Domestic large-scale video website, the content of the home page is basically channel content editor recommended. What is YouTube 's recommended algorithm?The first stage, based on user - video graph traversal algorithm, 2008 [ 1 ]. At this stage, YouTube believes users should be recommended videos of the same kind that have been watched, or have the same label. At this point, however, YouTube 's video is tens of millions of orders of magnitude, the part with the label is very small, so how to effectively expand the video label, it is considered to be the core issue of recommendations. The core of the solution has two pieces, one is based on the user common view record structure ( video co - view graph ); The second is based on this data structure algorithm, known as the adsorption algorithm ( adsorptional gorithm ).
The construction of the common viewing relationship of video can be viewed from two perspectives, one is the composition of video, one is the composition of video - user, the " video" graph can be seen as extracted by the" video user" graph ( figure 1 ). The edges between the videos can be the number of users who have viewed both videos simultaneously, or the number of times they have been viewed simultaneously in the same session, even taking the order into account. So how do you actually tag the video? Labels can be regarded as a classification, the so-called " near Zhu Zhechi, near ink black", in the graph structure, a node 's information and attributes can be obtained through the surrounding nodes. " tag" is no exception. The core idea of adsorptional gorithm is that some nodes will have tags, each iteration, can pass the tags to the neighboring nodes, so keep iteration, until the tags are stably distributed in the node. The pseudo code is as follows:
Where v is a set of nodes, e is a set of edges, w is the weight between nodes and edges, l is a set of labels, VL is a node with labels in v, and each video corresponds to a distribution probability LV of labels. Each iteration recalculates the label distribution for all nodes. The distribution of the labels corresponding to the nodes is obtained by the strength of the relationship between the adjacent nodes to which the nodes are connected, and the labels are accumulated after the product of the distribution probability of the adjacent nodes. This algorithm is similar to PageRank and Markov chain walk process, because the label weight in each node comes from the linear combination of the corresponding weights of the surrounding nodes, and also similar to the linear system. In addition, the paper does not spend space on how to use this algorithm for alternative generation of recommendations, only said that the iterative stability of the user 's tag in the graph structure as an alternative ( the basis of the generation ), or link to the alternative video. Similarly, there is no final ordering of the papers and how to merge multiple alternative results, although this module is already available in YouTube 's recommendation system at this stage. The author thinks that this algorithm can be classified as " user portrait" recommended method category. Using tags as video and user 's description, the tag information of users and videos is mined in some way as a link to each other. Youtube compares this method with popular results and simple collaboration, and has achieved success. The test method is also relatively preliminary: the effect evaluation is carried out completely offline, the new user cannot be evaluated, and the value of the newly generated content cannot be measured; In addition, for video, click as a measure is not enough, play time is a must consider factors.
At this stage, YouTube considers it necessary to recommend similar videos of the videos that the user has viewed to the user. And what is a similar video? It is mainly defined by user behavior, which can be as follows: 1. video watched by a certain number of users; 2. video that is often viewed simultaneously in the same session; 3. video that is often viewed simultaneously in the same session, taking into account sequence information. With these options, the effectiveness of the information is gradually better, but the data is gradually sparse
A preliminary recommended alternative is a similar video in which the user has consumed the video. As the above formula, s is the video set consumed by the user, vi is one of the videos in s, and ri is the similar video set corresponding to vi. The final alternative set c is the union of all ris. In general, the results generated in this way are sufficient as an alternative, but often the content focus is difficult to find new video for the user, and there are cases where the alternative results are insufficient. In this case, the similar video set can continue to expand, from " similar video" to" similar video of similar video", and so on a certain number of iterations to obtain the final alternative set. After alternative generation is the sorting stage, mainly considering two kinds of factors. One is the quality of the video, including the number of video playback, scoring, etc.; Second, the user 's demand information, including some information in the user 's viewing history, such as the number of video viewing, and viewing time, etc.; A linear formula can be used to consider these two kinds of factors ( not mentioned here how linear formula comes from, should not be a racquet head - _ - # ). In the end, only a small number of alternative results can be presented, so only part of the data can be selected, and the process needs to deal with diversity issues: removing tag - like data or removing data belonging to the same channel, and further clustering and content analysis based methods can also be adopted. The paper also describes the specific system implementation plan. Because the alternative results for each user can remain completely unchanged for a certain period of time, the offline calculation method is selected. But doing so will lead to poor effectiveness, so YouTube optimizes the data generation process and enables data updates several times a day. The system architecture is mainly divided into three parts: data collection, alternative generation and recommendation service. After the user logs are extracted, they are stored in bigtable, and then an alternative is generated based on MapReduce, and the resulting generated results are stored in the bigtable server that provides online services. Recommended services do not involve too much real-time computing, and delay time is more network transmission. To verify the effectiveness of this approach, YouTube has chosen an online a / b test approach, with the main metrics including CTR, longctr ( active click over a certain length of time ), average session viewing time, first viewing time, and recommended coverage.
This paper describes the optimization method of " related video", i.e. the video recommended by the user when watching a video. But essentially defines a similar or related video calculation. The definition of " similar object" is the core issue of recommendation, with different calculation methods, also means that there are new recommended methods. Why do you have a new " related video" calculation method? Collaborative filtering is the best method at that time, but it is suitable for a certain user to watch recorded video, but for new video and long-tailed video, not good application.
Let 's see how YouTube did it. First, the video is described through the < subject, weight > collection. As shown above, the worldwarz film contains four themes and corresponding weights. The theme name is extracted from the description information of the video, the keywords defined by the uploading party, the retrieval words corresponding to the searched viewing records, the name of the broadcast list and the like. In the second step, the similarity between the video and the video is calculated. Mainly through the calculation of two sets of topics
In the experiment stage, YouTube mainly uses online experiment to verify the effect. Metrics include viewing time, viewing completion rate ( measure how much video is viewed from head to head ), and drop rate, i.e., the percentage of no relevant video is viewed ( in this case, user behavior terminates ). From these three indicators, the proposed method and collaborative filtering combined, the effect is significantly improved, and the method based on topic weight retraining is better than the artificial fitting ranking formula method using search theory. It is worth mentioning that the similarity calculation method proposed in this paper is fundamentally different from the way based on user behavior ( e.g. collaborative filtering ), has less dependence on user behavior, is suitable for new data and long tail data, and can greatly restrain Matthew effect. At the same time, this paper also provides a mature implementation: alternative generation based on the search bottom layer. The retrieval sentence is constructed from the video subject information being viewed and is queried in the inverted index. Furthermore, it was also mentioned that the effect would be further enhanced by merging with alternative sets of collaborative filtering methods through the reordering module
If this article is helpful to you. Please praise or comment on this article. thank you