Ranking Pages for User Queries
A typical query submitted to a document search engine consists of some keywords. Such a query can also be represented as a set of terms with weights. The degree of match between a page and a query, often call the similarity, can be measured by the terms they share. A simple approach is to add up the products of the weights corresponding to the matching terms between the query and the page.
This approach yields larger similarities for pages that share more important terms with a query. However, it tends to favor longer pages over shorter ones. This problem is often addressed by dividing the above similarity by the product of the lengths of the query and the page. The function that computes such type of similarities is called the Cosine function. The length of each page can be computed beforehand and stored at the search engine site.
Many methods exist for ranking Web pages for user queries, and different search engines likely employ different ranking techniques. For example, some ranking methods also consider the proximity of the query terms within a page. As another example, a search engine may keep track of the number of times each page has been accessed by users and use such information to help rank pages.
Google (www.google.com) is one of the most popular search engines on the Web. A main reason why Google is successful is its powerful ranking method, which has the capability to differentiate more important pages from less important ones even when they all contain the query terms the same number oftimes. Google uses the linkage information among Web pages (i.e., how Web pages are linked) to derive the importance of each page.
A link from page A to page B is placed by the author of page A. Intuitively, the existence of such a link is an indication that the author of page A considers page B to be of some value. On the Web, a page may be linked from many other pages and these links can be aggregated in some way to reflect the overall importance of the page. For a given page, PageRank is a measure of the relative importance of the page on the Web, and this measure is computed based on the linkage information. The following are the three main ideas behind the definition and computation of PageRank.
Pages that are linked from more pages are likely to be more important. In other words, the importance ofa page should be reflected by the popularity of the page among the authors of all Web pages. Pages that are linked from more important pages are likely to be more important themselves. Pages that have links to more pages have less influence over the importance of each of the linked pages.
In other words, if a page has more child pages, then it can only propagate a smaller fraction of its importance to each child page. Based on the above insights, the founders of Google developed a method to calculate the importance (PageRank) of each page on the Web.
The PageRanks of Web pages can be combined with other, say content-based, measures to indicate the overall relevance of a page with respect to a given query. For example, for a given query, a page may be ranked based on a weighted sum of its similarity with the query and its PageRank. Among pages with similar similarities, this method will rank those that have higher PageRanks.
Date added: 2024-07-23; views: 37;