Metasearch Engine
A metasearch engine is a system that provides unified access to multiple existing search engines. When a metasearch engine receives a query from a user, it sends the query to multiple existing search engines, and it then combines the results returned by these search engines and displays the combined results to the user.
A metasearch engine makes it easy for a user to search multiple search engines simultaneously while submitting just one query. A big benefit of a metasearch engine is its ability to combine the coverage of many search engines. As metasearch engines interact with the search interfaces of search engines, they can use Deep Web search engines just as easily as Surface Web search engines. Therefore, metasearch engine technology provides an effective mechanism to reach a large portion of the Deep Web by connecting to many Deep Web search engines.
Metasearch Engine Architecture. A simple metasearch engine consists of a user interface for users to submit queries, a search engine connection component for programmatically submitting queries to its employed search engines and receiving result pages from them, a result extraction component for extracting the search result records from the returned result pages, and a result merging component for combining the results.
If a metasearch engine employs a large number of search engines, then a search engine selection component is needed. This component determines which search engines are likely to contain good matching results for any given user query so that only these search engines are used for this query. Search engine selection is necessary for efficiency considerations. For example, suppose only the 20 best-matched results are needed for a query and there are 1000 search engines in a metasearch engine.
It is clear that the 20 best-matched results will come from at most 20 search engines, meaning that at least 980 search engines are not useful for this query. Sending a query to useless search engines will cause serious inefficiencies, such as heavy network traffic caused by transmitting unwanted results and the waste of system resources for evaluating the query.
We may have metasearch engines for document search engines and metasearch engines for database search engines. These two types of metasearch engines, though conceptually similar, need different techniques to build. They will be discussed in the next two subsections.
Document Metasearch Engine. A document metasearch engine employs document search engines as its underlying search engines. In this subsection, we discuss some aspects of building a document metasearch engine, including search engine selection, search engine connection, result extraction, and merging.
Search Engine Selection. When a metasearch engine receives a query from a user, the metasearch engine makes a determination on which search engines likely contain useful pages to the query and therefore should be used to process the query. Before search engine selection can be performed, some information representing the contents of the set of pages of each search engine is collected. The information about the pages in a search engine is called the representative of the search engine.
The representatives of all search engines used by the metasearch engine are collected in advance and are stored with the metasearch engine. During search engine selection for a given query, search engines are ranked based on how well their representatives match with the query.
Different search engine selection techniques exist and they often employ different types of search engine representatives. A simple representative of a search engine may contain only a few selected key words or a short description. This type of representative is usually produced manually by someone who is familiar with the contents of the search engine.
When a user query is received, the metasearch engine can compute the similarities between the query and the representatives, and then select the search engines with the highest similarities. Although this method is easy to implement, this type of representative provides only a general description about the contents of search engines. As a result, the accuracy of the selection may be low.
More elaborate representatives collect detailed statistical information about the pages in each search engine. These representatives typically collect one or several pieces of statistical information for each term in each search engine. As it is impractical to find out all the terms that appear in some pages in a search engine, an approximate vocabulary of terms for a search engine can be used. Such an approximate vocabulary can be obtained from pages retrieved from the search engine using sample queries
Some of the statistics that have been used in proposed search engine selection techniques include, for each term, its document frequency, its average or maximum weight in all pages having the term, and the number of search engines that have the term. With the detailed statistics, more accurate estimation of the usefulness of each search engine with respect to any user query can be obtained.
The collected statistics may be used to compute the similarity between a query and each search engine, to estimate the number of pages in a search engine whose similarities with the query are above a threshold value, and to estimate the similarity of the most similar page in a search engine with respect to a query. These quantities allow search engines to be ranked for any given query and the top- ranked search engines can then be selected to process the query.
It is also possible to generate search engine representatives by learning from the search results of past queries. In this case, the representative of a search engine is simply the knowledge indicating its past performance with respect to different queries. In the SavvySearch metasearch engine (now www.search.com), the learning is carried out as follows. For a search engine, a weight is maintained for each term that has appeared in previous queries.
The weight of a term for a search engine is increased or decreased depending on whether the search engine returns useful results for a query containing the term. Over time, if a search engine has a large positive (negative) weight for a term, the search engine is considered to have responded well (poorly) to the term in the past. When a new query is received by the metasearch engine, the weights of the query terms in the representatives of different search engines are aggregated to rank the search engines. The ProFusion metasearch engine also employs a learning-based approach to construct the search engine representatives.
ProFusion uses training queries to find out how well each search engine responds to queries in 13 different subject categories. The knowledge learned about each search engine4 from training queries is used to select search engines to use for each user query and the knowledge is continuously updated based on the user’s reaction to the search result (i.e., whether a particular page is clicked by the user).
Search Engine Connection. Usually, the search interface of a search engine is implemented using an HTML form tag with a query textbox. The form tag contains all information needed to connect to the search engine via a program. Such information includes the name and the location of the program (i.e., the search engine server) that processes user queries as well as the network connection method (i.e., the HTTP request method, usually GET or POST). The query textbox has an associated name and is used to fill out the query. The form tag of each search engine interface is pre-processed to extract the information needed for program connection.
After a query is received by the metasearch engine and the decision is made to use a particular search engine, the query is assigned to the name of the query textbox of the search engine and sent to the server of the search engine using the HTTP request method supported by the search engine. After the query is processed by the search engine, a result page containing the search results is returned to the metasearch engine.
Search Result Extraction. A result page returned by a search engine is a dynamically generated HTML page. In addition to the search result records for a query, a result page usually also contains some unwanted information/links such as advertisements, search engine host information, or sponsored links. It is essential for the metasearch engine to correctly extract the search result records on each result page.
A typical search result record corresponds to a Web page found by the search engine and it usually contains the URL and the title of the page as well as some additional information about the page (usually the first few sentences of the page plus the date at which the page was created, etc.; it is often called the snippet of the page).
As different search engines organize their result pages differently, a separate result extraction program (also called extraction wrapper) needs to be generated for each search engine. To extract the search result records of a search engine, the structure/format of its result pages needs to be analyzed to identify the region(s) that contain the records and separators that separate different records. As a result, a wrapper is constructed to extract the results of any query for the search engine. Extraction wrappers can be manually, semi-automatically, or automatically constructed.
Result Merging. Result merging is the task of combining the results returned from multiple search engines into a single ranked list. Ideally, pages in the merged result should be ranked in descending order of the global matching scores of the pages, which can be accomplished by fetching/downloading all returned pages from their local servers and computing their global matching scores in the metasearch engine. For example, the Inquirus metasearch engine employs such an approach. The main drawback of this approach is that the time it takes to fetch the pages might be long.
Most metasearch engines use the local ranks of the returned pages and their snippets to perform result merging to avoid fetching the actual pages (16). When snippets are used to perform the merging, a matching score of each snippet with the query can be computed based on several factors such as the number of unique query terms that appear in the snippet and the proximity of the query terms in the snippet.
Recall that when search engine selection is performed for a given query, the usefulness of each search engine is estimated and is represented as a score. The search engine scores can be used to adjust the matching scores of retrieved search records, for example, by multiplying the matching score of each record by the score of the search engine that retrieved the record.
Furthermore, if the same result is retrieved by multiple search engines, the multiplied scores of the result from these search engines are aggregated, or added up, to produce the final score for the result. This type of aggregation gives preference to those results that are retrieved by multiple search engines. The search results are then ranked in descending order of the final scores.
Date added: 2024-07-23; views: 102;