Database Metasearch Engine
A database metasearch engine provides a unified access to multiple database search engines. Usually, multiple database search engines in the same application domain (e.g., auto, book, real estate, flight) are integrated to create a database metasearch engine. Such a metasearch engine over multiple e-commerce sites allows users to do comparison-shopping across these sites. For example, a metasearch engine on top of all book search engines allows users to find desired books with the lowest price from all booksellers.
A database metasearch engine is similar to a document metasearch engine in architecture. Components such as search engine connection, result extraction, and result merging are common in both types of metasearch engines, but the corresponding components for database metasearch engines need to deal with more structured data.
For example, result extraction needs to extract not only the returned search records (say books) but also lower level semantic data units within each record such as the titles and prices of books. One new component needed for a database metasearch engine is the search interface integration component. This component integrates the search interfaces of multiple database search engines in the same domain into a unified interface, which is then used by users to specify queries against the metasearch engine.
This component is not needed for document metasearch engines because document search engines usually have very simple search interfaces (just a textbox). In the following subsections, we present some details about the search interface integration component and the result extraction component. For the latter, we focus on extracting lower level semantic data units within records.
Search Interface Integration. To integrate the search interfaces of database search engines, the first step is to extract the search fields on the search interfaces from the HTML Web pages of these interfaces. A typical search interface of a database search engine has multiple search fields. An example of such an interface is shown in Fig. 2. Each search field is implemented by text (i.e., field label) and one or more HTML form control elements such as textbox, selection list, radio button, and checkbox.
The text indicates the semantic meaning of its corresponding search field. A search interface can be treated as a partial schema of the underlying database, and each search field can be considered as an attribute of the schema. Search interfaces can be manually extracted but recently there have been efforts to develop techniques to automate the extraction. The main challenge of automatic extraction of search interfaces is to group form control elements and field labels into logical attributes.
After all the search interfaces under consideration have been extracted, they are integrated into a unified search interface to serve as the interface of the database metasearch engine. Search interface integration consists of primarily two steps. In the first step, attributes that have similar semantics across different search interfaces are identified. In the second step, attributes with similar semantics are mapped to a single attribute on the unified interface.
In general, it is not difficult for an experienced human user to identify matching attributes across different search interfaces when the number of search interfaces under consideration is small. For applications that need to integrate a large number of search interfaces or need to perform the integration for many domains, automatic integration tools are needed. WISE-Integrator is a tool that is specifically designed to automate the integration of search interfaces. It can identify matching attributes across different interfaces and produce a unified interface automatically.
Result Extraction and Annotation. For a document search engine, a search result record corresponds to a retrieved Web page. For a database search engine, however, a search result record corresponds to a structured entity in the database. The problem of extracting search result records from the result pages of both types of search engines is similar (see search result extraction section).
However, a search result record of a database entity is more structured than that of a Web page, and it usually consists of multiple lower level semantic data units that need to be extracted and annotated with appropriate labels to facilitate further data manipulation such as result merging.
Wrapper induction in is a semi-automatic technique to extract the desired information from Web pages. It needs users to specify what information they want to extract, and then the wrapper induction system induces the rules to construct the wrapper for extracting the corresponding data. Much human work is involved in such wrapper construction. Recently, research efforts have been put on how to automatically construct wrapper to extract structured data.
To automatically annotate the extracted data instances, currently there are three basic approaches: ontology-based, search interface schema-based, and physical layout-based. In the ontology-based approach, a task-specific ontology (i.e., conceptual model instance) is usually predefined, which describes the data of interest, including relationships, lexical appearance, and context keywords. A database schema and recognizers for constants and keywords can be produced by parsing the ontology.
Then the data units can be recognized and structured using the recognizers and the database schema. The search interface schema-based approach is based on the observation that the complex Web search interfaces of database search engines usually partially reflect the schema of the data in the underlying databases. So the data units in the returned result record may be the values of a search field on search interfaces.
The search field labels are thereby assigned to corresponding data units as meaningful labels. The physical layout-based approach assumes that data units usually occur together with their class labels; thus it annotates the data units in such a way that the closest label to the data units is treated as the class label. The headers of a visual table layout are also another clue for annotating the corresponding column of data units. As none of the three approaches is perfect, a combination of them will be a very promising approach to automatic annotation.
Date added: 2024-07-23; views: 121;