Challenges of Document Search Engines
Although Web search engines like Google, Yahoo, and MSN are widely used by numerous users to find the desired information on the Web, there are still a number of challenges for enhancing their quality. In the following, we briefly introduce some of these challenges.
Freshness. Currently, most search engines depend on Web crawlers to collect Web pages from numerous Web sites and build the index database based on the fetched Web pages. To refresh the index database so as to provide up-to-date pages, they periodically (e.g., once every month) recollect Web pages from the Internet and rebuild the index database.
As a result, pages that are added/ deleted/changed since the last crawling are not reflected in the current index database, which makes some pages not accessible via the search engine, some retrieved pages not available on the Web (i.e., deadlinks), and the ranking of some pages based on obsolete contents. How to keep the index database up-to-date for large search engines is a challenging issue.
Coverage. It was estimated that no search engine indexes more than one-third of the “publicly indexable Web’’. One important reason is that the Web crawlers can only crawl Web pages that are linked to the initial seed URLs. The ‘‘Bow Tie’’ theory about the Web structure (10) indicates that only 30% of the Web pages are strongly connected. This theory further proves the limitation of Web crawlers. How to fetch more Web pages, including those in the Deep Web, is a problem that needs further research.
Quality of Results. Quality of results refers to how well the returned pages match the given keywords query. Given a keywords query, a user wants the most relevant pages to be returned. Suppose a user submits ‘‘apple’’ as a query, a typical search engine will return all pages containing the word ‘‘apple’’ no matter if it is related to an apple pie recipe or Apple computer. Both the keywords-based similarity and the lack of context compromise the quality of returned pages.
One promising technique for improving the quality of results is to perform a personalized search, in which a profile is maintained for each user that contains the user’s personal information, such as specialty and interest, as well as some information obtained by tracking the user’s Web surfing behaviors, such as which pages the user has clicked and how long the user spent on reading them; a user’s query can be expanded based on his/her profile, and the pages are retrieved and ranked based on how well they match the expanded query.
Natural Language Query. Currently, most search engines accept only keywords queries. However, keywords cannot precisely express users’ information needs. Natural language queries, such as ‘‘Who is the president of the United States?’’ often require clear answers that cannot be provided by most current search engines. Processing natural language queries requires not only the understanding of the semantics of a user query but also a different parsing and indexing mechanism of Web pages.
Search engine ask.com can answer some simple natural language queries such as ‘‘Who is the president of the United States?’’ and ‘‘Where is Chicago?’’ using its Web Answer capability. However, ask.com does not yet have the capability to answer general natural language queries. There is still a long way to go before general natural language queries can be precisely answered.
Querying Non-Text Corpus. In addition to textual Web pages, a large amount of image, video, and audio data also exists on the Web. How to effectively and efficiently index and retrieve such data is also an open research problem in data search engines. Although some search engines such as Google and Yahoo can search images, their technologies are still mostly keywords-match based.
Date added: 2024-07-23; views: 97;