Box 2.2
How Search Engines Work
Search engines help users find information on the Internet stored in Web pages. Typically, a user will type some words (the "search query") into a search engine, and the search engine will return a number of "links" on its results page. To reach any of these results, the user clicks on the link, which transfers the user away from the search engine and into the uniform resource locator (URL) corresponding to that link.
A search engine works by matching the user's query against an index of Web pages (documents) on the Internet that it has stored in a database. An index is necessary because with over 2 billion pages on the Internet, a real-time search of all of them when someone makes an information request would be prohibitively expensive and time-consuming. An index allows a search to be completed in a much smaller amount of time (seconds rather than days or weeks), though at the cost of some incompleteness and inaccuracy (because pages may have changed or been added since the index was created).
No search engine indexes (or even could index) all of the pages on the Web, and each search engine indexes a different set of pages. For this reason, and because of the dynamic nature of the Web, all search engines are inherently "incomplete," and the contents of their indexes (and thus search results) differ from one another.
A search engine builds its index of Web pages by sending out a "spider" to retrieve the pages from Web sites. Spiders retrieve only static pages, not pages that are hiding as databases or are dynamically generated. Most spiders also obey the robot.txt file on a Web site; if the file says, "Do not index this site," they do not index that site. They can store millions of words and hundreds of thousands of sites.
A paper published in Nature in 1999 estimated the types of material indexed, excluding commercial sites. "Scientific and educational" sites were the largest population. Health sites, personal sites, and the sites for societies (scholarly or other) are all larger than the percentage estimated for pornography; i.e., a few percent of Web pages contained material that could reasonably be characterized as adult-oriented, sexually explicit material.1
For a more detailed description of how search engines index Web pages, interpret queries, and search their databases, see Appendix C.
1Steve Lawrence and Lee Giles. 1999. "Accessibility of Information on the Web," Nature 400: 107-109.