Publication Data
Indexing the World Wide Web: The Journey So Far
Abstract: In this chapter, we describe the key indexing components of
today’s web search engines. As the World Wide Web has grown, the systems and methods
for indexing have changed significantly. We present the data structures used, the
features extracted, the infrastructure needed, and the options available for designing
a brand new search engine. We highlight techniques that improve relevance of results,
discuss trade-offs to best utilize machine resources, and cover distributed processing
concepts in this context. In particular, we delve into the topics of indexing phrases
instead of terms, storage in memory vs. on disk, and data partitioning. We will finish
with some thoughts on information organization for the newly emerging data-forms.
