Inverted Index for Search Engines

By Enzo Calamia

Inverted indexing is a technique to perform full-text searches across multiple documents efficiently.

The idea is to build a key-value store where keys are the terms encountered in each indexed document. The values are sets; each of them contains the references to the documents in where the term has been found.

Indexing Documents

When indexing a document, each term (word) is scanned. Each encountered term can be skipped if it is considered nonrelevant. It is typically the case of words such as "the," "a," "this," "that." We call those terms stop words. A term could be added to the index as a new key if it did not previously exist. If the term is already present (because other documents containing this term have already been indexed), the set is updated, so it now includes the reference of the document.

For example, let's say we have the three following documents.

d1: "Penguins are a group of aquatic flightless birds"

d2: "Dolphins are a widely distributed and diverse group of aquatic mammals"

d3: "Japanese bush warblers are Asian passerine birds more often heard than seen"

The resulting inverted index is the following structure. (Note that we have skipped the stop words.)

INDEX = {
  'aquatic': {'d1', 'd2'},
  'asian': {'d3'},
  'birds': {'d3', 'd1'},
  'bush': {'d3'},
  'distributed': {'d2'},
  'diverse': {'d2'},
  'dolphins': {'d2'},
  'flightless': {'d1'},
  'group': {'d1', 'd2'},
  'heard': {'d3'},
  'japanese': {'d3'},
  'mammals': {'d2'},
  'more': {'d3'},
  'often': {'d3'},
  'passerine': {'d3'},
  'penguins': {'d1'},
  'seen': {'d3'},
  'warblers': {'d3'},
  'widely': {'d2'}
}

Searching

Now, searching for the term birds is performed by a simple lookup on key birds, which returns the set {d3, d1}. The documents d1 and d3 are returned as expected because they both contain the term birds.

When searching for more than one term, multiple lookups are performed, and the resulting sets are merged with a set intersection.

Searching for terms aquatic birds gives us two sets: {d1, d2} and {d3, d1}. The intersection of those sets is: {d1, d2} ∩ {d3, d1} = {d1}. The document d1 is returned, and it contains both terms aquatic and birds.

Such lookups are generally performed in O(1) (hash map) or O(log n) (B-Trees or similar), which is efficient.
However, the set intersection algorithm complexity is necessarily linear. In practice, this may not be a problem because search results are generally limited to a fixed number of items (e.g., only the n most recent items are taken into account) thus limiting the intersection computation.

Basic implementation

For this article, I have written a short Python implementation.