# Inverted Index for Search Engines

By Enzo CalamiaInverted indexing is a technique to perform full-text searches across multiple documents efficiently.

The idea is to build a key-value store where keys are the terms encountered in each indexed document. The values are sets; each of them contains the references to the documents in where the term has been found.

## Indexing Documents

When indexing a document, each term (word) is scanned. Each encountered term can be skipped if it is considered nonrelevant. It is typically the case of words such as "the," "a," "this," "that." We call those terms stop words. A term could be added to the index as a new key if it did not previously exist. If the term is already present (because other documents containing this term have already been indexed), the set is updated, so it now includes the reference of the document.

For example, let's say we have the three following documents.

```
d1: "Penguins are a group of aquatic flightless birds"
d2: "Dolphins are a widely distributed and diverse group of aquatic mammals"
d3: "Japanese bush warblers are Asian passerine birds more often heard than seen"
```

The resulting inverted index is the following structure. (Note that we have skipped the stop words.)

```
INDEX = {
'aquatic': {'d1', 'd2'},
'asian': {'d3'},
'birds': {'d3', 'd1'},
'bush': {'d3'},
'distributed': {'d2'},
'diverse': {'d2'},
'dolphins': {'d2'},
'flightless': {'d1'},
'group': {'d1', 'd2'},
'heard': {'d3'},
'japanese': {'d3'},
'mammals': {'d2'},
'more': {'d3'},
'often': {'d3'},
'passerine': {'d3'},
'penguins': {'d1'},
'seen': {'d3'},
'warblers': {'d3'},
'widely': {'d2'}
}
```

## Searching

Now, searching for the term `birds`

is performed by a simple lookup on key `birds`

, which
returns the set `{d3, d1}`

. The documents `d1`

and `d3`

are returned as expected because
they both contain the term `birds`

.

When searching for more than one term, multiple lookups are performed, and the resulting sets are merged with a set intersection.

Searching for terms `aquatic birds`

gives us two sets: `{d1, d2}`

and `{d3, d1}`

. The intersection
of those sets is: `{d1, d2} ∩ {d3, d1} = {d1}`

. The document `d1`

is returned, and it contains
both terms `aquatic`

and `birds`

.

Such lookups are generally performed in *O(1)* (hash map) or *O(log n)* (B-Trees or similar), which is
efficient.

However, the set intersection algorithm complexity is necessarily linear. In practice, this may not be a problem
because search results are generally limited to a fixed number of items (e.g., only the `n` most recent items are taken into account)
thus limiting the intersection computation.

## Basic implementation

For this article, I have written a short Python implementation.