How to build Google Search, for your data center

Search 101: a scalable file index

This article is part of a series about building Igneous DataDiscover, a searchable file index. Click Here for the overview.

Go to the first post in this series for an overview of this diagram

We’ve come a long way since we first started this series! First, we pulled data from your storage device onto a VM in your datacenter. Then, the VM packaged this data and wrote it into a sorted list in the cloud. Now, finally, we are ready to use this data to build a search index, a metadata structure that can be used to process queries and return results.

Why do we need a search index?

Let’s suppose our entire dataset — all the files on the customer storage device — consists of only the following 3 files:

The metadata list — all the files that we want to search over

Let’s call this the metadata list. This sorted list, coupled with binary search, is great if we want to find files by name or prefix. For instance, the query “Find the file ‘/data/sounds.txt’” can easily be answered in O(log N) time. However, a query like “Find any file whose name contains ‘sounds’” cannot be completed using binary search. Instead, we would have to iterate the entire dataset looking for a match.

Similarly, suppose we wanted to answer the query “What is the total size of all of Alice’s files?” If our data was sorted by Owner, this would be easy, but it’s not. So, the fastest way to answer this query is to iterate the entire dataset, taking O(N) time.

Fundamentally, binary search is great when the data is sorted by the field — name, extension, owner, etc — that we’re querying for, and completely useless otherwise.

A search index is a data structure that will allow us to quickly perform lookups on specific fields, even if our data is not sorted by that field. But, there’s a catch: we have to choose our fields ahead of time.

Choosing Fields and Terms

Before building the search index, we need to decide which fields we are going to support. In our example, it’s reasonable to allow searching by Path, Extension and Owner.

Then, we need to define an extraction function that, given a (Name, Metadata) pair, creates a list of terms for each of our chosen fields.

For /data/words.txt, that looks like this:

Fields and terms for /data/words.txt

Note that, for a single field, we can have multiple extracted terms. In this case, a search for “Path=data” and “Path=words” should both return /data/words.txt.

This is a mapping from a file to all fields and terms for that file. But, really, we want the inverse — a way to get from a specific field and term (like Path=words) to the matching file(s). That’s where our search index comes in.

Search Index

A search index (sometimes also called an inverted index) flips the mapping we have: term to file instead of file to term. To do this, we create a list of all fields and terms, and for each, list every file (and its index in the original metadata list) that matches that term.

Path=data
/data/music.mp3 (index 0)
/data/sounds.mp3 (index 1)
/data/words.txt (index 2)
Path=music
/data/music.mp3 (index 0)
Path=sounds
/data/sounds.mp3 (index 1)
Path=words
/data/words.txt (index 2)
Extension=.mp3
/data/music.mp3 (index 0)
/data/sounds.mp3 (index 1)
Extension=.txt
/data/words.txt (index 2)
Owner=alice
/data/sounds.mp3 (index 1)
/data/words.txt (index 2)
Owner=bob
/data/music.mp3 (index 0)

This index works just like an index in a book. If we wanted to find information about cats in an encyclopedia, we’d flip to the back, look for the index entry for “cats”, and know that all the pages there are relevant to our search. Similarly, to find the total size of Alice’s files, we check our search index for “Owner=alice”, look up each match in our metadata list, and sum the sizes together.

If we end up with T total terms, creating the search index can be done in time O(N T) — we iterate the metadata list and, for each file, compute all field/term pairs from its metadata. Then we append the file name to the appropriate term lists.

This process isn’t particularly fast, but it’s worth it — like my grandmother used to say, “a bit of effort up-front building a search index goes a long way to save you expensive O(N) queries later.”

Compression

“But wait!”, you exclaim. “Surely storing all these lists is wildly space-inefficient!” And you are absolutely right. So, we change the format of the lists, using a bitmap instead.

For every term, we create a bitmap — a list of 0’s and 1’s — of size N, where N is the total number of files (in this case, 3). In this bitmap, we set place i (0-indexed) to be a 1 if file i matches the given term. For example, the list for “Owner=alice” becomes the bitmap 011, because entries 1 and 2 in our metadata list match this query. So, the search index is transformed to look like this:

Path=data        111
Path=music 100
Path=sounds 010
Path=words 001
Extension=.mp3 110
Extension=.txt 001
Owner=alice 011
Owner=bob 100

This format is definitely a little more concise. Now, to look up the size of all Alice’s files, we go to the bitmap for “Owner=alice”, go to rows number 1 and 2 in our original data set, and sum those sizes together.

When we get to billions of files, these bitmaps can become quite large — a billion bits per term is about 120 MB, and a typical dataset has tens or hundreds of thousands of terms. However, it’s important to note that these bitmaps are very sparse, meaning that most bitmaps will contain far more zeros than ones. This is because most terms apply to a small proportion of all files — for example, Alice probably only owns a few thousand out of a billion files, meaning that the bitmap for “Owner=alice” will be mostly zeros.

This sparseness property means that the bitmaps are highly compressible, allowing us to save a significant amount of space without sacrificing performance.

Complex Queries

We get another nice property from bitmaps — it is very easy to perform bitwise arithmetic on them. This allows us to easily perform AND and OR operations, like “Find mp3 files owned by Alice.” To do this, we take the bitmap for “Extension=.mp3” (110) and “Owner=alice” (011), bitwise AND them together, and use the result (010) to identify the rows in the original table that match our query — in this case, row 1: /data/sounds.mp3.

AND’ing our bitmaps column by column yields a new bitmap that answers the compound query

We can take a similar approach for OR queries, as well as complex combinations of ANDs and ORs. NOT is also straightforward; we can just invert the entire bitmap, replacing ones with zeros and vice versa.

Continuous Queries

Our field and term strategy is great for representing discrete fields, like “owner”, where there are only a few term options. However, it’s not as straightforward to represent fields like “size” or “last modified time”, where there are many possibilities. It wouldn’t make sense to, for instance, create a separate bitmap for every possible time value.

Instead, we group times into buckets — for example, one for every day since 1960. Then, we can create terms and bitmaps for each of these time buckets — about 22,000 bitmaps. Then, if we want to find all files created between two times, we would OR together the bitmaps for all the days that fall within that range.

Conclusion

Together, these strategies allow us to query our metadata quickly. Instead of having to iterate every file for a query, we just iterate matches, saving a lot of time. As a tradeoff, we have to build and store hundreds of thousands of bitmaps, which together comprise our search index.

As it turns out, both the building and the querying of a search index is highly parallelizable, allowing us to significantly speed up the process if we break it down properly. Managing this parallelism is an interesting distributed systems problem, which we will explore next time.

Software engineer at MongoDB. Cofounder of Upbeat Music App. I do cloud things.