Growing a search index from 100 million to 100 billion files
In previous posts of this series, we’ve been focusing mostly on the algorithms and strategies that we used to build Data Discover, from reading to transporting and finally indexing and querying the data. However, even with carefully chosen data structures and algorithms, every part of the process will quickly run into the limits of what a single machine can do. The only way to get beyond this point is to split the work amongst multiple computers. Therefore, it’s critical to ensure that we have a strategy to break down and parallelize every step of the process. This will enable us to scale horizontally by adding more machines when we need to handle more load.
This post will go through our data pipeline step by step to describe how it scales.
Reading and Transporting Metadata
The first major parallelization decision that needs to be made is how to break down the data into manageable chunks. Each chunk will be its own queryable metadata index, and can be processed independently of everything else. On one extreme, we could treat the entire data center as a single entity, creating just one index, but this doesn’t scale well, as we would be unable to split the total work across many machines.
On the opposite extreme, we could create a metadata index for every file or for every directory in the data center. This, too, is a bad idea at scale — having billions of tiny indices is extremely inefficient, since our algorithms and data structures are optimized for large amounts of data.
We want to choose a split that would create only hundreds or thousands of indices — enough to break the problem down, but not so many that each index becomes tiny. For us, the natural split was to use the volume level on the filer (NFS exports or SMB shares) — every volume on the storage device is treated as a separate data set, entirely independent from all the others. This is a pretty good choice because large exports typically don’t exceed a billion files, which is an amount we can handle with just one index. On the low end, exports could have just tens or hundreds of files, which is inefficient but fairly atypical.
Once we have this split logically laid out, we can easily assign different volumes to different VMs, which can work in parallel. If we need this step to happen faster we can stand up additional VMs in order to process more volumes at a time.
Microservices in the Cloud
On the cloud, we have broken down the process of indexing and querying data into several separate components, each handled by a separate microservice. For the unfamiliar, the microservice pattern involves creating separate, independently deployed services (for example, executable binaries on Docker images), each of which performs a single subtask, and interacts with the others to perform the overarching task. In contrast, a monolith is a single service that does all the work itself.
Lately, microservice-based architecture is all the rage, and for good reason. In addition to being easier to reason about and more fault-tolerant, a well-designed set of microservices is independently scalable. This means, if one part of the system needs to handle a heavier load, it is possible to scale just that part of the system while leaving other sections alone. In a crude example, if Facebook suddenly had a significant increase in commenting activity without an increase in posts, they could add more nodes to the Comments Service cluster and leave the Posts Service cluster as-is (assuming that they chose to architect their system this way).
Choosing a microservices architecture is as much an art as it is a science. In a good design, each microservice does one nontrivial thing and has clear ownership of that thing. An easy test is to ask “What are the tasks of my process that I want to be able to scale independently?”. For us, the three tasks that we needed to scale independently were indexing, querying, and namespacing. Each task will be handled by a separate service, which will consist of one or more machines working together. We’ll go into each of the services in order.
Our indexing service is in charge of creating search indices. As described in earlier posts, it receives and combines data into a sorted LSM tree. Then, it uses this tree to build a search index. Each indexing service can create a few indices at a time — if we have many indices to create we can spin up more instances to go faster.
After indices are created, the indexing service is no longer responsible for them. Instead, a separate querying service is in charge of opening the indices and performing queries.
Since the indexing and querying services are separate, increased index load for new data does not affect query performance. Similarly, a slew of customer-issued queries won’t slow down our indexing. If we find that our query times are slower than desired due to a high volume of data, we can scale this service up in order to query more indices at once.
The catch with our strategy of breaking down the data into a per volume is that, eventually, we’ll need to recombine the data. If we have a file system containing /data/volume1 and /data/volume2, we will create separate indices for each. A query to one index will contain results only for the volume it represents. However, users will still want to be able to perform a global query on /data, meaning that we will need to merge results from multiple indices.
So, when a query comes in for /data, the Namespacing Service will issue multiple parallel requests, performing that query on /data/volume1 and /data/volume2. These requests get picked up by elements of the Querying Service, which will then perform the queries on the appropriate indices and return the results to the Namespacing Service. Finally, the Namespacing Service aggregates the results from the separate queries and returns the final result to the caller.
This aggregation step cannot be parallelized, but is a relatively lightweight operation — typically, a simple sum or concatenation of results. As a result, we limit Namespacing Service to a single instance per deployment, since additional resources are not needed on this level.
Our parallelizable architecture allows us to grow to enormous scale before we hit bottlenecks. At our largest customers, we are able to process tens or hundreds of billions of files within a weekend. This enables us to reindex customer data at least once a week in order to keep our visualizations fresh. And because of our parallelized query architecture, we can respond to most queries within a matter of seconds, ensuring our UI is responsive and usable.
The end result is a system that allows customers to search across one or many data centers to find files by specific metadata attributes. Additionally, they can quickly identify large amounts of data that are taking up valuable space on expensive storage, and either delete or move this data to the cloud. For many, this is a game-changer, simplifying the process of learning about and managing their data.
Thank you to Lily Bowdler and Carolyn Hughes, without whom this series would not have been possible. Data Discover represents the collaborative work of many people across the Igneous Engineering team and wider organization. To learn more about it or try it yourself, check out our website.