You won’t believe how we efficiently exported a billion rows!

7 min readOct 30, 2020

Shipping metadata to the cloud with log-structured merge trees

This article is part of a series about building Igneous Data Discover, a searchable file index. Click Here for the overview.

Go to the first post in this series for an overview of this diagram

In the previous post, we talked about how we could efficiently pull metadata from billions of files in a file system. But this metadata alone is not enough to quickly answer queries. Consider, for example, the sample question we posed at the beginning: What is the total size of all .mp3 files in our datacenter?”. To answer this using our big list of (filename, file metadata) pairs, we’d need to go check every file to see if it matches our filter criteria, and if it does, add the result to a growing total.

Ultimately, to answer this and other queries efficiently, we’ll need to transform the data into a structure called a search index, but building a search index is memory- and CPU-intensive. So, before we can work on building the search index, we need to ship this data from the scanning VM onto a more powerful compute node. This article is about the crucial data transfer step, and we will discuss how to build the index next time.

Transferring Metadata to the Cloud

As you may imagine, it’s nontrivial to move metadata about a billion files from our VM to a compute node. To do so efficiently, we’ll need to carefully choose a strategy that allows for fast transfer, in order to keep up with the speed of our scan. At the same time, we need the metadata to end up in sorted order on the cloud, because it must be sorted to build a search index.

To help think through our method, let’s consider two easy approaches:

Approach 1: We can keep a sorted list in the cloud, on disk. Every time our scanner discovers a new file, we upload its metadata to the cloud and insert it into the sorted list.

Under this approach, every insert would take O(log N) time, where N is the size of the sorted list. As the scan progresses, we’d notice inserts taking longer and longer as N increases. This is not great, because we’d prefer to finish the scan part of the task as soon as possible in order to free up resources in the customer’s datacenter.

We can do slightly better by adding an intermediate data structure — keeping sorted file metadata in memory on the VM, and periodically writing this list to the cloud, where it gets merged into the sorted list. However, later batches will still take a long time to insert, delaying the scan.

Approach 2: Alternatively, we can optimize for fast writes by postponing all sorting until the scan is over. We could keep unsorted batches in memory on the VM, and periodically write this list to the cloud, where they get appended to a giant unsorted list. When the scan is finished, we run an O(N log N) sort on the list.

This approach has some nice properties on the write side — every insert is constant time, and batching inserts together minimize the number of network requests. However, the sorting phase at the end of the scan would take a large amount of time and space.

So, we’d like to get the advantages of quick inserts, without paying the cost of a long post-processing phase. And, we’d like to minimize network requests by staging batches of metadata locally before writing them to our cloud data structure in bulk.

This locally staged metadata needs to be in a data structure that can be quickly written to, easily shipped to the cloud, and efficiently merged with the rest of our data there. And, this data structure should enable a sorted read when we’re all done.

Log-Structured Merge Trees

We decided to use a variant of the log-structured merge (LSM) tree, a data structure that, thanks to its internal layout, provides the performance characteristics we need. While LSM trees are not the only tool for the job (and may not even be the best, depending on exact needs), they meet our requirements. We’d already invested the time into implementing a highly optimized LSM Tree for our backup and Data Flow applications, so we were able to adapt our existing libraries for this new purpose.

The rest of this article will focus on what LSM trees are and why they work well for our use case. This discussion will stay high level, but I encourage you to check out some of these sources for more detail:
https://www.youtube.com/watch?v=_5vrfuwhvlQ
https://yetanotherdevblog.com/lsm/
https://en.wikipedia.org/wiki/Log-structured_merge-tree

Inserts

An LSM tree works because it’s a hybrid of the two approaches we explored above. It consists of two components — a small sorted list held in memory, and a pyramid-like structure stored on disk. Conceptually, the pyramid structure consists of a series of sorted layers, stacked on top of each other.

https://tenor.com/view/reverse-funnel-gif-9554341 — “*It’s Always Sunny in Philadelphia”*

To insert into an LSM, the metadata uploader on the VM builds up a number of small in-memory sorted lists. These lists are of size k, where k relates to how much data we’re willing to hold in memory. In the following diagrams, k=5 rows (typically, k is around 10,000). Each insert is O(log k), but because k is small, inserts are still very fast.

An in-memory list reaches its maximum size k, and then is sent to the cloud where it becomes Layer 0

After an in-memory list reaches its maximum size, it gets sent up to the cloud, where it becomes the top layer of our pyramid structure — Layer 0. When another list reaches its maximum size, it too is sent to the cloud and forms a new layer on top — Layer 1.

A second in-memory list reaches its maximum size, and becomes Layer 1 in the cloud structure

And that’s it! There’s nothing else to do on insertion, so inserting it into the table is super fast.

Compaction

Like a plate in a buffet, an LSM tree can only have so many layers piled onto it before it causes serious problems for you later. If we kept adding layers, we’d end up with N/k total layers at the end — which would again require a long post-processing phase to process and merge. So, we merge layers as we go in a process called compaction.

When we get close to our limit of total layers (typically around 7–10), we choose a layer and merge it into the layer below it. So, L1 and L0 could get merged together into a new sorted list, which would become the new L0.

When we get too many layers, adjacent layers are merge-sorted together to form a new, larger layer

As our layers get large, this can be a pretty expensive process. However, it’s important to note that this process can happen in the background, without blocking or slowing down inserts! So, we are able to sort as we go, while also maintaining a fast insert speed.

As we keep inserting layers, we merge layers together so that the bottom layers become larger and larger, and get updated less and less frequently. There’s a lot of activity toward the top of the pyramid, but not so much down below.

L0 is much larger and much less frequently touched then L3

Merging layers together is a little like the game 2048 — the larger tiles stay untouched as you merge smaller tiles together, but once you have another large tile, you can create even larger bottom layers.

The larger numbers in 2048 are like the bottom layers of an LSM tree

Reading

Because we have multiple layers, looking up specific values in an LSM tree is a little more expensive than if we had a single sorted list. To find the value for a particular key, we need to check each layer until we find what we’re looking for.

LSM trees have a number of tricks, such as Bloom Filters, that help save a lot of time here. But for DataDiscover, we are just using the LSM tree as a vehicle to transfer the data and merge it into a sorted table. We don’t expect to be performing specific lookups — instead, we will simply read the whole thing in sorted order to build our search index. So, we don’t need to optimize for arbitrary key lookups.

And there you have it! In a nutshell, LSM trees allow us to write data very quickly, producing a sorted list of all our key/value pairs without requiring a long post-processing phase. After we are done transferring all of the data from the source system, we are finally ready to use it to build a search index. More on that next time…

Special thanks to Jarrett Gaddy and Irene Jiang for contributing to the metaphors used in this post.