I think it’s safe to say that just about everyone who’s been to college misses it, at least a little bit. Things were simple and good and pure. For example, in the early days of my computer science classes, we were told to think about computers as little Turing Machines zipping around on infinite tapes of memory. Infinite! Probably not everyone misses college because of Turing Machines, but I think many of you can relate, right? As I’m sure you know, memory is in fact quite finite, and I soon learned in the real world of computing a lot of thought goes into how to manage that little scratchpad of precious memory.
At ThoughtSpot, the scratchpad is probably a bit bigger than average (several TBs of memory distributed over tens of machines) but the data we’re dealing with is just as big, and given our outrageously aggressive performance standards, memory management is absolutely crucial. We chose to use Google’s TCMalloc memory manager because it’s famously fast (it’s used by Chrome!). However it’s major flaw is that it’s very (very) memory greedy (it’s used by Chrome…). We’ve had a lot of fun wrestling with TCMalloc and we hope you find our experiences useful or at least interesting!
TCMalloc: What You Need to Know
TCMalloc is a memory allocator implementation from Google tailored for concurrent memory access. When used in a multithreaded environment, most traditional memory allocators require a shared lock around the heap (or at least around discrete blocks of heap memory called arenas). Conversely, TCMalloc shines in such environments by reducing lock contention, thereby improving performance. This is accomplished by affording each thread a ‘thread cache,’ or free lists of memory objects across various size classes. Because each thread cache is local to its respective thread, there is no lock contention when a thread allocates memory from its cache. When a thread cache free list of a particular size class is empty, more objects of that size are fetched from a central cache, a free list structure global to all threads. Because it is accessible from each thread, the central cache requires a concurrency control mechanism. However, rather than locking the entire cache, TCMalloc instead only locks the central cache size class from which a thread cache is replenishing. Therefore, simultaneous free list refreshes across different size classes from the central cache to thread caches is permitted. When the central cache is empty, a run of pages is allocated from a central page allocator (which directly reads discrete blocks of memory from the heap) as a span and divided into objects of the size class. The heap itself is maintained as a list of spans that are either allocated or free.
When a thread receives an allocation request, the request is first classified as large or small. If large, memory is allocated from the page allocator itself (the heap accessed in units of pages) and the allocation is recorded in terms of units of spans. If small, the thread cache first attempts to fulfill the request. If unable to (the free list of the requested size class is empty), the thread cache is refreshed from the central cache (which too is refreshed from the page allocator if need be). When memory is freed, it is reinserted into the respective free list in the current thread’s thread cache or its corresponding pages are marked as free in the case of large allocations.
Garbage collection plays an important role in regulating the performance of thread caches. Namely, garbage collection is required for thread caches when the size of the cache exceeds a predefined limit. This threshold often decreases as the number of threads increases to prevent fragmentation and corresponding memory wastage. In the garbage collection process, objects are moved from the thread cache to the central cache until the thread cache size threshold is satisfied.
OK Cool...Why Should I Use It?
TCMalloc’s most obvious benefit is reducing lock contention (which in turn yields faster performance) by giving each thread a cache from which to satisfy as many requests as possible. Thread caches enable fewer accesses to lock-protected resources, such as the first potential point of lock contention: refreshing from the central cache. Even still, contention only occurs if two threads are requesting refreshes of the same size class at the same time (since there is one lock per size class, not across the entire central cache).
Comparisons between TCMalloc and the traditional PTMalloc2 (one of the most prevalent multithreaded allocators before TCMalloc) are below and performance is measured in terms of millions of ops/sec (less lock contention yields more ops/sec). As one might assume, the benefits are not very apparent when comparing performance on one thread (left), while they become clear when considering performance across many threads (right):
Another benefit of TCMalloc is its avoidance of a header/footer metadata section per memory object in the free list. In traditional allocators, this overhead uses 4 bytes per object, which is especially wasteful for small allocations. With TCMalloc, memory metadata is instead maintained in a heap map that maps page numbers to span pointers, enabling full utilization of object contents.
Hmm...Seems Too Good To Be True
TCMalloc’s allocation and freeing pattern encourages reliance on thread caches, but these caches can yield fragmentation over time. For example, consider a usage pattern where one thread experiences a steady stream of allocation requests for the same size. While such usage will yield high turnover of one size class free list in the thread cache, there is zero utilization of other size class free lists. The majority of the memory in thread caches, which mainly exists in unused free lists, is wasted.
This intuition reveals a major issue with TCMalloc that starts with a cold start problem: TCMalloc initializes all thread caches with free lists of all size classes and continually pages more in when needed. Thread cache contents are not redistributed until garbage collection, which occurs when the cache reaches its size threshold. Thus, until a thread cache reaches its threshold, fragmentation within the thread cache can go unchecked. Moreover, when the majority of thread caches are free, TCMalloc holds onto a lot of unused memory. In fact, if you’ve ever noticed Google Chrome consuming jaw-dropping amounts of memory on your laptop, TCMalloc is pretty much the reason (of course, we love Chrome nonetheless)! Such memory greed has significant impacts on the Sage (search) and Falcon (database) workers of the ThoughtSpot system. These workers perform extremely memory intensive operations on hundreds of gigabytes of data (on each machine in the cluster). Given these types of allocation patterns, we found that machines were running out of memory left and right. When a worker’s request exceeds the allowed memory, then the cluster goes down (this is bad - it means downtime for the customer!). Needless to say, this was not an acceptable risk for us, so we took matters into our own hands.
Memory Usage (kb) vs Time(sec).
Combating TCMalloc’s memory greed
The first optimization ThoughtSpot leverages in conjunction with TCMalloc is a process-level utility called mnotify, which relays a notification from the kernel when the process surpasses a memory threshold. The utility leverages the event control support of memcg in the kernel to enable listeners that can execute arbitrary code in response to events (in this case, memory threshold notifications from mnotify). The memory threshold of mnotify is different from the TCMalloc-internal threshold that triggers garbage collection; it is not necessarily static and can be set as a percent of overall system memory or process memory. Furthermore, when broadcast, mnotify provides additional levels of severity: one level calls for all memory to be flushed from its thread caches to the central cache and another level calls for all memory to be flushed from the central cache to the page allocator. In this manner, mnotify acts as a process-wide listener that proactively flushes caches upon threat of fragmentation.
Our second optimization is proactively releasing memory upon building each Sage (search) index (which can be enormous - on the order of dozens or potentially 100 gigabytes). These builds use a lot of memory but happen relatively infrequently, so proactive releasing is a measure that is carried out when a memory-intensive operation such as an index build is scheduled. Builds are carried out across a collection of threads that each take responsibility for building an index for a shard of the data. Let’s say there are 4 shards, and 16 threads. Each time a shard is re-built, the thread cache for the thread that performed the build becomes huge. After a few builds, all 16 thread caches have blown up, and the system is probably running dangerously low on memory. To fix this, we explicitly release memory from the thread cache back to the central cache after each build. The central cache still gets pretty big, but no more than 4 shards worth of memory (because our 4 shards are built in parallel). This way, we achieve a 4x improvement in memory bloat.
For Your Consideration
While it comes with its share of issues, TCMalloc has enabled ThoughtSpot to handle large, concurrent memory workloads. However, it is not always a drop-in replacement for malloc! TCMalloc is optimized for high concurrency situations where individual threads have variable memory workloads. It was particularly fruitful for our engineering efforts because of our database engine, which is an intensive user of dynamic memory. Each search (which could utilize several cooperating threads) has varied memory requirements (based on the scope of the query in terms of the memory footprint of the tables/columns searched), making TCMalloc a particularly good fit with our architecture. Despite its speed, TCMalloc’s memory greed had to be tempered with additional optimizations that prevent out of memory situations: mnotify and proactive memory release enable ThoughtSpot to continuously regulate the memory overhead of thread caches and limit fragmentation. Without these additions, the nature of our workloads would cause the system to frequently hit the memory limit, causing production outages.
So while we don’t quite yet have infinite tape like we “did” in college, we’ve been able to wrangle TCMalloc to take advantage of every last bit of tape we do have at our disposal!