Understand a C++ Program's Memory Usage with Pprof

We live in a world with more data than ever before. This data explosion is only accelerating, meaning systems need to look at more and more information.  For example, many of our customers typically run queries such as “best performing stores by revenue in 2019” over billions of rows and data values - and they expect insights back in seconds. Therefore, we need to make our individual c++ services as light as possible by looking for every opportunity to cut down memory usage.

Pprof is a very powerful tool to achieve this goal. We recently used it to find bottlenecks in one of our customer usage services, which are services that keep historical searches and user preferences. The result? We reduced its memory footprint from 4.4GB to 2.1GB.

One of the most useful ways to use pprof is to generate an image of the current heap memory usage. For example:

 

Here’s how to do it.

Step 1: Getting an SVG presentation of the current heap allocations:

  1. Import a copy of the binaries of the service you would like to profile to your local machine. For example: user_usage_service_main.
  2. Get a copy of the heapz_dump (the currently heap allocated memory by the service).
    1. If you are using TCMALLOC for memory allocations, you can dump the heap memory using the following code snippet:

#include "gperftools/malloc_extension.h"

const std::string outfile = outdir + "/heapz";

std::string data;

MallocExtension::instance()->GetHeapSample(&data);

// write data to a file e.g. heapz_dump

  1. Confirm you have pprof on your machine using: which pprof command. You should get a directory printed in the console if it is installed. Otherwise, for pprof installation see: https://github.com/google/pprof
  2. Now run pprof passing the binary as the first argument and the heapz_dump file as the second argument. For example:

pprof user_usage_service_main heapz_dump  

  1. Note: if you look at the heapz_dump you will see that it is a series of addresses and the amount of bytes they allocated. The binary has the textual names of those addresses. With the binary and heapz_dump, pprof can match the function names to the amount of memory they allocated. If you open the heapz_dump it should look something like this:

  1. At this point a pprof shell will open: (pprof) . Run the command web . This will load a webpage with the SVG but since this is in the workstation we will need to scp it locally instead. You will see a log line in pprof such as:

'Loading web page file:////tmp/pprof15228.0.svg'. All we need to do is scp the svg image. For example: scp <user>@<IP>:/tmp/pprof15228.0.svg  /Users/Desktop/

Step 2: Inspecting the memory SVG:

The generated SVG would look like in the image above. Here is how to interpret this image:

  •  In the top left you will find: the name of the service, the total heap memory recorded (this should be very close to the physical + swap memory as seen by a call to: MallocExtension::instance()->GetStats(buf.get(), kBufSize); and the size of nodes dropped (unless specified as 0 not all nodes will be present in the graph as it makes it harder to read).
  •  Every node in the graph includes
    • The name of the function
    • The memory it is currently holding (in most cases this will be 0)
    • The memory currently held by its children (functions it called). The leafs should add up to 100% the same number at the top of the image (this is not always the case, see explanation below). Arrows represent the direction from caller to callee.
  • Important to note:
    • The image we see is a snapshot in time of where memory is currently allocated in the service (at the time the heapz was collected) and see the stack calls that led to this memory allocation. It is important to note that what we are looking at is the current representation of heap memory and not the total allocations that have been made throughout the lifetime of the program (i.e. freed memory will not be presented)
    • Pprof is only used to show heap memory and doesn't include stack memory.

Section 3: Additional useful commands:

  1. To generate the SVG without dropping any nodes or edges run the following command:

pprof -nodefraction=0 -edgefraction=0  user_usage_service_main heapz_dump

  1. In case we want to see how much memory was allocated on the heap for a particular function call you can output the heap_dump to a txt file to see (a) Each stack call and (b) the memory allocated by that stack. To do so:
    1. Set a file such as pprof_memory for the binary using your environment variables. For example run the command

env HEAPPROFILE=pprof_memory ./build-dbg/<name of your service test binary>

  1. Then run pprof with the binary and the heapz_dump file with the following flags:

pprof --lines --stacks --show_bytes --text --alloc_space ./build-dbg/<your service test binary>  pprof_memory.0001.heap

  1. This will output the full stack calls and memory allocated to pprof_memory.0001.heap

Section 4: Important notes and caveats:

  • The SVG we generate is not 100% of the memory taken by the program. Rather, it is a sample of the memory allocations done by the program. The greater the allocation, the higher the probability it was sampled. That's why you might notice the sum of the leaf nodes' memory doesn't add up to 100%. By default, the sampling rate is set to 262144 bytes (i.e. any call allocating more than 262144 is guaranteed to be recorded and allocations less than that might be recorded with some probability). The sampling rate is given in the first line of the heapz_dump file. For example: 

'heap profile: 23684: 345859024 [ 23684: 345859024] @ heap_v2/262144 14: 1120 [ 14: 1120] @ 0x175f01f 0x13c82e0 0x157'.

  • If you would like to increase the sampling rate (i.e. decrease the number from the default 262144) do the following:
    1. Stop your running service.
    2. Change the TCMALLOC_SAMPLE_PARAMETER value in your configuration file from 262144 to another number (numbers lower than 5000 will normally cause your service to crash because holding these many samples is memory intensive).
    3. Restart the service with the new configuration. Now the sampling would be adjusted to your new value. Note: You might notice once you restart the service that the memory reported by pprof (in the upper left corner of the SVG) doesn't match the heap memory seen using a tool like ATOP. This difference comes from the memory used by pprof to store the samples themselves - pprof is smart enough not to add those to the heap profile it presents to you.

Need Proof? Here’s what we did at ThoughtSpot

  1. After noticing one of our services consumed more memory than necessary, we ran pprof and noticed the following node in the graph:

  1. We realized that the UsageStats struct which is stored many times in our index is consuming almost 1GB - much more than expected.
  2. We found out that we mistakenly reported the size of the deque inside that struct using sizeof(deque), which is always 54 Bytes. This only reports the size of the deque class and not the full heap memory taken by deque.
  3. We then ran a small test where we created a deque and then profiled the heap memory taken by an std::deque (see instructions 2b in Additional Useful Commands to get memory taken by specific stack calls). We found out that a deque actually takes 512 Bytes of heap memory. When multiplied many times, it consumed a lot of memory!
  4. We replaced the deque with vector, which only takes 24 bytes of heap memory after initialization. This by itself led to a 1.3GB reduction in memory consumption.

The main takeaway: with c++ standard data structures, it is hard to know how much heap memory they take without knowing their internal implementations. When in doubt, pprof can be used to know exactly how much memory these data structures consume. Simply build a test where you initialize one such structure and add some elements, then use the commands in step 3 to see the full list of memory allocations.

Further reading:

  1. Google's pprof manual
  2. Pprof shell commands
  3. Memory bottleneck case study (implemented using GO, but a lot of the examples also apply in c++) :