Poor performance can be caused by memory usage. Paul Floyd introduces Valgrind’s heap memory profiling tool Massif.
This time there is a bit more good news. You’ve identified your CPU bottlenecks and have ironed them out. But now you have customers complaining that performance is poor when they are using very big projects. When you hear the word ‘big’, you should start thinking ‘memory’.
Using system profiling tools
Before you do any lengthy testing, you can get a quick impression of your application’s memory use by using 'top' (or a similar process monitor tool).
Here I’m running a simulator on a 48Gbyte server. You should be looking at the
Mem:
and
Swap:
lines in the header block and the
VIRT
(virtual memory) and
RES
(resident memory) in the columns in Figure 1.
top - 17:21:05 up 9 days, 2:16, 2 users, load average: 1.09, 1.05, 1.01 Tasks: 232 total, 2 running, 230 sleeping, 0 stopped, 0 zombie Cpu(s): 12.6%us, 0.0%sy, 0.0%ni, 87.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 49414760k total, 21724376k used, 27690384k free, 185660k buffers Swap: 49151992k total, 2128k used, 49149864k free, 2669128k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 12733 paulf 20 0 16.5g 16g 22m R 100.0 34.5 159:06.14 sim64.exe 23657 paulf 20 0 13252 1264 884 R 0.7 0.0 0:02.56 top |
Figure 1 |
You can also get an idea what your system is doing by using a tool like
vmstat
(on UNIX-like systems). If you type "
vmstat 5
" then you’ll sample the system every 5 seconds (with the first line being the average since the system booted). You’ll see something like Figure 2.
vmstat 5 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 2128 27686448 185312 2667364 0 0 0 3 0 4 11 3 84 1 0 1 0 2128 27686876 185312 2667364 0 0 0 0 1136 134 12 0 88 0 0 1 0 2128 27687488 185312 2667364 0 0 0 0 1088 74 13 0 87 0 0 |
Figure 2 |
swpd
,
free
,
buff
and
cache
correspond to similarly named values in the
Mem:
and
Swap:
lines of top. When your performance is really poor, then it’s the
si
and
so
values that you should look at. They are the amount of memory that is being swapped in and out from disk per second. Swapping is bad news. Disk drives are many orders of magnitude slower than RAM.
Enough of system profiling, I’m here to write about Valgrind. In particular, Massif, Valgrind’s heap memory profiling tool. Massif will give you an overview of the memory used by your application over time. You can use it to help you to identify where and when your application allocates memory on the heap with a view to either reducing the amount of memory that you allocate or freeing it sooner. It can indicate that you have memory leaks, but it will only tell you about memory use, not memory abuse (like unreachable memory due to pointers going out of scope or being overwritten).
I’ll start with a very simple example that just allocates and frees some memory (Listing 1).
// m1.cpp #include <unistd.h> const size_t DATA_SIZE = 100U; const size_t BLOCK_SIZE = 1000000; int main() { long **data = new long *[DATA_SIZE]; for (size_t i = 0; i < DATA_SIZE; ++i) { data[i] = new long[BLOCK_SIZE]; // do something with data[i] usleep(100000); } sleep(1); for (size_t i = 0; i < DATA_SIZE; ++i) { delete [] data[i]; } delete [] data; } |
Listing 1 |
To select Massif, run valgrind with the
--tool=massif
option. In this case, since the example uses
sleep()
and
usleep()
, I'll add the
--time-unit=ms
option, the default being instructions executed. There are further options to profile the stack, control the call context depth recorded and the number of snapshots taken. There isn’t much output to the terminal, just the small piece of header information in Figure 3.
valgrind --tool=massif --time-unit=ms ./m1 ==12939== Massif, a heap profiler ==12939== Copyright (C) 2003-2012, and GNU GPL'd, by Nicholas Nethercote ==12939== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info ==12939== Command: ./m1 ==12939== ==12939== |
Figure 3 |
The important information is written to a file named
massif.out.<pid>
(you can control the file name with the
--massif-out-file
option). This isn’t really meant for human consumption. For completeness, I’ll include an extract, see Figure 4.
desc: --time-unit=ms cmd: ./m1 time_unit: ms #----------- snapshot=0 #----------- time=0 mem_heap_B=0 mem_heap_extra_B=0 mem_stacks_B=0 heap_tree=empty [content deleted] #----------- snapshot=49 #----------- time=9830 mem_heap_B=784000800 mem_heap_extra_B=345752 mem_stacks_B=0 heap_tree=detailed n2: 784000800 (heap allocation functions) malloc/new/new[], --alloc-fns, etc. n0: 784000000 0x400759: main (m1.cpp:12) n0: 800 in 1 place, below massif's threshold (01.00%) [more content] |
Figure 4 |
The output file is meant to be processed by
ms_print
. This outputs to the terminal. Firstly, there is a short summary, then there is an ASCII art plot of the memory use as a function of time. Figure 5 is what my example code produces.
MB 763.3^ # | ::@:::::::# | ::: @: # | ::::::: @: # | :::::: ::: @: # | :@::: ::: ::: @: # | :::::@::: ::: ::: @: # | ::::: ::@::: ::: ::: @: # | ::::::::: ::@::: ::: ::: @: # | :@: ::::::: ::@::: ::: ::: @: # | ::::@: ::::::: ::@::: ::: ::: @: # | ::::: ::@: ::::::: ::@::: ::: ::: @: # | :::::: : ::@: ::::::: ::@::: ::: ::: @: # | :::::: ::: : ::@: ::::::: ::@::: ::: ::: @: # | :::: :::: ::: : ::@: ::::::: ::@::: ::: ::: @: # | ::@: :: :::: ::: : ::@: ::::::: ::@::: ::: ::: @: # | :::: @: :: :::: ::: : ::@: ::::::: ::@::: ::: ::: @: # | :@@:: : @: :: :::: ::: : ::@: ::::::: ::@::: ::: ::: @: # | :::::@ :: : @: :: :::: ::: : ::@: ::::::: ::@::: ::: ::: @: # | ::@:: ::@ :: : @: :: :::: ::: : ::@: ::::::: ::@::: ::: ::: @: # 0 +----------------------------------------------------------------------->s 0 11.13 |
Figure 5 |
As expected, the memory use rises monotonically for 10 seconds. The application allocates a bit over 800 million bytes, or 762.94MB. That’s quite close to the peak shown in the graph. After this, the
ms_print
output shows information about snapshots taken during the execution of the application. All of the snapshots show a one line summary of the heap use. By default, every tenth snapshot shows details of where the memory was allocated. Figure 6 is the end of the output from the m1 test application.
39 7,825 624,275,992 624,000,800 275,192 0 99.96% (624,000,800B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc. ->99.96% (624,000,000B) 0x400759: main (m1.cpp:12) | ->00.00% (800B) in 1+ places, all below ms_print's threshold (01.00%) -------------------------------------------------------------------------------- n time(ms) total(B) useful-heap(B) extra-heap(B) stacks(B) -------------------------------------------------------------------------------- 40 7,926 632,279,520 632,000,800 278,720 0 41 8,126 648,286,576 648,000,800 285,776 0 [content deleted] 48 9,530 760,335,968 760,000,800 335,168 0 49 9,830 784,346,552 784,000,800 345,752 0 99.96% (784,000,800B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc. ->99.96% (784,000,000B) 0x400759: main (m1.cpp:12) | ->00.00% (800B) in 1+ places, all below ms_print's threshold (01.00%) -------------------------------------------------------------------------------- n time(ms) total(B) useful-heap(B) extra-heap(B) stacks(B) -------------------------------------------------------------------------------- 50 9,931 792,350,080 792,000,800 349,280 0 51 11,134 800,353,608 800,000,800 352,808 0 99.96% (800,000,800B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc. ->99.96% (800,000,000B) 0x400759: main (m1.cpp:12) | ->00.00% (800B) in 1+ places, all below ms_print's threshold (01.00%) |
Figure 6 |
We can see that the figure for the useful-heap matches what we expect. The extra-heap figure is the memory allocated for book-keeping. In a real world application, there would be an extensive tree of calls displayed showing where memory was allocated and how much (% of total and number of bytes). In this example, everything is done in
main()
, so there’s not much context to see. Now that we have this information, what do we want to do with it? Generally, two things: try to free memory earlier and try to find more efficient data structures.
Now, let’s go back to the small example and make some changes to improve the memory use. (See Listing 2.)
// m2.cpp #include <unistd.h> const size_t DATA_SIZE = 100U; const size_t BLOCK_SIZE = 1000000; int main() { long **data = new long *[DATA_SIZE]; for (size_t i = 0; i < DATA_SIZE; ++i) { data[i] = new long[BLOCK_SIZE]; // do something with data[i] usleep(100000); delete [] data[i]; usleep(100000); } sleep(1); delete [] data; } |
Listing 2 |
Now the memory is freed straight after it is used, rather than all at once at the end of the application. The ASCII art graph now looks like Figure 7.
MB 7.634^#::@@::@::::::@:: @@:::::@@ : @::: ::::@:: :::::: : : |#: @ : @ : : :@:: @ ::: :@ : @:: ::: @: :: :: : : |#: @ : @ : : :@:: @ ::: :@ : @:: ::: @: :: :: : : |#: @ : @ : : :@:: @ ::: :@ : @:: ::: @: :: :: : : |#: @ : @ : : :@:: @ ::: :@ : @:: ::: @: :: :: : : |#: @ : @ : : :@:: @ ::: :@ : @:: ::: @: :: :: : : |#: @ : @ : : :@:: @ ::: :@ : @:: ::: @: :: :: : : |#: @ : @ : : :@:: @ ::: :@ : @:: ::: @: :: :: : : |#: @ : @ : : :@:: @ ::: :@ : @:: ::: @: :: :: : : |#: @ : @ : : :@:: @ ::: :@ : @:: ::: @: :: :: : : |#: @ : @ : : :@:: @ ::: :@ : @:: ::: @: :: :: : : |#: @ : @ : : :@:: @ ::: :@ : @:: ::: @: :: :: : : |#: @ : @ : : :@:: @ ::: :@ : @:: ::: @: :: :: : : |#: @ : @ : : :@:: @ ::: :@ : @:: ::: @: :: :: : : |#: @ : @ : : :@:: @ ::: :@ : @:: ::: @: :: :: : : |#: @ : @ : : :@:: @ ::: :@ : @:: ::: @: :: :: : : |#: @ : @ : : :@:: @ ::: :@ : @:: ::: @: :: :: : : |#: @ : @ : : :@:: @ ::: :@ : @:: ::: @: :: :: : : |#: @ : @ : : :@:: @ ::: :@ : @:: ::: @: :: :: : : |#: @ : @ : : :@:: @ ::: :@ : @:: ::: @: :: :: : : 0 +----------------------------------------------------------------------->s 0 21.15 |
Figure 7 |
Notice that the peaks where memory is allocated are not evenly spaced. This is an artefact of the sampling, and if your application does a lot of allocation and freeing, you may need to play with the
--detailed-freq
or
--max-snapshots
options.
As an alternative to the ASCII art, there is a separate tool, ‘massif-visualizer’ (Figure 8).
Figure 8 |
I installed massif-visualizer on Kubuntu with the package manager. It wasn’t available in the package manager of openSUSE, and I had problems with Cmake dependencies when trying to build it from source. The GUI, on top of being a bit prettier than the ASCII drawings, allows you to interact with the snapshots. However, it does not show stack memory if you use the
--stacks=yes
option. This leads to an example that does use stack profiling. Listing 3 is the code (from Bertrans Meyer’s
A Touch of Class
, p. 477, translated from Eiffel to C++). This is a nice example of stack use that is hard to analyze.
// bizarre.cpp #include <iostream> #include <cassert> using std::cout; long int bizarre(long int n) { assert(n >= 1); if (n == 1) { return n; } else if (0L == n%2L) { return bizarre(n/2L); } else { return bizarre((3L*n + 1L)/2L); } } int main() { for (long int i = 1L; i < 200000000L; ++i) { bizarre(i); } } |
Listing 3 |
Profiling this with
valgrind --tool=massif --stacks=yes ./bizarre
gives Figure 9.
KB 18.77^ # | # | # | # | # | # | @ # | @ # | @ # | @ # | @ # | @ # | @ # | @ # | @ # : : | @ # : ::: :: | @ # :: : @@ : ::: ::: | @ :: ::: ::: # : ::: :: : :: @@ ::: @ :: :::::::@:: | @ : :@@::: :: # ::: :: ::: : : : @ @: : @ :: : :::::@:@ | :@: : :@ ::: :: #:::: ::: ::: : ::: : :: :@ @: :::@ :: : :::::@:@ 0 +----------------------------------------------------------------------->Gi 0 548.2 |
Figure 9 |
In one last example, let's see what happens if we use some low level allocation like mmap. (Listing 4.)
// m3.cpp #include <unistd.h> #include <sys/mman.h> const size_t DATA_SIZE = 100U; const size_t BLOCK_SIZE = 1000000; int main() { long **data = new long *[DATA_SIZE]; for (size_t i = 0; i < DATA_SIZE; ++i) { data[i] = reinterpret_cast<long *>(mmap(NULL, BLOCK_SIZE*sizeof(long), PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0)); // do something with data[i] usleep(100000); } sleep(1); for (size_t i = 0; i < DATA_SIZE; ++i) { munmap(data[i], BLOCK_SIZE*sizeof(long)); } delete [] data; } |
Listing 4 |
With this, all I see is the memory allocated for 'data'. (I haven’t included it here as it’s just an ASCII rectangle showing 808 bytes allocated and a time of 11.14s). Where have the other 763MB gone? By default, massif does not trace memory allocated in low level functions like mmap. If you want to see that memory (which might be the case if you are using a custom allocator that is based on mmap), then you will need to add the
--pages-as-heap=yes
option. If I add this option to the above example, then I see all of the memory being allocated (in fact, 777.5MB, since it now includes memory allocated when loading the application).
Note that normally on Linux, glibc will use mmap for malloc requests to allocate above a certain threshold, but Valgrind intercepts the call to malloc, so there is no need to use
--pages-as-heap
unless you are using mmap directly.
That just about wraps up this installment. As you can see, Massif is straightforward to use and it presents a simple view of the memory use of your application.
In my next article, I’ll cover two Valgrind tools for detecting thread hazards, Helgrind and DRD.