A few weeks ago I posted a series of articles explaining the ideas and goals behind memory contexts, basics of implementation in PostgreSQL, and even some common issues you may run into when using the memory contexts.
I mentioned there are two main reasons why memory contexts are introduced. First, to simplify the memory management - tracking life cycle of all the allocated pieces, making it less burdensome for the developers and also preventing some usual memory leak scenarios etc.
The second goal is (of course) improving performance. Because everyone knows (or rather believes) that the generic allocators (malloc/free implementations provided by your OS - kernel/libc/...) are slow. But is that still true?
After all, there are papers like Reconsidering Custom Memory Allocation essentially demonstrating that most custom allocators don't really perform any better than the a generic one (although they claim that allocators based on regions - aka blocks - are a notable exception).
Also, we're getting a new kernel version every few months, presumably getting improvements in the memory management too. So maybe there were some significant improvements rendering the additional complexity of custom allocator pointless?
We can't just zap the memory contexts completely, as we'd loose the first benefit (tracking the allocated pieces). But maybe we could change the implementation so that it really is just a thin wrapper around malloc/free with a simple trackinng ... so let's try that.
As explained here, the default implementation
is based on blocks, that are then slided into chunks in response to palloc()
calls. All this happens in PostgreSQL code, and the consequence is that it's
impossible to free the chunks directly (by calling free()) because it's the
whole block that gets allocated by malloc().
So let's rip out all the block-related stuff, leaving us with contexts looking like this:
typedef struct AllocSetContext { MemoryContextData header; /* Standard memory-context fields */ AllocChunk chunks; /* doubly-linked list of chunks */ } AllocSetContext;
Compare this to the original structure:
typedef struct AllocSetContext { MemoryContextData header; /* Standard memory-context fields */ /* Info about storage allocated in this context: */ AllocBlock blocks; /* head of list of blocks in this set */ AllocChunk freelist[ALLOCSET_NUM_FREELISTS]; /* free chunk lists */ /* Allocation parameters for this context: */ Size initBlockSize; /* initial block size */ Size maxBlockSize; /* maximum block size */ Size nextBlockSize; /* next block size to allocate */ Size allocChunkLimit; /* effective chunk size limit */ AllocBlock keeper; /* if not NULL, keep this block over resets */ } AllocSetContext;
Similarly, it's necessary to modify the "chunk header" by adding pointers to the previous/next chunk in the list (note that the fields need to be added to the beginning, not to interfere with the standard chunk header defined in memutils.h:
typedef struct AllocChunkData { /* doubly-linked list (needs to be before the standarch chunk header) */ AllocChunk prev; /* previous item in the list (NULL -> head) */ AllocChunk next; /* next item in the list (NULL -> last) */ /* aset is the owning aset if allocated, or the freelist link if free */ void *aset; /* size is always the size of the usable space in the chunk */ Size size; #ifdef MEMORY_CONTEXT_CHECKING /* when debugging memory usage, also store actual requested size */ /* this is zero in a free chunk */ Size requested_size; #endif } AllocChunkData;
I could go over the various methods (AllocSetAlloc, AllocSetFree, AllocSetRealloc,
...) and explain the changes in each of them, but that'd be quite a waste of time.
The methods are rather simple, basically just wrappers around malloc(), free()
and so on, with a bit of code to update the list of chunks.
For example AllocSetAlloc allocates the chunk, adds it to the list, and that's all.
I'll just give you the patch, so that you can see for yourself if interested, and present you some numbers comparing the performance to the current master.
The standard benchmark in the PostgreSQL world is pgbench, of course. After
initializing a database with scale 300 (~4.5 GB), which neatly fits into RAM, I did
a bunch of short read-only runs with 8 clients (on a 8-core machine) like this:
pgbench -S -c 8 -j 8 -T 60 test
With and without the patch, of course. The numbers were pretty stable (mostly within 1%), and the averages look like this:
| - | tps |
|---|---|
| master | 85809 |
| simple-context | 70464 |
That's ~20% less compared to the default memory context - a pretty significant
difference. And perf top reveals that this really is mostly due to malloc,
free and related functions. Apparently, custom allocator still makes a big difference.
Let's see another benchmark, just to be sure - reindexing is allocation-intensive workload too, so let's do this (on the same database):
psql test -c "set trace_sort=on; reindex index pgbench_accounts_pkey;"
If you have maintenance_work_mem set to a sufficiently large value, e.g. 4GB,
this will do in-memory sort. Thanks to the trace_sort=on you should get se
something like this in the log (for the master):
LOG: internal sort ended, 25 KB used: CPU 0.97s/3.40u sec elapsed 5.08 sec LOG: internal sort ended, 1723933 KB used: CPU 1.48s/4.20u sec elapsed 10.70 sec
and
LOG: internal sort ended, 25 KB used: CPU 1.35s/3.98u sec elapsed 5.46 sec LOG: internal sort ended, 2192683 KB used: CPU 1.78s/4.83u sec elapsed 10.89 sec
for the modified allocator. And again, after averaging the results (this time it's slightly more volatile than in the pgbench case), you'll get something like this:
| - | small | large |
|---|---|---|
| master | 4.97 | 10.46 |
| simple-context | 5.44 | 10.87 |
Again, there's a performance difference, although much smaller (~5% for the "large" sort).
Apparently, the malloc() is still not as fast as the custom allocator (at least
for this particular workload). But maybe we can do better, without giving up the
ability to free the individual chunks (i.e. without reintroducing the blocks) by
reintroducing a freelist?
Note: You can also notice that the amount of memory needed for the sort increased by about 30% - while originally it needed 1723933 KB, not it needs 2192683 KB. This is most likely due to adding the prev/next pointers (2x 8B) to the chunk header. The difference is exactly 30.000.000 x 16B, which perfectly matches the pgbench scale used (300). Apparently this reindex allocates a lot of small chunks (it's rebuilding an index on an INTEGER column, so that's not entirely surprising).
One of the things I've removed from the AllocSetContext structure was a freelist.
I could simply reintroduce it at the same place, but hey - we can do something different
and quite interesting now!
As the chunks are not "attached" to the context through a block anymore, we can "detach" them and move them to a global freelist. And we can also limit the size of the freelist (which was not possible before, and might cause excessive memory usage and even OOM issues with some usage patterns described here).
And that's exactly what I did in the second patch, which introduces a global freelist
/* less than 8MB of memory in the whole freelist */ #define MAX_FREELIST_ITEMS 512 /* * Global freelist * * Each freelist entry contains a singly-linked list of chunks, and a * counter of chunks for each freelist - to limit the amount of 'free' * memory (to prevent extensive memory usage). */ typedef struct FreeListData { AllocChunk chunks; int64 count; } FreeListData; /* the global freelist */ static FreeListData freelist[ALLOCSET_NUM_FREELISTS] = {{NULL, 0}, {NULL, 0}, {NULL, 0}, {NULL, 0}, {NULL, 0}, {NULL, 0}, {NULL, 0}, {NULL, 0}, {NULL, 0}, {NULL, 0}, {NULL, 0}};
Again, I'll not explain all the implementation details here - see the code, if interested. It's quite straightforward, adding a bit of freelist management where needed.
Let's see how this changes the benchmark results. First, the pgbench:
| - | tps |
|---|---|
| master | 85809 |
| simple-context | 70464 |
| simple-with-freelist | 84971 |
Not bad - we're back at 99% of the original performance (still, the gap is quite consistent ~1%).
And the reindex benchmark:
| - | small | large |
|---|---|---|
| master | 4.97 | 10.46 |
| simple-context | 5.44 | 10.87 |
| simple-with-freelist | 4.99 | 10.52 |
In this case, the difference is pretty much gone.
Clearly, the custom allocator still makes sense, performing consistently better than the
malloc-oriented allocator presented here. There might be some other better-performing
custom allocator, but using malloc directly is not a viable choice.
Note: BTW all this was tested on a Gentoo machine with glibc 2.19 (2.19-r1) and kernel
3.16.1. I'm not sure about the other platforms supported by PostgreSQL - it might be that
some other *nix platform uses much better malloc implementation or something like that.
Feel free to comment.
The custom allocator with global freelist seems quite interesting though - it's still not beating the block-based allocator, but it's not much slower either. And it has the nice feature that it keeps only limited amount of freed chunks, so maybe it could be a good choice for cases like the array_agg issue.
The problem described in that bug report has two causes. First, each array_agg group creates
a private memory context, and allocation set always creates at least 1kB block. So if the
groups are tiny, this is very wasteful. Second, the array_agg is a perfect example of
pattern with constantly growing requests - whenever you add another element into the array
(and it happens to exceed the existing buffer), you just wasted that chunk because it can't
be freed or reused for any future requests (as the context is private).
Maybe a custom memory context could be a good solution to cases like this - probably not as efficient as using a shared context for all the groups, but maybe easier to use.
For example with the simple context, I'm able to run the queries mentioned in the array_agg bug report. It's still consuming a few gigabytes of memory (which is much more than should be necessary), but it's much better than the tens of gigabytes consumed before.
Copyright © 2014 Tomas Vondra