How much benefit do we get from Allocation set?

2014/11/24 by Tomas Vondra

A few weeks ago I posted a series of articles explaining the ideas and goals behind memory contexts, basics of implementation in PostgreSQL, and even some common issues you may run into when using the memory contexts.

I mentioned there are two main reasons why memory contexts are introduced. First, to simplify the memory management - tracking life cycle of all the allocated pieces, making it less burdensome for the developers and also preventing some usual memory leak scenarios etc.

The second goal is (of course) improving performance. Because everyone knows (or rather believes) that the generic allocators (malloc/free implementations provided by your OS - kernel/libc/...) are slow. But is that still true?

After all, there are papers like Reconsidering Custom Memory Allocation essentially demonstrating that most custom allocators don't really perform any better than the a generic one (although they claim that allocators based on regions - aka blocks - are a notable exception).

Also, we're getting a new kernel version every few months, presumably getting improvements in the memory management too. So maybe there were some significant improvements rendering the additional complexity of custom allocator pointless?

We can't just zap the memory contexts completely, as we'd loose the first benefit (tracking the allocated pieces). But maybe we could change the implementation so that it really is just a thin wrapper around malloc/free with a simple trackinng ... so let's try that.

As explained here, the default implementation is based on blocks, that are then slided into chunks in response to palloc() calls. All this happens in PostgreSQL code, and the consequence is that it's impossible to free the chunks directly (by calling free()) because it's the whole block that gets allocated by malloc().

So let's rip out all the block-related stuff, leaving us with contexts looking like this:

typedef struct AllocSetContext
    MemoryContextData header;   /* Standard memory-context fields */
    AllocChunk chunks;          /* doubly-linked list of chunks */
} AllocSetContext;

Compare this to the original structure:

typedef struct AllocSetContext
    MemoryContextData header; /* Standard memory-context fields */

    /* Info about storage allocated in this context: */
    AllocBlock blocks; /* head of list of blocks in this set */
    AllocChunk freelist[ALLOCSET_NUM_FREELISTS]; /* free chunk lists */

    /* Allocation parameters for this context: */
    Size    initBlockSize; /* initial block size */
    Size    maxBlockSize; /* maximum block size */
    Size    nextBlockSize; /* next block size to allocate */
    Size    allocChunkLimit; /* effective chunk size limit */

    AllocBlock keeper; /* if not NULL, keep this block over resets */

} AllocSetContext;

Similarly, it's necessary to modify the "chunk header" by adding pointers to the previous/next chunk in the list (note that the fields need to be added to the beginning, not to interfere with the standard chunk header defined in memutils.h:

typedef struct AllocChunkData
    /* doubly-linked list (needs to be before the standarch chunk header) */
    AllocChunk  prev;   /* previous item in the list (NULL -> head) */
    AllocChunk  next;   /* next item in the list (NULL -> last) */

    /* aset is the owning aset if allocated, or the freelist link if free */
    void    *aset;
    /* size is always the size of the usable space in the chunk */
    Size    size;
    /* when debugging memory usage, also store actual requested size */
    /* this is zero in a free chunk */
    Size    requested_size;
} AllocChunkData;

I could go over the various methods (AllocSetAlloc, AllocSetFree, AllocSetRealloc, ...) and explain the changes in each of them, but that'd be quite a waste of time. The methods are rather simple, basically just wrappers around malloc(), free() and so on, with a bit of code to update the list of chunks.

For example AllocSetAlloc allocates the chunk, adds it to the list, and that's all.

I'll just give you the patch, so that you can see for yourself if interested, and present you some numbers comparing the performance to the current master.

The standard benchmark in the PostgreSQL world is pgbench, of course. After initializing a database with scale 300 (~4.5 GB), which neatly fits into RAM, I did a bunch of short read-only runs with 8 clients (on a 8-core machine) like this:

pgbench -S -c 8 -j 8 -T 60 test

With and without the patch, of course. The numbers were pretty stable (mostly within 1%), and the averages look like this:

- tps
master 85809
simple-context 70464

That's ~20% less compared to the default memory context - a pretty significant difference. And perf top reveals that this really is mostly due to malloc, free and related functions. Apparently, custom allocator still makes a big difference.

Let's see another benchmark, just to be sure - reindexing is allocation-intensive workload too, so let's do this (on the same database):

psql test -c "set trace_sort=on; reindex index pgbench_accounts_pkey;"

If you have maintenance_work_mem set to a sufficiently large value, e.g. 4GB, this will do in-memory sort. Thanks to the trace_sort=on you should get se something like this in the log (for the master):

LOG:  internal sort ended, 25 KB used: CPU 0.97s/3.40u sec elapsed 5.08 sec
LOG:  internal sort ended, 1723933 KB used: CPU 1.48s/4.20u sec elapsed 10.70 sec


LOG:  internal sort ended, 25 KB used: CPU 1.35s/3.98u sec elapsed 5.46 sec
LOG:  internal sort ended, 2192683 KB used: CPU 1.78s/4.83u sec elapsed 10.89 sec

for the modified allocator. And again, after averaging the results (this time it's slightly more volatile than in the pgbench case), you'll get something like this:

- small large
master 4.97 10.46
simple-context 5.44 10.87

Again, there's a performance difference, although much smaller (~5% for the "large" sort).

Apparently, the malloc() is still not as fast as the custom allocator (at least for this particular workload). But maybe we can do better, without giving up the ability to free the individual chunks (i.e. without reintroducing the blocks) by reintroducing a freelist?

Note: You can also notice that the amount of memory needed for the sort increased by about 30% - while originally it needed 1723933 KB, not it needs 2192683 KB. This is most likely due to adding the prev/next pointers (2x 8B) to the chunk header. The difference is exactly 30.000.000 x 16B, which perfectly matches the pgbench scale used (300). Apparently this reindex allocates a lot of small chunks (it's rebuilding an index on an INTEGER column, so that's not entirely surprising).

Global freelist

One of the things I've removed from the AllocSetContext structure was a freelist. I could simply reintroduce it at the same place, but hey - we can do something different and quite interesting now!

As the chunks are not "attached" to the context through a block anymore, we can "detach" them and move them to a global freelist. And we can also limit the size of the freelist (which was not possible before, and might cause excessive memory usage and even OOM issues with some usage patterns described here).

And that's exactly what I did in the second patch, which introduces a global freelist

/* less than 8MB of memory in the whole freelist */

 * Global freelist
 * Each freelist entry contains a singly-linked list of chunks, and a
 * counter of chunks for each freelist - to limit the amount of 'free'
 * memory (to prevent extensive memory usage).
typedef struct FreeListData
    AllocChunk  chunks;
    int64       count;
} FreeListData;

/* the global freelist */
static FreeListData freelist[ALLOCSET_NUM_FREELISTS]
                = {{NULL, 0}, {NULL, 0}, {NULL, 0}, {NULL, 0}, {NULL, 0},
                {NULL, 0}, {NULL, 0}, {NULL, 0}, {NULL, 0}, {NULL, 0},
                {NULL, 0}};

Again, I'll not explain all the implementation details here - see the code, if interested. It's quite straightforward, adding a bit of freelist management where needed.

Let's see how this changes the benchmark results. First, the pgbench:

- tps
master 85809
simple-context 70464
simple-with-freelist 84971

Not bad - we're back at 99% of the original performance (still, the gap is quite consistent ~1%).

And the reindex benchmark:

- small large
master 4.97 10.46
simple-context 5.44 10.87
simple-with-freelist 4.99 10.52

In this case, the difference is pretty much gone.


Clearly, the custom allocator still makes sense, performing consistently better than the malloc-oriented allocator presented here. There might be some other better-performing custom allocator, but using malloc directly is not a viable choice.

Note: BTW all this was tested on a Gentoo machine with glibc 2.19 (2.19-r1) and kernel 3.16.1. I'm not sure about the other platforms supported by PostgreSQL - it might be that some other *nix platform uses much better malloc implementation or something like that. Feel free to comment.

The custom allocator with global freelist seems quite interesting though - it's still not beating the block-based allocator, but it's not much slower either. And it has the nice feature that it keeps only limited amount of freed chunks, so maybe it could be a good choice for cases like the array_agg issue.

The problem described in that bug report has two causes. First, each array_agg group creates a private memory context, and allocation set always creates at least 1kB block. So if the groups are tiny, this is very wasteful. Second, the array_agg is a perfect example of pattern with constantly growing requests - whenever you add another element into the array (and it happens to exceed the existing buffer), you just wasted that chunk because it can't be freed or reused for any future requests (as the context is private).

Maybe a custom memory context could be a good solution to cases like this - probably not as efficient as using a shared context for all the groups, but maybe easier to use.

For example with the simple context, I'm able to run the queries mentioned in the array_agg bug report. It's still consuming a few gigabytes of memory (which is much more than should be necessary), but it's much better than the tens of gigabytes consumed before.

comments powered by Disqus