count_distinct - improved memory usage


Last week, I briefly explained the recent improvements in count_distinct, a custom alternative to COUNT(DISTINCT) aggregate. I presented some basic performance comparison illustrating that count_distinct is in most cases significantly faster than the native COUNT(DISTINCT) aggregate. However, performance was not the initial goal of the changes, and the improvement in this respect is not that big (a few percent, maybe, which is negligible when compared to ~3x speedup against COUNT(DISTINCT)). The main issue was excessive memory consumption, and I promised to present the improvements. So here we go!

If I had to show you a single chart illustrating the improvements, it'd be this one. It compares the amount of memory necessary to aggregate the large tables from the previous post (100M rows). Remember, the lower the values the better, and red is the new implementation:

memory-large.png

While the old implementation used 1.5GB - 2GB in most cases, the new implementation needs just ~600MB. I don't want to brag, but that's a pretty significant improvement (also you could also say I messed up the initial implementation and now I merely fixed it).


So, why did the hash table suck so much?

Clearly, the original count_distinct implementation was quite bad when dealing with memory, and there are two or three reasons why. If you're writing extensions in C, this might give you a few hints for improvements.

Simple hash table implementation

First, I have used a very simple hash table implementation, resulting in some unnecessary overhead for each value - you need the buckets, a pointer to the next item in the same bucket, etc. This could be lowered a bit by using a more elaborate hash table implementation (e.g. open addressing, but either the gains were not as significant as I expected, or the resulting implementation was more complex than I wished for, or slower.

After a lot of experiments, I switched to this simple array-based structure:

typedef struct element_set_t {

    uint32 item_size; /* length of the value (depends on the actual data type) */
    uint32 nsorted;   /* number of items in the sorted part (distinct) */
    uint32 nall;      /* number of all items (unsorted part may contain duplicates) */
    uint32 nbytes;    /* number of bytes in the data array */

    /* aggregation memory context (reference, so we don't need to do lookups repeatedly) */
    MemoryContext aggctx;

    /* elements */
    char * data;      /* nsorted items first, then (nall - nsorted) unsorted items */

} element_set_t;

And this resulted in pretty much zero per-item overhead. Yes, the array will be always a bit larger than necessary, but on average the overhead is going to be much lower than with any hash table implementation.

Because if for each 32-bit item you have to keep a 64-bit pointer to the next item, it's a pretty significant overhead. The hash table I was using was not really this dumb (it was using batches of items, ...), but you get the idea.

Allocating tiny pieces of memory

Second, I was using palloc to allocate very small pieces of memory (just a few bytes a time). That however results in significant hidden overhead, because each palloc prepends the allocated piece with a ~16B header. So calling palloc(50) actually means you'll get 66 bytes of memory - 16B header, followed by the 50B you requested.

This header contains information associating the allocated piece with a MemoryContext, which is a concept used by PostgreSQL to simplify memory management. I'm not going to explain the details of how MemoryContext works (read the comments in palloc.h), but if you had to deal with a complex C codebase, you probably know how tedious and error-prone memory management can be. Tracking every little piece of memory as it is passed from function to function is no fun, and the MemoryContext concept makes this much easier, and the palloc overhead is a price for that.

But there are ways to lower the costs - the most obvious one is allocating larger pieces of memory. You'll still get the 16B header, but the overhead will be much lower (i.e. 16B when allocating a 50B piece is ~30%, for 500B it's only ~3%). And that's exactly what the array-based implementation does. It simply allocates a single array as a whole, not a tiny piece of memory for each separate item.

Memory context internals

And the third reason is really about memory context internals. The truth is memory contexts don't internally work with pieces of arbitrary size. They work with sizes that are powers of 2. So if you do palloc(10), the implementation really does palloc(16) because 16 is the nearest power of 2. Similarly for higher values - palloc(1100) allocates the same amount of memory as palloc(2048), and so on.

So how to use this for your benefit? Well, only request sizes that are powers of two, matching the memory context internals perfectly. Because that minimizes the amount of memory you can't actually use. And again, this is exactly what the new implementation does.

Memory consumption comparison

So, let's see the memory consumption of the new (sorted array) implementation, compared to the old (hash table) one. I'll use the same datasets and queries as in the previous post. The consumed memory was measured placing a call to MemoryContextStats to an appropriate place in code, so it includes everything including palloc overhead, overhead of the structures, unused space in the array and of course overhead of the HashAggregate itself (dynahash, palloc, ...).

The hash-table and sorted-array denote the old and new implementation, respectively. I've omitted the native aggregation, because that always leads to a GroupAggregate plan, keeping a single group in memory (making the memory consumption mostly irrelevant). The cases with the largest number of groups are omitted for the same reason (see "Aggregates vs. sort" in the previous post for explanation).

The data column tracks how much actual data needs to be kept in memory (groups x items per group x 4B).

Small dataset (100k rows)

memory-small.png

groups items per group data hash-table sorted-array
1 100000 390 kB 8440 kB 568 kB
10 10000 390 kB 4249 kB 889 kB
100 1000 390 kB 4088 kB 1016 kB
10k 10 390 kB 8184 kB 4088 kB
992 100 387 kB 2040 kB 1016 kB
992 9 35 kB 1016 kB 248 kB
101 640 258 kB 2040 kB 504 kB

Medium dataset (10M rows)

memory-medium.png

groups items per group data hash-table sorted-array
1 10000000 39 MB 260 MB 64 MB
10 1000000 39 MB 218 MB 44 MB
100 100000 39 MB 353 MB 52 MB
10000 1000 39 MB 224 MB 88 MB
9992 1000 39 MB 177 MB 63 MB
9992 80 3 MB 24 MB 8 MB
10000 644 25 MB 104 MB 48 MB

Large dataset (100M rows)

memory-large.png

groups items per group data hash-table sorted-array
1 100000000 381 MB 2068 MB 512 MB
10 10000000 381 MB 2384 MB 640 MB
100 1000000 381 MB 2116 MB 406 MB
10000 10000 381 MB 2557 MB 786 MB
9999 10000 381 MB 1755 MB 589 MB
9999 96 4 MB 32 MB 8 MB
100000 644 245 MB 1040 MB 408 MB

Summary

It's immediately clear that the new implementation uses only a fraction of memory (usually 10-25%) of the hash table implementation. So either the old implementation was really bad, or reducing the overhead redution really helped a lot. Most likely it's a combination of both.

The most important "lessons learned" for me is that sometimes we concentrate too much on optimizing the current approach (implementation based on a hash table), while it'd actually be better to re-evaluate the assumptions and try a completely different approach. Because hash tables are great for lookups, but maybe it's not the best tool when you need to remove duplicate values from a set.

It's also clear that the new implementation still uses about ~2x the amount of memory required by the data itself. I believe that's pretty much the minimum achievable overhead while still keeping the performance. First, the doubling the array size pretty much means ~50% overhead on average (right before doubling it's 0%, right after doubling it's 100%, thus 50% on average). Also, there's additional overhead outside this extension - the Hash Aggregate needs to keep it's own hash table and some other stuff, etc.

There are certainly ways to improve Hash Aggregate in general - for example in the discussion below the previous post, Noah Misch asked if it was possible to limit the memory consumption by spilling the groups to disk. It's certainly possible, but not from the extension, because it only sees isolated groups - it's pretty much impossible to make good decisions about when to spill to disk, what to spill, etc. This needs to be done at the Hash Aggregate level, and by coincidence this it currently discussed at pgsql-hackers. It's a quite complex change and there's a lot of work to be done on that (so don't hold your breath).


Published on 2014/08/25



comments powered by Disqus