count_distinct improvements


2014/08/19 by Tomas Vondra

Almost a year ago, I wrote a custom experimental aggregate replacing COUNT(DISTINCT). The problem with the native COUNT(DISTINCT) is that it forces a sort on the input relation, and when the amount of data is significant (say, tens of millions rows), that may be a significant performance drag. And sometimes we really need to do COUNT(DISTINCT).

Extensibility is one of the great strengths of PostgreSQL - users of most other databases can only dream about things like defining custom aggregates. And this extension point was exactly the foundation for my idea was - implementing a custom aggregate, counting the distinct items in a more efficient way.

That's how count_distinct was conceived, and the principle was really simple. For each group, a small hash table is maintained, making it trivial to keep track of distinct values (and counting them). And it worked quite well - instead of the COUNT(DISTINCT) query

SELECT COUNT(DISTINCT column) FROM table;

you can call the customa aggregate

SELECT COUNT_DISTINCT(column) FROM table;

and in most cases it was much faster (usually ~3x), without doing the sort, etc. The main disadvantage was memory consumption - the overhead of the additional hash table structure and palloc overhead was bad (and in some cases quite terrible - consuming an order of magnitude more memory than the amount of data being processed). I experimented with various hash table variants, allocation schemes, but either the impact on performance was unacceptable, or the memory consumption was not much lower. Until about a month ago ...


Luckily, I've been messing with the hash join implementation recently, and it occured to me that a hash table may not be the right tool for this task. Hash tables are great for lookups, but that's not really the point of count_distinct - it only needs to detect and remove duplicates, and lookups are only one of the ways to achieve that goal.

Also, as the hash table grows, the L2/L3 cache hit ratio gets worse and worse. And after a bit of experimenting, I managed to come up with an implementation that uses a simple sorted array, is actually slightly faster than the hash table version and uses only a fraction of memory required by the initial implementation.

How it works

The original implementation (up until commit dffc0263) uses a very simple hash table implementation. Eseentially an array of buckets, each pointing to a list of items. When a new value is passed to the group, it's checked whether the value is in the hash table, and if it's a new element, it's added. As the number of items grows, the hash table is occasionally resized to keep good lookup performance. Nothing really surprising here.

The new implementation scraps all of this, and replaces the hash table with a simple array, split into three sections:

  • sorted items
  • unsorted items
  • free space

Adding a value to the group is very simple:

  1. if there's enough free space, just put the item there and you're done
  2. otherwise (if there's not enough free space) perform a "compaction" of the data
    • sort the 'unsorted' part (e.g. using qsort)
    • remove duplicates from the 'unsorted' part (a single pass through the now-sorted array)
    • merge the two (now sorted) pieces into a single sorted array (merge sort)
  3. if only a small fraction of the space is freed, double the array size (to prevent frequent compactions)

The implementation is actually much shorter compared to the hash table version, and there's pretty much no memory overhead (sure, there's some free space in the array, but no palloc overhead, no additional pointers ...). Also, the compaction happens only very rarely.

I have already released it at PGXN, so doing pgxnclient install count_distinct gives you the new implementation (version 1.2.1).

Performance comparison

To evaluate and compare the performance of the implementantions, I used three different datasets - small (100k rows), medium (10M rows) and large (100M rows). For each size, there were multiple tables with different features (cardinality of columns, correlation, ...) and a range of queries aggregating the data - different number of groups, distinct values per group, etc. The queries and scripts for generating the tables are available in the github repository.

In the charts, native means COUNT(DISTINCT) as we have it in PostgreSQL, hash-table means the original count_distinct implementation, sorted-array is the new improved implementation. The charts track duration of queries in milliseconds, so lower values are better.

What surprised me a bit is that there's pretty much no difference between sorted and unsorted data for any of the implementations (including native). The most important parameter seems to be the number of groups, and number of (distinct) values per group. So the charts only show performance (duration of the queries) for number of groups returned by the query.

Of course, the queries were chosen specifically to test the COUNT(DISTINCT) in isolation - in practice, queries tend to be more complex (joins, other WHERE conditions, ...). Those other operations may consume considerable amount of time, possibly making the speedup negligible. So take the results with a grain of salt.

Small dataset (100k rows)

While the goal of count_distinct is improving performance on large datasets, we can't just slow-down performance on small datasets. We can't just go and hurt performance for many users (and customers) to speed up queries on the few really huge tables. And it seems both the old and new count_distinct implementations handle small tables just fine.

distinct-small.png

The performance is about the same, except when the number of groups gets really high. Then it's actually much faster than the native implementation.

Medium dataset (10M rows)

Once we get to medium-sized tables, the performance difference is immediately apparent:

distinct-medium.png

The count_distinct is significantly faster than native COUNT(DISTINCT), with the exception of a query with 1 group, containing 10M distnct values, i.e. about the same as doing

SELECT COUNT(DISTINCT id) FROM table;

where id is a unique column (or almost unique column). This however seems like a rather unlikely case (at least in our application), and once you add a GROUP BY clause (with more than one group), or counting data with duplicate values, count_distinct quickly wins.

Large dataset (100M rows)

For the large dataset, count_distinct clearly wins in all cases, including counting a unique column.

distinct-large.png

The speedup, compared to the native implementation seems about the same as for the medium dataset (i.e. about ~5x speedup).

Aggregates vs. sorting

One strange thing you have probably noticed in the charts above is that for large number of groups (the two last columns in each chart), the performance gets much worse. While for lower group counts it's ~5x, here it suddenly drops to "only" ~2x speedup. Why is that?

The problem here is that the query plan in those cases actually looks like this:

                                      QUERY PLAN
-----------------------------------------------------------------------------------
  GroupAggregate  (cost=16097386.18..18097386.82 rows=100000032 width=8)
    Group Key: large_table.col_a
    ->  Sort  (cost=16097386.18..16347386.26 rows=100000032 width=8)
          Sort Key: huge_10000.col_a
          ->  Seq Scan on large_table  (cost=0.00..1442478.32 rows=100000032 width=8)

That is, it's still sorting the data, and the reason is that the planner does not know how large the aggregate state (memory for the sorted arrays) will get, and uses some defaults that happen to be quite high in this case. So the reasoning of the planner is about this:

Hey, I'll get ~100M groups! I wonder how much memory that is ... Well, the aggregate uses internal data type to pass state, which is probably a pointer to a structure of unknown size. Let's use 1kB.

Which amounts to ~100GB in this case, and that's way higher than my work_mem value (and I'd bet most sane work_mem values in general). So the planner concludes HashAggregate (keeping all the groups in memory) is not a viable plan and goes for sorting and a GroupAggregate (because that needs to keep a single group in the memory).

Of course, the actual memory requirements are much lower (single group for 100M 32-bit integers => 512MB), but in fairness the planner has no way to know this. It's better to choose a slower plan and eventually get the result, than to shoot for speed and get terminated by the dreaded OOM killer.

Summary and comments

So, that's how the performance improved by switching to a different implementation. The new count_distinct implementation is much faster, except for one case that I believe is an uncommon corner case. It's also much more memory efficient, but I haven't discussed that today - expect another post in a few days.

The current count_distinct implementation only works with fixed-length types - INT, BIGINT, ... whatever uses a fixed size (less than 8B) should work. Extending to varlena types is possible (and I have a quite clear idea on how to do that), but I'm not currently working on that and it's not a high priority for me.

If you need to count distinct values for such type (e.g. a TEXT or BYTEA), one way to do that is hash the value into an INT or BIGINT and then pass the result to count_distinct. For example pghashlib is a good library for such hashes. It does not support 64-bit version of MurmurHash3, which is one of the most popular algorithms these days, but CityHash is a great replacement (some consider it better).

However, using a hash instead of the actual value turns the count_distinct into a cardinality estimator as there's a probability of hash collisions, no matter how small the probability is. For small number of distinct values (per group) it's quite small, especially when using 64-bit hashes, but it's not zero. I have considered using this approach (keeping only a hash of the varlena value), and decided not to use it for these reasons.

If a cardinality estimate is sufficient for you, there are better solutions - for example HyperLogLog from aggregatedknowledge or me.

I also mentioned that as the hash tables grow, the L2/L3 cache hit ratio gets worse, and that one of the goals of using a simpler structure was to improve that. I don't think that turned out to work quite well - the new implementation is a tad faster than the old one, but not significantly. Had there been a singificant improvement in L2/L3 hit ratio, the difference would be much more visible. I believe this is due to the random access to the "large" hash table maintained by HashAggregate itself.





comments powered by Disqus