PostgreSQL performance with gcc, clang and icc


2014/11/18 by Tomas Vondra

On Linux, a "compiler" is usually a synonym to gcc, but clang is gaining more and more adoption. Over the years, phoronix published several articles comparing of performance of various clang and gcc versions, suggesting that while clang improves over time, gcc still wins in most benchmarks - except maybe "compilation time" where clang is a clear winner. But none of the benchmarks is really a database-style application, so the question is how much difference can you get by switching a compiler (or a compiler version). So I did a bunch of tests, with gcc versions 4.1-4.9, clang 3.1-3.5, and just for fun with icc 2013 and 2015. And here are the results.


I did two usual types of tests - pgbench, representing a transactional workload (lots of small queries), and a subset of TPC-DS benchmark, representing analytical workloads (a few queries chewing large amounts of data).

I'll present results from a machine with i5-2500k CPU, 8GB RAM and an SSD drive, running Gentoo with kernel 3.12.20. I did rudimentary PostgreSQL tuning, mostly by tweaking postgresql.conf like this:

shared_buffers = 1GB
work_mem = 128MB
maintenance_work_mem = 256MB
checkpoint_segments = 64
effective_io_concurrency = 32

I do have results from another machine, but in general it confirms the results presented here. The PostgreSQL was compiled like this

./configure --prefix=...
make install

i.e. nothing special (no custom tweaks, etc.). The rest of the system is compiled with gcc 4.7.

pgbench

I did pgbench with three dataset sizes - small (~150MB), medium (~25% RAM) and large (~200% RAM). For each scale I ran pgbench with 4 clients (which is the number of cores on the CPU) for 15 minutes, repeated 3x, and averaged the results. And all this in read-write and read-only mode.

The first observation is that once you start hitting the drives, compiler makes absolutely no measurable difference. That makes results from all the read-write tests (for all scales) uninteresting, as well as the read-only test on large dataset - for all these tests the I/O is the main bottleneck (and that's something the compiler can't really influence).

So we're left with just the read-only benchmark on small and medium datasets, where the results look like this:

compiler tps (small scale=10) tps (medium scale=140)
gcc 4.1.2 52932 49837
gcc 4.2.4 53071 50219
gcc 4.3.6 52147 49396
gcc 4.4.7 52597 49834
gcc 4.5.4 53537 50143
gcc 4.6.4 53238 49959
gcc 4.7.4 54383 51033
gcc 4.8.3 54494 51627
gcc 4.9.2 55084 52515
clang 3.1 55160 51748
clang 3.2 55848 52197
clang 3.3 54946 51906
clang 3.4 55297 52306
clang 3.5 55800 52458
icc 2013 52249 49197
icc 2015 52064 49064

Let's use the gcc 4.1.2 results as a baseline, and express the other results as a percentage of the baseline. So 100 means "same as gcc 4.1.2", 90 means "10% slower than gcc 4.1.2" and so on. On a chart it then looks like this (the higher the number, the better):

pgbench-comparison.png

Not really a dramatic difference:

  • gcc 4.9 and clang 3.5 are winners, with ~4-5% improvement over gcc 4.1.2
  • gcc improves over time, with the exception of 4.3/4.4, where the performance dropped below 4.1
  • clang is very fast right from 3.1, peaking at 3.2 (which is slightly better than 3.5)
  • surprisingly, icc gives the worst results here

TPC-DS

Now, the data warehouse benchmark. I've used a small dataset (1GB), so that it fits into memory - otherwise we'd hit the I/O bottlenecks and the compilers would make no difference. First, lest's load the data - the script performs these operations:

  • COPY data into all the tables
  • create indexes
  • VACUUM FULL (not really necessary)
  • VACUUM FREEZE
  • ANALYZE

The results (in seconds) look like this:

tpcds-load.png

compiler copy indexes vacuum full vacuum freeze analyze total
gcc 4.1.2 110 131 168 5 8 422
gcc 4.2.4 105 128 162 5 8 408
gcc 4.3.6 103 127 160 4 7 401
gcc 4.4.7 102 127 160 4 7 400
gcc 4.5.4 101 126 160 4 6 397
gcc 4.6.4 103 128 162 5 8 406
gcc 4.7.4 100 122 156 3 6 387
gcc 4.8.3 101 122 155 3 6 387
gcc 4.9.2 102 118 150 3 8 381
clang 3.1 108 129 162 4 8 411
clang 3.2 104 125 160 4 6 399
clang 3.3 105 125 160 3 6 399
clang 3.4 106 126 161 3 8 404
clang 3.5 105 127 162 4 8 406
icc 2013 106 129 163 4 8 410
icc 2015 105 125 160 4 6 400

According to the totals, the difference between the slowest (gcc 4.1.2) and fastest (gcc 4.9.2) is ~10%. Again, gcc continuously improves, which is nice. Clang actually slightly slows down since 3.2, which is not so nice, and clang 3.5 is ~6.5% slower than gcc 4.9.2. And icc is somewhere in between, with a nice speedup between 2013 and 2015 versions.

But that was just loading the data, what about the actual queries? TPC-DS specifies 99 query templates. Some of those use features not yet available in PostgreSQL, leaving us with 61 PostgreSQL-compatible templates. Sadly 2 of those did not complete within 30 minutes on the 1GB dataset (clearly, room for improvement), so the actual benchmark consists of 59 queries.

Chart of total duration of three runs per query, using gcc 4.1.2 as a baseline (just like the pgbench, but this time lower numbers are better) looks like this:

tpcds-queries.png

Clearly, the differences are much more significant than in the pgbench results. Again, gcc continuously improves over time, with 4.9.2 being the winner here - the difference between 4.1.2 and 4.9.2 is astonishing ~15%. That's pretty significant improvement - good work, GCC developers!

Clang results fluctuate a lot - 3.1, 3.3 and 3.5 are quite good (not as good as gcc 4.9.2, though).

And icc is again somewhere in the middle - faster than gcc 4.1.2 but nowehere as fast as gcc 4.9.2 or the "good" clang versions. And this time 2015 actually slowed down (contrary to the previous results).

Summary

If your workload is transactional (pgbench-like), the compiler does not matter that much - either you're hitting disks (and the compiler does not matter at all), or the differences are within 5% from gcc 4.1.2. But if a gain this small is significant for you enough to warrant switching a compiler, you should probably consider getting a slightly more powerful hardware (CPU with more cores, faster RAM, better storage, ...).

Analytical workloads are a different case - gcc is a clear winner, and if you're using an ancient version (say, 4.3 or older), you can get ~10% speedup by switching to 4.7, or ~15% to 4.9. In any case, the newer the version, the better.





comments powered by Disqus