Compiler optimizations vs PostgreSQL


2014/12/01 by Tomas Vondra

About two weeks ago I posted a performance comparison of PostgreSQL compiled using various compilers and compiler versions. The conclusion was that for pgbench-like workloads (lots of small transactions), the compilers make very little difference. Maybe one or two percent, if you're using a reasonably fresh compiler version. For analytical workloads (queries processing large amounts of data), the compilers make a bit more difference - gcc 4.9.2 giving the best results (about 15% faster than gcc 4.1.2, with the other compilers/versions somewhere between those two).

Those results were however measured with the default optimization level (which is -O2 for PostgreSQL), and one of the questions in the discussion below that article is what difference would the other optimizations level (like -O3, -march=native and -flto) do. So here we go!


I repeated the two tests described in the previous post - pgbench and TPC-DS, with almost exactly the same configuration. If you're interested in the details, read that post.

The only thing that really changed is that thile compiling the PostgreSQL sources, I modified the opmtimization level or enabled the additional options. In total, I decided to test these combinations:

  • clang -O3 -O4 (available since clang-3.4)
  • gcc -O3 -O3 -march=native (since gcc-4.2) ** -O3 -march=native -flto (since gcc-4.5)

When combined with all the clang and gcc versions, this amounts to 43 combinations. I haven't done the tests for the Intel C Compiler.

BTW if you're interested in more detailed results, see this spreadsheet, or download this OpenDocument spreadheet (same data).

pgbench

For the small dataset (~150MB), the results of the read-only test are depicted on the following chart, showing the number of transactions per second (50k-58k range) - so the higher the number, the better.

The results are sorted by compiler (clang, gcc), optimization level and finally compiler version. The bars depict minimum, average and maximum tps (from 3 runs), to give an idea of how volatile the results are - I haven't found a better chart type in Google Drive or Libreoffice Calc.

optimizations-pgbench-small-read-only.png

Now, the first thing you probably notice is that for clang, the higher optimization levels mostly lower the performance. The newer the version, the worse the impact - while clang 3.5 gives >55k transactions with -O2, it drops to 54k with -O3 and 53k with -O4. The 2-4% difference is not a big one, but it's pretty consistent and it certainly is not in the direction we've hoped for.

With gcc, the situation is more complicated - the -O3 and -O3 -march=native levels result in slightly worse performance, although the difference is not as significant as for clang. The results however seem less volatile (e.g. 4.4 is a good example of that).

The "Link Time Optimization" is a different story, increasing the performance for most versions, especially compared to the -O3 -march=native results. For example 4.8 jumps from 53k to 55k tps, and 4.9 jumps from 54k to 56k. Compared to the -O2 results it's not that great, though (it's still faster, but the difference is smaller).

The other way to look at the data is by looking at the results like this:

compiler -O2 -O3 -O4
clang 3.1 54808 54666 -
clang 3.2 55320 54957 -
clang 3.3 55314 54909 -
clang 3.4 55144 54474 54259
clang 3.5 55766 54628 53848
compiler -O2 -O3 -O3 -march=native -O3 -march=native -flto
gcc 4.1.2 52808 53019 - -
gcc 4.2.4 53474 52829 53053 -
gcc 4.3.6 52355 52634 52465 -
gcc 4.4.7 51685 52070 52194 -
gcc 4.5.4 53739 52828 52663 56085
gcc 4.6.4 53144 53632 52899 54973
gcc 4.7.4 54354 53572 53001 52451
gcc 4.8.3 54350 52390 52753 54842
gcc 4.9.2 54669 54036 53758 56151

And after computing the difference against the -O2 results for each version, you'll get this:

compiler -O2 -O3 -O4
clang 3.1 54808 -0.26% -
clang 3.2 55320 -0.66% -
clang 3.3 55314 -0.73% -
clang 3.4 55144 -1.21% -0.40%
clang 3.5 55766 -2.04% -1.43%
compiler -O2 -O3 -O3 -march=native -O3 -march=native -flto
gcc 4.1.2 52808 0.40% - -
gcc 4.2.4 53474 -1.20% -0.79% -
gcc 4.3.6 52355 0.53% 0.21% -
gcc 4.4.7 51685 0.74% 0.98% -
gcc 4.5.4 53739 -1.70% -2.00% 4.36%
gcc 4.6.4 53144 0.92% -0.46% 3.44%
gcc 4.7.4 54354 -1.44% -2.49% -3.50%
gcc 4.8.3 54350 -3.61% -2.94% 0.91%
gcc 4.9.2 54669 -1.16% -1.67% 2.71%

This only confirms that on clang all the optimization levels hurt the performance (although only a tiny bit). For gcc, the only thing that makes a bit of difference in the right direction is the -flo flag. But even this makes less difference than the compiler version (the gcc-4.9.2 with -O2 is almost as fast as gcc-4.8.3 with -flto).

TPC-DS

Ok, so that was a transactiona workload. Now let's see the impact on analytical workloads ... first, the data load, consisting from the same steps as before:

  • COPY data into all the tables
  • create indexes
  • VACUUM FULL (not really necessary)
  • VACUUM FREEZE
  • ANALYZE

but for all the various optimization combinations:

optimizations-tpc-ds-loading.png

Clearly, no significant impact - exactly as in the initial post. In case you prefer the tabular form of the results (similar to the one presented for pgbench), this time tracking the total duration of the loading process (in seconds):

compiler -O2 -O3 -O4
clang-3.1 407 407 -
clang-3.2 399 396 -
clang-3.3 399 395 -
clang-3.4 406 397 411
clang-3.5 405 405 411
compiler -O2 -O3 -O3 -march=native -O3 -march=native -flto
gcc-4.1.2 401 406 - -
gcc-4.2.4 407 402 397 -
gcc-4.3.6 401 398 400 -
gcc-4.4.7 400 402 398 -
gcc-4.5.4 398 394 391 393
gcc-4.6.4 406 398 400 397
gcc-4.7.4 385 384 384 387
gcc-4.8.3 390 384 390 383
gcc-4.9.2 379 383 374 379

and as a speedup versus the -O2 for the same compiler version (negative values mean slowdown):

compiler -O2 -O3 -O4
clang-3.1 407 -0.10% -
clang-3.2 399 0% -
clang-3.3 399 1% -
clang-3.4 406 2% -1.18%
clang-3.5 405 0% -1.38%
compiler -O2 -O3 -O3 -march=native -O3 -march=native -flto
gcc-4.1.2 401 -1.11% - -
gcc-4.2.4 407 1% 2% -
gcc-4.3.6 401 0% 0% -
gcc-4.4.7 400 -0.61% 0% -
gcc-4.5.4 398 1% 1% 1%
gcc-4.6.4 406 1% 1% 2%
gcc-4.7.4 385 0% 0% -0.57%
gcc-4.8.3 390 1% -0.02% 1%
gcc-4.9.2 379 -1.04% 1% -0.13%

Now, let's see the impact on query performance (notice the chart shows range 150-210, in seconds):

optimizations-tpc-ds.png

And the results in the tabular form:

compiler version -O2 -O3 -O4
clang-3.1 176 174 - -
clang-3.2 176 172 - -
clang-3.3 174 185 - -
clang-3.4 189 176 181 -
clang-3.5 174 175 179 -
compiler -O2 -O3 -O3 -march=native -O3 -march=native -flto
gcc-4.1.2 186 200 - -
gcc-4.2.4 189 186 186 -
gcc-4.3.6 189 186 185 -
gcc-4.4.7 181 178 183 -
gcc-4.5.4 173 169 166 160
gcc-4.6.4 171 173 172 153
gcc-4.7.4 171 166 183 160
gcc-4.8.3 171 170 172 161
gcc-4.9.2 164 167 162 153

and as a speedup versus the -O2 for the same compiler version (negative values mean slowdown):

compiler version -O2 -O3 -O4
clang-3.1 176 1.14% - -
clang-3.2 176 2.27% - -
clang-3.3 174 -6.32% - -
clang-3.4 189 6.88% 4.23% -
clang-3.5 174 -0.57% -2.87% -
compiler -O2 -O3 -O3 -march=native -O3 -march=native -flto
gcc-4.1.2 186 -7.53% - -
gcc-4.2.4 189 1.59% 1.59% -
gcc-4.3.6 189 1.59% 2.12% -
gcc-4.4.7 181 1.66% -1.10% -
gcc-4.5.4 173 2.31% 4.05% 7.51%
gcc-4.6.4 171 -1.17% -0.58% 10.53%
gcc-4.7.4 171 2.92% -7.02% 6.43%
gcc-4.8.3 171 0.58% -0.58% 5.85%
gcc-4.9.2 164 -1.83% 1.22% 6.71%

For clang, the results vary for each version - on 3.3 the -O3 results in ~6% slowdown, on 3.4 it's ~6% speed-up. For the last version (3.5) it's a slight slowdown for both -O3 and -O4.

For gcc, the -O3 and -O3 -march=native flags are a bit unpredictable - on older versions this might give either slight improvement or significant slowdown (see for example gcc-4.7.4 where -O3 gives ~3% speed-up, but -O3 -march=native results in ~7% slowdown).

The only flag that really matters on gcc is apparently -flto (i.e. Link Time Optimization), giving ~5-7% speedup for most versions. That's not negligible, although it's not a ground-breaking speed-up either.

Summary

  • The various optimization flags don't have much impact - in most cases it's ~1-2%.
  • When they do have an impact, it's often in the unexpected (and undesirable) direction, actually making it slower.
  • The one flag that apparently makes a measurable difference in the right direction is -flto, giving ~3% speed-up in pgbench and ~7% in TPC-DS.




comments powered by Disqus