So, in the introduction post I briefly described what was the motivation of this whole effort, what hardware and benchmarks were used, etc. So let's see the results for the first benchmark - the well known pgbench and how they evolved since PostgreSQL 7.4.
If there's a one chart in this blog post that you should remember, it's probably the following one - it shows throughput (transactions per second, y-axis) for various numbers of clients (x-axis), for PostgreSQL releases since 7.4.
Note: If a version is not shown, it means "same (or almost the same) as the previous version" - for example 9.3 and 9.4 give almost the same performance as 9.2 in this particular test, so only 9.2 is on the chart.
So on 7.4 we could do up to 10k transactions per second (although the scalability was poor, and with more clients the throughput quickly dropped to ~3k tps). Since then the performance gradually improved, and on 9.2 we can do more than 70k tps, and keep this throughput even for higher client counts. Not bad ;-)
Also note that up until PostgreSQL 9.1 (including), we've seen a significant performance drop once we exceeded the number of CPUs (the machine has 8 cores in total, which is exactly the point where the number of transactions per second starts decreasing on the chart).
Of course, these are the result of just one test (read-only on medium dataset), and gains in other tests may be more modest, but it nicely illustrates the amount of improvements that happened sice PostgreSQL 7.4.
This post is a bit long, but don't worry - most of it are pretty images ;-)
Update: There's a hackernews thread with a discussion about the benchmark, results etc.
If you know these pgbench basics, you may just skip to the "Results" section, presenting the results of the benchmarks.
Pgbench has two main modes - read-write and read-only (there's also a third mode, and an option to use custom scripts, but I won't talk about those).
The read-write mode executes a simple transaction, simulating a withdrawal from an account - updating the balance in "accounts" table, selecting the current balance, updating tables representing a branch and a teller, and recording a row into a history.
BEGIN; UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid; SELECT abalance FROM pgbench_accounts WHERE aid = :aid; UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid; UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid; INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP); END;
As you can see, all the conditions are on a PK column, always work with a single row, etc. Simple.
The read-only mode is even simpler - it simply executes only the "SELECT" query from the read-write mode, so it's pretty much this:
SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
Again, access through PK column - very simple.
In short, scale determines size of the database as a number of rows in the main "accounts" table - the supplied value gets multiplied by 100.000 and that's how many rows in that table you get. This of course determines the size on disk, as every 100.000 rows corresponds to 15MB on disk (including indexes etc.).
When choosing the scale for your benchmark, you have three basic choices, each testing something slightly different.
The third important parameter I'd like to discuss is client count, determining the number of concurrent connections used to execute queries.
Let's assume you measure performance with a single client, and you get e.g. 1000 transactions per second. Now, what performance will you get with 2 clients? In an ideal world, you'd get 2000 tps, twice the performance with a single client.
For N clients you'd get N-times the performance of a single client, but it's clear it doesn't work like that, because sooner or later you'll run into some bottleneck. Either the CPUs will become fully utilized, the I/O subsystem will be unable to handle more I/O requests, you'll saturate network or memory bandwidth, or hit some other bottleneck.
As the number of clients is increased, we usually see three segments on the charts:
On a simple chart, it might look like this:
Of course - this is how it would look in an ideal world. In practice, the initial growth is not linear, and the throughtput in the second part is not constant but drops over time - sometimes fast (bad), sometimes gradually (better). We'll see examples of this.
For sake of clarity and brevity, I've left out the latencies from the charts in this blog post. A proper latency analysis would require a considerable amount of space (to show averages, percentiles, ...) but the results turned out to be especially dull and boring in this case. You have to trust me that the transaction rate (and implied average latency) is a sufficient description of the results in this case.
By default, all the pgbench clients (connections) are handled by a single thread - the thread submits new queries, collects results, writes statistics etc. When the queries are long, or when the number of connections is low, this is fine. But as the queries get shorter (e.g. with read-only test on small dataset), or when the number of connections grows (on many-core systems), the management overhead becomes significant.
That's why pgbench has
--jobs N (or
-j N) option, to specify the number of threads. The value has to be a divisor of number of clients, so for example with 8 clients you may use 1, 2, 4 or 8 threads, but not 5 (because 8 % 5 = 3). There are various smart strategies to set the value, but the easiest and most straightforward way is to use the same value for both options, and that's what I used for this benchmark.
pgbench tool evolves with PostgreSQL, so when new features are added to PostgreSQL, which means new
pgbench versions may not work with older PostgreSQL releases (e.g. on releases not supporting
fillfactor storage option, which is used by
pgbench when initializing tables). There are two ways to overcome this - either use
pgbench version matching the PostgreSQL release, or use the new version but patch it so that it works with older releases (by skipping features not present in that PostgreSQL version).
I've used the latter approach, mostly because I wanted to use some of the features available only in recent pgbench versions (especially
The patch I used is available here. It changes three things:
-Xoption to disable
IF EXISTSwhen dropping table (because that's not available on old releases)
-Qoption to entirely skip cleanup (because the script simply recreates the database anyway)
fillfactoron all the tables (so the default version is used)
None of the changes should have no impact on the results, because the first two just remove new syntax (which is not necessary), and after removing the explicit fillfactor option the default will be used (which is 100 on all versions).
If you're interested in the script driving the testing, it's available here. It's a bit hackish, but it should give you an idea of how the testing was done.
So, given all the possible parameters (mode, scale, client count), what combinations were benchmarked?
That gives 84 combinations in total, and for each combination there were 3 runs, 30 minutes each. That's ~126 hours per version (not including initialization and warmup), so ~6 days per PostgreSQL version. There were 12 versions benchmarked (7.4, 8.0, 8.1, 8.2, 8.3, 8.4, 9.0, 9.1, 9.2, 9.3, 9.4b2, head) so the total runtime is close to 70 days.
Hopefully that convinces you this was not a rushed unreliable benchmark.
There were only minimal changes to the configuration file:
# machine has 16GB of RAM in total shared_buffers = 2GB work_mem = 64MB maintenance_work_mem = 512MB checkpoint_segments = 64 checkpoint_completion_target = 0.9 effective_cache_size = 8GB # when executed on the SSD (otherwise 4) random_page_cost = 2
Of course, there were some tweaks (e.g. because older versions don't know
checkpoint_completion_target) but no extensive per-version tuning was performed, as that'd make the testing much more time consuming (to the point that it'd be impossible).
I however did a limited amount of tests to see how big impact may such tuning have - I'll show you a chart illustrating that later (spoiler: the older versions may be much faster with a bit of tuning, but the new ones are still faster).
So, let's see the results! I'll only show some of the charts - the rest of them is usually quite similar.
First, the chart we've already seen:
This is a read-only workload, with a dataset that completely fits into RAM. So this is CPU-bound, with locking etc. Yet PostgreSQL 9.4 is 7-20x faster than 7.4. Not bad, I guess.
The other interesting observation is that it scales linearly with the number of cores - in this case it's actually a bit better (the performance on 8 cores is more than 2x of 4 cores), most likely thanks to caching effects. During the 9.2 development this was thoroughly tested (by Robert Haas) and the conclusion was that we do scale linearly up to 64 cores (and probably beyond, but we didn't have machines with more cores).
There's also an excellent talk by Heikki Linnakangas about all the improvements done in PostgreSQL 9.2, presented at FOSDEM 2012.
This is of course dependent on workload - the improvements in locking are most visible on read-only workloads on small datasets, because that's mostly cached and should do much I/O. Once you start hitting any kind of I/O, it will probably overshadow these improvements. It's also related to improvements made in Linux kernel, for example the lseek scalability improvements committed into kernel 3.2. And finally all this is of course dependent on the exact CPU type - on older CPUs the improvements may not be as great.
Let's see how the read-only workload performs on a large dataset (thus not fitting into RAM), on an SSD drive
This is clearly I/O bound, because pgbench generates a lot of random I/O (although the SSD handles that much better than rotational drives). Because of the I/O bottleneck the difference is not as significant as with the medium dataset, but 2-4x speedup is still really nice.
There were two releases that significantly improved the performance - first 8.0, then 8.1 (and that's mostly what we get on 9.4, performance-wise).
So, what happens when we keep the large dataset, but do a read-write test?
Interesting. The 7.4 is doing rather poorly (just ~500 transactions per second on an SSD). The 8.x releases perform much better, with roughly ~2x the performance, but the throughput is mostly constant after reaching 8 clients (which is the number of CPUs in the machine).
But since 9.1, we see a quite interesting behavior - the throughput grows even after 8 clients (where it's mostly equal to 8.x), up to ~2500 tps with 32 clients. And the chart suggests it would probably grow even further with more clients (had it been part of the benchmark). This can happen because SSDs are quite good with handling parallel requests (thanks to internal architecture, which uses multiple independend channels), and since 9.1 we can saturate it much better.
Note 1: The read-write results on medium dataset are very similar to this.
If you ask how would the comparison look on traditional rotational drives, this is the answer:
For the record, these are 10k SAS drives, connected to a HP P400 controller (RAID10 on 6 drives).
It's interesting that 8.4 is slightly faster than 9.4 for higher client counts, but we're still doing better than 7.4, but clearly the I/O is a significant bottleneck that's difficult to beat at the code level.
Another interesting question is how much this depends on hardware, so here is a comparison with results from the machine with i5-2500k CPU.
Wow! On PostgreSQL 7.4 the difference is "only" 2x, but on 9.4 the i5 really crushes the Xeon, giving nearly ~4x the performance. Also, the i5 has only 4 cores while the Xeon has 8 cores (in 2 sockets), so that's ~8x the performance per socket, while the frequencies are about the same (3 GHz vs. 3.3 GHz).
Just for the record, this difference is not only thanks to the CPU - the i5 system has a better memory throughput, the SSD is not connected using a lousy PCIe controller (which limits the bandwidth), etc. But clearly, modern hardware can make a huge difference.
I also mentioned that each PostgreSQL version may have different "optimal" configuration values. That makes sense, because the development happens in the context of the current hardware. When 7.4 was released 10 years ago, most machines had only 4GB of RAM (not to mention that everything was still 32-bit and thus capped to 2GB), so there was no point in optimizing for large
shared_buffers values, for example.
So, what happens if we use much lower settings for
shared_buffers (only 128MB instead of the 2GB used in all the previous benchmarks)? This:
On 7.4, using
shared_buffers=128MB improved the performance of
shared_buffers=2GB by a factor of ~3x. On 8.0 there's still some difference, but it's much smaller (~5%). The nice thing is that 9.4b1 is faster even with the 2GB.
Note: There are also other reasons to use small
shared_buffers values on older versions (absence of
checkpoint_completion_target and usually running them on older filesystems like ext3, with poor fsync behavior).
Those were medium and large datasets, but what about the small one? The read-only results look quite similar to the medium dataset, but the read-write seems interesting on it's own:
Wow! 8.2 really improved this, boosting the performance 5x, compared to 8.1. The following releases further improved the results by ~25%, but 8.2 was clearly a huge jump forward.
And finally, let's do two more comparisons of the Xeon and i5 based systems, with read-only and read-write benchmarks on the small dataset.
The solid lines are Xeon, the dashed lines are i5. This time the Xeon actually wins over i5 (unlike with the large dataset), at least on 9.4 - having 2x the cores matters here, apparently. It's also nice to see that 9.4 fixed the performance degradation on the Xeon, with client counts higher than the number of CPU cores (on the i5 this did not happen at all).
On read-write workload, i5 wins again (despite the smaller number of cores).
Let me summarize the main points we've learned from the benchmark results:
shared_buffersetc.), because that's the context of their development.