One of the questions I'm getting in response the talks and posts about filesystem performance is "What about I/O schedulers?" Sadly I didn't have much reliable data on this topic before, and I've already published that some time ago. But I've been asking myself this question too, so naturally I decided to additional benchmarks, hopefully improving the reliability somewhat. So let's look what I've measured.
The main reason why I haven't been very happy with the previous measurements is that the setup was not particularly representative of actual deployments. Most serious deployments don't run on a single SSD device, but at lease separate the data directory and WAL somehow. As I finally received the additional SSD drives I ordered some time ago, I've modified the setup this way - one SSD for WAL, one SSD for data. This also allows setting different I/O schedulers for WAL and data separately - maybe one scheduler is good for WAL and a different one for data?
Those results suggest that in terms of average throughput, the I/O scheduler on XLOG does not actually matter, while the I/O scheduler on data directory can indeed make a significant difference.
The interesting thing is that although the common recommendation is to use "deadline" I/O scheduler for databases or "noop" for SSD devices, in this benchmark "cfq" actually performed better than both of those schedulers. Compared to "deadline" the difference is not that big (just ~3%), but "noop" is almost 10% slower.
But of course, average throughput may be quite misleading, hiding various types of performance variability. One way to illustrate the variability might be to simply plot number of transactions completed each second, but that gets very confusing very quickly. So instead let's look at the statistical distribution of this metric - a 60-minute run gives us 3600 observations, which we can use to build approximation of the statistical distribution.
A good way to represent the distribution is CDF, which for the "tps" looks like this:
Firstly, this chart once again suggests that scheduler on XLOG does not really matter - there are three clearly separated groups of curves, depending on the scheduler set on the data directory.
If you're not familiar with CDF, it's not that difficult - pick a value on x-axis, for example 6000 tps, and find the matching value on y-axis for a given curve. For the "blue" curves on the previous chart it's ~0.5, which means ~50% of values are below 6000. In other words, 6000 is median of distributions when "noop" scheduler is used for the data directory. Or you may go from the other direction and pick a value on the y-axis, and use that to find out the matching values on x-axis, effectively percentiles percentiles - for example by using 0.5, you can easily find medians for all the curves.
The CDF also visualizes variability of the data - the speeper the curve, the less variable the data. Had the performance been perfectly consistent, i.e. all the values would be exactly the same (same performance each second), the CDF would be a step function, going from 0 to 1 in a single point.
It's obvious that "cfq" performs better than "deadline" or "noop" not just in terms of average throughput, but also in terms of variability. The "cfq" curves are steeper than for the other schedulers, and also the "tails" are much shorter.
We can use the same method to analyze latencies - I didn't collect the full transaction log, but let's look at minimum and maximum latencies (collected per second). For the two next charts, the x-axis is the latency in microseconds.
For the minimum latencies, the "noop" performs best and "cfq" worst, but the difference is quite negligible - less than 10 microseconds for all percentiles. I'd guess it's mostly due to how expensive the scheduling algorithm is, and clearly "noop" is cheapest as it does nothing, while "cfq" does a fair amount of reorderings and such.
The maximum latencies are much more interesting - the values mostly depend on the I/O performance, and how well the I/O scheduler reorganizes the requests for the device. While it's often recommended to use "noop" for SSD devices, these results quite contradict that, as it results in the worst latencies. Similarly for "deadline", although then the difference is not that bad. This mostly matches the difference in average throughput.
The benchmarks were executed on a system used for some of the previous tests, i.e.
The data (tooling and results) are available at bitbucket.
I find it slightly surprising that "cfq" so clearly wins over "deadline" and (especially) "noop." I'd not dare to suggest that "cfq" universally beats the other schedulers - there may be other PostgreSQL workloads where "deadline" or "noop" performs much better, not to mention other types of applications (i.e. not databases). I also wouldn't be surprised if the results were slightly different for other types of storage, e.g. rotational devices or even different types/models of SSDs.
In case of "noop" I believe this means that the optimizations performed by "cfq" (reordering and coalescing of requests, etc.) still matter even on SSD drives.
For "deadline" I'm not quite sure. The ideas behind this scheduler sound like a great match for PostgreSQL and databases in general - preferring reads over writes should work great, because reads in databases are synchronous (client has to wait for them) while writes are asynchronous (happen mostly in the background during checkpoint). Yet "deadline" does not beat "cfq" - similarly to "noop" it might be due to not performing some of the optimizations done by "cfq."
Actually, while there's a lot of "deadline works great for databases" claims, there are also discussions about bad experiences with "deadline" scheduler. For example there was this thread on pgsql-hackers in 2010, discussing issues with "deadline" scheduler (and also workloads that work poorly with the other schedulers). Definitely worth read, and it also links to various benchmarks, like fsopbench.
Another thing is that there's a lot of tunables for the "deadline" scheduler, so maybe there's some configuration that works much better. I've been experimenting particularly with
write_expire, but that hardly affected the performance - results are included in the git repository on bitbucket. In any case, this probably means that simply switching to "deadline" scheduler may not really give you anything without further tuning.