In case you haven't noticed, the schedule for pgconf.eu 2015 was published a few days ago. One of my talks is called PostgreSQL Performance on EXT4, XFS, F2FS, BTRFS and ZFS and aims to compare PostgreSQL performance of modern Linux file systems (and also impact of various tuning options like write barriers, discard etc.). It's not an entirely new talk - it's a reworked (and significantly improved) version of a talk I gave on 5432 conference in May.
One of the rather surprising results was the EXT4 vs XFS comparison - even though XFS is usually presented and perceived as the faster option (and EXT4 as the more "conservative" choice), the results were quite clearly in favor of EXT4. The difference was ~10%, so not a negligible difference. Let me briefly discuss this and also show some of the updated results that I'll present in Vienna. And come to pgconf.eu and see the whole updated talk!
The talk in Milan presented a bunch of slides illustrating the "EXT4 is faster than XFS" conclusion, but let me show just two of them. The first one compares pgbench read-write results on a large data set (2x RAM) with 16 clients (the machine only has 4 cores) on a range of file systems:
Let's ignore the other file systems and only look at EXT4 and XFS results. The best EXT4 result (with TRIM/DISCARD enabled, write barriers disabled and aligned to SSD blocks) is ~4700 tps, which the best XFS result (TRIM/DISCARD enabled, write barriers disabled) was ~4100 tps, so ~10% slower.
Now, that's just the total performance of the whole 30-minute run, so let's see the second slide, a chart showing transaction per second for the first 180 seconds:
We see that the general behavior is about the same - the dips are a typical consequence of a checkpoint, and otherwise the green line (EXT4) is consistently above the red one (XFS). So that suggests EXT4 really performs better than XFS, at least in pgbench.
I received plenty of feedback after I gave this talk in Milan, and one of the comments was actually a question - why do the checkpoints happen that frequently? The previous chart only shows the first 180 seconds, yet there are about 10 checkpoints, so a checkpoint happens every ~20 seconds. This is because I haven't increased the checkpoint_segments (more about this later), so everytime the 3 segments filled up, a checkpoint was necessary. Surely that's not really a good representation of production databases! In production we usually shoot for much longer checkpoints, ideally triggered by timeout - increasing the checkpoint segments to 256 or 1024 is not really uncommon, these days.
Using the default number of checkpoint segments (3) was a conscious choice, though. The reasoning was that this further increases the load on the filesystem, and stressing the file system is the ultimate goal of the benchmark. But after the discussions, I started to suspect that maybe this really was not entirely correct - maybe it stresses the file systems in a slightly different way that well-tuned database? It also clearly interacts with the kernel internals, and for example
dirty_expire_centiseconds is set to 30 seconds by default, so it can't kick in for checkpoints that happen more frequently.
So this is one of the things that I decided to change in the new round of benchmarks, increasing the
checkpoint_segments to 512 (8GB). Let's see if that changed the results, perhaps even making XFS faster than EXT4.
First, let's see EXT4 results, with various optimization techniques:
You can't really see the first three data series (with different alignment), because they perfectly overlap - either the alignment does not really matter, or perhaps I did not manage to misalign the partition badly enough. Disabling write barriers however makes a huge difference - the performance improved by about 25% by just adding this mount option (and it's safe, because this SSD has capacitor).
Interestingly enabling DISCARD/TRIM did not really change the results much - it improved the performance by a percent or two, much less than disabling write barriers. I wasn't really expecting a huge difference though, because the benchmark utilizes just a small part of the benchmark (say, ~20GB) of the 100GB SSD, so there's plenty of free space for the internal garbage collection algorithms. It'd probably be more important when the SSD gets full. This however does not confirm the fears that enabling TRIM actually hurts performance (at least with this particular SSD drive).
Now let's see XFS:
Pretty much the same result, including the non-importance of alignment, significant improvement after disabling write barriers and minor improvements after enabling TRIM.
Let's compare the best results in a single chart:
So, EXT4 is still a bit faster than XFS, but the difference is much lower than before.
Now, let's see the transaction logs for the whole 30-minute runs (not just 180 seconds as before):
Again, the overall pattern is about the same - regular dips becuase of checkpoints (now spaced ~240 seconds apart), but clearly the EXT4 performance is much more variable and less predictable. If you're designing services that need to provide SLA guarntees, you'll probably happily exchange the few percents of performance for significantly lower performance variability.
One last thing I'd like to point out - all this only really applies to large data sets, i.e. data sets where the active set (i.e. the subset you're actually accessing) exceeds the available RAM.
For example for "medium" data set (i.e. roughly 50% of RAM), the EXT4 vs. XFS comparison looks like this:
so pretty much no difference at all.
Does this prove that EXT4 is better than XFS? Absolutely not. The results presented here are really specific to this particular bechmark and hardware, I wouldn't really dare extrapolating the results to other types of storage (say, large RAID arrays built from spinning rust), or other types of workload. Or perhaps a kernel version (this benchmark was done on kernel 4.0.4).
It's however clear that while EXT4 is a bit faster, but the difference is not somehow huge (it surely is not the case that EXT4 somehow beats the crap out of XFS). In any case, if you're choosing between EXT4 and XFS, it does not really matter which one you choose (performance wise).
Also, file systems are not just about the performance, but about reliability (because the data is often the most valuable thing we have) and other useful features. This is why there's ZFS or BTRFS for example.