PostgreSQL on SSD - 4kB or 8kB pages?


2015/11/16 by Tomas Vondra

As you may be aware, PostgreSQL splits data files into 8kB pages by default. It's possible to use different page sizes (up to 32kB), but that requires compiling custom packages so almost no one does that. It's quite rare to come across an installation using different page size at a customer, for example.

Why 8kB data pages and not a different size? Clearly, smaller or larger pages might be more efficient in some cases, and indeed some other databases use different page sizes. For example InnoDB uses 16KB pages by default, and some databases even allow a mix of page sizes. So there seems to be a gain in using different page sizes, otherwise engineers working on those products would not waste time implementing it.

So what factors determine the optimal page size? And how do SSD disks change the results?


One of the fundamental rules about page size, called Five Minute Rule, was formulated by Jim Gray and G. F. Putzolu in 1985, and approaches the problem as an economic question "Given disk and RAM prices, how long should a page of given size be kept in memory?" The equation expressing the rule is essentially this

break_even_interval = (pages_per_MB * price_per_MB_RAM) * (IOPS_per_disk * price_per_disk)

and for values common in 1985 (1kB pages and disk/RAM prices in 1985), the break even interval came out as roughly 5 minutes (hence the name of the rule).

However since 1985 a lot has changed - the prices of RAM and disks plummeted, performance of rotational storage slightly improved and so on. Despite that, the 1997 revision of the paper came to about the same conclusion - the break even interval is still about 5 minutes, however assuming 8kB page size (not 1kB as used in the 1985 paper).

This is probably also the reason why PostgreSQL uses 8kB pages - Postgres95 (as it was called at that time) was released shortly before the paper was published, but it had to follow the same reasoning.

But again, a lot has changed since 1997 - the RAM/disk prices continued to plummet and, probably more importantly, flash storage is becoming the de facto storage standard in the last few years. And once again, there were updates to the paper in 2008 in ACM Queue, and in 2010 in The Register.

The general conclusion of both those updates is that the "break even" interval for 8kB pages significantly increases (up to tens of minutes or hours) and that to get back to the ~5 minute interval, the page size needs to significantly increase (to 64kB or even more).

So why PostgreSQL still uses 8kB pages?

The original paper, formulated in 1985, was aiming to answer a particular question:

Given a page size and costs for memory and disks, when does it become economical to buy memory instead of adding disks?

But it's interesting to realize that we may choose which of the variables are fixed and solve the equation for a different one, and that those variable are not entirely independent. And of course, the equation is just a simplified model of the problem as stated in 1985, thus ignoring costs that were negligible on rotational storage but may got significant on flash.

What does it mean that the variables are not entirely independent? Firstly, it's naive to assume that disks can do the same number of IOPS for all page sizes - for rotational devices the difference is not that significant as the most expensive part are the seeks. Once you move the heads to the proper location, it does not make much difference what amount of data you actually read (up to some point, of course).

For example this review of an enterprise 10k enterprise drive from Seagate can do 400 IOPS both for 4kB and 8kB random reads.

But on flash storage this is no longer true - it's quite common to see significant performance differences with varying request sizes. For example for Intel S3700 drives, the data sheet says that with 4kB pages the drive can do up to 75000 IOPS, but with 8kB it's just 47500 IOPS (i.e. a 40% drop in performance).

The other difference that comes with flash storage is that the price for reads and writes is not the same. For example the Intel S3700 can do 75000 IOPS for 4kB reads, but only 36000 IOPS for writes (i.e. less than 50%). On rotational devices, the difference is much less significant.

Another thing not reflected in the equation is the impact of page size on cache. The page size directly determines the number of "slots" in a cache - given 1GB of cache you may split it into 262.144 slots (4kB) or 131.072 slots (8kB), which of course impacts adaptivity of the cache. For example assume you frequenly access only 200.000 entries, but that those entries are scattered through much larger data set, i.e. each of those interesting entries is on a different page. With 4kB pages it's possible to keep all those pages in cache at once, with 8kB pages that's not possible.

So while the recent reviews of the "5 minute rule" suggest increasing the page size, these costs present a natural pressure against such increase, both on rotational and (especially) flash storage.

PostgreSQL with 4kB pages on SSD

To illustrate the effect of using s smaller page size on SSD, I've done a bunch of pgbench tests with 4kB and 8kB page size, and I'll present a simple comparison here. The test was done on a rather small system (i5-2500k, 8GB of RAM, S3700 SSD) on ~75GB data set (scale=5000). The database was reasonably tuned, most importantly

  • shared_buffers = 1GB
  • min_wal_size = 1GB
  • max_wal_size = 8GB

The results of a long read-write pgbench run (4 hours, 16 clients) look like this:

page-size-tps.png

So using a smaller data page resulted in ~30% increase in performance, which is a significant improvement. It gets even more interesting when we look at the amount of data written to the device during the pgbench run (4 hours), as tracked by Host_Writes_32MiB - one of the SMART counters.

page-sizes-data-written.png

That is, with 8kB pages the database wrote about 962 GB of data, while with 4kB pages the amount of data is just 805 GB. But this comparison is not entirely fair, because the 4kB database also did more work, as it operated on higher transaction rate - let's compensate for this difference, by reducing the 4kB results accordingly (as if it operated on the same transaction rate).

page-sizes-data-written-compensated.png

So given the same transaction rate the difference is about 25%, which is quite significant - it may have impact on service life of the SSD device, which is closely related to the amount of writes.

Summary

The "Five minute rule" is useful and embodies a very interesting view approach to building database systems, based on simple economic reasoning. While I believe it's fundamentally right, it's just a simplified model and it's crucial to realize which aspects it neglects - in this post I've mentioned several important aspects of flash storage.

Does this mean that the smaller the page the better? Certainly not - the 4kB page was not chosen incidentally, but because that's the size of pages used by the S37000 SSD internally. In this case it eliminates write amplification, i.e. needlessly "dirtying" both 4kB half-pages when writing a single 8kB page. But lowering the page size would not help, at least not on this particular SSD. Not all SSD drives use 4kB pages, though - some SSD use smaller or larger pages (e.g. Samsung is using 8kB in some drives).





comments powered by Disqus