The two kinds of stats in PostgreSQL


2014/08/12 by Tomas Vondra

Every now and then, someone gets confused about resetting stats in PostgreSQL, and the impact it may have on planner and other parts of the database. Perhaps the documentation might be a tad more clear on this, because while it certainly can be confusing for those who never had to deal with stats in PostgreSQL before. But it's not just about newbies - I wrote a patch in this area for 9.3, and I get occasionally confused too.

The most surprising fact for most users is that 'stats' may actually mean one of two things - statistics describing distribution of the data, and monitoring statistics, tracking some interesting counters about the operation of the database system itself. Each of the kinds has different purposes, is stored differently, and the impact when the data is missing is very different.

So let's see what is the purpose of the two kinds of stats, what are the common issues and what happens when the data get missing for some reason.


Data distribution stats

The first kind od stats tracks distribution of data - number of distinct values, most common values in the column, histogram of data etc. This is the information used when planning the queries - it allows answering questions like:

  • How many rows match the condition? (selectivity estimation)
  • How many rows is produced by the join? (selectivity estimation)
  • How much memory will be needed for the aggregate?

Essentially those are some of the questions asked by the planner/optimizer when deciding what is the best way to execute a query.

This kind of statistics is collected by ANALYZE (or autovacuum) and stored in pretty much "regular" tables, protected by a transaction log just like the regular data. Check pg_statistic system catalog, or rather pg_stats which is a view on top of pg_statistic making the stats easier to read for humans.

A lot of things can go wrong with the stats, mostly resulting in choice of a bad plan and bad query performance. There are various reasons why that may happen (inaccurate stats, complex conditions, correlated columns, ...) - I'm not going to discuss this kind of issues here, because this is not the kind of issues faced by people who consider "resetting statistics." But hey, a good topic for another blog post!

As far as I know, there's no command / function to reset this kind of stats, and there's a good reason for that - it's not really needed. I'm not aware of a problem with that might be solved by removing this kind of stats. You could probably remove the data from the catalog using a simple DELETE (but, I never tried it).

Monitoring stats

The other kind of data is collected by statistics collector. Let me quote the very first paragraph of the documentation:

PostgreSQL's statistics collector is a subsystem that supports collection and reporting of information about server activity. Presently, the collector can count accesses to tables and indexes in both disk-block and individual-row terms. It also tracks the total number of rows in each table, and information about vacuum and analyze actions for each table. It can also count calls to user-defined functions and the total time spent in each one.

So when you need to know how frequently is accessed a particular table, whether it's read sequentially or accessed using an index, etc. this is the kind of stats collected by statistics collector. If you want to view this kind of stats, there's a quite large number of system views:

  • pg_stat_activity
  • pg_stat_archiver
  • pg_stat_bgwriter
  • pg_stat_database
  • pg_stat_all_tables
  • pg_stat_sys_tables
  • ...
  • pg_statio_all_tables
  • pg_statio_sys_tables
  • pg_statio_user_tables
  • pg_statio_all_indexes
  • pg_statio_sys_indexes
  • ...

The naming scheme is obvious (starts with pg_stat_ or pg_statio_) and the names are quite self-explanatory. For example when you need info about tables owned by the current user, you may go either to pg_stat_user_tables or pg_statio_user_tables, depending on whether you want simple stats (number of seq scans, index scans, ...) or IO-related stats (number of blocks read, hit ratios etc.). Similarly for other objects (indexes, functions, ...) and statistics related to replication and other parts of the database system.

The thing is - these are "just views" on top of a pgstat.stat file (or, since 9.3, a collection of per-database files). This is very different from the data distribution stats, because these are simple binary files, not protected by a transaction log and the access methods are very simple - essentially just "read the whole file" and "write the whole file."

And the problem is ...

Also, this is the root of the issues that drive people to reset statistics, to the mailing list etc. ...

The exact logic behind the statistics collector implementation is somewhat complex (multiple processes reporting the data over UDP, single process receiving and writing the data in regular intervals or on request, etc.) but in the end it boils down to this:

  • each database object (table, index, ...) has a fixed-length record in the file
  • the file gets written quite frequently (say, twice a second)

For small clusters (few databases with a small number of objects), this is not a big deal - a cluster with 10 databases, each with 100 tables and 1000 indexes will still have less than 1MB of statistics. Who cares if you write that twice a second? If this amount of IO means trouble for you, you probably have far more serious issues elsewhere.

But with larger clusters, the issues get much worse. With 1000 databases, each with 100 of tables and 1000 indexes, you may have ~50 MB of statistics. Write that twice a second, and you have ~100MB/s of writes. And don't forget you need to read it from time to time.

Now, consider there are far larger clusters, with either more databases or objects (or both). On our machines, we regularly see ~500MB of stats, mostly because the application creates a lot of tables with intermediate results. But there are other clusters with similar issues (think about shared database servers on large hosting sites, for example).

In my experience, this is why most people started to look at stat reset in the first place. One day they get a report about a performance issue on the database server, after a bit of poking they start iotop and notice the postmaster process is writing hundreds of MBs per second. This is also sometimes accompained by messages by messages like this in the PostgreSQL log:

WARNING: pgstat wait timeout

which essentially means that a backend process asked postmaster to write fresh data, but postmaster was unable to write the data fast enough (usually because of overloaded IO, or maybe because the filesystem is full, etc.).

Removing the data? Not really a solution ...

The first idea often is to remove the data by calling pg_stat_reset, but the problem with this approach is that it does not really work.

It may work for a short while, but as the objects are accessed (by users, autovacuum, backups, ...) the file will start growing again. It may take a while (hours, days, ...) but in the end you'll get into the same situation.

Also, I haven't mentioned that autovacuum uses this to store information important for it to work correctly and efficiently - timestamps of the last vacuum/analyze of a table, and counters of modified rows. If you remove this, autovacuum will think nothing was vacuumed/analyzed yet and start from scratch. Which essentially means reading all the data you have in the cluster.

So how to fix this properly?

tmpfs

The one universal solution that works pretty well is a RAM-based filesystem. Linux has tmpfs, most other system have something similar. Do something like this

mkdir /mnt/pg_stats_tmp
mount -t tmpfs -o size=1G none /mnt/pg_stats_tmp

You'll get a tmpfs filesystem that may grow up to 1GB (it's not reserved all the time). And then change stats_temp_directory in postgresql.conf so that it points to the new path:

stats_temp_directory = '/mnt/pgstats_tmp'

After restart, the PostgreSQL will copy the files to the new location (and back when it's stopped).

Be careful when setting the filesystem size - you need at least 2x the size of the file, because the file is not overwritten but a new copy is written and it's renamed. So if your pgstat.stat file has ~200MB, you need at least ~400MB of space. Also, when the file grew to ~200MB there's no reason why it shouldn't grow to 300MB, so add some spare space. The tmpfs size is not dedicated (i.e. if you don't use the RAM, it can be used by other processes). Setting some alert on free space is probably a good idea.

9.3 makes this much less of a problem

The tmpfs solution worked fine for a very long time for us, but it has one problem, which was much more visible after the IO bottleneck was gone. As all the data were stored in a single file, reading just a small part of it meant reading and parsing everything. So for example every time an autovacuum worker starts and requests data for a single database, the whole file needs to be read and parsed.

With large clusters (like ours), the autovacuum workers are running pretty much non-stop, and most of the time find out that particular database does not need any maintenance at all. Writing, reading and processing a 500MB file is not exactly free - what we saw was ~1 CPU doing just this. And we weren't particularly happy about this, even though we had 23 other cores in the system.

So we improved this a bit for 9.3 by splitting the single file into per-database pieces. When the postmaster regularly writes the data, it writes only the few 'dirty' pieces, and also when a process requests stats for a particular database, only that piece gets written. Which means much less IO activity and CPU for parsing. It also has the nice bonus that we don't need 2x the disk space, because only copies of the per-database pieces need to be kept while writing (not the whole file).

With these improvements, I/O generated by the stats is no longer a problem for us - we could move it back from the tmpfs but we've decided not to.

Of course, this does only work if you hava multiple databases with many objects. If most objects belong to a single database, or if you have a single database with objects separated in schemas, this does not work. And AFAIK there's no straightforward way to make it work, sadly :-(

Summary

Hmmmmm, I started with a brief "two kinds of stats in PostgreSQL" blog post, but apparently I ended with a post on what are the issues with monitoring stats and how to solve them. Anyway, to summarize it a bit - there are two kind of stats:

Data distribution stats

  • statistics describing distribution of the data, collected by ANALYZE / autovacuum
  • used by the planner/optimizer when planning queries
  • stored in the database, pretty much regular tables protected by WAL
  • no official way to reset this kind of statistics
  • the problems usually result in choosing poor query plans (more on this later)

Monitoring stats

  • statistics tracking the operation of the database system itself
  • used for monitoring purposes and by autovacuum (to identify objects that need maintenance)
  • stored outside database, in a binary file (pgstat.stat) or a collection of per-database files (since 9.3)
  • this is what gets reset by pg_stat_reset()
  • the most common problem is high I/O load when the pgstat.stat file gets large (due to tracking many database objects)
  • resetting the stats is not a solution - it won't make the problem go away, and it negatively impacts autovacuum (possibly causing I/O load far worse than it solved)
  • pre-9.3 solution: move the pgstat.stat file to a tmpfs filesystem, consider upgrading to 9.3
  • 9.3 and beyond: you shouldn't really have problems with this (unless you're using a single database with schemas)




comments powered by Disqus