Some time ago I explained that there really are two kinds of statistics in PostgreSQL, and I explained what are the common issues with statistics tracking database activity, and how (not to) fix them.
That however does not mean there are no issues with data distribution (planner) statistics, and as that's one of my areas of interest, in this post I'll discuss the usual issues with data distribution statistics, usually observed by the user as slow (or even failing) queries. And I'll also mention what are the possible remedies available (if any).
As mentioned in the initial post, data distribution stats describe data stored in the database, so that the planner can use this to choose the best plan. For each column there's a bunch of info about the data distribution:
and various additional info. All this is subsequently used when planning and optimizing the queries - estimating selectivity of
WHERE conditions, cardinality of joins, amount of memory needed to store some auxiliary data structures (e.g. hash tables for a hash join and hash aggregate), etc.
The official documentation explains the basics quite well - overview of statistics used by the planner and range of row estimation examples if definitely worth reading. And of course, understanding EXPLAIN is an absolute necessity for those who investigate slow queries.
But let's do at least a very (very very) quick crash course here, because we can't really talk about the failures otherwise.
So what happens when the database is planning a query? How does it decide whether to use an index or not, whether to aggregate using sorted or hashed approach, etc.? In the past the databases were often rule based, i.e. those planning decision were in a sense hard-coded, either into the database itself, or into the query. But nowadays most databases use cost-based optimization, i.e. they try to associate each plan with a "cost" of the execution (factoring in expected I/O, CPU, memory requirements and such), and then choose the cheapest one because it's also assumed to be the fastest one.
To do this, the database has to do reasonably accurate estimates of verious things - size of source relation and joins, selectivity of the conditions, size of aggregated relations (
GROUP BY results) etc. This is where the data distribution / planner stats are absolutely vital.
For example when you run a query like this:
SELECT * FROM events WHERE event_date BETWEEN '2014-01-10' AND '2014-01-11';
the optimizer peeks into pg_class catalog (a special kind of table) and reads
relpages for the
events table, so that it knows how many rows are there (aka cardinality) and how large the table is on disk (how many 8kB pages it has).
Then it peeks into pg_statistic catalog, reads the histogram and list of most common values for the
event_date column, and estimates what portion of rows matches the
WHERE condition. This is known as a selectivity of the condition, and by multiplying it with
reltuples you get the expected size of the query result.
And then it may use this information to decide whether to use index to lookup these rows, for example.
So, what statistics are available, actually? For each column, there may be these statistics:
Not all statistics may be available - for mostly uniform distributions there may be no most common values, for columns with just a few values there may not be histograms, etc. I've also omitted a few columns related to array element statistics, introduced a few versions back.
BTW if you want to look at these stats, don't use
pg_statistic directly, because the data there are stored in a format suitable for the planner (not for humans). Use pg_stats instead, which is a view on top of
pg_statistic making the stats comprehensible.
To compute the selectivity of a condition, the planner fetches this info for the
event_date column referenced in the query, looks at the MCV list and histogram, and uses it to compute the selectivity of the condition - let's say it's ~5% of the rows. It then takes
reltuples (an estimate of the current number rows in the table) from pg_class, and multiplies it with the selectivity to get the expected number of rows returned by the query. If the table has 1.000.000 rows, the query will probably return about 50.000 rows.
It then does similar evaluation for other parts of the query (other conditions, joins, aggregations, ...) and uses this info to estimate the "execution cost" of the entire plan. Of course, doing this for a single plan would be pointless - the point is that the planner generates multiple possible plans (often very many), computes the expected cost for each of them and then chooses the "cheapest one" because it's expected to be the fastest.
For example there might be an index on the
event_date column, so the planner needs to decide whether to perform a simple Sequential Scan, Index Scan or a Bitmap Index Scan. For small fractions of the table an Index Scan is the best option, for large portions it's the Sequential Scan, and the Bitmap Scan is somewhere between. So the planner generates three possible plans, assigns them a cost (based on how many I/O and CPU resources they'll need), and then chooses the cheapest one.
This process is usually called cost-based optimization and it's one of the crucial pieces that turns modern database systems into declarative programming environments. The database receives a declarative specification of the result (aka SQL query) and chooses the execution path it believes to be the most efficient one. What could go possibly wrong with that, right?
Well, sometimes things go wrong. The main issue is that data distribution statistics are just simplified summaries of data, and thus inaccurate (not describing all the tiny details) by nature. You can see this as a lossy compression, where by removing some details you get a more compact representation of the data, but this inaccuracy may propagate to the estimates and then into the estimated cost of a plan. And if the estimates for multiple plans get sufficiently wrong, an inefficient one may be chosen in the end, resulting in much longer query execution.
But how could that happen? I'll try to explain that in a minute, let's look at the cost first.
At this point, you're probably screaming "But the costs are just estimates, so they're going to be wrong all the time! How the hell can this even work?" And to some degree you're right - the costs are inaccurate all the time, but it works fine most of the time thanks to a set of "cost stability principles" that makes this much more reliable that you might expect (although sometimes I'm pretty sure it works mostly thanks to gnomes and pixies trapped in the CPU).
The cost estimates are generally expected to have these properties (I'm not aware of established terms for those properties. If there are, please let me know in the discussion or by e-mail.):
This does not say the relation between the cost and actual runtime is linear, or that you can reliably estimate the duration from the cost. Cost 1 usually means a few miliseconds, and cost 1000000000 may easily result in queries running for hours or days, but the relation is very complex and non-linear.
It also does not say it makes sense to compare cost between different queries, it only says you can compare costs for plans for a given query. Sure, to some extent it is possible, but there's not much use for that anyway (because the costs are a tool to choose plans for a particular query, not plans across queries).
It however does say that you don't need to worry about accuracy of cost estimates too much. The goal is to choose an efficient execution plan - the most efficient one in the ideal case. But if there are two plans with similar costs, a small estimation error may result in choosing the second plan. But as the cost difference is also small (estimation stability), the difference in actual duration of the two plans should be small too (cost stability).
In practice, the most efficient plan is usually way cheaper than the other plans, so this works fine unless there are significant estimation errors. And "significant" is usually interpreted as "at least an order of magnitude wrong" - e.g. estimating that a condition matches 1000 rows while in reality it matches 100.000 (i.e. 100x more). A difference this large may easily result in a poor plan choice further down the road - choosing a sequential scan when an index scan would perform much better, etc.
But how could an estimate get this wrong? A number of reasons, actually ...
Sometimes, the distribution is so complex the default level of details (number of elements of the MVC, number of intervals in the histogram) is not enough, which results in inaccurate selectivity estimates. How detailed the statistics are is determined by
default_statistics_target which specifies how many items may be tracked in a MCV list or how many buckets may be in a histogram. Since 8.4 the default value is 100, so MCV lists may have up to 100 items and histograms may have 100 buckets.
So, let's see an example where the MCV list size is insufficient. First, let's construct a table with 1000 frequent values, and many (999000) values that are unique.
CREATE TABLE t (v INT); INSERT INTO t SELECT mod(i,1000)+1 FROM generate_series(1,1000000) s(i); INSERT INTO t SELECT i FROM generate_series(1001, 1000000) s(i); ANALYZE t;
To accurately represent this in a MCV list, we'd need up to 1000 entries, but we only have 100. So let's see some queries - first for the unique values.
EXPLAIN ANALYZE SELECT * FROM t WHERE v = 10000; QUERY PLAN ----------------------------------------------------------------- Seq Scan on t (... rows=58 ...) (actual ... rows=1 ...) Filter: (v = 10000) Rows Removed by Filter: 1998999
Well, it's not perfectly accurate, but not bad - the unique values can't get to the MCV list, and have to be estimated using histogram. So some fuzziness is expected. Now, let's see one of the common values:
EXPLAIN ANALYZE SELECT * FROM t WHERE v = 100; QUERY PLAN ----------------------------------------------------------------- Seq Scan on t (... rows=58 ...) (actual ... rows=1000 ...) Filter: (v = 100) Rows Removed by Filter: 1998000
Not that great. Apparently this value did not make it into the MCV list, and falls back to the histogram just like the unique values. (The values that make it to the MCV list may differ for each random sample of rows, so you may have to try a few values to get a misestimate.)
Luckily, this is quite easy to fix because that's exactly what
default_statistics_target is for - just crank it up to a bit (either globally or for a single column), run
ANALYZE and you're done.
SET default_statistics_target = 1000; ANALYZE t; EXPLAIN ANALYZE SELECT * FROM t WHERE v = 100; QUERY PLAN ----------------------------------------------------------------- Seq Scan on t (... rows=1079 ...) (actual ... rows=1000 ...) Filter: (v = 100) Rows Removed by Filter: 1998000
It's good to do this only for columns that actually need this (those with strange distributions and often used in queries), because it means higher overhead both for
ANALYZE and planning. For example
ALTER TABLE events ALTER COLUMN event_date SET STATISTICS = 1000;
increases the statistics target on
event_date column from 100 to 1000, making the MCV lists and histograms 10x more detailed.
A somewhat special case of the inaccurate statistics is are ndistinct estimates (number of distinct values in the column). It sounds quite simple but is actually surprisingly difficult to estimate reliably. For some data distributions (correlated columns) it's made worse by our current row sampling implementation, producing imperfect row samples (not quite random), which causes serious issues in the ndistinct estimator (which of course assumes random row samples).
For example let's create table with 100.000.000 rows, containing 10.000.000 distinct values in the first column (the padding column is there to make the table larger, which influences the sampling):
CREATE TABLE t AS SELECT i/10 AS a, md5(i::text) AS padd FROM generate_series(1,100000000) s(i);
Now, let's analyze the table and see the ndistinct estimate for the first column
ANALYZE t; SELECT n_distinct FROM pg_stats WHERE tablename = 't' AND attname = 'a'; n_distinct ------------ 421450
Well, we know there are 10M distinct values, but the estimate is just 421.450, so 23x under-estimated. Let's see what would happen if the table was even larger by lowering statistics target (so making the sample smaller with respect to the table).
SET default_statistics_target = 10; ANALYZE t; SELECT n_distinct FROM pg_stats WHERE tablename = 't' AND attname = 'a'; n_distinct ------------ 49005
So this time the estimate is about 200x under-estimated, and it's not difficult to come up with even worse examples.
Cases like this may easily cause OOM errors in HashAggregate if you need to do
GROUP BY on the under-estimated column (sadly, hash aggregate is about the only node that still does not respect
statistics_target often improves the ndistinct estimates (we've seen that lowering makes the estimate worse), but the maximum value is 10000 which may not be sufficient for very large tables. And moreover there's a better solution - overriding the estimate with a fixed value (which is not quite possible with MCV lists or histograms, because those are complex statistics).
ALTER TABLE events ALTER COLUMN event_date SET (n_distinct = 12345);
sets the number of distinct values in the
event_date column to 12345 (you may remove the override by setting it to 0).
Even if you have accurate statistics for all the columns, it's quite simple to make them useless by using conditions that are somehow incompatible with the statistics. The statistics are applicable only to simple column conditions - once you use the column in an expression (for example
column LIKE '%aaa%') or when you apply a function (like
UPPER(column) = 'ABC' or
date_part('year', column) = 2014) it's pretty much game over.
Such complex conditions make it mostly impossible to use the statistics at all, because the planner does not know how to apply the statistics on the expression (which may change ordering, for example, so the histograms make no sense), or how to "undo" the expressions, which is actually quite tricky thing (likely impossible in general, especially when it's a function call, so entirely opaque to planner). In those cases the planner just uses some reasonable default selectivities, which may work most of the time, but obviously not always.
Sometimes it's possible to fix manually by inverting the conditions - sometimes it's as trivial are rewriting
column + 1 > 100 to
column > 99. Sometimes it's necessary to apply some additional knowledge of what the function does. e.g.
date_part('year', column) = 2014 may be rewritten as
column >= '2014-01-01' AND column < '2015-01-01'.
But sometimes it's not really possible - some conditions simply are complex by nature and can't be rewritten to make them compatible with statistics :-(
Of course, there are various other ways to increase complexity of queries - joins are a primer example, because not only join conditions compare multiple columns, but those columns are in different tables.
So far we've been talking about estimating a single condition, but what about multiple conditions? Let's say we have two conditions
(column_a = 1) AND (column_b = 2), and we need to estimate them.
By default, most databases assume that all the conditions are independent, which means that you can simply multiply the selectivities of individual conditions, to get the selectivity of the whole WHERE clause (and thus cardinality of the result). This is based on the observation that selectivities are actually probabilities of events "row matches condition" and that probability of independent events is equal to product of probabilities of each event.
So when you have
WHERE condition_a AND condition_b, and you know that each condition matches 10% of the rows, you can do (0.1 * 0.1) which is 0.01 and you know that the whole WHERE clause matches ~0.1% of the whole table. But this was based on the assumption of independence, so what if the columns are correlated in some way?
For example let's assume that the columns are exactly the same values
CREATE TABLE t AS SELECT i AS a, i AS b, i AS c FROM generate_series(1,1000000) s(i); ANALYZE t;
and use two conditions that each matches 10% of the table
EXPLAIN ANALYZE SELECT * FROM t WHERE (a < 100000) AND (b < 100000);
The optimizer expects this to match 1% of the table, but in reality this matches 10% because the conditions are perfectly redundant:
QUERY PLAN --------------------------------------------------------------------- Seq Scan on t (cost=... rows=9973 ...) (actual ... rows=99999 ...) Filter: ((a < 100000) AND (b < 100000)) Rows Removed by Filter: 900001
so the estimate is 10x lower than it should be. It's not very difficult to make the difference much larger. For example you may add another condition (with the same selectivity), lowering the estimate by a factor of 10:
EXPLAIN ANALYZE SELECT * FROM t WHERE (a < 100000) AND (b < 100000) AND (c < 100000); QUERY PLAN --------------------------------------------------------------------- Seq Scan on t (... rows=1038 ...) (actual ... rows=99999 ...) Filter: ((a < 100000) AND (b < 100000) AND (c < 100000)) Rows Removed by Filter: 900001
Alternatively it's possible to use more selective conditions, which also increases the difference (thanks to the multiplication)
EXPLAIN ANALYZE SELECT * FROM t WHERE (a < 10000) AND (b < 10000); QUERY PLAN ------------------------------------------------------ Seq Scan on t (... rows=96 ...) (... rows=9999 ...) Filter: ((a < 10000) AND (b < 10000)) Rows Removed by Filter: 990001
So far we've only seen under-estimates, i.e. the estimated number of rows was much lower than the actual value. It's quite simple to construct examples of the opposite:
EXPLAIN ANALYZE SELECT * FROM t WHERE (a < 500000) AND (b > 500000); QUERY PLAN -------------------------------------------------------------------- Seq Scan on t (... rows=249993 ...) (actual ... rows=0 ...) Filter: ((a < 500000) AND (b > 500000)) Rows Removed by Filter: 1000000
Which exploits the fact that both columns really contain the same values, so this particular combination of conditions is "incompatible."
But those are simple artificial examples, constructed as an illustration - what about actual data?
In real-world data sets, the independence assumption is only rarely met perfectly. Sometimes columns may be truly independent, often the correlation is very weak (so that it does not really impact the estimates significantly), but sometimes it's very strong and makes the estimates significantly off - just like in the previous examples.
But even if the columns are strongly correlated, it may not be an issue - it really depends on what types of queries you're executing, i.e. what kind of workload you're dealing with. Se let's talk about OLTP and OLAP workloads for a while.
OLTP is the kind of workload that naturally works with small subsets of the data - accesses individual records using a PK, a few dozens of rows using an index, and so on. So this kind of workload already works with rather low estimates, and the under-estimates won't change the plan significantly. If the query was using Nested Loop before, it's still going to choose Nested Loop.
OLAP workload however work with much larger sets of rows, as it performs analytical queries - large aggregations, selection of large subsets of the data, etc. In this case, the under-estimate may easily change the plan - a scan my switch from Bitmap Index Scan to a plain Index Scan, a join may switch from Hash Join to Nested Loop, and so on. If you've ever dealt with such issues, you know how serious issue this is.
Of course, application are often a mix of OLTP and OLAP queries. An OLTP application may use a few analytical queries for reporting purposes, or perform batch updates - both of which are rather OLAP-style queries. Similarly, OLAP application may allow ad-hoc updates of individual records and other OLTP-style queries.
When I was explaining the impact on OLTP and OLAP workloads in the previous section, I was only talking about under-estimates. There's a good reason for that - the consequences of under-estimates are usually much more severe, in my experience.
Of course, if an over-estimate changes the plan (e.g. by choosing Bitmap Index Scan instead of Index Scan, or Hash Join instead of Nested Loop), it's likely to make the query slower. But the actual cost should actually be lower than the cost of the cheapest plan, because it's not processing as many queries as expected - a Hash Join needs to build the Hash table anyway, but then it will perform maybe 100 lookups instead of the estimated 100.000. A Bitmap Index Scan will have to build the bitmaps even though there's just 100 matching rows. And so on. In a sense, the cost of the plan is an upper boundary of the actual cost.
With the under-estimates, it does not really work this way - there's no such upper cost limit, and the actual cost may grow arbitrarily large.
Anyway, estimates on correlated columns are not a new problem (see for example this paper from VLDB 1997), but it's surprisingly difficult to solve well (without costs making it inefficient to use) and only very few databases implement multi-column statistics (or something like that). We currently don't have anything to address this in PostgreSQL, although I'm working on a patch that should make this possible.