Thinking about ANALYZE stats and autovacuum and large non-uniform tables - Mailing list pgsql-hackers

From Greg Stark
Subject Thinking about ANALYZE stats and autovacuum and large non-uniform tables
Date
Msg-id CAM-w4HM4kJGqykoLp8TqeGRthLbyj207yQrQLoBn-0hU8f97Mw@mail.gmail.com
Whole thread Raw
Responses Re: Thinking about ANALYZE stats and autovacuum and large non-uniform tables  (Thomas Munro <thomas.munro@gmail.com>)
Re: Thinking about ANALYZE stats and autovacuum and large non-uniform tables  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
One problem I've seen in multiple databases and is when a table has a
mixture of data sets within it. E.g. A queue table where 99% of the
entries are "done" but most queries are working with the 1% that are
"new" or in other states. Often the statistics are skewed by the
"done" entries and give bad estimates for query planning when the
query is actually looking at the other rows.

We've always talked about this as a "skewed distribution" or
"intercolumn correlation" problem. And we've developed some tools for
dealing with those issues. But I've been thinking that's not the only
problem with these cases.

The problem I'm finding is that the distribution of these small
subsets can swing quickly. And understanding intercolumn correlations
even if we could do it perfectly would be no help at all.

Consider a table with millions of rows that are "done" but none that
are "pending". Inserting just a few hundred or thousand new pending
rows makes any estimates based on the existing statistics entirely
incorrect. Even if we had perfect statistics capable of making perfect
estimates they would be entirely wrong once a few inserts of pending
rows are done.

Worse, this is kind of true for even n_dead_tup, n_mod_since_analyze,
etc are kind of affected by this. It's easy (at least on older
versions, maybe Peter's work has fixed this for btree) to get severe
index bloat because vacuum doesn't run for a long time relative to the
size of the busy portion of a table.

I'm imagining to really tackle this we should be doing something like
noticing when inserts, updates, deletes are affecting key values that
are "rare" according to the statistics and triggering autovacuum
ANALYZE statements that use indexes to only update the statistics for
the relevant key ranges.

Obviously this could get complex quickly. Perhaps it should be
something users could declare. Some kind of "partitioned statistics"
where you declare a where clause and we generate statistics for the
table where that where clause is true. Then we could fairly easily
also count things like n_mod_since_analyze for that where clause too.

And yes, partitioning the table could be a solution to this in some
cases. I think there are reasons why it isn't always going to work for
these issues though, not least that users will likely have other ways
they want to partition the data already.


-- 
greg



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: parallelizing the archiver
Next
From: Thomas Munro
Date:
Subject: Re: Thinking about ANALYZE stats and autovacuum and large non-uniform tables