Robert Haas <robertmhaas@gmail.com> writes: > ... how > important is stability to ANALYZE? If you *either* retake your MVCC > snapshots periodically as you re-scan the table *or* use a non-MVCC > snapshot for the scan, you can get those same kinds of artifacts: you > might see two copies of a just-updated row, or none. Maybe this would > actually *break* something - e.g. could there be code that would get > confused if we sample multiple rows for the same value in a column > that has a UNIQUE index? But I think mostly the consequences would be > that you might get somewhat different results from the statistics.
Yeah, that's an excellent point. I can imagine somebody complaining "this query clearly matches a unique index, why is the planner estimating multiple rows out?". But most of the time it wouldn't matter much. (And I think you can get cases like that anyway today.)
> It's not clear to me that it would even be correct to categorize those > somewhat-different results as "less accurate."
Estimating two rows where the correct answer is one row is clearly "less accurate". But I suspect you'd have to be quite unlucky to get such a result in practice from Simon's proposal, as long as we weren't super-aggressive about changing ANALYZE's snapshot a lot.
Seems like we're agreed we can use more than one snapshot, the only discussion is "how many?"
The more you take the more weirdness you will see, so adopting an approach of one-snapshot-per-row seems like the worst case for accuracy, even if it does make analyze faster.
(If we do want to speed up ANALYZE, we should use the system block sampling approach, but the argument against that is less independence of rows.)
Keeping the discussion on reducing the impact of bernoulli sampled analyze, I was imagining we would do one snapshot for each block of rows with default statistics_target, so that default behavior would be unaffected. Larger settings would be chunked according to the default, so stats_target=10k(max) would take a 10000/100 = 100 snapshots. That approach allows people to vary that using an existing parameter if needed.