On Thu, Aug 6, 2020 at 5:35 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > It's not clear to me that it would even be correct to categorize those
> > somewhat-different results as "less accurate."
>
> Estimating two rows where the correct answer is one row is clearly
> "less accurate" [ than estimating one row ].
That's a tautology, so I can't argue with it as far as it goes.
Thinking about it more, there are really two ways to think about an
estimated row count.
On the one hand, if you think of the row count estimate as the number
of rows that are going to pop out of a node, then it's always right to
think of a unique index as limiting the number of occurrences of a
given value to 1. But, if you think of the row count estimate as a way
of estimating the amount of work that the node has to do to produce
that output, then it isn't.
If a table has a lot of inserts and deletes, or a lot of updates,
index scans might have to do a lot of extra work chasing down index
pointers to tuples that end up being invisible to our scan. The scan
may not have any filter quals at all, and even if it does, they are
likely cheap to evaluate compared to the cost of finding a locking
buffers and checking visibility, so the dominant cost of the scan is
really based on the total number of rows that are present, not the
number that are visible. Ideally, the presence of those rows would
affect the cost estimate for the node in a way very similar to
expecting to find more rows. At the same time, it doesn't work to just
bump up the row count estimate for the node, because then you'll think
more rows will be output, which might cause poor planning decisions at
higher levels.
It doesn't seem easy to get this 100% right. Tuple visibility can
change very quickly, much faster than the inter-ANALYZE interval. And
sometimes tuples can be pruned away very quickly, too, and the index
pointers may be opportunistically removed very quickly, too.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company