Re: strange row count estimates with conditions on multiple column - Mailing list pgsql-general

From Tomas Vondra
Subject Re: strange row count estimates with conditions on multiple column
Date
Msg-id 4CE3501F.9070104@fuzzy.cz
Whole thread Raw
In response to Re: strange row count estimates with conditions on multiple column  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: strange row count estimates with conditions on multiple column  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-general
OK, thanks for the explanation.

Cases (A), (B) and (D) are clear now. But I'm not sure about (C) ...

Dne 17.11.2010 04:03, Tom Lane napsal(a):
> Tomas Vondra <tv@fuzzy.cz> writes:
>> I'm not quite sure why (C) has an estimate of 1.
>
> It's smart enough to see that each of the clauses is a range constraint
> on the variable, so you get fairly tight estimates on the number of
> matches ... and then those two small selectivities are multiplied
> together.  It does not however notice that the range bounds are actually
> equal, which would allow it to convert the estimate to a simple equality
> estimate, which in many cases (including this one) would be better.
> I think we've discussed special-casing that, but it doesn't look like
> anybody got around to it yet.  It's a little bit tricky to do because
> the range estimator doesn't really distinguish < from <= --- which
> normally doesn't matter a lot, but it does when you're considering
> "x >= 33 and x <= 33" versus "x > 33 and x < 33".

OK, but how this leads to the estimate of 1 row?

Estimate for condition

   ... WHERE (col_a BETWEEN 33 AND 33)

is about 10k rows, which is quite precise. On the other side estimate
for condition

   ... WHERE (col_b BETWEEN 33 AND 33)

is 1 row, which is very imprecise (actual value is about 5000).

While the estimate related to col_a is based on most_common_vals/freqs,
estimate related to col_b is based on a histogram. So I guess this is
somehow related - it seems like it's not just counting the bins (value
33 hits one bin, there are 100 bins => estimate is 1% of rows), it's
probably counting what portion of the bin is overlapped by the range.
And in this case "33-33 = 0" ...

I've been playing with this a little bit, and I've noticed another
'strange' thing. When I rewrite the condition like this

   ... WHERE (col_b BETWEEN 32.9 AND 33.1)

so that the range is not of zero length, and everything works fine, the
estimate is exactly 5000. But when I increase the lower bound a bit

   ... WHERE (col_b BETWEEN 33 AND 33.1)

the estimate suddenly jumps to 276667. So narrowing an range actually
increases the interval for some reason?

regards
Tomas

pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: Survey on backing up unlogged tables: help us with PostgreSQL development!
Next
From: Tom Lane
Date:
Subject: Re: strange row count estimates with conditions on multiple column