Thread: strange row count estimates with conditions on multiple column

strange row count estimates with conditions on multiple column

From
Tomas Vondra
Date:
Hi everyone,

I've just noticed a strange behaviour when estimating row counts (I'm
running 9.0.1). A small demonstration - let's create table with two
columns, and fill it with data so that the columns are not independent:

=====================================================================

-- table with two columns
create table test_table (col_a int, col_b int);

-- fill it with correlated data (0..99, 0..199)
insert into test_table select i%100, i%200 from generate_series(1,
1000000) s(i);

-- update statistics
analyze test_table;

=====================================================================

OK, not let's run a few simple queries, producing exactly the same
output but very different plans:

A) SELECT * FROM test_table WHERE col_a = 33 AND col_b = 33;

Seq Scan on test_table  (cost=0.00..19425.00 rows=47 width=8) (actual
time=0.025..216.273 rows=5000 loops=1)
   Filter: ((col_a = 33) AND (col_b = 33))
 Total runtime: 221.676 ms

B) SELECT * FROM test_table WHERE (col_a, col_b) = (33, 33);

plan is exactly the same as (A)

C) SELECT * FROM test_table WHERE (col_a BETWEEN 33 AND 33) AND (col_b
BETWEEN 33 AND 33);

 Seq Scan on test_table  (cost=0.00..24425.00 rows=1 width=8) (actual
time=0.025..282.725 rows=5000 loops=1)
   Filter: ((col_a >= 33) AND (col_a <= 33) AND (col_b >= 33) AND (col_b
<= 33))
 Total runtime: 288.127 ms

D) SELECT * FROM test_table WHERE (col_a, col_b) BETWEEN (33, 33) AND
(33, 33);

 Seq Scan on test_table  (cost=0.00..24425.00 rows=227232 width=8)
(actual time=0.022..238.958 rows=5000 loops=1)
   Filter: ((ROW(col_a, col_b) >= ROW(33, 33)) AND (ROW(col_a, col_b) <=
ROW(33, 33)))
 Total runtime: 244.353 ms

=====================================================================

So the estimated number of rows is this

A) 47
B) 47
C) 1
D) 227232

Results from (A) and (B) seem strange to me because AFAIK there are no
multi-column statistics available, and accoring to this thread

http://archives.postgresql.org/pgsql-hackers/2009-03/msg00052.php

the single-column estimates are not multiplied (which would be OK only
in case of statistically independent columns). Yet the estimates somehow
match the product:

1.000.000 * (1/200) * (1/100) = 1.000.000 / 20.000 = 50


I'm not quite sure why (C) has an estimate of 1. The col_a has only 100
distinct values, so it uses most_common_vals/most_common_freqs, and all
the values will be there (statistics target is 100) along with
frequencies. This gives about 1% selectivity.

Column col_b has 200 distinct values, uniformly distributed, so the
estimates are based on a histogram - there are 100 bins, the range fits
into a single bin, giving 1% selectivity.

But no matter what I do, I'm not sure how to combine these two estimates
into 0.0001% (1 row out of a million).

And I do have exactly the same problem with the estimate in (D). Where
the heck did 227232 come from?

regards
Tomas

Re: strange row count estimates with conditions on multiple column

From
Tom Lane
Date:
Tomas Vondra <tv@fuzzy.cz> writes:
> Results from (A) and (B) seem strange to me because AFAIK there are no
> multi-column statistics available, and accoring to this thread

> http://archives.postgresql.org/pgsql-hackers/2009-03/msg00052.php

> the single-column estimates are not multiplied (which would be OK only
> in case of statistically independent columns).

You're misreading that thread: it's discussing row inequality
comparisons, as in your example (D).  Row equality comparisons are the
same as a bunch of per-column equality comparisons, which is why (A) and
(B) behave the same, and for that you *will* get a multiplication of the
assumed-independent clause selectivities.

> I'm not quite sure why (C) has an estimate of 1.

It's smart enough to see that each of the clauses is a range constraint
on the variable, so you get fairly tight estimates on the number of
matches ... and then those two small selectivities are multiplied
together.  It does not however notice that the range bounds are actually
equal, which would allow it to convert the estimate to a simple equality
estimate, which in many cases (including this one) would be better.
I think we've discussed special-casing that, but it doesn't look like
anybody got around to it yet.  It's a little bit tricky to do because
the range estimator doesn't really distinguish < from <= --- which
normally doesn't matter a lot, but it does when you're considering
"x >= 33 and x <= 33" versus "x > 33 and x < 33".

> And I do have exactly the same problem with the estimate in (D). Where
> the heck did 227232 come from?

It doesn't recognize that this case is a range comparison (which was a
point made in the thread you cited).  So you get a dumb multiplication
of the selectivities for col_a >= 33 and col_a <= 33.

            regards, tom lane

Re: strange row count estimates with conditions on multiple column

From
Tomas Vondra
Date:
OK, thanks for the explanation.

Cases (A), (B) and (D) are clear now. But I'm not sure about (C) ...

Dne 17.11.2010 04:03, Tom Lane napsal(a):
> Tomas Vondra <tv@fuzzy.cz> writes:
>> I'm not quite sure why (C) has an estimate of 1.
>
> It's smart enough to see that each of the clauses is a range constraint
> on the variable, so you get fairly tight estimates on the number of
> matches ... and then those two small selectivities are multiplied
> together.  It does not however notice that the range bounds are actually
> equal, which would allow it to convert the estimate to a simple equality
> estimate, which in many cases (including this one) would be better.
> I think we've discussed special-casing that, but it doesn't look like
> anybody got around to it yet.  It's a little bit tricky to do because
> the range estimator doesn't really distinguish < from <= --- which
> normally doesn't matter a lot, but it does when you're considering
> "x >= 33 and x <= 33" versus "x > 33 and x < 33".

OK, but how this leads to the estimate of 1 row?

Estimate for condition

   ... WHERE (col_a BETWEEN 33 AND 33)

is about 10k rows, which is quite precise. On the other side estimate
for condition

   ... WHERE (col_b BETWEEN 33 AND 33)

is 1 row, which is very imprecise (actual value is about 5000).

While the estimate related to col_a is based on most_common_vals/freqs,
estimate related to col_b is based on a histogram. So I guess this is
somehow related - it seems like it's not just counting the bins (value
33 hits one bin, there are 100 bins => estimate is 1% of rows), it's
probably counting what portion of the bin is overlapped by the range.
And in this case "33-33 = 0" ...

I've been playing with this a little bit, and I've noticed another
'strange' thing. When I rewrite the condition like this

   ... WHERE (col_b BETWEEN 32.9 AND 33.1)

so that the range is not of zero length, and everything works fine, the
estimate is exactly 5000. But when I increase the lower bound a bit

   ... WHERE (col_b BETWEEN 33 AND 33.1)

the estimate suddenly jumps to 276667. So narrowing an range actually
increases the interval for some reason?

regards
Tomas

Re: strange row count estimates with conditions on multiple column

From
Tom Lane
Date:
Tomas Vondra <tv@fuzzy.cz> writes:
> Estimate for condition
>    ... WHERE (col_a BETWEEN 33 AND 33)
> is about 10k rows, which is quite precise. On the other side estimate
> for condition
>    ... WHERE (col_b BETWEEN 33 AND 33)
> is 1 row, which is very imprecise (actual value is about 5000).

That's an artifact of your test case.  There are exactly 100 distinct
values in col_a, which means that they all fit into the most_common_vals
array (assuming you're using default_statistics_target = 100).  So the
planner actually has complete information about the contents of col_a,
modulo some sampling inaccuracy about the precise frequency of each
value.  It should be expected to produce pretty good estimates for the
sorts of expressions it can estimate, and it does.  In col_b, there are
200 values, so they can't be represented by most_common_vals, and in
fact ANALYZE notices that none of them are really much more common than
any other.  So it throws up its hands and doesn't generate an MCV list
at all, just a histogram.  That doesn't provide a lot of foothold for
the range estimator to give an exact estimate for a very narrow range.

> I've been playing with this a little bit, and I've noticed another
> 'strange' thing. When I rewrite the condition like this

>    ... WHERE (col_b BETWEEN 32.9 AND 33.1)

> so that the range is not of zero length, and everything works fine, the
> estimate is exactly 5000. But when I increase the lower bound a bit

>    ... WHERE (col_b BETWEEN 33 AND 33.1)

> the estimate suddenly jumps to 276667.

If you look closely at what EXPLAIN is saying, that expression expands
as

 Filter: ((col_b >= 33) AND ((col_b)::numeric <= 33.1))

So the things being compared to aren't the same, and it doesn't
recognize this as a range constraint, and you get the dumb
product-of-independent-inequality-probabilities estimate.

The "exact" estimate you got in the first case is more of a lucky guess
than anything else, too.  It does realize that it's got a range
constraint in that case, but it has no stats about the value of the
expression col_b::numeric.  So you're just getting a default estimate
that by coincidence matches the correct answer.

            regards, tom lane

Re: strange row count estimates with conditions on multiple column

From
Tomas Vondra
Date:
Dne 17.11.2010 05:22, Tom Lane napsal(a):
> Tomas Vondra <tv@fuzzy.cz> writes:
>> Estimate for condition
>>    ... WHERE (col_a BETWEEN 33 AND 33)
>> is about 10k rows, which is quite precise. On the other side estimate
>> for condition
>>    ... WHERE (col_b BETWEEN 33 AND 33)
>> is 1 row, which is very imprecise (actual value is about 5000).
>
> That's an artifact of your test case.  There are exactly 100 distinct
> values in col_a, which means that they all fit into the most_common_vals
> array (assuming you're using default_statistics_target = 100).  So the
> planner actually has complete information about the contents of col_a,
> modulo some sampling inaccuracy about the precise frequency of each
> value.  It should be expected to produce pretty good estimates for the
> sorts of expressions it can estimate, and it does.  In col_b, there are
> 200 values, so they can't be represented by most_common_vals, and in
> fact ANALYZE notices that none of them are really much more common than
> any other.  So it throws up its hands and doesn't generate an MCV list
> at all, just a histogram.  That doesn't provide a lot of foothold for
> the range estimator to give an exact estimate for a very narrow range.

Yes, I understand why MCV is not used in case of col_b, and I do
understand that the estimate may not be precise. But I'm wondering
what's a better estimate in such cases - 1, 5000, any constant, or
something related to a the histogram?

I'd probably vote for a size of the the histogram bucket (or maybe a
half of it), as it seems like a good upper bound, and I don't see why
any constant (1, 5000, whatever) should be better. But maybe I'm missing
something.

> If you look closely at what EXPLAIN is saying, that expression expands
> as
>
>  Filter: ((col_b >= 33) AND ((col_b)::numeric <= 33.1))

Aha! Haven't noticed that missing '::numeric' - it'd be nice to make
this more visible. I bet I'm not the only one who haven't noticed this.

> So the things being compared to aren't the same, and it doesn't
> recognize this as a range constraint, and you get the dumb
> product-of-independent-inequality-probabilities estimate.
>
> The "exact" estimate you got in the first case is more of a lucky guess
> than anything else, too.  It does realize that it's got a range
> constraint in that case, but it has no stats about the value of the
> expression col_b::numeric.  So you're just getting a default estimate
> that by coincidence matches the correct answer.

Aaaargh, it looked like a perfect estimate :-(

BTW I think the default estimate used to be 1000, so it was changed in
one of the 8.x releases? Can you point me to the docs? I've even tried
to find that in the sources, but unsuccessfully.

regards
Tomas

Re: strange row count estimates with conditions on multiple column

From
Tomas Vondra
Date:
> BTW I think the default estimate used to be 1000, so it was changed in
> one of the 8.x releases? Can you point me to the docs? I've even tried
> to find that in the sources, but unsuccessfully.

OK, I've found it right after submitting the e-mail.

It's defined in selfuncs.h as DEFAULT_RANGE_INEQ_SEL, equal 0.005. Which
makes sense, as it's related to the table size. I've mixed that with
some other default which used to be 1000 (table size I think).

regards
Tomas

Re: strange row count estimates with conditions on multiple column

From
Tom Lane
Date:
Tomas Vondra <tv@fuzzy.cz> writes:
> Yes, I understand why MCV is not used in case of col_b, and I do
> understand that the estimate may not be precise. But I'm wondering
> what's a better estimate in such cases - 1, 5000, any constant, or
> something related to a the histogram?

It is doing it off the histogram.  The logic is actually quite good
I think for cases where the data granularity is small compared to the
histogram bucket width.  For cases like we have here, the assumption
of a continuous distribution fails rather badly --- but it's pretty
hard to see how to improve it without inserting a lot of type-specific
assumptions.

> BTW I think the default estimate used to be 1000, so it was changed in
> one of the 8.x releases? Can you point me to the docs? I've even tried
> to find that in the sources, but unsuccessfully.

It's DEFAULT_RANGE_INEQ_SEL, and AFAIR it hasn't changed in quite a while.
But I wouldn't be surprised if the behavior of this example changed when
we boosted the default statistics target.

            regards, tom lane

Re: strange row count estimates with conditions on multiple column

From
Tomas Vondra
Date:
Dne 17.11.2010 06:58, Tom Lane napsal(a):
>> BTW I think the default estimate used to be 1000, so it was changed in
>> one of the 8.x releases? Can you point me to the docs? I've even tried
>> to find that in the sources, but unsuccessfully.
>
> It's DEFAULT_RANGE_INEQ_SEL, and AFAIR it hasn't changed in quite a while.
> But I wouldn't be surprised if the behavior of this example changed when
> we boosted the default statistics target.

I've been thinking about this and I think it might be improved. If I
understand the logic corretly, it says 'use half of the histogram bin
size'. But the value

#define DEFAULT_RANGE_INEQ_SEL 0.005

says it's always 0.5%, which is not not true if STATISTICS TARGET is not
100. This could actually yield 10x more precise estimates when the
STATISTICS TARGET is set to 1000.

OK, I know the default value is 100, just thinking about how to improve
the estimates.

Tomas

Re: strange row count estimates with conditions on multiple column

From
Tom Lane
Date:
Tomas Vondra <tv@fuzzy.cz> writes:
> Dne 17.11.2010 06:58, Tom Lane napsal(a):
>>> BTW I think the default estimate used to be 1000, so it was changed in
>>> one of the 8.x releases? Can you point me to the docs? I've even tried
>>> to find that in the sources, but unsuccessfully.
>>
>> It's DEFAULT_RANGE_INEQ_SEL, and AFAIR it hasn't changed in quite a while.
>> But I wouldn't be surprised if the behavior of this example changed when
>> we boosted the default statistics target.

> I've been thinking about this and I think it might be improved. If I
> understand the logic corretly, it says 'use half of the histogram bin
> size'. But the value

> #define DEFAULT_RANGE_INEQ_SEL 0.005

> says it's always 0.5%, which is not not true if STATISTICS TARGET is not
> 100. This could actually yield 10x more precise estimates when the
> STATISTICS TARGET is set to 1000.

Huh?  The default estimates are completely unrelated to the size of the
histogram, and certainly unrelated to the default size of the
histogram.  We use those estimates when we don't have relevant stats.
It's pure wishful thinking to suppose that changing the statistics
target would have any impact on what the estimate ought to be in such
a case.

I believe the actual reasoning for setting the default estimates that
are under 1% is that we wanted to encourage indexscan choices in such
cases.  Once it's small enough for that, making it even smaller doesn't
really help --- and that does risk making bad join choices.  You don't
want the thing coming up with one-row estimates unless there's real
evidence for such an estimate.

            regards, tom lane

Re: strange row count estimates with conditions on multiple column

From
tv@fuzzy.cz
Date:
> Tomas Vondra <tv@fuzzy.cz> writes:
>> I've been thinking about this and I think it might be improved. If I
>> understand the logic corretly, it says 'use half of the histogram bin
>> size'. But the value
>
>> #define DEFAULT_RANGE_INEQ_SEL 0.005
>
>> says it's always 0.5%, which is not not true if STATISTICS TARGET is not
>> 100. This could actually yield 10x more precise estimates when the
>> STATISTICS TARGET is set to 1000.
>
> Huh?  The default estimates are completely unrelated to the size of the
> histogram, and certainly unrelated to the default size of the
> histogram.  We use those estimates when we don't have relevant stats.
> It's pure wishful thinking to suppose that changing the statistics
> target would have any impact on what the estimate ought to be in such
> a case.

Ooops, sorry for the crazy gibberish I've posted earlier. I thought those
default estimates work a somehow different and haven't checked that in the
code. The proposed 'optimization' obviously does not make any sense.

regards
Tomas