Thread: statistics horribly broken for row-wise comparison

statistics horribly broken for row-wise comparison

From
Merlin Moncure
Date:
It looks like for row-wise comparison, only the first column is used
for generating the expected row count.  This can lead to bad plans in
some cases.

Test case (takes seconds to minutes hardware depending):

create table range as select v as id, v % 500 as key, now() +
((random() * 1000) || 'days')::interval as ts from
generate_series(1,10000000) v;

create index range_idx on range(key, ts);

explain analyze select * from range where (key, ts) >= (222, '7/11/2009') and       (key, ts) <= (222, '7/12/2009')
 order by key, ts;
 

result (cut down a bit)
Sort  (cost=469723.46..475876.12 rows=2461061 width=16) (actual
time=0.054..0.056 rows=13 loops=1)  Sort Key: key, ts  Sort Method:  quicksort  Memory: 25kB

note the row count expected vs. got.  Varying the ts parameters
changes the expected rows, but varying the key does not.  Note for the
test case the returned plan is ok, but obviously the planner will
freak out and drop to seq scan or so other nefarious things
circumstances depending.

I confirmed this on 8.2 and HEAD (a month old or so).

merlin


Re: statistics horribly broken for row-wise comparison

From
Merlin Moncure
Date:
On Mon, Mar 2, 2009 at 4:43 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> It looks like for row-wise comparison, only the first column is used
> for generating the expected row count.  This can lead to bad plans in
> some cases.
>
> Test case (takes seconds to minutes hardware depending):
>
> create table range as select v as id, v % 500 as key, now() +
> ((random() * 1000) || 'days')::interval as ts from
> generate_series(1,10000000) v;
>
> create index range_idx on range(key, ts);
>
> explain analyze select * from range where (key, ts) >= (222, '7/11/2009') and
>        (key, ts) <= (222, '7/12/2009')
>        order by key, ts;
>
> result (cut down a bit)
> Sort  (cost=469723.46..475876.12 rows=2461061 width=16) (actual
> time=0.054..0.056 rows=13 loops=1)
>   Sort Key: key, ts
>   Sort Method:  quicksort  Memory: 25kB
>
> note the row count expected vs. got.  Varying the ts parameters
> changes the expected rows, but varying the key does not.  Note for the

oop, thats backwards.  rows expected depends on key (the first column
in the index) only.

merlin


Re: statistics horribly broken for row-wise comparison

From
Tom Lane
Date:
Merlin Moncure <mmoncure@gmail.com> writes:
> It looks like for row-wise comparison, only the first column is used
> for generating the expected row count.

[ shrug... ]  Short of multi-column statistics, it's hard to see how to
do better.
        regards, tom lane


Re: statistics horribly broken for row-wise comparison

From
Merlin Moncure
Date:
On Mon, Mar 2, 2009 at 8:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Merlin Moncure <mmoncure@gmail.com> writes:
>> It looks like for row-wise comparison, only the first column is used
>> for generating the expected row count.
>
> [ shrug... ]  Short of multi-column statistics, it's hard to see how to
> do better.

hm... Why can't you just multiply the range estimates for the fields
together when doing an operation over the key?

For example, in this case if the planner estimates 10% of rows for
key, and 5% of matches for ts, just multiply .1 & .05 and get .005
when you fall into the row operation case.  This would give a
reasonably accurate answer...formally correct, even.  All the
information is there, or am I missing something (not knowing all the
inner workings of the planner, I certainly might be)?

IOW, I don't see this as a 'not enough statistics', more of a 'looking
at the statistics wrong for multi-column index range operation'
problem.  Equality works correctly, as it always has.  This is a kind
of a stats loophole introduced when we got the ability to correctly do
these types of operations in 8.2.

There's no workaround that I see to this problem short of disabling seq_scan.

The classic form of this query when looking for only one 'key' my problem case):
select * from range where key = x and ts between a and b;

usually gives a plain index scan, which can be really undesirable.

merlin


Re: statistics horribly broken for row-wise comparison

From
Tom Lane
Date:
Merlin Moncure <mmoncure@gmail.com> writes:
> On Mon, Mar 2, 2009 at 8:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Merlin Moncure <mmoncure@gmail.com> writes:
>>> It looks like for row-wise comparison, only the first column is used
>>> for generating the expected row count.
>> 
>> [ shrug... ] �Short of multi-column statistics, it's hard to see how to
>> do better.

> hm... Why can't you just multiply the range estimates for the fields
> together when doing an operation over the key?

Because it would be the wrong answer, except in the uncommon case where
the field values are completely independent (at least, I would expect
that to be uncommon when people have multicolumn indexes on them).

In the case at hand, I think you're barking up the wrong tree anyway.
It's much more likely that what we need to do is fix
clauselist_selectivity to recognize range conditions involving
RowCompareExprs.
        regards, tom lane


Re: statistics horribly broken for row-wise comparison

From
Greg Stark
Date:
On Tue, Mar 3, 2009 at 3:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Because it would be the wrong answer, except in the uncommon case where
> the field values are completely independent (at least, I would expect
> that to be uncommon when people have multicolumn indexes on them).


Actually I think it's *more* likely to be true for multicolumn
indexes. If the two columns were correlated then you wouldn't need the
multicolumn index since you could just use one of the columns alone.
The case where people create multicolumn indexes btree indexes is
usually when they have a "major"-"minor" relationship between the two
columns so for each value of the major key there is a whole range of
minor keys.

Sure multi-column statistics would be nice though.

-- 
greg