Thread: avoiding seq scans when two columns are very correlated

avoiding seq scans when two columns are very correlated

From
Ruslan Zakirov
Date:
Hello,

A table has two columns id and EffectiveId. First is primary key.
EffectiveId is almost always equal to id (95%) unless records are
merged. Many queries have id = EffectiveId condition. Both columns are
very distinct and Pg reasonably decides that condition has very low
selectivity and picks sequence scan.

Simple perl script that demonstrates estimation error:
https://gist.github.com/1356744

Estimation is ~200 times off (5 vs 950), for real situation it's very
similar. Understandably difference depends on correlation coefficient.

In application such wrong estimation result in seq scan of this table
winning leading position in execution plans over other tables and
index scans.

What can I do to avoid this problem?

Tested with PostgreSQL 9.0.3 on x86_64-apple-darwin10.6.0, compiled by
GCC i686-apple-darwin10-gcc-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5664),
64-bit

--
Best regards, Ruslan.

Re: avoiding seq scans when two columns are very correlated

From
Tom Lane
Date:
Ruslan Zakirov <ruz@bestpractical.com> writes:
> A table has two columns id and EffectiveId. First is primary key.
> EffectiveId is almost always equal to id (95%) unless records are
> merged. Many queries have id = EffectiveId condition. Both columns are
> very distinct and Pg reasonably decides that condition has very low
> selectivity and picks sequence scan.

I think the only way is to rethink your data representation.  PG doesn't
have cross-column statistics at all, and even if it did, you'd be asking
for an estimate of conditions in the "long tail" of the distribution.
That's unlikely to be very accurate.

Consider adding a "merged" boolean, or defining effectiveid differently.
For instance you could set it to null in unmerged records; then you
could get the equivalent of the current meaning with
COALESCE(effectiveid, id).  In either case, PG would then have
statistics that bear directly on the question of how many merged vs
unmerged records there are.

            regards, tom lane

Re: avoiding seq scans when two columns are very correlated

From
Ruslan Zakirov
Date:
On Fri, Nov 11, 2011 at 7:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Ruslan Zakirov <ruz@bestpractical.com> writes:
>> A table has two columns id and EffectiveId. First is primary key.
>> EffectiveId is almost always equal to id (95%) unless records are
>> merged. Many queries have id = EffectiveId condition. Both columns are
>> very distinct and Pg reasonably decides that condition has very low
>> selectivity and picks sequence scan.
>
> I think the only way is to rethink your data representation.  PG doesn't
> have cross-column statistics at all, and even if it did, you'd be asking
> for an estimate of conditions in the "long tail" of the distribution.
> That's unlikely to be very accurate.

Rethinking schema is an option that requires more considerations as we
do it this way for years and run product on mysql, Pg and Oracle.
Issue affects Oracle, but it can be worked around by dropping indexes
or may be by building correlation statistics in 11g (didn't try it
yet).

Wonder if "CROSS COLUMN STATISTICS" patch that floats around would
help with such case?

> Consider adding a "merged" boolean, or defining effectiveid differently.
> For instance you could set it to null in unmerged records; then you
> could get the equivalent of the current meaning with
> COALESCE(effectiveid, id).  In either case, PG would then have
> statistics that bear directly on the question of how many merged vs
> unmerged records there are.

NULL in EffectiveId is the way to go, however when we actually need
those records (not so often situation) query becomes frightening:

SELECT main.* FROM Tickets main
    JOIN Tickets te
        ON te.EffectiveId = main.id
        OR (te.id = main.id AND te.EffectiveId IS NULL)
    JOIN OtherTable ot
        ON ot.Ticket = te.id

Past experience reminds that joins with ORs poorly handled by many optimizers.

In the current situation join condition is very straightforward and effective.

>                        regards, tom lane

--
Best regards, Ruslan.

Re: avoiding seq scans when two columns are very correlated

From
Stuart Bishop
Date:
On Fri, Nov 11, 2011 at 10:01 PM, Ruslan Zakirov <ruz@bestpractical.com> wrote:
> Hello,
>
> A table has two columns id and EffectiveId. First is primary key.
> EffectiveId is almost always equal to id (95%) unless records are
> merged. Many queries have id = EffectiveId condition. Both columns are
> very distinct and Pg reasonably decides that condition has very low
> selectivity and picks sequence scan.
>
> Simple perl script that demonstrates estimation error:
> https://gist.github.com/1356744
>
> Estimation is ~200 times off (5 vs 950), for real situation it's very
> similar. Understandably difference depends on correlation coefficient.
>
> In application such wrong estimation result in seq scan of this table
> winning leading position in execution plans over other tables and
> index scans.
>
> What can I do to avoid this problem?

Does a partial index help? CREATE UNIQUE INDEX foo_idx ON mytab(id)
WHERE id = EffectiveId


--
Stuart Bishop <stuart@stuartbishop.net>
http://www.stuartbishop.net/

Re: avoiding seq scans when two columns are very correlated

From
Ruslan Zakirov
Date:
On Tue, Nov 15, 2011 at 11:08 AM, Stuart Bishop <stuart@stuartbishop.net> wrote:
> On Fri, Nov 11, 2011 at 10:01 PM, Ruslan Zakirov <ruz@bestpractical.com> wrote:
>> Hello,

[snip]

>> In application such wrong estimation result in seq scan of this table
>> winning leading position in execution plans over other tables and
>> index scans.
>>
>> What can I do to avoid this problem?
>
> Does a partial index help? CREATE UNIQUE INDEX foo_idx ON mytab(id)
> WHERE id = EffectiveId

It doesn't help. Probably reason is the same as for partitions.

>
>
> --
> Stuart Bishop <stuart@stuartbishop.net>
> http://www.stuartbishop.net/
>



--
Best regards, Ruslan.