Thinking About Correlated Columns (again) - Mailing list pgsql-performance

From Shaun Thomas
Subject Thinking About Correlated Columns (again)
Date
Msg-id 5193AA62.2040700@optionshouse.com
Whole thread Raw
Responses Re: Thinking About Correlated Columns (again)  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Re: Thinking About Correlated Columns (again)  (Craig James <cjames@emolecules.com>)
Re: Thinking About Correlated Columns (again)  (Thomas Kellerer <spam_eater@gmx.net>)
List pgsql-performance
Hi!

This has been a pain point for quite a while. While we've had several
discussions in the area, it always seems to just kinda trail off and
eventually vanish every time it comes up.

A really basic example of how bad the planner is here:

CREATE TABLE foo AS
SELECT a.id, a.id % 1000 AS col_a, a.id % 1000 AS col_b
   FROM generate_series(1, 1000000) a(id);

CREATE INDEX idx_foo_ab ON foo (col_a, col_b);

ANALYZE foo;

EXPLAIN ANALYZE
SELECT *
   FROM foo
  WHERE col_a = 50
    AND col_b = 50;

Index Scan using idx_foo_ab on foo  (cost=0.00..6.35 rows=1 width=12)
            (actual time=0.030..3.643 rows=1000 loops=1)
   Index Cond: ((col_a = 50) AND (col_b = 50))

Hey, look! The row estimate is off by a factor of 1000. This particular
case doesn't suffer terribly from the mis-estimation, but others do.
Boy, do they ever.

Since there really is no fix for this aside from completely rewriting
the query or horribly misusing CTEs (to force sequence scans instead of
bloated nested loops, for example), I recently tried to use some
existing ALTER TABLE syntax:

ALTER TABLE foo ALTER col_a SET (n_distinct = 1);
ALTER TABLE foo ALTER col_b SET (n_distinct = 1);

The new explain output:

Index Scan using idx_foo_ab on foo  (cost=0.00..9.37 rows=2 width=12)
            (actual time=0.013..0.871 rows=1000 loops=1)
   Index Cond: ((col_a = 50) AND (col_b = 50))

Well... 2 is closer to 1000 than 1 was, but it's pretty clear the
histogram is still the basis for the cost estimates. I'm curious what
benefit overriding the n_distinct pg_stats field actually gives us in
practice.

I'm worried that without an easy answer for cases like this, more DBAs
will abuse optimization fences to get what they want and we'll end up in
a scenario that's actually worse than query hints. Theoretically, query
hints can be deprecated or have GUCs to remove their influence, but CTEs
are forever, or until the next code refactor.

I've seen conversations on this since at least 2005. There were even
proposed patches every once in a while, but never any consensus. Anyone
care to comment?

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


pgsql-performance by date:

Previous
From: "ktm@rice.edu"
Date:
Subject: Re: RT3.4 query needed a lot more tuning with 9.2 than it did with 8.1
Next
From: Heikki Linnakangas
Date:
Subject: Re: Thinking About Correlated Columns (again)