Home > mailing lists

Re: Thinking About Correlated Columns (again) - Mailing list pgsql-performance

From	Craig James
Subject	Re: Thinking About Correlated Columns (again)
Date	May 15, 2013 16:23:18
Msg-id	CAFwQ8rcSrwfJRgtP9GV_t-hKRR4=PmVeqCuGJ7wEnnO-9PD0wA@mail.gmail.com Whole thread
In response to	Thinking About Correlated Columns (again) (Shaun Thomas <sthomas@optionshouse.com>)
Responses	Re: Thinking About Correlated Columns (again) Re: Thinking About Correlated Columns (again)
List	pgsql-performance

Tree view

On Wed, May 15, 2013 at 8:31 AM, Shaun Thomas <sthomas@optionshouse.com> wrote:

[Inefficient plans for correlated columns] has been a pain point for quite a while. While we've had several discussions in the area, it always seems to just kinda trail off and eventually vanish every time it comes up.

...
Since there really is no fix for this aside from completely rewriting the query or horribly misusing CTEs (to force sequence scans instead of bloated nested loops, for example)...
I'm worried that without an easy answer for cases like this, more DBAs will abuse optimization fences to get what they want and we'll end up in a scenario that's actually worse than query hints. Theoretically, query hints can be deprecated or have GUCs to remove their influence, but CTEs are forever, or until the next code refactor.

I've seen conversations on this since at least 2005. There were even proposed patches every once in a while, but never any consensus. Anyone care to comment?

It's a very hard problem. There's no way you can keep statistics about all possible correlations since the number of possibilities is O(N^2) with the number of columns.

We have an external search engine (not part of Postgres) for a domain-specific problem that discovers correlations dynamically and does on-the-fly alterations to its search plan. Our queries are simple conjunctions (A & B & C ...), which makes "planning" easy: you can do indexes in any order you like. So, the system watches each index's performance:

1. An index that's rarely rejecting anything gets demoted and eventually is discarded.
2. An index that's rejecting lots of stuff gets promoted.

Rule 1: If two indexes are strongly correlated, which ever happens to be first in the plan will do all the work, so the second one will rarely reject anything. By demoting it to later in the plan, it allows more selective indexes to be examined first. If an index ends up rejecting almost nothing, it is discarded.

Rule 2: If an index rejects lots of stuff and you promote it to the front of the plan, then "anti-correlated" indexes (those that reject different things) tend to move to the front of the plan, resulting in very efficient indexing.

Typically, our query plans start with roughly 20-100 indexes. The nature of our domain-specific problem is that the indexes are highly correlated (but it's impossible to predict in advance; each query causes unique correlations.) Within a few thousand rows, it usually has dropped most of them, and finishes the query with 5-10 indexes that are about 95% as selective, but much faster, than the original plan.

But there are two caveats. First, our particular query is a simple conjunction (A & B & C ...). It's easy to shuffle the index orders around.

Second, databases tend to be non-homogeneous. There are clusters of similar rows. If you blindly optimize by going through a table in storage order, you may optimize strongly for a highly similar group of rows, then discover once you get out of that section of the database that your "optimization" threw out a bunch of indexes that would be selective in other sections of the database. To solve this, you have to add a mechanism that examines the database rows in random order. This tends to minimize the problem (although in theory it could still happen).

A more typical query (i.e. anything past a simple conjunction) would be much more difficult for a dynamic optimizer.

And I can't see how you can expect a static planner (one that doesn't do on-the-fly modifications to the plan) to find correlations.

Craig

pgsql-performance by date:

From: Heikki Linnakangas
Date: 15 May 2013, 15:52:33
Subject: Re: Thinking About Correlated Columns (again)

From: Shaun Thomas
Date: 15 May 2013, 16:27:35
Subject: Re: Thinking About Correlated Columns (again)

Re: Thinking About Correlated Columns (again) - Mailing list pgsql-performance

Previous

Next