Thread: Hypothetical suggestions for planner, indexing improvement

Hypothetical suggestions for planner, indexing improvement

From
Josh Berkus
Date:
Folks,

An area in which postgresql planner & indexing could be improved have occurred
to me over the last week.    I'd like to share this ideas with you in case it
is worthy of the todo list.

Please excuse me if this issue is already dealt with in CVS; I've been unable
to keep up completely on HACKERS lately.   Please also excuse me if this
issue has been discussed and was tabled due to some theoretical limitation,
such as x^n scaling problems

THE IDEA:  The planner should keep statistics on the correlation of foreign
keys and apply them to the expected row counts for EXISTS clause limitations,
and possibly for other query types as well.

To illustrate:
Database "calendar" has two tables, events and event_days.
Event_days has FK on column event_id to parent table Events.
There is at lease one record in event_days for each record in events, and the
average parent-child relationship is 1 event -> 1.15 event_days records.

This query:
SELECT events.* FROM events
WHERE EXISTS (SELECT event_id FROM event_days
    WHERE event_day BETWEEN '2003-04-08' AND '2003-05-18');

Currently, (in 7.2.4 and 7.3.1) the planner makes the assumption that the
above EXISTS restriction will only filter events by 50% and makes other join
and execution plans accordingly.   In fact, it filters events by 96% and the
ideal execution plan should be quite different.

It would be really keen if planner statistics could be expanded to include
correlation on foriegn keys in order to make more intelligent planner
decisions on the above type of query possible.

Thanks for your attention!


--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


Josh Berkus <josh@agliodbs.com> writes:
> THE IDEA:  The planner should keep statistics on the correlation of foreign
> keys and apply them to the expected row counts for EXISTS clause limitations,
> and possibly for other query types as well.

It's a thought.  Keeping complete cross-column correlation stats (for
every combination of columns in the DB) is obviously out of the
question.  If you're gonna do it you need a heuristic to tell you which
combinations of columns are worth keeping track of --- and foreign-key
relationships seem like a reasonable guide to the interesting
combinations.

I'm not sure about the long-term usefulness of optimizing EXISTS per se.
Seems to me that a lot of the present uses of EXISTS are workarounds
for Postgres' historic mistreatment of IN ... which we've attacked more
directly for 7.4.  But cross-column correlations are certainly useful
for estimating join sizes in general.

            regards, tom lane


Re: [HACKERS] Hypothetical suggestions for planner, indexing improvement

From
Josh Berkus
Date:
Tom,

> It's a thought.  Keeping complete cross-column correlation stats (for
> every combination of columns in the DB) is obviously out of the
> question.  If you're gonna do it you need a heuristic to tell you which
> combinations of columns are worth keeping track of --- and foreign-key
> relationships seem like a reasonable guide to the interesting
> combinations.

Yes.  It would also make FKs something more than just an annoying (and slow)
constraint in PostgreSQL.   And it would be a performance feature that most
other RDBMSs don't have ;-)

> I'm not sure about the long-term usefulness of optimizing EXISTS per se.
> Seems to me that a lot of the present uses of EXISTS are workarounds
> for Postgres' historic mistreatment of IN ... which we've attacked more
> directly for 7.4.  But cross-column correlations are certainly useful
> for estimating join sizes in general.

EXISTS is more flexible than IN; how can you do a 3-column corellation on an
IN clause?

The reason that I mention EXISTS is because that's where the lack of
cross-column corellation is most dramatic; the planner seems to estimate a
flat 50% for EXISTS clauses regardless of the content.

--
Josh Berkus
Aglio Database Solutions
San Francisco


Josh Berkus <josh@agliodbs.com> writes:
> The reason that I mention EXISTS is because that's where the lack of
> cross-column corellation is most dramatic; the planner seems to estimate a
> flat 50% for EXISTS clauses regardless of the content.

No "seems to" about that one: see src/backend/optimizer/path/clausesel.c

    else if (is_subplan(clause))
    {
        /*
         * Just for the moment! FIX ME! - vadim 02/04/98
         */
        s1 = (Selectivity) 0.5;
    }

Patches to improve this are welcome ;-).  But I'm not at all sure how to
write something that would extract a reliable selectivity estimate from
a subplan.

            regards, tom lane


Re: [HACKERS] Hypothetical suggestions for planner, indexing

From
Peter Childs
Date:
On Mon, 5 May 2003, Josh Berkus wrote:

> Tom,
>
> > It's a thought.  Keeping complete cross-column correlation stats (for
> > every combination of columns in the DB) is obviously out of the
> > question.  If you're gonna do it you need a heuristic to tell you which
> > combinations of columns are worth keeping track of --- and foreign-key
> > relationships seem like a reasonable guide to the interesting
> > combinations.
>
> Yes.  It would also make FKs something more than just an annoying (and slow)
> constraint in PostgreSQL.   And it would be a performance feature that most
> other RDBMSs don't have ;-)

    That statement seams really strange, If FKs are really only a
safety lock to stop you from putting bad data in your database, It makes
them a bit pointless if you want a nice fast database and you can trust
your users! This does not make them useless and I still have them but from
a purely performance point of view they don't help currently!
    It may be worth adding Partial Matching FKs so that a user can
mark that they think might be a useful match to do. This would help the
fact that in many data sets NULL can mean more than one different thing.
(Don't Know, None, etc) plus using the index on IS NOT NULL queries would
be very handy when you need to know about all the records that you need to
find the information out for, or all the records with no relationship.

Peter Childs

>
> > I'm not sure about the long-term usefulness of optimizing EXISTS per se.
> > Seems to me that a lot of the present uses of EXISTS are workarounds
> > for Postgres' historic mistreatment of IN ... which we've attacked more
> > directly for 7.4.  But cross-column correlations are certainly useful
> > for estimating join sizes in general.
>
> EXISTS is more flexible than IN; how can you do a 3-column corellation on an
> IN clause?
>
> The reason that I mention EXISTS is because that's where the lack of
> cross-column corellation is most dramatic; the planner seems to estimate a
> flat 50% for EXISTS clauses regardless of the content.
>
>


Re: [HACKERS] Hypothetical suggestions for planner,

From
Ron Johnson
Date:
On Mon, 2003-05-05 at 23:25, Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:
> > THE IDEA:  The planner should keep statistics on the correlation of foreign
> > keys and apply them to the expected row counts for EXISTS clause limitations,
> > and possibly for other query types as well.
>
> It's a thought.  Keeping complete cross-column correlation stats (for
> every combination of columns in the DB) is obviously out of the
> question.  If you're gonna do it you need a heuristic to tell you which
> combinations of columns are worth keeping track of --- and foreign-key
> relationships seem like a reasonable guide to the interesting
> combinations.

How about generalizing this into something useful for all queries?

And to make this problem not just guess work on the part of the
optimizer, how about having a separate "backslash command" so that
the DBA can add specific/important/crucial predicates that he needs
optimized.

Thus, if there's a query with a large WHERE clause that has an
optimized predicate inside it, the statistics would be used.

> WHERE event_day BETWEEN '2003-04-08' AND '2003-05-18')

What this sounds to me like, though, is that in order for it to
work quickly, the postmaster would have to keep track in system
tables each value of event_day and it's COUNT(*), thus more I/O
overhead.

--
+---------------------------------------------------------------+
| Ron Johnson, Jr.        mailto:ron.l.johnson@cox.net          |
| Jefferson, LA  USA      http://members.cox.net/ron.l.johnson  |
|                                                               |
| The purpose of the military isn't to pay your college tuition |
| or give you a little extra income; it's to "kill people and   |
| break things".  Surprisingly, not everyone understands that.  |
+---------------------------------------------------------------+


Re: [HACKERS] Hypothetical suggestions for planner, indexing improvement

From
"Jim C. Nasby"
Date:
On Mon, May 05, 2003 at 09:33:33PM -0700, Josh Berkus wrote:
> EXISTS is more flexible than IN; how can you do a 3-column corellation on an
> IN clause?

It would be nice to add support for multi-column IN..

WHERE (a, b, c) IN (SELECT a, b, c ...)

BTW, does postgresql handle IN and EXISTS differently? Theoretically if
the optimizer was good enough you could transform one to the other and
not worry about it. Whenever possible, I try and use IN when the
subselect will return a very small number of rows, since IN might be
faster than EXISTS in that case, though it seems most optimizers tend to
fall apart when they see ORs, and a lot of databases transform IN to a
OR b OR c.
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


Re: [HACKERS] Hypothetical suggestions for planner, indexing improvement

From
"Jim C. Nasby"
Date:
On Tue, May 06, 2003 at 12:25:33AM -0400, Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:
> > THE IDEA:  The planner should keep statistics on the correlation of foreign
> > keys and apply them to the expected row counts for EXISTS clause limitations,
> > and possibly for other query types as well.
>
> It's a thought.  Keeping complete cross-column correlation stats (for
> every combination of columns in the DB) is obviously out of the
> question.  If you're gonna do it you need a heuristic to tell you which
> combinations of columns are worth keeping track of --- and foreign-key
> relationships seem like a reasonable guide to the interesting
> combinations.

What if the optimizer kept on-going statistics on what columns were used
in joins to what other columns? Over time this would allow analyze to
determine on it's own what cross-column correlation stats should be
kept.
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


"Jim C. Nasby" <jim@nasby.net> writes:
> It would be nice to add support for multi-column IN..
> WHERE (a, b, c) IN (SELECT a, b, c ...)

RTFM...

> BTW, does postgresql handle IN and EXISTS differently?

Yes.

> Theoretically if the optimizer was good enough you could transform one
> to the other and not worry about it.

No.  They have different responses to NULLs in the subselect result.

            regards, tom lane


Re: [HACKERS] Hypothetical suggestions for planner, indexing

From
Christopher Kings-Lynne
Date:
> On Mon, May 05, 2003 at 09:33:33PM -0700, Josh Berkus wrote:
> > EXISTS is more flexible than IN; how can you do a 3-column corellation on an
> > IN clause?
>
> It would be nice to add support for multi-column IN..
>
> WHERE (a, b, c) IN (SELECT a, b, c ...)

Umm....we DO have that...

Chris


Re: [HACKERS] Hypothetical suggestions for planner, indexing

From
Dennis Björklund
Date:
On Tue, 6 May 2003, Tom Lane wrote:

> > It would be nice to add support for multi-column IN..
> > WHERE (a, b, c) IN (SELECT a, b, c ...)
>
> RTFM...

Maybe he did read the manual:

6.15.3. IN (subquery form)

expression IN (subquery)

The right-hand side of this form of IN is a parenthesized subquery, which
must return exactly one column.

--
/Dennis


Re: [HACKERS] Hypothetical suggestions for planner, indexing improvement

From
"Jim C. Nasby"
Date:
On Tue, May 06, 2003 at 09:45:07AM -0400, Tom Lane wrote:
> "Jim C. Nasby" <jim@nasby.net> writes:
> > It would be nice to add support for multi-column IN..
> > WHERE (a, b, c) IN (SELECT a, b, c ...)
>
> RTFM...

As someone pointed out, the documentation says you can't. In this case
the docs are wrong (I've added a note).

> > BTW, does postgresql handle IN and EXISTS differently?
>
> Yes.
>
> > Theoretically if the optimizer was good enough you could transform one
> > to the other and not worry about it.
>
> No.  They have different responses to NULLs in the subselect result.

They appear to operate the same... what's different?
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


"Jim C. Nasby" <jim@nasby.net> writes:
> On Tue, May 06, 2003 at 09:45:07AM -0400, Tom Lane wrote:
>> RTFM...

> As someone pointed out, the documentation says you can't. In this case
> the docs are wrong (I've added a note).

Perhaps you should have read to the end of the section.

>>> BTW, does postgresql handle IN and EXISTS differently?
>>
>> Yes.
>
> They appear to operate the same... what's different?

Supposing that tab1.col1 contains 1, NULL, 2, then for an outer
table row where col2 = 42

    WHERE outer.col2 IN (SELECT col1 FROM tab1)

will yield NULL (not FALSE).  But

    WHERE EXISTS(SELECT * FROM tab1 WHERE col1 = outer.col2)

will yield FALSE (not NULL).

The distinction doesn't matter at the top level of WHERE, but it
matters a lot underneath a NOT ...

            regards, tom lane


Re: [HACKERS] Hypothetical suggestions for planner, indexing

From
Dennis Björklund
Date:
On Wed, 7 May 2003, Tom Lane wrote:

> > As someone pointed out, the documentation says you can't. In this case
> > the docs are wrong (I've added a note).
>
> Perhaps you should have read to the end of the section.

Oh, you are right (of course). It is documented later on in that section.

I guess it's specified in the SQL spec, but how come a tuple is not an
expression? If it had been an expression the first part of that section
would still apply, which is why I just read the first part.

--
/Dennis


Re: [HACKERS] Hypothetical suggestions for planner, indexing improvement

From
"Jim C. Nasby"
Date:
> Supposing that tab1.col1 contains 1, NULL, 2, then for an outer
> table row where col2 = 42
>
>     WHERE outer.col2 IN (SELECT col1 FROM tab1)
>
> will yield NULL (not FALSE).  But
>
>     WHERE EXISTS(SELECT * FROM tab1 WHERE col1 = outer.col2)
>
> will yield FALSE (not NULL).
>
> The distinction doesn't matter at the top level of WHERE, but it
> matters a lot underneath a NOT ...

OK, but even if a true transform can't be done, couldn't they share the
same set of code to fetch the data for the subquery? Going back to my
original post, I tend to use IN only in cases where I think the subquery
will return a small result-set, and use EXISTS elsewhere. Presumably,
the subquery for an IN will only be run once, while EXISTS will be run
as an inner-loop (I'm guessing here, I could be wrong). It might be
useful if the subquery was executed based on how many rows it
would/might return.
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


Re: [HACKERS] Hypothetical suggestions for planner, indexing

From
Tom Lane
Date:
=?ISO-8859-1?Q?Dennis_Bj=F6rklund?= <db@zigo.dhs.org> writes:
> I guess it's specified in the SQL spec, but how come a tuple is not an
> expression? If it had been an expression the first part of that section
> would still apply, which is why I just read the first part.

Well, if you want to think of "(42, 'foo')" as being an expression then
I suppose it could be considered to be all one syntax.  Personally I
think that'd be more confusing rather than less so.

Another alternative is to put both syntax summaries at the top of the
subsection, and combine the two textual descriptions into one.

This is the wrong place for this discussion, though.  If you want to
have a go at rewriting the section, why not put up a sketch on
pgsql-docs and see if people like it?

            regards, tom lane