Thread: Hypothetical suggestions for planner, indexing improvement

Hypothetical suggestions for planner, indexing improvement

From
Josh Berkus
Date:
Folks,

An area in which postgresql planner & indexing could be improved have occurred
to me over the last week.    I'd like to share this ideas with you in case it
is worthy of the todo list.

Please excuse me if this issue is already dealt with in CVS; I've been unable
to keep up completely on HACKERS lately.   Please also excuse me if this
issue has been discussed and was tabled due to some theoretical limitation,
such as x^n scaling problems

THE IDEA:  The planner should keep statistics on the correlation of foreign
keys and apply them to the expected row counts for EXISTS clause limitations,
and possibly for other query types as well.

To illustrate:
Database "calendar" has two tables, events and event_days.
Event_days has FK on column event_id to parent table Events.
There is at lease one record in event_days for each record in events, and the
average parent-child relationship is 1 event -> 1.15 event_days records.

This query:
SELECT events.* FROM events
WHERE EXISTS (SELECT event_id FROM event_days
    WHERE event_day BETWEEN '2003-04-08' AND '2003-05-18');

Currently, (in 7.2.4 and 7.3.1) the planner makes the assumption that the
above EXISTS restriction will only filter events by 50% and makes other join
and execution plans accordingly.   In fact, it filters events by 96% and the
ideal execution plan should be quite different.

It would be really keen if planner statistics could be expanded to include
correlation on foriegn keys in order to make more intelligent planner
decisions on the above type of query possible.

Thanks for your attention!


--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


Re: Hypothetical suggestions for planner, indexing improvement

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> THE IDEA:  The planner should keep statistics on the correlation of foreign
> keys and apply them to the expected row counts for EXISTS clause limitations,
> and possibly for other query types as well.

It's a thought.  Keeping complete cross-column correlation stats (for
every combination of columns in the DB) is obviously out of the
question.  If you're gonna do it you need a heuristic to tell you which
combinations of columns are worth keeping track of --- and foreign-key
relationships seem like a reasonable guide to the interesting
combinations.

I'm not sure about the long-term usefulness of optimizing EXISTS per se.
Seems to me that a lot of the present uses of EXISTS are workarounds
for Postgres' historic mistreatment of IN ... which we've attacked more
directly for 7.4.  But cross-column correlations are certainly useful
for estimating join sizes in general.

            regards, tom lane


Re: Hypothetical suggestions for planner, indexing improvement

From
Josh Berkus
Date:
Tom,

> It's a thought.  Keeping complete cross-column correlation stats (for
> every combination of columns in the DB) is obviously out of the
> question.  If you're gonna do it you need a heuristic to tell you which
> combinations of columns are worth keeping track of --- and foreign-key
> relationships seem like a reasonable guide to the interesting
> combinations.

Yes.  It would also make FKs something more than just an annoying (and slow)
constraint in PostgreSQL.   And it would be a performance feature that most
other RDBMSs don't have ;-)

> I'm not sure about the long-term usefulness of optimizing EXISTS per se.
> Seems to me that a lot of the present uses of EXISTS are workarounds
> for Postgres' historic mistreatment of IN ... which we've attacked more
> directly for 7.4.  But cross-column correlations are certainly useful
> for estimating join sizes in general.

EXISTS is more flexible than IN; how can you do a 3-column corellation on an
IN clause?

The reason that I mention EXISTS is because that's where the lack of
cross-column corellation is most dramatic; the planner seems to estimate a
flat 50% for EXISTS clauses regardless of the content.

--
Josh Berkus
Aglio Database Solutions
San Francisco


Re: Hypothetical suggestions for planner, indexing improvement

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> The reason that I mention EXISTS is because that's where the lack of
> cross-column corellation is most dramatic; the planner seems to estimate a
> flat 50% for EXISTS clauses regardless of the content.

No "seems to" about that one: see src/backend/optimizer/path/clausesel.c

    else if (is_subplan(clause))
    {
        /*
         * Just for the moment! FIX ME! - vadim 02/04/98
         */
        s1 = (Selectivity) 0.5;
    }

Patches to improve this are welcome ;-).  But I'm not at all sure how to
write something that would extract a reliable selectivity estimate from
a subplan.

            regards, tom lane


"Jim C. Nasby" <jim@nasby.net> writes:
> It would be nice to add support for multi-column IN..
> WHERE (a, b, c) IN (SELECT a, b, c ...)

RTFM...

> BTW, does postgresql handle IN and EXISTS differently?

Yes.

> Theoretically if the optimizer was good enough you could transform one
> to the other and not worry about it.

No.  They have different responses to NULLs in the subselect result.

            regards, tom lane


Re: Hypothetical suggestions for planner, indexing improvement

From
"Zeugswetter Andreas SB SD"
Date:
> > The reason that I mention EXISTS is because that's where the lack of
> > cross-column corellation is most dramatic; the planner seems to estimate a
> > flat 50% for EXISTS clauses regardless of the content.
>
> No "seems to" about that one: see
> src/backend/optimizer/path/clausesel.c
>
>     else if (is_subplan(clause))
>     {
>         /*
>          * Just for the moment! FIX ME! - vadim 02/04/98
>          */
>         s1 = (Selectivity) 0.5;
>     }

I think the main issue cannot be the correlation in this case.
In this FK PK case an events row exists for each row in the subselect,
so the correlation must be above 1, how much above is irrelevant for the
selectivity estimate (it is only relevant in the below 1 cases).
The selectivity in this case can be estimated by estimating the number of rows
returned from the subselect where the (in the example missing) PK-FK join condition
is removed (here: select distinct event_id from event_day where event_day BETWEEN
'2003-04-08' AND '2003-05-18'). (selectivity = min (1, e. rows of subselect / e. rows of
main select))

The "event_day BETWEEN '2003-04-08' AND '2003-05-18'" is what really reduces the
result set here, and that is not used.

Andreas

PS: in the example the subselect join clause event_id=events.event_id is missing



Re: Hypothetical suggestions for planner, indexing improvement

From
Josh Berkus
Date:
Andreas,

> PS: in the example the subselect join clause event_id=events.event_id is
> missing

Sorry.  I was cutting-and-pasting to reduce the complexity of the query.   The
full query is actually posted on -PERFORMANCE, as is a workaround to the
"exists selectivity problem".

--
Josh Berkus
Aglio Database Solutions
San Francisco



Re: Hypothetical suggestions for planner, indexing

From
Robert Treat
Date:
On Tue, 2003-05-06 at 00:45, Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:
> > The reason that I mention EXISTS is because that's where the lack of 
> > cross-column corellation is most dramatic; the planner seems to estimate a 
> > flat 50% for EXISTS clauses regardless of the content.
> 
> No "seems to" about that one: see src/backend/optimizer/path/clausesel.c
> 
>     else if (is_subplan(clause))
>     {
>         /*
>          * Just for the moment! FIX ME! - vadim 02/04/98
>          */
>         s1 = (Selectivity) 0.5;
>     }
> 
> Patches to improve this are welcome ;-).  But I'm not at all sure how to
> write something that would extract a reliable selectivity estimate from
> a subplan.
> 

given that we have so few GUC variables... 

would there be any merit in adding one that would allow folks to change
this assumption? 

Robert Treat



Re: [PERFORM] Hypothetical suggestions for planner, indexing

From
Dennis Björklund
Date:
On Tue, 6 May 2003, Tom Lane wrote:

> > It would be nice to add support for multi-column IN..
> > WHERE (a, b, c) IN (SELECT a, b, c ...)
>
> RTFM...

Maybe he did read the manual:

6.15.3. IN (subquery form)

expression IN (subquery)

The right-hand side of this form of IN is a parenthesized subquery, which
must return exactly one column.

--
/Dennis


Re: [PERFORM] Hypothetical suggestions for planner, indexing improvement

From
"Jim C. Nasby"
Date:
On Tue, May 06, 2003 at 09:45:07AM -0400, Tom Lane wrote:
> "Jim C. Nasby" <jim@nasby.net> writes:
> > It would be nice to add support for multi-column IN..
> > WHERE (a, b, c) IN (SELECT a, b, c ...)
>
> RTFM...

As someone pointed out, the documentation says you can't. In this case
the docs are wrong (I've added a note).

> > BTW, does postgresql handle IN and EXISTS differently?
>
> Yes.
>
> > Theoretically if the optimizer was good enough you could transform one
> > to the other and not worry about it.
>
> No.  They have different responses to NULLs in the subselect result.

They appear to operate the same... what's different?
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


"Jim C. Nasby" <jim@nasby.net> writes:
> On Tue, May 06, 2003 at 09:45:07AM -0400, Tom Lane wrote:
>> RTFM...

> As someone pointed out, the documentation says you can't. In this case
> the docs are wrong (I've added a note).

Perhaps you should have read to the end of the section.

>>> BTW, does postgresql handle IN and EXISTS differently?
>>
>> Yes.
>
> They appear to operate the same... what's different?

Supposing that tab1.col1 contains 1, NULL, 2, then for an outer
table row where col2 = 42

    WHERE outer.col2 IN (SELECT col1 FROM tab1)

will yield NULL (not FALSE).  But

    WHERE EXISTS(SELECT * FROM tab1 WHERE col1 = outer.col2)

will yield FALSE (not NULL).

The distinction doesn't matter at the top level of WHERE, but it
matters a lot underneath a NOT ...

            regards, tom lane


Re: [PERFORM] Hypothetical suggestions for planner, indexing

From
Dennis Björklund
Date:
On Wed, 7 May 2003, Tom Lane wrote:

> > As someone pointed out, the documentation says you can't. In this case
> > the docs are wrong (I've added a note).
>
> Perhaps you should have read to the end of the section.

Oh, you are right (of course). It is documented later on in that section.

I guess it's specified in the SQL spec, but how come a tuple is not an
expression? If it had been an expression the first part of that section
would still apply, which is why I just read the first part.

--
/Dennis


Re: [PERFORM] Hypothetical suggestions for planner, indexing improvement

From
"Jim C. Nasby"
Date:
> Supposing that tab1.col1 contains 1, NULL, 2, then for an outer
> table row where col2 = 42
>
>     WHERE outer.col2 IN (SELECT col1 FROM tab1)
>
> will yield NULL (not FALSE).  But
>
>     WHERE EXISTS(SELECT * FROM tab1 WHERE col1 = outer.col2)
>
> will yield FALSE (not NULL).
>
> The distinction doesn't matter at the top level of WHERE, but it
> matters a lot underneath a NOT ...

OK, but even if a true transform can't be done, couldn't they share the
same set of code to fetch the data for the subquery? Going back to my
original post, I tend to use IN only in cases where I think the subquery
will return a small result-set, and use EXISTS elsewhere. Presumably,
the subquery for an IN will only be run once, while EXISTS will be run
as an inner-loop (I'm guessing here, I could be wrong). It might be
useful if the subquery was executed based on how many rows it
would/might return.
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"


Re: [PERFORM] Hypothetical suggestions for planner, indexing

From
Tom Lane
Date:
=?ISO-8859-1?Q?Dennis_Bj=F6rklund?= <db@zigo.dhs.org> writes:
> I guess it's specified in the SQL spec, but how come a tuple is not an
> expression? If it had been an expression the first part of that section
> would still apply, which is why I just read the first part.

Well, if you want to think of "(42, 'foo')" as being an expression then
I suppose it could be considered to be all one syntax.  Personally I
think that'd be more confusing rather than less so.

Another alternative is to put both syntax summaries at the top of the
subsection, and combine the two textual descriptions into one.

This is the wrong place for this discussion, though.  If you want to
have a go at rewriting the section, why not put up a sketch on
pgsql-docs and see if people like it?

            regards, tom lane