Thread: Optimisation of INTERSECT expressions

Optimisation of INTERSECT expressions

From
"Phil Endecott"
Date:
Dear PostgresQL Experts,

I am trying to get to the bottom of some efficiency problems and hope that
you can help.  The difficulty seems to be with INTERSECT expressions.

I have a query of the form
     select A from T where C1 intersect select A from T where C2;
It runs in about 100 ms.

But it is equivalent to this query
     select A from T where C1 and C2;
which runs in less than 10 ms.

Looking at the output of "explain analyse" on the first query, it seems
that PostgresQL always computes the two sub-expressions and then computes
an explicit intersection on the results.  I had hoped that it would notice
that both subexpressions are scanning the same input table T and convert
the expression to the second form.

Is there a reason why it can't do this transformation?

(Of course, I could just re-write my code to use the second form, but my
application is generating these bits of SQL programmatically, and it is not
trivial as in some cases the two tables are not the same so an intersection
really is needed; if PostgresQL can do it for me, that would be much
better.  I don't want to write an SQL parser!)


While studying the same code I found another case where my INTERSECT
expressions don't seem to be optimised as much as I'd like.  In this case,
one of the subexpressions being intersected is empty much of the time.  But
even when it is empty, PostgresQL computes the other (expensive)
subexpression and does an intersect.  Could PostgresQL do something like this:

- guess which subexpression is likely to produce fewest rows
- compute this subexpression
- if empty, return now with an empty result
- compute other subexpression
- compute intersection
- return intersection

Alternatively, it could be defined that the left subexpression is always
computed first and the second not computed if it is empty, like the
behaviour of logical AND and OR operators in C.

Thanks in advance for any suggestions.

--Phil.


Re: Optimisation of INTERSECT expressions

From
Stephan Szabo
Date:
On Tue, 23 Mar 2004, Phil Endecott wrote:

> Dear PostgresQL Experts,
>
> I am trying to get to the bottom of some efficiency problems and hope that
> you can help.  The difficulty seems to be with INTERSECT expressions.
>
> I have a query of the form
>      select A from T where C1 intersect select A from T where C2;
> It runs in about 100 ms.
>
> But it is equivalent to this query
>      select A from T where C1 and C2;
> which runs in less than 10 ms.
>
> Looking at the output of "explain analyse" on the first query, it seems
> that PostgresQL always computes the two sub-expressions and then computes
> an explicit intersection on the results.  I had hoped that it would notice
> that both subexpressions are scanning the same input table T and convert
> the expression to the second form.
>
> Is there a reason why it can't do this transformation?

Probably because noone's bothered to try to prove under what conditions
it's the same.

For example, given a non-unique A, the two queries can give different
answers (if say the same two A values match both C1 and C2 in different
rows how many output rows does each give? *), also given a non-stable A
(for example random) the two queries are not necessarily equivalent.

Re: Optimisation of INTERSECT expressions

From
Stephan Szabo
Date:
On Tue, 23 Mar 2004, Stephan Szabo wrote:

> On Tue, 23 Mar 2004, Phil Endecott wrote:
>
> > Dear PostgresQL Experts,
> >
> > I am trying to get to the bottom of some efficiency problems and hope that
> > you can help.  The difficulty seems to be with INTERSECT expressions.
> >
> > I have a query of the form
> >      select A from T where C1 intersect select A from T where C2;
> > It runs in about 100 ms.
> >
> > But it is equivalent to this query
> >      select A from T where C1 and C2;
> > which runs in less than 10 ms.
> >
> > Looking at the output of "explain analyse" on the first query, it seems
> > that PostgresQL always computes the two sub-expressions and then computes
> > an explicit intersection on the results.  I had hoped that it would notice
> > that both subexpressions are scanning the same input table T and convert
> > the expression to the second form.
> >
> > Is there a reason why it can't do this transformation?
>
> Probably because noone's bothered to try to prove under what conditions
> it's the same.
>
> For example, given a non-unique A, the two queries can give different
> answers (if say the same two A values match both C1 and C2 in different
> rows how many output rows does each give? *), also given a non-stable A
> (for example random) the two queries are not necessarily equivalent.

Ugh, the example got trimmed out for the *

Given a non-unique A, C1 as B>5, c2 as C>5 and the data:
A | B | C
1 | 6 | 1
1 | 1 | 6

The intersect gives 1 row, the and query gives 0 AFAICS.


Re: Optimisation of INTERSECT expressions

From
Tom Lane
Date:
Stephan Szabo <sszabo@megazone.bigpanda.com> writes:
> Given a non-unique A, C1 as B>5, c2 as C>5 and the data:
> A | B | C
> 1 | 6 | 1
> 1 | 1 | 6
> The intersect gives 1 row, the and query gives 0 AFAICS.

Another way that the queries are not equivalent is that INTERSECT is
defined to remove duplicate output rows (much like DISTINCT) whereas
the AND form of course won't do that.

            regards, tom lane

Re: Optimisation of INTERSECT expressions

From
"Phil Endecott"
Date:
I asked:
> select A from T where C1 intersect select A from T where C2;
> select A from T where C1 and C2;
> [why isn't the first optimised into the second?]

Stephan Szabo answered:
> Given a non-unique A, C1 as B>5, c2 as C>5 and the data:
> A | B | C
> 1 | 6 | 1
> 1 | 1 | 6
> The intersect gives 1 row, the and query gives 0 AFAICS.

Tom Lane answered:
> Another way that the queries are not equivalent is that INTERSECT is
> defined to remove duplicate output rows (much like DISTINCT) whereas
> the AND form of course won't do that.

Thanks!  In my case the attribute A is unique - it is the primary key - and
I hadn't considered the more general case properly.

So I suppose I'll have to find a more sophisticated way to generate my
queries.  Imagine a user interface for a search facility with various
buttons and text entry fields.  At the moment, for each part of the search
that the user has enabled I create a string of SQL.  I then compose them
into a single statement using INTERSECT.  Each sub-query always returns the
same attribute, but to make things complicated they may come from different
tables.  It now seems that I'll have to merge the queries more thoroughly.
 Does anyone have any suggestions about how to do this?  I'd like a nice
general technique that works for all possible subqueries, as my current
composition with INTERSECT does.


Any thoughts on my other question about empty intersections?

Thanks again for the feedback.

--Phil.

Re: Optimisation of INTERSECT expressions

From
Bruno Wolff III
Date:
On Tue, Mar 23, 2004 at 11:21:39 -0500,
  Phil Endecott <spam_from_postgresql_lists@chezphil.org> wrote:
>  Does anyone have any suggestions about how to do this?  I'd like a nice
> general technique that works for all possible subqueries, as my current
> composition with INTERSECT does.

One adjustment you might make is using INTERSECT ALL if you know there
can't be duplicates. Then time won't be wasted trying to remove duplicates.

Re: Optimisation of INTERSECT expressions

From
Josh Berkus
Date:
Phil,

> So I suppose I'll have to find a more sophisticated way to generate my
> queries.  Imagine a user interface for a search facility with various
> buttons and text entry fields.  At the moment, for each part of the search
> that the user has enabled I create a string of SQL.  I then compose them
> into a single statement using INTERSECT.  Each sub-query always returns the
> same attribute, but to make things complicated they may come from different
> tables.  It now seems that I'll have to merge the queries more thoroughly.
>  Does anyone have any suggestions about how to do this?  I'd like a nice
> general technique that works for all possible subqueries, as my current
> composition with INTERSECT does.

I've done this but it involves a choice between a lot of infrastrucure for
fully configurable queries, or limiting user choice.    The former option
requires that you construct reference tables holding what search fields are
available, what kind of values they hold, and what operators to use while
querying, as well as a table storing the joins used for the various tables
that can be queried.

Based on that, you can construct dynamically a query on any field or combo of
fields listed in your reference tables.

If search options are more constrained, you can simply take the easier path of
hard-coding the query building blocks into a set-returning function.   I do
this all the time for Web search interfaces, where the user only has about 9
things to search on.

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


Re: Optimisation of INTERSECT expressions

From
Tom Lane
Date:
Bruno Wolff III <bruno@wolff.to> writes:
> On Tue, Mar 23, 2004 at 11:21:39 -0500,
>   Phil Endecott <spam_from_postgresql_lists@chezphil.org> wrote:
>> Does anyone have any suggestions about how to do this?  I'd like a nice
>> general technique that works for all possible subqueries, as my current
>> composition with INTERSECT does.

> One adjustment you might make is using INTERSECT ALL if you know there
> can't be duplicates. Then time won't be wasted trying to remove duplicates.

Actually, I don't think that will help.  UNION ALL is much faster than
UNION, because it doesn't have to match up duplicates, but INTERSECT
and EXCEPT still have to match rows from the inputs in order to find
out if they should emit a row at all.  IIRC there will not be any
noticeable speed difference with or without ALL.

AFAICS, what Phil will want to do is

    SELECT a FROM table1 WHERE cond11 AND cond12 AND ...
    INTERSECT
    SELECT a FROM table2 WHERE cond21 AND cond22 AND ...
    INTERSECT
    ...

which is more painful to assemble than his current approach, but it
shouldn't be *that* bad --- you just need to tag each condition with the
table it applies to, and bring together matching tags when you build the
SQL string.

            regards, tom lane