Thread: Optimisation of INTERSECT expressions
Dear PostgresQL Experts, I am trying to get to the bottom of some efficiency problems and hope that you can help. The difficulty seems to be with INTERSECT expressions. I have a query of the form select A from T where C1 intersect select A from T where C2; It runs in about 100 ms. But it is equivalent to this query select A from T where C1 and C2; which runs in less than 10 ms. Looking at the output of "explain analyse" on the first query, it seems that PostgresQL always computes the two sub-expressions and then computes an explicit intersection on the results. I had hoped that it would notice that both subexpressions are scanning the same input table T and convert the expression to the second form. Is there a reason why it can't do this transformation? (Of course, I could just re-write my code to use the second form, but my application is generating these bits of SQL programmatically, and it is not trivial as in some cases the two tables are not the same so an intersection really is needed; if PostgresQL can do it for me, that would be much better. I don't want to write an SQL parser!) While studying the same code I found another case where my INTERSECT expressions don't seem to be optimised as much as I'd like. In this case, one of the subexpressions being intersected is empty much of the time. But even when it is empty, PostgresQL computes the other (expensive) subexpression and does an intersect. Could PostgresQL do something like this: - guess which subexpression is likely to produce fewest rows - compute this subexpression - if empty, return now with an empty result - compute other subexpression - compute intersection - return intersection Alternatively, it could be defined that the left subexpression is always computed first and the second not computed if it is empty, like the behaviour of logical AND and OR operators in C. Thanks in advance for any suggestions. --Phil.
On Tue, 23 Mar 2004, Phil Endecott wrote: > Dear PostgresQL Experts, > > I am trying to get to the bottom of some efficiency problems and hope that > you can help. The difficulty seems to be with INTERSECT expressions. > > I have a query of the form > select A from T where C1 intersect select A from T where C2; > It runs in about 100 ms. > > But it is equivalent to this query > select A from T where C1 and C2; > which runs in less than 10 ms. > > Looking at the output of "explain analyse" on the first query, it seems > that PostgresQL always computes the two sub-expressions and then computes > an explicit intersection on the results. I had hoped that it would notice > that both subexpressions are scanning the same input table T and convert > the expression to the second form. > > Is there a reason why it can't do this transformation? Probably because noone's bothered to try to prove under what conditions it's the same. For example, given a non-unique A, the two queries can give different answers (if say the same two A values match both C1 and C2 in different rows how many output rows does each give? *), also given a non-stable A (for example random) the two queries are not necessarily equivalent.
On Tue, 23 Mar 2004, Stephan Szabo wrote: > On Tue, 23 Mar 2004, Phil Endecott wrote: > > > Dear PostgresQL Experts, > > > > I am trying to get to the bottom of some efficiency problems and hope that > > you can help. The difficulty seems to be with INTERSECT expressions. > > > > I have a query of the form > > select A from T where C1 intersect select A from T where C2; > > It runs in about 100 ms. > > > > But it is equivalent to this query > > select A from T where C1 and C2; > > which runs in less than 10 ms. > > > > Looking at the output of "explain analyse" on the first query, it seems > > that PostgresQL always computes the two sub-expressions and then computes > > an explicit intersection on the results. I had hoped that it would notice > > that both subexpressions are scanning the same input table T and convert > > the expression to the second form. > > > > Is there a reason why it can't do this transformation? > > Probably because noone's bothered to try to prove under what conditions > it's the same. > > For example, given a non-unique A, the two queries can give different > answers (if say the same two A values match both C1 and C2 in different > rows how many output rows does each give? *), also given a non-stable A > (for example random) the two queries are not necessarily equivalent. Ugh, the example got trimmed out for the * Given a non-unique A, C1 as B>5, c2 as C>5 and the data: A | B | C 1 | 6 | 1 1 | 1 | 6 The intersect gives 1 row, the and query gives 0 AFAICS.
Stephan Szabo <sszabo@megazone.bigpanda.com> writes: > Given a non-unique A, C1 as B>5, c2 as C>5 and the data: > A | B | C > 1 | 6 | 1 > 1 | 1 | 6 > The intersect gives 1 row, the and query gives 0 AFAICS. Another way that the queries are not equivalent is that INTERSECT is defined to remove duplicate output rows (much like DISTINCT) whereas the AND form of course won't do that. regards, tom lane
I asked: > select A from T where C1 intersect select A from T where C2; > select A from T where C1 and C2; > [why isn't the first optimised into the second?] Stephan Szabo answered: > Given a non-unique A, C1 as B>5, c2 as C>5 and the data: > A | B | C > 1 | 6 | 1 > 1 | 1 | 6 > The intersect gives 1 row, the and query gives 0 AFAICS. Tom Lane answered: > Another way that the queries are not equivalent is that INTERSECT is > defined to remove duplicate output rows (much like DISTINCT) whereas > the AND form of course won't do that. Thanks! In my case the attribute A is unique - it is the primary key - and I hadn't considered the more general case properly. So I suppose I'll have to find a more sophisticated way to generate my queries. Imagine a user interface for a search facility with various buttons and text entry fields. At the moment, for each part of the search that the user has enabled I create a string of SQL. I then compose them into a single statement using INTERSECT. Each sub-query always returns the same attribute, but to make things complicated they may come from different tables. It now seems that I'll have to merge the queries more thoroughly. Does anyone have any suggestions about how to do this? I'd like a nice general technique that works for all possible subqueries, as my current composition with INTERSECT does. Any thoughts on my other question about empty intersections? Thanks again for the feedback. --Phil.
On Tue, Mar 23, 2004 at 11:21:39 -0500, Phil Endecott <spam_from_postgresql_lists@chezphil.org> wrote: > Does anyone have any suggestions about how to do this? I'd like a nice > general technique that works for all possible subqueries, as my current > composition with INTERSECT does. One adjustment you might make is using INTERSECT ALL if you know there can't be duplicates. Then time won't be wasted trying to remove duplicates.
Phil, > So I suppose I'll have to find a more sophisticated way to generate my > queries. Imagine a user interface for a search facility with various > buttons and text entry fields. At the moment, for each part of the search > that the user has enabled I create a string of SQL. I then compose them > into a single statement using INTERSECT. Each sub-query always returns the > same attribute, but to make things complicated they may come from different > tables. It now seems that I'll have to merge the queries more thoroughly. > Does anyone have any suggestions about how to do this? I'd like a nice > general technique that works for all possible subqueries, as my current > composition with INTERSECT does. I've done this but it involves a choice between a lot of infrastrucure for fully configurable queries, or limiting user choice. The former option requires that you construct reference tables holding what search fields are available, what kind of values they hold, and what operators to use while querying, as well as a table storing the joins used for the various tables that can be queried. Based on that, you can construct dynamically a query on any field or combo of fields listed in your reference tables. If search options are more constrained, you can simply take the easier path of hard-coding the query building blocks into a set-returning function. I do this all the time for Web search interfaces, where the user only has about 9 things to search on. -- -Josh Berkus Aglio Database Solutions San Francisco
Bruno Wolff III <bruno@wolff.to> writes: > On Tue, Mar 23, 2004 at 11:21:39 -0500, > Phil Endecott <spam_from_postgresql_lists@chezphil.org> wrote: >> Does anyone have any suggestions about how to do this? I'd like a nice >> general technique that works for all possible subqueries, as my current >> composition with INTERSECT does. > One adjustment you might make is using INTERSECT ALL if you know there > can't be duplicates. Then time won't be wasted trying to remove duplicates. Actually, I don't think that will help. UNION ALL is much faster than UNION, because it doesn't have to match up duplicates, but INTERSECT and EXCEPT still have to match rows from the inputs in order to find out if they should emit a row at all. IIRC there will not be any noticeable speed difference with or without ALL. AFAICS, what Phil will want to do is SELECT a FROM table1 WHERE cond11 AND cond12 AND ... INTERSECT SELECT a FROM table2 WHERE cond21 AND cond22 AND ... INTERSECT ... which is more painful to assemble than his current approach, but it shouldn't be *that* bad --- you just need to tag each condition with the table it applies to, and bring together matching tags when you build the SQL string. regards, tom lane