Thread: Overhead of union versus union all
I am combining query results that I know are disjoint. I'm wondering how much overhead there is in calling union versus union all. (Just curious really; I can't see a reason not to use union all.) (cc me please; not subscribed...) THK -- Timothy H. Keitt http://www.keittlab.org/
Tim Keitt wrote: > I am combining query results that I know are disjoint. I'm wondering > how much overhead there is in calling union versus union all. (Just > curious really; I can't see a reason not to use union all.) UNION needs to uniquify the output, for which it plasters an additional sort step, whereas UNION ALL does not need to uniquify its output and thus it can avoid the sort step. Using UNION ALL is recommended wherever possible. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Tim Keitt wrote: > I am combining query results that I know are disjoint. I'm wondering > how much overhead there is in calling union versus union all. (Just > curious really; I can't see a reason not to use union all.) (cc me > please; not subscribed...) > > THK > I think you can test this one yourself pretty easily. Just run the two queries with "explain analyze". Union All should run in about the sum of the separate queries. Plain Union will always be slower, because it takes the same results from "union all" and runs them through an extra sort/distinct or hash step. In my tests, on a query with 600,000 rows, the Plain Union took about 3x as long to complete.
Alvaro Herrera wrote: > Tim Keitt wrote: > > I am combining query results that I know are disjoint. I'm wondering > > how much overhead there is in calling union versus union all. (Just > > curious really; I can't see a reason not to use union all.) > > UNION needs to uniquify the output, for which it plasters an additional > sort step, whereas UNION ALL does not need to uniquify its output and > thus it can avoid the sort step. Using UNION ALL is recommended > wherever possible. Yep, ideally UNION ALL would be the default behavior, but that standard requires otherwise. Many people don't know that UNION has an extra SORT/UNIQUE step. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Alvaro Herrera wrote: > Tim Keitt wrote: >> I am combining query results that I know are disjoint. I'm wondering >> how much overhead there is in calling union versus union all. (Just >> curious really; I can't see a reason not to use union all.) > > UNION needs to uniquify the output, for which it plasters an additional > sort step, whereas UNION ALL does not need to uniquify its output and > thus it can avoid the sort step. Using UNION ALL is recommended > wherever possible. > I think I read somewhere that as of 8.4 it no longer required the sort step, due to the improvements in hashing. Here it is http://wiki.postgresql.org/wiki/WhatsNew84#Performance
Scott Bailey wrote: > Alvaro Herrera wrote: > > Tim Keitt wrote: > >> I am combining query results that I know are disjoint. I'm wondering > >> how much overhead there is in calling union versus union all. (Just > >> curious really; I can't see a reason not to use union all.) > > > > UNION needs to uniquify the output, for which it plasters an additional > > sort step, whereas UNION ALL does not need to uniquify its output and > > thus it can avoid the sort step. Using UNION ALL is recommended > > wherever possible. > > > > I think I read somewhere that as of 8.4 it no longer required the sort > step, due to the improvements in hashing. Here it is > > http://wiki.postgresql.org/wiki/WhatsNew84#Performance Oh, yea, hashing is used in some cases rather than sort. I assume sort is still used if the hash exceeds workmem size. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Thu, Jul 9, 2009 at 7:58 PM, Bruce Momjian<bruce@momjian.us> wrote: > Scott Bailey wrote: >> Alvaro Herrera wrote: >> > Tim Keitt wrote: >> >> I am combining query results that I know are disjoint. I'm wondering >> >> how much overhead there is in calling union versus union all. (Just >> >> curious really; I can't see a reason not to use union all.) >> > >> > UNION needs to uniquify the output, for which it plasters an additional >> > sort step, whereas UNION ALL does not need to uniquify its output and >> > thus it can avoid the sort step. Using UNION ALL is recommended >> > wherever possible. >> > >> >> I think I read somewhere that as of 8.4 it no longer required the sort >> step, due to the improvements in hashing. Here it is >> >> http://wiki.postgresql.org/wiki/WhatsNew84#Performance > > Oh, yea, hashing is used in some cases rather than sort. I assume sort > is still used if the hash exceeds workmem size. The important point being that it's still more expensive than a plain union all thought, right?
Scott Marlowe wrote: > On Thu, Jul 9, 2009 at 7:58 PM, Bruce Momjian<bruce@momjian.us> wrote: > > Scott Bailey wrote: > >> Alvaro Herrera wrote: > >> > Tim Keitt wrote: > >> >> I am combining query results that I know are disjoint. I'm wondering > >> >> how much overhead there is in calling union versus union all. (Just > >> >> curious really; I can't see a reason not to use union all.) > >> > > >> > UNION needs to uniquify the output, for which it plasters an additional > >> > sort step, whereas UNION ALL does not need to uniquify its output and > >> > thus it can avoid the sort step. ?Using UNION ALL is recommended > >> > wherever possible. > >> > > >> > >> I think I read somewhere that as of 8.4 it no longer required the sort > >> step, due to the improvements in hashing. Here it is > >> > >> http://wiki.postgresql.org/wiki/WhatsNew84#Performance > > > > Oh, yea, hashing is used in some cases rather than sort. ?I assume sort > > is still used if the hash exceeds workmem size. > > The important point being that it's still more expensive than a plain > union all thought, right? Yep. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Thu, 2009-07-09 at 20:41 -0600, Scott Marlowe wrote: > On Thu, Jul 9, 2009 at 7:58 PM, Bruce Momjian<bruce@momjian.us> wrote: > > Scott Bailey wrote: > >> Alvaro Herrera wrote: > >> > Tim Keitt wrote: > >> >> I am combining query results that I know are disjoint. I'm wondering > >> >> how much overhead there is in calling union versus union all. (Just > >> >> curious really; I can't see a reason not to use union all.) > >> > > >> > UNION needs to uniquify the output, for which it plasters an additional > >> > sort step, whereas UNION ALL does not need to uniquify its output and > >> > thus it can avoid the sort step. Using UNION ALL is recommended > >> > wherever possible. > >> > > >> I think I read somewhere that as of 8.4 it no longer required the sort > >> step, due to the improvements in hashing. Here it is > >> > >> http://wiki.postgresql.org/wiki/WhatsNew84#Performance > > > > Oh, yea, hashing is used in some cases rather than sort. I assume sort > > is still used if the hash exceeds workmem size. > > The important point being that it's still more expensive than a plain > union all thought, right? I think it should be possible to use predtest theorem proving to discard the sort/hash step in cases where we can prove the sets are disjoint. Often there are top-level quals that can be compared in the WHERE clauses of the sub-queries, so a shallow search could be quite profitable in allowing us to rewrite a UNION into a UNION ALL. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Simon Riggs wrote: > > On Thu, 2009-07-09 at 20:41 -0600, Scott Marlowe wrote: > > On Thu, Jul 9, 2009 at 7:58 PM, Bruce Momjian<bruce@momjian.us> wrote: > > > Scott Bailey wrote: > > >> Alvaro Herrera wrote: > > >> > Tim Keitt wrote: > > >> >> I am combining query results that I know are disjoint. I'm wondering > > >> >> how much overhead there is in calling union versus union all. (Just > > >> >> curious really; I can't see a reason not to use union all.) > > >> > > > >> > UNION needs to uniquify the output, for which it plasters an additional > > >> > sort step, whereas UNION ALL does not need to uniquify its output and > > >> > thus it can avoid the sort step. Using UNION ALL is recommended > > >> > wherever possible. > > >> > > > >> I think I read somewhere that as of 8.4 it no longer required the sort > > >> step, due to the improvements in hashing. Here it is > > >> > > >> http://wiki.postgresql.org/wiki/WhatsNew84#Performance > > > > > > Oh, yea, hashing is used in some cases rather than sort. I assume sort > > > is still used if the hash exceeds workmem size. > > > > The important point being that it's still more expensive than a plain > > union all thought, right? > > I think it should be possible to use predtest theorem proving to discard > the sort/hash step in cases where we can prove the sets are disjoint. > Often there are top-level quals that can be compared in the WHERE > clauses of the sub-queries, so a shallow search could be quite > profitable in allowing us to rewrite a UNION into a UNION ALL. I assume we would still need the distinct removal step; we just avoid the sort/hash. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Fri, 2009-07-10 at 08:59 -0400, Bruce Momjian wrote: > > I think it should be possible to use predtest theorem proving to > discard > > the sort/hash step in cases where we can prove the sets are > disjoint. > > Often there are top-level quals that can be compared in the WHERE > > clauses of the sub-queries, so a shallow search could be quite > > profitable in allowing us to rewrite a UNION into a UNION ALL. > > I assume we would still need the distinct removal step; we just avoid > the sort/hash. I mean it seems possible to prove that the distinct removal step is not necessary, by proving that the various sub-queries are already disjoint. It's a common manual optimization, so automating it seems a reasonable future goal. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Simon Riggs wrote: > > On Fri, 2009-07-10 at 08:59 -0400, Bruce Momjian wrote: > > > > I think it should be possible to use predtest theorem proving to > > discard > > > the sort/hash step in cases where we can prove the sets are > > disjoint. > > > Often there are top-level quals that can be compared in the WHERE > > > clauses of the sub-queries, so a shallow search could be quite > > > profitable in allowing us to rewrite a UNION into a UNION ALL. > > > > I assume we would still need the distinct removal step; we just avoid > > the sort/hash. > > I mean it seems possible to prove that the distinct removal step is not > necessary, by proving that the various sub-queries are already disjoint. > It's a common manual optimization, so automating it seems a reasonable > future goal. I am confused what sub-queries produce _distinct_ output. I know there are some that produce _ordered_ output. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Fri, 2009-07-10 at 09:28 -0400, Bruce Momjian wrote: > Simon Riggs wrote: > > > > On Fri, 2009-07-10 at 08:59 -0400, Bruce Momjian wrote: > > > > > > I think it should be possible to use predtest theorem proving to > > > discard > > > > the sort/hash step in cases where we can prove the sets are > > > disjoint. > > > > Often there are top-level quals that can be compared in the WHERE > > > > clauses of the sub-queries, so a shallow search could be quite > > > > profitable in allowing us to rewrite a UNION into a UNION ALL. > > > > > > I assume we would still need the distinct removal step; we just avoid > > > the sort/hash. > > > > I mean it seems possible to prove that the distinct removal step is not > > necessary, by proving that the various sub-queries are already disjoint. > > It's a common manual optimization, so automating it seems a reasonable > > future goal. > > I am confused what sub-queries produce _distinct_ output. I know there > are some that produce _ordered_ output. None, that was not my point. If you have a query like this Select ..., status, ... ... where status = '1' union Select ..., status, ... ... where status = '2'; or a query like this Select '1', ... ... union Select '2', ... ... ; or a query like this Select '1', ... ... union Select status, ... ... where status != '1'; ; then it is clear that we could automatically prove that the the distinct step is redundant and so we could either hash or sort. This is the same as replacing the UNION with UNION ALL. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Simon Riggs wrote: > or a query like this > > Select '1', ... > ... > union > Select status, ... > ... > where status != '1'; > ; > > then it is clear that we could automatically prove that the the distinct > step is redundant and so we could either hash or sort. This is the same > as replacing the UNION with UNION ALL. In the last example, how do you know that status != '1' produces unique output? I assumed UNION gave distinct for the entire output, not just remove duplicates from the two UNION branches; that's how Postgres behaves now: test=> SELECT 1 UNION (SELECT 2 UNION ALL SELECT 2); ?column? ---------- 1 2 (2 rows) -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Fri, 2009-07-10 at 09:46 -0400, Bruce Momjian wrote: > Simon Riggs wrote: > > or a query like this > > > > Select '1', ... > > ... > > union > > Select status, ... > > ... > > where status != '1'; > > ; > > > > then it is clear that we could automatically prove that the the distinct > > step is redundant and so we could either hash or sort. This is the same > > as replacing the UNION with UNION ALL. > > In the last example, how do you know that status != '1' produces unique > output? You don't. I was assuming that you could already prove that each subquery was distinct in itself. It's one for the TODO, that's all. I see it often, but I'm not planning to work on the code for this myself. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
On Fri, 2009-07-10 at 14:22 +0100, Simon Riggs wrote: > I mean it seems possible to prove that the distinct removal step is not > necessary, by proving that the various sub-queries are already disjoint. > It's a common manual optimization, so automating it seems a reasonable > future goal. There are even simpler cases that postgresql can't optimize. Consider: -- foo has a primary key SELECT * FROM foo UNION SELECT * FROM foo; That's logically equivalent to: SELECT * FROM foo; But postgresql will add a sort anyway. There are lots of optimizations along these lines. They seem obscure, but these optimizations become much more useful when using views or complex queries where the same table appears multiple times. For instance, if you have two views that are projections of the same table, then, you join the views together, you can optimize away the join in some cases, and just scan the original table. I think a lot of these optimizations depend on knowing which tables (or subqueries) are relations in the relational theory sense; i.e. unordered, distinct, and have no NULLs in the relevant attributes. Regards, Jeff Davis
On Fri, Jul 10, 2009 at 6:37 PM, Jeff Davis<pgsql@j-davis.com> wrote: > > -- foo has a primary key > SELECT * FROM foo UNION SELECT * FROM foo; > > That's logically equivalent to: > > SELECT * FROM foo; > > But postgresql will add a sort anyway. Well no, it's equivalent to SELECT DISTINCT * FROM foo; -- greg http://mit.edu/~gsstark/resume.pdf
On Fri, 2009-07-10 at 18:47 +0100, Greg Stark wrote: > > -- foo has a primary key > Well no, it's equivalent to SELECT DISTINCT * FROM foo; I think you missed that "foo" has a primary key. Regards, Jeff Davis
On Fri, Jul 10, 2009 at 11:47 AM, Greg Stark<gsstark@mit.edu> wrote: > On Fri, Jul 10, 2009 at 6:37 PM, Jeff Davis<pgsql@j-davis.com> wrote: >> >> -- foo has a primary key >> SELECT * FROM foo UNION SELECT * FROM foo; >> >> That's logically equivalent to: >> >> SELECT * FROM foo; >> >> But postgresql will add a sort anyway. > > > Well no, it's equivalent to SELECT DISTINCT * FROM foo; And honestly, I'd rather see development effort go into making complex queries run faster (the things like bitmap indexes on disk etc) rather than optimising things that i can optimise myself.