Thread: Overhead of union versus union all

Overhead of union versus union all

From
Tim Keitt
Date:
I am combining query results that I know are disjoint. I'm wondering
how much overhead there is in calling union versus union all. (Just
curious really; I can't see a reason not to use union all.) (cc me
please; not subscribed...)

THK

--
Timothy H. Keitt
http://www.keittlab.org/

Re: Overhead of union versus union all

From
Alvaro Herrera
Date:
Tim Keitt wrote:
> I am combining query results that I know are disjoint. I'm wondering
> how much overhead there is in calling union versus union all. (Just
> curious really; I can't see a reason not to use union all.)

UNION needs to uniquify the output, for which it plasters an additional
sort step, whereas UNION ALL does not need to uniquify its output and
thus it can avoid the sort step.  Using UNION ALL is recommended
wherever possible.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Overhead of union versus union all

From
Adam Rich
Date:
Tim Keitt wrote:
> I am combining query results that I know are disjoint. I'm wondering
> how much overhead there is in calling union versus union all. (Just
> curious really; I can't see a reason not to use union all.) (cc me
> please; not subscribed...)
>
> THK
>


I think you can test this one yourself pretty easily.  Just run the two
queries with "explain analyze".  Union All should run in about the sum
of the separate queries.  Plain Union will always be slower, because it
takes the same results from "union all" and runs them through an extra
sort/distinct or hash step.  In my tests, on a query with 600,000 rows,
the Plain Union took about 3x as long to complete.



Re: Overhead of union versus union all

From
Bruce Momjian
Date:
Alvaro Herrera wrote:
> Tim Keitt wrote:
> > I am combining query results that I know are disjoint. I'm wondering
> > how much overhead there is in calling union versus union all. (Just
> > curious really; I can't see a reason not to use union all.)
>
> UNION needs to uniquify the output, for which it plasters an additional
> sort step, whereas UNION ALL does not need to uniquify its output and
> thus it can avoid the sort step.  Using UNION ALL is recommended
> wherever possible.

Yep, ideally UNION ALL would be the default behavior, but that standard
requires otherwise.  Many people don't know that UNION has an extra
SORT/UNIQUE step.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Overhead of union versus union all

From
Scott Bailey
Date:
Alvaro Herrera wrote:
> Tim Keitt wrote:
>> I am combining query results that I know are disjoint. I'm wondering
>> how much overhead there is in calling union versus union all. (Just
>> curious really; I can't see a reason not to use union all.)
>
> UNION needs to uniquify the output, for which it plasters an additional
> sort step, whereas UNION ALL does not need to uniquify its output and
> thus it can avoid the sort step.  Using UNION ALL is recommended
> wherever possible.
>

I think I read somewhere that as of 8.4 it no longer required the sort
step, due to the improvements in hashing. Here it is

http://wiki.postgresql.org/wiki/WhatsNew84#Performance

Re: Overhead of union versus union all

From
Bruce Momjian
Date:
Scott Bailey wrote:
> Alvaro Herrera wrote:
> > Tim Keitt wrote:
> >> I am combining query results that I know are disjoint. I'm wondering
> >> how much overhead there is in calling union versus union all. (Just
> >> curious really; I can't see a reason not to use union all.)
> >
> > UNION needs to uniquify the output, for which it plasters an additional
> > sort step, whereas UNION ALL does not need to uniquify its output and
> > thus it can avoid the sort step.  Using UNION ALL is recommended
> > wherever possible.
> >
>
> I think I read somewhere that as of 8.4 it no longer required the sort
> step, due to the improvements in hashing. Here it is
>
> http://wiki.postgresql.org/wiki/WhatsNew84#Performance

Oh, yea, hashing is used in some cases rather than sort.  I assume sort
is still used if the hash exceeds workmem size.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Overhead of union versus union all

From
Scott Marlowe
Date:
On Thu, Jul 9, 2009 at 7:58 PM, Bruce Momjian<bruce@momjian.us> wrote:
> Scott Bailey wrote:
>> Alvaro Herrera wrote:
>> > Tim Keitt wrote:
>> >> I am combining query results that I know are disjoint. I'm wondering
>> >> how much overhead there is in calling union versus union all. (Just
>> >> curious really; I can't see a reason not to use union all.)
>> >
>> > UNION needs to uniquify the output, for which it plasters an additional
>> > sort step, whereas UNION ALL does not need to uniquify its output and
>> > thus it can avoid the sort step.  Using UNION ALL is recommended
>> > wherever possible.
>> >
>>
>> I think I read somewhere that as of 8.4 it no longer required the sort
>> step, due to the improvements in hashing. Here it is
>>
>> http://wiki.postgresql.org/wiki/WhatsNew84#Performance
>
> Oh, yea, hashing is used in some cases rather than sort.  I assume sort
> is still used if the hash exceeds workmem size.

The important point being that it's still more expensive than a plain
union all thought, right?

Re: Overhead of union versus union all

From
Bruce Momjian
Date:
Scott Marlowe wrote:
> On Thu, Jul 9, 2009 at 7:58 PM, Bruce Momjian<bruce@momjian.us> wrote:
> > Scott Bailey wrote:
> >> Alvaro Herrera wrote:
> >> > Tim Keitt wrote:
> >> >> I am combining query results that I know are disjoint. I'm wondering
> >> >> how much overhead there is in calling union versus union all. (Just
> >> >> curious really; I can't see a reason not to use union all.)
> >> >
> >> > UNION needs to uniquify the output, for which it plasters an additional
> >> > sort step, whereas UNION ALL does not need to uniquify its output and
> >> > thus it can avoid the sort step. ?Using UNION ALL is recommended
> >> > wherever possible.
> >> >
> >>
> >> I think I read somewhere that as of 8.4 it no longer required the sort
> >> step, due to the improvements in hashing. Here it is
> >>
> >> http://wiki.postgresql.org/wiki/WhatsNew84#Performance
> >
> > Oh, yea, hashing is used in some cases rather than sort. ?I assume sort
> > is still used if the hash exceeds workmem size.
>
> The important point being that it's still more expensive than a plain
> union all thought, right?

Yep.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Overhead of union versus union all

From
Simon Riggs
Date:
On Thu, 2009-07-09 at 20:41 -0600, Scott Marlowe wrote:
> On Thu, Jul 9, 2009 at 7:58 PM, Bruce Momjian<bruce@momjian.us> wrote:
> > Scott Bailey wrote:
> >> Alvaro Herrera wrote:
> >> > Tim Keitt wrote:
> >> >> I am combining query results that I know are disjoint. I'm wondering
> >> >> how much overhead there is in calling union versus union all. (Just
> >> >> curious really; I can't see a reason not to use union all.)
> >> >
> >> > UNION needs to uniquify the output, for which it plasters an additional
> >> > sort step, whereas UNION ALL does not need to uniquify its output and
> >> > thus it can avoid the sort step.  Using UNION ALL is recommended
> >> > wherever possible.
> >> >
> >> I think I read somewhere that as of 8.4 it no longer required the sort
> >> step, due to the improvements in hashing. Here it is
> >>
> >> http://wiki.postgresql.org/wiki/WhatsNew84#Performance
> >
> > Oh, yea, hashing is used in some cases rather than sort.  I assume sort
> > is still used if the hash exceeds workmem size.
>
> The important point being that it's still more expensive than a plain
> union all thought, right?

I think it should be possible to use predtest theorem proving to discard
the sort/hash step in cases where we can prove the sets are disjoint.
Often there are top-level quals that can be compared in the WHERE
clauses of the sub-queries, so a shallow search could be quite
profitable in allowing us to rewrite a UNION into a UNION ALL.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


Re: Overhead of union versus union all

From
Bruce Momjian
Date:
Simon Riggs wrote:
>
> On Thu, 2009-07-09 at 20:41 -0600, Scott Marlowe wrote:
> > On Thu, Jul 9, 2009 at 7:58 PM, Bruce Momjian<bruce@momjian.us> wrote:
> > > Scott Bailey wrote:
> > >> Alvaro Herrera wrote:
> > >> > Tim Keitt wrote:
> > >> >> I am combining query results that I know are disjoint. I'm wondering
> > >> >> how much overhead there is in calling union versus union all. (Just
> > >> >> curious really; I can't see a reason not to use union all.)
> > >> >
> > >> > UNION needs to uniquify the output, for which it plasters an additional
> > >> > sort step, whereas UNION ALL does not need to uniquify its output and
> > >> > thus it can avoid the sort step.  Using UNION ALL is recommended
> > >> > wherever possible.
> > >> >
> > >> I think I read somewhere that as of 8.4 it no longer required the sort
> > >> step, due to the improvements in hashing. Here it is
> > >>
> > >> http://wiki.postgresql.org/wiki/WhatsNew84#Performance
> > >
> > > Oh, yea, hashing is used in some cases rather than sort.  I assume sort
> > > is still used if the hash exceeds workmem size.
> >
> > The important point being that it's still more expensive than a plain
> > union all thought, right?
>
> I think it should be possible to use predtest theorem proving to discard
> the sort/hash step in cases where we can prove the sets are disjoint.
> Often there are top-level quals that can be compared in the WHERE
> clauses of the sub-queries, so a shallow search could be quite
> profitable in allowing us to rewrite a UNION into a UNION ALL.

I assume we would still need the distinct removal step;  we just avoid
the sort/hash.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Overhead of union versus union all

From
Simon Riggs
Date:
On Fri, 2009-07-10 at 08:59 -0400, Bruce Momjian wrote:

> > I think it should be possible to use predtest theorem proving to
> discard
> > the sort/hash step in cases where we can prove the sets are
> disjoint.
> > Often there are top-level quals that can be compared in the WHERE
> > clauses of the sub-queries, so a shallow search could be quite
> > profitable in allowing us to rewrite a UNION into a UNION ALL.
>
> I assume we would still need the distinct removal step;  we just avoid
> the sort/hash.

I mean it seems possible to prove that the distinct removal step is not
necessary, by proving that the various sub-queries are already disjoint.
It's a common manual optimization, so automating it seems a reasonable
future goal.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


Re: Overhead of union versus union all

From
Bruce Momjian
Date:
Simon Riggs wrote:
>
> On Fri, 2009-07-10 at 08:59 -0400, Bruce Momjian wrote:
>
> > > I think it should be possible to use predtest theorem proving to
> > discard
> > > the sort/hash step in cases where we can prove the sets are
> > disjoint.
> > > Often there are top-level quals that can be compared in the WHERE
> > > clauses of the sub-queries, so a shallow search could be quite
> > > profitable in allowing us to rewrite a UNION into a UNION ALL.
> >
> > I assume we would still need the distinct removal step;  we just avoid
> > the sort/hash.
>
> I mean it seems possible to prove that the distinct removal step is not
> necessary, by proving that the various sub-queries are already disjoint.
> It's a common manual optimization, so automating it seems a reasonable
> future goal.

I am confused what sub-queries produce _distinct_ output.  I know there
are some that produce _ordered_ output.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Overhead of union versus union all

From
Simon Riggs
Date:
On Fri, 2009-07-10 at 09:28 -0400, Bruce Momjian wrote:
> Simon Riggs wrote:
> >
> > On Fri, 2009-07-10 at 08:59 -0400, Bruce Momjian wrote:
> >
> > > > I think it should be possible to use predtest theorem proving to
> > > discard
> > > > the sort/hash step in cases where we can prove the sets are
> > > disjoint.
> > > > Often there are top-level quals that can be compared in the WHERE
> > > > clauses of the sub-queries, so a shallow search could be quite
> > > > profitable in allowing us to rewrite a UNION into a UNION ALL.
> > >
> > > I assume we would still need the distinct removal step;  we just avoid
> > > the sort/hash.
> >
> > I mean it seems possible to prove that the distinct removal step is not
> > necessary, by proving that the various sub-queries are already disjoint.
> > It's a common manual optimization, so automating it seems a reasonable
> > future goal.
>
> I am confused what sub-queries produce _distinct_ output.  I know there
> are some that produce _ordered_ output.

None, that was not my point.

If you have a query like this

 Select ..., status, ...
 ...
 where status = '1'
 union
 Select ..., status, ...
 ...
 where status = '2';

or a query like this

 Select '1', ...
 ...
 union
 Select '2', ...
 ...
 ;

or a query like this

 Select '1', ...
 ...
 union
 Select status, ...
 ...
 where status != '1';
 ;

then it is clear that we could automatically prove that the the distinct
step is redundant and so we could either hash or sort. This is the same
as replacing the UNION with UNION ALL.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


Re: Overhead of union versus union all

From
Bruce Momjian
Date:
Simon Riggs wrote:
> or a query like this
>
>  Select '1', ...
>  ...
>  union
>  Select status, ...
>  ...
>  where status != '1';
>  ;
>
> then it is clear that we could automatically prove that the the distinct
> step is redundant and so we could either hash or sort. This is the same
> as replacing the UNION with UNION ALL.

In the last example, how do you know that status != '1' produces unique
output?  I assumed UNION gave distinct for the entire output, not just
remove duplicates from the two UNION branches;  that's how Postgres
behaves now:

    test=> SELECT 1 UNION (SELECT 2 UNION ALL SELECT 2);
     ?column?
    ----------
            1
            2
    (2 rows)

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Overhead of union versus union all

From
Simon Riggs
Date:
On Fri, 2009-07-10 at 09:46 -0400, Bruce Momjian wrote:
> Simon Riggs wrote:
> > or a query like this
> >
> >  Select '1', ...
> >  ...
> >  union
> >  Select status, ...
> >  ...
> >  where status != '1';
> >  ;
> >
> > then it is clear that we could automatically prove that the the distinct
> > step is redundant and so we could either hash or sort. This is the same
> > as replacing the UNION with UNION ALL.
>
> In the last example, how do you know that status != '1' produces unique
> output?

You don't. I was assuming that you could already prove that each
subquery was distinct in itself.

It's one for the TODO, that's all. I see it often, but I'm not planning
to work on the code for this myself.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


Re: Overhead of union versus union all

From
Jeff Davis
Date:
On Fri, 2009-07-10 at 14:22 +0100, Simon Riggs wrote:
> I mean it seems possible to prove that the distinct removal step is not
> necessary, by proving that the various sub-queries are already disjoint.
> It's a common manual optimization, so automating it seems a reasonable
> future goal.

There are even simpler cases that postgresql can't optimize. Consider:

-- foo has a primary key
SELECT * FROM foo UNION SELECT * FROM foo;

That's logically equivalent to:

SELECT * FROM foo;

But postgresql will add a sort anyway.

There are lots of optimizations along these lines. They seem obscure,
but these optimizations become much more useful when using views or
complex queries where the same table appears multiple times. For
instance, if you have two views that are projections of the same table,
then, you join the views together, you can optimize away the join in
some cases, and just scan the original table.

I think a lot of these optimizations depend on knowing which tables (or
subqueries) are relations in the relational theory sense; i.e.
unordered, distinct, and have no NULLs in the relevant attributes.

Regards,
    Jeff Davis


Re: Overhead of union versus union all

From
Greg Stark
Date:
On Fri, Jul 10, 2009 at 6:37 PM, Jeff Davis<pgsql@j-davis.com> wrote:
>
> -- foo has a primary key
> SELECT * FROM foo UNION SELECT * FROM foo;
>
> That's logically equivalent to:
>
> SELECT * FROM foo;
>
> But postgresql will add a sort anyway.


Well no, it's equivalent to SELECT DISTINCT * FROM foo;


--
greg
http://mit.edu/~gsstark/resume.pdf

Re: Overhead of union versus union all

From
Jeff Davis
Date:
On Fri, 2009-07-10 at 18:47 +0100, Greg Stark wrote:
> > -- foo has a primary key

> Well no, it's equivalent to SELECT DISTINCT * FROM foo;

I think you missed that "foo" has a primary key.

Regards,
    Jeff Davis


Re: Overhead of union versus union all

From
Scott Marlowe
Date:
On Fri, Jul 10, 2009 at 11:47 AM, Greg Stark<gsstark@mit.edu> wrote:
> On Fri, Jul 10, 2009 at 6:37 PM, Jeff Davis<pgsql@j-davis.com> wrote:
>>
>> -- foo has a primary key
>> SELECT * FROM foo UNION SELECT * FROM foo;
>>
>> That's logically equivalent to:
>>
>> SELECT * FROM foo;
>>
>> But postgresql will add a sort anyway.
>
>
> Well no, it's equivalent to SELECT DISTINCT * FROM foo;

And honestly, I'd rather see development effort go into making complex
queries run faster (the things like bitmap indexes on disk etc) rather
than optimising things that i can optimise myself.