Thread: DISTINCT vs. GROUP BY

DISTINCT vs. GROUP BY

From

Hans-Jürgen Schönig

Date:

19 September 2005, 11:27:58

I was wondering whether it is possible to teach the planner to handle 
DISTINCT in a more efficient way:

em=# explain select distinct lastname from import.testtest;                                   QUERY PLAN
-------------------------------------------------------------------------------- Unique  (cost=2647377.45..2709467.70
rows=1width=7)   ->  Sort  (cost=2647377.45..2678422.58 rows=12418051 width=7)         Sort Key: lastname         ->
SeqScan on testtest  (cost=0.00..370082.51 rows=12418051 
 
width=7)
(4 Zeilen)


Isn't it possible to perform the same operation using a HashAggregate?
We have seen that a GROUP BY workaround is usually a lot faster than 
sort->unique - at least when work_mem is large enough.
best regards,
    hans


-- 
Cybertec Geschwinde & Schönig GmbH
Schöngrabern 134; A-2020 Hollabrunn
Tel: +43/1/205 10 35 / 340
www.postgresql.at, www.cybertec.at

Re: DISTINCT vs. GROUP BY

From

Neil Conway

Date:

19 September 2005, 11:51:59

On Mon, 2005-19-09 at 16:27 +0200, Hans-Jürgen Schönig wrote:
> I was wondering whether it is possible to teach the planner to handle 
> DISTINCT in a more efficient way:
[...]
> Isn't it possible to perform the same operation using a
> HashAggregate? 

One problem is that DISTINCT ON is defined to return the first unique
row (according to the query's ORDER BY) for the set of DISTINCT ON
columns, which can't easily be done via hashing.

-Neil

Re: DISTINCT vs. GROUP BY

From

Tom Lane

Date:

19 September 2005, 12:11:59

Hans-Jürgen Schönig <postgres@cybertec.at> writes:
> I was wondering whether it is possible to teach the planner to handle 
> DISTINCT in a more efficient way:

Probably (although the interactions with ORDER BY might be tricky).
No one has touched that part of the planner in a very long time.
        regards, tom lane

Re: DISTINCT vs. GROUP BY

From

Greg Stark

Date:

19 September 2005, 12:45:25

Neil Conway <neilc@samurai.com> writes:

> On Mon, 2005-19-09 at 16:27 +0200, Hans-Jürgen Schönig wrote:
> > I was wondering whether it is possible to teach the planner to handle 
> > DISTINCT in a more efficient way:
> [...]
> > Isn't it possible to perform the same operation using a
> > HashAggregate? 
> 
> One problem is that DISTINCT ON is defined to return the first unique
> row (according to the query's ORDER BY) for the set of DISTINCT ON
> columns, which can't easily be done via hashing.

Uhm. Sure it can.

DISTINCT is really just special a case of GROUP BY. Even DISTINCT ON is just
GROUP BY with a kind of "first()" aggregate function. What would be really
neat would be to teach GROUP BY about first() and last() and how it can skip
over some index entries and still satisfy the query. Then make DISTINCT and
DISTINCT ON be handled through the exact same code path.

For bonus points teach it that min() and max() can sometimes be treated the
same way if the path is presenting records sorted on that column.

-- 
greg

Re: DISTINCT vs. GROUP BY

From

Tom Lane

Date:

19 September 2005, 16:50:56

Greg Stark <gsstark@mit.edu> writes:
> DISTINCT is really just special a case of GROUP BY. Even DISTINCT ON is just
> GROUP BY with a kind of "first()" aggregate function. What would be really
> neat would be to teach GROUP BY about first() and last() and how it can skip
> over some index entries and still satisfy the query. Then make DISTINCT and
> DISTINCT ON be handled through the exact same code path.

You've missed the point entirely.

first() is not a substitute for sorting the input; it is only useful
if the input comes pre-sorted.  And if you are going to sort the input,
you might as well use the current implementation of DISTINCT ON and
skip the effort and memory-overflow-risk associated with a hashtable.

I do think hash aggregation is a plausible alternative implementation of
plain DISTINCT, but I don't see the case for using it for DISTINCT ON.
        regards, tom lane

Re: DISTINCT vs. GROUP BY

From

Greg Stark

Date:

19 September 2005, 18:00:55

Tom Lane <tgl@sss.pgh.pa.us> writes:

> I do think hash aggregation is a plausible alternative implementation of
> plain DISTINCT, but I don't see the case for using it for DISTINCT ON.

It could be done without presorting the input though not with a simple
first()-like function. It would have be a sort of two-argument min() function
that kept a state variable for the smallest value found so far of the sort
key.

My main motivation here is that it's odd to have two code paths for
implementing the two language constructs when one is really just a special
case of the other. It's a source of cases like this where the code to
implement a query path exists but isn't accessible due to the way the query is
written.

-- 
greg

Re: DISTINCT vs. GROUP BY

From

Bruce Momjian

Date:

19 September 2005, 23:16:41

Added to TODO:
* Allow DISTINCT to use hashing like GROUP BY


---------------------------------------------------------------------------

Greg Stark wrote:
> 
> Neil Conway <neilc@samurai.com> writes:
> 
> > On Mon, 2005-19-09 at 16:27 +0200, Hans-J?rgen Sch?nig wrote:
> > > I was wondering whether it is possible to teach the planner to handle 
> > > DISTINCT in a more efficient way:
> > [...]
> > > Isn't it possible to perform the same operation using a
> > > HashAggregate? 
> > 
> > One problem is that DISTINCT ON is defined to return the first unique
> > row (according to the query's ORDER BY) for the set of DISTINCT ON
> > columns, which can't easily be done via hashing.
> 
> Uhm. Sure it can.
> 
> 
> DISTINCT is really just special a case of GROUP BY. Even DISTINCT ON is just
> GROUP BY with a kind of "first()" aggregate function. What would be really
> neat would be to teach GROUP BY about first() and last() and how it can skip
> over some index entries and still satisfy the query. Then make DISTINCT and
> DISTINCT ON be handled through the exact same code path.
> 
> For bonus points teach it that min() and max() can sometimes be treated the
> same way if the path is presenting records sorted on that column.
> 
> 
> -- 
> greg
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
> 
>                http://www.postgresql.org/docs/faq
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: DISTINCT vs. GROUP BY

From

"Jim C. Nasby"

Date:

20 September 2005, 16:40:54

On Mon, Sep 19, 2005 at 10:16:36PM -0400, Bruce Momjian wrote:
>  
> Added to TODO:
> 
>     * Allow DISTINCT to use hashing like GROUP BY

3 lines above we have...
Consider using hash buckets to do DISTINCT, rather than sorting
This would be beneficial when there are few distinct values. 

Can you add
http://archives.postgresql.org/pgsql-hackers/2005-09/msg00810.php? All I
could find on the other TODO was
http://archives.postgresql.org/pgsql-committers/2004-09/msg00028.php,
which doesn't help much...
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: DISTINCT vs. GROUP BY

From

Bruce Momjian

Date:

20 September 2005, 18:06:31

Jim C. Nasby wrote:
> On Mon, Sep 19, 2005 at 10:16:36PM -0400, Bruce Momjian wrote:
> >  
> > Added to TODO:
> > 
> >     * Allow DISTINCT to use hashing like GROUP BY
> 
> 3 lines above we have...
> Consider using hash buckets to do DISTINCT, rather than sorting
> This would be beneficial when there are few distinct values. 

OK, I have merged these items into one.

> 
> Can you add
> http://archives.postgresql.org/pgsql-hackers/2005-09/msg00810.php? All I
> could find on the other TODO was
> http://archives.postgresql.org/pgsql-committers/2004-09/msg00028.php,
> which doesn't help much...

What do these URL's have that the current TODO does not?

* Consider using hash buckets to do DISTINCT, rather than sorting
 This would be beneficial when there are few distinct values.  This is already used by GROUP BY.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: DISTINCT vs. GROUP BY

From

"Jim C. Nasby"

Date:

20 September 2005, 19:08:01

On Tue, Sep 20, 2005 at 05:05:05PM -0400, Bruce Momjian wrote:
> Jim C. Nasby wrote:
> > On Mon, Sep 19, 2005 at 10:16:36PM -0400, Bruce Momjian wrote:
> > >  
> > > Added to TODO:
> > > 
> > >     * Allow DISTINCT to use hashing like GROUP BY
> > 
> > 3 lines above we have...
> > Consider using hash buckets to do DISTINCT, rather than sorting
> > This would be beneficial when there are few distinct values. 
> 
> OK, I have merged these items into one.
> 
> > 
> > Can you add
> > http://archives.postgresql.org/pgsql-hackers/2005-09/msg00810.php? All I
> > could find on the other TODO was
> > http://archives.postgresql.org/pgsql-committers/2004-09/msg00028.php,
> > which doesn't help much...
> 
> What do these URL's have that the current TODO does not?
> 
> * Consider using hash buckets to do DISTINCT, rather than sorting
> 
>   This would be beneficial when there are few distinct values.  This is
>   already used by GROUP BY.

Maybe it's just me, but the recent run-through of the TODO list
indicated that there's a fair number of items that people look at and
don't really knowh what they mean. Providing the context (ie: email
thread) that spawned an idea seems extremely valuable in being able to
explain the idea behind a TODO item. They also usually contain valuable
tips about how a TODO could be implemented. In this example, having
quick reference to the discussion about hashagg and first()/last() would
probably prove useful.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: DISTINCT vs. GROUP BY

From

Bruce Momjian

Date:

20 September 2005, 19:45:17

Jim C. Nasby wrote:
> > What do these URL's have that the current TODO does not?
> > 
> > * Consider using hash buckets to do DISTINCT, rather than sorting
> > 
> >   This would be beneficial when there are few distinct values.  This is
> >   already used by GROUP BY.
> 
> Maybe it's just me, but the recent run-through of the TODO list
> indicated that there's a fair number of items that people look at and
> don't really knowh what they mean. Providing the context (ie: email
> thread) that spawned an idea seems extremely valuable in being able to
> explain the idea behind a TODO item. They also usually contain valuable
> tips about how a TODO could be implemented. In this example, having
> quick reference to the discussion about hashagg and first()/last() would
> probably prove useful.

True, but sometimes the thread winds all around and there isn't a
definative explaination of how to go at something.  I woul rather digest
the information to improve it, rather than require people to wade around
in an email thread.  Is there some detail the TODO is missing?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: DISTINCT vs. GROUP BY

From

"Jim C. Nasby"

Date:

20 September 2005, 20:13:59

On Tue, Sep 20, 2005 at 06:45:07PM -0400, Bruce Momjian wrote:
> Jim C. Nasby wrote:
> > > What do these URL's have that the current TODO does not?
> > > 
> > > * Consider using hash buckets to do DISTINCT, rather than sorting
> > > 
> > >   This would be beneficial when there are few distinct values.  This is
> > >   already used by GROUP BY.
> > 
> > Maybe it's just me, but the recent run-through of the TODO list
> > indicated that there's a fair number of items that people look at and
> > don't really knowh what they mean. Providing the context (ie: email
> > thread) that spawned an idea seems extremely valuable in being able to
> > explain the idea behind a TODO item. They also usually contain valuable
> > tips about how a TODO could be implemented. In this example, having
> > quick reference to the discussion about hashagg and first()/last() would
> > probably prove useful.
> 
> True, but sometimes the thread winds all around and there isn't a
> definative explaination of how to go at something.  I woul rather digest
> the information to improve it, rather than require people to wade around
> in an email thread.  Is there some detail the TODO is missing?

Sure, and I think the TODO items usually are digested fine. But if you
want to know any of the details behind the items (like what the
motivation is, or how it could be done), you have to manually go and
search the archives trying to find something.

In this example, it would have been useful to see if there had been any
discussion about if you could use hashagg to do this or not. But
searching the archive turned up nothing, so I guess we'll never know. If
there was a link to a thread in the TODO, we could see exactly what was
discussed.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461