Thread: DISTINCT vs. GROUP BY
I was wondering whether it is possible to teach the planner to handle DISTINCT in a more efficient way: em=# explain select distinct lastname from import.testtest; QUERY PLAN -------------------------------------------------------------------------------- Unique (cost=2647377.45..2709467.70 rows=1width=7) -> Sort (cost=2647377.45..2678422.58 rows=12418051 width=7) Sort Key: lastname -> SeqScan on testtest (cost=0.00..370082.51 rows=12418051 width=7) (4 Zeilen) Isn't it possible to perform the same operation using a HashAggregate? We have seen that a GROUP BY workaround is usually a lot faster than sort->unique - at least when work_mem is large enough. best regards, hans -- Cybertec Geschwinde & Schönig GmbH Schöngrabern 134; A-2020 Hollabrunn Tel: +43/1/205 10 35 / 340 www.postgresql.at, www.cybertec.at
On Mon, 2005-19-09 at 16:27 +0200, Hans-Jürgen Schönig wrote: > I was wondering whether it is possible to teach the planner to handle > DISTINCT in a more efficient way: [...] > Isn't it possible to perform the same operation using a > HashAggregate? One problem is that DISTINCT ON is defined to return the first unique row (according to the query's ORDER BY) for the set of DISTINCT ON columns, which can't easily be done via hashing. -Neil
Hans-Jürgen Schönig <postgres@cybertec.at> writes: > I was wondering whether it is possible to teach the planner to handle > DISTINCT in a more efficient way: Probably (although the interactions with ORDER BY might be tricky). No one has touched that part of the planner in a very long time. regards, tom lane
Neil Conway <neilc@samurai.com> writes: > On Mon, 2005-19-09 at 16:27 +0200, Hans-Jürgen Schönig wrote: > > I was wondering whether it is possible to teach the planner to handle > > DISTINCT in a more efficient way: > [...] > > Isn't it possible to perform the same operation using a > > HashAggregate? > > One problem is that DISTINCT ON is defined to return the first unique > row (according to the query's ORDER BY) for the set of DISTINCT ON > columns, which can't easily be done via hashing. Uhm. Sure it can. DISTINCT is really just special a case of GROUP BY. Even DISTINCT ON is just GROUP BY with a kind of "first()" aggregate function. What would be really neat would be to teach GROUP BY about first() and last() and how it can skip over some index entries and still satisfy the query. Then make DISTINCT and DISTINCT ON be handled through the exact same code path. For bonus points teach it that min() and max() can sometimes be treated the same way if the path is presenting records sorted on that column. -- greg
Greg Stark <gsstark@mit.edu> writes: > DISTINCT is really just special a case of GROUP BY. Even DISTINCT ON is just > GROUP BY with a kind of "first()" aggregate function. What would be really > neat would be to teach GROUP BY about first() and last() and how it can skip > over some index entries and still satisfy the query. Then make DISTINCT and > DISTINCT ON be handled through the exact same code path. You've missed the point entirely. first() is not a substitute for sorting the input; it is only useful if the input comes pre-sorted. And if you are going to sort the input, you might as well use the current implementation of DISTINCT ON and skip the effort and memory-overflow-risk associated with a hashtable. I do think hash aggregation is a plausible alternative implementation of plain DISTINCT, but I don't see the case for using it for DISTINCT ON. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: > I do think hash aggregation is a plausible alternative implementation of > plain DISTINCT, but I don't see the case for using it for DISTINCT ON. It could be done without presorting the input though not with a simple first()-like function. It would have be a sort of two-argument min() function that kept a state variable for the smallest value found so far of the sort key. My main motivation here is that it's odd to have two code paths for implementing the two language constructs when one is really just a special case of the other. It's a source of cases like this where the code to implement a query path exists but isn't accessible due to the way the query is written. -- greg
Added to TODO: * Allow DISTINCT to use hashing like GROUP BY --------------------------------------------------------------------------- Greg Stark wrote: > > Neil Conway <neilc@samurai.com> writes: > > > On Mon, 2005-19-09 at 16:27 +0200, Hans-J?rgen Sch?nig wrote: > > > I was wondering whether it is possible to teach the planner to handle > > > DISTINCT in a more efficient way: > > [...] > > > Isn't it possible to perform the same operation using a > > > HashAggregate? > > > > One problem is that DISTINCT ON is defined to return the first unique > > row (according to the query's ORDER BY) for the set of DISTINCT ON > > columns, which can't easily be done via hashing. > > Uhm. Sure it can. > > > DISTINCT is really just special a case of GROUP BY. Even DISTINCT ON is just > GROUP BY with a kind of "first()" aggregate function. What would be really > neat would be to teach GROUP BY about first() and last() and how it can skip > over some index entries and still satisfy the query. Then make DISTINCT and > DISTINCT ON be handled through the exact same code path. > > For bonus points teach it that min() and max() can sometimes be treated the > same way if the path is presenting records sorted on that column. > > > -- > greg > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Mon, Sep 19, 2005 at 10:16:36PM -0400, Bruce Momjian wrote: > > Added to TODO: > > * Allow DISTINCT to use hashing like GROUP BY 3 lines above we have... Consider using hash buckets to do DISTINCT, rather than sorting This would be beneficial when there are few distinct values. Can you add http://archives.postgresql.org/pgsql-hackers/2005-09/msg00810.php? All I could find on the other TODO was http://archives.postgresql.org/pgsql-committers/2004-09/msg00028.php, which doesn't help much... -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Jim C. Nasby wrote: > On Mon, Sep 19, 2005 at 10:16:36PM -0400, Bruce Momjian wrote: > > > > Added to TODO: > > > > * Allow DISTINCT to use hashing like GROUP BY > > 3 lines above we have... > Consider using hash buckets to do DISTINCT, rather than sorting > This would be beneficial when there are few distinct values. OK, I have merged these items into one. > > Can you add > http://archives.postgresql.org/pgsql-hackers/2005-09/msg00810.php? All I > could find on the other TODO was > http://archives.postgresql.org/pgsql-committers/2004-09/msg00028.php, > which doesn't help much... What do these URL's have that the current TODO does not? * Consider using hash buckets to do DISTINCT, rather than sorting This would be beneficial when there are few distinct values. This is already used by GROUP BY. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Tue, Sep 20, 2005 at 05:05:05PM -0400, Bruce Momjian wrote: > Jim C. Nasby wrote: > > On Mon, Sep 19, 2005 at 10:16:36PM -0400, Bruce Momjian wrote: > > > > > > Added to TODO: > > > > > > * Allow DISTINCT to use hashing like GROUP BY > > > > 3 lines above we have... > > Consider using hash buckets to do DISTINCT, rather than sorting > > This would be beneficial when there are few distinct values. > > OK, I have merged these items into one. > > > > > Can you add > > http://archives.postgresql.org/pgsql-hackers/2005-09/msg00810.php? All I > > could find on the other TODO was > > http://archives.postgresql.org/pgsql-committers/2004-09/msg00028.php, > > which doesn't help much... > > What do these URL's have that the current TODO does not? > > * Consider using hash buckets to do DISTINCT, rather than sorting > > This would be beneficial when there are few distinct values. This is > already used by GROUP BY. Maybe it's just me, but the recent run-through of the TODO list indicated that there's a fair number of items that people look at and don't really knowh what they mean. Providing the context (ie: email thread) that spawned an idea seems extremely valuable in being able to explain the idea behind a TODO item. They also usually contain valuable tips about how a TODO could be implemented. In this example, having quick reference to the discussion about hashagg and first()/last() would probably prove useful. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Jim C. Nasby wrote: > > What do these URL's have that the current TODO does not? > > > > * Consider using hash buckets to do DISTINCT, rather than sorting > > > > This would be beneficial when there are few distinct values. This is > > already used by GROUP BY. > > Maybe it's just me, but the recent run-through of the TODO list > indicated that there's a fair number of items that people look at and > don't really knowh what they mean. Providing the context (ie: email > thread) that spawned an idea seems extremely valuable in being able to > explain the idea behind a TODO item. They also usually contain valuable > tips about how a TODO could be implemented. In this example, having > quick reference to the discussion about hashagg and first()/last() would > probably prove useful. True, but sometimes the thread winds all around and there isn't a definative explaination of how to go at something. I woul rather digest the information to improve it, rather than require people to wade around in an email thread. Is there some detail the TODO is missing? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Tue, Sep 20, 2005 at 06:45:07PM -0400, Bruce Momjian wrote: > Jim C. Nasby wrote: > > > What do these URL's have that the current TODO does not? > > > > > > * Consider using hash buckets to do DISTINCT, rather than sorting > > > > > > This would be beneficial when there are few distinct values. This is > > > already used by GROUP BY. > > > > Maybe it's just me, but the recent run-through of the TODO list > > indicated that there's a fair number of items that people look at and > > don't really knowh what they mean. Providing the context (ie: email > > thread) that spawned an idea seems extremely valuable in being able to > > explain the idea behind a TODO item. They also usually contain valuable > > tips about how a TODO could be implemented. In this example, having > > quick reference to the discussion about hashagg and first()/last() would > > probably prove useful. > > True, but sometimes the thread winds all around and there isn't a > definative explaination of how to go at something. I woul rather digest > the information to improve it, rather than require people to wade around > in an email thread. Is there some detail the TODO is missing? Sure, and I think the TODO items usually are digested fine. But if you want to know any of the details behind the items (like what the motivation is, or how it could be done), you have to manually go and search the archives trying to find something. In this example, it would have been useful to see if there had been any discussion about if you could use hashagg to do this or not. But searching the archive turned up nothing, so I guess we'll never know. If there was a link to a thread in the TODO, we could see exactly what was discussed. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461