Re: SELECT DISTINCT very slow - Mailing list pgsql-general

From Ben Harper
Subject Re: SELECT DISTINCT very slow
Date
Msg-id 6def3e7b0907100541w1c7d3e22td5266ad41cbbd9ae@mail.gmail.com
Whole thread Raw
In response to Re: SELECT DISTINCT very slow  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: SELECT DISTINCT very slow
List pgsql-general
Thanks for all the feedback.

Using GROUP BY is indeed much faster (about 1 second).

Unfortunately I can't use GROUP BY, because what I'm really doing is
SELECT DISTINCT ON(unique_field) id FROM table;

I'm not familiar with the Postgres internals, but in my own DB system
that I have written, I do the skip-scanning thing, and for my system
it was a really trivial optimization to code. I know, I'm always free
to submit a patch, and hopefully someday I will, if it hasn't already
been done by then.

I can't comment on whether this skip-scan optimization is general
enough to warrant the lines of code, but I might as well explain my
use case:
Inside a GIS application, the user wants to categorize the display of
some information based on, in this case, the suburb name.
He clicks a button that says "Add All Unique Categories". This is a
very common operation in this domain.

Again, thanks for all the feedback. I'll upgrade to 8.4 soon.
Ben Harper

On Fri, Jul 10, 2009 at 2:50 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
> Greg Stark <gsstark@mit.edu> writes:
>> Not really. The OP doesn't say how wide the record rows are but unless
>> they're very wide it wouldn't pay to use an index for this even if you
>> didn't have to access the heap also. It's going to be faster to scan
>> the whole heap and either sort or use a hash. Currently there aren't
>> many cases where a btree with 6,000 copies of 111 distinct keys is
>> going to be useful.
>
> It was 600,000 not 6,000 ... so a skip-scan might be worth the trouble,
> but as you say we haven't done it.
>
> In any case I think the real issue is that the OP is probably using a
> pre-8.4 release which will always do SELECT DISTINCT via sort-and-unique.
> Hash aggregation would be a whole lot faster for these numbers, even
> if not exactly instantaneous.  He could update to 8.4, or go over to
> using GROUP BY as was recommended upthread.
>
>                        regards, tom lane
>

pgsql-general by date:

Previous
From: Tim Uckun
Date:
Subject: Re: ubuntu packages for 8.4
Next
From: Greg Stark
Date:
Subject: Re: SELECT DISTINCT very slow