Thread: SELECT DISTINCT very slow

SELECT DISTINCT very slow

From
Ben Harper
Date:
Hi,
Can anybody explain this:

Records: 600,000
Field Width: Approximately 20 UTF8 characters - type is VARCHAR(25)
Field is Indexed.

SELECT DISTINCT field FROM table;

Takes about 6 seconds. There are 111 distinct items.

On Sqlite, and another place where I have a B+Tree, this query is
faster than my eye can measure.

Is this a well known issue?

Thanks,
Ben

Re: SELECT DISTINCT very slow

From
Bill Moran
Date:
In response to Ben Harper <rogojin@gmail.com>:
> Hi,
> Can anybody explain this:
>
> Records: 600,000
> Field Width: Approximately 20 UTF8 characters - type is VARCHAR(25)
> Field is Indexed.
>
> SELECT DISTINCT field FROM table;
>
> Takes about 6 seconds. There are 111 distinct items.

What's the output of EXPLAIN ANALYZE SELECT DISTINCT field FROM table;?

Does a VACUUM ANALYZE of the table help?  Is the query significantly
faster the second time you run it?

> Is this a well known issue?

Not that I'm aware of.

--
Bill Moran
http://www.potentialtech.com
http://people.collaborativefusion.com/~wmoran/

Re: SELECT DISTINCT very slow

From
Andres Freund
Date:
On Thursday 09 July 2009 17:09:13 Ben Harper wrote:
> Hi,
> Can anybody explain this:
>
> Records: 600,000
> Field Width: Approximately 20 UTF8 characters - type is VARCHAR(25)
> Field is Indexed.
>
> SELECT DISTINCT field FROM table;
>
> Takes about 6 seconds. There are 111 distinct items.
>
> On Sqlite, and another place where I have a B+Tree, this query is
> faster than my eye can measure.
>
> Is this a well known issue?
Yes, I think so.

AFAIK the primary cause is that indexes in pg do not store visibility
information. That means you need to check for existence of the tuple on the
heap.
Possibly due to that PG has no special case code for DISTINCT to optimize such
a query using mostly the index. It would be possible that for each possible
value of 'field' you check the index only long enough to prove that there is at
least one such entry.

Taking that single field into its own table is not possible?

Andres

Re: SELECT DISTINCT very slow

From
Pavel Stehule
Date:
Hello

when you use older pg than 8.3, please, use GROUP BY.

SELECT field FROM table GROUP BY field.

Regards
Pavel Stehule

2009/7/9 Ben Harper <rogojin@gmail.com>:
> Hi,
> Can anybody explain this:
>
> Records: 600,000
> Field Width: Approximately 20 UTF8 characters - type is VARCHAR(25)
> Field is Indexed.
>
> SELECT DISTINCT field FROM table;
>
> Takes about 6 seconds. There are 111 distinct items.
>
> On Sqlite, and another place where I have a B+Tree, this query is
> faster than my eye can measure.
>
> Is this a well known issue?
>
> Thanks,
> Ben
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

Re: SELECT DISTINCT very slow

From
Greg Stark
Date:
On Thu, Jul 9, 2009 at 4:47 PM, Andres Freund<andres@anarazel.de> wrote:
> AFAIK the primary cause is that indexes in pg do not store visibility
> information.

Not really. The OP doesn't say how wide the record rows are but unless
they're very wide it wouldn't pay to use an index for this even if you
didn't have to access the heap also. It's going to be faster to scan
the whole heap and either sort or use a hash. Currently there aren't
many cases where a btree with 6,000 copies of 111 distinct keys is
going to be useful.

Arguably the missing feature here is skip-scans where we scan the
index but only pull out one record for each distinct value. I'm not
sure there's anything particularly stopping Postgres from being able
to do them, but it might be a lot of code for a narrow use case.

--
greg
http://mit.edu/~gsstark/resume.pdf

Re: SELECT DISTINCT very slow

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> Not really. The OP doesn't say how wide the record rows are but unless
> they're very wide it wouldn't pay to use an index for this even if you
> didn't have to access the heap also. It's going to be faster to scan
> the whole heap and either sort or use a hash. Currently there aren't
> many cases where a btree with 6,000 copies of 111 distinct keys is
> going to be useful.

It was 600,000 not 6,000 ... so a skip-scan might be worth the trouble,
but as you say we haven't done it.

In any case I think the real issue is that the OP is probably using a
pre-8.4 release which will always do SELECT DISTINCT via sort-and-unique.
Hash aggregation would be a whole lot faster for these numbers, even
if not exactly instantaneous.  He could update to 8.4, or go over to
using GROUP BY as was recommended upthread.

            regards, tom lane

Re: SELECT DISTINCT very slow

From
Ben Harper
Date:
Thanks for all the feedback.

Using GROUP BY is indeed much faster (about 1 second).

Unfortunately I can't use GROUP BY, because what I'm really doing is
SELECT DISTINCT ON(unique_field) id FROM table;

I'm not familiar with the Postgres internals, but in my own DB system
that I have written, I do the skip-scanning thing, and for my system
it was a really trivial optimization to code. I know, I'm always free
to submit a patch, and hopefully someday I will, if it hasn't already
been done by then.

I can't comment on whether this skip-scan optimization is general
enough to warrant the lines of code, but I might as well explain my
use case:
Inside a GIS application, the user wants to categorize the display of
some information based on, in this case, the suburb name.
He clicks a button that says "Add All Unique Categories". This is a
very common operation in this domain.

Again, thanks for all the feedback. I'll upgrade to 8.4 soon.
Ben Harper

On Fri, Jul 10, 2009 at 2:50 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
> Greg Stark <gsstark@mit.edu> writes:
>> Not really. The OP doesn't say how wide the record rows are but unless
>> they're very wide it wouldn't pay to use an index for this even if you
>> didn't have to access the heap also. It's going to be faster to scan
>> the whole heap and either sort or use a hash. Currently there aren't
>> many cases where a btree with 6,000 copies of 111 distinct keys is
>> going to be useful.
>
> It was 600,000 not 6,000 ... so a skip-scan might be worth the trouble,
> but as you say we haven't done it.
>
> In any case I think the real issue is that the OP is probably using a
> pre-8.4 release which will always do SELECT DISTINCT via sort-and-unique.
> Hash aggregation would be a whole lot faster for these numbers, even
> if not exactly instantaneous.  He could update to 8.4, or go over to
> using GROUP BY as was recommended upthread.
>
>                        regards, tom lane
>

Re: SELECT DISTINCT very slow

From
Greg Stark
Date:
On Fri, Jul 10, 2009 at 1:41 PM, Ben Harper<rogojin@gmail.com> wrote:
>
> Unfortunately I can't use GROUP BY, because what I'm really doing is
> SELECT DISTINCT ON(unique_field) id FROM table;

You could do that using GROUP BY if you define a first() aggregate. In
this case that would just be SELECT first(id) AS id from (select *
from table ORDER BY unique_field, ...) GROUP BY unique_field. In cases
with more fields it gets tiresome fast.

In this case 8.4 won't actually help you. It only uses hash aggregates
for DISTINCT not DISTINCT ON.


> I'm not familiar with the Postgres internals, but in my own DB system
> that I have written, I do the skip-scanning thing, and for my system
> it was a really trivial optimization to code.

Well things get tricky quickly when you have to deal with concurrent
inserts and potentially page splits from other transactions. Also
consider how hard it is to prove that the query falls into this
category of queries.


> Inside a GIS application, the user wants to categorize the display of
> some information based on, in this case, the suburb name.
> He clicks a button that says "Add All Unique Categories". This is a
> very common operation in this domain.

That doesn't look like what this query is doing to me. It's taking one
exemplar from each suburb based on some other constraint (the minimum
of whatever your order by key is) and taking the id of that data
point. If the order by key doesn't specify a specific data point then
it's a non-deterministic record.

--
greg
http://mit.edu/~gsstark/resume.pdf

Re: SELECT DISTINCT very slow

From
Jeff Davis
Date:
On Fri, 2009-07-10 at 01:36 +0100, Greg Stark wrote:
> Arguably the missing feature here is skip-scans where we scan the
> index but only pull out one record for each distinct value. I'm not
> sure there's anything particularly stopping Postgres from being able
> to do them, but it might be a lot of code for a narrow use case.

Hypothetically, would something like a "sort distinct" operator be of
any use? I wonder how much work it would save if the sort could save
steps by weeding out duplicate tuples while sorting. That might make
sort into a better plan in cases where don't have a good estimate of the
distinct values.

Regards,
    Jeff Davis