Home > mailing lists

Re: SELECT DISTINCT very slow - Mailing list pgsql-general

From	Greg Stark
Subject	Re: SELECT DISTINCT very slow
Date	July 10, 2009 12:58:38
Msg-id	407d949e0907100558o309dafa8m2b4d84a437e0dd30@mail.gmail.com Whole thread Raw
In response to	Re: SELECT DISTINCT very slow (Ben Harper <rogojin@gmail.com>)
List	pgsql-general

Tree view

On Fri, Jul 10, 2009 at 1:41 PM, Ben Harper<rogojin@gmail.com> wrote:
>
> Unfortunately I can't use GROUP BY, because what I'm really doing is
> SELECT DISTINCT ON(unique_field) id FROM table;

You could do that using GROUP BY if you define a first() aggregate. In
this case that would just be SELECT first(id) AS id from (select *
from table ORDER BY unique_field, ...) GROUP BY unique_field. In cases
with more fields it gets tiresome fast.

In this case 8.4 won't actually help you. It only uses hash aggregates
for DISTINCT not DISTINCT ON.

> I'm not familiar with the Postgres internals, but in my own DB system
> that I have written, I do the skip-scanning thing, and for my system
> it was a really trivial optimization to code.

Well things get tricky quickly when you have to deal with concurrent
inserts and potentially page splits from other transactions. Also
consider how hard it is to prove that the query falls into this
category of queries.

> Inside a GIS application, the user wants to categorize the display of
> some information based on, in this case, the suburb name.
> He clicks a button that says "Add All Unique Categories". This is a
> very common operation in this domain.

That doesn't look like what this query is doing to me. It's taking one
exemplar from each suburb based on some other constraint (the minimum
of whatever your order by key is) and taking the id of that data
point. If the order by key doesn't specify a specific data point then
it's a non-deterministic record.

--
greg
http://mit.edu/~gsstark/resume.pdf

pgsql-general by date:

From: Ben Harper
Date: 10 July 2009, 12:41:19
Subject: Re: SELECT DISTINCT very slow

From: Bruce Momjian
Date: 10 July 2009, 13:01:11
Subject: Re: Overhead of union versus union all

Re: SELECT DISTINCT very slow - Mailing list pgsql-general

Previous

Next