Re: SELECT DISTINCT very slow - Mailing list pgsql-general

From Greg Stark
Subject Re: SELECT DISTINCT very slow
Date
Msg-id 407d949e0907100558o309dafa8m2b4d84a437e0dd30@mail.gmail.com
Whole thread Raw
In response to Re: SELECT DISTINCT very slow  (Ben Harper <rogojin@gmail.com>)
List pgsql-general
On Fri, Jul 10, 2009 at 1:41 PM, Ben Harper<rogojin@gmail.com> wrote:
>
> Unfortunately I can't use GROUP BY, because what I'm really doing is
> SELECT DISTINCT ON(unique_field) id FROM table;

You could do that using GROUP BY if you define a first() aggregate. In
this case that would just be SELECT first(id) AS id from (select *
from table ORDER BY unique_field, ...) GROUP BY unique_field. In cases
with more fields it gets tiresome fast.

In this case 8.4 won't actually help you. It only uses hash aggregates
for DISTINCT not DISTINCT ON.


> I'm not familiar with the Postgres internals, but in my own DB system
> that I have written, I do the skip-scanning thing, and for my system
> it was a really trivial optimization to code.

Well things get tricky quickly when you have to deal with concurrent
inserts and potentially page splits from other transactions. Also
consider how hard it is to prove that the query falls into this
category of queries.


> Inside a GIS application, the user wants to categorize the display of
> some information based on, in this case, the suburb name.
> He clicks a button that says "Add All Unique Categories". This is a
> very common operation in this domain.

That doesn't look like what this query is doing to me. It's taking one
exemplar from each suburb based on some other constraint (the minimum
of whatever your order by key is) and taking the id of that data
point. If the order by key doesn't specify a specific data point then
it's a non-deterministic record.

--
greg
http://mit.edu/~gsstark/resume.pdf

pgsql-general by date:

Previous
From: Ben Harper
Date:
Subject: Re: SELECT DISTINCT very slow
Next
From: Bruce Momjian
Date:
Subject: Re: Overhead of union versus union all