Re: Querying distinct values from a large table - Mailing list pgsql-performance

From Gregory Stark
Subject Re: Querying distinct values from a large table
Date
Msg-id 877iv3miqu.fsf@stark.xeocode.com
Whole thread Raw
In response to Re: Querying distinct values from a large table  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Querying distinct values from a large table  (Ron <rjpeace@earthlink.net>)
Re: Querying distinct values from a large table  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-performance
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Gregory Stark wrote:
> >> (Incidentally I'm not sure where 2-5x comes from. It's entirely dependant on
> >> your data distribution. It's not hard to come up with distributions where it's
> >> 1000x as fast and others where there's no speed difference.)
>
> > So the figure is really "1-1000x"?  I bet this one is more impressive in
> > PHB terms.
>
> Luke has a bad habit of quoting numbers that are obviously derived from
> narrow benchmarking scenarios as Universal Truths, rather than providing
> the context they were derived in.  I wish he'd stop doing that...

In fairness I may have exaggerated a bit there. There is a limit to how much
of a speedup you can get in valid benchmarking situations. A single sequential
scan is always going to be necessary so you're only saving the cost of writing
out the temporary file and subsequent merge passes.

It's hard to generate lots of intermediate merge passes since there are only
O(log(n)) of them. So to get 1000x speedup on a large I/O bound sort you would
have to be sorting something on order of 2^1000 records which is ridiculous.
Realistically you should only be able to save 2-5 intermediate merge passes.

On the other there are some common situations where you could see atypical
increases. Consider joining a bunch of small tables to generate a large result
set. The small tables are probably all in memory and the result set may only
have a small number of distinct values. If you throw out the duplicates early
you save *all* the I/O. If you have to do a disk sort it could be many orders
slower.

This is actually not an uncommon coding idiom for MySQL programmers accustomed
to fast DISTINCT working around the lack of subqueries and poor performance of
IN and EXISTS. They often just join together all the tables in a big cross
join and then toss in a DISTINCT at the top to get rid of the duplicates.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

pgsql-performance by date:

Previous
From: "Luke Lonergan"
Date:
Subject: Re: Querying distinct values from a large table
Next
From: Sidar López Cruz
Date:
Subject: Re: Very slow queries