Re: Querying distinct values from a large table - Mailing list pgsql-performance

From Ron
Subject Re: Querying distinct values from a large table
Date
Msg-id E1HCGjN-0003Gp-GQ@elasmtp-kukur.atl.sa.earthlink.net
Whole thread Raw
In response to Re: Querying distinct values from a large table  (Gregory Stark <gsstark@mit.edu>)
List pgsql-performance
I strongly encourage anyone who is interested in the general external
sorting problem peruse Jim Gray's site:
http://research.microsoft.com/barc/SortBenchmark/

Ron Peacetree

At 08:24 AM 1/31/2007, Gregory Stark wrote:
>Tom Lane <tgl@sss.pgh.pa.us> writes:
>
> > Alvaro Herrera <alvherre@commandprompt.com> writes:
> > > Gregory Stark wrote:
> > >> (Incidentally I'm not sure where 2-5x comes from. It's
> entirely dependant on
> > >> your data distribution. It's not hard to come up with
> distributions where it's
> > >> 1000x as fast and others where there's no speed difference.)
> >
> > > So the figure is really "1-1000x"?  I bet this one is more impressive in
> > > PHB terms.
> >
> > Luke has a bad habit of quoting numbers that are obviously derived from
> > narrow benchmarking scenarios as Universal Truths, rather than providing
> > the context they were derived in.  I wish he'd stop doing that...
>
>In fairness I may have exaggerated a bit there. There is a limit to how much
>of a speedup you can get in valid benchmarking situations. A single sequential
>scan is always going to be necessary so you're only saving the cost of writing
>out the temporary file and subsequent merge passes.
>
>It's hard to generate lots of intermediate merge passes since there are only
>O(log(n)) of them. So to get 1000x speedup on a large I/O bound sort you would
>have to be sorting something on order of 2^1000 records which is ridiculous.
>Realistically you should only be able to save 2-5 intermediate merge passes.
>
>On the other there are some common situations where you could see atypical
>increases. Consider joining a bunch of small tables to generate a large result
>set. The small tables are probably all in memory and the result set may only
>have a small number of distinct values. If you throw out the duplicates early
>you save *all* the I/O. If you have to do a disk sort it could be many orders
>slower.
>
>This is actually not an uncommon coding idiom for MySQL programmers accustomed
>to fast DISTINCT working around the lack of subqueries and poor performance of
>IN and EXISTS. They often just join together all the tables in a big cross
>join and then toss in a DISTINCT at the top to get rid of the duplicates.
>
>--
>   Gregory Stark
>   EnterpriseDB          http://www.enterprisedb.com
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq


pgsql-performance by date:

Previous
From: Ted Allen
Date:
Subject: Re: Very slow queries
Next
From: Sidar López Cruz
Date:
Subject: Re: Very slow queries