Thread: Querying distinct values from a large table

Querying distinct values from a large table

From

Igor Lobanov

Date:

30 January 2007, 07:59:31

Greetings!

I have rather large table with about 5 millions of rows and a dozen of
columns. Let's suppose that columns are named 'a', 'b', 'c' etc. I need
to query distinct pairs of ('a';'b') from this table.

I use following query:

SELECT DISTINCT a, b FROM tbl;

but unfortunately, it takes forever to complete. Explaining gives me
information that bottleneck is seqscan on 'tbl', which eats much time.

Creating compound index on this table using following statement:

CREATE INDEX tbl_a_b_idx ON tbl( a, b );

gives no effect, postgres simply ignores it, at least according to the
EXPLAIN output.

Is there any way to somehow improve the performance of this operation?
Table can not be changed.

--
Igor Lobanov
Internal Development Engineer
SWsoft, Inc.

Re: Querying distinct values from a large table

From

Richard Huxton

Date:

30 January 2007, 08:13:02

Igor Lobanov wrote:
> Greetings!
>
> I have rather large table with about 5 millions of rows and a dozen of
> columns. Let's suppose that columns are named 'a', 'b', 'c' etc. I need
> to query distinct pairs of ('a';'b') from this table.

> Creating compound index on this table using following statement:
>
> CREATE INDEX tbl_a_b_idx ON tbl( a, b );
>
> gives no effect, postgres simply ignores it, at least according to the
> EXPLAIN output.

What version of PostgreSQL is it?

How many distinct values are you getting back from your 5 million rows?
If there are too many, an index isn't going to help.

Can you share the EXPLAIN ANALYSE output? You might want to try
increasing work_mem for this one query to speed any sorting.

How often is the table updated? Clustering might buy you some
improvements (but not a huge amount I suspect).

   Richard Huxton
   Archonet Ltd

Re: Querying distinct values from a large table

From

Igor Lobanov

Date:

30 January 2007, 08:33:25


Richard Huxton wrote:
>> I have rather large table with about 5 millions of rows and a dozen of
>> columns. Let's suppose that columns are named 'a', 'b', 'c' etc. I
>> need to query distinct pairs of ('a';'b') from this table.
 >
> What version of PostgreSQL is it?

8.1.4

> How many distinct values are you getting back from your 5 million rows?
> If there are too many, an index isn't going to help.

No more than 10,000.

> Can you share the EXPLAIN ANALYSE output? You might want to try
> increasing work_mem for this one query to speed any sorting.

Real table and colum names are obfuscated because of NDA, sorry.

explain analyze select distinct a, b from tbl

EXPLAIN ANALYZE output is:

  Unique  (cost=500327.32..525646.88 rows=1848 width=6) (actual
time=52719.868..56126.356 rows=5390 loops=1)
    ->  Sort  (cost=500327.32..508767.17 rows=3375941 width=6) (actual
time=52719.865..54919.989 rows=3378864 loops=1)
          Sort Key: a, b
          ->  Seq Scan on tbl  (cost=0.00..101216.41 rows=3375941
width=6) (actual time=16.643..20652.610 rows=3378864 loops=1)
  Total runtime: 57307.394 ms

> How often is the table updated? Clustering might buy you some
> improvements (but not a huge amount I suspect).

It is updated once per 3-5 seconds.

And one more thing. I don't know if it helps, but column 'a' can have
value from a limited set: 0, 1 or 2. Column 'b' is also an integer
(foreign key, actually).

--
Igor Lobanov
Internal Development Engineer
SWsoft, Inc.

Re: Querying distinct values from a large table

From

Igor Lobanov

Date:

30 January 2007, 08:36:24

Forgot to mention that our work_mem setting is 20480 Kb.

> You might want to try
> increasing work_mem for this one query to speed any sorting.

--
Igor Lobanov
Internal Development Engineer
SWsoft, Inc.

Re: Querying distinct values from a large table

From

Richard Huxton

Date:

30 January 2007, 09:02:38

Igor Lobanov wrote:
>
>
> Richard Huxton wrote:
>>> I have rather large table with about 5 millions of rows and a dozen
>>> of columns. Let's suppose that columns are named 'a', 'b', 'c' etc. I
>>> need to query distinct pairs of ('a';'b') from this table.
>  >
>> What version of PostgreSQL is it?
>
> 8.1.4

Current release is 8.1.6 - probably worth upgrading when you've got
time. It should be a simple matter of replacing the binaries but do
check the release notes.

>> How many distinct values are you getting back from your 5 million
>> rows? If there are too many, an index isn't going to help.
>
> No more than 10,000.

OK. Should be possible to do something then.

>> Can you share the EXPLAIN ANALYSE output? You might want to try
>> increasing work_mem for this one query to speed any sorting.
>
> Real table and colum names are obfuscated because of NDA, sorry.

Fair enough.

> explain analyze select distinct a, b from tbl
>
> EXPLAIN ANALYZE output is:
>
>  Unique  (cost=500327.32..525646.88 rows=1848 width=6) (actual
> time=52719.868..56126.356 rows=5390 loops=1)
>    ->  Sort  (cost=500327.32..508767.17 rows=3375941 width=6) (actual
> time=52719.865..54919.989 rows=3378864 loops=1)
>          Sort Key: a, b
>          ->  Seq Scan on tbl  (cost=0.00..101216.41 rows=3375941
> width=6) (actual time=16.643..20652.610 rows=3378864 loops=1)
>  Total runtime: 57307.394 ms

Hmm - am I right in thinking (a,b) are the only two columns on this
table? That means you'll have a lot of rows per page and an index scan
could end up fetching lots of pages to check the rows are visible. Still
- I'd expect it to be better than a seq scan.

The first thing to try is to put the index back and run "SET
enable_seqscan=off" before the explain analyse. That should force it to
use the index. Then we'll see what costs it's expecting.

>> How often is the table updated? Clustering might buy you some
>> improvements (but not a huge amount I suspect).
>
> It is updated once per 3-5 seconds.

OK - forget clustering then.

> And one more thing. I don't know if it helps, but column 'a' can have
> value from a limited set: 0, 1 or 2. Column 'b' is also an integer
> (foreign key, actually).

Hmm - might be worth trying distinct on (b,a) with an index that way
around - that would give you greater selectivity at the top-level of the
btree. Can you repeat the EXPLAIN ANALYSE with that too please.

--
   Richard Huxton
   Archonet Ltd

Re: Querying distinct values from a large table

From

imad

Date:

30 January 2007, 10:39:27

To be sure about the performance of index scan, try forcing the
planner to use it instead of seq scan. A way might be to force the
planner to use index scan on your table by using a dummy where clause.
Try using a condition in your where clause which holds true for all
rows.

--Imad
www.EnterpriseDB.com

On 1/30/07, Richard Huxton <dev@archonet.com> wrote:
> Igor Lobanov wrote:
> >
> >
> > Richard Huxton wrote:
> >>> I have rather large table with about 5 millions of rows and a dozen
> >>> of columns. Let's suppose that columns are named 'a', 'b', 'c' etc. I
> >>> need to query distinct pairs of ('a';'b') from this table.
> >  >
> >> What version of PostgreSQL is it?
> >
> > 8.1.4
>
> Current release is 8.1.6 - probably worth upgrading when you've got
> time. It should be a simple matter of replacing the binaries but do
> check the release notes.
>
> >> How many distinct values are you getting back from your 5 million
> >> rows? If there are too many, an index isn't going to help.
> >
> > No more than 10,000.
>
> OK. Should be possible to do something then.
>
> >> Can you share the EXPLAIN ANALYSE output? You might want to try
> >> increasing work_mem for this one query to speed any sorting.
> >
> > Real table and colum names are obfuscated because of NDA, sorry.
>
> Fair enough.
>
> > explain analyze select distinct a, b from tbl
> >
> > EXPLAIN ANALYZE output is:
> >
> >  Unique  (cost=500327.32..525646.88 rows=1848 width=6) (actual
> > time=52719.868..56126.356 rows=5390 loops=1)
> >    ->  Sort  (cost=500327.32..508767.17 rows=3375941 width=6) (actual
> > time=52719.865..54919.989 rows=3378864 loops=1)
> >          Sort Key: a, b
> >          ->  Seq Scan on tbl  (cost=0.00..101216.41 rows=3375941
> > width=6) (actual time=16.643..20652.610 rows=3378864 loops=1)
> >  Total runtime: 57307.394 ms
>
> Hmm - am I right in thinking (a,b) are the only two columns on this
> table? That means you'll have a lot of rows per page and an index scan
> could end up fetching lots of pages to check the rows are visible. Still
> - I'd expect it to be better than a seq scan.
>
> The first thing to try is to put the index back and run "SET
> enable_seqscan=off" before the explain analyse. That should force it to
> use the index. Then we'll see what costs it's expecting.
>
> >> How often is the table updated? Clustering might buy you some
> >> improvements (but not a huge amount I suspect).
> >
> > It is updated once per 3-5 seconds.
>
> OK - forget clustering then.
>
> > And one more thing. I don't know if it helps, but column 'a' can have
> > value from a limited set: 0, 1 or 2. Column 'b' is also an integer
> > (foreign key, actually).
>
> Hmm - might be worth trying distinct on (b,a) with an index that way
> around - that would give you greater selectivity at the top-level of the
> btree. Can you repeat the EXPLAIN ANALYSE with that too please.
>
> --
>    Richard Huxton
>    Archonet Ltd
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org
>

Re: Querying distinct values from a large table

From

"Simon Riggs"

Date:

30 January 2007, 13:06:40

On Tue, 2007-01-30 at 15:33 +0600, Igor Lobanov wrote:

> explain analyze select distinct a, b from tbl
>
> EXPLAIN ANALYZE output is:
>
>   Unique  (cost=500327.32..525646.88 rows=1848 width=6) (actual
> time=52719.868..56126.356 rows=5390 loops=1)
>     ->  Sort  (cost=500327.32..508767.17 rows=3375941 width=6) (actual
> time=52719.865..54919.989 rows=3378864 loops=1)
>           Sort Key: a, b
>           ->  Seq Scan on tbl  (cost=0.00..101216.41 rows=3375941
> width=6) (actual time=16.643..20652.610 rows=3378864 loops=1)
>   Total runtime: 57307.394 ms

All your time is in the sort, not in the SeqScan.

Increase your work_mem.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com

Re: Querying distinct values from a large table

From

Richard Huxton

Date:

30 January 2007, 13:12:17

Simon Riggs wrote:
> On Tue, 2007-01-30 at 15:33 +0600, Igor Lobanov wrote:
>
>> explain analyze select distinct a, b from tbl
>>
>> EXPLAIN ANALYZE output is:
>>
>>   Unique  (cost=500327.32..525646.88 rows=1848 width=6) (actual
>> time=52719.868..56126.356 rows=5390 loops=1)
>>     ->  Sort  (cost=500327.32..508767.17 rows=3375941 width=6) (actual
>> time=52719.865..54919.989 rows=3378864 loops=1)
>>           Sort Key: a, b
>>           ->  Seq Scan on tbl  (cost=0.00..101216.41 rows=3375941
>> width=6) (actual time=16.643..20652.610 rows=3378864 loops=1)
>>   Total runtime: 57307.394 ms
>
> All your time is in the sort, not in the SeqScan.
>
> Increase your work_mem.

Well, even if the sort was instant it's only going to get him down to
20secs.

--
   Richard Huxton
   Archonet Ltd

Re: Querying distinct values from a large table

From

"Chad Wagner"

Date:

30 January 2007, 13:13:39

On 1/30/07, Simon Riggs <simon@2ndquadrant.com> wrote:

> explain analyze select distinct a, b from tbl
>
> EXPLAIN ANALYZE output is:
>
>   Unique  (cost=500327.32..525646.88 rows=1848 width=6) (actual
> time=52719.868..56126.356 rows=5390 loops=1)
>     ->  Sort  (cost=500327.32..508767.17 rows=3375941 width=6) (actual
> time=52719.865..54919.989 rows=3378864 loops=1)
>           Sort Key: a, b
>           ->  Seq Scan on tbl  (cost=0.00..101216.41 rows=3375941
> width=6) (actual time=16.643..20652.610 rows=3378864 loops=1)
>   Total runtime: 57307.394 ms

All your time is in the sort, not in the SeqScan.

Increase your work_mem.

Sounds like an opportunity to implement a "Sort Unique" (sort of like a hash, I guess), there is no need to push 3M rows through a sort algorithm to only shave it down to 1848 unique records.

I am assuming this optimization just isn't implemented in PostgreSQL?

--
Chad
http://www.postgresqlforums.com/

Re: Querying distinct values from a large table

From

Brian Herlihy

Date:

30 January 2007, 13:38:23

As I understand, there's no hashing for DISTINCT, but there is for GROUP BY. GROUP BY will choose between a hash and a sort (or maybe other options?) depending on the circumstances. So you can write

SELECT a, b FROM tbl GROUP BY a,b

and the sort/unique part of the query may run faster.

Brian

----- Original Message ----
From: Chad Wagner <chad.wagner@gmail.com>
To: Simon Riggs <simon@2ndquadrant.com>
Cc: Igor Lobanov <ilobanov@swsoft.com>; Richard Huxton <dev@archonet.com>; pgsql-performance@postgresql.org
Sent: Tuesday, 30 January, 2007 10:13:27 PM
Subject: Re: [PERFORM] Querying distinct values from a large table

On 1/30/07, Simon Riggs <simon@2ndquadrant.com> wrote:

> explain analyze select distinct a, b from tbl
>
> EXPLAIN ANALYZE output is:
>
>   Unique  (cost=500327.32..525646.88 rows=1848 width=6) (actual
> time=52719.868..56126.356 rows=5390 loops=1)
>     ->  Sort  (cost=500327.32..508767.17 rows=3375941 width=6) (actual
> time=52719.865..54919.989 rows=3378864 loops=1)
>           Sort Key: a, b
>           ->  Seq Scan on tbl  (cost=0.00..101216.41 rows=3375941
> width=6) (actual time=16.643..20652.610 rows=3378864 loops=1)
>   Total runtime: 57307.394 ms

All your time is in the sort, not in the SeqScan.

Increase your work_mem.

Re: Querying distinct values from a large table

From

"Luke Lonergan"

Date:

30 January 2007, 13:57:29

Chad,

On 1/30/07 6:13 AM, "Chad Wagner" <chad.wagner@gmail.com> wrote:

> Sounds like an opportunity to implement a "Sort Unique" (sort of like a hash,
> I guess), there is no need to push 3M rows through a sort algorithm to only
> shave it down to 1848 unique records.
>
> I am assuming this optimization just isn't implemented in PostgreSQL?

Not that it helps Igor, but we've implemented single pass sort/unique,
grouping and limit optimizations and it speeds things up to a single seqscan
over the data, from 2-5 times faster than a typical external sort.

I can't think of a way that indexing would help this situation given the
required visibility check of each tuple.

- Luke

Re: Querying distinct values from a large table

From

"Chad Wagner"

Date:

30 January 2007, 14:03:22

On 1/30/07, Luke Lonergan <llonergan@greenplum.com> wrote:

Not that it helps Igor, but we've implemented single pass sort/unique,
grouping and limit optimizations and it speeds things up to a single seqscan
over the data, from 2-5 times faster than a typical external sort.

Was that integrated back into PostgreSQL, or is that part of Greenplum's offering?

I can't think of a way that indexing would help this situation given the
required visibility check of each tuple.

I agree, using indexes as a "skinny" table is a whole other feature that would be nice.

--
Chad
http://www.postgresqlforums.com/

Re: Querying distinct values from a large table

From

"Luke Lonergan"

Date:

30 January 2007, 15:34:25

Chad,

On 1/30/07 7:03 AM, "Chad Wagner" <chad.wagner@gmail.com> wrote:

> On 1/30/07, Luke Lonergan <llonergan@greenplum.com> wrote:
>> Not that it helps Igor, but we've implemented single pass sort/unique,
>> grouping and limit optimizations and it speeds things up to a single seqscan
>> over the data, from 2-5 times faster than a typical external sort.
>
> Was that integrated back into PostgreSQL, or is that part of Greenplum's
> offering?

Not yet, we will submit to PostgreSQL along with other executor node
enhancements like hybrid hash agg (fixes the memory overflow problem with
hash agg) and some other great sort work.  These are all "cooked" and in the
Greenplum DBMS, and have proven themselves significant on customer workloads
with tens of terabytes already.

For now it seems that the "Group By" trick Brian suggested in this thread
combined with lots of work_mem may speed things up for this case if HashAgg
is chosen.  Watch out for misestimates of stats though - hash agg may
overallocate RAM in some cases.

>> I can't think of a way that indexing would help this situation given the
>> required visibility check of each tuple.
>
> I agree, using indexes as a "skinny" table is a whole other feature that would
> be nice.

Yah - I like Hannu's ideas to make visibility less of a problem.  We're
thinking about this too.

- Luke

Re: Querying distinct values from a large table

From

Gregory Stark

Date:

30 January 2007, 15:39:11

"Luke Lonergan" <llonergan@greenplum.com> writes:

> Chad,
>
> On 1/30/07 6:13 AM, "Chad Wagner" <chad.wagner@gmail.com> wrote:
>
> > Sounds like an opportunity to implement a "Sort Unique" (sort of like a hash,
> > I guess), there is no need to push 3M rows through a sort algorithm to only
> > shave it down to 1848 unique records.
> >
> > I am assuming this optimization just isn't implemented in PostgreSQL?
>
> Not that it helps Igor, but we've implemented single pass sort/unique,
> grouping and limit optimizations and it speeds things up to a single seqscan
> over the data, from 2-5 times faster than a typical external sort.

Fwiw I also implemented this last September when I worked on the LIMIT/SORT
case. It's blocked on the same issue that that patch is: how do we communicate
the info that only unique records are needed down the plan to the sort node?

Note that it's not just a boolean flag that we can include like the random
access flag: The Unique node could in theory have a different key than the
sort node.

Coming at this fresh now a few months later I'm thinking instead of trying to
put this intelligence into the individual nodes, an optimizer pass could run
through the tree looking for Sort nodes that can have this bit tweaked.

That doesn't quite help the LIMIT/SORT case because the limit isn't calculated
until run-time but perhaps it's possible to finesse that issue by providing
either the Limit or Sort node with pointers to other nodes.

(Incidentally I'm not sure where 2-5x comes from. It's entirely dependant on
your data distribution. It's not hard to come up with distributions where it's
1000x as fast and others where there's no speed difference.)

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: Querying distinct values from a large table

From

Bruno Wolff III

Date:

30 January 2007, 16:00:46

On Tue, Jan 30, 2007 at 14:33:34 +0600,
  Igor Lobanov <ilobanov@swsoft.com> wrote:
> Greetings!
>
> I have rather large table with about 5 millions of rows and a dozen of
> columns. Let's suppose that columns are named 'a', 'b', 'c' etc. I need
> to query distinct pairs of ('a';'b') from this table.
>
> Is there any way to somehow improve the performance of this operation?
> Table can not be changed.

DISTINCT currently can't use a hash aggregate plan and will use a sort.
If there aren't many distinct values, the hash aggregate plan will run much
faster. To get around this limitation, rewrite the query as a group by.
Something like:
SELECT a, b FROM table GROUP BY a, b;

Re: Querying distinct values from a large table

From

Alvaro Herrera

Date:

30 January 2007, 17:07:52

Gregory Stark wrote:

> (Incidentally I'm not sure where 2-5x comes from. It's entirely dependant on
> your data distribution. It's not hard to come up with distributions where it's
> 1000x as fast and others where there's no speed difference.)

So the figure is really "1-1000x"?  I bet this one is more impressive in
PHB terms.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Querying distinct values from a large table

From

"Luke Lonergan"

Date:

30 January 2007, 19:56:18

Alvaro,

On 1/30/07 9:04 AM, "Alvaro Herrera" <alvherre@commandprompt.com> wrote:

>> (Incidentally I'm not sure where 2-5x comes from. It's entirely dependant on
>> your data distribution. It's not hard to come up with distributions where
>> it's
>> 1000x as fast and others where there's no speed difference.)
>
> So the figure is really "1-1000x"?  I bet this one is more impressive in
> PHB terms.

You got me - I'll bite - what's PHB?

- Luke

Re: Querying distinct values from a large table

From

"Steinar H. Gunderson"

Date:

30 January 2007, 20:00:43

On Tue, Jan 30, 2007 at 10:20:28AM -0800, Luke Lonergan wrote:
> You got me - I'll bite - what's PHB?

Usually the Pointy-Haired Boss, a figure from Dilbert.

  http://en.wikipedia.org/wiki/Pointy_Haired_Boss

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: Querying distinct values from a large table

From

Josh Berkus

Date:

30 January 2007, 21:57:58

Luke,

> You got me - I'll bite - what's PHB?

Pointy Haired Boss.   It's a Dilbert reference.

--
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

Re: Querying distinct values from a large table

From

Alvaro Herrera

Date:

30 January 2007, 22:07:30

Luke Lonergan wrote:
> Alvaro,
>
> On 1/30/07 9:04 AM, "Alvaro Herrera" <alvherre@commandprompt.com> wrote:
>
> >> (Incidentally I'm not sure where 2-5x comes from. It's entirely dependant on
> >> your data distribution. It's not hard to come up with distributions where
> >> it's
> >> 1000x as fast and others where there's no speed difference.)
> >
> > So the figure is really "1-1000x"?  I bet this one is more impressive in
> > PHB terms.
>
> You got me - I'll bite - what's PHB?

Pointy-Haired Boss, a term (AFAIK) popularized by the Dilbert comic strip.

Why, I just noticed there's even a Wikipedia entry:
http://en.wikipedia.org/wiki/Pointy_Haired_Boss
It seems people enjoy writing hypothetically about their bosses.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Querying distinct values from a large table

From

"Luke Lonergan"

Date:

30 January 2007, 22:18:15

Argh!

             ###                   #######
            #####        ####################
            #####    #########################
             ###             ###############

                   #####
                ##########
              ##############     ##########
             #####      #####   ############
            ###             #################
            ##               ######        ###
            #               ######          ##
            #              ######            #
            ##             #####            ##
            ###           #######           ##
             #####     ######## ####     ####
              ################    ##########
               ##############       ######
                 #########

                       ##
                       ##
                       ##
                       ##
                       ##
                       ##
                       ##
                       ##
                       ##
                       ##

            ###                            ###
             ########                #########
             ################################
                ##########################
                    ##################
                        ##########



On 1/30/07 3:00 PM, "Josh Berkus" <josh@agliodbs.com> wrote:

>
> --
> --Josh
>
> Josh Berkus
> PostgreSQL @ Sun
> San Francisco
>

Re: Querying distinct values from a large table

From

Tom Lane

Date:

31 January 2007, 04:55:38

Alvaro Herrera <alvherre@commandprompt.com> writes:
> Gregory Stark wrote:
>> (Incidentally I'm not sure where 2-5x comes from. It's entirely dependant on
>> your data distribution. It's not hard to come up with distributions where it's
>> 1000x as fast and others where there's no speed difference.)

> So the figure is really "1-1000x"?  I bet this one is more impressive in
> PHB terms.

Luke has a bad habit of quoting numbers that are obviously derived from
narrow benchmarking scenarios as Universal Truths, rather than providing
the context they were derived in.  I wish he'd stop doing that...

            regards, tom lane

Re: Querying distinct values from a large table

From

"Luke Lonergan"

Date:

31 January 2007, 05:59:47

Tom,

On 1/30/07 9:55 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

> Alvaro Herrera <alvherre@commandprompt.com> writes:
>> Gregory Stark wrote:
>>> (Incidentally I'm not sure where 2-5x comes from. It's entirely dependant on
>>> your data distribution. It's not hard to come up with distributions where
>>> it's
>>> 1000x as fast and others where there's no speed difference.)
>
>> So the figure is really "1-1000x"? I bet this one is more impressive in
>> PHB terms.
>
> Luke has a bad habit of quoting numbers that are obviously derived from
> narrow benchmarking scenarios as Universal Truths, rather than providing
> the context they were derived in. I wish he'd stop doing that...

In this case I was referring to results obtained using grouping and distinct optimizations within sort where the benefit is from the use of a single pass instead of the multiple merge passes for external sort followed by a UNIQUE operator. In this case, the benefit ranges from 2-5x in many examples as I mentioned: "from 2-5 times faster than a typical external sort". This is also the same range of benefits we see for this optimization with a popular commercial database. With the limit/sort optimization we have seen more dramatic results, but I think those are less typical cases.

Here are some results for a 1GB table and a simple COUNT(DISTINCT) on a column with 7 unique values from my dual CPU laptop running Greenplum DB (PG 8.2.1 compatible) on both CPUs. Note that my laptop has 2GB of RAM so I have the 1GB table loaded into OS I/O cache. The unmodified external sort spills the sorted attribute to disk, but that takes little time. Note that the COUNT(DISTINCT) plan embeds a sort as the transition function in the aggregation node.

=================================================================
================= No Distinct Optimization in Sort ==============
=================================================================
lukelonergan=# select count(distinct l_shipmode) from lineitem;
count
-------
     7
(1 row)

Time: 37832.308 ms

lukelonergan=# explain analyze select count(distinct l_shipmode) from lineitem;                                                                                                            QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=159175.30..159175.31 rows=1 width=8)
   Total 1 rows with 40899 ms to end, start offset by 3.189 ms.
   -> Gather Motion 2:1 (slice2) (cost=159175.25..159175.28 rows=1 width=8)
         recv: Total 2 rows with 39387 ms to first row, 40899 ms to end, start offset by 3.191 ms.
         -> Aggregate (cost=159175.25..159175.26 rows=1 width=8)
               Avg 1.00 rows x 2 workers. Max 1 rows (seg0) with 39367 ms to end, start offset by 22 ms.
               -> Redistribute Motion 2:2 (slice1) (cost=0.00..151672.00 rows=3001300 width=8)
                     recv: Avg 3000607.50 rows x 2 workers. Max 3429492 rows (seg1) with 0.362 ms to first row, 8643 ms to end, start offset by 23 ms.
                     Hash Key: lineitem.l_shipmode
                     -> Seq Scan on lineitem (cost=0.00..91646.00 rows=3001300 width=8)
                           Avg 3000607.50 rows x 2 workers. Max 3001300 rows (seg0) with 0.049 ms to first row, 2813 ms to end, start offset by 12.998 ms.
Total runtime: 40903.321 ms
(12 rows)

Time: 40904.013 ms

=================================================================
================= With Distinct Optimization in Sort ==============
=================================================================
lukelonergan=# set mpp_sort_flags=1;
SET
Time: 1.425 ms
lukelonergan=# select count(distinct l_shipmode) from lineitem;
count
-------
     7
(1 row)

Time: 12846.466 ms

lukelonergan=# explain analyze select count(distinct l_shipmode) from lineitem;
                                                                         QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=159175.30..159175.31 rows=1 width=8)
   Total 1 rows with 13754 ms to end, start offset by 2.998 ms.
   -> Gather Motion 2:1 (slice2) (cost=159175.25..159175.28 rows=1 width=8)
         recv: Total 2 rows with 13754 ms to end, start offset by 3.000 ms.
         -> Aggregate (cost=159175.25..159175.26 rows=1 width=8)
               Avg 1.00 rows x 2 workers. Max 1 rows (seg0) with 13734 ms to end, start offset by 23 ms.
               -> Redistribute Motion 2:2 (slice1) (cost=0.00..151672.00 rows=3001300 width=8)
                     recv: Avg 3000607.50 rows x 2 workers. Max 3429492 rows (seg1) with 0.352 ms to first row, 10145 ms to end, start offset by 26 ms.
                     Hash Key: lineitem.l_shipmode
                     -> Seq Scan on lineitem (cost=0.00..91646.00 rows=3001300 width=8)
                           Avg 3000607.50 rows x 2 workers. Max 3001300 rows (seg0) with 0.032 ms to first row, 4048 ms to end, start offset by 13.037 ms.
Total runtime: 13757.524 ms
(12 rows)

Time: 13758.182 ms

================= Background Information ==============
lukelonergan=# select count(*) from lineitem;
  count
---------
6001215
(1 row)

Time: 1661.337 ms
lukelonergan=# \d lineitem
               Table "public.lineitem"
     Column      |         Type          | Modifiers
-----------------+-----------------------+-----------
l_orderkey      | integer               | not null
l_partkey       | integer               | not null
l_suppkey       | integer               | not null
l_linenumber    | integer               | not null
l_quantity      | double precision      | not null
l_extendedprice | double precision      | not null
l_discount      | double precision      | not null
l_tax           | double precision      | not null
l_returnflag    | text                  | not null
l_linestatus    | text                  | not null
l_shipdate      | date                  | not null
l_commitdate    | date                  | not null
l_receiptdate   | date                  | not null
l_shipinstruct | text                  | not null
l_shipmode      | text                  | not null
l_comment       | character varying(44) | not null
Distributed by: (l_orderkey)

lukelonergan=# select pg_relation_size(oid)/(1000.*1000.) as MB, relname as Table from pg_class order by MB desc limit 10;
           mb           |             table
------------------------+--------------------------------
  1009.4755840000000000 | lineitem
   230.0559360000000000 | orders
   146.7678720000000000 | partsupp
    35.8072320000000000 | part
    32.8908800000000000 | customer
     1.9333120000000000 | supplier
     1.1304960000000000 | pg_proc
0.88473600000000000000 | pg_proc_proname_args_nsp_index
0.81100800000000000000 | pg_attribute
0.81100800000000000000 | pg_depend

- Luke

Re: Querying distinct values from a large table

From

Gregory Stark

Date:

31 January 2007, 12:25:02

Tom Lane <tgl@sss.pgh.pa.us> writes:

> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Gregory Stark wrote:
> >> (Incidentally I'm not sure where 2-5x comes from. It's entirely dependant on
> >> your data distribution. It's not hard to come up with distributions where it's
> >> 1000x as fast and others where there's no speed difference.)
>
> > So the figure is really "1-1000x"?  I bet this one is more impressive in
> > PHB terms.
>
> Luke has a bad habit of quoting numbers that are obviously derived from
> narrow benchmarking scenarios as Universal Truths, rather than providing
> the context they were derived in.  I wish he'd stop doing that...

In fairness I may have exaggerated a bit there. There is a limit to how much
of a speedup you can get in valid benchmarking situations. A single sequential
scan is always going to be necessary so you're only saving the cost of writing
out the temporary file and subsequent merge passes.

It's hard to generate lots of intermediate merge passes since there are only
O(log(n)) of them. So to get 1000x speedup on a large I/O bound sort you would
have to be sorting something on order of 2^1000 records which is ridiculous.
Realistically you should only be able to save 2-5 intermediate merge passes.

On the other there are some common situations where you could see atypical
increases. Consider joining a bunch of small tables to generate a large result
set. The small tables are probably all in memory and the result set may only
have a small number of distinct values. If you throw out the duplicates early
you save *all* the I/O. If you have to do a disk sort it could be many orders
slower.

This is actually not an uncommon coding idiom for MySQL programmers accustomed
to fast DISTINCT working around the lack of subqueries and poor performance of
IN and EXISTS. They often just join together all the tables in a big cross
join and then toss in a DISTINCT at the top to get rid of the duplicates.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: Querying distinct values from a large table

From

Ron

Date:

31 January 2007, 13:46:50

I strongly encourage anyone who is interested in the general external
sorting problem peruse Jim Gray's site:
http://research.microsoft.com/barc/SortBenchmark/

Ron Peacetree

At 08:24 AM 1/31/2007, Gregory Stark wrote:
>Tom Lane <tgl@sss.pgh.pa.us> writes:
>
> > Alvaro Herrera <alvherre@commandprompt.com> writes:
> > > Gregory Stark wrote:
> > >> (Incidentally I'm not sure where 2-5x comes from. It's
> entirely dependant on
> > >> your data distribution. It's not hard to come up with
> distributions where it's
> > >> 1000x as fast and others where there's no speed difference.)
> >
> > > So the figure is really "1-1000x"?  I bet this one is more impressive in
> > > PHB terms.
> >
> > Luke has a bad habit of quoting numbers that are obviously derived from
> > narrow benchmarking scenarios as Universal Truths, rather than providing
> > the context they were derived in.  I wish he'd stop doing that...
>
>In fairness I may have exaggerated a bit there. There is a limit to how much
>of a speedup you can get in valid benchmarking situations. A single sequential
>scan is always going to be necessary so you're only saving the cost of writing
>out the temporary file and subsequent merge passes.
>
>It's hard to generate lots of intermediate merge passes since there are only
>O(log(n)) of them. So to get 1000x speedup on a large I/O bound sort you would
>have to be sorting something on order of 2^1000 records which is ridiculous.
>Realistically you should only be able to save 2-5 intermediate merge passes.
>
>On the other there are some common situations where you could see atypical
>increases. Consider joining a bunch of small tables to generate a large result
>set. The small tables are probably all in memory and the result set may only
>have a small number of distinct values. If you throw out the duplicates early
>you save *all* the I/O. If you have to do a disk sort it could be many orders
>slower.
>
>This is actually not an uncommon coding idiom for MySQL programmers accustomed
>to fast DISTINCT working around the lack of subqueries and poor performance of
>IN and EXISTS. They often just join together all the tables in a big cross
>join and then toss in a DISTINCT at the top to get rid of the duplicates.
>
>--
>   Gregory Stark
>   EnterpriseDB          http://www.enterprisedb.com
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq

Re: Querying distinct values from a large table

From

Tom Lane

Date:

31 January 2007, 14:16:55

Gregory Stark <gsstark@mit.edu> writes:
> On the other there are some common situations where you could see
> atypical increases. Consider joining a bunch of small tables to
> generate a large result set. The small tables are probably all in
> memory and the result set may only have a small number of distinct
> values. If you throw out the duplicates early you save *all* the
> I/O. If you have to do a disk sort it could be many orders slower.

Right, we already have support for doing that well, in the form of
hashed aggregation.  What needs to happen is to get that to work for
DISTINCT as well as GROUP BY.  IIRC, DISTINCT is currently rather
thoroughly intertwined with ORDER BY, and we'd have to figure out
some way to decouple them --- without breaking DISTINCT ON, which
makes it a lot harder :-(

            regards, tom lane