Thread: How slow is DISTINCT?

How slow is DISTINCT?

From

Wei Weng

Date:

27 February 2002, 15:20:59

I tend to use DISTINCT a lot in my queries to limit the repeated rows.
Is that a good habbit?

How exactly slow is DISTINCT being processed in SQL engines? (not
limited to postgresql, though comments on postgresql would be most relevant)

Thanks

-- 
Wei Weng
Network Software Engineer
KenCast Inc.

Re: How slow is DISTINCT?

From

"Josh Berkus"

Date:

27 February 2002, 16:49:25

Wei Wang,

> How exactly slow is DISTINCT being processed in SQL engines? (not
> limited to postgresql, though comments on postgresql would be most
>  relevant)

I can only give you a relative result, based exlusively on my anecdotalexperience with 7.1:

Fast:  SELECT ...
Slower: SELECT ... GROUP BY x,y,z    or:  SELECT DISCTINCT ON (x) ... (Postgres non-standard extension)
SLowest: SELECT DISTINCT ...

The reason for this is that SELECT DISTINCT is effectively a GROUP BYon all result fields of the query, and if a few of
thearen't indexedthat requires a seq scan.
 

If performance is an issue, you may wish to consider restructuring yourqueries and/or data model to eliminate the
actualduplicate rows.
 

-Josh

Re: How slow is DISTINCT?

From

"Dan Langille"

Date:

27 February 2002, 17:10:25

On 27 Feb 2002 at 13:27, Wei Weng wrote:

> I tend to use DISTINCT a lot in my queries to limit the repeated rows. Is
> that a good habbit?

If that's what you need, use it.

> How exactly slow is DISTINCT being processed in SQL engines? (not
> limited to postgresql, though comments on postgresql would be most
> relevant)

Try it.  Compare the same SQL with and without DISTINCT.
-- 
Dan Langille
The FreeBSD Diary - http://freebsddiary.org/ - practical examples

Re: How slow is DISTINCT?

From

Wei Weng

Date:

27 February 2002, 17:50:45

Josh Berkus wrote:
> Wei Wang,
> 
> 
>>How exactly slow is DISTINCT being processed in SQL engines? (not
>>limited to postgresql, though comments on postgresql would be most
>> relevant)
>>
> 
> I can only give you a relative result, based exlusively on my anecdotal
>  experience with 7.1:
> 
> Fast:  SELECT ...
> Slower: SELECT ... GROUP BY x,y,z
>      or:  SELECT DISCTINCT ON (x) ... (Postgres non-standard extension)
> SLowest: SELECT DISTINCT ...
> 
> The reason for this is that SELECT DISTINCT is effectively a GROUP BY
>  on all result fields of the query, and if a few of the aren't indexed
>  that requires a seq scan.
What if I do thing like

SELECT DISTINCT table1.tid, table1.name, table1.description FROM ...

Does it equal to the scenario 2 or 3?

I am thinking SELECT DISTINCT table1.tid is just a variation (or the 
other way around) of SELECT DISTINCT ON (table1.tid), is that right?

Thanks

-- 
Wei Weng
Network Software Engineer
KenCast Inc.

Re: How slow is DISTINCT?

From

"Josh Berkus"

Date:

27 February 2002, 17:54:46

Wei,

> SELECT DISTINCT table1.tid, table1.name, table1.description FROM ...
> 
> Does it equal to the scenario 2 or 3?

Three.

> I am thinking SELECT DISTINCT table1.tid is just a variation (or the
>  other way around) of SELECT DISTINCT ON (table1.tid), is that right?

Wrong.  SELECT DISTINCT ON table1.tid takes the table1.tid field andthe first related other data it can find.  SELECT
DISTINCTlooks forthe unique combinaiton of all fields selected.  In a multi-table, manyfield query, the former is
fasterthan the latter bceause less data isbeing checked for uniqueness, and extra data is simply discarded.
 

Tom, please correct me if I'm totally out on a limb, here.

-Josh

______AGLIO DATABASE SOLUTIONS___________________________                                      Josh Berkus Complete
informationtechnology      josh@agliodbs.com  and data management solutions       (415) 565-7293 for law firms, small
businesses       fax 621-2533   and non-profit organizations.      San Francisco

Re: How slow is DISTINCT?

From

Wei Weng

Date:

27 February 2002, 18:12:55

Josh Berkus wrote:
> Wei,
> 
> 
>>SELECT DISTINCT table1.tid, table1.name, table1.description FROM ...
>>
>>Does it equal to the scenario 2 or 3?
>>
> 
> Three.
> 
> 
>>I am thinking SELECT DISTINCT table1.tid is just a variation (or the
>> other way around) of SELECT DISTINCT ON (table1.tid), is that right?
>>
> 
> Wrong.  SELECT DISTINCT ON table1.tid takes the table1.tid field and
>  the first related other data it can find.  SELECT DISTINCT looks for
>  the unique combinaiton of all fields selected.  In a multi-table, many
>  field query, the former is faster than the latter bceause less data is
>  being checked for uniqueness, and extra data is simply discarded.
> 
> Tom, please correct me if I'm totally out on a limb, here.

Is that(DISTINCT ON) SQL standard compliant or a Postgresql extension?



-- 
Wei Weng
Network Software Engineer
KenCast Inc.

Re: How slow is DISTINCT?

From

TimothyReaves@westfieldgrp.com

Date:

28 February 2002, 08:48:35

Wei Weng wrote:

> Josh Berkus wrote:
>
>>Wei Wang,
>>
>>
>>
>>>How exactly slow is DISTINCT being processed in SQL engines? (not
>>>limited to postgresql, though comments on postgresql would be most
>>>relevant)
>>>
>>>
>>I can only give you a relative result, based exlusively on my anecdotal
>> experience with 7.1:
>>
>>Fast:  SELECT ...
>>Slower: SELECT ... GROUP BY x,y,z
>>     or:  SELECT DISCTINCT ON (x) ... (Postgres non-standard extension)
>>SLowest: SELECT DISTINCT ...
>>
>>The reason for this is that SELECT DISTINCT is effectively a GROUP BY
>> on all result fields of the query, and if a few of the aren't indexed
>> that requires a seq scan.
>>
> What if I do thing like
>
> SELECT DISTINCT table1.tid, table1.name, table1.description FROM ...
>
> Does it equal to the scenario 2 or 3?
>
> I am thinking SELECT DISTINCT table1.tid is just a variation (or the
> other way around) of SELECT DISTINCT ON (table1.tid), is that right?
>
> Thanks
>
>
    Of course, a more basic question is, why so much repeated data?
Perhaps you should look at the design in your tables.  I'm not assuming
they are wrong, only suggesting that you look.

Re: How slow is DISTINCT?

From

Michael Contzen

Date:

15 April 2002, 08:52:12

Hello,

our problem to 'select distinct' is similar: We have a big table of
about 320.000.000 rows of an datawarehouse application. 55 GB, yes, it
works fine! 

Createing an index on it quite fast (nearly as fast than oracle on the
same data and same machine) 1GHz P4, 4GB-Ram -> 30 Minutes. 

Then we tried 'select distinct one_field' which would result to about
200.000 different values. Postgres needed 6 hours while Oracle managed
it in about 30 minutes.

Looking into the pgsql_tmp directory of this db while doing this
selection showed me a lot of tempfiles nearly as big as the table.

Does postgres sort the whole table without projection to one column an
performs a unique on this whole table?
This would explain the big amount of disk usage in pgsql_tmp and the big
amount of time.

The statement, something could be wrong with the data, is not very
useful: This is data of our electronic cash-desks. Unfortunately our
customers buy every day nearly the same articles - therefor the
repeatition of data :-)


Kind regards

M.Contzen
Developer 
Dohle Systemberatung
Germany


Some facts of our test:
        Table "warenausg_ges" Column   |     Type      | Modifiers 
-----------+---------------+-----------ean       | numeric(13,0) | menge     | numeric(13,3) | lvkumsatz |
numeric(15,3)| vkumsatz  | numeric(15,3) | ekumsatz  | numeric(15,3) | rabatt    | numeric(12,0) | kdnr      |
numeric(10,0)| artnr     | numeric(10,0) | lfnr      | numeric(10,0) | wg        | integer       | aktion    |
character(1) | datum     | date          | status    | integer       | 
 
Indexes: warenausg_ges_inx

Index "warenausg_ges_inx"Column | Type 
--------+------datum  | date
btree

explain select distinct artnr from warenausg_ges;
NOTICE:  QUERY PLAN:

Unique  (cost=224522801.22..225315849.86 rows=31721946 width=12) ->  Sort  (cost=224522801.22..224522801.22
rows=317219456width=12)       ->  Seq Scan on warenausg_ges  (cost=0.00..165793667.00
 
rows=317219456 width=12)

Re: How slow is DISTINCT?

From

Tom Lane

Date:

15 April 2002, 11:41:47

Michael Contzen <mcontzen@dohle.com> writes:
> Does postgres sort the whole table without projection to one column an
> performs a unique on this whole table?

No, it only sorts the column(s) being selected.

My guess is that your performance problem comes from using "numeric"
datatype.  Consider int or bigint instead of numeric(10,0).
        regards, tom lane

Re: How slow is DISTINCT?

From

"Christopher Kings-Lynne"

Date:

15 April 2002, 21:53:24

What happens if you do...

select artnr from warenausg_ges group by artnr;

or even

select distinct (artnr) from warenausg_ges;

or

select distinct on (artnr) from warenausg_ges;

...sort of ideas...

Another idea is to maintain a table of the unique values in the column.
Create a trigger on the main table to maintian the list of unique values...

Chris

> -----Original Message-----
> From: pgsql-sql-owner@postgresql.org
> [mailto:pgsql-sql-owner@postgresql.org]On Behalf Of Michael Contzen
> Sent: Tuesday, 9 April 2002 6:16 PM
> To: pgsql-sql@postgresql.org
> Subject: Re: [SQL] How slow is DISTINCT?
>
>
> Hello,
>
> our problem to 'select distinct' is similar: We have a big table of
> about 320.000.000 rows of an datawarehouse application. 55 GB, yes, it
> works fine!
>
> Createing an index on it quite fast (nearly as fast than oracle on the
> same data and same machine) 1GHz P4, 4GB-Ram -> 30 Minutes.
>
> Then we tried 'select distinct one_field' which would result to about
> 200.000 different values. Postgres needed 6 hours while Oracle managed
> it in about 30 minutes.
>
> Looking into the pgsql_tmp directory of this db while doing this
> selection showed me a lot of tempfiles nearly as big as the table.
>
> Does postgres sort the whole table without projection to one column an
> performs a unique on this whole table?
> This would explain the big amount of disk usage in pgsql_tmp and the big
> amount of time.
>
> The statement, something could be wrong with the data, is not very
> useful: This is data of our electronic cash-desks. Unfortunately our
> customers buy every day nearly the same articles - therefor the
> repeatition of data :-)
>
>
> Kind regards
>
> M.Contzen
> Developer
> Dohle Systemberatung
> Germany
>
>
> Some facts of our test:
>
>          Table "warenausg_ges"
>   Column   |     Type      | Modifiers
> -----------+---------------+-----------
>  ean       | numeric(13,0) |
>  menge     | numeric(13,3) |
>  lvkumsatz | numeric(15,3) |
>  vkumsatz  | numeric(15,3) |
>  ekumsatz  | numeric(15,3) |
>  rabatt    | numeric(12,0) |
>  kdnr      | numeric(10,0) |
>  artnr     | numeric(10,0) |
>  lfnr      | numeric(10,0) |
>  wg        | integer       |
>  aktion    | character(1)  |
>  datum     | date          |
>  status    | integer       |
> Indexes: warenausg_ges_inx
>
> Index "warenausg_ges_inx"
>  Column | Type
> --------+------
>  datum  | date
> btree
>
> explain select distinct artnr from warenausg_ges;
> NOTICE:  QUERY PLAN:
>
> Unique  (cost=224522801.22..225315849.86 rows=31721946 width=12)
>   ->  Sort  (cost=224522801.22..224522801.22 rows=317219456 width=12)
>         ->  Seq Scan on warenausg_ges  (cost=0.00..165793667.00
> rows=317219456 width=12)
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>