Re: How slow is DISTINCT? - Mailing list pgsql-sql

From Michael Contzen
Subject Re: How slow is DISTINCT?
Date
Msg-id 3CB2BF53.77D41E79@dohle.com
Whole thread Raw
In response to How slow is DISTINCT?  (Wei Weng <wweng@kencast.com>)
Responses Re: How slow is DISTINCT?  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: How slow is DISTINCT?  ("Christopher Kings-Lynne" <chriskl@familyhealth.com.au>)
List pgsql-sql
Hello,

our problem to 'select distinct' is similar: We have a big table of
about 320.000.000 rows of an datawarehouse application. 55 GB, yes, it
works fine! 

Createing an index on it quite fast (nearly as fast than oracle on the
same data and same machine) 1GHz P4, 4GB-Ram -> 30 Minutes. 

Then we tried 'select distinct one_field' which would result to about
200.000 different values. Postgres needed 6 hours while Oracle managed
it in about 30 minutes.

Looking into the pgsql_tmp directory of this db while doing this
selection showed me a lot of tempfiles nearly as big as the table.

Does postgres sort the whole table without projection to one column an
performs a unique on this whole table?
This would explain the big amount of disk usage in pgsql_tmp and the big
amount of time.

The statement, something could be wrong with the data, is not very
useful: This is data of our electronic cash-desks. Unfortunately our
customers buy every day nearly the same articles - therefor the
repeatition of data :-)


Kind regards

M.Contzen
Developer 
Dohle Systemberatung
Germany


Some facts of our test:
        Table "warenausg_ges" Column   |     Type      | Modifiers 
-----------+---------------+-----------ean       | numeric(13,0) | menge     | numeric(13,3) | lvkumsatz |
numeric(15,3)| vkumsatz  | numeric(15,3) | ekumsatz  | numeric(15,3) | rabatt    | numeric(12,0) | kdnr      |
numeric(10,0)| artnr     | numeric(10,0) | lfnr      | numeric(10,0) | wg        | integer       | aktion    |
character(1) | datum     | date          | status    | integer       | 
 
Indexes: warenausg_ges_inx

Index "warenausg_ges_inx"Column | Type 
--------+------datum  | date
btree

explain select distinct artnr from warenausg_ges;
NOTICE:  QUERY PLAN:

Unique  (cost=224522801.22..225315849.86 rows=31721946 width=12) ->  Sort  (cost=224522801.22..224522801.22
rows=317219456width=12)       ->  Seq Scan on warenausg_ges  (cost=0.00..165793667.00
 
rows=317219456 width=12)


pgsql-sql by date:

Previous
From: "Dorian Taylor"
Date:
Subject: case statement in group by clause
Next
From: "Pierre-Andre Michel"
Date:
Subject: How to use an index on a bigint column