Thread: How slow is DISTINCT?
I tend to use DISTINCT a lot in my queries to limit the repeated rows. Is that a good habbit? How exactly slow is DISTINCT being processed in SQL engines? (not limited to postgresql, though comments on postgresql would be most relevant) Thanks -- Wei Weng Network Software Engineer KenCast Inc.
Wei Wang, > How exactly slow is DISTINCT being processed in SQL engines? (not > limited to postgresql, though comments on postgresql would be most > relevant) I can only give you a relative result, based exlusively on my anecdotalexperience with 7.1: Fast: SELECT ... Slower: SELECT ... GROUP BY x,y,z or: SELECT DISCTINCT ON (x) ... (Postgres non-standard extension) SLowest: SELECT DISTINCT ... The reason for this is that SELECT DISTINCT is effectively a GROUP BYon all result fields of the query, and if a few of thearen't indexedthat requires a seq scan. If performance is an issue, you may wish to consider restructuring yourqueries and/or data model to eliminate the actualduplicate rows. -Josh
On 27 Feb 2002 at 13:27, Wei Weng wrote: > I tend to use DISTINCT a lot in my queries to limit the repeated rows. Is > that a good habbit? If that's what you need, use it. > How exactly slow is DISTINCT being processed in SQL engines? (not > limited to postgresql, though comments on postgresql would be most > relevant) Try it. Compare the same SQL with and without DISTINCT. -- Dan Langille The FreeBSD Diary - http://freebsddiary.org/ - practical examples
Josh Berkus wrote: > Wei Wang, > > >>How exactly slow is DISTINCT being processed in SQL engines? (not >>limited to postgresql, though comments on postgresql would be most >> relevant) >> > > I can only give you a relative result, based exlusively on my anecdotal > experience with 7.1: > > Fast: SELECT ... > Slower: SELECT ... GROUP BY x,y,z > or: SELECT DISCTINCT ON (x) ... (Postgres non-standard extension) > SLowest: SELECT DISTINCT ... > > The reason for this is that SELECT DISTINCT is effectively a GROUP BY > on all result fields of the query, and if a few of the aren't indexed > that requires a seq scan. What if I do thing like SELECT DISTINCT table1.tid, table1.name, table1.description FROM ... Does it equal to the scenario 2 or 3? I am thinking SELECT DISTINCT table1.tid is just a variation (or the other way around) of SELECT DISTINCT ON (table1.tid), is that right? Thanks -- Wei Weng Network Software Engineer KenCast Inc.
Wei, > SELECT DISTINCT table1.tid, table1.name, table1.description FROM ... > > Does it equal to the scenario 2 or 3? Three. > I am thinking SELECT DISTINCT table1.tid is just a variation (or the > other way around) of SELECT DISTINCT ON (table1.tid), is that right? Wrong. SELECT DISTINCT ON table1.tid takes the table1.tid field andthe first related other data it can find. SELECT DISTINCTlooks forthe unique combinaiton of all fields selected. In a multi-table, manyfield query, the former is fasterthan the latter bceause less data isbeing checked for uniqueness, and extra data is simply discarded. Tom, please correct me if I'm totally out on a limb, here. -Josh ______AGLIO DATABASE SOLUTIONS___________________________ Josh Berkus Complete informationtechnology josh@agliodbs.com and data management solutions (415) 565-7293 for law firms, small businesses fax 621-2533 and non-profit organizations. San Francisco
Josh Berkus wrote: > Wei, > > >>SELECT DISTINCT table1.tid, table1.name, table1.description FROM ... >> >>Does it equal to the scenario 2 or 3? >> > > Three. > > >>I am thinking SELECT DISTINCT table1.tid is just a variation (or the >> other way around) of SELECT DISTINCT ON (table1.tid), is that right? >> > > Wrong. SELECT DISTINCT ON table1.tid takes the table1.tid field and > the first related other data it can find. SELECT DISTINCT looks for > the unique combinaiton of all fields selected. In a multi-table, many > field query, the former is faster than the latter bceause less data is > being checked for uniqueness, and extra data is simply discarded. > > Tom, please correct me if I'm totally out on a limb, here. Is that(DISTINCT ON) SQL standard compliant or a Postgresql extension? -- Wei Weng Network Software Engineer KenCast Inc.
Wei Weng wrote: > Josh Berkus wrote: > >>Wei Wang, >> >> >> >>>How exactly slow is DISTINCT being processed in SQL engines? (not >>>limited to postgresql, though comments on postgresql would be most >>>relevant) >>> >>> >>I can only give you a relative result, based exlusively on my anecdotal >> experience with 7.1: >> >>Fast: SELECT ... >>Slower: SELECT ... GROUP BY x,y,z >> or: SELECT DISCTINCT ON (x) ... (Postgres non-standard extension) >>SLowest: SELECT DISTINCT ... >> >>The reason for this is that SELECT DISTINCT is effectively a GROUP BY >> on all result fields of the query, and if a few of the aren't indexed >> that requires a seq scan. >> > What if I do thing like > > SELECT DISTINCT table1.tid, table1.name, table1.description FROM ... > > Does it equal to the scenario 2 or 3? > > I am thinking SELECT DISTINCT table1.tid is just a variation (or the > other way around) of SELECT DISTINCT ON (table1.tid), is that right? > > Thanks > > Of course, a more basic question is, why so much repeated data? Perhaps you should look at the design in your tables. I'm not assuming they are wrong, only suggesting that you look.
Hello, our problem to 'select distinct' is similar: We have a big table of about 320.000.000 rows of an datawarehouse application. 55 GB, yes, it works fine! Createing an index on it quite fast (nearly as fast than oracle on the same data and same machine) 1GHz P4, 4GB-Ram -> 30 Minutes. Then we tried 'select distinct one_field' which would result to about 200.000 different values. Postgres needed 6 hours while Oracle managed it in about 30 minutes. Looking into the pgsql_tmp directory of this db while doing this selection showed me a lot of tempfiles nearly as big as the table. Does postgres sort the whole table without projection to one column an performs a unique on this whole table? This would explain the big amount of disk usage in pgsql_tmp and the big amount of time. The statement, something could be wrong with the data, is not very useful: This is data of our electronic cash-desks. Unfortunately our customers buy every day nearly the same articles - therefor the repeatition of data :-) Kind regards M.Contzen Developer Dohle Systemberatung Germany Some facts of our test: Table "warenausg_ges" Column | Type | Modifiers -----------+---------------+-----------ean | numeric(13,0) | menge | numeric(13,3) | lvkumsatz | numeric(15,3)| vkumsatz | numeric(15,3) | ekumsatz | numeric(15,3) | rabatt | numeric(12,0) | kdnr | numeric(10,0)| artnr | numeric(10,0) | lfnr | numeric(10,0) | wg | integer | aktion | character(1) | datum | date | status | integer | Indexes: warenausg_ges_inx Index "warenausg_ges_inx"Column | Type --------+------datum | date btree explain select distinct artnr from warenausg_ges; NOTICE: QUERY PLAN: Unique (cost=224522801.22..225315849.86 rows=31721946 width=12) -> Sort (cost=224522801.22..224522801.22 rows=317219456width=12) -> Seq Scan on warenausg_ges (cost=0.00..165793667.00 rows=317219456 width=12)
Michael Contzen <mcontzen@dohle.com> writes: > Does postgres sort the whole table without projection to one column an > performs a unique on this whole table? No, it only sorts the column(s) being selected. My guess is that your performance problem comes from using "numeric" datatype. Consider int or bigint instead of numeric(10,0). regards, tom lane
What happens if you do... select artnr from warenausg_ges group by artnr; or even select distinct (artnr) from warenausg_ges; or select distinct on (artnr) from warenausg_ges; ...sort of ideas... Another idea is to maintain a table of the unique values in the column. Create a trigger on the main table to maintian the list of unique values... Chris > -----Original Message----- > From: pgsql-sql-owner@postgresql.org > [mailto:pgsql-sql-owner@postgresql.org]On Behalf Of Michael Contzen > Sent: Tuesday, 9 April 2002 6:16 PM > To: pgsql-sql@postgresql.org > Subject: Re: [SQL] How slow is DISTINCT? > > > Hello, > > our problem to 'select distinct' is similar: We have a big table of > about 320.000.000 rows of an datawarehouse application. 55 GB, yes, it > works fine! > > Createing an index on it quite fast (nearly as fast than oracle on the > same data and same machine) 1GHz P4, 4GB-Ram -> 30 Minutes. > > Then we tried 'select distinct one_field' which would result to about > 200.000 different values. Postgres needed 6 hours while Oracle managed > it in about 30 minutes. > > Looking into the pgsql_tmp directory of this db while doing this > selection showed me a lot of tempfiles nearly as big as the table. > > Does postgres sort the whole table without projection to one column an > performs a unique on this whole table? > This would explain the big amount of disk usage in pgsql_tmp and the big > amount of time. > > The statement, something could be wrong with the data, is not very > useful: This is data of our electronic cash-desks. Unfortunately our > customers buy every day nearly the same articles - therefor the > repeatition of data :-) > > > Kind regards > > M.Contzen > Developer > Dohle Systemberatung > Germany > > > Some facts of our test: > > Table "warenausg_ges" > Column | Type | Modifiers > -----------+---------------+----------- > ean | numeric(13,0) | > menge | numeric(13,3) | > lvkumsatz | numeric(15,3) | > vkumsatz | numeric(15,3) | > ekumsatz | numeric(15,3) | > rabatt | numeric(12,0) | > kdnr | numeric(10,0) | > artnr | numeric(10,0) | > lfnr | numeric(10,0) | > wg | integer | > aktion | character(1) | > datum | date | > status | integer | > Indexes: warenausg_ges_inx > > Index "warenausg_ges_inx" > Column | Type > --------+------ > datum | date > btree > > explain select distinct artnr from warenausg_ges; > NOTICE: QUERY PLAN: > > Unique (cost=224522801.22..225315849.86 rows=31721946 width=12) > -> Sort (cost=224522801.22..224522801.22 rows=317219456 width=12) > -> Seq Scan on warenausg_ges (cost=0.00..165793667.00 > rows=317219456 width=12) > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org >