Home > mailing lists

Re: select count(distinct ...) is slower than select distinct in about 5x - Mailing list pgsql-performance

From	jacket41142
Subject	Re: select count(distinct ...) is slower than select distinct in about 5x
Date	December 12, 2013 06:22:43
Msg-id	CAONnt+6W2GS7BO+fry-2ayehWerVVeKv+THgGVByYsPUrrgwmw@mail.gmail.com Whole thread Raw
In response to	Re: select count(distinct ...) is slower than select distinct in about 5x (Jeff Janes <jeff.janes@gmail.com>)
List	pgsql-performance

Tree view

2013/12/11 Jeff Janes <jeff.janes@gmail.com>

On Tuesday, December 10, 2013, jacket41142 wrote:
Thanks very much.

I think another problem is that the cost estimation isn't good enough to reflex real cost. Since we can see, from "explain analyze ...", count(distinct ...) has smallest cost between the others, but since it uses sorts, the time complexity should be higher especially for large amount of rows.

That one is easy to explain. The cost estimate is not intended to be an absolute estimate, it is just an relative estimate to choose between alternatives. Since the current implementation of count(distinct ...) does not *have* any alternatives for that step in the process, there is no point in estimating a cost for it. So part of providing it with alternatives will have to be providing those cost estimates as well.

I got it. Thanks very much for explain.

Also I think even if we can have multiple count() expressions, the optimizer should also be able to choose between use sort, HashAggregate or maybe something like linear aggregate if sorts are not needed or other methods if exist. Also this may be done as just one job for entire table of interested columns, or for each column separately.

Right. I hope this gets fixed. It's been on my todo list for a while, but at the current rate of going through my todo list, it will takes a few decades to get to if it is left up to me....

Thanks very much for your effort. Also it's still good to know for me that this problem will be fixed in future. :)

And so until now, if someone want to use count(distinct ...), he can use a workaround like subquery if performance is a concern. (Of course, he also needs to take care about NULL values as mentioned in http://www.postgresql.org/message-id/flat/CAPNY-2Utce-c+kNTwsMCbAk58=9mYeAeViTXT9LO7r1k77jukw@mail.gmail.com#CAPNY-2Utce-c+kNTwsMCbAk58=9mYeAeViTXT9LO7r1k77jukw@mail.gmail.com)

best regards,

jacket41142

pgsql-performance by date:

From: Mack Talcott
Date: 12 December 2013, 05:04:15
Subject: Re: Debugging shared memory issues on CentOS

From: Tom Lane
Date: 12 December 2013, 06:40:05
Subject: Re: Debugging shared memory issues on CentOS

Re: select count(distinct ...) is slower than select distinct in about 5x - Mailing list pgsql-performance

Previous

Next