Thread: Counting Distinct Records
I am wondering whether the following two forms of SELECT statements are logically equivalent: SELECT COUNT( DISTINCT table.column ) ... and SELECT DISTINCT COUNT( * ) ... If they are the same, then why is the latter query much slower in postgres when applied to the same FROM and WHERE clauses? Furthermore, is there a better way of performing this sort of operation in postgres (or just in SQL in general)? Thanks! -tfo -- Thomas F. O'Connell Co-Founder, Information Architect Sitening, LLC http://www.sitening.com/ 110 30th Avenue North, Suite 6 Nashville, TN 37203-6320 615-260-0005
On Tue, 16 Nov 2004, Thomas F.O'Connell wrote: > I am wondering whether the following two forms of SELECT statements are > logically equivalent: > > SELECT COUNT( DISTINCT table.column ) ... > > and > > SELECT DISTINCT COUNT( * ) ... Not in general. The former counts how many distinct table.column values there are. The distinct in the latter would be basically meaningless unless there's a group by involved.
Is there another way to accomplish what the former is doing, then? For practical reasons, I'd like to come up with something better. For theoretical curiosity, I'd like to know whether there's a way to combine COUNT and DISTINCT that still allows one to reference * rather than naming specific columns without grouping. If I resort to GROUP BY, is there an efficient way of counting all the groups, or would it just be something like: SELECT COUNT ( * ) FROM ( SELECT ... GROUP BY ... ); -tfo -- Thomas F. O'Connell Co-Founder, Information Architect Sitening, LLC http://www.sitening.com/ 110 30th Avenue North, Suite 6 Nashville, TN 37203-6320 615-260-0005 On Nov 16, 2004, at 2:03 PM, Stephan Szabo wrote: > On Tue, 16 Nov 2004, Thomas F.O'Connell wrote: > >> I am wondering whether the following two forms of SELECT statements >> are >> logically equivalent: >> >> SELECT COUNT( DISTINCT table.column ) ... >> >> and >> >> SELECT DISTINCT COUNT( * ) ... > > Not in general. > > The former counts how many distinct table.column values there are. The > distinct in the latter would be basically meaningless unless there's a > group by involved. >
On Tue, 16 Nov 2004, Thomas F.O'Connell wrote: > Is there another way to accomplish what the former is doing, then? The only thing I can think of is a subselect in from that uses distinct.select count(*) from (select distinct ...) foo That also theoretically allows you to use select distinct * inside the subselect.
Hmm. I was more interested in using COUNT( * ) than DISTINCT *. I want a count of all rows, but I want to be able to specify which columns are distinct. That's definitely an interesting approach, but testing doesn't show it to be appreciably faster. If I do a DISTINCT *, postgres will attempt to guarantee that there are no duplicate values across all columns rather than a subset of columns? Is that right? Anyway, I was just wondering if there were any best practices out there for counting distinct values in sets of values that might not themselves be distinct. Thanks for the tips so far! -tfo -- Thomas F. O'Connell Co-Founder, Information Architect Sitening, LLC http://www.sitening.com/ 110 30th Avenue North, Suite 6 Nashville, TN 37203-6320 615-260-0005 On Nov 16, 2004, at 4:34 PM, Stephan Szabo wrote: > On Tue, 16 Nov 2004, Thomas F.O'Connell wrote: > >> Is there another way to accomplish what the former is doing, then? > > The only thing I can think of is a subselect in from that uses > distinct. > select count(*) from (select distinct ...) foo > > That also theoretically allows you to use select distinct * inside the > subselect.
On Tue, 16 Nov 2004, Thomas F.O'Connell wrote: > Hmm. I was more interested in using COUNT( * ) than DISTINCT *. > > I want a count of all rows, but I want to be able to specify which > columns are distinct. I'm now a bit confused about exactly what you're looking for in the end. Can you give a short example? > That's definitely an interesting approach, but testing doesn't show it > to be appreciably faster. > > If I do a DISTINCT *, postgres will attempt to guarantee that there are > no duplicate values across all columns rather than a subset of columns? > Is that right? It guarantees one output row for each distinct set of column values across all columns.
The specific problem I'm trying to solve involves a user table with some history. Something like this: create table user_history (user_id intevent_time_stamp timestamp ); I'd like to be able to count the distinct user_ids in this table, even if it were joined to other tables. -tfo -- Thomas F. O'Connell Co-Founder, Information Architect Sitening, LLC http://www.sitening.com/ 110 30th Avenue North, Suite 6 Nashville, TN 37203-6320 615-260-0005 On Nov 17, 2004, at 8:52 AM, Stephan Szabo wrote: > On Tue, 16 Nov 2004, Thomas F.O'Connell wrote: > >> Hmm. I was more interested in using COUNT( * ) than DISTINCT *. >> >> I want a count of all rows, but I want to be able to specify which >> columns are distinct. > > I'm now a bit confused about exactly what you're looking for in the > end. > Can you give a short example? > >> That's definitely an interesting approach, but testing doesn't show it >> to be appreciably faster. >> >> If I do a DISTINCT *, postgres will attempt to guarantee that there >> are >> no duplicate values across all columns rather than a subset of >> columns? >> Is that right? > > It guarantees one output row for each distinct set of column values > across > all columns.