Thread: Counting Distinct Records

Counting Distinct Records

From

Thomas F.O'Connell

Date:

16 November 2004, 19:46:08

I am wondering whether the following two forms of SELECT statements are 
logically equivalent:

SELECT COUNT( DISTINCT table.column ) ...

and

SELECT DISTINCT COUNT( * ) ...

If they are the same, then why is the latter query much slower in 
postgres when applied to the same FROM and WHERE clauses?

Furthermore, is there a better way of performing this sort of operation 
in postgres (or just in SQL in general)?

Thanks!

-tfo

--
Thomas F. O'Connell
Co-Founder, Information Architect
Sitening, LLC
http://www.sitening.com/
110 30th Avenue North, Suite 6
Nashville, TN 37203-6320
615-260-0005

Re: Counting Distinct Records

From

Stephan Szabo

Date:

16 November 2004, 20:03:46

On Tue, 16 Nov 2004, Thomas F.O'Connell wrote:

> I am wondering whether the following two forms of SELECT statements are
> logically equivalent:
>
> SELECT COUNT( DISTINCT table.column ) ...
>
> and
>
> SELECT DISTINCT COUNT( * ) ...

Not in general.

The former counts how many distinct table.column values there are.  The
distinct in the latter would be basically meaningless unless there's a
group by involved.

Re: Counting Distinct Records

From

Thomas F.O'Connell

Date:

16 November 2004, 20:09:48

Is there another way to accomplish what the former is doing, then?

For practical reasons, I'd like to come up with something better.

For theoretical curiosity, I'd like to know whether there's a way to 
combine COUNT and DISTINCT that still allows one to reference * rather 
than naming specific columns without grouping.

If I resort to GROUP BY, is there an efficient way of counting all the 
groups, or would it just be something like:

SELECT COUNT ( * ) FROM ( SELECT ... GROUP BY ... );

-tfo

--
Thomas F. O'Connell
Co-Founder, Information Architect
Sitening, LLC
http://www.sitening.com/
110 30th Avenue North, Suite 6
Nashville, TN 37203-6320
615-260-0005

On Nov 16, 2004, at 2:03 PM, Stephan Szabo wrote:

> On Tue, 16 Nov 2004, Thomas F.O'Connell wrote:
>
>> I am wondering whether the following two forms of SELECT statements 
>> are
>> logically equivalent:
>>
>> SELECT COUNT( DISTINCT table.column ) ...
>>
>> and
>>
>> SELECT DISTINCT COUNT( * ) ...
>
> Not in general.
>
> The former counts how many distinct table.column values there are.  The
> distinct in the latter would be basically meaningless unless there's a
> group by involved.
>

Re: Counting Distinct Records

From

Stephan Szabo

Date:

16 November 2004, 22:45:42

On Tue, 16 Nov 2004, Thomas F.O'Connell wrote:

> Is there another way to accomplish what the former is doing, then?

The only thing I can think of is a subselect in from that uses distinct.select count(*) from (select distinct ...) foo

That also theoretically allows you to use select distinct * inside the
subselect.

Re: Counting Distinct Records

From

Thomas F.O'Connell

Date:

16 November 2004, 23:03:12

Hmm. I was more interested in using COUNT( * ) than DISTINCT *.

I want a count of all rows, but I want to be able to specify which 
columns are distinct.

That's definitely an interesting approach, but testing doesn't show it 
to be appreciably faster.

If I do a DISTINCT *, postgres will attempt to guarantee that there are 
no duplicate values across all columns rather than a subset of columns? 
Is that right?

Anyway, I was just wondering if there were any best practices out there 
for counting distinct values in sets of values that might not 
themselves be distinct.

Thanks for the tips so far!

-tfo

--
Thomas F. O'Connell
Co-Founder, Information Architect
Sitening, LLC
http://www.sitening.com/
110 30th Avenue North, Suite 6
Nashville, TN 37203-6320
615-260-0005

On Nov 16, 2004, at 4:34 PM, Stephan Szabo wrote:

> On Tue, 16 Nov 2004, Thomas F.O'Connell wrote:
>
>> Is there another way to accomplish what the former is doing, then?
>
> The only thing I can think of is a subselect in from that uses 
> distinct.
>  select count(*) from (select distinct ...) foo
>
> That also theoretically allows you to use select distinct * inside the
> subselect.

Re: Counting Distinct Records

From

Stephan Szabo

Date:

17 November 2004, 14:53:02

On Tue, 16 Nov 2004, Thomas F.O'Connell wrote:

> Hmm. I was more interested in using COUNT( * ) than DISTINCT *.
>
> I want a count of all rows, but I want to be able to specify which
> columns are distinct.

I'm now a bit confused about exactly what you're looking for in the end.
Can you give a short example?

> That's definitely an interesting approach, but testing doesn't show it
> to be appreciably faster.
>
> If I do a DISTINCT *, postgres will attempt to guarantee that there are
> no duplicate values across all columns rather than a subset of columns?
> Is that right?

It guarantees one output row for each distinct set of column values across
all columns.

Re: Counting Distinct Records

From

Thomas F.O'Connell

Date:

17 November 2004, 19:27:28

The specific problem I'm trying to solve involves a user table with 
some history.

Something like this:

create table user_history (user_id intevent_time_stamp timestamp
);

I'd like to be able to count the distinct user_ids in this table, even 
if it were joined to other tables.

-tfo

--
Thomas F. O'Connell
Co-Founder, Information Architect
Sitening, LLC
http://www.sitening.com/
110 30th Avenue North, Suite 6
Nashville, TN 37203-6320
615-260-0005

On Nov 17, 2004, at 8:52 AM, Stephan Szabo wrote:

> On Tue, 16 Nov 2004, Thomas F.O'Connell wrote:
>
>> Hmm. I was more interested in using COUNT( * ) than DISTINCT *.
>>
>> I want a count of all rows, but I want to be able to specify which
>> columns are distinct.
>
> I'm now a bit confused about exactly what you're looking for in the 
> end.
> Can you give a short example?
>
>> That's definitely an interesting approach, but testing doesn't show it
>> to be appreciably faster.
>>
>> If I do a DISTINCT *, postgres will attempt to guarantee that there 
>> are
>> no duplicate values across all columns rather than a subset of 
>> columns?
>> Is that right?
>
> It guarantees one output row for each distinct set of column values 
> across
> all columns.