Thread: distinct not working in a multiple join

distinct not working in a multiple join

From
David Rio Deiros
Date:
Hi there,

I have some issues with the query attached at the end of this email.
If I run that query I got this output ( I have removed some of the
fields) despite the distinct clause:

QC Q&A  | www.xxx.com | 44281
QC Q&A  | www.xxx.com | 44281
WhyMAX? |             | 44285

But, and here is the weird thing, if I remove g.group_id from the
selection list then I got what I expected:

QC Q&A  | www.xxx.com | 44281
WhyMAX? |             | 44285
toto    | rufus       | 44286

Can someone explain me what is going on here? What am I missing?

Thanks for you help in advance,

David


            SELECT
                distinct ar.title,
                ar.raw_data,
                ar.upload_dt,
                ar.artifact_id,
                g.group_id,
                acl.read
            FROM
                artifact_acl acl,
                artifacts a,
                artifact_revisions ar,
                revisions_to_types rt,
                artifact_types at,
                groups g
            WHERE
                a.expire_dt > NOW() and
                acl.artifact_id = a.artifact_id and
                a.published_revision = ar.revision_id and
                ar.revision_id = rt.revision_id and
                rt.type_id = at.type_id and
                acl.group_id = g.group_id and
                a.suppress = false and
                at.is_resource = true and
                (
                acl.group_id = (
                                select
                                    group_id
                                from
                                    groups
                                where
                                    group_nm = 'ting'
                               )
                               OR
                acl.group_id = (
                                select
                                    group_id
                                from
                                    groups
                                where
                                    group_nm = 'tors'
                               )
                ) and
                acl.group_id IN (4,17,54,2,1,123) and
                acl.read = true and
                ((g.back_or_front = 'front') or
                 (g.group_nm = 'PR Admin'))
            GROUP BY
               ar.title
            ORDER BY
               ar.upload_dt DESC
            LIMIT 3;

Re: distinct not working in a multiple join

From
Tom Lane
Date:
David Rio Deiros <driodeiros@gmail.com> writes:
> I have some issues with the query attached at the end of this email.
> If I run that query I got this output ( I have removed some of the
> fields) despite the distinct clause:

> QC Q&A  | www.xxx.com | 44281
> QC Q&A  | www.xxx.com | 44281
> WhyMAX? |             | 44285

Since you removed some fields, no one can tell if this output
is wrong or not.

>             SELECT
>                 distinct ar.title,
>                 ar.raw_data,
>                 ar.upload_dt,
>                 ar.artifact_id,
>                 g.group_id,
>                 acl.read
>             FROM

The way you formatted that makes me wonder if you think that the
DISTINCT applies only to the first column.  It does not, it applies
to all the columns together --- that is, it only removes rows that
are identical in all columns to some other row.  So if there were
some rows that were identical except for group_id, you'd get the
behavior you described.

BTW, I'm not sure I believe this is actually the same query you
ran.  The presence of the "GROUP BY ar.title" clause should have
provoked errors about ungrouped columns.  If this is an exact
copy of what you did, what Postgres version is this?

            regards, tom lane

Re: distinct not working in a multiple join

From
David Rio Deiros
Date:
On Thu, Feb 09, 2006 at 07:20:19PM -0500, Tom Lane wrote:
> David Rio Deiros <driodeiros@gmail.com> writes:
> > I have some issues with the query attached at the end of this email.
> > If I run that query I got this output ( I have removed some of the
> > fields) despite the distinct clause:
>
> > QC Q&A  | www.xxx.com | 44281
> > QC Q&A  | www.xxx.com | 44281
> > WhyMAX? |             | 44285
>
> Since you removed some fields, no one can tell if this output
> is wrong or not.

Tom,

Thanks for the answer and apologizes for the lack of information.
I think you found the problem already (see bellow). In anycase, just
to clarify:

This is the information about the machine/OS/Psql version:

1. Psql 8.0.4 - Linux 2.6.13

Now, This is the first query I tried:

            SELECT
                distinct ar.title,
                ar.raw_data,
                ar.upload_dt,
                ar.artifact_id,
                g.group_id,
                acl.read
            FROM
                artifact_acl acl,
                artifacts a,
                artifact_revisions ar,
                revisions_to_types rt,
                artifact_types at,
                groups g
            WHERE
                a.expire_dt > NOW() and
                acl.artifact_id = a.artifact_id and
                a.published_revision = ar.revision_id and
                ar.revision_id = rt.revision_id and
                rt.type_id = at.type_id and
                acl.group_id = g.group_id and
                a.suppress = false and
                at.is_resource = true and
                (
                acl.group_id = (
                                select
                                    group_id
                                from
                                    groups
                                where
                                    group_nm = 'Marketing'
                               )
                               OR
                acl.group_id = (
                                select
                                    group_id
                                from
                                    groups
                                where
                                    group_nm = 'Communicators'
                               )
                ) and
                acl.group_id IN (4,17,54,2,1,123) and
                acl.read = true and
                ((g.back_or_front = 'front') or
                (g.group_nm = 'PR Admin'))
                ORDER BY
                    ar.upload_dt DESC
                LIMIT 3;

and here you have the output:

-[ RECORD 1 ]---------------------------
title       | QC Q&A
raw_data    | www.com
upload_dt   | 2006-02-09 11:15:04.724525
artifact_id | 44281
group_id    | 2
read        | t
-[ RECORD 2 ]---------------------------
title       | QC Q&A
raw_data    | www.com
upload_dt   | 2006-02-09 11:15:04.724525
artifact_id | 44281
group_id    | 54
read        | t
-[ RECORD 3 ]---------------------------
title       | WhyMAX?
raw_data    |
upload_dt   | 2006-02-09 09:25:27.717663
artifact_id | 44061
group_id    | 2
read        | t


Now, I tried this query, which is the first one but removing the g.group_id
from the select clause. This is the output:

-[ RECORD 1 ]-----------------------------
title       | QC Q&A
raw_data    | www.com
upload_dt   | 2006-02-09 11:15:04.724525
artifact_id | 44281
read        | t
-[ RECORD 2 ]----------------------------
title       | WhyMAX?
raw_data    |
upload_dt   | 2006-02-09 09:25:27.717663
artifact_id | 44061
read        | t
-[ RECORD 3 ]---------------------------
title       | Business Assets-test
raw_data    | corpcomm.com/
upload_dt   | 2006-02-08 15:58:06.81578
artifact_id | 44280
read        | t

Which is the desired output.

> >             SELECT
> >                 distinct ar.title,
> >                 ar.raw_data,
> >                 ar.upload_dt,
> >                 ar.artifact_id,
> >                 g.group_id,
> >                 acl.read
> >             FROM
>
> The way you formatted that makes me wonder if you think that the
> DISTINCT applies only to the first column.  It does not, it applies
> to all the columns together --- that is, it only removes rows that
> are identical in all columns to some other row.  So if there were
> some rows that were identical except for group_id, you'd get the
> behavior you described.

Yes, that was the reason, I thought that distinct was only applying
to the first column.

> BTW, I'm not sure I believe this is actually the same query you
> ran.  The presence of the "GROUP BY ar.title" clause should have
> provoked errors about ungrouped columns.  If this is an exact
> copy of what you did, what Postgres version is this?

It wasn't the same query. Forget about the first email, in this email
you have the queries I launched (sorry again about that).

Now I have to redefine my query because I want to get the second
output but keeping the group_id. Ideas and suggestions are welcome.

Thanks again your help,

David



Re: distinct not working in a multiple join

From
Stephen Frost
Date:
* David Rio Deiros (driodeiros@gmail.com) wrote:
> Now I have to redefine my query because I want to get the second
> output but keeping the group_id. Ideas and suggestions are welcome.

You might want to look at 'distinct on'.

    Stephen

Attachment

Re: distinct not working in a multiple join

From
David Rio Deiros
Date:
On Thu, Feb 09, 2006 at 09:51:13PM -0500, Stephen Frost wrote:
> * David Rio Deiros (driodeiros@gmail.com) wrote:
> > Now I have to redefine my query because I want to get the second
> > output but keeping the group_id. Ideas and suggestions are welcome.
>
> You might want to look at 'distinct on'.

I am reading about it right now. It sounds that it will make it.

Thanks guys for your help,

David