Thread: JOIN producing duplicate results

JOIN producing duplicate results

From

Lonni J Friedman

Date:

02 May 2012, 20:52:14

I've got a query that is joining data across 4 tables to provide data
based on test results.  The query is working fine, except for the fact
that its returning two identical records for each row of unique data.
 If I throw a DISTINCT in front of the primary key column (a.id) of
one of the tables in the join, that eliminates all the duplicates.
However, I've read (and found) that DISTINCT tends to introduce a
performance hit, so I'm hoping to find a better performing solution,
if possible.  Hopefully I'm just doing something silly with my JOINS
that is easily fixed.  This is on postgresql-9.0.x, and yes I'm aware
that if i upgraded to 9.1.x then I could likely do a 'group by a.id',
but for now I'm stuck on 9.0.x.

Here's the query:
SELECT a.id,a.suiteid,a.testname,date_trunc('second',a.last_update) AS
last_update,regexp_replace(p.relname,E'tests','','g'),o.osname
FROM smoketests AS a, pg_class AS p, smoke AS t, osversmap AS o
WHERE a.osversion=o.osversion AND a.suiteid=t.id AND a.tableoid=p.oid
AND ( a.current_status='FAILED' ) AND ( a.arch='i386' ) AND (
a.os='Darwin' ) AND a.last_update>'2012-05-01 04:00:00' AND
a.last_update<'2012-05-02 14:20:45'
ORDER BY a.id ;

    id    | suiteid |     testname     |     last_update     |
regexp_replace |   osname
----------+---------+------------------+---------------------+----------------+------------
 32549818 |  668232 | bug377064        | 2012-05-01 08:38:07 | smoke
   | OSX-10.7.x
 32549818 |  668232 | bug377064        | 2012-05-01 08:38:07 | smoke
   | OSX-10.7.x
 32549819 |  668232 | funcmem_resize   | 2012-05-01 08:38:07 | smoke
   | OSX-10.7.x
 32549819 |  668232 | funcmem_resize   | 2012-05-01 08:38:07 | smoke
   | OSX-10.7.x
 32549820 |  668232 | leitest | 2012-05-01 08:38:07 | smoke      | OSX-10.7.x
 32549820 |  668232 | leitest | 2012-05-01 08:38:07 | smoke      | OSX-10.7.x

The problem is visible in the id column, where there are two of each
value returned even though a.id is the unique primary key of the smoke
table and doesn't really have duplicates.  The 'smoke' table has a one
to many relationship with the smoketests table, but I'm still rather
confused why I'm getting the duplicates of everything.

Re: JOIN producing duplicate results

From

Binand Sethumadhavan

Date:

03 May 2012, 01:09:58

On 3 May 2012 05:21, Lonni J Friedman <netllama@gmail.com> wrote:
> I've got a query that is joining data across 4 tables to provide data
> based on test results.  The query is working fine, except for the fact

[snip]

This is your query formatted in my favourite SQL formatting style:

http://pastebin.com/cjGAyZ4U

It looks OK to me, so the only reason I can think of (without knowing
the DDL etc.) is data duplication. I would suggest you remove the
joined tables one by one and see which of them is introducing the
duplication (or alternatively, start by querying only smoketests and
then add the joined tables one by one to see which one introduces the
duplication). Then explore the data in that table.

Incidentally, you are not using the table "smoke" for anything. That
would be the first table I'd remove.

Binand

Re: JOIN producing duplicate results

From

Brent Dombrowski

Date:

03 May 2012, 10:23:02

You are not using a.id in the join. My guess is that all the other columns in a that you are joining against are not
uniqueand that is where the duplicates are coming from. 

Brent.

On May 2, 2012, at 4:51 PM, Lonni J Friedman wrote:

> I've got a query that is joining data across 4 tables to provide data
> based on test results.  The query is working fine, except for the fact
> that its returning two identical records for each row of unique data.
> If I throw a DISTINCT in front of the primary key column (a.id) of
> one of the tables in the join, that eliminates all the duplicates.
> However, I've read (and found) that DISTINCT tends to introduce a
> performance hit, so I'm hoping to find a better performing solution,
> if possible.  Hopefully I'm just doing something silly with my JOINS
> that is easily fixed.  This is on postgresql-9.0.x, and yes I'm aware
> that if i upgraded to 9.1.x then I could likely do a 'group by a.id',
> but for now I'm stuck on 9.0.x.
>
> Here's the query:
> SELECT a.id,a.suiteid,a.testname,date_trunc('second',a.last_update) AS
> last_update,regexp_replace(p.relname,E'tests','','g'),o.osname
> FROM smoketests AS a, pg_class AS p, smoke AS t, osversmap AS o
> WHERE a.osversion=o.osversion AND a.suiteid=t.id AND a.tableoid=p.oid
> AND ( a.current_status='FAILED' ) AND ( a.arch='i386' ) AND (
> a.os='Darwin' ) AND a.last_update>'2012-05-01 04:00:00' AND
> a.last_update<'2012-05-02 14:20:45'
> ORDER BY a.id ;
>
>    id    | suiteid |     testname     |     last_update     |
> regexp_replace |   osname
> ----------+---------+------------------+---------------------+----------------+------------
> 32549818 |  668232 | bug377064        | 2012-05-01 08:38:07 | smoke
>   | OSX-10.7.x
> 32549818 |  668232 | bug377064        | 2012-05-01 08:38:07 | smoke
>   | OSX-10.7.x
> 32549819 |  668232 | funcmem_resize   | 2012-05-01 08:38:07 | smoke
>   | OSX-10.7.x
> 32549819 |  668232 | funcmem_resize   | 2012-05-01 08:38:07 | smoke
>   | OSX-10.7.x
> 32549820 |  668232 | leitest | 2012-05-01 08:38:07 | smoke      | OSX-10.7.x
> 32549820 |  668232 | leitest | 2012-05-01 08:38:07 | smoke      | OSX-10.7.x
>
> The problem is visible in the id column, where there are two of each
> value returned even though a.id is the unique primary key of the smoke
> table and doesn't really have duplicates.  The 'smoke' table has a one
> to many relationship with the smoketests table, but I'm still rather
> confused why I'm getting the duplicates of everything.
>
> --
> Sent via pgsql-novice mailing list (pgsql-novice@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-novice

Re: JOIN producing duplicate results

From

"Oliveiros d'Azevedo Cristina"

Date:

03 May 2012, 10:53:52

Do you have any table inheriting from smoketests?

Best,
Oliveiros

----- Original Message -----
From: "Brent Dombrowski" <brent.dombrowski@gmail.com>
To: "Lonni J Friedman" <netllama@gmail.com>
Cc: <pgsql-novice@postgresql.org>
Sent: Thursday, May 03, 2012 2:22 PM
Subject: Re: [NOVICE] JOIN producing duplicate results


You are not using a.id in the join. My guess is that all the other columns
in a that you are joining against are not unique and that is where the
duplicates are coming from.

Brent.

On May 2, 2012, at 4:51 PM, Lonni J Friedman wrote:

> I've got a query that is joining data across 4 tables to provide data
> based on test results.  The query is working fine, except for the fact
> that its returning two identical records for each row of unique data.
> If I throw a DISTINCT in front of the primary key column (a.id) of
> one of the tables in the join, that eliminates all the duplicates.
> However, I've read (and found) that DISTINCT tends to introduce a
> performance hit, so I'm hoping to find a better performing solution,
> if possible.  Hopefully I'm just doing something silly with my JOINS
> that is easily fixed.  This is on postgresql-9.0.x, and yes I'm aware
> that if i upgraded to 9.1.x then I could likely do a 'group by a.id',
> but for now I'm stuck on 9.0.x.
>
> Here's the query:
> SELECT a.id,a.suiteid,a.testname,date_trunc('second',a.last_update) AS
> last_update,regexp_replace(p.relname,E'tests','','g'),o.osname
> FROM smoketests AS a, pg_class AS p, smoke AS t, osversmap AS o
> WHERE a.osversion=o.osversion AND a.suiteid=t.id AND a.tableoid=p.oid
> AND ( a.current_status='FAILED' ) AND ( a.arch='i386' ) AND (
> a.os='Darwin' ) AND a.last_update>'2012-05-01 04:00:00' AND
> a.last_update<'2012-05-02 14:20:45'
> ORDER BY a.id ;
>
>    id    | suiteid |     testname     |     last_update     |
> regexp_replace |   osname
> ----------+---------+------------------+---------------------+----------------+------------
> 32549818 |  668232 | bug377064        | 2012-05-01 08:38:07 | smoke
>   | OSX-10.7.x
> 32549818 |  668232 | bug377064        | 2012-05-01 08:38:07 | smoke
>   | OSX-10.7.x
> 32549819 |  668232 | funcmem_resize   | 2012-05-01 08:38:07 | smoke
>   | OSX-10.7.x
> 32549819 |  668232 | funcmem_resize   | 2012-05-01 08:38:07 | smoke
>   | OSX-10.7.x
> 32549820 |  668232 | leitest | 2012-05-01 08:38:07 | smoke      |
> OSX-10.7.x
> 32549820 |  668232 | leitest | 2012-05-01 08:38:07 | smoke      |
> OSX-10.7.x
>
> The problem is visible in the id column, where there are two of each
> value returned even though a.id is the unique primary key of the smoke
> table and doesn't really have duplicates.  The 'smoke' table has a one
> to many relationship with the smoketests table, but I'm still rather
> confused why I'm getting the duplicates of everything.
>
> --
> Sent via pgsql-novice mailing list (pgsql-novice@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-novice


--
Sent via pgsql-novice mailing list (pgsql-novice@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-novice

Re: JOIN producing duplicate results

From

Lonni J Friedman

Date:

03 May 2012, 14:04:21

I figured it out.  The problem was that osvermap has an arch column,
such that there are more than one of the same osversion and the arch
differs. Adding this eliminated the duplicates: o.arch=a.arch

Here's the fixed working query:

SELECT
      a.id
    , a.suiteid
    , a.testname
    , date_trunc('second', a.last_update) AS last_update
    , regexp_replace(p.relname, E'tests', '', 'g')
    , o.osname
FROM
    smoketests AS a
    , pg_class AS p
    , smoke AS t
    , osversmap AS o
WHERE
     o.arch=a.arch
    AND a.osversion = o.osversion
    AND a.suiteid = t.id
    AND a.tableoid = p.oid
    AND a.current_status = 'FAILED'
    AND a.arch = 'i386'
    AND a.os = 'Darwin'
    AND a.last_update > '2012-05-01 04:00:00'
    AND a.last_update < '2012-05-02 14:20:45'
ORDER BY
    a.id ;

On Thu, May 3, 2012 at 6:53 AM, Oliveiros d'Azevedo Cristina
<oliveiros.cristina@marktest.pt> wrote:
> Do you have any table inheriting from smoketests?
>
> Best,
> Oliveiros
>
> ----- Original Message ----- From: "Brent Dombrowski"
> <brent.dombrowski@gmail.com>
> To: "Lonni J Friedman" <netllama@gmail.com>
> Cc: <pgsql-novice@postgresql.org>
> Sent: Thursday, May 03, 2012 2:22 PM
> Subject: Re: [NOVICE] JOIN producing duplicate results
>
>
>
> You are not using a.id in the join. My guess is that all the other columns
> in a that you are joining against are not unique and that is where the
> duplicates are coming from.
>
> Brent.
>
> On May 2, 2012, at 4:51 PM, Lonni J Friedman wrote:
>
>> I've got a query that is joining data across 4 tables to provide data
>> based on test results.  The query is working fine, except for the fact
>> that its returning two identical records for each row of unique data.
>> If I throw a DISTINCT in front of the primary key column (a.id) of
>> one of the tables in the join, that eliminates all the duplicates.
>> However, I've read (and found) that DISTINCT tends to introduce a
>> performance hit, so I'm hoping to find a better performing solution,
>> if possible.  Hopefully I'm just doing something silly with my JOINS
>> that is easily fixed.  This is on postgresql-9.0.x, and yes I'm aware
>> that if i upgraded to 9.1.x then I could likely do a 'group by a.id',
>> but for now I'm stuck on 9.0.x.
>>
>> Here's the query:
>> SELECT a.id,a.suiteid,a.testname,date_trunc('second',a.last_update) AS
>> last_update,regexp_replace(p.relname,E'tests','','g'),o.osname
>> FROM smoketests AS a, pg_class AS p, smoke AS t, osversmap AS o
>> WHERE a.osversion=o.osversion AND a.suiteid=t.id AND a.tableoid=p.oid
>> AND ( a.current_status='FAILED' ) AND ( a.arch='i386' ) AND (
>> a.os='Darwin' ) AND a.last_update>'2012-05-01 04:00:00' AND
>> a.last_update<'2012-05-02 14:20:45'
>> ORDER BY a.id ;
>>
>>   id    | suiteid |     testname     |     last_update     |
>> regexp_replace |   osname
>>
>> ----------+---------+------------------+---------------------+----------------+------------
>> 32549818 |  668232 | bug377064        | 2012-05-01 08:38:07 | smoke
>>  | OSX-10.7.x
>> 32549818 |  668232 | bug377064        | 2012-05-01 08:38:07 | smoke
>>  | OSX-10.7.x
>> 32549819 |  668232 | funcmem_resize   | 2012-05-01 08:38:07 | smoke
>>  | OSX-10.7.x
>> 32549819 |  668232 | funcmem_resize   | 2012-05-01 08:38:07 | smoke
>>  | OSX-10.7.x
>> 32549820 |  668232 | leitest | 2012-05-01 08:38:07 | smoke      |
>> OSX-10.7.x
>> 32549820 |  668232 | leitest | 2012-05-01 08:38:07 | smoke      |
>> OSX-10.7.x
>>
>> The problem is visible in the id column, where there are two of each
>> value returned even though a.id is the unique primary key of the smoke
>> table and doesn't really have duplicates.  The 'smoke' table has a one
>> to many relationship with the smoketests table, but I'm still rather
>> confused why I'm getting the duplicates of everything.
>>

Re: JOIN producing duplicate results

From

Binand Sethumadhavan

Date:

04 May 2012, 00:02:46

On 3 May 2012 22:34, Lonni J Friedman <netllama@gmail.com> wrote:
>     o.arch=a.arch AND a.osversion = o.osversion

Note that if you need to do this, it is almost always an indicator of
a problem in your ER model.

Binand