sub-select trouble: wrong SQL or PostgreSQL issue? - Mailing list pgsql-sql

From patrick
Subject sub-select trouble: wrong SQL or PostgreSQL issue?
Date
Msg-id 20021023194414.B3699@3eye.boxsoft.com
Whole thread Raw
Responses Re: sub-select trouble: wrong SQL or PostgreSQL issue?
List pgsql-sql
Greetings,

I have 3 tables.  Two of which (ta and tb) are different
"attributes" the third table (tc) can have.

tc is allowed up to three of each kind of "attributes".

e.g., 3 ta values and 2 tb values.

By assigning ta and tb attributes to each entry in tc you
are also "matching" tb attributes with corresponding ta
ones.

e.g.,
c_id  = 8     ta1   = 1  <-- match -->   tb1   = 3     ta2   = 3  <-- match -->   tb2   = 2     ta3   = 13 <-- match
-->  tb3   = 20
 
c_id  = 9     ta1   = 2  <-- match -->   tb1   = 3     ta2   = 5  <-- match -->   tb2   = 3     ta3   = 9  <-- match
-->  tb3   = 2
 
c_id  = 10     ta1   = 3  <-- match -->   tb1   = 2     ta2   = 1  <-- match -->   tb2   = 3     ta3   = 9  <-- match
-->  tb3   = 2
 

The task at hand is to figure out the number of ta
attributes each tb attribute corresponds with.

For above example I want a result set showing me:
  tb.b_id    ta.a_id count --------------------------     2            2     -- matched w/ta.a_id = 3, 9     3
 3     -- matched w/ta.a_id = 1, 2, 5    20            1     -- only matched w/ta.a_id = 13
 


-- tables
create table ta ( a_id int unique );
create table tb ( b_id int unique );
create table tc  (  c_id        int unique,  ta1         int references ta( a_id ),  ta2         int references ta(
a_id),  ta3         int references ta( a_id ),  tb1         int references tb( b_id ),  tb2         int references tb(
b_id),  tb3         int references tb( b_id )  );
 

-- sequences
create sequence ta_seq;
create sequence tb_seq;
create sequence tc_seq;

-- populate tables with some data
insert into ta values ( NEXTVAL( 'ta_seq' ) );
insert into ta values ( NEXTVAL( 'ta_seq' ) );
insert into ta values ( NEXTVAL( 'ta_seq' ) );
insert into ta values ( NEXTVAL( 'ta_seq' ) );
insert into ta values ( NEXTVAL( 'ta_seq' ) );
insert into ta values ( NEXTVAL( 'ta_seq' ) );
insert into ta values ( NEXTVAL( 'ta_seq' ) );
insert into ta values ( NEXTVAL( 'ta_seq' ) );
insert into ta values ( NEXTVAL( 'ta_seq' ) );
insert into ta values ( NEXTVAL( 'ta_seq' ) );
insert into ta values ( NEXTVAL( 'ta_seq' ) );

insert into tb values ( NEXTVAL( 'tb_seq' ) );
insert into tb values ( NEXTVAL( 'tb_seq' ) );
insert into tb values ( NEXTVAL( 'tb_seq' ) );
insert into tb values ( NEXTVAL( 'tb_seq' ) );
-- ...
insert into tb values ( NEXTVAL( 'tb_seq' ) );
insert into tb values ( NEXTVAL( 'tb_seq' ) );



Don't read too much into the data as i don't think the
values chosen have anything to do with the problem.  I
am just using a set of data sequences from my live
database.

Though, i'm willing to be surprised if someone's willing
to show correlation between the problem and the values
used.

--                                             ta       tb
--                                           1  2  3  1  2  3
insert into tc values ( NEXTVAL( 'tc_seq' ), 1, 1, 1, 1, 3, 2 );
insert into tc values ( NEXTVAL( 'tc_seq' ), 3, 3, 3, 2, 1, 3 );
insert into tc values ( NEXTVAL( 'tc_seq' ), 4, 4, 4, 2, 1, 3 );
insert into tc values ( NEXTVAL( 'tc_seq' ), 5, 5, 5, 1, 2, 3 );
insert into tc values ( NEXTVAL( 'tc_seq' ), 5, 5, 5, 1, 2, 3 );
insert into tc values ( NEXTVAL( 'tc_seq' ), 1, 1, 1, 1, 2, 3 );
insert into tc values ( NEXTVAL( 'tc_seq' ), 4, 4, 4, 2, 1, 3 );
insert into tc values ( NEXTVAL( 'tc_seq' ), 3, 3, 3, 2, 1, 3 );
/*
insert into tc values ( NEXTVAL( 'tc_seq' ), 1, null, null, 1, null, null );
*/

This last insert is to show it is possible to have null
values for either of the ta or tb attributes in table tc.
But it is omitted to simplify the problem for now.


I am no SQL expert, as in I don't do it for a living.
Therefore, I would not doubt that there exist more
efficient ways to do this but this is one way I came up
with and it is quite possible that a SQL mistake is
causing the strange outputs I am observing.

But let me describe in English what i am attempting to
do in the SQL.

As I said earlier the task is to determine the unique
count of ta entries each tb.b_id is associated with in
the tc table.

I'm selecting on the tb table and doing sub selects on
the tc table where I attempt to count unique counts of
tb to ta matches.

First sub-select (aa) does a distinct count of tc.ta1
where tc.tb1 equals tb.b_id and the corresponding tc.ta1
field is not null.

Second sub-select (bb) I'm again doing a count of
distinct tc.ta2 where tc.tb2 is equal to tb.b_id, tc.ta2
is not null and tc.ta2 was not counted in the aa
sub-select.

Next (cc), once again do a count of distinct tc.ta3
where tc.tb3 is equal to tb.b_id, tc.ta3 is not null and
not in either of the two previous sub-selects (aa, bb).

The sum of sub-selects aa, bb and cc should be the value
I'm interested in (not shown in SQL statement).

The problem:
The result of the select is not the expected counts of ta
to tb.

The twist:
If the select statement is provided a where-clause where
each tb.b_id is restricted to individual tb.b_id values
then the correct/expected count of ta to tb is obtained
(See examples below).



-- selects
select  tb.b_id,       (       select   count( distinct tc.ta1 )         from   tc        where   tc.tb1 = tb.b_id
         and                tc.ta1 is not null       ) as aa,       (       select  count( distinct tc.ta2 )
from tc        where  tc.tb2 = tb.b_id               and               tc.ta2 is not null               and
 tc.ta2 not in                 (                 select  distinct tc.ta1                   from  tc
where tc.tb1 = tb.b_id                         and                         tc.ta1 is not null                 )       )
asbb,       (       select  count( distinct tc.ta3 )         from  tc        where  tc.tb3 = tb.b_id               and
            tc.ta3 is not null               and               tc.ta3 not in                 (                 select
distincttc.ta1                   from  tc                  where  tc.tb1 = tb.b_id                         and
              tc.ta1 is not null                 )               and               tc.ta3 not in                 (
          select  distinct tc.ta2                   from  tc                  where  tc.tb2 = tb.b_id
     and                         tc.ta2 is not null                 )       ) as cc from  tb
 
;




-- results from selects

=# select * from tc order by c_id;c_id | ta1 | ta2 | ta3 | tb1 | tb2 | tb3 
------+-----+-----+-----+-----+-----+-----   1 |   1 |   1 |   1 |   1 |   3 |   2   2 |   3 |   3 |   3 |   2 |   1 |
3   3 |   4 |   4 |   4 |   2 |   1 |   3   4 |   5 |   5 |   5 |   1 |   2 |   3   5 |   5 |   5 |   5 |   1 |   2 |
3  6 |   1 |   1 |   1 |   1 |   2 |   3   7 |   4 |   4 |   4 |   2 |   1 |   3   8 |   3 |   3 |   3 |   2 |   1 |
3
(8 rows)


=# select  tb.b_id, ( select   count( distinct tc.ta1 ) from   tc where   tc.tb1 = tb.b_id and tc.ta1 is not null) as
aa,( select  count( distinct tc.ta2 ) from  tc where  tc.tb2 = tb.b_id and tc.ta2 is not null and tc.ta2 not in (
select distinct tc.ta1 from  tc where  tc.tb1 = tb.b_id and tc.ta1 is not null)) as bb, ( select  count( distinct
tc.ta3) from  tc where  tc.tb3 = tb.b_id and tc.ta3 is not null and tc.ta3 not in ( select  distinct tc.ta1 from  tc
where tc.tb1 = tb.b_id and tc.ta1 is not null) and tc.ta3 not in ( select  distinct tc.ta2 from  tc where  tc.tb2 =
tb.b_idand tc.ta2 is not null)) as cc from  tb where tb.b_id < 4;b_id | aa | bb | cc 
 
------+----+----+----   1 |  2 |  2 |  0   2 |  2 |  0 |  0   3 |  0 |  0 |  0
(3 rows)


Obviously not the correct result for rows 2 and 3.


=# select  tb.b_id, ( select   count( distinct tc.ta1 ) from   tc where   tc.tb1 = tb.b_id and tc.ta1 is not null) as
aa,( select  count( distinct tc.ta2 ) from  tc where  tc.tb2 = tb.b_id and tc.ta2 is not null and tc.ta2 not in (
select distinct tc.ta1 from  tc where  tc.tb1 = tb.b_id and tc.ta1 is not null)) as bb, ( select  count( distinct
tc.ta3) from  tc where  tc.tb3 = tb.b_id and tc.ta3 is not null and tc.ta3 not in ( select  distinct tc.ta1 from  tc
where tc.tb1 = tb.b_id and tc.ta1 is not null) and tc.ta3 not in ( select  distinct tc.ta2 from  tc where  tc.tb2 =
tb.b_idand tc.ta2 is not null)) as cc from  tb where tb.b_id = 1;b_id | aa | bb | cc 
 
------+----+----+----   1 |  2 |  2 |  0
(1 row)


=# select  tb.b_id, ( select   count( distinct tc.ta1 ) from   tc where   tc.tb1 = tb.b_id and tc.ta1 is not null) as
aa,( select  count( distinct tc.ta2 ) from  tc where  tc.tb2 = tb.b_id and tc.ta2 is not null and tc.ta2 not in (
select distinct tc.ta1 from  tc where  tc.tb1 = tb.b_id and tc.ta1 is not null)) as bb, ( select  count( distinct
tc.ta3) from  tc where  tc.tb3 = tb.b_id and tc.ta3 is not null and tc.ta3 not in ( select  distinct tc.ta1 from  tc
where tc.tb1 = tb.b_id and tc.ta1 is not null) and tc.ta3 not in ( select  distinct tc.ta2 from  tc where  tc.tb2 =
tb.b_idand tc.ta2 is not null)) as cc from  tb where tb.b_id = 2;b_id | aa | bb | cc 
 
------+----+----+----   2 |  2 |  2 |  0
(1 row)


This is what we should've gotten in the first query for
the second row.



=# select  tb.b_id, ( select   count( distinct tc.ta1 ) from   tc where   tc.tb1 = tb.b_id and tc.ta1 is not null) as
aa,( select  count( distinct tc.ta2 ) from  tc where  tc.tb2 = tb.b_id and tc.ta2 is not null and tc.ta2 not in (
select distinct tc.ta1 from  tc where  tc.tb1 = tb.b_id and tc.ta1 is not null)) as bb, ( select  count( distinct
tc.ta3) from  tc where  tc.tb3 = tb.b_id and tc.ta3 is not null and tc.ta3 not in ( select  distinct tc.ta1 from  tc
where tc.tb1 = tb.b_id and tc.ta1 is not null) and tc.ta3 not in ( select  distinct tc.ta2 from  tc where  tc.tb2 =
tb.b_idand tc.ta2 is not null)) as cc from  tb where tb.b_id = 3;b_id | aa | bb | cc 
 
------+----+----+----   3 |  0 |  1 |  3
(1 row)


This is what we should've gotten in the first query for
the third row.


Further strangeness:
I would think that the following two select statements
should yeild the same results.


=# select  tb.b_id, ( select   count( distinct tc.ta1 ) from   tc where   tc.tb1 = tb.b_id and tc.ta1 is not null) as
aa,( select  count( distinct tc.ta2 ) from  tc where  tc.tb2 = tb.b_id and tc.ta2 is not null and tc.ta2 not in (
select distinct tc.ta1 from  tc where  tc.tb1 = tb.b_id and tc.ta1 is not null)) as bb, ( select  count( distinct
tc.ta3) from  tc where  tc.tb3 = tb.b_id and tc.ta3 is not null and tc.ta3 not in ( select  distinct tc.ta1 from  tc
where tc.tb1 = tb.b_id and tc.ta1 is not null) and tc.ta3 not in ( select  distinct tc.ta2 from  tc where  tc.tb2 =
tb.b_idand tc.ta2 is not null)) as cc from  tb where tb.b_id = 1 or tb.b_id = 2 or tb.b_id = 3;b_id | aa | bb | cc 
 
------+----+----+----   1 |  2 |  2 |  0   2 |  2 |  0 |  0   3 |  0 |  0 |  0
(3 rows)


Incorrect output.  Same as the very first query.


=# select  tb.b_id, ( select   count( distinct tc.ta1 ) from   tc where   tc.tb1 = tb.b_id and tc.ta1 is not null) as
aa,( select  count( distinct tc.ta2 ) from  tc where  tc.tb2 = tb.b_id and tc.ta2 is not null and tc.ta2 not in (
select distinct tc.ta1 from  tc where  tc.tb1 = tb.b_id and tc.ta1 is not null)) as bb, ( select  count( distinct
tc.ta3) from  tc where  tc.tb3 = tb.b_id and tc.ta3 is not null and tc.ta3 not in ( select  distinct tc.ta1 from  tc
where tc.tb1 = tb.b_id and tc.ta1 is not null) and tc.ta3 not in ( select  distinct tc.ta2 from  tc where  tc.tb2 =
tb.b_idand tc.ta2 is not null)) as cc from  tb where tb.b_id = 3 or tb.b_id = 2 or tb.b_id = 1;b_id | aa | bb | cc 
 
------+----+----+----   3 |  0 |  1 |  3   2 |  2 |  2 |  0   1 |  2 |  2 |  0
(3 rows)


Strangely, this is the result we were expecting from our
original query!

Is it possible that the sub-selects are somehow
affecting the result sets?


Seen on:
% postmaster --version
postmaster (PostgreSQL) 7.1.3

and

% postmaster --version
postmaster (PostgreSQL) 7.1.2



-- drops
drop sequence tc_seq;
drop sequence tb_seq;
drop sequence ta_seq;

drop table tc;
drop table tb;
drop table ta;



I'd appreciate any comments you may have regarding this
post :-)

sidster
--
They who would sacrifice freedom for security will have neither.  -Ben Franklin 


pgsql-sql by date:

Previous
From: Josh Berkus
Date:
Subject: Re: Sum of Every Column
Next
From: Stephan Szabo
Date:
Subject: Re: sub-select trouble: wrong SQL or PostgreSQL issue?