Thread: LEFT JOIN and missing values
Hi, I have a problem, like this: I have 2 tables, T1 (id, val1) -- a vector and T2 (id1, id2, val2) -- this is actually a sqare matrix in the form of a very long table. I need to do some operations with them, such as take a dot-product of 2 rows of the matrix (encoded as T2), weighted with T1.. . The query is the following: select c.id1, d.id1, sum(c.val2*d.val2*T1.val1) from T1, T2 c, T2 d where c.ID2 = T1.id AND d.id2 = T1.id AND c.id1< d.id1 group by c.id1, d.id1; This gives me all possible pairs of (weighted) dot-products, as I want them. Now, some 90% of values in the T2 are the same, say, 0.5, so I could thouw them away. But then the query above is not valid anymore, since if there is mo match on the condition c.ID2 = T1.id AND d.id2 = T1.id, nothing is added (it might be that c.id2 is there and d.id2 is not!). I quiess I need a left join + substitute for missing value. Can anyone help me with the query? I cannot figure it out, too bad ad left joins... thanks, Mila
mila wrote: > Hi, > > I have a problem, like this: > > I have 2 tables, T1 (id, val1) -- a vector > and T2 (id1, id2, val2) -- this is actually a sqare matrix > in the form of a very long table. > > I need to do some operations with them, such as take a > dot-product of 2 rows of the matrix (encoded as T2), weighted with T1.. . > > The query is the following: > > select c.id1, d.id1, sum(c.val2*d.val2*T1.val1) > from T1, T2 c, T2 d > where c.ID2 = T1.id AND d.id2 = T1.id AND c.id1 < d.id1 > group by c.id1, d.id1; > > This gives me all possible pairs of (weighted) dot-products, as I want them. > > Now, some 90% of values in the T2 are the same, say, 0.5, so I could > thouw them away. But then the query above is not valid anymore, since > if there is mo match on the condition c.ID2 = T1.id AND d.id2 = > T1.id, nothing is added (it might be that c.id2 is there and d.id2 is > not!). > > I quiess I need a left join + substitute for missing value. Can anyone > help me with the query? I cannot figure it out, too bad ad left > joins... > > thanks, > Mila First let's rewrite your query to explicit joins (I like them very much, because they show the problem more clearly) select c.id1, d.id1, sum(c.val2*d.val2*T1.val1) from T1 join T2 c on (c.ID2 = T1.id) join T2 d on (d.id2 = T1.id and c.id1 < d.id1) group by c.id1, d.id1; If I understand well you want to delete useless values: delete from t2 where val2=0.5 your query should look then: select c.id1, d.id1, sum( case when c.id2 is null then 0.5 else c.val2 end * case when d.id2 is null then 0.5 else d.val2end *T1.val1) from T1 left join T2 c on (c.ID2 = T1.id) left join T2 d on (d.id2 = T1.id and (c.id1 < d.id1 or c.id1 is null)) group by c.id1, d.id1; I'm not sure about this query... There is a problem - is it possible to have missing values in your matrix? This solution replaces all missing values into 0.5. What should happen to "T2 alias d" if there is missing value in "T2 alias c"? What happens then to clause "c.id1<d.id1" if "c" is not found? Regards, Tomasz Myrta
Tomasz, > select c.id1, d.id1, sum( > case when c.id2 is null then 0.5 else c.val2 end * > case when d.id2 is null then 0.5 else d.val2 end *T1.val1) > from > T1 > left join T2 c on (c.ID2 = T1.id) > left join T2 d on (d.id2 = T1.id and (c.id1 < d.id1 or c.id1 is null)) > group by c.id1, d.id1; > I'm not sure about this query... There is a problem - is it possible to > have missing values in your matrix? This solution replaces all missing > values into 0.5. yes, that's what I want to do, remove the tuples with default value and substitute it later if I need it. > What should happen to "T2 alias d" if there is missing > value in "T2 alias c"? if a value, c.val2, is missing, it should be replaced with 0.5, since the only deleted tuples are those that had contained 0.5 in "val2" field. The (other) value, d.val2, that exisits in the matrix, should be used "as is". If they are both missing, then t1.val1 is multiplied twice by 0.5 > What happens then to clause "c.id1<d.id1" if "c" > is not found? I removed this condition "c.id1 < d.id1", since it is indeed unclear, what to compare when a row is "missing" (I can filter it later). Then, I tried the query that you proposed, on a test table T1 - 10 rows, T2 = 10x10 = 100 rows. With the full T2 the query is equivalent to the one that I gave, but when I remove the tuples with default values (call it table T3), then the resulting table is only 90 rows instead of 100, besides, the values are different from the "right" ones from "full" T2. The missing combinations seem to be random. How can that be if T1 contains all values that might be in T2.id2, T2.id1? and what do do next? Mila
I looked at your example once again and I noticed that wrong result comes from "group by" clause. If we delete useless rows with value "0.5", we also remove information about possible values of t2.id1. Let's say your table t2 is full of 0.5 values. It means, we can empty it. If we do this, how Postgres would know which t2.id1 values to return? What size of matrix should it return? I think if you really need to remove your 0.5 values, you need another table with all possible values of id1. This table will look like: create table t3 ( id1 integer; ); your query should look then: select cc.id1, dd.id1, sum( case when c.id2 is null then 0.5 else c.val2 end * case when d.id2 is null then 0.5 else d.val2end *T1.val1) from T1 cross join T3 cc left join T2 c on (c.ID2 = T1.id and c.id1=cc.id1) cross join T3 dd left join T2 d on (d.id2 = T1.idand d.id1=dd.id1) group by cc.id1, dd.id1; Anyway I'm not sure if it is what you really want :-( Regards, Tomasz Myrta
Tomasz, > I think if you really need to remove your 0.5 values, you need another > table with all possible values of id1. This table will look like: > create table t3 > ( > id1 integer; > ); T1 contains values of id's, The T2 matrix is a matrix of some sort of "distances" between id's from T1, so original T2 is a T1 x T1. (that's what you are proposing right?) So there is no id1, id2 in T2 that don't exists in T1, I checked my test tables and this is indeed the case. In other words T1 is dense, I make a left join on it and I get wrong results! And even wrong number of tuples :-( cheers, Mila