Thread: Non Matching Records in Two Tables

Non Matching Records in Two Tables

From

Ken Hill

Date:

08 February 2006, 17:15:06

I need some help with a bit of SQL. I have two tables. I want to find records in one table that don't match records in
anothertable based on a common column in the two tables. Both tables have a column named 'key100'. I was trying
somethinglike:<br /><br /> SELECT count(*)<br /> FROM table1, table2<br /> WHERE (table1.key100 != table2.key100);<br
/><br/> But the query is very slow and I finally just cancel it. Any help is very much appreciated.<br /><br /> -Ken

Re: Non Matching Records in Two Tables

From

Frank Bax

Date:

08 February 2006, 17:27:23

At 04:10 PM 2/8/06, Ken Hill wrote:
>I need some help with a bit of SQL. I have two tables. I want to find 
>records in one table that don't match records in another table based on a 
>common column in the two tables. Both tables have a column named 'key100'. 
>I was trying something like:
>
>SELECT count(*)
>FROM table1, table2
>WHERE (table1.key100 != table2.key100);
>
>But the query is very slow and I finally just cancel it. Any help is very 
>much appreciated.


vacuum analyse table1;

vacuum analyse table2;

select count(*) from table1 full outer join table2 on 
table1.key100=table2.key100 where table1.key100 is null or table2.key100 is 
null;

If this is also slow, post output of "EXPLAIN ANALYSE SELECT ...."

Re: Non Matching Records in Two Tables

From

Markus Schaber

Date:

08 February 2006, 17:31:41

Hi, Ken,

Ken Hill schrieb:
> I need some help with a bit of SQL. I have two tables. I want to find
> records in one table that don't match records in another table based on
> a common column in the two tables. Both tables have a column named
> 'key100'. I was trying something like:
> 
> SELECT count(*)
> FROM table1, table2
> WHERE (table1.key100 != table2.key100);
> 
> But the query is very slow and I finally just cancel it. Any help is
> very much appreciated.

Do you have indices on the key100 columns? Is autovacuum running, or do
you do analyze manually?

Can you send us the output from "EXPLAIN ANALYZE [your query]"?

Btw, I don't think this query will do what you wanted, it basically
creates a cross product, that means if your tables look like:

schabitest=# select * from table1;key100 | valuea | valueb
--------+--------+--------     1 | foo    | bar     2 | blah   | blubb     3 | manga  | mungo

schabitest=# select * from table2;key100 | valuec | valued
--------+--------+--------     1 | monday | euro     2 | sunday | dollar     4 | friday | pounds

Then your query will produce something like:
schabitest=# select * from table1, table2 WHERE (table1.key100 !=
table2.key100);key100 | valuea | valueb | key100 | valuec | valued
--------+--------+--------+--------+--------+--------     1 | foo    | bar    |      2 | sunday | dollar     1 | foo
|bar    |      4 | friday | pounds     2 | blah   | blubb  |      1 | monday | euro     2 | blah   | blubb  |      4 |
friday| pounds     3 | manga  | mungo  |      1 | monday | euro     3 | manga  | mungo  |      2 | sunday | dollar
3| manga  | mungo  |      4 | friday | pounds
 

I suggest you would like to have all records from table1 that don't have
a corresponding record in table2:

schabitest=# select * from table1 where table1.key100 not in (select
key100 from table2);key100 | valuea | valueb
--------+--------+--------     3 | manga  | mungo

HTH,
Markus

Re: Non Matching Records in Two Tables

From

"Owen Jacobson"

Date:

08 February 2006, 17:31:52

Ken Hill wrote:

> I need some help with a bit of SQL. I have two tables. I want
> to find records in one table that don't match records in another
> table based on a common column in the two tables. Both tables
> have a column named 'key100'. I was trying something like:
> 
> SELECT count(*)
> FROM table1, table2
> WHERE (table1.key100 != table2.key100);
> 
> But the query is very slow and I finally just cancel it. Any help
> is very much appreciated.

That's a cartesian join, there, and it'll be huge (on the order of N*M rows, where N and M are the number of rows in
thefirst and second tables respectively).

It sounds like, from your description, you want to find rows in table1 that don't have a corresponding row in table2.
Thisshould work:

SELECT count(*) FROM table1 LEFT JOIN table2 ON table1.key100 = table2.key100 WHERE table2.key100 IS NULL;

This will still be fairly slow unless there are indexes on table1.key100 and table2.key100, but nowhere near as slow as
theoriginal query.

Frank Bax's solution will work if what you want is a count of rows in table1 that don't have a corresponding row in
table2or in table2 that don't have a corresponding row in table1; for that specific requirement you may actually be
betteroff doing two queries (one for each table) and adding the results together.

-Owen

Re: Non Matching Records in Two Tables

From

Ken Hill

Date:

08 February 2006, 18:18:25

On Wed, 2006-02-08 at 16:27 -0500, Frank Bax wrote: <blockquote type="CITE"><pre>
<font color="#000000">At 04:10 PM 2/8/06, Ken Hill wrote:</font>
<font color="#000000">>I need some help with a bit of SQL. I have two tables. I want to find </font>
<font color="#000000">>records in one table that don't match records in another table based on a </font>
<font color="#000000">>common column in the two tables. Both tables have a column named 'key100'. </font>
<font color="#000000">>I was trying something like:</font>
<font color="#000000">></font>
<font color="#000000">>SELECT count(*)</font>
<font color="#000000">>FROM table1, table2</font>
<font color="#000000">>WHERE (table1.key100 != table2.key100);</font>
<font color="#000000">></font>
<font color="#000000">>But the query is very slow and I finally just cancel it. Any help is very </font>
<font color="#000000">>much appreciated.</font>


<font color="#000000">vacuum analyse table1;</font>

<font color="#000000">vacuum analyse table2;</font>

<font color="#000000">select count(*) from table1 full outer join table2 on </font>
<font color="#000000">table1.key100=table2.key100 where table1.key100 is null or table2.key100 is </font>
<font color="#000000">null;</font>

<font color="#000000">If this is also slow, post output of "EXPLAIN ANALYSE SELECT ...." </font>


<font color="#000000">---------------------------(end of broadcast)---------------------------</font>
<font color="#000000">TIP 1: if posting/reading through Usenet, please send an appropriate</font>
<font color="#000000">       subscribe-nomail command to <a
href="mailto:majordomo@postgresql.org">majordomo@postgresql.org</a>so that your</font>
 
<font color="#000000">       message can get through to the mailing list c</font>leanly
</pre></blockquote><br /> OK. I added indexes on the two columns in the two tables:<br /><br /> CREATE INDEX
key100_idex<br/> ON ncccr9 (key100);<br /><br /> CREATE INDEX key100_ncccr10_idex<br /> ON ncccr10 (key100);<br /><br
/>Here is the analysis of the query:<br /><br /> csalgorithm=# EXPLAIN ANALYSE SELECT count(*)<br /> csalgorithm-#  
FROMncccr9 LEFT JOIN ncccr10 ON ncccr9.key100 = ncccr10.key100<br /> csalgorithm-#   WHERE ncccr10.key100 IS NULL;<br
/>                                                               QUERY PLAN <br />
--------------------------------------------------------------------------------
--------------------------------------------------------<br/> Aggregate  (cost=208337.59..208337.59 rows=1 width=0)
(actualtime=255723.212.. 255723.214 rows=1 loops=1)<br />    ->  Hash Left Join  (cost=99523.55..207101.41
rows=494471width=0) (actual ti me=92326.635..255538.447 rows=38144 loops=1)<br />          Hash Cond: ("outer".key100 =
"inner".key100)<br/>          Filter: ("inner".key100 IS NULL)<br />          ->  Seq Scan on ncccr9 
(cost=0.00..59360.71rows=494471 width=104) (ac tual time=171.778..75099.734 rows=494471 loops=1)<br />          -> 
Hash (cost=88438.64..88438.64 rows=611564 width=104) (actual time= 91962.956..91962.956 rows=0 loops=1)<br />
              ->  Seq Scan on ncccr10  (cost=0.00..88438.64 rows=611564 width=1 04) (actual time=11.704..76519.323
rows=611564loops=1)<br /> Total runtime: 255724.219 ms<br /> (8 rows)<br /><br /> The result of 38,144 non-matching
recordsseems too much:<br /><br /> csalgorithm=# SELECT count(*)<br /> csalgorithm-#   FROM ncccr9 LEFT JOIN ncccr10 ON
ncccr9.key100= ncccr10.key100<br /> csalgorithm-#   WHERE ncccr10.key100 IS NULL;<br /> count<br /> -------<br />
38144<br/> (1 row)<br /><br /> Maybe I need to do a RIGHT JOIN to return the count of records in table 'ncccr10' that
don'tmatch records in 'ncccr9'? <br /><br /> Thanks for your help. JOINS are fairly new to me.<br /><br />

Re: Non Matching Records in Two Tables

From

Patrick JACQUOT

Date:

09 February 2006, 05:00:04

Ken Hill wrote:

> I need some help with a bit of SQL. I have two tables. I want to find 
> records in one table that don't match records in another table based 
> on a common column in the two tables. Both tables have a column named 
> 'key100'. I was trying something like:
>
> SELECT count(*)
> FROM table1, table2
> WHERE (table1.key100 != table2.key100);
>
> But the query is very slow and I finally just cancel it. Any help is 
> very much appreciated.
>
> -Ken 

Maybe you could use a NOT EXISTS subquery, as in
SELECT count(*) from table1
WHERE NOT EXISTS(SELECT count(*) from table2 WHERE table1.key100 
=table2.key100)
which gives you the number of records in table1 without corresponding 
records in table2.
That kind of query is quite fast, if there exists an index on table2.key100
hth
P.Jacquot

Re: Non Matching Records in Two Tables

From

"BigSmoke"

Date:

09 February 2006, 12:16:34

You can use an EXCEPT clause.

Re: Non Matching Records in Two Tables

From

Ken Hill

Date:

14 February 2006, 16:54:39

On Wed, 2006-02-08 at 22:31 +0100, Markus Schaber wrote: <blockquote type="CITE"><pre>
<font color="#000000">Hi, Ken,</font>

<font color="#000000">Ken Hill schrieb:</font>
<font color="#000000">> I need some help with a bit of SQL. I have two tables. I want to find</font>
<font color="#000000">> records in one table that don't match records in another table based on</font>
<font color="#000000">> a common column in the two tables. Both tables have a column named</font>
<font color="#000000">> 'key100'. I was trying something like:</font>
<font color="#000000">> </font>
<font color="#000000">> SELECT count(*)</font>
<font color="#000000">> FROM table1, table2</font>
<font color="#000000">> WHERE (table1.key100 != table2.key100);</font>
<font color="#000000">> </font>
<font color="#000000">> But the query is very slow and I finally just cancel it. Any help is</font>
<font color="#000000">> very much appreciated.</font>

<font color="#000000">Do you have indices on the key100 columns? Is autovacuum running, or do</font>
<font color="#000000">you do analyze manually?</font>

<font color="#000000">Can you send us the output from "EXPLAIN ANALYZE [your query]"?</font>

<font color="#000000">Btw, I don't think this query will do what you wanted, it basically</font>
<font color="#000000">creates a cross product, that means if your tables look like:</font>

<font color="#000000">schabitest=# select * from table1;</font>
<font color="#000000"> key100 | valuea | valueb</font>
<font color="#000000">--------+--------+--------</font>
<font color="#000000">      1 | foo    | bar</font>
<font color="#000000">      2 | blah   | blubb</font>
<font color="#000000">      3 | manga  | mungo</font>

<font color="#000000">schabitest=# select * from table2;</font>
<font color="#000000"> key100 | valuec | valued</font>
<font color="#000000">--------+--------+--------</font>
<font color="#000000">      1 | monday | euro</font>
<font color="#000000">      2 | sunday | dollar</font>
<font color="#000000">      4 | friday | pounds</font>

<font color="#000000">Then your query will produce something like:</font>
<font color="#000000">schabitest=# select * from table1, table2 WHERE (table1.key100 !=</font>
<font color="#000000">table2.key100);</font>
<font color="#000000"> key100 | valuea | valueb | key100 | valuec | valued</font>
<font color="#000000">--------+--------+--------+--------+--------+--------</font>
<font color="#000000">      1 | foo    | bar    |      2 | sunday | dollar</font>
<font color="#000000">      1 | foo    | bar    |      4 | friday | pounds</font>
<font color="#000000">      2 | blah   | blubb  |      1 | monday | euro</font>
<font color="#000000">      2 | blah   | blubb  |      4 | friday | pounds</font>
<font color="#000000">      3 | manga  | mungo  |      1 | monday | euro</font>
<font color="#000000">      3 | manga  | mungo  |      2 | sunday | dollar</font>
<font color="#000000">      3 | manga  | mungo  |      4 | friday | pounds</font>

<font color="#000000">I suggest you would like to have all records from table1 that don't have</font>
<font color="#000000">a corresponding record in table2:</font>

<font color="#000000">schabitest=# select * from table1 where table1.key100 not in (select</font>
<font color="#000000">key100 from table2);</font>
<font color="#000000"> key100 | valuea | valueb</font>
<font color="#000000">--------+--------+--------</font>
<font color="#000000">      3 | manga  | mungo</font>

<font color="#000000">HTH,</font>
<font color="#000000">Markus</font>

</pre></blockquote> Here is my query SQL:<br /><br /> SELECT key100 FROM ncccr10<br /> WHERE ncccr10.key100 NOT IN
(SELECTkey100 FROM ncccr9);<br /><br /> It is is running after 30 minutes. Here is the query plan:<br /><br />
                              QUERY PLAN<br />
-------------------------------------------------------------------------<br/> Seq Scan on ncccr10 
(cost=0.00..20417160510.08rows=305782 width=104)<br />    Filter: (NOT (subplan))<br />    SubPlan<br />      -> 
SeqScan on ncccr9  (cost=0.00..65533.71 rows=494471 width=104)<br /> (4 rows)<br /><br /> Any ideas why it is so slow?

Re: Non Matching Records in Two Tables

From

Tom Lane

Date:

14 February 2006, 17:07:33

Ken Hill <ken@scottshill.com> writes:
>  Seq Scan on ncccr10  (cost=0.00..20417160510.08 rows=305782 width=104)
>    Filter: (NOT (subplan))
>    SubPlan
>      ->  Seq Scan on ncccr9  (cost=0.00..65533.71 rows=494471 width=104)
> (4 rows)

> Any ideas why it is so slow?

"NOT (subplan)" is horrendous (and the system knows it, note the huge
cost estimate).  Try increasing work_mem enough so you get a hashed
subplan instead.
        regards, tom lane

Re: Non Matching Records in Two Tables

From

chester c young

Date:

14 February 2006, 17:08:33

> Here is my query SQL:
> 
> SELECT key100 FROM ncccr10
> WHERE ncccr10.key100 NOT IN (SELECT key100 FROM ncccr9);
> 
> It is is running after 30 minutes. Here is the query plan:
>

I would try an outer join:

select a.key100
from ncccr10 a
left join ncccr9 b on( key100 )
where b.key100 is null;

also (hate to be obvious) have you analyzed lately?


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: Non Matching Records in Two Tables

From

Ken Hill

Date:

14 February 2006, 17:24:19

On Tue, 2006-02-14 at 13:08 -0800, chester c young wrote: <blockquote type="CITE"><pre>
<font color="#000000">> Here is my query SQL:</font>
<font color="#000000">> </font>
<font color="#000000">> SELECT key100 FROM ncccr10</font>
<font color="#000000">> WHERE ncccr10.key100 NOT IN (SELECT key100 FROM ncccr9);</font>
<font color="#000000">> </font>
<font color="#000000">> It is is running after 30 minutes. Here is the query plan:</font>
<font color="#000000">></font>

<font color="#000000">I would try an outer join:</font>

<font color="#000000">select a.key100</font>
<font color="#000000">from ncccr10 a</font>
<font color="#000000">left join ncccr9 b on( key100 )</font>
<font color="#000000">where b.key100 is null;</font>

<font color="#000000">also (hate to be obvious) have you analyzed lately?</font>


<font color="#000000">__________________________________________________</font>
<font color="#000000">Do You Yahoo!?</font>
<font color="#000000">Tired of spam?  Yahoo! Mail has the best spam protection around </font>
<font color="#000000"><a href="http://mail.yahoo.com">http://mail.yahoo.com</a> </font>

<font color="#000000">---------------------------(end of broadcast)---------------------------</font>
<font color="#000000">TIP 1: if posting/reading through Usenet, please send an appropriate</font>
<font color="#000000">       subscribe-nomail command to <a
href="mailto:majordomo@postgresql.org">majordomo@postgresql.org</a>so that your</font>
 
<font color="#000000">       message can get through to the mailing list c</font>leanly
</pre></blockquote> Yes, I did a VACUUM ANALYZE on the entire database before running this query. Also, I get an error
withyour suggestion:<br /><br /> csalgorithm=# SELECT a.key100 FROM ncccr10 a<br /> csalgorithm-# LEFT JOIN ncccr9 b
ON(a.key100)<br/> csalgorithm-# WHERE b.key100 IS Null;<br /> ERROR:  argument of JOIN/ON must be type boolean, not
typecharacter<br /><br />

Re: Non Matching Records in Two Tables

From

Ken Hill

Date:

14 February 2006, 17:25:06

 On Tue, 2006-02-14 at 16:07 -0500, Tom Lane wrote: <blockquote type="CITE"><pre>
<font color="#000000">Ken Hill <<a href="mailto:ken@scottshill.com">ken@scottshill.com</a>> writes:</font>
<font color="#000000">>  Seq Scan on ncccr10  (cost=0.00..20417160510.08 rows=305782 width=104)</font>
<font color="#000000">>    Filter: (NOT (subplan))</font>
<font color="#000000">>    SubPlan</font>
<font color="#000000">>      ->  Seq Scan on ncccr9  (cost=0.00..65533.71 rows=494471 width=104)</font>
<font color="#000000">> (4 rows)</font>

<font color="#000000">> Any ideas why it is so slow?</font>

<font color="#000000">"NOT (subplan)" is horrendous (and the system knows it, note the huge</font>
<font color="#000000">cost estimate).  Try increasing work_mem enough so you get a hashed</font>
<font color="#000000">subplan instead.</font>

<font color="#000000">            regards, tom lane</font>
</pre></blockquote><br /> How do I ncrease work_mem?

Re: Non Matching Records in Two Tables

From

Ken Hill

Date:

14 February 2006, 21:46:28

On Tue, 2006-02-14 at 13:24 -0800, Ken Hill wrote:<br /><blockquote type="CITE"><font color="#000000">On Tue,
2006-02-14at 13:08 -0800, chester c young wrote: </font><blockquote type="CITE"><pre>
 
<font color="#000000">> Here is my query SQL:</font>
<font color="#000000">> </font>
<font color="#000000">> SELECT key100 FROM ncccr10</font>
<font color="#000000">> WHERE ncccr10.key100 NOT IN (SELECT key100 FROM ncccr9);</font>
<font color="#000000">> </font>
<font color="#000000">> It is is running after 30 minutes. Here is the query plan:</font>
<font color="#000000">></font>

<font color="#000000">I would try an outer join:</font>

<font color="#000000">select a.key100</font>
<font color="#000000">from ncccr10 a</font>
<font color="#000000">left join ncccr9 b on( key100 )</font>
<font color="#000000">where b.key100 is null;</font>

<font color="#000000">also (hate to be obvious) have you analyzed lately?</font>


<font color="#000000">__________________________________________________</font>
<font color="#000000">Do You Yahoo!?</font>
<font color="#000000">Tired of spam?  Yahoo! Mail has the best spam protection around </font>
<font color="#000000"><a href="http://mail.yahoo.com">http://mail.yahoo.com</a> </font>

<font color="#000000">---------------------------(end of broadcast)---------------------------</font>
<font color="#000000">TIP 1: if posting/reading through Usenet, please send an appropriate</font>
<font color="#000000">       subscribe-nomail command to <a
href="mailto:majordomo@postgresql.org">majordomo@postgresql.org</a>so that your</font>
 
<font color="#000000">       message can get through to the mailing list cleanly</font>
</pre></blockquote><font color="#000000">Yes, I did a VACUUM ANALYZE on the entire database before running this query.
Also,I get an error with your suggestion:</font><br /><br /><font color="#000000">csalgorithm=# SELECT a.key100 FROM
ncccr10a</font><br /><font color="#000000">csalgorithm-# LEFT JOIN ncccr9 b ON(a.key100)</font><br /><font
color="#000000">csalgorithm-#WHERE b.key100 IS Null;</font><br /><font color="#000000">ERROR:  argument of JOIN/ON must
betype boolean, not type character</font><br /><br /></blockquote> Well, this works:<br /><br /> SELECT *<br /> FROM
ncccr9a<br /> LEFT JOIN ncccr10 b USING( key100 )<br /> WHERE b.key100 is null;<br /><br /> It still seems slow. It
takesabout the same time to run as in MS Access. I thought PostgreSQL would be faster.

Re: Non Matching Records in Two Tables

From

Bryce Nesbitt

Date:

15 February 2006, 02:58:47

Ken Hill wrote:
>> also (hate to be obvious) have you analyzed lately?
>>     
I'd say that's fair game, not obvious.  Vacuum/Analyze is ar so aparent
to a person moving
to Postgres from other DB's.

Re: Non Matching Records in Two Tables

From

Ken Hill

Date:

15 February 2006, 11:45:40

On Tue, 2006-02-14 at 15:05 -0800, Bryce Nesbitt wrote: <blockquote type="CITE"><pre>
<font color="#000000">Ken Hill wrote:</font>
<font color="#000000">>> also (hate to be obvious) have you analyzed lately?</font>
<font color="#000000">>>     </font>
<font color="#000000">I'd say that's fair game, not obvious.  Vacuum/Analyze is ar so aparent</font>
<font color="#000000">to a person moving</font>
<font color="#000000">to Postgres from other DB's.</font>


<font color="#000000">---------------------------(end of broadcast)---------------------------</font>
<font color="#000000">TIP 4: Have you searched our list archives?</font>

<font color="#000000">               <a
href="http://archives.postgresql.org">http://archives.</a></font>postgresql.org
</pre></blockquote> In my O'Reily "Practical PostgreSQL" book it recommends vacuuming a production database on a daily
basis.I liken this to MS Access' "compact/repair" procedure.