Thread: Non Matching Records in Two Tables
I need some help with a bit of SQL. I have two tables. I want to find records in one table that don't match records in anothertable based on a common column in the two tables. Both tables have a column named 'key100'. I was trying somethinglike:<br /><br /> SELECT count(*)<br /> FROM table1, table2<br /> WHERE (table1.key100 != table2.key100);<br /><br/> But the query is very slow and I finally just cancel it. Any help is very much appreciated.<br /><br /> -Ken
At 04:10 PM 2/8/06, Ken Hill wrote: >I need some help with a bit of SQL. I have two tables. I want to find >records in one table that don't match records in another table based on a >common column in the two tables. Both tables have a column named 'key100'. >I was trying something like: > >SELECT count(*) >FROM table1, table2 >WHERE (table1.key100 != table2.key100); > >But the query is very slow and I finally just cancel it. Any help is very >much appreciated. vacuum analyse table1; vacuum analyse table2; select count(*) from table1 full outer join table2 on table1.key100=table2.key100 where table1.key100 is null or table2.key100 is null; If this is also slow, post output of "EXPLAIN ANALYSE SELECT ...."
Hi, Ken, Ken Hill schrieb: > I need some help with a bit of SQL. I have two tables. I want to find > records in one table that don't match records in another table based on > a common column in the two tables. Both tables have a column named > 'key100'. I was trying something like: > > SELECT count(*) > FROM table1, table2 > WHERE (table1.key100 != table2.key100); > > But the query is very slow and I finally just cancel it. Any help is > very much appreciated. Do you have indices on the key100 columns? Is autovacuum running, or do you do analyze manually? Can you send us the output from "EXPLAIN ANALYZE [your query]"? Btw, I don't think this query will do what you wanted, it basically creates a cross product, that means if your tables look like: schabitest=# select * from table1;key100 | valuea | valueb --------+--------+-------- 1 | foo | bar 2 | blah | blubb 3 | manga | mungo schabitest=# select * from table2;key100 | valuec | valued --------+--------+-------- 1 | monday | euro 2 | sunday | dollar 4 | friday | pounds Then your query will produce something like: schabitest=# select * from table1, table2 WHERE (table1.key100 != table2.key100);key100 | valuea | valueb | key100 | valuec | valued --------+--------+--------+--------+--------+-------- 1 | foo | bar | 2 | sunday | dollar 1 | foo |bar | 4 | friday | pounds 2 | blah | blubb | 1 | monday | euro 2 | blah | blubb | 4 | friday| pounds 3 | manga | mungo | 1 | monday | euro 3 | manga | mungo | 2 | sunday | dollar 3| manga | mungo | 4 | friday | pounds I suggest you would like to have all records from table1 that don't have a corresponding record in table2: schabitest=# select * from table1 where table1.key100 not in (select key100 from table2);key100 | valuea | valueb --------+--------+-------- 3 | manga | mungo HTH, Markus
Ken Hill wrote: > I need some help with a bit of SQL. I have two tables. I want > to find records in one table that don't match records in another > table based on a common column in the two tables. Both tables > have a column named 'key100'. I was trying something like: > > SELECT count(*) > FROM table1, table2 > WHERE (table1.key100 != table2.key100); > > But the query is very slow and I finally just cancel it. Any help > is very much appreciated. That's a cartesian join, there, and it'll be huge (on the order of N*M rows, where N and M are the number of rows in thefirst and second tables respectively). It sounds like, from your description, you want to find rows in table1 that don't have a corresponding row in table2. Thisshould work: SELECT count(*) FROM table1 LEFT JOIN table2 ON table1.key100 = table2.key100 WHERE table2.key100 IS NULL; This will still be fairly slow unless there are indexes on table1.key100 and table2.key100, but nowhere near as slow as theoriginal query. Frank Bax's solution will work if what you want is a count of rows in table1 that don't have a corresponding row in table2or in table2 that don't have a corresponding row in table1; for that specific requirement you may actually be betteroff doing two queries (one for each table) and adding the results together. -Owen
On Wed, 2006-02-08 at 16:27 -0500, Frank Bax wrote: <blockquote type="CITE"><pre> <font color="#000000">At 04:10 PM 2/8/06, Ken Hill wrote:</font> <font color="#000000">>I need some help with a bit of SQL. I have two tables. I want to find </font> <font color="#000000">>records in one table that don't match records in another table based on a </font> <font color="#000000">>common column in the two tables. Both tables have a column named 'key100'. </font> <font color="#000000">>I was trying something like:</font> <font color="#000000">></font> <font color="#000000">>SELECT count(*)</font> <font color="#000000">>FROM table1, table2</font> <font color="#000000">>WHERE (table1.key100 != table2.key100);</font> <font color="#000000">></font> <font color="#000000">>But the query is very slow and I finally just cancel it. Any help is very </font> <font color="#000000">>much appreciated.</font> <font color="#000000">vacuum analyse table1;</font> <font color="#000000">vacuum analyse table2;</font> <font color="#000000">select count(*) from table1 full outer join table2 on </font> <font color="#000000">table1.key100=table2.key100 where table1.key100 is null or table2.key100 is </font> <font color="#000000">null;</font> <font color="#000000">If this is also slow, post output of "EXPLAIN ANALYSE SELECT ...." </font> <font color="#000000">---------------------------(end of broadcast)---------------------------</font> <font color="#000000">TIP 1: if posting/reading through Usenet, please send an appropriate</font> <font color="#000000"> subscribe-nomail command to <a href="mailto:majordomo@postgresql.org">majordomo@postgresql.org</a>so that your</font> <font color="#000000"> message can get through to the mailing list c</font>leanly </pre></blockquote><br /> OK. I added indexes on the two columns in the two tables:<br /><br /> CREATE INDEX key100_idex<br/> ON ncccr9 (key100);<br /><br /> CREATE INDEX key100_ncccr10_idex<br /> ON ncccr10 (key100);<br /><br />Here is the analysis of the query:<br /><br /> csalgorithm=# EXPLAIN ANALYSE SELECT count(*)<br /> csalgorithm-# FROMncccr9 LEFT JOIN ncccr10 ON ncccr9.key100 = ncccr10.key100<br /> csalgorithm-# WHERE ncccr10.key100 IS NULL;<br /> QUERY PLAN <br /> -------------------------------------------------------------------------------- --------------------------------------------------------<br/> Aggregate (cost=208337.59..208337.59 rows=1 width=0) (actualtime=255723.212.. 255723.214 rows=1 loops=1)<br /> -> Hash Left Join (cost=99523.55..207101.41 rows=494471width=0) (actual ti me=92326.635..255538.447 rows=38144 loops=1)<br /> Hash Cond: ("outer".key100 = "inner".key100)<br/> Filter: ("inner".key100 IS NULL)<br /> -> Seq Scan on ncccr9 (cost=0.00..59360.71rows=494471 width=104) (ac tual time=171.778..75099.734 rows=494471 loops=1)<br /> -> Hash (cost=88438.64..88438.64 rows=611564 width=104) (actual time= 91962.956..91962.956 rows=0 loops=1)<br /> -> Seq Scan on ncccr10 (cost=0.00..88438.64 rows=611564 width=1 04) (actual time=11.704..76519.323 rows=611564loops=1)<br /> Total runtime: 255724.219 ms<br /> (8 rows)<br /><br /> The result of 38,144 non-matching recordsseems too much:<br /><br /> csalgorithm=# SELECT count(*)<br /> csalgorithm-# FROM ncccr9 LEFT JOIN ncccr10 ON ncccr9.key100= ncccr10.key100<br /> csalgorithm-# WHERE ncccr10.key100 IS NULL;<br /> count<br /> -------<br /> 38144<br/> (1 row)<br /><br /> Maybe I need to do a RIGHT JOIN to return the count of records in table 'ncccr10' that don'tmatch records in 'ncccr9'? <br /><br /> Thanks for your help. JOINS are fairly new to me.<br /><br />
Ken Hill wrote: > I need some help with a bit of SQL. I have two tables. I want to find > records in one table that don't match records in another table based > on a common column in the two tables. Both tables have a column named > 'key100'. I was trying something like: > > SELECT count(*) > FROM table1, table2 > WHERE (table1.key100 != table2.key100); > > But the query is very slow and I finally just cancel it. Any help is > very much appreciated. > > -Ken Maybe you could use a NOT EXISTS subquery, as in SELECT count(*) from table1 WHERE NOT EXISTS(SELECT count(*) from table2 WHERE table1.key100 =table2.key100) which gives you the number of records in table1 without corresponding records in table2. That kind of query is quite fast, if there exists an index on table2.key100 hth P.Jacquot
You can use an EXCEPT clause.
On Wed, 2006-02-08 at 22:31 +0100, Markus Schaber wrote: <blockquote type="CITE"><pre> <font color="#000000">Hi, Ken,</font> <font color="#000000">Ken Hill schrieb:</font> <font color="#000000">> I need some help with a bit of SQL. I have two tables. I want to find</font> <font color="#000000">> records in one table that don't match records in another table based on</font> <font color="#000000">> a common column in the two tables. Both tables have a column named</font> <font color="#000000">> 'key100'. I was trying something like:</font> <font color="#000000">> </font> <font color="#000000">> SELECT count(*)</font> <font color="#000000">> FROM table1, table2</font> <font color="#000000">> WHERE (table1.key100 != table2.key100);</font> <font color="#000000">> </font> <font color="#000000">> But the query is very slow and I finally just cancel it. Any help is</font> <font color="#000000">> very much appreciated.</font> <font color="#000000">Do you have indices on the key100 columns? Is autovacuum running, or do</font> <font color="#000000">you do analyze manually?</font> <font color="#000000">Can you send us the output from "EXPLAIN ANALYZE [your query]"?</font> <font color="#000000">Btw, I don't think this query will do what you wanted, it basically</font> <font color="#000000">creates a cross product, that means if your tables look like:</font> <font color="#000000">schabitest=# select * from table1;</font> <font color="#000000"> key100 | valuea | valueb</font> <font color="#000000">--------+--------+--------</font> <font color="#000000"> 1 | foo | bar</font> <font color="#000000"> 2 | blah | blubb</font> <font color="#000000"> 3 | manga | mungo</font> <font color="#000000">schabitest=# select * from table2;</font> <font color="#000000"> key100 | valuec | valued</font> <font color="#000000">--------+--------+--------</font> <font color="#000000"> 1 | monday | euro</font> <font color="#000000"> 2 | sunday | dollar</font> <font color="#000000"> 4 | friday | pounds</font> <font color="#000000">Then your query will produce something like:</font> <font color="#000000">schabitest=# select * from table1, table2 WHERE (table1.key100 !=</font> <font color="#000000">table2.key100);</font> <font color="#000000"> key100 | valuea | valueb | key100 | valuec | valued</font> <font color="#000000">--------+--------+--------+--------+--------+--------</font> <font color="#000000"> 1 | foo | bar | 2 | sunday | dollar</font> <font color="#000000"> 1 | foo | bar | 4 | friday | pounds</font> <font color="#000000"> 2 | blah | blubb | 1 | monday | euro</font> <font color="#000000"> 2 | blah | blubb | 4 | friday | pounds</font> <font color="#000000"> 3 | manga | mungo | 1 | monday | euro</font> <font color="#000000"> 3 | manga | mungo | 2 | sunday | dollar</font> <font color="#000000"> 3 | manga | mungo | 4 | friday | pounds</font> <font color="#000000">I suggest you would like to have all records from table1 that don't have</font> <font color="#000000">a corresponding record in table2:</font> <font color="#000000">schabitest=# select * from table1 where table1.key100 not in (select</font> <font color="#000000">key100 from table2);</font> <font color="#000000"> key100 | valuea | valueb</font> <font color="#000000">--------+--------+--------</font> <font color="#000000"> 3 | manga | mungo</font> <font color="#000000">HTH,</font> <font color="#000000">Markus</font> </pre></blockquote> Here is my query SQL:<br /><br /> SELECT key100 FROM ncccr10<br /> WHERE ncccr10.key100 NOT IN (SELECTkey100 FROM ncccr9);<br /><br /> It is is running after 30 minutes. Here is the query plan:<br /><br /> QUERY PLAN<br /> -------------------------------------------------------------------------<br/> Seq Scan on ncccr10 (cost=0.00..20417160510.08rows=305782 width=104)<br /> Filter: (NOT (subplan))<br /> SubPlan<br /> -> SeqScan on ncccr9 (cost=0.00..65533.71 rows=494471 width=104)<br /> (4 rows)<br /><br /> Any ideas why it is so slow?
Ken Hill <ken@scottshill.com> writes: > Seq Scan on ncccr10 (cost=0.00..20417160510.08 rows=305782 width=104) > Filter: (NOT (subplan)) > SubPlan > -> Seq Scan on ncccr9 (cost=0.00..65533.71 rows=494471 width=104) > (4 rows) > Any ideas why it is so slow? "NOT (subplan)" is horrendous (and the system knows it, note the huge cost estimate). Try increasing work_mem enough so you get a hashed subplan instead. regards, tom lane
> Here is my query SQL: > > SELECT key100 FROM ncccr10 > WHERE ncccr10.key100 NOT IN (SELECT key100 FROM ncccr9); > > It is is running after 30 minutes. Here is the query plan: > I would try an outer join: select a.key100 from ncccr10 a left join ncccr9 b on( key100 ) where b.key100 is null; also (hate to be obvious) have you analyzed lately? __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
On Tue, 2006-02-14 at 13:08 -0800, chester c young wrote: <blockquote type="CITE"><pre> <font color="#000000">> Here is my query SQL:</font> <font color="#000000">> </font> <font color="#000000">> SELECT key100 FROM ncccr10</font> <font color="#000000">> WHERE ncccr10.key100 NOT IN (SELECT key100 FROM ncccr9);</font> <font color="#000000">> </font> <font color="#000000">> It is is running after 30 minutes. Here is the query plan:</font> <font color="#000000">></font> <font color="#000000">I would try an outer join:</font> <font color="#000000">select a.key100</font> <font color="#000000">from ncccr10 a</font> <font color="#000000">left join ncccr9 b on( key100 )</font> <font color="#000000">where b.key100 is null;</font> <font color="#000000">also (hate to be obvious) have you analyzed lately?</font> <font color="#000000">__________________________________________________</font> <font color="#000000">Do You Yahoo!?</font> <font color="#000000">Tired of spam? Yahoo! Mail has the best spam protection around </font> <font color="#000000"><a href="http://mail.yahoo.com">http://mail.yahoo.com</a> </font> <font color="#000000">---------------------------(end of broadcast)---------------------------</font> <font color="#000000">TIP 1: if posting/reading through Usenet, please send an appropriate</font> <font color="#000000"> subscribe-nomail command to <a href="mailto:majordomo@postgresql.org">majordomo@postgresql.org</a>so that your</font> <font color="#000000"> message can get through to the mailing list c</font>leanly </pre></blockquote> Yes, I did a VACUUM ANALYZE on the entire database before running this query. Also, I get an error withyour suggestion:<br /><br /> csalgorithm=# SELECT a.key100 FROM ncccr10 a<br /> csalgorithm-# LEFT JOIN ncccr9 b ON(a.key100)<br/> csalgorithm-# WHERE b.key100 IS Null;<br /> ERROR: argument of JOIN/ON must be type boolean, not typecharacter<br /><br />
On Tue, 2006-02-14 at 16:07 -0500, Tom Lane wrote: <blockquote type="CITE"><pre> <font color="#000000">Ken Hill <<a href="mailto:ken@scottshill.com">ken@scottshill.com</a>> writes:</font> <font color="#000000">> Seq Scan on ncccr10 (cost=0.00..20417160510.08 rows=305782 width=104)</font> <font color="#000000">> Filter: (NOT (subplan))</font> <font color="#000000">> SubPlan</font> <font color="#000000">> -> Seq Scan on ncccr9 (cost=0.00..65533.71 rows=494471 width=104)</font> <font color="#000000">> (4 rows)</font> <font color="#000000">> Any ideas why it is so slow?</font> <font color="#000000">"NOT (subplan)" is horrendous (and the system knows it, note the huge</font> <font color="#000000">cost estimate). Try increasing work_mem enough so you get a hashed</font> <font color="#000000">subplan instead.</font> <font color="#000000"> regards, tom lane</font> </pre></blockquote><br /> How do I ncrease work_mem?
On Tue, 2006-02-14 at 13:24 -0800, Ken Hill wrote:<br /><blockquote type="CITE"><font color="#000000">On Tue, 2006-02-14at 13:08 -0800, chester c young wrote: </font><blockquote type="CITE"><pre> <font color="#000000">> Here is my query SQL:</font> <font color="#000000">> </font> <font color="#000000">> SELECT key100 FROM ncccr10</font> <font color="#000000">> WHERE ncccr10.key100 NOT IN (SELECT key100 FROM ncccr9);</font> <font color="#000000">> </font> <font color="#000000">> It is is running after 30 minutes. Here is the query plan:</font> <font color="#000000">></font> <font color="#000000">I would try an outer join:</font> <font color="#000000">select a.key100</font> <font color="#000000">from ncccr10 a</font> <font color="#000000">left join ncccr9 b on( key100 )</font> <font color="#000000">where b.key100 is null;</font> <font color="#000000">also (hate to be obvious) have you analyzed lately?</font> <font color="#000000">__________________________________________________</font> <font color="#000000">Do You Yahoo!?</font> <font color="#000000">Tired of spam? Yahoo! Mail has the best spam protection around </font> <font color="#000000"><a href="http://mail.yahoo.com">http://mail.yahoo.com</a> </font> <font color="#000000">---------------------------(end of broadcast)---------------------------</font> <font color="#000000">TIP 1: if posting/reading through Usenet, please send an appropriate</font> <font color="#000000"> subscribe-nomail command to <a href="mailto:majordomo@postgresql.org">majordomo@postgresql.org</a>so that your</font> <font color="#000000"> message can get through to the mailing list cleanly</font> </pre></blockquote><font color="#000000">Yes, I did a VACUUM ANALYZE on the entire database before running this query. Also,I get an error with your suggestion:</font><br /><br /><font color="#000000">csalgorithm=# SELECT a.key100 FROM ncccr10a</font><br /><font color="#000000">csalgorithm-# LEFT JOIN ncccr9 b ON(a.key100)</font><br /><font color="#000000">csalgorithm-#WHERE b.key100 IS Null;</font><br /><font color="#000000">ERROR: argument of JOIN/ON must betype boolean, not type character</font><br /><br /></blockquote> Well, this works:<br /><br /> SELECT *<br /> FROM ncccr9a<br /> LEFT JOIN ncccr10 b USING( key100 )<br /> WHERE b.key100 is null;<br /><br /> It still seems slow. It takesabout the same time to run as in MS Access. I thought PostgreSQL would be faster.
Ken Hill wrote: >> also (hate to be obvious) have you analyzed lately? >> I'd say that's fair game, not obvious. Vacuum/Analyze is ar so aparent to a person moving to Postgres from other DB's.
On Tue, 2006-02-14 at 15:05 -0800, Bryce Nesbitt wrote: <blockquote type="CITE"><pre> <font color="#000000">Ken Hill wrote:</font> <font color="#000000">>> also (hate to be obvious) have you analyzed lately?</font> <font color="#000000">>> </font> <font color="#000000">I'd say that's fair game, not obvious. Vacuum/Analyze is ar so aparent</font> <font color="#000000">to a person moving</font> <font color="#000000">to Postgres from other DB's.</font> <font color="#000000">---------------------------(end of broadcast)---------------------------</font> <font color="#000000">TIP 4: Have you searched our list archives?</font> <font color="#000000"> <a href="http://archives.postgresql.org">http://archives.</a></font>postgresql.org </pre></blockquote> In my O'Reily "Practical PostgreSQL" book it recommends vacuuming a production database on a daily basis.I liken this to MS Access' "compact/repair" procedure.