Thread: where not exists
Dear everyone, What is the best way of removing rows which are not in another table? I have these two tables each with millions of rows/tuples. They are joined on two fields: CREATE TABLE a ( join1 OID, join2 OID, --a fair number of other fields ..... ) CREATE TABLE b ( join1 OID, join2 OID, --a fair number of other fields ..... ) There are indices on both of them (on "join1, join2"). At the moment, I am doing 1) get items which are orphaned in a. CREATE TEMP TABLE orphans as SELECT join1, join2 FROM a WHERE NOT EXISTS ( SELECT * FROM b WHERE a.join1 = b.join1 AND a.join2 = b.join2 ) 2) DELETE FROM a where orphans.join1 = a.join1 and orphans.join2=a.join2 3) DROP TABLE orphans This is very slow. Is there a better way? Should I first copy all join1. join2 from a and b into temporary tables first? Do I need to index the temporary tables? Surely this is a general enough a problem that optimal sets of solutions exists in people's experience. Thanks a lot. Llew
Llew, Do it all in one query: > DELETE FROM a > WHERE NOT EXISTS > ( > SELECT b.join1 > FROM b > WHERE > a.join1 = b.join1 AND > a.join2 = b.join2 > ) And make sure that join1 and join2 are indexed in both tables. -Josh Berkus
set operation especially NOT EXIST is very slow(for big table), I recommand you use a few queries for your propose: 1. select * into c from a where join1=b.join1 and join2=b.join2; 2. truncate table a; 3. insert into a select * from c; 4. drop table c; You don't need index c since you use full table scan anyway. Jie Liang -----Original Message----- From: Llew [mailto:postgres@lg.ndirect.co.uk] Sent: Wednesday, March 06, 2002 11:28 AM To: pgsql-sql@postgresql.org Subject: [SQL] where not exists Dear everyone, What is the best way of removing rows which are not in another table? I have these two tables each with millions of rows/tuples. They are joined on two fields: CREATE TABLE a ( join1 OID, join2 OID, --a fair number of other fields ..... ) CREATE TABLE b ( join1 OID, join2 OID, --a fair number of other fields ..... ) There are indices on both of them (on "join1, join2"). At the moment, I am doing 1) get items which are orphaned in a. CREATE TEMP TABLE orphans as SELECT join1, join2 FROM a WHERE NOT EXISTS ( SELECT * FROM b WHERE a.join1 = b.join1 AND a.join2 = b.join2 ) 2) DELETE FROM a where orphans.join1 = a.join1 and orphans.join2=a.join2 3) DROP TABLE orphans This is very slow. Is there a better way? Should I first copy all join1. join2 from a and b into temporary tables first? Do I need to index the temporary tables? Surely this is a general enough a problem that optimal sets of solutions exists in people's experience. Thanks a lot. Llew ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly
Hi, did the default behaviour of \d+ and \dd change on 7.2 (?) I'm getting errors: \d+ -> function obj_description(oid, unknown) does not exist \dd -> no such attribute or function 'classoid' TIA ------------------------------------------------------------------------ Thomas Good tomg@admin.nrnet.org Programmer/Analyst phone: 718-818-5528 Residential Services, Behavioral Health Services fax: 718-818-5056 Saint Vincent Catholic Medical Centers mobile: 917-282-7359 -- -- SQL Clinic - An Open Source Clinical Record www.sqlclinic.net ------------------------------------------------------------------------
Dear Jie Liang, The common case is that there will only be a few or no orphans. Won't recreateing and reindicing this massive table (table a) take ages and known my db out of commission for everyone else? One of the reasons why I do things is two steps is that this allows me archive the deleted items. This is, alas, a common (read daily) operation / query. Leo "Jie Liang" <jie@stbernard.com> wrote in message news:7C760DAA511DC74B99E7D22189F786F1906D98@MAIL01.stbernard.com... > set operation especially NOT EXIST is very slow(for big table), > I recommand you use a few queries for your propose: > 1. select * into c from a where join1=b.join1 and join2=b.join2; > 2. truncate table a; > 3. insert into a select * from c; > 4. drop table c; > > You don't need index c since you use full table scan anyway. > > What is the best way of removing rows which are not in another table? > > I have these two tables each with millions of rows/tuples. > They are joined on two fields: > CREATE TABLE a > ( > join1 OID, > join2 OID, > --a fair number of other fields > ..... > ) > CREATE TABLE b > ( > join1 OID, > join2 OID, > --a fair number of other fields > ..... > ) > There are indices on both of them (on "join1, join2"). > At the moment, I am doing > > 1) get items which are orphaned in a. > CREATE TEMP TABLE orphans as > SELECT join1, join2 > FROM a > WHERE NOT EXISTS > ( > SELECT * > FROM b > WHERE > a.join1 = b.join1 AND > a.join2 = b.join2 > ) > 2) DELETE FROM a where orphans.join1 = a.join1 and orphans.join2=a.join2 > 3) DROP TABLE orphans > > This is very slow. Is there a better way?
Dear Josh, The reason why I do things is two steps is that this allows me archive the deleted items. Llew ""Josh Berkus"" <josh@agliodbs.com> wrote in message news:web-819049@davinci.ethosmedia.com... > Llew, > > Do it all in one query: > > > DELETE FROM a > > WHERE NOT EXISTS > > ( > > SELECT b.join1 > > FROM b > > WHERE > > a.join1 = b.join1 AND > > a.join2 = b.join2 > > ) > > And make sure that join1 and join2 are indexed in both tables. > > -Josh Berkus > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org
Sorry for popping in a bit late here, but you could try this: 1) get items which are orphaned in a. CREATE TEMP TABLE orphans as SELECT a.join1, a.join2 FROM a LEFT OUTER JOIN b ON a.join1 = b.join1 AND a.join2 =b.join2 WHERE b.join1 IS NULL AND b.join2 IS NULL D A "Llew" <leo.goodstadt@anat.ox.ac.uk> wrote in message news:a65qm1$2k6g$1@jupiter.hub.org... > Dear everyone, > What is the best way of removing rows which are not in another table? > > I have these two tables each with millions of rows/tuples. > They are joined on two fields: > CREATE TABLE a > ( > join1 OID, > join2 OID, > --a fair number of other fields > ..... > ) > CREATE TABLE b > ( > join1 OID, > join2 OID, > --a fair number of other fields > ..... > ) > There are indices on both of them (on "join1, join2"). > At the moment, I am doing > > 1) get items which are orphaned in a. > CREATE TEMP TABLE orphans as > SELECT join1, join2 > FROM a > WHERE NOT EXISTS > ( > SELECT * > FROM b > WHERE > a.join1 = b.join1 AND > a.join2 = b.join2 > ) > 2) DELETE FROM a where orphans.join1 = a.join1 and orphans.join2=a.join2 > 3) DROP TABLE orphans > > This is very slow. Is there a better way? > Should I first copy all join1. join2 from a and b into temporary tables > first? > Do I need to index the temporary tables? > Surely this is a general enough a problem that optimal sets of solutions > exists in people's experience. > Thanks a lot. > > Llew > >
I ended up by using an external programme. NOT EXISTS is just a set difference. Doing set compares is really quick if both sets are sorted. I use CRC64s for the data and just compare the resulting sorted sets of (large CRC 64-bit) numbers. Because everything hashes to a number, the memory requirements are not that bad either (8 bytes per item ~256000 tuples per Mb). The programme is in C++ but is as fast in something like Perl. I.e. comparing millions of rows of data takes 10s of seconds rather than 10s of minutes. Leo > > 1) get items which are orphaned in a. > CREATE TEMP TABLE orphans as > SELECT a.join1, a.join2 > FROM a LEFT OUTER JOIN b ON a.join1 = b.join1 AND > a.join2 = b.join2 > WHERE b.join1 IS NULL AND b.join2 IS NULL > > D A > > > "Llew" <leo.goodstadt@anat.ox.ac.uk> wrote in message > news:a65qm1$2k6g$1@jupiter.hub.org... > > Dear everyone, > > What is the best way of removing rows which are not in > another table?