Thread: where not exists

where not exists

From

"Llew"

Date:

06 March 2002, 14:47:39

Dear everyone,
What is the best way of removing rows which are not in another table?

I have these two tables each with millions of rows/tuples.
They are joined on two fields:
CREATE TABLE a
(   join1 OID,   join2 OID,   --a fair number of other fields   .....
)
CREATE TABLE b
(   join1 OID,   join2 OID,   --a fair number of other fields   .....
)
There are indices on both of them (on "join1, join2").
At the moment, I am doing

1) get items which are orphaned in a.
CREATE TEMP TABLE orphans as   SELECT join1, join2       FROM a       WHERE NOT EXISTS       (           SELECT *
       FROM b               WHERE               a.join1 = b.join1 AND               a.join2 = b.join2       )
 
2) DELETE FROM a where orphans.join1 = a.join1 and orphans.join2=a.join2
3) DROP TABLE orphans

This is very slow. Is there a better way?
Should I first copy all join1. join2 from a and b into temporary tables
first?
Do I need to index the temporary tables?
Surely this is a general enough a problem that optimal sets of solutions
exists in people's experience.
Thanks a lot.

Llew

Re: where not exists

From

"Josh Berkus"

Date:

06 March 2002, 15:29:34

Llew,

Do it all in one query:

> DELETE FROM a
>         WHERE NOT EXISTS
>         (
>             SELECT b.join1
>                 FROM b
>                 WHERE
>                 a.join1 = b.join1 AND
>                 a.join2 = b.join2
>         )

And make sure that join1 and join2 are indexed in both tables.

-Josh Berkus

Re: where not exists

From

Jie Liang

Date:

06 March 2002, 15:29:34

set operation especially NOT EXIST is very slow(for big table),
I recommand you use a few queries for your propose:
1. select * into c from a where join1=b.join1 and join2=b.join2;
2. truncate table a;
3. insert into a select * from c;
4. drop table c;

You don't need index c since you use full table scan anyway.

Jie Liang

-----Original Message-----
From: Llew [mailto:postgres@lg.ndirect.co.uk]
Sent: Wednesday, March 06, 2002 11:28 AM
To: pgsql-sql@postgresql.org
Subject: [SQL] where not exists


Dear everyone,
What is the best way of removing rows which are not in another table?

I have these two tables each with millions of rows/tuples.
They are joined on two fields:
CREATE TABLE a
(   join1 OID,   join2 OID,   --a fair number of other fields   .....
)
CREATE TABLE b
(   join1 OID,   join2 OID,   --a fair number of other fields   .....
)
There are indices on both of them (on "join1, join2").
At the moment, I am doing

1) get items which are orphaned in a.
CREATE TEMP TABLE orphans as   SELECT join1, join2       FROM a       WHERE NOT EXISTS       (           SELECT *
       FROM b               WHERE               a.join1 = b.join1 AND               a.join2 = b.join2       )
 
2) DELETE FROM a where orphans.join1 = a.join1 and orphans.join2=a.join2
3) DROP TABLE orphans

This is very slow. Is there a better way?
Should I first copy all join1. join2 from a and b into temporary tables
first?
Do I need to index the temporary tables?
Surely this is a general enough a problem that optimal sets of solutions
exists in people's experience.
Thanks a lot.

Llew



---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

\d+ and \dd

From

Thomas Good

Date:

06 March 2002, 15:59:30

Hi, did the default behaviour of \d+ and \dd change on 7.2 (?)
I'm getting errors:

\d+ ->  function obj_description(oid, unknown) does not exist
\dd ->  no such attribute or function 'classoid'

TIA
------------------------------------------------------------------------
Thomas Good                                         tomg@admin.nrnet.org
Programmer/Analyst                                  phone:  718-818-5528
Residential Services, Behavioral Health Services    fax:    718-818-5056
Saint Vincent Catholic Medical Centers              mobile: 917-282-7359
--                                                                    --
SQL Clinic - An Open Source Clinical Record            www.sqlclinic.net
------------------------------------------------------------------------

Re: where not exists

From

"Llew"

Date:

13 March 2002, 08:29:09

Dear Jie Liang,
The common case is that there will only be a few or no orphans. Won't
recreateing and reindicing this massive table (table a) take ages and known
my db out of commission for everyone else? One of the reasons why I do
things is two steps is that this allows me archive the deleted items. This
is, alas, a common (read daily) operation / query.
Leo

"Jie Liang" <jie@stbernard.com> wrote in message
news:7C760DAA511DC74B99E7D22189F786F1906D98@MAIL01.stbernard.com...
> set operation especially NOT EXIST is very slow(for big table),
> I recommand you use a few queries for your propose:
> 1. select * into c from a where join1=b.join1 and join2=b.join2;
> 2. truncate table a;
> 3. insert into a select * from c;
> 4. drop table c;
>
> You don't need index c since you use full table scan anyway.
>
> What is the best way of removing rows which are not in another table?
>
> I have these two tables each with millions of rows/tuples.
> They are joined on two fields:
> CREATE TABLE a
> (
>     join1 OID,
>     join2 OID,
>     --a fair number of other fields
>     .....
> )
> CREATE TABLE b
> (
>     join1 OID,
>     join2 OID,
>     --a fair number of other fields
>     .....
> )
> There are indices on both of them (on "join1, join2").
> At the moment, I am doing
>
> 1) get items which are orphaned in a.
> CREATE TEMP TABLE orphans as
>     SELECT join1, join2
>         FROM a
>         WHERE NOT EXISTS
>         (
>             SELECT *
>                 FROM b
>                 WHERE
>                 a.join1 = b.join1 AND
>                 a.join2 = b.join2
>         )
> 2) DELETE FROM a where orphans.join1 = a.join1 and orphans.join2=a.join2
> 3) DROP TABLE orphans
>
> This is very slow. Is there a better way?

Re: where not exists

From

"Llew"

Date:

13 March 2002, 08:29:30

Dear Josh,
The reason why I do things is two steps is that this allows me archive the
deleted items.
Llew

""Josh Berkus"" <josh@agliodbs.com> wrote in message
news:web-819049@davinci.ethosmedia.com...
> Llew,
>
> Do it all in one query:
>
> > DELETE FROM a
> >         WHERE NOT EXISTS
> >         (
> >             SELECT b.join1
> >                 FROM b
> >                 WHERE
> >                 a.join1 = b.join1 AND
> >                 a.join2 = b.join2
> >         )
>
> And make sure that join1 and join2 are indexed in both tables.
>
> -Josh Berkus
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org

Re: where not exists

From

"Dag Arne Matre"

Date:

13 March 2002, 15:34:47

Sorry for popping in a bit late here, but you could try this:

1) get items which are orphaned in a.
CREATE TEMP TABLE orphans as   SELECT a.join1, a.join2       FROM a LEFT OUTER JOIN b ON a.join1 = b.join1 AND a.join2
=b.join2       WHERE b.join1 IS NULL AND b.join2 IS NULL
 

D A


"Llew" <leo.goodstadt@anat.ox.ac.uk> wrote in message
news:a65qm1$2k6g$1@jupiter.hub.org...
> Dear everyone,
> What is the best way of removing rows which are not in another table?
>
> I have these two tables each with millions of rows/tuples.
> They are joined on two fields:
> CREATE TABLE a
> (
>     join1 OID,
>     join2 OID,
>     --a fair number of other fields
>     .....
> )
> CREATE TABLE b
> (
>     join1 OID,
>     join2 OID,
>     --a fair number of other fields
>     .....
> )
> There are indices on both of them (on "join1, join2").
> At the moment, I am doing
>
> 1) get items which are orphaned in a.
> CREATE TEMP TABLE orphans as
>     SELECT join1, join2
>         FROM a
>         WHERE NOT EXISTS
>         (
>             SELECT *
>                 FROM b
>                 WHERE
>                 a.join1 = b.join1 AND
>                 a.join2 = b.join2
>         )
> 2) DELETE FROM a where orphans.join1 = a.join1 and orphans.join2=a.join2
> 3) DROP TABLE orphans
>
> This is very slow. Is there a better way?
> Should I first copy all join1. join2 from a and b into temporary tables
> first?
> Do I need to index the temporary tables?
> Surely this is a general enough a problem that optimal sets of solutions
> exists in people's experience.
> Thanks a lot.
>
> Llew
>
>

Re: where not exists

From

"Llew Sion Goodstadt"

Date:

20 March 2002, 08:41:12

I ended up by using an external programme.
NOT EXISTS is just a set difference.
Doing set compares is really quick if both sets are sorted.
I use CRC64s for the data and just compare the resulting sorted sets of
(large CRC 64-bit) numbers.
Because everything hashes to a number, the memory requirements are not
that bad either (8 bytes per item ~256000 tuples per Mb).
The programme is in C++ but is as fast in something like Perl.
I.e. comparing millions of rows of data takes 10s of seconds rather than
10s of minutes.


Leo

> 
> 1) get items which are orphaned in a.
> CREATE TEMP TABLE orphans as
>     SELECT a.join1, a.join2
>         FROM a LEFT OUTER JOIN b ON a.join1 = b.join1 AND 
> a.join2 = b.join2
>         WHERE b.join1 IS NULL AND b.join2 IS NULL
> 
> D A
> 
> 
> "Llew" <leo.goodstadt@anat.ox.ac.uk> wrote in message
> news:a65qm1$2k6g$1@jupiter.hub.org...
> > Dear everyone,
> > What is the best way of removing rows which are not in 
> another table?