Thread: Finding orphan records

Finding orphan records

From
Wes
Date:
I'm trying to find/delete all records in table A that are no longer
referenced by tables B or C.  There are about 4 million records in table A,
and several hundred million in tables B and C.

Is there something more efficient than:

select address_key, address from addresses where ( not exists(select 1 from
B where BField=addresses.address_key limit 1) ) and ( not exists(select 1
from C where CField=addresses.address_key limit 1) )

Of course, all fields above are indexed.

There are foreign key references in B and C to A.  Is there some way to
safely leverage that?

Wes



Re: Finding orphan records

From
"Jonel Rienton"
Date:
I would use an outer join and check only those null-value records in the
right table with id's referencing table A

Sample query:

select a.*,b.* from a
left outer join b on a.id = b.a_id  -- assuming a_id is my referencing
column to a
where b.id is null;

This will yield all columns in table a which has a null value on table b

This is just from the top of my head, just a concept, I might have some
syntax error.

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Wes
Sent: Wednesday, January 11, 2006 11:51 PM
To: Postgresql-General
Subject: [GENERAL] Finding orphan records

I'm trying to find/delete all records in table A that are no longer
referenced by tables B or C.  There are about 4 million records in table A,
and several hundred million in tables B and C.

Is there something more efficient than:

select address_key, address from addresses where ( not exists(select 1 from
B where BField=addresses.address_key limit 1) ) and ( not exists(select 1
from C where CField=addresses.address_key limit 1) )

Of course, all fields above are indexed.

There are foreign key references in B and C to A.  Is there some way to
safely leverage that?

Wes



---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

               http://archives.postgresql.org

--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.371 / Virus Database: 267.14.16/225 - Release Date: 1/9/2006



Re: Finding orphan records

From
"Jonel Rienton"
Date:
Resending sample query, darn where clause didn't wrap

select a.*,b.* from a
left outer join b on a.id = b.a_id
where b.id is null;

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Wes
Sent: Wednesday, January 11, 2006 11:51 PM
To: Postgresql-General
Subject: [GENERAL] Finding orphan records

I'm trying to find/delete all records in table A that are no longer
referenced by tables B or C.  There are about 4 million records in table A,
and several hundred million in tables B and C.

Is there something more efficient than:

select address_key, address from addresses where ( not exists(select 1 from
B where BField=addresses.address_key limit 1) ) and ( not exists(select 1
from C where CField=addresses.address_key limit 1) )

Of course, all fields above are indexed.

There are foreign key references in B and C to A.  Is there some way to
safely leverage that?

Wes



---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

               http://archives.postgresql.org

--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.371 / Virus Database: 267.14.16/225 - Release Date: 1/9/2006



Re: Finding orphan records

From
Wes
Date:
On 1/12/06 12:23 AM, "Jonel Rienton" <jonel@rientongroup.com> wrote:

> Resending sample query, darn where clause didn't wrap
>
> select a.*,b.* from a
> left outer join b on a.id = b.a_id
> where b.id is null;

I tried something along those lines a while back, and it was orders of
magnitude slower.  The above produces:

explain select address_key, address from addresses left outer join messages
on addresses.address_key=originator where originator is null;
                                           QUERY PLAN
----------------------------------------------------------------------------
---------------------
 Merge Left Join  (cost=35684870.14..38457295.97 rows=4090203 width=40)
   Merge Cond: ("outer".address_key = "inner".originator)
   Filter: ("inner".originator IS NULL)
   ->  Index Scan using addresses_pkey on addresses  (cost=0.00..97213.17
rows=4090203 width=40)
   ->  Sort  (cost=35684870.14..36129462.74 rows=177837040 width=11)
         Sort Key: messages.originator
         ->  Seq Scan on messages  (cost=0.00..7215718.40 rows=177837040
width=11)


This appears to be very inefficient.  B is almost two orders of magnitude
larger than A.  C is about 3-4 times as big as B (record count).  My
statement (with the same single 'B' table as above) produces:

narc=> explain select address_key, address from addresses where ( not
exists(select 1 from messages where originator=addresses.address_key limit
1) );
                                                 QUERY PLAN
----------------------------------------------------------------------------
--------------------------------
 Seq Scan on addresses  (cost=0.00..3398462.98 rows=2045102 width=40)
   Filter: (NOT (subplan))
   SubPlan
     ->  Limit  (cost=0.00..0.81 rows=1 width=0)
           ->  Index Scan using messages_i_orig_mdate on messages
(cost=0.00..35148.46 rows=43301 width=0)
                 Index Cond: (originator = $0)


Which seems like it should be much more efficient.

Wes



Re: Finding orphan records

From
Greg Stark
Date:
Wes <wespvp@syntegra.com> writes:

> This appears to be very inefficient.  B is almost two orders of magnitude
> larger than A.  C is about 3-4 times as big as B (record count).  My
> statement (with the same single 'B' table as above) produces:

If it's only a factor of 3-4 then the merge join should be faster. If it's
really two orders of magnitude (100x?) then the nested loop below would be
faster. I think in 8.1 (and I think in 8.0 too) the planner is capable of
coming up with both plans for the NOT IN query though.

--
greg

Re: Finding orphan records

From
Wes
Date:
On 1/12/06 10:37 AM, "Greg Stark" <gsstark@mit.edu> wrote:

> If it's only a factor of 3-4 then the merge join should be faster. If it's
> really two orders of magnitude (100x?) then the nested loop below would be
> faster. I think in 8.1 (and I think in 8.0 too) the planner is capable of
> coming up with both plans for the NOT IN query though.

Yep, it's really two orders of magnitude.  A is about 4 million records.  B
is around 200 million, and C is about 800 million records.  That's why I
figured a sequential pass on A with index lookups on B and C would be
fastest.  It took about 3.25 hours to complete the SELECT.

I'm working on the 8.1 upgrade now.  My biggest concern is making sure no
bugs have crept in to ecpg that would affect us (bitten once before, so
being careful).  So far everything else is looking good with 8.1.

I'm pondering dumping the keys for A to a file, and B and C to another file,
sorting with uniqueness on the B/C file, then programmatically determining
which keys have been freed.  I'm pretty sure this will be much faster, but I
had hoped to avoid an external process like this.

Wes