Thread: Finding orphan records
I'm trying to find/delete all records in table A that are no longer referenced by tables B or C. There are about 4 million records in table A, and several hundred million in tables B and C. Is there something more efficient than: select address_key, address from addresses where ( not exists(select 1 from B where BField=addresses.address_key limit 1) ) and ( not exists(select 1 from C where CField=addresses.address_key limit 1) ) Of course, all fields above are indexed. There are foreign key references in B and C to A. Is there some way to safely leverage that? Wes
I would use an outer join and check only those null-value records in the right table with id's referencing table A Sample query: select a.*,b.* from a left outer join b on a.id = b.a_id -- assuming a_id is my referencing column to a where b.id is null; This will yield all columns in table a which has a null value on table b This is just from the top of my head, just a concept, I might have some syntax error. -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Wes Sent: Wednesday, January 11, 2006 11:51 PM To: Postgresql-General Subject: [GENERAL] Finding orphan records I'm trying to find/delete all records in table A that are no longer referenced by tables B or C. There are about 4 million records in table A, and several hundred million in tables B and C. Is there something more efficient than: select address_key, address from addresses where ( not exists(select 1 from B where BField=addresses.address_key limit 1) ) and ( not exists(select 1 from C where CField=addresses.address_key limit 1) ) Of course, all fields above are indexed. There are foreign key references in B and C to A. Is there some way to safely leverage that? Wes ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org -- No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.371 / Virus Database: 267.14.16/225 - Release Date: 1/9/2006
Resending sample query, darn where clause didn't wrap select a.*,b.* from a left outer join b on a.id = b.a_id where b.id is null; -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Wes Sent: Wednesday, January 11, 2006 11:51 PM To: Postgresql-General Subject: [GENERAL] Finding orphan records I'm trying to find/delete all records in table A that are no longer referenced by tables B or C. There are about 4 million records in table A, and several hundred million in tables B and C. Is there something more efficient than: select address_key, address from addresses where ( not exists(select 1 from B where BField=addresses.address_key limit 1) ) and ( not exists(select 1 from C where CField=addresses.address_key limit 1) ) Of course, all fields above are indexed. There are foreign key references in B and C to A. Is there some way to safely leverage that? Wes ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org -- No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.371 / Virus Database: 267.14.16/225 - Release Date: 1/9/2006
On 1/12/06 12:23 AM, "Jonel Rienton" <jonel@rientongroup.com> wrote: > Resending sample query, darn where clause didn't wrap > > select a.*,b.* from a > left outer join b on a.id = b.a_id > where b.id is null; I tried something along those lines a while back, and it was orders of magnitude slower. The above produces: explain select address_key, address from addresses left outer join messages on addresses.address_key=originator where originator is null; QUERY PLAN ---------------------------------------------------------------------------- --------------------- Merge Left Join (cost=35684870.14..38457295.97 rows=4090203 width=40) Merge Cond: ("outer".address_key = "inner".originator) Filter: ("inner".originator IS NULL) -> Index Scan using addresses_pkey on addresses (cost=0.00..97213.17 rows=4090203 width=40) -> Sort (cost=35684870.14..36129462.74 rows=177837040 width=11) Sort Key: messages.originator -> Seq Scan on messages (cost=0.00..7215718.40 rows=177837040 width=11) This appears to be very inefficient. B is almost two orders of magnitude larger than A. C is about 3-4 times as big as B (record count). My statement (with the same single 'B' table as above) produces: narc=> explain select address_key, address from addresses where ( not exists(select 1 from messages where originator=addresses.address_key limit 1) ); QUERY PLAN ---------------------------------------------------------------------------- -------------------------------- Seq Scan on addresses (cost=0.00..3398462.98 rows=2045102 width=40) Filter: (NOT (subplan)) SubPlan -> Limit (cost=0.00..0.81 rows=1 width=0) -> Index Scan using messages_i_orig_mdate on messages (cost=0.00..35148.46 rows=43301 width=0) Index Cond: (originator = $0) Which seems like it should be much more efficient. Wes
Wes <wespvp@syntegra.com> writes: > This appears to be very inefficient. B is almost two orders of magnitude > larger than A. C is about 3-4 times as big as B (record count). My > statement (with the same single 'B' table as above) produces: If it's only a factor of 3-4 then the merge join should be faster. If it's really two orders of magnitude (100x?) then the nested loop below would be faster. I think in 8.1 (and I think in 8.0 too) the planner is capable of coming up with both plans for the NOT IN query though. -- greg
On 1/12/06 10:37 AM, "Greg Stark" <gsstark@mit.edu> wrote: > If it's only a factor of 3-4 then the merge join should be faster. If it's > really two orders of magnitude (100x?) then the nested loop below would be > faster. I think in 8.1 (and I think in 8.0 too) the planner is capable of > coming up with both plans for the NOT IN query though. Yep, it's really two orders of magnitude. A is about 4 million records. B is around 200 million, and C is about 800 million records. That's why I figured a sequential pass on A with index lookups on B and C would be fastest. It took about 3.25 hours to complete the SELECT. I'm working on the 8.1 upgrade now. My biggest concern is making sure no bugs have crept in to ecpg that would affect us (bitten once before, so being careful). So far everything else is looking good with 8.1. I'm pondering dumping the keys for A to a file, and B and C to another file, sorting with uniqueness on the B/C file, then programmatically determining which keys have been freed. I'm pretty sure this will be much faster, but I had hoped to avoid an external process like this. Wes