Scrub one large table against another - Mailing list pgsql-performance

From Brendan Curran
Subject Scrub one large table against another
Date
Msg-id 452C1437.9040709@gmail.com
Whole thread Raw
Responses Re: Scrub one large table against another
List pgsql-performance
I currently have a db supporting what is for the most part an OLAP data
warehousing application.

One table (good data) has roughly 120 million rows, divided into roughly
40 different relational groups (logically by foreign key). Every time I
add data to this table, I need to afterwards scrub that group against
known "bad data" which is housed in a second table that has roughly 21
million rows.

The 120 million row good data table is called "email_record"
The 21 million row bad data table is called "suppress"

There are separate btree indexes on 'email_record_id', 'email_list_id'
and 'email' on both tables.

Each time I scrub data I pull out anywhere from 1 to 5 million rows from
the good table (depending on the size of the group i'm scrubbing) and
compare them against the 21 million rows in the 'suppress' table.

So far I've done this using a temporary staging table that stores only
the email_record_id for each row from the relevant group of the good
table. I use a plsql function that does roughly the following (i've
included only sql syntax and inserted the constant '9' where i would
normally use a variable):

The characters: email_record_id int8, email varchar(255), email_list_id int8
-------------------------------------------------------------

CREATE TEMP TABLE temp_list_suppress(email_record_id int8);

INSERT INTO temp_list_suppress
    SELECT email_record_id from ONLY email_record er
    WHERE email_list_id = 9 AND email IN
    (select email from suppress);

CREATE INDEX unique_id_index on temp_list_suppress ( email_record_id );

INSERT INTO er_banned
SELECT * from ONLY email_record er WHERE EXISTS
(SELECT 1 from temp_list_suppress ts where er.email_record_id =
ts.email_record_id)';

DELETE FROM ONLY email_record WHERE email_list_id = 9 AND email_record_id IN
    (SELECT email_record_id from temp_list_suppress);

TRUNCATE TABLE temp_list_suppress;
DROP TABLE temp_list_suppress;
--------------------------------------------------------------

The performance is dreadful, is there a more efficient way to do this?
Would I be better off just grabbing * initially from the good table
instead of just the id to avoid more sequential searches later? Here are
my configs:

Debian
Postgres 8.1.4
dual zeon
ram: 4 gigs
raid 5

# - Memory -
shared_buffers = 3000
work_mem = 92768
maintenance_work_mem = 128384

autovacuum is turned off, and the db is annalyzed and vacuumed regularly.


Regards,
Brendan

pgsql-performance by date:

Previous
From: "Jim C. Nasby"
Date:
Subject: Re: Simple join optimized badly?
Next
From: Tom Lane
Date:
Subject: Re: Scrub one large table against another