Re: Scrub one large table against another - Mailing list pgsql-performance

From Brendan Curran
Subject Re: Scrub one large table against another
Date
Msg-id 452D2195.8010907@gmail.com
Whole thread Raw
In response to Re: Scrub one large table against another  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Scrub one large table against another
List pgsql-performance
Tom Lane wrote:
> Brendan Curran <brendan.curran@gmail.com> writes:
>> So much time is being spent in the Unique and Sort leaves... I would
>> think that it wouldn't need to do the unique portion, since there is no
>> DISTINCT clause...
>
> There's nothing in that query suggesting that suppress.email is unique.
> If you know that it is, try using a plain join instead of an IN.
>
>             regards, tom lane
>


Interestingly, and thank you to Tom and Jim, the explicit JOIN improved
performance tremendously (RESULTS BELOW). I converted the entire query
to use explicit joins instead of IN and EXISTS and discovered acceptable
performance. I think the next place to go from here is RAID1/RAID10 and
possibly partitioning my large table (Welcome to DDL insanity, right?).

I have to add that I'm a little surprised the documentation is so
generous to IN and EXISTS. Is there something amiss in my configuration
that prevents them from performing correctly? If not, I can't imagine a
time when IN or EXISTS would be more performant than an explicit JOIN...

Additionally, I manually scrub for duplicates at the group level in the
email_record table to keep my records unique. I would like to use a
unique constraint, but have found that batching in JDBC is impossible
due to irrecoverable errors even when using BEFORE INSERT triggers to
just return NULL if a record exists already. Has anyone got an elegant
solution for the 'add only if not exists already' problem similar to
MSSQL's MERGE command?

Just one more thing... I have found that maintaining a btree index on a
varchar(255) value is extremely expensive on insert/update/delete. It is
unfortunately necessary for me to maintain this index for queries and
reports so I am transitioning to using an unindexed staging table to
import data into before merging it with the larger table. All the docs
and posts recommend is to drop the index, import your data, and then
create the index again. This is untenable on a daily / bi-weekly basis.
Is there a more elegant solution to this indexing problem?

Thank you for all of your help!

EXPLAIN ANALYZE result comparison...

1. EXPLAIN ANALYZE SELECT email_record_id from ONLY email_record er
     WHERE email_list_id = 13 AND email IN
     (select email from suppress);

Hash Join  (cost=8359220.68..9129843.00 rows=800912 width=8) (actual
time=2121601.603..2121601.603 rows=0 loops=1)
   Hash Cond: (("outer".email)::text = ("inner".email)::text)
   ->  Unique  (cost=4414093.19..4522324.49 rows=21646260 width=25)
(actual time=1165955.907..1434439.731 rows=21646261 loops=1)
         ->  Sort  (cost=4414093.19..4468208.84 rows=21646260 width=25)
(actual time=1165955.903..1384667.715 rows=21646261 loops=1)
               Sort Key: suppress.email
               ->  Seq Scan on suppress  (cost=0.00..393024.60
rows=21646260 width=25) (actual time=37.784..609848.551 rows=21646261
loops=1)
   ->  Hash  (cost=3899868.47..3899868.47 rows=4606808 width=32) (actual
time=554522.983..554522.983 rows=3245336 loops=1)
         ->  Bitmap Heap Scan on email_record er
(cost=38464.83..3899868.47 rows=4606808 width=32) (actual
time=275640.435..541342.727 rows=3245336 loops=1)
               Recheck Cond: (email_list_id = 13)
               ->  Bitmap Index Scan on list  (cost=0.00..38464.83
rows=4606808 width=0) (actual time=275102.037..275102.037 rows=5172979
loops=1)
                     Index Cond: (email_list_id = 13)
Total runtime: 2,122,693.864 ms
--------------------------------------------------------

2. EXPLAIN ANALYZE SELECT email_record_id FROM ONLY email_record er JOIN
suppress s USING
(email) WHERE er.email_list_id = 13;

Hash Join  (cost=3945127.49..5000543.11 rows=800912 width=8) (actual
time=808874.088..808874.088 rows=0 loops=1)
   Hash Cond: (("outer".email)::text = ("inner".email)::text)
   ->  Seq Scan on suppress s  (cost=0.00..393024.60 rows=21646260
width=25) (actual time=661.518..216933.399 rows=21646261 loops=1)
   ->  Hash  (cost=3899868.47..3899868.47 rows=4606808 width=32) (actual
time=494294.932..494294.932 rows=3245336 loops=1)
         ->  Bitmap Heap Scan on email_record er
(cost=38464.83..3899868.47 rows=4606808 width=32) (actual
time=242198.226..485942.542 rows=3245336 loops=1)
               Recheck Cond: (email_list_id = 13)
               ->  Bitmap Index Scan on list  (cost=0.00..38464.83
rows=4606808 width=0) (actual time=241769.786..241769.786 rows=5172979
loops=1)
                     Index Cond: (email_list_id = 13)
Total runtime: 808,884.387 ms



pgsql-performance by date:

Previous
From: Mark Lewis
Date:
Subject: Re: Simple join optimized badly?
Next
From: Brendan Curran
Date:
Subject: Re: Scrub one large table against another (vmstat output)