Bulk loading/merging - Mailing list pgsql-performance

From Worky Workerson
Subject Bulk loading/merging
Date
Msg-id ce4072df0605261148g4ec22267gc54b6b445e47c855@mail.gmail.com
Whole thread Raw
Responses Re: Bulk loading/merging  ("Worky Workerson" <worky.workerson@gmail.com>)
Re: Bulk loading/merging  ("Jim C. Nasby" <jnasby@pervasive.com>)
List pgsql-performance
I've set up something similar the 'recommended' way to merge data into
the DB, i.e.

http://www.postgresql.org/docs/current/static/plpgsql-control-structures.html#PLPGSQL-ERROR-TRAPPING

however I did it with a trigger on insert, i.e. (not my schema :) ):

CREATE TABLE db (a INT PRIMARY KEY, b TEXT, c INTEGER, d INET);

CREATE FUNCTION merge_db() RETURNS TRIGGER AS
$$
BEGIN
    UPDATE db SET b = NEW.data
        WHERE a = NEW.key
              AND NOT (c IS DISTINCT FROM NEW.c)
              AND NOT (d IS DISTINCT FROM NEW.d);
    IF found THEN
        RETURN NULL;
    END IF;
    RETURN NEW;
END;
$$
LANGUAGE plpgsql;

CREATE TRIGGER merge_db_tr BEFORE INSERT ON db
FOR EACH ROW EXECUTE PROCEDURE merge_db();

Is this the best/fastest way to do this sort of thing?  I only get
about 50 records/second inserts, while without the trigger (inserting
unmerged data) I can get more like 1000/second. I'm doing the whole
NOT ... IS DISTINCT stuff to handle NULL values that might be in the
columns ... I'm only considering two column keys equal if (a,c,d) are
all the same (i.e. either the same value or both NULL).

I read that there is a race condition with the above method as applied
to a normal function ... does this apply to a trigger as well?

Optimization Questions:
-Can I do better with the trigger function itself?

-I realize that I can create indexes on some of the lookup columns
('key' in the above example).  This would speed up the location of the
update record but slow down the actual update insert, right?  Would
this be a win?  I tested an index on 10000 rows, and it beat out the
non-indexed by about 7% (3:31 with index, 3:45 without) ... is this
all the benefit that I can expect?

-Will moving pg_xlog to a different disk help all that much, if the
whole DB is currently on a 4 disk RAID10?  What about moving the
indexes?  I've set up my postgresql.conf according to the docs and
Josh Berkus' presentation, i.e. (16GB ram, quad Opteron moachine, not
all settings are relevant):
shared_buffers = 60000
temp_buffers = 10000
work_mem = 131072
maintenance_work_mem = 524288
effective_cache_size = 120000
random_page_cost = 2
wal_buffers = 128
checkpoint_segments = 128
checkpoint_timeout = 3000
max_fsm_pages = 2000000
max_fsm_relations = 1000000

-If I break up my dataset into smaller chunks and parallelize it,
could I get better total performance, or would I most likely be
thrashing the disk?

-If I sort the data in the COPY file by key (i.e. a,c,d) before
inserting it into the database, will this help out the DB at all?

-Its cleaner to just be able to insert everything into the database
and let the DB aggregate the records, however I could use some of our
extra hardware to do aggregation in perl and then output the already
aggregated records to the DB ... this has the advantage of being
easily parallelizable but requires a bit of extra work to get right.
Do you think that this is the best way to go?

Also, as a slight aside, without a trigger, COPY seems to process each
record very quickly (using Perl DBI, about 7000 records/second)
however there is a long pause once the last record has been delivered.
 Is this just the backend queuing up the insert commands given by
perl, or is there extra processing that needs to be done at the end of
the COPY that could be taking a while (10s on 500K record COPY).

Thanks!

pgsql-performance by date:

Previous
From: wmiro@ig.com.br
Date:
Subject: Why the 8.1 plan is worst than 7.4?
Next
From: "Worky Workerson"
Date:
Subject: Re: Bulk loading/merging