Re: Best COPY Performance - Mailing list pgsql-performance

From Jim C. Nasby
Subject Re: Best COPY Performance
Date
Msg-id 20061023215913.GW26892@nasby.net
Whole thread Raw
In response to Re: Best COPY Performance  ("Worky Workerson" <worky.workerson@gmail.com>)
Responses Re: Best COPY Performance  ("Joshua D. Drake" <jd@commandprompt.com>)
Re: Best COPY Performance  ("Craig A. James" <cjames@modgraph-usa.com>)
Re: Best COPY Performance  ("Worky Workerson" <worky.workerson@gmail.com>)
Re: Best COPY Performance  (Alex Stapleton <alexs@advfn.com>)
List pgsql-performance
On Mon, Oct 23, 2006 at 11:10:19AM -0400, Worky Workerson wrote:
> >> I am most interested in loading two tables, one with about 21 (small)
> >> VARCHARs where each record is about 200 bytes, and another with 7
> >> INTEGERs, 3 TIMESTAMPs, and 1 BYTEA where each record is about 350
> >> bytes.
> >
> >indexes/keys?  more memory for sorting during index creation can have
> >a dramatic affect on bulk insert performance.  check for pg_tmp
> >folders popping up during copy run.
>
> The only index on load is a single IP4 btree primary key, which I
> figure should function about like an INTEGER.
>
> >for table light on indexes, 10-15k for copy is pretty poor.  you can
> >get pretty close to that with raw inserts on good hardware. I would
> >suggest configuirng your perl script to read from stdin and write to
> >stdout, and pipe it to psql using copy from stdin.  then just
> >benchmark your perl script redirecting output to a file.
>
> So simple and hadn't thought of that ... thanks.  When I pre-create a
> COPY file, I can load it at about 45K inserts/sec (file was 1.8GB or
> 14.5 million records in 331 seconds), which looks like its about 5.5
> MB/s.  I'm loading from a local 15K SCSI320 RAID10 (which also
> contains the PG log files) to a 10K SCSI320 RAID10 on an FC SAN.  Does
> this look more consistent with "decent" performance, or should I go
> looking into some hardware issues i.e. SAN configuration?  I've
> currently got several hats including hardware/systems/security admin,
> as well as DBA and programmer, and my SAN setup skills could
> definitely use some more work.
>
> Hardware aside, my perl can definitely use some work, and it seems to
> be mostly the CSV stuff that I am using, mostly for convenience.  I'll
> see if I can't redo some of that to eliminate some CSV processing, or,
> barring that, multithread the process to utilize more of the CPUs.
> Part of the reason that I hadn't used psql in the first place is that
> I'm loading the data into partitioned tables, and the loader keeps
> several COPY connections open at a time to load the data into the
> right table.  I guess I could just as easily keep several psql pipes
> open, but it seemed cleaner to go through DBI.

http://stats.distributed.net used to use a perl script to do some
transformations before loading data into the database. IIRC, when we
switched to using C we saw 100x improvement in speed, so I suspect that
if you want performance perl isn't the way to go. I think you can
compile perl into C, so maybe that would help some.

Ultimately, you might be best of using triggers instead of rules for the
partitioning since then you could use copy. Or go to raw insert commands
that are wrapped in a transaction.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

pgsql-performance by date:

Previous
From: "Jim C. Nasby"
Date:
Subject: Re: New hardware thoughts
Next
From: "Jim C. Nasby"
Date:
Subject: Re: Optimizing disk throughput on quad Opteron