Thread: Improve COPY performance for large data sets

Improve COPY performance for large data sets

From
Ryan Hansen
Date:
Greetings,

I'm relatively new to PostgreSQL but I've been in the IT applications
industry for a long time, mostly in the LAMP world.

One thing I'm experiencing some trouble with is running a COPY of a
large file (20+ million records) into a table in a reasonable amount of
time.  Currently it's taking about 12 hours to complete on a 64 bit
server with 3 GB memory allocated (shared_buffer), single SATA 320 GB
drive.  I don't seem to get any improvement running the same operation
on a dual opteron dual-core, 16 GB server.

I'm not asking for someone to solve my problem, just some direction in
the best ways to tune for faster bulk loading, since this will be a
fairly regular operation for our application (assuming it can work this
way).  I've toyed with the maintenance_work_mem and some of the other
params, but it's still way slower than it seems like it should be.
So any contributions are much appreciated.

Thanks!

P.S. Assume I've done a ton of reading and research into PG tuning,
which I have.  I just can't seem to find anything beyond the basics that
talks about really speeding up bulk loads.

Re: Improve COPY performance for large data sets

From
Ryan Hansen
Date:
NEVERMIND!!

I found it.  Turns out there was still a constraint on the table.  Once
I dropped that, the time went down to 44 minutes.

Maybe I am an idiot after all. :)

-Ryan
Greetings,

I'm relatively new to PostgreSQL but I've been in the IT applications
industry for a long time, mostly in the LAMP world.

One thing I'm experiencing some trouble with is running a COPY of a
large file (20+ million records) into a table in a reasonable amount of
time.  Currently it's taking about 12 hours to complete on a 64 bit
server with 3 GB memory allocated (shared_buffer), single SATA 320 GB
drive.  I don't seem to get any improvement running the same operation
on a dual opteron dual-core, 16 GB server.

I'm not asking for someone to solve my problem, just some direction in
the best ways to tune for faster bulk loading, since this will be a
fairly regular operation for our application (assuming it can work this
way).  I've toyed with the maintenance_work_mem and some of the other
params, but it's still way slower than it seems like it should be.
So any contributions are much appreciated.

Thanks!

P.S. Assume I've done a ton of reading and research into PG tuning,
which I have.  I just can't seem to find anything beyond the basics that
talks about really speeding up bulk loads.


Re: Improve COPY performance for large data sets

From
Alan Hodgson
Date:
On Wednesday 10 September 2008, Ryan Hansen <ryan.hansen@brightbuilders.com>
wrote:
>Currently it's taking about 12 hours to complete on a 64 bit
> server with 3 GB memory allocated (shared_buffer), single SATA 320 GB
> drive.  I don't seem to get any improvement running the same operation
> on a dual opteron dual-core, 16 GB server.
>
> I'm not asking for someone to solve my problem, just some direction in
> the best ways to tune for faster bulk loading, since this will be a
> fairly regular operation for our application (assuming it can work this
> way).  I've toyed with the maintenance_work_mem and some of the other
> params, but it's still way slower than it seems like it should be.
> So any contributions are much appreciated.

Your drive subsystem, such as it is, is inappropriate for a database. Your
bottleneck is your drive.

Turning fsync off might help. You should also drop all indexes on the table
before the COPY and add them back after (which would eliminate a lot of
random I/O during the COPY).

--
Alan

Re: Improve COPY performance for large data sets

From
Bill Moran
Date:
In response to Ryan Hansen <ryan.hansen@brightbuilders.com>:
>
> I'm relatively new to PostgreSQL but I've been in the IT applications
> industry for a long time, mostly in the LAMP world.
>
> One thing I'm experiencing some trouble with is running a COPY of a
> large file (20+ million records) into a table in a reasonable amount of
> time.  Currently it's taking about 12 hours to complete on a 64 bit
> server with 3 GB memory allocated (shared_buffer), single SATA 320 GB
> drive.  I don't seem to get any improvement running the same operation
> on a dual opteron dual-core, 16 GB server.
>
> I'm not asking for someone to solve my problem, just some direction in
> the best ways to tune for faster bulk loading, since this will be a
> fairly regular operation for our application (assuming it can work this
> way).  I've toyed with the maintenance_work_mem and some of the other
> params, but it's still way slower than it seems like it should be.
> So any contributions are much appreciated.

There's a program called pgloader which supposedly is faster than copy.
I've not used it so I can't say definitively how much faster it is.

A single 320G drive isn't going to get you much on speed.  How many
RPM?  Watch iostat on your platform to see if you're saturating the
drive, if you are, the only way you're going to get it faster is to
add more disks in a RAID-10 or similar, or somehow get a faster disk.

You always have the option to turn off fsync, but be sure you understand
the consequences of doing that and have an appropriate failure plan
before doing so.

--
Bill Moran
Collaborative Fusion Inc.
http://people.collaborativefusion.com/~wmoran/

wmoran@collaborativefusion.com
Phone: 412-422-3463x4023

Re: Improve COPY performance for large data sets

From
Dimitri Fontaine
Date:
Hi,

Le mercredi 10 septembre 2008, Ryan Hansen a écrit :
> One thing I'm experiencing some trouble with is running a COPY of a
> large file (20+ million records) into a table in a reasonable amount of
> time.  Currently it's taking about 12 hours to complete on a 64 bit
> server with 3 GB memory allocated (shared_buffer), single SATA 320 GB
> drive.  I don't seem to get any improvement running the same operation
> on a dual opteron dual-core, 16 GB server.

You single SATA disk is probably very busy going from reading source file to
writing data. You could try raising checkpoint_segments to 64 or more, but a
single SATA disk won't give you high perfs for IOs. You're getting what you
payed for...

You could maybe ease the disk load by launching the COPY from a remote (local
netword) machine, and while at it if the file is big, try parallel loading
with pgloader.

Regards,
--
dim

Attachment

Re: Improve COPY performance for large data sets

From
"Scott Carey"
Date:
A single SATA drive may not be the best performer, but:

1. It won't make a load take 12 hours unless we're talking a load that is in total, similar to the size of the disk.  A slow, newer SATA drive will read and write at at ~50MB/sec at minimum, so the whole 320GB can be scanned at 3GB per minute.  Thats ~ 5 hours.  It is not likely that 20M records is over 20GB, and at that size there is no way the disk is the bottleneck.

2. To figure out if the disk or CPU is a bottleneck, don't assume.  Check iostat or top and look at the disk utilization % and io wait times.  Check the backend process CPU utilization.  In my experience, there are many things that can cause COPY to be completely CPU bound even with slow disks -- I have seen it bound to a 5MB/sec write rate on a 3Ghz CPU, which a drive from 1998 could handle.

It seems like this case is resolved, but there are some other good tuning recommendations.  Don't blame the disk until the disk is actually showing high utilization though. 

COPY is bound typically by the disk or a single CPU.  It is usually CPU bound if there are indexes or constraints on the table, and sometimes even when there are none.

The pg_bulkload tool in almost all cases, will be significantly faster but it has limitations that make it inappropriate for some to use.



On Wed, Sep 10, 2008 at 10:14 AM, Alan Hodgson <ahodgson@simkin.ca> wrote:
On Wednesday 10 September 2008, Ryan Hansen <ryan.hansen@brightbuilders.com>
wrote:
>Currently it's taking about 12 hours to complete on a 64 bit
> server with 3 GB memory allocated (shared_buffer), single SATA 320 GB
> drive.  I don't seem to get any improvement running the same operation
> on a dual opteron dual-core, 16 GB server.
>
> I'm not asking for someone to solve my problem, just some direction in
> the best ways to tune for faster bulk loading, since this will be a
> fairly regular operation for our application (assuming it can work this
> way).  I've toyed with the maintenance_work_mem and some of the other
> params, but it's still way slower than it seems like it should be.
> So any contributions are much appreciated.

Your drive subsystem, such as it is, is inappropriate for a database. Your
bottleneck is your drive.

Turning fsync off might help. You should also drop all indexes on the table
before the COPY and add them back after (which would eliminate a lot of
random I/O during the COPY).

--
Alan

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Improve COPY performance for large data sets

From
"Scott Carey"
Date:
Correction --
 2 hours to read the whole disk.

1. It won't make a load take 12 hours unless we're talking a load that is in total, similar to the size of the disk.  A slow, newer SATA drive will read and write at at ~50MB/sec at minimum, so the whole 320GB can be scanned at 3GB per minute.  Thats ~ 5 hours.  It is not likely that 20M records is over 20GB, and at that size there is no way the disk is the bottleneck.

Re: Improve COPY performance for large data sets

From
Dimitri Fontaine
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Le 10 sept. 08 à 19:16, Bill Moran a écrit :
> There's a program called pgloader which supposedly is faster than
> copy.
> I've not used it so I can't say definitively how much faster it is.

In fact pgloader is using COPY under the hood, and doing so via a
network connection (could be unix domain socket), whereas COPY on the
server reads the file content directly from the local file. So no,
pgloader is not good for being faster than copy.

That said, pgloader is able to split the workload between as many
threads as you want to, and so could saturate IOs when the disk
subsystem performs well enough for a single CPU not to be able to
overload it. Two parallel loading mode are supported, pgloader will
either hav N parts of the file processed by N threads, or have one
thread read and parse the file then fill up queues for N threads to
send COPY commands to the server.

Now, it could be that using pgloader with a parallel setup performs
better than plain COPY on the server. This remains to get tested, the
use case at hand is said to be for hundreds of GB or some TB data
file. I don't have any facilities to testdrive such a setup...

Note that those pgloader parallel options have been asked by
PostgreSQL hackers in order to testbed some ideas with respect to a
parallel pg_restore, maybe re-explaining what have been implemented
will reopen this can of worms :)

Regards,
- --
dim

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Darwin)

iEYEARECAAYFAkjINB0ACgkQlBXRlnbh1bmhkgCgu4TduBB0bnscuEsy0CCftpSp
O5IAoMsrPoXAB+SJEr9s5pMCYBgH/CNi
=1c5H
-----END PGP SIGNATURE-----

Re: Improve COPY performance for large data sets

From
"Scott Marlowe"
Date:
On Wed, Sep 10, 2008 at 11:16 AM, Bill Moran
<wmoran@collaborativefusion.com> wrote:
> There's a program called pgloader which supposedly is faster than copy.
> I've not used it so I can't say definitively how much faster it is.

I think you are thinking of pg_bulkloader...

答复: [PERFORM] Improve COPY performance for large data sets

From
"jay"
Date:
I suspect your table has index, or checkpoint_segments is small and lead PG
do checkpoint frequently.
If the table has index or constraint, drop it and copy it ,after copy
finished, do create index or constraint again.
If checkpoint_segments is small, enlarge it.
And also you can turn fsync off when you do copy, after finish, turn it on
again.
And also you can enlarge maintenance_work_mem.

If you take above, time cost will down significantly.

     莫建祥
阿里巴巴软件(上海)有限公司
研发中心-IM服务端开发部
联系方式:86-0571-85022088-13072
贸易通ID:jaymo 淘宝ID:jackem
公司网站:www.alisoft.com
wiki:http://10.0.32.21:1688/confluence/pages/viewpage.action?pageId=10338

-----邮件原件-----
发件人: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-owner@postgresql.org] 代表 Ryan Hansen
发送时间: 2008年9月11日 1:14
收件人: pgsql-performance@postgresql.org
主题: Re: [PERFORM] Improve COPY performance for large data sets

NEVERMIND!!

I found it.  Turns out there was still a constraint on the table.  Once
I dropped that, the time went down to 44 minutes.

Maybe I am an idiot after all. :)

-Ryan