Re: Issues with \copy from file - Mailing list pgsql-performance

From Scott Marlowe
Subject Re: Issues with \copy from file
Date
Msg-id dcc563d10910172310p4fb06188t68129c23eaa43ba3@mail.gmail.com
Whole thread Raw
In response to Issues with \copy from file  (Sigurgeir Gunnarsson <sgunnars@gmail.com>)
Responses Re: Issues with \copy from file
List pgsql-performance
On Mon, Oct 12, 2009 at 4:05 PM, Sigurgeir Gunnarsson
<sgunnars@gmail.com> wrote:
> I'm doing \copy from file into table. There are two files one with 7 million
> lines and the other with around 24 million and the data goes into separate
> table. There are only three columns in each file and four in each table (the
> primary key, id serial is the fourt). The data is about 150 MB and 450 MB
> and takes from 5 to 20 minutes to load into the database.

You can only write data then commit it so fast to one drive, and that
speed is usually somewhere in the megabyte per second range.  450+150
in 5 minutes is 120 Megs per second, that's pretty fast, but is likely
the max speed of a modern super fast 15k rpm drive.  If it's taking 20
minutes then it's 30 Megs per second which is still really good if
you're in the middle of a busy afternoon and the db has other things
to do.

The key here is monitoring your machine to see what you're maxing out.
 If you're at 100% IO then cpu tricks and tuning aren't likely to
help.  Unless you can reduce the IO load to do the same thing (things
like turning off fsync might help streamline some writes.)

To really tell what the numbers bi / bo / wa mean you really need to
run some artificial tests to see what your machine can do at
quiescence. If you can get 120Meg per second streamed, and 20 Meg per
second random on 8k blocks, then 5 minutes is the top side of what you
can ever expect to get.  If you can get 600Meg per sec then you're
there yet, and might need multiple threads to load data fast.

pg_restore supports the -j switch for this.  But it only works on
separate tables so you'd be limited to two at once right now since
there's two tables.

> What I'm wondering about is what parameters to tweak to improve the
> operation and shorten the time of the \copy ?

copy has a minimum cost in time per megabyte that you can't get out
of.   The trick is knowing when you've gotten there (or damned close)
and quit banging your head on the wall about it.


> I think I have tweaked most of
> the available in postgresql.conf, that is shared_buffer, temp_buffers,
> work_mem, maintenance_work_mem, max_fsm_pages. Maybe someone could point out
> the one really related to \copy ?

Try cranking up your checkpoint segments to several hundred.  Note
this may delay  restart on a crash.  If you crash a lot you have other
problems, but still, it lets you know that if someone trips over a
cord in the afternoon you're gonna have to wait 10 or 20 or 30 minutes
for the machine to come back up as it replays the log files.


> I would hope that there is some way for me to improve the operation. I used
> to do the operation on a MySQL server and simple time measurements gives me
> a difference of a multiple 3 to 4, where the MySQL is faster.

With innodb tables?  If it's myisam tables it doesn't really count,
unless your data is unimportant.  In which case myisam may be the
better choice.

> I would also be satisfied to know if this is an expected difference.

I'm not entirely sure it's a difference really.  I can believe one if
I see it on my hardware, where I run both dbs.  Pgsql is much faster
on my machines that mysql for this type of stuff.

Note that reading the file from the same file system that you're
writing to is gonna be slow.  It'd likely be fastest to read from one
drive that is FAST but not the main storage drive.

Next, are pg_xlog files on the same partition as the main db?

I can copy files into my big servers in teh 350 to 450
megabytes/second range if the machines are otherwise quiet (early am)
and sustain 150 to 200 even during moderately high loads during the
day.

pgsql-performance by date:

Previous
From: Euler Taveira de Oliveira
Date:
Subject: Re: Issues with \copy from file
Next
From: Joe Uhl
Date:
Subject: Re: Partitioned Tables and ORDER BY