Thread: Feature proposal
I'm currently playing with very large data import using COPY from file. As this can be extremely long operation (hours in my case) the nice feature would be some option to show operation progress - how many rows were already imported. Or maybe there is some way to do it? As long as postgres have no read- uncommited I think I can estimate it only by destination table size ??
On 2010-08-25, wstrzalka wrote: > I'm currently playing with very large data import using COPY from file. > > As this can be extremely long operation (hours in my case) the nice > feature would be some option to show operation progress - how many > rows were already imported. > > Or maybe there is some way to do it? As long as postgres have no read- > uncommited I think I can estimate it only by destination table size ?? > > -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To > make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general > Hi, You can use tools like pv for a progress bar and pipe the output into psql. HTH -- Robert...
Le 25.08.2010 09:15, wstrzalka a écrit : > I'm currently playing with very large data import using COPY from > file. > > As this can be extremely long operation (hours in my case) the nice > feature would be some option to show operation progress - how many > rows were already imported. > > Or maybe there is some way to do it? As long as postgres have no read- > uncommited I think I can estimate it only by destination table size ?? By the way, did you try to optimize your postgresql server ? In my case I was able to reduce a big data update from : 1 hour 15 minutes to : 5 minutes Without even changing any line of data or code in sql ! Incredible, isn't it ? Denis
On Wed, 2010-08-25 at 17:06 +0200, Denis BUCHER wrote: > Le 25.08.2010 09:15, wstrzalka a crit : > > I'm currently playing with very large data import using COPY from > > file. > > > > As this can be extremely long operation (hours in my case) the nice > > feature would be some option to show operation progress - how many > > rows were already imported. > > > > Or maybe there is some way to do it? As long as postgres have no read- > > uncommited I think I can estimate it only by destination table size ?? > > By the way, did you try to optimize your postgresql server ? > > In my case I was able to reduce a big data update from : > 1 hour 15 minutes > to : > 5 minutes > > Without even changing any line of data or code in sql ! > > Incredible, isn't it ? > Curious- what postgresql.conf settings did you change to improve it?
On Wed, 2010-08-25 at 12:20 -0400, Eric Comeau wrote: > > > Without even changing any line of data or code in sql ! > > > > Incredible, isn't it ? > > > > Curious- what postgresql.conf settings did you change to improve it? The most obvious would be to turn fsync off, sychronous_commit off, increase work_mem, increase checkpoint_timeout, increase wal_segments. JD > > > -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt
On 08/25/2010 12:30 PM, Joshua D. Drake wrote: > On Wed, 2010-08-25 at 12:20 -0400, Eric Comeau wrote: >> >>> Without even changing any line of data or code in sql ! >>> >>> Incredible, isn't it ? >>> >> >> Curious- what postgresql.conf settings did you change to improve it? > > The most obvious would be to turn fsync off, sychronous_commit off, > increase work_mem, increase checkpoint_timeout, increase wal_segments. > > JD > >> >> >> > can these be changed on the fly via set commands or does the config file have to be changed and postgres stopped and restarted. postgres 8.3.7 on freebsd. -- Stephen Clark NetWolves Sr. Software Engineer III Phone: 813-579-3200 Fax: 813-882-0209 Email: steve.clark@netwolves.com www.netwolves.com
Yea - I'll try to optimize as I had a plan to write to pgsql.performance for rescue anyway. I don't know exact hardware specification yet - known facts at the moment are: Sun Turgo?? (SPARC) with 32 cores 17GB RAM (1GB for shared buffers) hdd - ? OS - Solaris 10 - the system is running in the zone (Solaris virtualization) - however during test nothing else is utilizing the machine. PostgreSQL 8.4.4 64bit The data set is 9mln rows - about 250 columns The result database size is ~9GB Load time ~2h 20min CPU utilization - 1,2% (half of the one core) iostat shows writes ~6MB/s, 20% busy when I run 2 loads in parallel the CPU is split to 2*0,6%, hdd write ~7MB (almost the same) postgresql.conf changes: checkpoint_segments - 128 checkpoint_timeout - 30min shared_buffers - 1GB maintenance_work_mem - 128MB does it looks like my HDD is the problem? or maybe the Solaris virtualization? what's also interesting - table is empty when I start (by truncate) but while the COPY is working, I see it grows (by \d+ or pg_total_relation_size) about 1MB per second what I'd expect it should grow at checkpoints only, not all the time - am I wrong? > On Wed, 2010-08-25 at 12:20 -0400, Eric Comeau wrote: >> >> > Without even changing any line of data or code in sql ! >> > >> > Incredible, isn't it ? >> > >> >> Curious- what postgresql.conf settings did you change to improve it? > The most obvious would be to turn fsync off, sychronous_commit off, > increase work_mem, increase checkpoint_timeout, increase wal_segments. > JD >> >> >> -- Pozdrowienia, Wojciech Strzałka
On 08/25/10 11:47 AM, Wojciech Strzałka wrote: > The data set is 9mln rows - about 250 columns Having 250 columns in a single table sets off the 'normalization' alarm in my head.
On Wed, 2010-08-25 at 12:20 -0400, Eric Comeau wrote: > > > Without even changing any line of data or code in sql ! > > > > Incredible, isn't it ? > > > > Curious- what postgresql.conf settings did you change to improve it? The most obvious would be to turn fsync off, sychronous_commit off, increase work_mem, increase checkpoint_timeout, increase wal_segments. JD > > > -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt
On 26/08/2010 1:06 AM, Steve Clark wrote: > On 08/25/2010 12:30 PM, Joshua D. Drake wrote: >> On Wed, 2010-08-25 at 12:20 -0400, Eric Comeau wrote: >>> >>>> Without even changing any line of data or code in sql ! >>>> >>>> Incredible, isn't it ? >>>> >>> >>> Curious- what postgresql.conf settings did you change to improve it? >> >> The most obvious would be to turn fsync off, sychronous_commit off, >> increase work_mem, increase checkpoint_timeout, increase wal_segments. >> >> JD >> >>> >>> >>> >> > can these be changed on the fly via set commands or does the config file > have to be changed and postgres stopped and restarted. First: Many options can be changed by editing the config file then telling the postmaster to reload its configuration, rather that restarting the postmaster. See pg_ctl. As for the specific options: Checkpoint and WAL tuning is necessary and important in any real postgresql instance under load, and it's quite safe to adjust the checkpoint timeouts and wal segment counts to suit your needs. You'll need a restart to change the number of wal segments; I'm not so sure about the checkpoint timeout. You can't change fsync without a config file edit and a restart. You should **NEVER** be using fsync=off with data you cannot afford to lose, so it's a good thing in a way. You might use it to help initially load a database with bulk data, but fsync should be turned back on and the database restarted before you start actually using it. fsync=off is **NOT SAFE**. synchronous_commit also has effects on data safety. It permits the loss of transactions committed within the commit delay interval if the server crashes. If you turn it on, you need to decide how much recent work you can afford to lose if the database crashes. Not sure if it can be applied with a reload or whether it requires a full server restart. So: if you don't know exactly what you're doing, leave fsync alone. -- Craig Ringer Tech-related writing at http://soapyfrogs.blogspot.com/
On Wed, Aug 25, 2010 at 08:47:10PM +0200, Wojciech Strzaaaka wrote: > The data set is 9mln rows - about 250 columns 250 columns sounds very strange to me as well! I start to getting worried when I hit a tenth of that. > CPU utilization - 1,2% (half of the one core) > iostat shows writes ~6MB/s, 20% busy > when I run 2 loads in parallel the CPU is split to 2*0,6%, hdd write > ~7MB (almost the same) If you've got indexes set up on the table then I'd expect this sort of behavior, you could try dropping them before the copy and then recreating them afterward. It would be great if PG could do these sorts of bulk index updates automatically! Maybe run the first few tens/hundred changes in the main index and then start logging the rows that will need indexing and bulk process and merge them at the end. Concurrent access seems a bit more complicated, but shouldn't be too bad. The case of a UNIQUE index seems to require a change in behavior. For example, the following are executed concurrently: Client A: COPY foo (id) FROM stdin; Client B: INSERT INTO foo (id) VALUES (1); with A starting before and finishing after B, and A sends a row with id=1. At the moment the behavior would be for A's data to be indexed immediately and hence B's conflicting change would fail. If PG did bulk index merging at the end, this would change to B's succeeding and A's failing when the index was brought up to date. These semantics are still compatible with SQL, just different from before so some code may be (incorrectly) relying on this. I've read discussions from: http://archives.postgresql.org/pgsql-hackers/2008-02/msg00811.php and http://archives.postgresql.org/pgsql-general/2008-01/msg01048.php but not found much recent. It seems to hold together better than the first suggestion. Second post notes that you may be better off working in work_mem batches to help preventing spilling to disk. Sounds reasonable, and if it's OK to assume the new rows will be physically close to each other then they can be recorded as ranges/run length encoded to reduce the chance of spilling to disk for even very large inserts. As per the second post, I'm struggling with BEFORE INSERT triggers as well, their semantics seem to preclude most optimizations. > what's also interesting - table is empty when I start (by truncate) > but while the COPY is working, I see it grows (by \d+ or > pg_total_relation_size) about 1MB per second > what I'd expect it should grow at checkpoints only, not all the > time - am I wrong? AFAIU, it'll constantly grow. -- Sam http://samason.me.uk/
On Wed, Aug 25, 2010 at 8:48 PM, Craig Ringer <craig@postnewspapers.com.au> wrote: > synchronous_commit also has effects on data safety. It permits the loss of > transactions committed within the commit delay interval if the server > crashes. If you turn it on, you need to decide how much recent work you can > afford to lose if the database crashes. Not sure if it can be applied with a > reload or whether it requires a full server restart. > I routinely set synchronous_commit = off on a per-connection or per-transaction basis. The beauty of it is that it still honors transaction boundaries. That is, if there is a server crash the transaction will be either there or not as a whole; it will not be partially applied. This works great for bulk imports and changes to the DB for me, since I can always just re-run my programs on such failure and everything will pick up where it left off. It takes some planning but is worth it. > So: if you don't know exactly what you're doing, leave fsync alone. I agree -- leave fsync alone. You get benefit from synchronous_commit without the corruption risk. The other advice on boosting checkpoint segments and timeout are spot on. Make them pretty big and it will make your import go way faster. If you have a spare disk on which to move the checkpoint segments so that you eliminate the seek time on them, move them to get even more speed. After your import, you can make the number of segments smaller again if that suits your workload.
Heyho! On Wednesday 25 August 2010 09.15:33 wstrzalka wrote: > I'm currently playing with very large data import using COPY from > file. > > As this can be extremely long operation (hours in my case) the nice > feature would be some option to show operation progress - how many > rows were already imported. Recently, I've found (on Linux, don't know if other OSs export this information) /proc/<pid>/fdinfo/<fd> extremely helpful. It tells you the position of the file pointer of file number <fd> in process <pid> (I guess for a COPY import this would be the postgresql backend handling your import session.) Unlike other options, you can also use this if you only notice that the process is long-running after you've already started it. Of course it probably will not work if the file is mmapped or otherwise not read in a sequential fashion. cheers -- vb -- All Hail Discordia!
Attachment
On ons, 2010-08-25 at 00:15 -0700, wstrzalka wrote: > I'm currently playing with very large data import using COPY from > file. > > As this can be extremely long operation (hours in my case) the nice > feature would be some option to show operation progress - how many > rows were already imported. A feature like this is being worked on: https://commitfest.postgresql.org/action/patch_view?id=368
On 26 Aug, 01:28, pie...@hogranch.com (John R Pierce) wrote: > On 08/25/10 11:47 AM, Wojciech Strzałka wrote: > > > The data set is 9mln rows - about 250 columns > > Having 250 columns in a single table sets off the 'normalization' alarm > in my head. > > -- > Sent via pgsql-general mailing list (pgsql-gene...@postgresql.org) > To make changes to your subscription:http://www.postgresql.org/mailpref/pgsql-general Yeap - but it is as it is. I need to migrate PG first - then start thinking about schema changes
On 26 Sie, 08:06, wstrzalka <wstrza...@gmail.com> wrote: > On 26 Aug, 01:28, pie...@hogranch.com (John R Pierce) wrote: > > > On 08/25/10 11:47 AM, Wojciech Strzałka wrote: > > > > The data set is 9mln rows - about 250 columns > > > Having 250 columns in a single table sets off the 'normalization' alarm > > in my head. > > > -- > > Sent via pgsql-general mailing list (pgsql-gene...@postgresql.org) > > To make changes to your subscription:http://www.postgresql.org/mailpref/pgsql-general > > Yeap - but it is as it is. > I need to migrate PG first - then start thinking about schema changes So after turning off fsync & synchronous_commit (which I can afford as I'm populating database from scratch) I've stucked at 43 minutes for the mentioned table. There is no PK, constrains, indexes, ... - nothing except for data. The behaviour changed - I'm utilizing the core 100%, iostat shows the write peaks about 70MB/s, the table shown by \d+ is growing all the time as it growth before. Is there anything I can look at? Anyway the load to PG is much faster then dump from the old database and the current load time is acceptable for me.
Excerpts from wstrzalka's message of jue ago 26 03:18:36 -0400 2010: > So after turning off fsync & synchronous_commit (which I can afford as > I'm populating database from scratch) > I've stucked at 43 minutes for the mentioned table. There is no PK, > constrains, indexes, ... - nothing except for data. Are you truncating the table in the same transaction that copies the data into it? If you do that, an optimization to skip WAL fires getting you a nice performance boost. You need to have WAL archiving turned off though. Also, if you do that, perhaps there's no point in turning off fsync and synch_commit because an fsync will be done only once when the copy is complete. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
No I don't, but definitely will try tomorrow > Excerpts from wstrzalka's message of jue ago 26 03:18:36 -0400 2010: >> So after turning off fsync & synchronous_commit (which I can afford as >> I'm populating database from scratch) >> I've stucked at 43 minutes for the mentioned table. There is no PK, >> constrains, indexes, ... - nothing except for data. > Are you truncating the table in the same transaction that copies the > data into it? If you do that, an optimization to skip WAL fires getting > you a nice performance boost. You need to have WAL archiving turned off > though. > Also, if you do that, perhaps there's no point in turning off fsync and > synch_commit because an fsync will be done only once when the copy is > complete. -- Pozdrowienia, Wojciech Strzałka