Thread: Feature proposal

Feature proposal

From
wstrzalka
Date:
I'm currently playing with very large data import using COPY from
file.

As this can be extremely long operation (hours in my case) the nice
feature would be some option to show operation progress - how many
rows were already imported.

Or maybe there is some way to do it? As long as postgres have no read-
uncommited I think I can estimate it only by destination table size ??

Re: Feature proposal

From
"Wappler, Robert"
Date:
On 2010-08-25, wstrzalka wrote:

> I'm currently playing with very large data import using COPY from
file.
>
> As this can be extremely long operation (hours in my case) the nice
> feature would be some option to show operation progress - how many
> rows were already imported.
>
> Or maybe there is some way to do it? As long as postgres have no read-
> uncommited I think I can estimate it only by destination table size ??
>
> -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To
> make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

Hi,

You can use tools like pv for a progress bar and pipe the output into
psql.

HTH
--
Robert...



Re: Feature proposal

From
Denis BUCHER
Date:
Le 25.08.2010 09:15, wstrzalka a écrit :
> I'm currently playing with very large data import using COPY from
> file.
>
> As this can be extremely long operation (hours in my case) the nice
> feature would be some option to show operation progress - how many
> rows were already imported.
>
> Or maybe there is some way to do it? As long as postgres have no read-
> uncommited I think I can estimate it only by destination table size ??

By the way, did you try to optimize your postgresql server ?

In my case I was able to reduce a big data update from :
1 hour 15 minutes
to :
5 minutes

Without even changing any line of data or code in sql !

Incredible, isn't it ?

Denis

Re: Feature proposal

From
Eric Comeau
Date:
On Wed, 2010-08-25 at 17:06 +0200, Denis BUCHER wrote:
> Le 25.08.2010 09:15, wstrzalka a crit :
> > I'm currently playing with very large data import using COPY from
> > file.
> >
> > As this can be extremely long operation (hours in my case) the nice
> > feature would be some option to show operation progress - how many
> > rows were already imported.
> >
> > Or maybe there is some way to do it? As long as postgres have no read-
> > uncommited I think I can estimate it only by destination table size ??
>
> By the way, did you try to optimize your postgresql server ?
>
> In my case I was able to reduce a big data update from :
> 1 hour 15 minutes
> to :
> 5 minutes
>
> Without even changing any line of data or code in sql !
>
> Incredible, isn't it ?
>

Curious- what postgresql.conf settings did you change to improve it?



Re: Feature proposal

From
"Joshua D. Drake"
Date:
On Wed, 2010-08-25 at 12:20 -0400, Eric Comeau wrote:
>
> > Without even changing any line of data or code in sql !
> >
> > Incredible, isn't it ?
> >
>
> Curious- what postgresql.conf settings did you change to improve it?

The most obvious would be to turn fsync off, sychronous_commit off,
increase work_mem, increase checkpoint_timeout, increase wal_segments.

JD

>
>
>

--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering
http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt

Re: Feature proposal

From
Steve Clark
Date:
On 08/25/2010 12:30 PM, Joshua D. Drake wrote:
> On Wed, 2010-08-25 at 12:20 -0400, Eric Comeau wrote:
>>
>>> Without even changing any line of data or code in sql !
>>>
>>> Incredible, isn't it ?
>>>
>>
>> Curious- what postgresql.conf settings did you change to improve it?
>
> The most obvious would be to turn fsync off, sychronous_commit off,
> increase work_mem, increase checkpoint_timeout, increase wal_segments.
>
> JD
>
>>
>>
>>
>
can these be changed on the fly via set commands or does the config file have to be
changed and postgres stopped and restarted.

postgres 8.3.7 on freebsd.

--
Stephen Clark
NetWolves
Sr. Software Engineer III
Phone: 813-579-3200
Fax: 813-882-0209
Email: steve.clark@netwolves.com
www.netwolves.com

Re: Feature proposal

From
Wojciech Strzałka
Date:
 Yea - I'll try to optimize as I had a plan to write to
 pgsql.performance for rescue anyway.

 I don't know exact hardware specification yet - known facts at the
 moment are:
 Sun Turgo?? (SPARC) with 32 cores
 17GB RAM (1GB for shared buffers)
 hdd - ?
 OS - Solaris 10 - the system is running in the zone (Solaris
 virtualization) - however during test nothing else is utilizing the
 machine.
 PostgreSQL 8.4.4 64bit

 The data set is 9mln rows - about 250 columns
 The result database size is ~9GB
 Load time ~2h 20min
 CPU utilization - 1,2% (half of the one core)
 iostat shows writes ~6MB/s,  20% busy
 when I run 2 loads in parallel the CPU is split to 2*0,6%, hdd write
 ~7MB (almost the same)

 postgresql.conf changes:
 checkpoint_segments - 128
 checkpoint_timeout - 30min
 shared_buffers - 1GB
 maintenance_work_mem - 128MB


 does it looks like my HDD is the problem? or maybe the Solaris
 virtualization?

 what's also interesting - table is empty when I start (by truncate)
 but while the COPY is working, I see it grows (by \d+ or
 pg_total_relation_size) about 1MB per second
 what I'd expect it should grow at checkpoints only, not all the
 time - am I wrong?




> On Wed, 2010-08-25 at 12:20 -0400, Eric Comeau wrote:
>>
>> > Without even changing any line of data or code in sql !
>> >
>> > Incredible, isn't it ?
>> >
>>
>> Curious- what postgresql.conf settings did you change to improve it?

> The most obvious would be to turn fsync off, sychronous_commit off,
> increase work_mem, increase checkpoint_timeout, increase wal_segments.

> JD

>>
>>
>>




--
Pozdrowienia,
 Wojciech Strzałka


Re: Feature proposal

From
John R Pierce
Date:
  On 08/25/10 11:47 AM, Wojciech Strzałka wrote:
>   The data set is 9mln rows - about 250 columns

Having 250 columns in a single table sets off the 'normalization' alarm
in my head.






Re: Feature proposal

From
"Joshua D. Drake"
Date:
On Wed, 2010-08-25 at 12:20 -0400, Eric Comeau wrote:
>
> > Without even changing any line of data or code in sql !
> >
> > Incredible, isn't it ?
> >
>
> Curious- what postgresql.conf settings did you change to improve it?

The most obvious would be to turn fsync off, sychronous_commit off,
increase work_mem, increase checkpoint_timeout, increase wal_segments.

JD

>
>
>

--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering
http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt


Re: Feature proposal

From
Craig Ringer
Date:
On 26/08/2010 1:06 AM, Steve Clark wrote:
> On 08/25/2010 12:30 PM, Joshua D. Drake wrote:
>> On Wed, 2010-08-25 at 12:20 -0400, Eric Comeau wrote:
>>>
>>>> Without even changing any line of data or code in sql !
>>>>
>>>> Incredible, isn't it ?
>>>>
>>>
>>> Curious- what postgresql.conf settings did you change to improve it?
>>
>> The most obvious would be to turn fsync off, sychronous_commit off,
>> increase work_mem, increase checkpoint_timeout, increase wal_segments.
>>
>> JD
>>
>>>
>>>
>>>
>>
> can these be changed on the fly via set commands or does the config file
> have to be changed and postgres stopped and restarted.

First: Many options can be changed by editing the config file then
telling the postmaster to reload its configuration, rather that
restarting the postmaster. See pg_ctl.

As for the specific options:

Checkpoint and WAL tuning is necessary and important in any real
postgresql instance under load, and it's quite safe to adjust the
checkpoint timeouts and wal segment counts to suit your needs. You'll
need a restart to change the number of wal segments; I'm not so sure
about the checkpoint timeout.

You can't change fsync without a config file edit and a restart. You
should **NEVER** be using fsync=off with data you cannot afford to lose,
so it's a good thing in a way. You might use it to help initially load a
database with bulk data, but fsync should be turned back on and the
database restarted before you start actually using it. fsync=off is
**NOT SAFE**.

synchronous_commit also has effects on data safety. It permits the loss
of transactions committed within the commit delay interval if the server
crashes. If you turn it on, you need to decide how much recent work you
can afford to lose if the database crashes. Not sure if it can be
applied with a reload or whether it requires a full server restart.

So: if you don't know exactly what you're doing, leave fsync alone.

--
Craig Ringer

Tech-related writing at http://soapyfrogs.blogspot.com/

Re: Feature proposal

From
Sam Mason
Date:
On Wed, Aug 25, 2010 at 08:47:10PM +0200, Wojciech Strzaaaka wrote:
>  The data set is 9mln rows - about 250 columns

250 columns sounds very strange to me as well! I start to getting
worried when I hit a tenth of that.

>  CPU utilization - 1,2% (half of the one core)
>  iostat shows writes ~6MB/s,  20% busy
>  when I run 2 loads in parallel the CPU is split to 2*0,6%, hdd write
>  ~7MB (almost the same)

If you've got indexes set up on the table then I'd expect this sort
of behavior, you could try dropping them before the copy and then
recreating them afterward.

It would be great if PG could do these sorts of bulk index updates
automatically!  Maybe run the first few tens/hundred changes in the
main index and then start logging the rows that will need indexing and
bulk process and merge them at the end.  Concurrent access seems a bit
more complicated, but shouldn't be too bad.  The case of a UNIQUE index
seems to require a change in behavior.  For example, the following are
executed concurrently:

  Client A: COPY foo (id) FROM stdin;
  Client B: INSERT INTO foo (id) VALUES (1);

with A starting before and finishing after B, and A sends a row with
id=1.

At the moment the behavior would be for A's data to be indexed
immediately and hence B's conflicting change would fail.  If PG did
bulk index merging at the end, this would change to B's succeeding and
A's failing when the index was brought up to date.  These semantics are
still compatible with SQL, just different from before so some code may
be (incorrectly) relying on this.

I've read discussions from:
  http://archives.postgresql.org/pgsql-hackers/2008-02/msg00811.php
and
  http://archives.postgresql.org/pgsql-general/2008-01/msg01048.php

but not found much recent.  It seems to hold together better than
the first suggestion.  Second post notes that you may be better off
working in work_mem batches to help preventing spilling to disk.  Sounds
reasonable, and if it's OK to assume the new rows will be physically
close to each other then they can be recorded as ranges/run length
encoded to reduce the chance of spilling to disk for even very large
inserts.  As per the second post, I'm struggling with BEFORE INSERT
triggers as well, their semantics seem to preclude most optimizations.

>  what's also interesting - table is empty when I start (by truncate)
>  but while the COPY is working, I see it grows (by \d+ or
>  pg_total_relation_size) about 1MB per second
>  what I'd expect it should grow at checkpoints only, not all the
>  time - am I wrong?

AFAIU, it'll constantly grow.

--
  Sam  http://samason.me.uk/

Re: Feature proposal

From
Vick Khera
Date:
On Wed, Aug 25, 2010 at 8:48 PM, Craig Ringer
<craig@postnewspapers.com.au> wrote:
> synchronous_commit also has effects on data safety. It permits the loss of
> transactions committed within the commit delay interval if the server
> crashes. If you turn it on, you need to decide how much recent work you can
> afford to lose if the database crashes. Not sure if it can be applied with a
> reload or whether it requires a full server restart.
>

I routinely set synchronous_commit = off  on a per-connection or
per-transaction basis.  The beauty of it is that it still honors
transaction boundaries.  That is, if there is a server crash the
transaction will be either there or not as a whole; it will not be
partially applied.  This works great for bulk imports and changes to
the DB for me, since I can always just re-run my programs on such
failure and everything will pick up where it left off.  It takes some
planning but is worth it.

> So: if you don't know exactly what you're doing, leave fsync alone.

I agree -- leave fsync alone. You get benefit from synchronous_commit
without the corruption risk.

The other advice on boosting checkpoint segments and timeout are spot
on.  Make them pretty big and it will make your import go way faster.
If you have a spare disk on which to move the checkpoint segments so
that you eliminate the seek time on them, move them to get even more
speed.  After your import, you can make the number of segments smaller
again if that suits your workload.

Re: Feature proposal

From
Adrian von Bidder
Date:
Heyho!

On Wednesday 25 August 2010 09.15:33 wstrzalka wrote:
> I'm currently playing with very large data import using COPY from
> file.
>
> As this can be extremely long operation (hours in my case) the nice
> feature would be some option to show operation progress - how many
> rows were already imported.

Recently, I've found (on Linux, don't know if other OSs export this
information) /proc/<pid>/fdinfo/<fd> extremely helpful.  It tells you the
position of the file pointer of file number <fd> in process <pid> (I guess
for a COPY import this would be the postgresql backend handling your import
session.)

Unlike other options, you can also use this if you only notice that the
process is long-running after you've already started it.

Of course it probably will not work if the file is mmapped or otherwise not
read in a sequential fashion.

cheers
-- vb

--
All Hail Discordia!

Attachment

Re: Feature proposal

From
Peter Eisentraut
Date:
On ons, 2010-08-25 at 00:15 -0700, wstrzalka wrote:
> I'm currently playing with very large data import using COPY from
> file.
>
> As this can be extremely long operation (hours in my case) the nice
> feature would be some option to show operation progress - how many
> rows were already imported.

A feature like this is being worked on:
https://commitfest.postgresql.org/action/patch_view?id=368


Re: Feature proposal

From
wstrzalka
Date:
On 26 Aug, 01:28, pie...@hogranch.com (John R Pierce) wrote:
>   On 08/25/10 11:47 AM, Wojciech Strzałka wrote:
>
> >   The data set is 9mln rows - about 250 columns
>
> Having 250 columns in a single table sets off the 'normalization' alarm
> in my head.
>
> --
> Sent via pgsql-general mailing list (pgsql-gene...@postgresql.org)
> To make changes to your subscription:http://www.postgresql.org/mailpref/pgsql-general

Yeap - but it is as it is.
I need to migrate PG first - then start thinking about schema changes

Re: Feature proposal

From
wstrzalka
Date:
On 26 Sie, 08:06, wstrzalka <wstrza...@gmail.com> wrote:
> On 26 Aug, 01:28, pie...@hogranch.com (John R Pierce) wrote:
>
> >   On 08/25/10 11:47 AM, Wojciech Strzałka wrote:
>
> > >   The data set is 9mln rows - about 250 columns
>
> > Having 250 columns in a single table sets off the 'normalization' alarm
> > in my head.
>
> > --
> > Sent via pgsql-general mailing list (pgsql-gene...@postgresql.org)
> > To make changes to your subscription:http://www.postgresql.org/mailpref/pgsql-general
>
> Yeap - but it is as it is.
> I need to migrate PG first - then start thinking about schema changes


So after turning off fsync & synchronous_commit (which I can afford as
I'm populating database from scratch)
I've stucked at 43 minutes for the mentioned table. There is no PK,
constrains, indexes, ... - nothing except for data.

The behaviour changed - I'm utilizing the core 100%, iostat shows the
write peaks about 70MB/s, the table shown by \d+ is growing all the
time as it growth before.
Is there anything I can look at?
Anyway the load to PG is much faster then dump from the old database
and the current load time is acceptable for me.

Re: Feature proposal

From
Alvaro Herrera
Date:
Excerpts from wstrzalka's message of jue ago 26 03:18:36 -0400 2010:

> So after turning off fsync & synchronous_commit (which I can afford as
> I'm populating database from scratch)
> I've stucked at 43 minutes for the mentioned table. There is no PK,
> constrains, indexes, ... - nothing except for data.

Are you truncating the table in the same transaction that copies the
data into it?  If you do that, an optimization to skip WAL fires getting
you a nice performance boost.  You need to have WAL archiving turned off
though.

Also, if you do that, perhaps there's no point in turning off fsync and
synch_commit because an fsync will be done only once when the copy is
complete.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Feature proposal

From
Wojciech Strzałka
Date:
No I don't, but definitely will try tomorrow

> Excerpts from wstrzalka's message of jue ago 26 03:18:36 -0400 2010:

>> So after turning off fsync & synchronous_commit (which I can afford as
>> I'm populating database from scratch)
>> I've stucked at 43 minutes for the mentioned table. There is no PK,
>> constrains, indexes, ... - nothing except for data.

> Are you truncating the table in the same transaction that copies the
> data into it?  If you do that, an optimization to skip WAL fires getting
> you a nice performance boost.  You need to have WAL archiving turned off
> though.

> Also, if you do that, perhaps there's no point in turning off fsync and
> synch_commit because an fsync will be done only once when the copy is
> complete.




--
Pozdrowienia,
 Wojciech Strzałka