Thread: Pg_upgrade speed for many tables

Pg_upgrade speed for many tables

From
Bruce Momjian
Date:
Magnus reported that a customer with a million tables was finding
pg_upgrade slow.  I had never considered many table to be a problem, but
decided to test it.  I created a database with 2k tables like this:

    CREATE TABLE test1990 (x SERIAL);

Running the git version of pg_upgrade on that took 203 seconds.  Using
synchronous_commit=off dropped the time to 78 seconds.  This was tested
on magnetic disks with a write-through cache.  (No change on an SSD with
a super-capacitor.)

I don't see anything unsafe about having pg_upgrade use
synchronous_commit=off.  I could set it just for the pg_dump reload, but
it seems safe to just use it always.  We don't write to the old cluster,
and if pg_upgrade fails, you have to re-initdb the new cluster anyway.

Patch attached.  I think it should be applied to 9.2 as well.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Attachment

Re: Pg_upgrade speed for many tables

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Magnus reported that a customer with a million tables was finding
> pg_upgrade slow.

You sure there's not an O(N^2) issue in there somewhere?

> I don't see anything unsafe about having pg_upgrade use
> synchronous_commit=off.

No objection, but this seems unlikely to be better than linear speedup,
with a not-terribly-large constant factor.

BTW, does pg_upgrade run pg_restore in --single-transaction mode?
That would probably make synchronous_commit moot, at least for that
step.
        regards, tom lane



Re: Pg_upgrade speed for many tables

From
Magnus Hagander
Date:

On Mon, Nov 5, 2012 at 9:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Bruce Momjian <bruce@momjian.us> writes:
> Magnus reported that a customer with a million tables was finding
> pg_upgrade slow.

You sure there's not an O(N^2) issue in there somewhere? 

> I don't see anything unsafe about having pg_upgrade use
> synchronous_commit=off.

No objection, but this seems unlikely to be better than linear speedup,
with a not-terribly-large constant factor.

BTW, does pg_upgrade run pg_restore in --single-transaction mode?
That would probably make synchronous_commit moot, at least for that
step.

 
It doesn't use pg_restore at all - it uses the dump from pg_dumpall, which you can't reload with pg_restore. 


--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: Pg_upgrade speed for many tables

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> On Mon, Nov 5, 2012 at 9:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> BTW, does pg_upgrade run pg_restore in --single-transaction mode?
>> That would probably make synchronous_commit moot, at least for that
>> step.

> It doesn't use pg_restore at all - it uses the dump from pg_dumpall, which
> you can't reload with pg_restore.

Sorry, I should've said psql --single-transaction.  Although that isn't
going to work either given the presence of \connect commands in the
script.  I wonder whether pg_dumpall ought to have some sort of "one
transaction per database please" option.
        regards, tom lane



Re: Pg_upgrade speed for many tables

From
Bruce Momjian
Date:
On Mon, Nov  5, 2012 at 03:30:32PM -0500, Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
> > On Mon, Nov 5, 2012 at 9:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> BTW, does pg_upgrade run pg_restore in --single-transaction mode?
> >> That would probably make synchronous_commit moot, at least for that
> >> step.
> 
> > It doesn't use pg_restore at all - it uses the dump from pg_dumpall, which
> > you can't reload with pg_restore.
> 
> Sorry, I should've said psql --single-transaction.  Although that isn't
> going to work either given the presence of \connect commands in the
> script.  I wonder whether pg_dumpall ought to have some sort of "one
> transaction per database please" option.

pg_dumpall is already doing lots of gymnastics with SQL, and pg_upgrade
splits the output file into db/user creation and object creation, so I
am hesitant to add anything more in there.

I was surprised by the scale of the performance improvement, but a
simple table creation test confirmed that improvement, irregardless of
pg_upgrade.  Perhaps we should suggest synchronous_commit=off for
pg_dumpall restores, particularly when using --schema-only.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Pg_upgrade speed for many tables

From
Magnus Hagander
Date:
On Mon, Nov 5, 2012 at 9:49 PM, Bruce Momjian <bruce@momjian.us> wrote:
On Mon, Nov  5, 2012 at 03:30:32PM -0500, Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
> > On Mon, Nov 5, 2012 at 9:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> BTW, does pg_upgrade run pg_restore in --single-transaction mode?
> >> That would probably make synchronous_commit moot, at least for that
> >> step.
>
> > It doesn't use pg_restore at all - it uses the dump from pg_dumpall, which
> > you can't reload with pg_restore.
>
> Sorry, I should've said psql --single-transaction.  Although that isn't
> going to work either given the presence of \connect commands in the
> script.  I wonder whether pg_dumpall ought to have some sort of "one
> transaction per database please" option.

pg_dumpall is already doing lots of gymnastics with SQL, and pg_upgrade
splits the output file into db/user creation and object creation, so I
am hesitant to add anything more in there.

What about running pg_dump in a loop instead of pg_dumpall?
 


--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: Pg_upgrade speed for many tables

From
Bruce Momjian
Date:
On Mon, Nov  5, 2012 at 10:01:22PM +0100, Magnus Hagander wrote:
> On Mon, Nov 5, 2012 at 9:49 PM, Bruce Momjian <bruce@momjian.us> wrote:
> 
>     On Mon, Nov  5, 2012 at 03:30:32PM -0500, Tom Lane wrote:
>     > Magnus Hagander <magnus@hagander.net> writes:
>     > > On Mon, Nov 5, 2012 at 9:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>     > >> BTW, does pg_upgrade run pg_restore in --single-transaction mode?
>     > >> That would probably make synchronous_commit moot, at least for that
>     > >> step.
>     >
>     > > It doesn't use pg_restore at all - it uses the dump from pg_dumpall,
>     which
>     > > you can't reload with pg_restore.
>     >
>     > Sorry, I should've said psql --single-transaction.  Although that isn't
>     > going to work either given the presence of \connect commands in the
>     > script.  I wonder whether pg_dumpall ought to have some sort of "one
>     > transaction per database please" option.
> 
>     pg_dumpall is already doing lots of gymnastics with SQL, and pg_upgrade
>     splits the output file into db/user creation and object creation, so I
>     am hesitant to add anything more in there.
> 
> 
> What about running pg_dump in a loop instead of pg_dumpall?

Well, I could cetainly do pg_dumpall --globals-only, and then I have to
create a pg_dump file for every database, and then add the \connect in
there;  it just seemed easier to use pg_dumpall, though the file split
thing is certainly something I would like to get rid of.

I think I used pg_dumpall because it was an existing tool that I assumed
would be maintained to dump a full cluster.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Pg_upgrade speed for many tables

From
Jeff Janes
Date:
On Mon, Nov 5, 2012 at 12:49 PM, Bruce Momjian <bruce@momjian.us> wrote:
> On Mon, Nov  5, 2012 at 03:30:32PM -0500, Tom Lane wrote:
>> Magnus Hagander <magnus@hagander.net> writes:
>> > On Mon, Nov 5, 2012 at 9:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> >> BTW, does pg_upgrade run pg_restore in --single-transaction mode?
>> >> That would probably make synchronous_commit moot, at least for that
>> >> step.
>>
>> > It doesn't use pg_restore at all - it uses the dump from pg_dumpall, which
>> > you can't reload with pg_restore.
>>
>> Sorry, I should've said psql --single-transaction.  Although that isn't
>> going to work either given the presence of \connect commands in the
>> script.  I wonder whether pg_dumpall ought to have some sort of "one
>> transaction per database please" option.
>
> pg_dumpall is already doing lots of gymnastics with SQL, and pg_upgrade
> splits the output file into db/user creation and object creation, so I
> am hesitant to add anything more in there.
>
> I was surprised by the scale of the performance improvement, but a
> simple table creation test confirmed that improvement, irregardless of
> pg_upgrade.  Perhaps we should suggest synchronous_commit=off for
> pg_dumpall restores, particularly when using --schema-only.

Or have options for pg_dump and pg_restore to insert "set
synchronous_commit=off" into the SQL stream?

Cheers,

Jeff



Re: Pg_upgrade speed for many tables

From
Bruce Momjian
Date:
On Mon, Nov  5, 2012 at 01:07:45PM -0800, Jeff Janes wrote:
> On Mon, Nov 5, 2012 at 12:49 PM, Bruce Momjian <bruce@momjian.us> wrote:
> > On Mon, Nov  5, 2012 at 03:30:32PM -0500, Tom Lane wrote:
> >> Magnus Hagander <magnus@hagander.net> writes:
> >> > On Mon, Nov 5, 2012 at 9:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> >> BTW, does pg_upgrade run pg_restore in --single-transaction mode?
> >> >> That would probably make synchronous_commit moot, at least for that
> >> >> step.
> >>
> >> > It doesn't use pg_restore at all - it uses the dump from pg_dumpall, which
> >> > you can't reload with pg_restore.
> >>
> >> Sorry, I should've said psql --single-transaction.  Although that isn't
> >> going to work either given the presence of \connect commands in the
> >> script.  I wonder whether pg_dumpall ought to have some sort of "one
> >> transaction per database please" option.
> >
> > pg_dumpall is already doing lots of gymnastics with SQL, and pg_upgrade
> > splits the output file into db/user creation and object creation, so I
> > am hesitant to add anything more in there.
> >
> > I was surprised by the scale of the performance improvement, but a
> > simple table creation test confirmed that improvement, irregardless of
> > pg_upgrade.  Perhaps we should suggest synchronous_commit=off for
> > pg_dumpall restores, particularly when using --schema-only.
> 
> Or have options for pg_dump and pg_restore to insert "set
> synchronous_commit=off" into the SQL stream?

You can already do that with PGOPTIONS:
PGOPTIONS="-c synchronous_commit=off" pg_restore ...

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Pg_upgrade speed for many tables

From
Robert Haas
Date:
On Mon, Nov 5, 2012 at 4:07 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> Or have options for pg_dump and pg_restore to insert "set
> synchronous_commit=off" into the SQL stream?

It would be kind of neat if we had a command that would force all
previously-asynchronous commits to complete.  It seems likely that
very, very few people would care about intermediate pg_dump states, so
we could do the whole dump asynchronously and then do "FORCE ALL
COMMITS;" or whatever at the end.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Pg_upgrade speed for many tables

From
Bruce Momjian
Date:
On Mon, Nov  5, 2012 at 04:14:47PM -0500, Robert Haas wrote:
> On Mon, Nov 5, 2012 at 4:07 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> > Or have options for pg_dump and pg_restore to insert "set
> > synchronous_commit=off" into the SQL stream?
> 
> It would be kind of neat if we had a command that would force all
> previously-asynchronous commits to complete.  It seems likely that
> very, very few people would care about intermediate pg_dump states, so
> we could do the whole dump asynchronously and then do "FORCE ALL
> COMMITS;" or whatever at the end.

Actually, I had assumed that a session disconnection forced a WAL fsync
flush, but now I doubt that.  Seems only server shutdown does that, or a
checkpoint.  Would this work?
SET synchronous_commit=on;CREATE TABLE dummy(x int);DROP TABLE dummy;

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Pg_upgrade speed for many tables

From
Jeff Janes
Date:
On Mon, Nov 5, 2012 at 1:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Nov 5, 2012 at 4:07 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> Or have options for pg_dump and pg_restore to insert "set
>> synchronous_commit=off" into the SQL stream?
>
> It would be kind of neat if we had a command that would force all
> previously-asynchronous commits to complete.  It seems likely that
> very, very few people would care about intermediate pg_dump states, so
> we could do the whole dump asynchronously and then do "FORCE ALL
> COMMITS;" or whatever at the end.

Yeah, I was wondering what a fool-proof way of doing that would be,
without implementing a new feature.  Turning synchronous_commits back
on and then doing and committing a transaction guaranteed to generate
WAL would do it.

Would a simple 'select pg_switch_xlog();' always accomplish the desired flush?

Cheers,

Jeff



Re: Pg_upgrade speed for many tables

From
Bruce Momjian
Date:
On Mon, Nov  5, 2012 at 01:23:58PM -0800, Jeff Janes wrote:
> On Mon, Nov 5, 2012 at 1:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> > On Mon, Nov 5, 2012 at 4:07 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> >> Or have options for pg_dump and pg_restore to insert "set
> >> synchronous_commit=off" into the SQL stream?
> >
> > It would be kind of neat if we had a command that would force all
> > previously-asynchronous commits to complete.  It seems likely that
> > very, very few people would care about intermediate pg_dump states, so
> > we could do the whole dump asynchronously and then do "FORCE ALL
> > COMMITS;" or whatever at the end.
> 
> Yeah, I was wondering what a fool-proof way of doing that would be,
> without implementing a new feature.  Turning synchronous_commits back
> on and then doing and committing a transaction guaranteed to generate
> WAL would do it.
> 
> Would a simple 'select pg_switch_xlog();' always accomplish the desired flush?

That could generate a lot of WAL files if used regularly.  :-(  Does
SELECT txid_current() generate WAL?  I think it does.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Pg_upgrade speed for many tables

From
Alvaro Herrera
Date:
Bruce Momjian escribió:
> On Mon, Nov  5, 2012 at 04:14:47PM -0500, Robert Haas wrote:
> > On Mon, Nov 5, 2012 at 4:07 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> > > Or have options for pg_dump and pg_restore to insert "set
> > > synchronous_commit=off" into the SQL stream?
> >
> > It would be kind of neat if we had a command that would force all
> > previously-asynchronous commits to complete.  It seems likely that
> > very, very few people would care about intermediate pg_dump states, so
> > we could do the whole dump asynchronously and then do "FORCE ALL
> > COMMITS;" or whatever at the end.
>
> Actually, I had assumed that a session disconnection forced a WAL fsync
> flush, but now I doubt that.  Seems only server shutdown does that, or a
> checkpoint.  Would this work?
>
>     SET synchronous_commit=on;
>     CREATE TABLE dummy(x int);
>     DROP TABLE dummy;

AFAIR any transaction that modifies catalogs gets sync commit forcibly,
regardless of the setting.  And sync commit means you get to wait for
all previous transactions to be flushed as well.  So simply creating a
temp table ought to do the trick ...

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: Pg_upgrade speed for many tables

From
Bruce Momjian
Date:
On Mon, Nov  5, 2012 at 06:33:16PM -0300, Alvaro Herrera wrote:
> Bruce Momjian escribió:
> > On Mon, Nov  5, 2012 at 04:14:47PM -0500, Robert Haas wrote:
> > > On Mon, Nov 5, 2012 at 4:07 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> > > > Or have options for pg_dump and pg_restore to insert "set
> > > > synchronous_commit=off" into the SQL stream?
> > > 
> > > It would be kind of neat if we had a command that would force all
> > > previously-asynchronous commits to complete.  It seems likely that
> > > very, very few people would care about intermediate pg_dump states, so
> > > we could do the whole dump asynchronously and then do "FORCE ALL
> > > COMMITS;" or whatever at the end.
> > 
> > Actually, I had assumed that a session disconnection forced a WAL fsync
> > flush, but now I doubt that.  Seems only server shutdown does that, or a
> > checkpoint.  Would this work?
> > 
> >     SET synchronous_commit=on;
> >     CREATE TABLE dummy(x int);
> >     DROP TABLE dummy;
> 
> AFAIR any transaction that modifies catalogs gets sync commit forcibly,
> regardless of the setting.  And sync commit means you get to wait for

Uh, I am not seeing that my testing because I was only doing CREATE
TABLE and it was affected by the synchronous_commit value.

> all previous transactions to be flushed as well.  So simply creating a
> temp table ought to do the trick ...

I don't think TEMP tables write to WAL, for performance reasons.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Pg_upgrade speed for many tables

From
Robert Haas
Date:
On Mon, Nov 5, 2012 at 4:33 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> AFAIR any transaction that modifies catalogs gets sync commit forcibly,
> regardless of the setting.  And sync commit means you get to wait for
> all previous transactions to be flushed as well.  So simply creating a
> temp table ought to do the trick ...

I don't think there's a carve-out for system tables ... but creating a
temp table with synchronous_commit=on will certainly do the trick.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Pg_upgrade speed for many tables

From
Bruce Momjian
Date:
On Mon, Nov  5, 2012 at 04:39:27PM -0500, Robert Haas wrote:
> On Mon, Nov 5, 2012 at 4:33 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > AFAIR any transaction that modifies catalogs gets sync commit forcibly,
> > regardless of the setting.  And sync commit means you get to wait for
> > all previous transactions to be flushed as well.  So simply creating a
> > temp table ought to do the trick ...
> 
> I don't think there's a carve-out for system tables ... but creating a
> temp table with synchronous_commit=on will certainly do the trick.

What is a temp table writing to WAL?  The pg_class/pg_attribute changes?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Pg_upgrade speed for many tables

From
Robert Haas
Date:
On Mon, Nov 5, 2012 at 4:42 PM, Bruce Momjian <bruce@momjian.us> wrote:
> On Mon, Nov  5, 2012 at 04:39:27PM -0500, Robert Haas wrote:
>> On Mon, Nov 5, 2012 at 4:33 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>> > AFAIR any transaction that modifies catalogs gets sync commit forcibly,
>> > regardless of the setting.  And sync commit means you get to wait for
>> > all previous transactions to be flushed as well.  So simply creating a
>> > temp table ought to do the trick ...
>>
>> I don't think there's a carve-out for system tables ... but creating a
>> temp table with synchronous_commit=on will certainly do the trick.
>
> What is a temp table writing to WAL?  The pg_class/pg_attribute changes?

Yes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Pg_upgrade speed for many tables

From
Andres Freund
Date:
On Mon, Nov 05, 2012 at 04:42:56PM -0500, Bruce Momjian wrote:
> On Mon, Nov  5, 2012 at 04:39:27PM -0500, Robert Haas wrote:
> > On Mon, Nov 5, 2012 at 4:33 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > > AFAIR any transaction that modifies catalogs gets sync commit forcibly,
> > > regardless of the setting.  And sync commit means you get to wait for
> > > all previous transactions to be flushed as well.  So simply creating a
> > > temp table ought to do the trick ...

SET synchronous_commit = on;
SELECT txid_current();

Should be enough.

> > I don't think there's a carve-out for system tables ... but creating a
> > temp table with synchronous_commit=on will certainly do the trick.
>
> What is a temp table writing to WAL?  The pg_class/pg_attribute changes?

Yes.

Andres



Re: Pg_upgrade speed for many tables

From
Jeff Janes
Date:
On Mon, Nov 5, 2012 at 1:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Nov 5, 2012 at 4:33 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>> AFAIR any transaction that modifies catalogs gets sync commit forcibly,
>> regardless of the setting.  And sync commit means you get to wait for
>> all previous transactions to be flushed as well.  So simply creating a
>> temp table ought to do the trick ...
>
> I don't think there's a carve-out for system tables ... but creating a
> temp table with synchronous_commit=on will certainly do the trick.

But that seems like something that might be optimized away in the
future (for example, so that temp tables can be used on hot standbys)
resulting in action-at-a-distance breakage.

Is txid_current() more fundamental, i.e. less likely to change?

Cheers,

Jeff



Re: Pg_upgrade speed for many tables

From
Jeff Janes
Date:
On Mon, Nov 5, 2012 at 12:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Bruce Momjian <bruce@momjian.us> writes:
>> Magnus reported that a customer with a million tables was finding
>> pg_upgrade slow.
>
> You sure there's not an O(N^2) issue in there somewhere?

There certainly will be before he gets to a million, but it probably
doesn't show up yet testing at 2000.

He will probably have to hack pg_dump, as discussed here:

http://archives.postgresql.org/pgsql-performance/2012-09/msg00003.php


>
>> I don't see anything unsafe about having pg_upgrade use
>> synchronous_commit=off.
>
> No objection, but this seems unlikely to be better than linear speedup,
> with a not-terribly-large constant factor.
>
> BTW, does pg_upgrade run pg_restore in --single-transaction mode?
> That would probably make synchronous_commit moot, at least for that
> step.

Doing that might make the sync problem better, but would make the N^2
problem worse if upgrading to <= 9.2 .

Cheers,

Jeff



Re: Pg_upgrade speed for many tables

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> That could generate a lot of WAL files if used regularly.  :-(  Does
> SELECT txid_current() generate WAL?  I think it does.

Well, it assigns a XID.  I'm not sure it'd be a good idea to assume that
the mere act of doing that, without actually writing anything to tables,
would result in a synchronous commit.  (For example, if the transaction
were to abort not commit, I'm pretty sure we'd not bother to fsync its
abort record.  There might be, today or in the future, a similar
optimization for successful xacts that created no WAL records.)

I thought the idea of creating a temp table was the most robust one.
A regular table would be even more certain to generate an fsync, but
it has the disadvantages that you can't easily guarantee no name
collision against a user table, nor guarantee that the table wouldn't
be left behind after a crash at the wrong instant.
        regards, tom lane



Re: Pg_upgrade speed for many tables

From
Josh Berkus
Date:
> Sorry, I should've said psql --single-transaction.  Although that isn't
> going to work either given the presence of \connect commands in the
> script.  I wonder whether pg_dumpall ought to have some sort of "one
> transaction per database please" option.

pg_dumpall ought to support -Fc output ...


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Pg_upgrade speed for many tables

From
Bruce Momjian
Date:
On Mon, Nov  5, 2012 at 05:39:40PM -0800, Josh Berkus wrote:
> 
> > Sorry, I should've said psql --single-transaction.  Although that isn't
> > going to work either given the presence of \connect commands in the
> > script.  I wonder whether pg_dumpall ought to have some sort of "one
> > transaction per database please" option.
> 
> pg_dumpall ought to support -Fc output ...

That is already a TODO:
Add pg_dumpall custom format dumps?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Pg_upgrade speed for many tables

From
Andrew Dunstan
Date:
On 11/05/2012 08:52 PM, Bruce Momjian wrote:
> On Mon, Nov  5, 2012 at 05:39:40PM -0800, Josh Berkus wrote:
>>> Sorry, I should've said psql --single-transaction.  Although that isn't
>>> going to work either given the presence of \connect commands in the
>>> script.  I wonder whether pg_dumpall ought to have some sort of "one
>>> transaction per database please" option.
>> pg_dumpall ought to support -Fc output ...
> That is already a TODO:
>
>     Add pg_dumpall custom format dumps?


That '?' isn't an accident. Custom format is currently inherently 
single-database. Unless you're going to make pg_dumpall produce multiple 
custom format archives, that would involve a major change that nobody 
has designed AFAIK.


cheers

andrew



Re: Pg_upgrade speed for many tables

From
Bruce Momjian
Date:
On Mon, Nov  5, 2012 at 03:08:17PM -0500, Bruce Momjian wrote:
> Magnus reported that a customer with a million tables was finding
> pg_upgrade slow.  I had never considered many table to be a problem, but
> decided to test it.  I created a database with 2k tables like this:
>
>     CREATE TABLE test1990 (x SERIAL);
>
> Running the git version of pg_upgrade on that took 203 seconds.  Using
> synchronous_commit=off dropped the time to 78 seconds.  This was tested
> on magnetic disks with a write-through cache.  (No change on an SSD with
> a super-capacitor.)
>
> I don't see anything unsafe about having pg_upgrade use
> synchronous_commit=off.  I could set it just for the pg_dump reload, but
> it seems safe to just use it always.  We don't write to the old cluster,
> and if pg_upgrade fails, you have to re-initdb the new cluster anyway.
>
> Patch attached.  I think it should be applied to 9.2 as well.

Modified patch attached and applied to head and 9.2.  I decided to use
synchronous_commit=off only on the new cluster, just in case we ever do
make a modification of the old cluster.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Attachment

Re: Pg_upgrade speed for many tables

From
Jeff Janes
Date:
On Mon, Nov 5, 2012 at 12:08 PM, Bruce Momjian <bruce@momjian.us> wrote:
> Magnus reported that a customer with a million tables was finding
> pg_upgrade slow.  I had never considered many table to be a problem, but
> decided to test it.  I created a database with 2k tables like this:
>
>         CREATE TABLE test1990 (x SERIAL);
>
> Running the git version of pg_upgrade on that took 203 seconds.  Using
> synchronous_commit=off dropped the time to 78 seconds.  This was tested
> on magnetic disks with a write-through cache.  (No change on an SSD with
> a super-capacitor.)
>
> I don't see anything unsafe about having pg_upgrade use
> synchronous_commit=off.  I could set it just for the pg_dump reload, but
> it seems safe to just use it always.  We don't write to the old cluster,
> and if pg_upgrade fails, you have to re-initdb the new cluster anyway.
>
> Patch attached.  I think it should be applied to 9.2 as well.

Is turning off synchronous_commit enough?  What about turning off fsync?

When I'm doing a pg_upgrade with thousands of tables, the shutdown
checkpoint after restoring the dump to the new cluster takes a very
long time, as the writer drains its operation table by opening and
individually fsync-ing thousands of files.  This takes about 40 ms per
file, which I assume is a combination of slow lap-top disk drive, and
a strange deal with ext4 which makes fsyncing a recently created file
very slow.   But even with faster hdd, this would still be a problem
if it works the same way, with every file needing 4 rotations to be
fsynced and this happens in serial.

Worse, the shutdown only waits for the default of 60 seconds for the
shutdown to take place before it throws an error and the entire
pg_upgrade gives up.  It seems to me that either the -t setting should
be increased, or should be an option to pg_upgrade.

My work around was to invoke a system-wide "sync" a couple seconds
after the 'pg_ctl stop' is initiated.  Flushing the files wholesale
seems to work to make the checkpoint writer rapidly find it has
nothing to do when it tries to flush them retail.

Anyway, the reason I think turning fsync off might be reasonable is
that as soon as the new cluster is shut down, pg_upgrade starts
overwriting most of those just-fsynced file with other files from the
old cluster, and AFAICT makes no effort to fsync them.  So until there
is a system-wide sync after the pg_upgrade finishes, your new cluster
is already in mortal danger anyway.

Cheers,

Jeff



Re: Pg_upgrade speed for many tables

From
Jeff Davis
Date:
On Mon, 2012-11-12 at 10:29 -0800, Jeff Janes wrote:
> When I'm doing a pg_upgrade with thousands of tables, the shutdown
> checkpoint after restoring the dump to the new cluster takes a very
> long time, as the writer drains its operation table by opening and
> individually fsync-ing thousands of files.

This reminds me of the fix I did for initdb to sync the files. I think
we do need to make sure they are sync'd, because ext4 can keep buffers
around for quite a long time without cleaning them.

I ended up using sync_file_range(..., SYNC_FILE_RANGE_WRITE) on linux,
and posix_fadvise(..., POSIX_FADV_DONTNEED) on everything else, and that
made subsequent fsyncs more efficient.

Regards,Jeff Davis





Re: Pg_upgrade speed for many tables

From
Bruce Momjian
Date:
On Mon, Nov 12, 2012 at 10:29:39AM -0800, Jeff Janes wrote:
> On Mon, Nov 5, 2012 at 12:08 PM, Bruce Momjian <bruce@momjian.us> wrote:
> > Magnus reported that a customer with a million tables was finding
> > pg_upgrade slow.  I had never considered many table to be a problem, but
> > decided to test it.  I created a database with 2k tables like this:
> >
> >         CREATE TABLE test1990 (x SERIAL);
> >
> > Running the git version of pg_upgrade on that took 203 seconds.  Using
> > synchronous_commit=off dropped the time to 78 seconds.  This was tested
> > on magnetic disks with a write-through cache.  (No change on an SSD with
> > a super-capacitor.)
> >
> > I don't see anything unsafe about having pg_upgrade use
> > synchronous_commit=off.  I could set it just for the pg_dump reload, but
> > it seems safe to just use it always.  We don't write to the old cluster,
> > and if pg_upgrade fails, you have to re-initdb the new cluster anyway.
> >
> > Patch attached.  I think it should be applied to 9.2 as well.
>
> Is turning off synchronous_commit enough?  What about turning off fsync?

I did some testing with the attached patch on a magnetic disk with no
BBU that turns off fsync;   I got these results:

                 sync_com=off  fsync=off
        1        15.90     13.51
     1000        26.09     24.56
     2000        33.41     31.20
     4000        57.39     57.74
     8000       102.84    116.28
    16000       189.43    207.84

It shows fsync faster for < 4k, and slower for > 4k.  Not sure why this
is the cause but perhaps the buffering of the fsync is actually faster
than doing a no-op fsync.

I don't think fsync=off makes sense, except for testing;  let me know if
I should test something else.

> When I'm doing a pg_upgrade with thousands of tables, the shutdown
> checkpoint after restoring the dump to the new cluster takes a very
> long time, as the writer drains its operation table by opening and
> individually fsync-ing thousands of files.  This takes about 40 ms per
> file, which I assume is a combination of slow lap-top disk drive, and
> a strange deal with ext4 which makes fsyncing a recently created file
> very slow.   But even with faster hdd, this would still be a problem
> if it works the same way, with every file needing 4 rotations to be
> fsynced and this happens in serial.

Is this with the current code that does synchronous_commit=off?  If not,
can you test to see if this is still a problem?

> Worse, the shutdown only waits for the default of 60 seconds for the
> shutdown to take place before it throws an error and the entire
> pg_upgrade gives up.  It seems to me that either the -t setting should
> be increased, or should be an option to pg_upgrade.
>
> My work around was to invoke a system-wide "sync" a couple seconds
> after the 'pg_ctl stop' is initiated.  Flushing the files wholesale
> seems to work to make the checkpoint writer rapidly find it has
> nothing to do when it tries to flush them retail.
>
> Anyway, the reason I think turning fsync off might be reasonable is
> that as soon as the new cluster is shut down, pg_upgrade starts
> overwriting most of those just-fsynced file with other files from the
> old cluster, and AFAICT makes no effort to fsync them.  So until there
> is a system-wide sync after the pg_upgrade finishes, your new cluster
> is already in mortal danger anyway.

pg_upgrade does a cluster shutdown before overwriting those files.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Attachment

Re: Pg_upgrade speed for many tables

From
Jeff Janes
Date:
On Wed, Nov 14, 2012 at 3:55 PM, Bruce Momjian <bruce@momjian.us> wrote:
> On Mon, Nov 12, 2012 at 10:29:39AM -0800, Jeff Janes wrote:
>>
>> Is turning off synchronous_commit enough?  What about turning off fsync?
>
> I did some testing with the attached patch on a magnetic disk with no
> BBU that turns off fsync;

With which file system? I wouldn't expect you to see a benefit with
ext2 or ext3, it seems to be a peculiarity of ext4 that inhibits
"group fsync" of new file creations but rather does each one serially.Whether it is worth applying a fix that is only
neededfor that one
 
file system, I don't know.  The trade-offs are not all that clear to
me yet.

>  I got these results
>
>                  sync_com=off  fsync=off
>             1        15.90     13.51
>          1000        26.09     24.56
>          2000        33.41     31.20
>          4000        57.39     57.74
>          8000       102.84    116.28
>         16000       189.43    207.84
>
> It shows fsync faster for < 4k, and slower for > 4k.  Not sure why this
> is the cause but perhaps the buffering of the fsync is actually faster
> than doing a no-op fsync.

synchronous-commit=off turns off not only the fsync at each commit,
but also the write-to-kernel at each commit; so it is not surprising
that it is faster at large scale.  I would specify both
synchronous-commit=off and fsync=off.


>> When I'm doing a pg_upgrade with thousands of tables, the shutdown
>> checkpoint after restoring the dump to the new cluster takes a very
>> long time, as the writer drains its operation table by opening and
>> individually fsync-ing thousands of files.  This takes about 40 ms per
>> file, which I assume is a combination of slow lap-top disk drive, and
>> a strange deal with ext4 which makes fsyncing a recently created file
>> very slow.   But even with faster hdd, this would still be a problem
>> if it works the same way, with every file needing 4 rotations to be
>> fsynced and this happens in serial.
>
> Is this with the current code that does synchronous_commit=off?  If not,
> can you test to see if this is still a problem?

Yes, it is with synchronous_commit=off. (or if it wasn't originally,
it is now, with the same result)

Applying your fsync patch does solve the problem for me on ext4.
Having the new cluster be on ext3 rather than ext4 also solves the
problem, without the need for a patch; but it would be nice to more
friendly to ext4, which is popular even though not recommended.

>>
>> Anyway, the reason I think turning fsync off might be reasonable is
>> that as soon as the new cluster is shut down, pg_upgrade starts
>> overwriting most of those just-fsynced file with other files from the
>> old cluster, and AFAICT makes no effort to fsync them.  So until there
>> is a system-wide sync after the pg_upgrade finishes, your new cluster
>> is already in mortal danger anyway.
>
> pg_upgrade does a cluster shutdown before overwriting those files.

Right.  So as far as the cluster is concerned, those files have been
fsynced.  But then the next step is go behind the cluster's back and
replace those fsynced files with different files, which may or may not
have been fsynced.  This is what makes me thing the new cluster is in
mortal danger.  Not only have the new files perhaps not been fsynced,
but the cluster is not even aware of this fact, so you can start it
up, and then shut it down, and it still won't bother to fsync them,
because as far as it is concerned they already have been.

Given that, how much extra danger would be added by having the new
cluster schema restore run with fsync=off?

In any event, I think the documentation should caution that the
upgrade should not be deemed to be a success until after a system-wide
sync has been done.  Even if we use the link rather than copy method,
are we sure that that is safe if the directories recording those links
have not been fsynced?

Cheers,

Jeff



Use of fsync; was Re: Pg_upgrade speed for many tables

From
Bruce Momjian
Date:
On Mon, Nov 19, 2012 at 12:11:26PM -0800, Jeff Janes wrote:

[ Sorry for the delay in replying.]

> On Wed, Nov 14, 2012 at 3:55 PM, Bruce Momjian <bruce@momjian.us> wrote:
> > On Mon, Nov 12, 2012 at 10:29:39AM -0800, Jeff Janes wrote:
> >>
> >> Is turning off synchronous_commit enough?  What about turning off fsync?
> >
> > I did some testing with the attached patch on a magnetic disk with no
> > BBU that turns off fsync;
> 
> With which file system? I wouldn't expect you to see a benefit with
> ext2 or ext3, it seems to be a peculiarity of ext4 that inhibits
> "group fsync" of new file creations but rather does each one serially.
>  Whether it is worth applying a fix that is only needed for that one
> file system, I don't know.  The trade-offs are not all that clear to
> me yet.

That only ext4 shows the difference seems possible.

> >  I got these results
> >
> >                  sync_com=off  fsync=off
> >             1        15.90     13.51
> >          1000        26.09     24.56
> >          2000        33.41     31.20
> >          4000        57.39     57.74
> >          8000       102.84    116.28
> >         16000       189.43    207.84
> >
> > It shows fsync faster for < 4k, and slower for > 4k.  Not sure why this
> > is the cause but perhaps the buffering of the fsync is actually faster
> > than doing a no-op fsync.
> 
> synchronous-commit=off turns off not only the fsync at each commit,
> but also the write-to-kernel at each commit; so it is not surprising
> that it is faster at large scale.  I would specify both
> synchronous-commit=off and fsync=off.

I would like to see actual numbers showing synchronous-commit=off is
also useful if we use fsync=off.

> >> When I'm doing a pg_upgrade with thousands of tables, the shutdown
> >> checkpoint after restoring the dump to the new cluster takes a very
> >> long time, as the writer drains its operation table by opening and
> >> individually fsync-ing thousands of files.  This takes about 40 ms per
> >> file, which I assume is a combination of slow lap-top disk drive, and
> >> a strange deal with ext4 which makes fsyncing a recently created file
> >> very slow.   But even with faster hdd, this would still be a problem
> >> if it works the same way, with every file needing 4 rotations to be
> >> fsynced and this happens in serial.
> >
> > Is this with the current code that does synchronous_commit=off?  If not,
> > can you test to see if this is still a problem?
> 
> Yes, it is with synchronous_commit=off. (or if it wasn't originally,
> it is now, with the same result)
> 
> Applying your fsync patch does solve the problem for me on ext4.
> Having the new cluster be on ext3 rather than ext4 also solves the
> problem, without the need for a patch; but it would be nice to more
> friendly to ext4, which is popular even though not recommended.

Do you have numbers with synchronous-commit=off, fsync=off, and both, on
ext4?

> >> Anyway, the reason I think turning fsync off might be reasonable is
> >> that as soon as the new cluster is shut down, pg_upgrade starts
> >> overwriting most of those just-fsynced file with other files from the
> >> old cluster, and AFAICT makes no effort to fsync them.  So until there
> >> is a system-wide sync after the pg_upgrade finishes, your new cluster
> >> is already in mortal danger anyway.
> >
> > pg_upgrade does a cluster shutdown before overwriting those files.
> 
> Right.  So as far as the cluster is concerned, those files have been
> fsynced.  But then the next step is go behind the cluster's back and
> replace those fsynced files with different files, which may or may not
> have been fsynced.  This is what makes me thing the new cluster is in
> mortal danger.  Not only have the new files perhaps not been fsynced,
> but the cluster is not even aware of this fact, so you can start it
> up, and then shut it down, and it still won't bother to fsync them,
> because as far as it is concerned they already have been.
> 
> Given that, how much extra danger would be added by having the new
> cluster schema restore run with fsync=off?
> 
> In any event, I think the documentation should caution that the
> upgrade should not be deemed to be a success until after a system-wide
> sync has been done.  Even if we use the link rather than copy method,
> are we sure that that is safe if the directories recording those links
> have not been fsynced?

OK, the above is something I have been thinking about, and obviously you
have too.  If you change fsync from off to on in a cluster, and restart
it, there is no guarantee that the dirty pages you read from the kernel
are actually on disk, because Postgres doesn't know they are dirty. 
They probably will be pushed to disk by the kernel in less than one
minute, but still, it doesn't seem reliable. Should this be documented
in the fsync section?

Again, another reason not to use fsync=off, though your example of the
file copy is a good one.  As you stated, this is a problem with the file
copy/link, independent of how Postgres handles the files.  We can tell
people to use 'sync' as root on Unix, but what about Windows?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Use of fsync; was Re: Pg_upgrade speed for many tables

From
Jeff Janes
Date:
On Fri, Nov 23, 2012 at 7:22 PM, Bruce Momjian <bruce@momjian.us> wrote:
>  On Mon, Nov 19, 2012 at 12:11:26PM -0800, Jeff Janes wrote:
>
>>
>> Yes, it is with synchronous_commit=off. (or if it wasn't originally,
>> it is now, with the same result)
>>
>> Applying your fsync patch does solve the problem for me on ext4.
>> Having the new cluster be on ext3 rather than ext4 also solves the
>> problem, without the need for a patch; but it would be nice to more
>> friendly to ext4, which is popular even though not recommended.
>
> Do you have numbers with synchronous-commit=off, fsync=off, and both, on
> ext4?

for 5,000 tables like "create table fooNNNNN (x serial)", upgrading
from 9.3dev to 9.3dev:

Timings are in seconds, done twice.  I had to hack pg_upgrade so that
the pg_ctl stop command did -w -t 3600, otherwise
I'd get an "database did not shut down" error for the first two.

both on                            648.29  608.42
synchronous_commit off   250.24  366.50
fsync off                           46.91   43.96
both off                             41.44   44.81

Also, I did a manual sync as soon as "Removing support functions from
new cluster OK" appears, with synchronous_commit off bug fsync on:
45.96   46.46


>>
>> In any event, I think the documentation should caution that the
>> upgrade should not be deemed to be a success until after a system-wide
>> sync has been done.  Even if we use the link rather than copy method,
>> are we sure that that is safe if the directories recording those links
>> have not been fsynced?
>
> OK, the above is something I have been thinking about, and obviously you
> have too.  If you change fsync from off to on in a cluster, and restart
> it, there is no guarantee that the dirty pages you read from the kernel
> are actually on disk, because Postgres doesn't know they are dirty.
> They probably will be pushed to disk by the kernel in less than one
> minute, but still, it doesn't seem reliable. Should this be documented
> in the fsync section?
>
> Again, another reason not to use fsync=off, though your example of the
> file copy is a good one.  As you stated, this is a problem with the file
> copy/link, independent of how Postgres handles the files.  We can tell
> people to use 'sync' as root on Unix, but what about Windows?

I'm pretty sure someone mentioned the way to do that on Windows in
this list in the last few months, but I can't seem to find it.  I
thought it was the initdb fsync thread.

Cheers,

Jeff



Re: Use of fsync; was Re: Pg_upgrade speed for many tables

From
Bruce Momjian
Date:
On Sat, Nov 24, 2012 at 09:42:08PM -0800, Jeff Janes wrote:
> On Fri, Nov 23, 2012 at 7:22 PM, Bruce Momjian <bruce@momjian.us> wrote:
> >  On Mon, Nov 19, 2012 at 12:11:26PM -0800, Jeff Janes wrote:
> >
> >>
> >> Yes, it is with synchronous_commit=off. (or if it wasn't originally,
> >> it is now, with the same result)
> >>
> >> Applying your fsync patch does solve the problem for me on ext4.
> >> Having the new cluster be on ext3 rather than ext4 also solves the
> >> problem, without the need for a patch; but it would be nice to more
> >> friendly to ext4, which is popular even though not recommended.
> >
> > Do you have numbers with synchronous-commit=off, fsync=off, and both, on
> > ext4?
> 
> for 5,000 tables like "create table fooNNNNN (x serial)", upgrading
> from 9.3dev to 9.3dev:
> 
> Timings are in seconds, done twice.  I had to hack pg_upgrade so that
> the pg_ctl stop command did -w -t 3600, otherwise
> I'd get an "database did not shut down" error for the first two.
> 
> both on                            648.29  608.42
> synchronous_commit off   250.24  366.50
> fsync off                           46.91   43.96
> both off                             41.44   44.81
> 
> Also, I did a manual sync as soon as "Removing support functions from
> new cluster OK" appears, with synchronous_commit off bug fsync on:
> 45.96   46.46

OK, these very convincing numbers.  I am going to modify initdb to have
an --fsync-only option, and have pg_upgrade use that.  This is 9.3
material.

> >>
> >> In any event, I think the documentation should caution that the
> >> upgrade should not be deemed to be a success until after a system-wide
> >> sync has been done.  Even if we use the link rather than copy method,
> >> are we sure that that is safe if the directories recording those links
> >> have not been fsynced?
> >
> > OK, the above is something I have been thinking about, and obviously you
> > have too.  If you change fsync from off to on in a cluster, and restart
> > it, there is no guarantee that the dirty pages you read from the kernel
> > are actually on disk, because Postgres doesn't know they are dirty.
> > They probably will be pushed to disk by the kernel in less than one
> > minute, but still, it doesn't seem reliable. Should this be documented
> > in the fsync section?
> >
> > Again, another reason not to use fsync=off, though your example of the
> > file copy is a good one.  As you stated, this is a problem with the file
> > copy/link, independent of how Postgres handles the files.  We can tell
> > people to use 'sync' as root on Unix, but what about Windows?
> 
> I'm pretty sure someone mentioned the way to do that on Windows in
> this list in the last few months, but I can't seem to find it.  I
> thought it was the initdb fsync thread.

Yep, the code is already in initdb to fsync a directory --- we just need
a way for pg_upgrade to access it.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Use of fsync; was Re: Pg_upgrade speed for many tables

From
Bruce Momjian
Date:
On Mon, Nov 26, 2012 at 02:43:19PM -0500, Bruce Momjian wrote:
> > >> In any event, I think the documentation should caution that the
> > >> upgrade should not be deemed to be a success until after a system-wide
> > >> sync has been done.  Even if we use the link rather than copy method,
> > >> are we sure that that is safe if the directories recording those links
> > >> have not been fsynced?
> > >
> > > OK, the above is something I have been thinking about, and obviously you
> > > have too.  If you change fsync from off to on in a cluster, and restart
> > > it, there is no guarantee that the dirty pages you read from the kernel
> > > are actually on disk, because Postgres doesn't know they are dirty.
> > > They probably will be pushed to disk by the kernel in less than one
> > > minute, but still, it doesn't seem reliable. Should this be documented
> > > in the fsync section?
> > >
> > > Again, another reason not to use fsync=off, though your example of the
> > > file copy is a good one.  As you stated, this is a problem with the file
> > > copy/link, independent of how Postgres handles the files.  We can tell
> > > people to use 'sync' as root on Unix, but what about Windows?
> >
> > I'm pretty sure someone mentioned the way to do that on Windows in
> > this list in the last few months, but I can't seem to find it.  I
> > thought it was the initdb fsync thread.
>
> Yep, the code is already in initdb to fsync a directory --- we just need
> a way for pg_upgrade to access it.

I have developed the attached patch that does this.  It basically adds
an --sync-only option to initdb, then turns off all durability in
pg_upgrade and has pg_upgrade run initdb --sync-only;  this give us
another nice speedup!

             ------ SSD ---- -- magnetic ---
                git    patch    git    patch
        1      11.11   11.11   11.10   11.13
     1000      20.57   19.89   20.72   19.30
     2000      28.02   25.81   28.50   27.53
     4000      42.00   43.59   46.71   46.84
     8000      89.66   74.16   89.10   73.67
    16000     157.66  135.98  159.97  153.48
    32000     316.24  296.90  334.74  308.59
    64000     814.97  715.53  797.34  727.94

(I am very happy with these times.  Thanks to Jeff Janes for his
suggestions.)

I have also added documentation to the 'fsync' configuration variable
warning about dirty buffers and recommending flushing them to disk
before the cluster is crash-recovery safe.

I consider this patch ready for 9.3 application (meaning it is not a
prototype).

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Attachment

Re: Use of fsync; was Re: Pg_upgrade speed for many tables

From
Bruce Momjian
Date:
Applied.

---------------------------------------------------------------------------

On Fri, Nov 30, 2012 at 10:43:29PM -0500, Bruce Momjian wrote:
> On Mon, Nov 26, 2012 at 02:43:19PM -0500, Bruce Momjian wrote:
> > > >> In any event, I think the documentation should caution that the
> > > >> upgrade should not be deemed to be a success until after a system-wide
> > > >> sync has been done.  Even if we use the link rather than copy method,
> > > >> are we sure that that is safe if the directories recording those links
> > > >> have not been fsynced?
> > > >
> > > > OK, the above is something I have been thinking about, and obviously you
> > > > have too.  If you change fsync from off to on in a cluster, and restart
> > > > it, there is no guarantee that the dirty pages you read from the kernel
> > > > are actually on disk, because Postgres doesn't know they are dirty.
> > > > They probably will be pushed to disk by the kernel in less than one
> > > > minute, but still, it doesn't seem reliable. Should this be documented
> > > > in the fsync section?
> > > >
> > > > Again, another reason not to use fsync=off, though your example of the
> > > > file copy is a good one.  As you stated, this is a problem with the file
> > > > copy/link, independent of how Postgres handles the files.  We can tell
> > > > people to use 'sync' as root on Unix, but what about Windows?
> > > 
> > > I'm pretty sure someone mentioned the way to do that on Windows in
> > > this list in the last few months, but I can't seem to find it.  I
> > > thought it was the initdb fsync thread.
> > 
> > Yep, the code is already in initdb to fsync a directory --- we just need
> > a way for pg_upgrade to access it.
> 
> I have developed the attached patch that does this.  It basically adds
> an --sync-only option to initdb, then turns off all durability in
> pg_upgrade and has pg_upgrade run initdb --sync-only;  this give us
> another nice speedup!
> 
>              ------ SSD ---- -- magnetic ---
>                 git    patch    git    patch
>         1      11.11   11.11   11.10   11.13
>      1000      20.57   19.89   20.72   19.30
>      2000      28.02   25.81   28.50   27.53
>      4000      42.00   43.59   46.71   46.84
>      8000      89.66   74.16   89.10   73.67
>     16000     157.66  135.98  159.97  153.48
>     32000     316.24  296.90  334.74  308.59
>     64000     814.97  715.53  797.34  727.94
> 
> (I am very happy with these times.  Thanks to Jeff Janes for his
> suggestions.)
> 
> I have also added documentation to the 'fsync' configuration variable
> warning about dirty buffers and recommending flushing them to disk
> before the cluster is crash-recovery safe.
> 
> I consider this patch ready for 9.3 application (meaning it is not a
> prototype).
> 
> -- 
>   Bruce Momjian  <bruce@momjian.us>        http://momjian.us
>   EnterpriseDB                             http://enterprisedb.com
> 
>   + It's impossible for everything to be true. +

> diff --git a/contrib/pg_upgrade/pg_upgrade.c b/contrib/pg_upgrade/pg_upgrade.c
> new file mode 100644
> index c12f15b..63df529
> *** a/contrib/pg_upgrade/pg_upgrade.c
> --- b/contrib/pg_upgrade/pg_upgrade.c
> *************** main(int argc, char **argv)
> *** 150,155 ****
> --- 150,161 ----
>                 new_cluster.pgdata);
>       check_ok();
>   
> +     prep_status("Sync data directory to disk");
> +     exec_prog(UTILITY_LOG_FILE, NULL, true,
> +               "\"%s/initdb\" --sync-only \"%s\"", new_cluster.bindir,
> +               new_cluster.pgdata);
> +     check_ok();
> + 
>       create_script_for_cluster_analyze(&analyze_script_file_name);
>       create_script_for_old_cluster_deletion(&deletion_script_file_name);
>   
> diff --git a/contrib/pg_upgrade/server.c b/contrib/pg_upgrade/server.c
> new file mode 100644
> index 49d4c8f..05d8cc0
> *** a/contrib/pg_upgrade/server.c
> --- b/contrib/pg_upgrade/server.c
> *************** start_postmaster(ClusterInfo *cluster)
> *** 209,217 ****
>        * a gap of 2000000000 from the current xid counter, so autovacuum will
>        * not touch them.
>        *
> !      *    synchronous_commit=off improves object creation speed, and we only
> !      *    modify the new cluster, so only use it there.  If there is a crash,
> !      *    the new cluster has to be recreated anyway.
>        */
>       snprintf(cmd, sizeof(cmd),
>                "\"%s/pg_ctl\" -w -l \"%s\" -D \"%s\" -o \"-p %d%s%s%s%s\" start",
> --- 209,217 ----
>        * a gap of 2000000000 from the current xid counter, so autovacuum will
>        * not touch them.
>        *
> !      * Turn off durability requirements to improve object creation speed, and
> !      * we only modify the new cluster, so only use it there.  If there is a
> !      * crash, the new cluster has to be recreated anyway.
>        */
>       snprintf(cmd, sizeof(cmd),
>                "\"%s/pg_ctl\" -w -l \"%s\" -D \"%s\" -o \"-p %d%s%s%s%s\" start",
> *************** start_postmaster(ClusterInfo *cluster)
> *** 219,225 ****
>                (cluster->controldata.cat_ver >=
>                 BINARY_UPGRADE_SERVER_FLAG_CAT_VER) ? " -b" :
>                " -c autovacuum=off -c autovacuum_freeze_max_age=2000000000",
> !              (cluster == &new_cluster) ? " -c synchronous_commit=off" : "",
>                cluster->pgopts ? cluster->pgopts : "", socket_string);
>   
>       /*
> --- 219,226 ----
>                (cluster->controldata.cat_ver >=
>                 BINARY_UPGRADE_SERVER_FLAG_CAT_VER) ? " -b" :
>                " -c autovacuum=off -c autovacuum_freeze_max_age=2000000000",
> !              (cluster == &new_cluster) ?
> !                 " -c synchronous_commit=off -c fsync=off -c full_page_writes=off" : "",
>                cluster->pgopts ? cluster->pgopts : "", socket_string);
>   
>       /*
> diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
> new file mode 100644
> index b56070b..b7df8ce
> *** a/doc/src/sgml/config.sgml
> --- b/doc/src/sgml/config.sgml
> *************** include 'filename'
> *** 1697,1702 ****
> --- 1697,1711 ----
>          </para>
>   
>          <para>
> +         For reliable recovery when changing <varname>fsync</varname>
> +         off to on, it is necessary to force all modified buffers in the
> +         kernel to durable storage.  This can be done while the cluster
> +         is shutdown or while fsync is on by running <command>initdb
> +         --sync-only</command>, running <command>sync</>, unmounting the
> +         file system, or rebooting the server.
> +        </para>
> + 
> +        <para>
>           In many situations, turning off <xref linkend="guc-synchronous-commit">
>           for noncritical transactions can provide much of the potential
>           performance benefit of turning off <varname>fsync</varname>, without
> diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
> new file mode 100644
> index 08ee37e..a1e46eb
> *** a/doc/src/sgml/ref/initdb.sgml
> --- b/doc/src/sgml/ref/initdb.sgml
> *************** PostgreSQL documentation
> *** 245,250 ****
> --- 245,261 ----
>        </varlistentry>
>   
>        <varlistentry>
> +       <term><option>-S</option></term>
> +       <term><option>--sync-only</option></term>
> +       <listitem>
> +        <para>
> +         Safely write all database files to disk and exit.  This does not
> +         perform any of the normal <application>initdb</> operations.
> +        </para>
> +       </listitem>
> +      </varlistentry>
> + 
> +      <varlistentry>
>         <term><option>-T <replaceable>CFG</></option></term>
>         <term><option>--text-search-config=<replaceable>CFG</></option></term>
>         <listitem>
> diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
> new file mode 100644
> index 402504b..8c0a9f4
> *** a/src/bin/initdb/initdb.c
> --- b/src/bin/initdb/initdb.c
> *************** static const char *authmethodlocal = "";
> *** 118,123 ****
> --- 118,124 ----
>   static bool debug = false;
>   static bool noclean = false;
>   static bool do_sync = true;
> + static bool sync_only = false;
>   static bool show_setting = false;
>   static char *xlog_dir = "";
>   
> *************** usage(const char *progname)
> *** 2796,2801 ****
> --- 2797,2803 ----
>       printf(_("  -n, --noclean             do not clean up after errors\n"));
>       printf(_("  -N, --nosync              do not wait for changes to be written safely to disk\n"));
>       printf(_("  -s, --show                show internal settings\n"));
> +     printf(_("  -S, --sync-only           only sync data directory\n"));
>       printf(_("\nOther options:\n"));
>       printf(_("  -V, --version             output version information, then exit\n"));
>       printf(_("  -?, --help                show this help, then exit\n"));
> *************** main(int argc, char *argv[])
> *** 3445,3450 ****
> --- 3447,3453 ----
>           {"show", no_argument, NULL, 's'},
>           {"noclean", no_argument, NULL, 'n'},
>           {"nosync", no_argument, NULL, 'N'},
> +         {"sync-only", no_argument, NULL, 'S'},
>           {"xlogdir", required_argument, NULL, 'X'},
>           {NULL, 0, NULL, 0}
>       };
> *************** main(int argc, char *argv[])
> *** 3476,3482 ****
>   
>       /* process command-line options */
>   
> !     while ((c = getopt_long(argc, argv, "dD:E:L:nNU:WA:sT:X:", long_options, &option_index)) != -1)
>       {
>           switch (c)
>           {
> --- 3479,3485 ----
>   
>       /* process command-line options */
>   
> !     while ((c = getopt_long(argc, argv, "dD:E:L:nNU:WA:sST:X:", long_options, &option_index)) != -1)
>       {
>           switch (c)
>           {
> *************** main(int argc, char *argv[])
> *** 3522,3527 ****
> --- 3525,3533 ----
>               case 'N':
>                   do_sync = false;
>                   break;
> +             case 'S':
> +                 sync_only = true;
> +                 break;
>               case 'L':
>                   share_path = pg_strdup(optarg);
>                   break;
> *************** main(int argc, char *argv[])
> *** 3589,3594 ****
> --- 3595,3608 ----
>           exit(1);
>       }
>   
> +     /* If we only need to fsync, just to it and exit */
> +     if (sync_only)
> +     {
> +         setup_pgdata();
> +         perform_fsync();
> +         return 0;
> +     }
> +     
>       if (pwprompt && pwfilename)
>       {
>           fprintf(stderr, _("%s: password prompt and password file cannot be specified together\n"), progname);

> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +