Thread: Further pg_upgrade analysis for many tables

Further pg_upgrade analysis for many tables

From
Bruce Momjian
Date:
As a followup to Magnus's report that pg_upgrade was slow for many
tables, I did some more testing with many tables, e.g.:

    CREATE TABLE test991 (x SERIAL);

I ran it for 0, 1k, 2k, ... 16k tables, and got these results:

    tables    pg_dump     restore     pg_upgrade(increase)
        0       0.30        0.24       11.73(-)
     1000       6.46        6.55       28.79(2.45x)
     2000      29.82       20.96       69.75(2.42x)
     4000      95.70      115.88      289.82(4.16x)
     8000     405.38      505.93     1168.60(4.03x)
    16000    1702.23     2197.56     5022.82(4.30x)

Things look fine through 2k, but at 4k the duration of pg_dump, restore,
and pg_upgrade (which is mostly a combination of these two) is 4x,
rather than the 2x as predicted by the growth in the number of tables.
To see how bad it is, 16k tables is 1.3 hours, and 32k tables would be
5.6 hours by my estimates.

You can see the majority of pg_upgrade duration is made up of the
pg_dump and the schema restore, so I can't really speed up pg_upgrade
without speeding those up, and the 4x increase is in _both_ of those
operations, not just one.

Also, for 16k, I had to increase max_locks_per_transaction or the dump
would fail, which kind of surprised me.

I tested 9.2 and git head, but they produced identical numbers.  I did
use synchronous_commit=off.

Any ideas?  I am attaching my test script.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Attachment

Re: Further pg_upgrade analysis for many tables

From
Bruce Momjian
Date:
On Wed, Nov  7, 2012 at 09:17:29PM -0500, Bruce Momjian wrote:
> Things look fine through 2k, but at 4k the duration of pg_dump, restore,
> and pg_upgrade (which is mostly a combination of these two) is 4x,
> rather than the 2x as predicted by the growth in the number of tables. 
> To see how bad it is, 16k tables is 1.3 hours, and 32k tables would be
> 5.6 hours by my estimates.
> 
> You can see the majority of pg_upgrade duration is made up of the
> pg_dump and the schema restore, so I can't really speed up pg_upgrade
> without speeding those up, and the 4x increase is in _both_ of those
> operations, not just one.
> 
> Also, for 16k, I had to increase max_locks_per_transaction or the dump
> would fail, which kind of surprised me.
> 
> I tested 9.2 and git head, but they produced identical numbers.  I did
> use synchronous_commit=off.
> 
> Any ideas?  I am attaching my test script.

Thinking this might be related to some server setting, I increased
shared buffers, work_mem, and maintenance_work_mem, but this produced
almost no improvement:
tables    pg_dump     restore     pg_upgrade    1       0.30        0.24       11.73(-) 1000       6.46        6.55
 28.79(2.45) 2000      29.82       20.96       69.75(2.42) 4000      95.70      115.88      289.82(4.16) 8000
405.38     505.93     1168.60(4.03)shared_buffers=1GBtables    pg_dump     restore     pg_upgrade   1        0.26
0.231000        6.22        7.002000       23.92       22.514000       88.44      111.998000      376.20
531.07shared_buffers=1GBwork_mem/maintenance_work_mem= 500MB1           0.27        0.231000        6.39
8.272000      26.34       20.534000       89.47      104.598000      397.13      486.99
 

Any ideas what else I should test?  It this O(2n) or O(n^2) behavior?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Further pg_upgrade analysis for many tables

From
Peter Eisentraut
Date:
On 11/7/12 9:17 PM, Bruce Momjian wrote:
> As a followup to Magnus's report that pg_upgrade was slow for many
> tables, I did some more testing with many tables, e.g.:
> 
>     CREATE TABLE test991 (x SERIAL);
> 
> I ran it for 0, 1k, 2k, ... 16k tables, and got these results:
> 
>     tables    pg_dump     restore     pg_upgrade(increase)
>         0       0.30        0.24       11.73(-)
>      1000       6.46        6.55       28.79(2.45x)
>      2000      29.82       20.96       69.75(2.42x)
>      4000      95.70      115.88      289.82(4.16x)
>      8000     405.38      505.93     1168.60(4.03x)
>     16000    1702.23     2197.56     5022.82(4.30x)

I can reproduce these numbers, more or less.  (Additionally, it ran out
of shared memory with the default setting when dumping the 8000 tables.)

But this issue seems to be entirely the fault of sequences being
present.  When I replace the serial column with an int, everything
finishes within seconds and scales seemingly linearly.



Re: Further pg_upgrade analysis for many tables

From
Jeff Janes
Date:
On Wed, Nov 7, 2012 at 6:17 PM, Bruce Momjian <bruce@momjian.us> wrote:
> As a followup to Magnus's report that pg_upgrade was slow for many
> tables, I did some more testing with many tables, e.g.:
>
...
>
> Any ideas?  I am attaching my test script.

Have you reviewed the thread at:
http://archives.postgresql.org/pgsql-performance/2012-09/msg00003.php
?

There is a known N^2 behavior when using pg_dump against pre-9.3 servers.

There was a proposed patch to pg_dump to work around the problem when
it is used against older servers, but it is was not accepted and not
entered into a commitfest.  For one thing because it there was doubts
about how stable it would be at very large scale and it wasn't tested
all that thoroughly, and for another, it would be a temporary
improvement as once the server itself is upgraded to 9.3, the kludge
in pg_dump would no longer be an improvement.

The most recent version (that I can find) of that work-around patch is at:

http://archives.postgresql.org/pgsql-performance/2012-06/msg00071.php

I don't know if that will solve your particular case, but it is
probably worth a try.

Cheers,

Jeff



Re: Further pg_upgrade analysis for many tables

From
Bruce Momjian
Date:
On Thu, Nov  8, 2012 at 03:46:09PM -0800, Jeff Janes wrote:
> On Wed, Nov 7, 2012 at 6:17 PM, Bruce Momjian <bruce@momjian.us> wrote:
> > As a followup to Magnus's report that pg_upgrade was slow for many
> > tables, I did some more testing with many tables, e.g.:
> >
> ...
> >
> > Any ideas?  I am attaching my test script.
> 
> Have you reviewed the thread at:
> http://archives.postgresql.org/pgsql-performance/2012-09/msg00003.php
> ?
> 
> There is a known N^2 behavior when using pg_dump against pre-9.3 servers.

I am actually now dumping git head/9.3, so I assume all the problems we
know about should be fixed.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Further pg_upgrade analysis for many tables

From
Bruce Momjian
Date:
On Thu, Nov  8, 2012 at 12:30:11PM -0500, Peter Eisentraut wrote:
> On 11/7/12 9:17 PM, Bruce Momjian wrote:
> > As a followup to Magnus's report that pg_upgrade was slow for many
> > tables, I did some more testing with many tables, e.g.:
> > 
> >     CREATE TABLE test991 (x SERIAL);
> > 
> > I ran it for 0, 1k, 2k, ... 16k tables, and got these results:
> > 
> >     tables    pg_dump     restore     pg_upgrade(increase)
> >         0       0.30        0.24       11.73(-)
> >      1000       6.46        6.55       28.79(2.45x)
> >      2000      29.82       20.96       69.75(2.42x)
> >      4000      95.70      115.88      289.82(4.16x)
> >      8000     405.38      505.93     1168.60(4.03x)
> >     16000    1702.23     2197.56     5022.82(4.30x)
> 
> I can reproduce these numbers, more or less.  (Additionally, it ran out
> of shared memory with the default setting when dumping the 8000 tables.)
> 
> But this issue seems to be entirely the fault of sequences being
> present.  When I replace the serial column with an int, everything
> finishes within seconds and scales seemingly linearly.

I did some more research and realized that I was not using --schema-only
like pg_upgrade uses.  With that setting, things look like this:
--schema-onlytables    pg_dump     restore  pg_upgrade1           0.27        0.23    11.73(-)1000        3.64
5.18   28.79(2.45)2000       13.07       14.63    69.75(2.42)4000       43.93       66.87   289.82(4.16)8000
190.63     326.67  1168.60(4.03)16000     757.80     1402.82  5022.82(4.30)
 

You can still see the 4x increase, but it now for all tests ---
basically, every time the number of tables doubles, the time to dump or
restore a _single_ table doubles, e.g. for 1k tables, a single table
takes 0.00364 to dump, for 16k tables, a single table takes 0.04736 to
dump, a 13x slowdown.

Second, with --schema-only, you can see the dump/restore is only 50% of
the duration of pg_upgrade, and you can also see that pg_upgrade itself
is slowing down as the number of tables increases, even ignoring the
dump/reload time.

This is all bad news.  :-(  I will keep digging.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Further pg_upgrade analysis for many tables

From
Jeff Janes
Date:
On Thu, Nov 8, 2012 at 4:33 PM, Bruce Momjian <bruce@momjian.us> wrote:
> On Thu, Nov  8, 2012 at 03:46:09PM -0800, Jeff Janes wrote:
>> On Wed, Nov 7, 2012 at 6:17 PM, Bruce Momjian <bruce@momjian.us> wrote:
>> > As a followup to Magnus's report that pg_upgrade was slow for many
>> > tables, I did some more testing with many tables, e.g.:
>> >
>> ...
>> >
>> > Any ideas?  I am attaching my test script.
>>
>> Have you reviewed the thread at:
>> http://archives.postgresql.org/pgsql-performance/2012-09/msg00003.php
>> ?
>>
>> There is a known N^2 behavior when using pg_dump against pre-9.3 servers.
>
> I am actually now dumping git head/9.3, so I assume all the problems we
> know about should be fixed.

Are sure the server you are dumping out of is head?

Using head's pg_dump, but 9.2.1 server, it takes me 179.11 seconds to
dump 16,000 tables (schema only) like your example, and it is
definitely quadratic.


But using head's pg_dump do dump tables out of head's server, it only
took 24.95 seconds, and the quadratic term is not yet important,
things still look linear.


But even the 179.11 seconds is several times faster than your report
of 757.8, so I'm not sure what is going on there.  I don't think my
laptop is particularly fast:

Intel(R) Pentium(R) CPU B960 @ 2.20GHz

Is the next value, increment, etc. for a sequence stored in a catalog,
or are they stored in the 8kb file associated with each sequence?  If
they are stored in the file, than it is shame that pg_dump goes to the
effort of extracting that info if pg_upgrade is just going to
overwrite it anyway.


Cheers,

Jeff



Re: Further pg_upgrade analysis for many tables

From
Tom Lane
Date:
Jeff Janes <jeff.janes@gmail.com> writes:
> Are sure the server you are dumping out of is head?

I experimented a bit with dumping/restoring 16000 tables matching
Bruce's test case (ie, one serial column apiece).  The pg_dump profile
seems fairly flat, without any easy optimization targets.  But
restoring the dump script shows a rather interesting backend profile:

samples  %        image name               symbol name
30861    39.6289  postgres                 AtEOXact_RelationCache
9911     12.7268  postgres                 hash_seq_search
2682      3.4440  postgres                 init_sequence
2218      2.8482  postgres                 _bt_compare
2120      2.7223  postgres                 hash_search_with_hash_value
1976      2.5374  postgres                 XLogInsert
1429      1.8350  postgres                 CatalogCacheIdInvalidate
1282      1.6462  postgres                 LWLockAcquire
973       1.2494  postgres                 LWLockRelease
702       0.9014  postgres                 hash_any

The hash_seq_search time is probably mostly associated with
AtEOXact_RelationCache, which is run during transaction commit and scans
the relcache hashtable looking for tables created in the current
transaction.  So that's about 50% of the runtime going into that one
activity.

There are at least three ways we could whack that mole:

* Run the psql script in --single-transaction mode, as I was mumbling
about the other day.  If we were doing AtEOXact_RelationCache only once,
rather than once per CREATE TABLE statement, it wouldn't be a problem.
Easy but has only a narrow scope of applicability.

* Keep a separate list (or data structure of your choice) so that
relcache entries created in the current xact could be found directly
rather than having to scan the whole relcache.  That'd add complexity
though, and could perhaps be a net loss for cases where the relcache
isn't so bloated.

* Limit the size of the relcache (eg by aging out
not-recently-referenced entries) so that we aren't incurring O(N^2)
costs for scripts touching N tables.  Again, this adds complexity and
could be counterproductive in some scenarios.
        regards, tom lane



Re: Further pg_upgrade analysis for many tables

From
Ants Aasma
Date:
On Fri, Nov 9, 2012 at 6:59 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Thu, Nov 8, 2012 at 4:33 PM, Bruce Momjian <bruce@momjian.us> wrote:
>> On Thu, Nov  8, 2012 at 03:46:09PM -0800, Jeff Janes wrote:
>>> On Wed, Nov 7, 2012 at 6:17 PM, Bruce Momjian <bruce@momjian.us> wrote:
>>> > As a followup to Magnus's report that pg_upgrade was slow for many
>>> > tables, I did some more testing with many tables, e.g.:
>>> >
>>> ...
>>> >
>>> > Any ideas?  I am attaching my test script.
>>>
>>> Have you reviewed the thread at:
>>> http://archives.postgresql.org/pgsql-performance/2012-09/msg00003.php
>>> ?
>>>
>>> There is a known N^2 behavior when using pg_dump against pre-9.3 servers.
>>
>> I am actually now dumping git head/9.3, so I assume all the problems we
>> know about should be fixed.
>
> Are sure the server you are dumping out of is head?
>
> Using head's pg_dump, but 9.2.1 server, it takes me 179.11 seconds to
> dump 16,000 tables (schema only) like your example, and it is
> definitely quadratic.
>
>
> But using head's pg_dump do dump tables out of head's server, it only
> took 24.95 seconds, and the quadratic term is not yet important,
> things still look linear.

I also ran a couple of experiments with git head. From 8k to 16k I'm
seeing slightly super-linear scaling (2.25x), from 32k to 64k a
quadratic term has taken over (3.74x).

I ran the experiments on a slightly beefier machine (Intel i5 @ 4GHz,
Intel SSD 320, Linux 3.2, ext4). For 16k, pg_dump took 29s, pg_upgrade
111s. At 64k the times were 150s/1237s. I didn't measure it, but
occasional peek at top suggested that most of the time was spent doing
server side processing of restore.

I also took two profiles (attached). AtEOXact_RelationCache seems to
be the culprit for the quadratic growth.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

Attachment

Re: Further pg_upgrade analysis for many tables

From
Ants Aasma
Date:
On Fri, Nov 9, 2012 at 7:53 AM, Ants Aasma <ants@cybertec.at> wrote:
> I also took two profiles (attached). AtEOXact_RelationCache seems to
> be the culprit for the quadratic growth.

One more thing that jumps out as quadratic from the profiles is
transfer_all_new_dbs from pg_upgrade (20% of total CPU time at 64k).
Searching for non-primary files loops over the whole file list for
each relation. This would be a lot faster if we would sort the file
list first and use binary search to find the related files.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: Further pg_upgrade analysis for many tables

From
Jeff Janes
Date:
On Thu, Nov 8, 2012 at 9:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
>> Are sure the server you are dumping out of is head?
>
> I experimented a bit with dumping/restoring 16000 tables matching
> Bruce's test case (ie, one serial column apiece).  The pg_dump profile
> seems fairly flat, without any easy optimization targets.  But
> restoring the dump script shows a rather interesting backend profile:
>
> samples  %        image name               symbol name
> 30861    39.6289  postgres                 AtEOXact_RelationCache
> 9911     12.7268  postgres                 hash_seq_search
...
>
> There are at least three ways we could whack that mole:
>
> * Run the psql script in --single-transaction mode, as I was mumbling
> about the other day.  If we were doing AtEOXact_RelationCache only once,
> rather than once per CREATE TABLE statement, it wouldn't be a problem.
> Easy but has only a narrow scope of applicability.

That is effective when loading into 9.3 (assuming you make
max_locks_per_transaction large enough).  But when loading into  <9.3,
using --single-transaction will evoke the quadratic behavior in the
resource owner/lock table and make things worse rather than better.


But there is still the question of how people can start using 9.3 if
they can't use pg_upgrade, or use the pg_dump half of the dump/restore
in, order to get there.

It seems to me that pg_upgrade takes some pains to ensure that no one
else attaches to the database during its operation.  In that case, is
it necessary to run the entire dump in a single transaction in order
to get a consistent picture?  The attached crude patch allows pg_dump
to not use a single transaction (and thus not accumulate a huge number
of locks) by using the --pg_upgrade flag.

This seems to remove the quadratic behavior of running pg_dump against
pre-9.3 servers.  It is linear up to 30,000 tables with a single
serial column, at about 1.5 msec per table.

I have no evidence other than a gut feeling that this is a safe thing to do.

I've also tested Tatsuo-san's group-"LOCK TABLE" patch against this
case, and it is minimal help.  The problem is that there is no syntax
for locking sequences, so they cannot be explicitly locked as a group
but rather are implicitly locked one by one and so still suffer from
the quadratic behavior.

Cheers,

Jeff

Attachment

Re: Further pg_upgrade analysis for many tables

From
Bruce Momjian
Date:
On Thu, Nov  8, 2012 at 08:59:21PM -0800, Jeff Janes wrote:
> On Thu, Nov 8, 2012 at 4:33 PM, Bruce Momjian <bruce@momjian.us> wrote:
> > On Thu, Nov  8, 2012 at 03:46:09PM -0800, Jeff Janes wrote:
> >> On Wed, Nov 7, 2012 at 6:17 PM, Bruce Momjian <bruce@momjian.us> wrote:
> >> > As a followup to Magnus's report that pg_upgrade was slow for many
> >> > tables, I did some more testing with many tables, e.g.:
> >> >
> >> ...
> >> >
> >> > Any ideas?  I am attaching my test script.
> >>
> >> Have you reviewed the thread at:
> >> http://archives.postgresql.org/pgsql-performance/2012-09/msg00003.php
> >> ?
> >>
> >> There is a known N^2 behavior when using pg_dump against pre-9.3 servers.
> >
> > I am actually now dumping git head/9.3, so I assume all the problems we
> > know about should be fixed.
>
> Are sure the server you are dumping out of is head?

Well, I tested again with 9.2 dumping/loading 9.2 and the same for git
head, and got these results:

                   pg_dump            restore
               9.2       git       9.2       git

        1      0.13      0.11      0.07      0.07
     1000      4.37      3.98      4.32      5.28
     2000     12.98     12.19     13.64     14.25
     4000     47.85     50.14     61.31     70.97
     8000    210.39    183.00    302.67    294.20
    16000    901.53    769.83   1399.25   1359.09

As you can see, there is very little difference between 9.2 and git
head, except maybe at the 16k level for pg_dump.

Is there some slowdown with a mismatched version dump/reload?  I am
attaching my test script.

> Using head's pg_dump, but 9.2.1 server, it takes me 179.11 seconds to
> dump 16,000 tables (schema only) like your example, and it is
> definitely quadratic.

Are you using a SERIAL column for the tables.  I am, and Peter
Eisentraut reported that was a big slowdown.

> But using head's pg_dump do dump tables out of head's server, it only
> took 24.95 seconds, and the quadratic term is not yet important,
> things still look linear.

Again, using SERIAL?

> But even the 179.11 seconds is several times faster than your report
> of 757.8, so I'm not sure what is going on there.  I don't think my
> laptop is particularly fast:
>
> Intel(R) Pentium(R) CPU B960 @ 2.20GHz

I am using server-grade hardware, Xeon E5620 2.4GHz:

    http://momjian.us/main/blogs/pgblog/2012.html#January_20_2012

> Is the next value, increment, etc. for a sequence stored in a catalog,
> or are they stored in the 8kb file associated with each sequence?  If

Each sequence is stored in its own 1-row 8k table:

    test=> CREATE SEQUENCE seq;
    CREATE SEQUENCE

    test=> SELECT * FROM seq;
    -[ RECORD 1 ]-+--------------------
    sequence_name | seq
    last_value    | 1
    start_value   | 1
    increment_by  | 1
    max_value     | 9223372036854775807
    min_value     | 1
    cache_value   | 1
    log_cnt       | 0
    is_cycled     | f
    is_called     | f

> they are stored in the file, than it is shame that pg_dump goes to the
> effort of extracting that info if pg_upgrade is just going to
> overwrite it anyway.

Actually, pg_upgrade needs pg_dump to restore all those sequence values.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Attachment

Re: Further pg_upgrade analysis for many tables

From
Andres Freund
Date:
On 2012-11-08 12:30:11 -0500, Peter Eisentraut wrote:
> On 11/7/12 9:17 PM, Bruce Momjian wrote:
> > As a followup to Magnus's report that pg_upgrade was slow for many
> > tables, I did some more testing with many tables, e.g.:
> >
> >     CREATE TABLE test991 (x SERIAL);
> >
> > I ran it for 0, 1k, 2k, ... 16k tables, and got these results:
> >
> >     tables    pg_dump     restore     pg_upgrade(increase)
> >         0       0.30        0.24       11.73(-)
> >      1000       6.46        6.55       28.79(2.45x)
> >      2000      29.82       20.96       69.75(2.42x)
> >      4000      95.70      115.88      289.82(4.16x)
> >      8000     405.38      505.93     1168.60(4.03x)
> >     16000    1702.23     2197.56     5022.82(4.30x)
>
> I can reproduce these numbers, more or less.  (Additionally, it ran out
> of shared memory with the default setting when dumping the 8000 tables.)
>
> But this issue seems to be entirely the fault of sequences being
> present.  When I replace the serial column with an int, everything
> finishes within seconds and scales seemingly linearly.

I don't know the pg_dump code at all but I would guess that without the
serial there are no dependencies, so the whole dependency sorting
business doesn't need to do very much...

Greetings,

Andres Freund



Re: Further pg_upgrade analysis for many tables

From
Jeff Janes
Date:
On Thu, Nov 8, 2012 at 7:25 PM, Bruce Momjian <bruce@momjian.us> wrote:
>
> I did some more research and realized that I was not using --schema-only
> like pg_upgrade uses.  With that setting, things look like this:
>
...

For profiling pg_dump in isolation, you should also specify
--binary-upgrade.  I was surprised that it makes a big difference,
slowing it down by about 2 fold.

Cheers,

Jeff



Re: Further pg_upgrade analysis for many tables

From
Jeff Janes
Date:
On Fri, Nov 9, 2012 at 3:06 PM, Bruce Momjian <bruce@momjian.us> wrote:
> On Thu, Nov  8, 2012 at 08:59:21PM -0800, Jeff Janes wrote:
>> On Thu, Nov 8, 2012 at 4:33 PM, Bruce Momjian <bruce@momjian.us> wrote:
>> >
>> > I am actually now dumping git head/9.3, so I assume all the problems we
>> > know about should be fixed.
>>
>> Are sure the server you are dumping out of is head?
>
> Well, I tested again with 9.2 dumping/loading 9.2 and the same for git
> head, and got these results:
>
>                        pg_dump            restore
>                    9.2       git       9.2       git
>
>             1      0.13      0.11      0.07      0.07
>          1000      4.37      3.98      4.32      5.28
>          2000     12.98     12.19     13.64     14.25
>          4000     47.85     50.14     61.31     70.97
>          8000    210.39    183.00    302.67    294.20
>         16000    901.53    769.83   1399.25   1359.09

For pg_dump, there are 4 possible combinations, not just two.  you can
use 9.2's pg_dump to dump from a 9.2 server, use git's pg_dump to dump
from a 9.2 server, use git's pg_dump to dump from a git server, or use
9.2's pg_dump to dump from a git server (although that last one isn't
very relevant)

>
> As you can see, there is very little difference between 9.2 and git
> head, except maybe at the 16k level for pg_dump.
>
> Is there some slowdown with a mismatched version dump/reload?  I am
> attaching my test script.

Sorry, from the script I can't really tell what versions are being
used for what.


>
>> Using head's pg_dump, but 9.2.1 server, it takes me 179.11 seconds to
>> dump 16,000 tables (schema only) like your example, and it is
>> definitely quadratic.
>
> Are you using a SERIAL column for the tables.  I am, and Peter
> Eisentraut reported that was a big slowdown.

Yes, I'm using the same table definition as your example.


>
>> But using head's pg_dump do dump tables out of head's server, it only
>> took 24.95 seconds, and the quadratic term is not yet important,
>> things still look linear.
>
> Again, using SERIAL?

Yep.

>> Is the next value, increment, etc. for a sequence stored in a catalog,
>> or are they stored in the 8kb file associated with each sequence?  If
>
> Each sequence is stored in its own 1-row 8k table:
>
>         test=> CREATE SEQUENCE seq;
>         CREATE SEQUENCE
>
>         test=> SELECT * FROM seq;
>         -[ RECORD 1 ]-+--------------------
>         sequence_name | seq
>         last_value    | 1
>         start_value   | 1
>         increment_by  | 1
>         max_value     | 9223372036854775807
>         min_value     | 1
>         cache_value   | 1
>         log_cnt       | 0
>         is_cycled     | f
>         is_called     | f
>
>> they are stored in the file, than it is shame that pg_dump goes to the
>> effort of extracting that info if pg_upgrade is just going to
>> overwrite it anyway.
>
> Actually, pg_upgrade needs pg_dump to restore all those sequence values.

I did an experiment where I had pg_dump just output dummy values
rather than hitting the database.  Once pg_upgrade moves the relation
files over, the dummy values disappear and are set back to their
originals.  So I think that pg_upgrade depends on pg_dump only in a
trivial way--they need to be there, but it doesn't matter what they
are.

Cheers,

Jeff



Re: Further pg_upgrade analysis for many tables

From
Bruce Momjian
Date:
On Fri, Nov  9, 2012 at 08:20:59AM +0200, Ants Aasma wrote:
> On Fri, Nov 9, 2012 at 7:53 AM, Ants Aasma <ants@cybertec.at> wrote:
> > I also took two profiles (attached). AtEOXact_RelationCache seems to
> > be the culprit for the quadratic growth.
> 
> One more thing that jumps out as quadratic from the profiles is
> transfer_all_new_dbs from pg_upgrade (20% of total CPU time at 64k).
> Searching for non-primary files loops over the whole file list for
> each relation. This would be a lot faster if we would sort the file
> list first and use binary search to find the related files.

I am confused why you see a loop.  transfer_all_new_dbs() does a
merge-join of old/new database names, then calls gen_db_file_maps(),
which loops over the relations and calls create_rel_filename_map(),
which adds to the map via array indexing.   I don't see any file loops
in there --- can you be more specific?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Further pg_upgrade analysis for many tables

From
Bruce Momjian
Date:
On Fri, Nov  9, 2012 at 04:23:40PM -0800, Jeff Janes wrote:
> On Fri, Nov 9, 2012 at 3:06 PM, Bruce Momjian <bruce@momjian.us> wrote:
> > On Thu, Nov  8, 2012 at 08:59:21PM -0800, Jeff Janes wrote:
> >> On Thu, Nov 8, 2012 at 4:33 PM, Bruce Momjian <bruce@momjian.us> wrote:
> >> >
> >> > I am actually now dumping git head/9.3, so I assume all the problems we
> >> > know about should be fixed.
> >>
> >> Are sure the server you are dumping out of is head?
> >
> > Well, I tested again with 9.2 dumping/loading 9.2 and the same for git
> > head, and got these results:
> >
> >                        pg_dump            restore
> >                    9.2       git       9.2       git
> >
> >             1      0.13      0.11      0.07      0.07
> >          1000      4.37      3.98      4.32      5.28
> >          2000     12.98     12.19     13.64     14.25
> >          4000     47.85     50.14     61.31     70.97
> >          8000    210.39    183.00    302.67    294.20
> >         16000    901.53    769.83   1399.25   1359.09
> 
> For pg_dump, there are 4 possible combinations, not just two.  you can
> use 9.2's pg_dump to dump from a 9.2 server, use git's pg_dump to dump
> from a 9.2 server, use git's pg_dump to dump from a git server, or use
> 9.2's pg_dump to dump from a git server (although that last one isn't
> very relevant)

True, but I thought doing matching versions was a sufficient test.

> >> Using head's pg_dump, but 9.2.1 server, it takes me 179.11 seconds to
> >> dump 16,000 tables (schema only) like your example, and it is
> >> definitely quadratic.
> >
> > Are you using a SERIAL column for the tables.  I am, and Peter
> > Eisentraut reported that was a big slowdown.
> 
> Yes, I'm using the same table definition as your example.

OK.

> >> But using head's pg_dump do dump tables out of head's server, it only
> >> took 24.95 seconds, and the quadratic term is not yet important,
> >> things still look linear.
> >
> > Again, using SERIAL?
> 
> Yep.

Odd why yours is so much after.

> >> they are stored in the file, than it is shame that pg_dump goes to the
> >> effort of extracting that info if pg_upgrade is just going to
> >> overwrite it anyway.
> >
> > Actually, pg_upgrade needs pg_dump to restore all those sequence values.
> 
> I did an experiment where I had pg_dump just output dummy values
> rather than hitting the database.  Once pg_upgrade moves the relation
> files over, the dummy values disappear and are set back to their
> originals.  So I think that pg_upgrade depends on pg_dump only in a
> trivial way--they need to be there, but it doesn't matter what they
> are.

Oh, wow, I had not thought of that.  Once we move the sequence files
into place from the old cluster, whatever was assigned to the sequence
counter by pg_dump restored is thrown away.  Good point.  

I am hesistant to add an optimization to pg_dump to fix this unless we
decide that pg_dump uses sequences in some non-optimal way that would
not warrant us improving general sequence creation performance.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Further pg_upgrade analysis for many tables

From
Bruce Momjian
Date:
On Fri, Nov  9, 2012 at 04:23:40PM -0800, Jeff Janes wrote:
> > Actually, pg_upgrade needs pg_dump to restore all those sequence values.
> 
> I did an experiment where I had pg_dump just output dummy values
> rather than hitting the database.  Once pg_upgrade moves the relation
> files over, the dummy values disappear and are set back to their
> originals.  So I think that pg_upgrade depends on pg_dump only in a
> trivial way--they need to be there, but it doesn't matter what they
> are.

FYI, thanks everyone for testing this.  I will keep going on my tests
--- seems I have even more things to try in my benchmarks.  I will
publish my results soon.

In general, I think we are getting some complaints about dump/restore
performance with a large number of tables, irregardless of pg_upgrade,
so it seems worthwhile to try to find the cause.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Further pg_upgrade analysis for many tables

From
Ants Aasma
Date:
On Sat, Nov 10, 2012 at 7:10 PM, Bruce Momjian <bruce@momjian.us> wrote:
> I am confused why you see a loop.  transfer_all_new_dbs() does a
> merge-join of old/new database names, then calls gen_db_file_maps(),
> which loops over the relations and calls create_rel_filename_map(),
> which adds to the map via array indexing.   I don't see any file loops
> in there --- can you be more specific?

Sorry, I was too tired when posting that. I actually meant
transfer_single_new_db(). More specifically the profile clearly showed
that most of the time was spent in the two loops starting on lines 193
and 228.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: Further pg_upgrade analysis for many tables

From
Bruce Momjian
Date:
On Sat, Nov 10, 2012 at 07:17:34PM +0200, Ants Aasma wrote:
> On Sat, Nov 10, 2012 at 7:10 PM, Bruce Momjian <bruce@momjian.us> wrote:
> > I am confused why you see a loop.  transfer_all_new_dbs() does a
> > merge-join of old/new database names, then calls gen_db_file_maps(),
> > which loops over the relations and calls create_rel_filename_map(),
> > which adds to the map via array indexing.   I don't see any file loops
> > in there --- can you be more specific?
> 
> Sorry, I was too tired when posting that. I actually meant
> transfer_single_new_db(). More specifically the profile clearly showed
> that most of the time was spent in the two loops starting on lines 193
> and 228.

Wow, you are right on target.  I was so focused on making logical
lookups linear that I did not consider file system vm/fsm and file
extension lookups.  Let me think a little and I will report back. 
Thanks.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Further pg_upgrade analysis for many tables

From
Bruce Momjian
Date:
On Fri, Nov  9, 2012 at 04:06:38PM -0800, Jeff Janes wrote:
> On Thu, Nov 8, 2012 at 7:25 PM, Bruce Momjian <bruce@momjian.us> wrote:
> >
> > I did some more research and realized that I was not using --schema-only
> > like pg_upgrade uses.  With that setting, things look like this:
> >
> ...
> 
> For profiling pg_dump in isolation, you should also specify
> --binary-upgrade.  I was surprised that it makes a big difference,
> slowing it down by about 2 fold.

Yes, I see that now:
              pg_dump vs. pg_dump --binary-upgrade           9.2      w/ b-u     git      w/ b-u    pg_upgrade    1
0.13      0.13      0.11      0.13       11.73 1000      4.37      8.18      3.98      8.08       28.79 2000     12.98
  33.29     12.19     28.11       69.75 4000     47.85    140.62     50.14    138.02      289.82 8000    210.39
604.95   183.00    517.35     1168.6016000    901.53   2373.79    769.83   1975.94     5022.82
 

I didn't show the restore numbers yet because I haven't gotten automated
pg_dump --binary-upgrade restore to work yet, but a normal restore for
16k takes 2197.56, so adding that to 1975.94, you get 4173.5, which is
83% of 5022.82.  That is a big chunk of the total time for pg_upgrade.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Further pg_upgrade analysis for many tables

From
Jeff Janes
Date:
On Sat, Nov 10, 2012 at 9:15 AM, Bruce Momjian <bruce@momjian.us> wrote:
> On Fri, Nov  9, 2012 at 04:23:40PM -0800, Jeff Janes wrote:
>> On Fri, Nov 9, 2012 at 3:06 PM, Bruce Momjian <bruce@momjian.us> wrote:
>> >
>> > Again, using SERIAL?
>>
>> Yep.
>
> Odd why yours is so much after.

You didn't build git head under --enable-cassert, did you?

Any chance you can do a oprofile or gprof of head's pg_dump dumping
out of head's server?  That really should be a lot faster (since
commit eeb6f37d89fc60c6449ca12ef9e) than dumping out of 9.2 server.
If it is not for you, I don't see how to figure it out without a
profile of the slow system.

Cheers,

Jeff



Re: Further pg_upgrade analysis for many tables

From
Bruce Momjian
Date:
On Sat, Nov 10, 2012 at 05:20:55PM -0500, Bruce Momjian wrote:
> On Fri, Nov  9, 2012 at 04:06:38PM -0800, Jeff Janes wrote:
> > On Thu, Nov 8, 2012 at 7:25 PM, Bruce Momjian <bruce@momjian.us> wrote:
> > >
> > > I did some more research and realized that I was not using --schema-only
> > > like pg_upgrade uses.  With that setting, things look like this:
> > >
> > ...
> > 
> > For profiling pg_dump in isolation, you should also specify
> > --binary-upgrade.  I was surprised that it makes a big difference,
> > slowing it down by about 2 fold.
> 
> Yes, I see that now:
> 
>                   pg_dump vs. pg_dump --binary-upgrade
>                9.2      w/ b-u     git      w/ b-u    pg_upgrade
>         1      0.13      0.13      0.11      0.13       11.73
>      1000      4.37      8.18      3.98      8.08       28.79
>      2000     12.98     33.29     12.19     28.11       69.75
>      4000     47.85    140.62     50.14    138.02      289.82
>      8000    210.39    604.95    183.00    517.35     1168.60
>     16000    901.53   2373.79    769.83   1975.94     5022.82
> 
> I didn't show the restore numbers yet because I haven't gotten automated
> pg_dump --binary-upgrade restore to work yet, but a normal restore for
> 16k takes 2197.56, so adding that to 1975.94, you get 4173.5, which is
> 83% of 5022.82.  That is a big chunk of the total time for pg_upgrade.

What I am seeing here is the same 4x increase for a 2x increase in the
number of tables.  Something must be going on there.  I have oprofile
set up, so I will try to run oprofile and try to find which functions
are taking up most of the time, though I am confused why Tom didn't see
any obvious causes.  I will keep going, and will focus on git head, and
schema-only, non-binary-upgrade mode, for simplicity.  I am just not
seeing 9.2 or --binary-upgrade causing any fundamental affects ---
pg_dump --schema-only itself has the same problems, and probably the
same cause.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Further pg_upgrade analysis for many tables

From
Bruce Momjian
Date:
On Sat, Nov 10, 2012 at 02:45:54PM -0800, Jeff Janes wrote:
> On Sat, Nov 10, 2012 at 9:15 AM, Bruce Momjian <bruce@momjian.us> wrote:
> > On Fri, Nov  9, 2012 at 04:23:40PM -0800, Jeff Janes wrote:
> >> On Fri, Nov 9, 2012 at 3:06 PM, Bruce Momjian <bruce@momjian.us> wrote:
> >> >
> >> > Again, using SERIAL?
> >>
> >> Yep.
> >
> > Odd why yours is so much after.
> 
> You didn't build git head under --enable-cassert, did you?

Yikes, you got me!  I have not done performance testing in so long, I
had forgotten I changed my defaults.  New numbers to follow.  Sorry.

> Any chance you can do a oprofile or gprof of head's pg_dump dumping
> out of head's server?  That really should be a lot faster (since
> commit eeb6f37d89fc60c6449ca12ef9e) than dumping out of 9.2 server.
> If it is not for you, I don't see how to figure it out without a
> profile of the slow system.

Yes, coming.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Further pg_upgrade analysis for many tables

From
Bruce Momjian
Date:
On Sat, Nov 10, 2012 at 05:59:54PM -0500, Bruce Momjian wrote:
> On Sat, Nov 10, 2012 at 02:45:54PM -0800, Jeff Janes wrote:
> > On Sat, Nov 10, 2012 at 9:15 AM, Bruce Momjian <bruce@momjian.us> wrote:
> > > On Fri, Nov  9, 2012 at 04:23:40PM -0800, Jeff Janes wrote:
> > >> On Fri, Nov 9, 2012 at 3:06 PM, Bruce Momjian <bruce@momjian.us> wrote:
> > >> >
> > >> > Again, using SERIAL?
> > >>
> > >> Yep.
> > >
> > > Odd why yours is so much after.
> > 
> > You didn't build git head under --enable-cassert, did you?
> 
> Yikes, you got me!  I have not done performance testing in so long, I
> had forgotten I changed my defaults.  New numbers to follow.  Sorry.
> 
> > Any chance you can do a oprofile or gprof of head's pg_dump dumping
> > out of head's server?  That really should be a lot faster (since
> > commit eeb6f37d89fc60c6449ca12ef9e) than dumping out of 9.2 server.
> > If it is not for you, I don't see how to figure it out without a
> > profile of the slow system.
> 
> Yes, coming.

OK, here are my results.  Again, apologies for posting non-linear
results based on assert builds:
            ----------  9.2 ----------  ------------ 9.3 --------            -- normal -- -- bin-up --   -- normal --
--bin-up --            dump   rest   dump   rest   dump   rest   dump   rest  pg_upgrade     1      0.12   0.06   0.12
0.06   0.11   0.07   0.11   0.07  11.11  1000      7.22   2.40   4.74   2.78   2.20   2.43   4.04   2.86  19.60  2000
  5.67   5.10   8.82   5.57   4.50   4.97   8.07   5.69  30.55  4000     13.34  11.13  25.16  12.52   8.95  11.24
16.75 12.16  60.70  8000     29.12  25.98  59.60  28.08  16.68  24.02  30.63  27.08 123.05 16000     87.36  53.16
189.38 62.72  31.38  55.37  61.55  62.66 365.71
 


You can see the non-linear dump at 16k in 9.2, and the almost-linear in
9.3.  :-)

pg_upgrade shows non-linear, but that is probably because of the
non-linear behavior of 9.2, and because of the two non-linear loops that
Ants found, that I will address in a separate email.

Thanks for the feedback.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Further pg_upgrade analysis for many tables

From
Bruce Momjian
Date:
On Fri, Nov  9, 2012 at 12:50:34AM -0500, Tom Lane wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
> > Are sure the server you are dumping out of is head?
> 
> I experimented a bit with dumping/restoring 16000 tables matching
> Bruce's test case (ie, one serial column apiece).  The pg_dump profile
> seems fairly flat, without any easy optimization targets.  But
> restoring the dump script shows a rather interesting backend profile:
> 
> samples  %        image name               symbol name
> 30861    39.6289  postgres                 AtEOXact_RelationCache
> 9911     12.7268  postgres                 hash_seq_search
> 2682      3.4440  postgres                 init_sequence
> 2218      2.8482  postgres                 _bt_compare
> 2120      2.7223  postgres                 hash_search_with_hash_value
> 1976      2.5374  postgres                 XLogInsert
> 1429      1.8350  postgres                 CatalogCacheIdInvalidate
> 1282      1.6462  postgres                 LWLockAcquire
> 973       1.2494  postgres                 LWLockRelease
> 702       0.9014  postgres                 hash_any
> 
> The hash_seq_search time is probably mostly associated with
> AtEOXact_RelationCache, which is run during transaction commit and scans
> the relcache hashtable looking for tables created in the current
> transaction.  So that's about 50% of the runtime going into that one
> activity.

Thanks for finding this.  What is odd is that I am not seeing non-linear
restores at 16k in git head, so I am confused how something that
consumes ~50% of backend time could still perform linearly.  Would this
consume 50% at lower table counts?

I agree we should do something, even if this is a rare case, because 50%
is a large percentage.

> There are at least three ways we could whack that mole:
> 
> * Run the psql script in --single-transaction mode, as I was mumbling
> about the other day.  If we were doing AtEOXact_RelationCache only once,
> rather than once per CREATE TABLE statement, it wouldn't be a problem.
> Easy but has only a narrow scope of applicability.
> 
> * Keep a separate list (or data structure of your choice) so that
> relcache entries created in the current xact could be found directly
> rather than having to scan the whole relcache.  That'd add complexity
> though, and could perhaps be a net loss for cases where the relcache
> isn't so bloated.

I like this one.  Could we do it only when the cache gets to be above a
certain size, to avoid any penalty?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Further pg_upgrade analysis for many tables

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> On Fri, Nov  9, 2012 at 12:50:34AM -0500, Tom Lane wrote:
>> The hash_seq_search time is probably mostly associated with
>> AtEOXact_RelationCache, which is run during transaction commit and scans
>> the relcache hashtable looking for tables created in the current
>> transaction.  So that's about 50% of the runtime going into that one
>> activity.

> Thanks for finding this.  What is odd is that I am not seeing non-linear
> restores at 16k in git head, so I am confused how something that
> consumes ~50% of backend time could still perform linearly.  Would this
> consume 50% at lower table counts?

No, the cost from that is O(N^2), though with a pretty small multiplier.
16K tables is evidently where the cost reaches the point of being
significant --- if you went up from there, you'd probably start to
notice an overall O(N^2) behavior.
        regards, tom lane



Re: Further pg_upgrade analysis for many tables

From
Robert Haas
Date:
On Fri, Nov 9, 2012 at 12:50 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
>> Are sure the server you are dumping out of is head?
>
> I experimented a bit with dumping/restoring 16000 tables matching
> Bruce's test case (ie, one serial column apiece).  The pg_dump profile
> seems fairly flat, without any easy optimization targets.  But
> restoring the dump script shows a rather interesting backend profile:
>
> samples  %        image name               symbol name
> 30861    39.6289  postgres                 AtEOXact_RelationCache
> 9911     12.7268  postgres                 hash_seq_search
> 2682      3.4440  postgres                 init_sequence
> 2218      2.8482  postgres                 _bt_compare
> 2120      2.7223  postgres                 hash_search_with_hash_value
> 1976      2.5374  postgres                 XLogInsert
> 1429      1.8350  postgres                 CatalogCacheIdInvalidate
> 1282      1.6462  postgres                 LWLockAcquire
> 973       1.2494  postgres                 LWLockRelease
> 702       0.9014  postgres                 hash_any
>
> The hash_seq_search time is probably mostly associated with
> AtEOXact_RelationCache, which is run during transaction commit and scans
> the relcache hashtable looking for tables created in the current
> transaction.  So that's about 50% of the runtime going into that one
> activity.
>
> There are at least three ways we could whack that mole:
>
> * Run the psql script in --single-transaction mode, as I was mumbling
> about the other day.  If we were doing AtEOXact_RelationCache only once,
> rather than once per CREATE TABLE statement, it wouldn't be a problem.
> Easy but has only a narrow scope of applicability.
>
> * Keep a separate list (or data structure of your choice) so that
> relcache entries created in the current xact could be found directly
> rather than having to scan the whole relcache.  That'd add complexity
> though, and could perhaps be a net loss for cases where the relcache
> isn't so bloated.
>
> * Limit the size of the relcache (eg by aging out
> not-recently-referenced entries) so that we aren't incurring O(N^2)
> costs for scripts touching N tables.  Again, this adds complexity and
> could be counterproductive in some scenarios.

Although there may be some workloads that access very large numbers of
tables repeatedly, I bet that's not typical.  Rather, I bet that a
session which accesses 10,000 tables is most likely to access them
just once each - and right now we don't handle that case very well;
this is not the first complaint about big relcaches causing problems.
On the flip side, we don't want workloads that exceed some baked-in
cache size to fall off a cliff.  So I think we should be looking for a
solution that doesn't put a hard limit on the size of the relcache,
but does provide at least some latitude to get rid of old entries.

So maybe something like this.  Add a flag to each relcache entry
indicating whether or not it has been used.  After adding 1024 entries
to the relcache, scan all the entries: clear the flag if it's set,
flush the entry if it's already clear.  This allows the size of the
relcache to grow without bound, but only if we're continuing to access
the old tables in between adding new ones to the mix.  As an
additional safeguard, we could count the number of toplevel SQL
commands that have been executed and require that a flush not be
performed more often than, say, every 64 toplevel SQL commands.  That
way, if a single operation on an inheritance parent with many children
sucks a lot of stuff into the relcache, we'll avoid cleaning it out
too quickly.

Maybe this is all too ad-hoc, but I feel like we don't need to
overengineer this.  The existing system is fine in 99% of the cases,
so we really only need to find a way to detect the really egregious
case where we are doing a neverending series of one-time table
accesses and apply a very light tap to avoid the pain point in that
case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Further pg_upgrade analysis for many tables

From
Bruce Momjian
Date:
On Sat, Nov 10, 2012 at 12:41:44PM -0500, Bruce Momjian wrote:
> On Sat, Nov 10, 2012 at 07:17:34PM +0200, Ants Aasma wrote:
> > On Sat, Nov 10, 2012 at 7:10 PM, Bruce Momjian <bruce@momjian.us> wrote:
> > > I am confused why you see a loop.  transfer_all_new_dbs() does a
> > > merge-join of old/new database names, then calls gen_db_file_maps(),
> > > which loops over the relations and calls create_rel_filename_map(),
> > > which adds to the map via array indexing.   I don't see any file loops
> > > in there --- can you be more specific?
> > 
> > Sorry, I was too tired when posting that. I actually meant
> > transfer_single_new_db(). More specifically the profile clearly showed
> > that most of the time was spent in the two loops starting on lines 193
> > and 228.
> 
> Wow, you are right on target.  I was so focused on making logical
> lookups linear that I did not consider file system vm/fsm and file
> extension lookups.  Let me think a little and I will report back. 
> Thanks.

OK, I have had some time to think about this.  What the current code
does is, for each database, get a directory listing to know about any
vm, fsm, and >1gig extents that exist in the directory.  It caches the
directory listing and does full array scans looking for matches.  If the
tablespace changes, it creates a new directory cache and throws away the
old one.  This code certainly needs improvement!

I can think of two solutions.  The first would be to scan the database
directory, and any tablespaces used by the database, sort it, then allow
binary search of the directory listing looking for file prefixes that
match the current relation.

The second approach would be to simply try to copy the fsm, vm, and
extent files, and ignore any ENOEXIST errors.  This allows code
simplification.  The downside is that it doesn't pull all files with
matching prefixes --- it requires pg_upgrade to _know_ what suffixes
might exist in that directory.  Second, it assumes there can be no
number gaps in the file extent numbering (is that safe?).

I need recommendations on which direction to persue;  this would only be
for 9.3.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Further pg_upgrade analysis for many tables

From
Bruce Momjian
Date:
On Mon, Nov 12, 2012 at 12:09:08PM -0500, Bruce Momjian wrote:
> OK, I have had some time to think about this.  What the current code
> does is, for each database, get a directory listing to know about any
> vm, fsm, and >1gig extents that exist in the directory.  It caches the
> directory listing and does full array scans looking for matches.  If the
> tablespace changes, it creates a new directory cache and throws away the
> old one.  This code certainly needs improvement!
>
> I can think of two solutions.  The first would be to scan the database
> directory, and any tablespaces used by the database, sort it, then allow
> binary search of the directory listing looking for file prefixes that
> match the current relation.
>
> The second approach would be to simply try to copy the fsm, vm, and
> extent files, and ignore any ENOEXIST errors.  This allows code
> simplification.  The downside is that it doesn't pull all files with
> matching prefixes --- it requires pg_upgrade to _know_ what suffixes
> might exist in that directory.  Second, it assumes there can be no
> number gaps in the file extent numbering (is that safe?).
>
> I need recommendations on which direction to persue;  this would only be
> for 9.3.

I went with the second idea, patch attached.  Here are the times:

             ----------  9.2 ----------  ------------ 9.3 --------
             -- normal -- -- bin-up --   -- normal -- -- bin-up --  pg_upgrade
             dump   rest   dump   rest   dump   rest   dump   rest   git   patch
      1      0.12   0.06   0.12   0.06   0.11   0.07   0.11   0.07  11.11  11.02
   1000      7.22   2.40   4.74   2.78   2.20   2.43   4.04   2.86  19.60  19.25
   2000      5.67   5.10   8.82   5.57   4.50   4.97   8.07   5.69  30.55  26.67
   4000     13.34  11.13  25.16  12.52   8.95  11.24  16.75  12.16  60.70  52.31
   8000     29.12  25.98  59.60  28.08  16.68  24.02  30.63  27.08 123.05 102.78
  16000     87.36  53.16 189.38  62.72  31.38  55.37  61.55  62.66 365.71 286.00

You can see a significant speedup with those loops removed.  The 16k
case is improved, but still not linear.  The 16k dump/restore scale
looks fine, so it must be something in pg_upgrade, or in the kernel.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Attachment

Re: Further pg_upgrade analysis for many tables

From
Bruce Momjian
Date:
On Mon, Nov 12, 2012 at 03:59:27PM -0500, Bruce Momjian wrote:
> > The second approach would be to simply try to copy the fsm, vm, and
> > extent files, and ignore any ENOEXIST errors.  This allows code
> > simplification.  The downside is that it doesn't pull all files with
> > matching prefixes --- it requires pg_upgrade to _know_ what suffixes
> > might exist in that directory.  Second, it assumes there can be no
> > number gaps in the file extent numbering (is that safe?).
> > 
> > I need recommendations on which direction to persue;  this would only be
> > for 9.3.
> 
> I went with the second idea, patch attached.  Here are the times:
> 
>              ----------  9.2 ----------  ------------ 9.3 --------
>              -- normal -- -- bin-up --   -- normal -- -- bin-up --  pg_upgrade
>              dump   rest   dump   rest   dump   rest   dump   rest   git   patch
>       1      0.12   0.06   0.12   0.06   0.11   0.07   0.11   0.07  11.11  11.02
>    1000      7.22   2.40   4.74   2.78   2.20   2.43   4.04   2.86  19.60  19.25
>    2000      5.67   5.10   8.82   5.57   4.50   4.97   8.07   5.69  30.55  26.67
>    4000     13.34  11.13  25.16  12.52   8.95  11.24  16.75  12.16  60.70  52.31
>    8000     29.12  25.98  59.60  28.08  16.68  24.02  30.63  27.08 123.05 102.78
>   16000     87.36  53.16 189.38  62.72  31.38  55.37  61.55  62.66 365.71 286.00
> 
> You can see a significant speedup with those loops removed.  The 16k
> case is improved, but still not linear.  The 16k dump/restore scale
> looks fine, so it must be something in pg_upgrade, or in the kernel.

It is possible that the poor 16k pg_upgrade value is caused by the poor
9.2 binary-upgrade number (189.38).  Perhaps I need to hack up
pg_upgrade to allow a 9.3 to 9.3 upgrade to test this.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Further pg_upgrade analysis for many tables

From
Alvaro Herrera
Date:
Bruce Momjian escribió:

> --- 17,24 ----
>
>   static void transfer_single_new_db(pageCnvCtx *pageConverter,
>                          FileNameMap *maps, int size);
> ! static int transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
> !                              const char *suffix);

Uh, does this code assume that forks other than the main one are not
split in segments?  I think that's a bug, is it not?

--



Re: Further pg_upgrade analysis for many tables

From
Alvaro Herrera
Date:
Bruce Momjian escribió:

> It is possible that the poor 16k pg_upgrade value is caused by the poor
> 9.2 binary-upgrade number (189.38).  Perhaps I need to hack up
> pg_upgrade to allow a 9.3 to 9.3 upgrade to test this.

Hmm?  This already works, since "make check" uses it, right?

--



Re: Further pg_upgrade analysis for many tables

From
Bruce Momjian
Date:
On Mon, Nov 12, 2012 at 06:14:59PM -0300, Alvaro Herrera wrote:
> Bruce Momjian escribió:
> 
> > --- 17,24 ----
> >   
> >   static void transfer_single_new_db(pageCnvCtx *pageConverter,
> >                          FileNameMap *maps, int size);
> > ! static int transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
> > !                              const char *suffix);
> 
> Uh, does this code assume that forks other than the main one are not
> split in segments?  I think that's a bug, is it not?

Oh, yeah, I must have fixed this long ago.  It only fails if you use
tablespaces:
   if (os_info.num_tablespaces > 0 &&   strcmp(old_cluster.tablespace_suffix, new_cluster.tablespace_suffix) == 0)
pg_log(PG_FATAL,              "Cannot upgrade to/from the same system catalog version when\n"              "using
tablespaces.\n");

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Further pg_upgrade analysis for many tables

From
Bruce Momjian
Date:
On Mon, Nov 12, 2012 at 12:09:08PM -0500, Bruce Momjian wrote:
> The second approach would be to simply try to copy the fsm, vm, and
> extent files, and ignore any ENOEXIST errors.  This allows code
> simplification.  The downside is that it doesn't pull all files with
> matching prefixes --- it requires pg_upgrade to _know_ what suffixes
> might exist in that directory.  Second, it assumes there can be no
> number gaps in the file extent numbering (is that safe?).

Seems our code does the same kind of segment number looping I was
suggesting for pg_upgrade, so I think I am safe:
       /*        * Note that because we loop until getting ENOENT, we will correctly        * remove all inactive
segmentsas well as active ones.        */       for (segno = 1;; segno++)       {           sprintf(segpath, "%s.%u",
path,segno);           if (unlink(segpath) < 0)           {               /* ENOENT is expected after the last
segment...*/               if (errno != ENOENT)                   ereport(WARNING,
(errcode_for_file_access(),                     errmsg("could not remove file \"%s\": %m", segpath)));
break;          }       }
 

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Further pg_upgrade analysis for many tables

From
Jeff Janes
Date:
On Fri, Nov 9, 2012 at 10:50 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Thu, Nov 8, 2012 at 9:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Jeff Janes <jeff.janes@gmail.com> writes:
>>> Are sure the server you are dumping out of is head?
>>
>> I experimented a bit with dumping/restoring 16000 tables matching
>> Bruce's test case (ie, one serial column apiece).  The pg_dump profile
>> seems fairly flat, without any easy optimization targets.  But
>> restoring the dump script shows a rather interesting backend profile:
>>
>> samples  %        image name               symbol name
>> 30861    39.6289  postgres                 AtEOXact_RelationCache
>> 9911     12.7268  postgres                 hash_seq_search
> ...
>>
>> There are at least three ways we could whack that mole:
>>
>> * Run the psql script in --single-transaction mode, as I was mumbling
>> about the other day.  If we were doing AtEOXact_RelationCache only once,
>> rather than once per CREATE TABLE statement, it wouldn't be a problem.
>> Easy but has only a narrow scope of applicability.
>
> That is effective when loading into 9.3 (assuming you make
> max_locks_per_transaction large enough).  But when loading into  <9.3,
> using --single-transaction will evoke the quadratic behavior in the
> resource owner/lock table and make things worse rather than better.

Using --single-transaction gets around the AtEOXact_RelationCache
quadratic, but it activates another quadratic behavior, this one in
"get_tabstat_entry".   That is a good trade-off because that one has a
lower constant, but it is still going to bite.

Cheers,

Jeff



Re: Further pg_upgrade analysis for many tables

From
Simon Riggs
Date:
On 12 November 2012 16:51, Robert Haas <robertmhaas@gmail.com> wrote:

> Although there may be some workloads that access very large numbers of
> tables repeatedly, I bet that's not typical.

Transactions with large numbers of DDL statements are typical at
upgrade (application or database release level) and the execution time
of those is critical to availability.

I'm guessing you mean large numbers of tables and accessing each one
multiple times?

> Rather, I bet that a
> session which accesses 10,000 tables is most likely to access them
> just once each - and right now we don't handle that case very well;
> this is not the first complaint about big relcaches causing problems.

pg_restore frequently accesses tables more than once as it runs, but
not more than a dozen times each, counting all types of DDL.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Further pg_upgrade analysis for many tables

From
Robert Haas
Date:
On Mon, Nov 12, 2012 at 5:17 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 12 November 2012 16:51, Robert Haas <robertmhaas@gmail.com> wrote:
>
>> Although there may be some workloads that access very large numbers of
>> tables repeatedly, I bet that's not typical.
>
> Transactions with large numbers of DDL statements are typical at
> upgrade (application or database release level) and the execution time
> of those is critical to availability.
>
> I'm guessing you mean large numbers of tables and accessing each one
> multiple times?

Yes, that is what I meant.

>> Rather, I bet that a
>> session which accesses 10,000 tables is most likely to access them
>> just once each - and right now we don't handle that case very well;
>> this is not the first complaint about big relcaches causing problems.
>
> pg_restore frequently accesses tables more than once as it runs, but
> not more than a dozen times each, counting all types of DDL.

Hmm... yeah.  Some of those accesses are probably one right after
another so any cache-flushing behavior would be fine; but index
creations for example might happen quite a bit later in the file,
IIRC.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Further pg_upgrade analysis for many tables

From
Ants Aasma
Date:
On Mon, Nov 12, 2012 at 10:59 PM, Bruce Momjian <bruce@momjian.us> wrote:
> You can see a significant speedup with those loops removed.  The 16k
> case is improved, but still not linear.  The 16k dump/restore scale
> looks fine, so it must be something in pg_upgrade, or in the kernel.

I can confirm the speedup. Profiling results for 9.3 to 9.3 upgrade
for 8k and 64k tables are attached. pg_upgrade itself is now taking
negligible time.

The 64k profile shows the AtEOXact_RelationCache scaling problem. For
the 8k profile nothing really pops out as a clear bottleneck. CPU time
distributes 83.1% to postgres, 4.9% to pg_dump, 7.4% to psql and 0.7%
to pg_upgrade.

Postgres time itself breaks down with 10% for shutdown checkpoint and
90% for regular running, consisting of 16% parsing, 13% analyze, 20%
plan, 30% execute, 11% commit (AtEOXact_RelationCache) and 6% network.

It looks to me that most benefit could be had from introducing more
parallelism. Are there any large roadblocks to pipelining the dump and
restore to have them happen in parallel?

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

Attachment

Re: Further pg_upgrade analysis for many tables

From
Bruce Momjian
Date:
On Tue, Nov 13, 2012 at 05:44:54AM +0200, Ants Aasma wrote:
> On Mon, Nov 12, 2012 at 10:59 PM, Bruce Momjian <bruce@momjian.us> wrote:
> > You can see a significant speedup with those loops removed.  The 16k
> > case is improved, but still not linear.  The 16k dump/restore scale
> > looks fine, so it must be something in pg_upgrade, or in the kernel.
>
> I can confirm the speedup. Profiling results for 9.3 to 9.3 upgrade
> for 8k and 64k tables are attached. pg_upgrade itself is now taking
> negligible time.

I generated these timings from the attached test script.

                 -------------------------- 9.3 ------------------------
                 ---- normal ----  -- binary_upgrade --  -- pg_upgrade -
                - dmp -   - res -   - dmp -   - res -    git       patch
        1         0.12      0.07      0.13      0.07     11.06     11.02
     1000         2.20      2.46      3.57      2.82     19.15     18.61
     2000         4.51      5.01      8.22      5.80     29.12     26.89
     4000         8.97     10.88     14.76     12.43     45.87     43.08
     8000        15.30     24.72     30.57     27.10    100.31     79.75
    16000        36.14     54.88     62.27     61.69    248.03    167.94
    32000        55.29    162.20    115.16    179.15    695.05    376.84
    64000       149.86    716.46    265.77    724.32   2323.73   1122.38

You can see the speedup of the patch, particularly for a greater number
of tables, e.g. 2x faster for 64k tables.

> The 64k profile shows the AtEOXact_RelationCache scaling problem. For
> the 8k profile nothing really pops out as a clear bottleneck. CPU time
> distributes 83.1% to postgres, 4.9% to pg_dump, 7.4% to psql and 0.7%
> to pg_upgrade.

At 64k I see pg_upgrade taking 12% of the duration time, if I subtract
out the dump/restore times.

I am attaching an updated pg_upgrade patch, which I believe is ready for
application for 9.3.

> Postgres time itself breaks down with 10% for shutdown checkpoint and
> 90% for regular running, consisting of 16% parsing, 13% analyze, 20%
> plan, 30% execute, 11% commit (AtEOXact_RelationCache) and 6% network.

That SVG graph was quite impressive.

> It looks to me that most benefit could be had from introducing more
> parallelism. Are there any large roadblocks to pipelining the dump and
> restore to have them happen in parallel?

I talked to Andrew Dustan about parallelization in pg_restore.  First,
we currently use pg_dumpall, which isn't in the custom format required
for parallel restore, but if we changed to custom format, create table
isn't done in parallel, only create index/check constraints, and trigger
creation, etc.  Not sure if it worth perusing this just for pg_upgrade.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Attachment

Re: Further pg_upgrade analysis for many tables

From
Bruce Momjian
Date:
On Tue, Nov 13, 2012 at 07:03:51PM -0500, Bruce Momjian wrote:
> I am attaching an updated pg_upgrade patch, which I believe is ready for
> application for 9.3.

Correction, here is the proper patch.  The previous posted version was
had pending merges from the master branch.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Attachment

Re: Further pg_upgrade analysis for many tables

From
Ants Aasma
Date:
On Wed, Nov 14, 2012 at 2:03 AM, Bruce Momjian <bruce@momjian.us> wrote:
> At 64k I see pg_upgrade taking 12% of the duration time, if I subtract
> out the dump/restore times.

My percentage numbers only included CPU time and I used SSD storage.
For the most part there was no IO wait to speak of, but it's
completely expected that thousands of link calls are not free.

>> Postgres time itself breaks down with 10% for shutdown checkpoint and
>> 90% for regular running, consisting of 16% parsing, 13% analyze, 20%
>> plan, 30% execute, 11% commit (AtEOXact_RelationCache) and 6% network.
>
> That SVG graph was quite impressive.

I used perf and Gprof2Dot for this. I will probably do a blog post on
how to generate these graphs. It's much more useful for me than a
plain flat profile as I don't know by heart which functions are called
by which.

>> It looks to me that most benefit could be had from introducing more
>> parallelism. Are there any large roadblocks to pipelining the dump and
>> restore to have them happen in parallel?
>
> I talked to Andrew Dustan about parallelization in pg_restore.  First,
> we currently use pg_dumpall, which isn't in the custom format required
> for parallel restore, but if we changed to custom format, create table
> isn't done in parallel, only create index/check constraints, and trigger
> creation, etc.  Not sure if it worth perusing this just for pg_upgrade.

I agree that parallel restore for schemas is a hard problem. But I
didn't mean parallelism within the restore, I meant that we could
start both postmasters and pipe the output from dump directly to
restore. This way the times for dumping and restoring can overlap.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: Further pg_upgrade analysis for many tables

From
Bruce Momjian
Date:
On Wed, Nov 14, 2012 at 06:11:27AM +0200, Ants Aasma wrote:
> On Wed, Nov 14, 2012 at 2:03 AM, Bruce Momjian <bruce@momjian.us> wrote:
> > At 64k I see pg_upgrade taking 12% of the duration time, if I subtract
> > out the dump/restore times.
> 
> My percentage numbers only included CPU time and I used SSD storage.
> For the most part there was no IO wait to speak of, but it's
> completely expected that thousands of link calls are not free.

Agreed.  I was looking at wall clock time so I could see the total
impact of everything pg_upgrade does.

> >> Postgres time itself breaks down with 10% for shutdown checkpoint and
> >> 90% for regular running, consisting of 16% parsing, 13% analyze, 20%
> >> plan, 30% execute, 11% commit (AtEOXact_RelationCache) and 6% network.
> >
> > That SVG graph was quite impressive.
> 
> I used perf and Gprof2Dot for this. I will probably do a blog post on
> how to generate these graphs. It's much more useful for me than a
> plain flat profile as I don't know by heart which functions are called
> by which.

Yes, please share that information.

> >> It looks to me that most benefit could be had from introducing more
> >> parallelism. Are there any large roadblocks to pipelining the dump and
> >> restore to have them happen in parallel?
> >
> > I talked to Andrew Dustan about parallelization in pg_restore.  First,
> > we currently use pg_dumpall, which isn't in the custom format required
> > for parallel restore, but if we changed to custom format, create table
> > isn't done in parallel, only create index/check constraints, and trigger
> > creation, etc.  Not sure if it worth perusing this just for pg_upgrade.
> 
> I agree that parallel restore for schemas is a hard problem. But I
> didn't mean parallelism within the restore, I meant that we could
> start both postmasters and pipe the output from dump directly to
> restore. This way the times for dumping and restoring can overlap.

Wow, that is a very creative idea.  The current code doesn't do that,
but this has the potential of doubling pg_upgrade's speed, without
adding a lot of complexity.  Here are the challenges of this approach:

*  I would need to log the output of pg_dumpall as it is passed to psql
so users can debug problems

*  pg_upgrade never runs the old and new clusters at the same time for
fear that it will run out of resources, e.g. shared memory, or if they
are using the same port number.  We can make this optional and force
different port numbers.

Let me work up a prototype in the next few days and see how it performs.
Thanks for the great idea.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Further pg_upgrade analysis for many tables

From
Andrew Dunstan
Date:
On 11/14/2012 10:08 AM, Bruce Momjian wrote:
> On Wed, Nov 14, 2012 at 06:11:27AM +0200, Ants Aasma wrote:
>>
>> I agree that parallel restore for schemas is a hard problem. But I
>> didn't mean parallelism within the restore, I meant that we could
>> start both postmasters and pipe the output from dump directly to
>> restore. This way the times for dumping and restoring can overlap.
> Wow, that is a very creative idea.  The current code doesn't do that,
> but this has the potential of doubling pg_upgrade's speed, without
> adding a lot of complexity.  Here are the challenges of this approach:
>
> *  I would need to log the output of pg_dumpall as it is passed to psql
> so users can debug problems


Instead of piping it directly, have pg_upgrade work as a tee, pumping 
bytes both to psql and a file. This doesn't seem terribly hard.

>
> *  pg_upgrade never runs the old and new clusters at the same time for
> fear that it will run out of resources, e.g. shared memory, or if they
> are using the same port number.  We can make this optional and force
> different port numbers.


Right.

cheers

andrew




Re: Further pg_upgrade analysis for many tables

From
Bruce Momjian
Date:
On Wed, Nov 14, 2012 at 10:25:24AM -0500, Andrew Dunstan wrote:
> 
> On 11/14/2012 10:08 AM, Bruce Momjian wrote:
> >On Wed, Nov 14, 2012 at 06:11:27AM +0200, Ants Aasma wrote:
> >>
> >>I agree that parallel restore for schemas is a hard problem. But I
> >>didn't mean parallelism within the restore, I meant that we could
> >>start both postmasters and pipe the output from dump directly to
> >>restore. This way the times for dumping and restoring can overlap.
> >Wow, that is a very creative idea.  The current code doesn't do that,
> >but this has the potential of doubling pg_upgrade's speed, without
> >adding a lot of complexity.  Here are the challenges of this approach:
> >
> >*  I would need to log the output of pg_dumpall as it is passed to psql
> >so users can debug problems
> 
> 
> Instead of piping it directly, have pg_upgrade work as a tee,
> pumping bytes both to psql and a file. This doesn't seem terribly
> hard.

Right.  It isn't hard.

> >*  pg_upgrade never runs the old and new clusters at the same time for
> >fear that it will run out of resources, e.g. shared memory, or if they
> >are using the same port number.  We can make this optional and force
> >different port numbers.
> 
> 
> Right.

OK.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Further pg_upgrade analysis for many tables

From
Jeff Janes
Date:
On Thu, Nov 8, 2012 at 9:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
>> Are sure the server you are dumping out of is head?
>
> I experimented a bit with dumping/restoring 16000 tables matching
> Bruce's test case (ie, one serial column apiece).  The pg_dump profile
> seems fairly flat, without any easy optimization targets.  But
> restoring the dump script shows a rather interesting backend profile:
>
> samples  %        image name               symbol name
> 30861    39.6289  postgres                 AtEOXact_RelationCache
> 9911     12.7268  postgres                 hash_seq_search
...
>
> The hash_seq_search time is probably mostly associated with
> AtEOXact_RelationCache, which is run during transaction commit and scans
> the relcache hashtable looking for tables created in the current
> transaction.  So that's about 50% of the runtime going into that one
> activity.
>
> There are at least three ways we could whack that mole:
>
> * Run the psql script in --single-transaction mode, as I was mumbling
> about the other day.  If we were doing AtEOXact_RelationCache only once,
> rather than once per CREATE TABLE statement, it wouldn't be a problem.
> Easy but has only a narrow scope of applicability.
>
> * Keep a separate list (or data structure of your choice) so that
> relcache entries created in the current xact could be found directly
> rather than having to scan the whole relcache.  That'd add complexity
> though, and could perhaps be a net loss for cases where the relcache
> isn't so bloated.

Maybe a static list that can overflow, like the ResourceOwner/Lock
table one recently added.  The overhead of that should be very low.

Are the three places where "need_eoxact_work = true;" the only places
where things need to be added to the new structure?  It seems like
there is no need to remove things from the list, because the things
done in AtEOXact_RelationCache are idempotent.


> * Limit the size of the relcache (eg by aging out
> not-recently-referenced entries) so that we aren't incurring O(N^2)
> costs for scripts touching N tables.  Again, this adds complexity and
> could be counterproductive in some scenarios.

I made the crude hack of just dumping the relcache whenever it was
>1000 at eox.  The time to load 100,000 tables went from 62 minutes
without the patch to 12 minutes with it.  (loading with "-1 -f" took
23 minutes).

The next quadratic behavior is in init_sequence.

Cheers,

Jeff

diff --git a/src/backend/utils/cache/relcache.c
b/src/backend/utils/cache/relcache.c
index 8c9ebe0..3941c98 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -2260,6 +2260,8 @@ AtEOXact_RelationCache(bool isCommit)               )               return;

+        if (hash_get_num_entries(RelationIdCache)>1000)
{RelationCacheInvalidate();}



Re: Further pg_upgrade analysis for many tables

From
Tom Lane
Date:
Jeff Janes <jeff.janes@gmail.com> writes:
> On Thu, Nov 8, 2012 at 9:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> There are at least three ways we could whack that mole: ...
>>
>> * Keep a separate list (or data structure of your choice) so that
>> relcache entries created in the current xact could be found directly
>> rather than having to scan the whole relcache.  That'd add complexity
>> though, and could perhaps be a net loss for cases where the relcache
>> isn't so bloated.

> Maybe a static list that can overflow, like the ResourceOwner/Lock
> table one recently added.  The overhead of that should be very low.

> Are the three places where "need_eoxact_work = true;" the only places
> where things need to be added to the new structure?

Yeah.  The problem is not so much the number of places that do that,
as that places that flush entries from the relcache would need to know
to remove them from the separate list, else you'd have dangling
pointers.  It's certainly not impossible, I was just unsure how much
of a pain in the rear it might be.

> The next quadratic behavior is in init_sequence.

Yeah, that's another place that is using a linear list that perhaps
should be a hashtable.  OTOH, probably most sessions don't touch enough
different sequences for that to be a win.
        regards, tom lane



Re: Further pg_upgrade analysis for many tables

From
Alvaro Herrera
Date:
Tom Lane escribió:
> Jeff Janes <jeff.janes@gmail.com> writes:

> > The next quadratic behavior is in init_sequence.
>
> Yeah, that's another place that is using a linear list that perhaps
> should be a hashtable.  OTOH, probably most sessions don't touch enough
> different sequences for that to be a win.

Could we use some adaptive mechanism here?  Say we use a list for the
first ten entries, and if an eleventh one comes in, we create a hash
table for that one and all subsequent ones.  All future calls would
have to examine both the list for the first few and then the hash table.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: Further pg_upgrade analysis for many tables

From
Bruce Momjian
Date:
Patch applied to git head.  Thanks Ants Aasma for the analysis that lead
to the patch.

---------------------------------------------------------------------------

On Tue, Nov 13, 2012 at 07:03:51PM -0500, Bruce Momjian wrote:
> On Tue, Nov 13, 2012 at 05:44:54AM +0200, Ants Aasma wrote:
> > On Mon, Nov 12, 2012 at 10:59 PM, Bruce Momjian <bruce@momjian.us> wrote:
> > > You can see a significant speedup with those loops removed.  The 16k
> > > case is improved, but still not linear.  The 16k dump/restore scale
> > > looks fine, so it must be something in pg_upgrade, or in the kernel.
> > 
> > I can confirm the speedup. Profiling results for 9.3 to 9.3 upgrade
> > for 8k and 64k tables are attached. pg_upgrade itself is now taking
> > negligible time.
> 
> I generated these timings from the attached test script.
> 
>                  -------------------------- 9.3 ------------------------
>                  ---- normal ----  -- binary_upgrade --  -- pg_upgrade -
>                 - dmp -   - res -   - dmp -   - res -    git       patch
>         1         0.12      0.07      0.13      0.07     11.06     11.02
>      1000         2.20      2.46      3.57      2.82     19.15     18.61
>      2000         4.51      5.01      8.22      5.80     29.12     26.89
>      4000         8.97     10.88     14.76     12.43     45.87     43.08
>      8000        15.30     24.72     30.57     27.10    100.31     79.75
>     16000        36.14     54.88     62.27     61.69    248.03    167.94
>     32000        55.29    162.20    115.16    179.15    695.05    376.84
>     64000       149.86    716.46    265.77    724.32   2323.73   1122.38
> 
> You can see the speedup of the patch, particularly for a greater number
> of tables, e.g. 2x faster for 64k tables.
> 
> > The 64k profile shows the AtEOXact_RelationCache scaling problem. For
> > the 8k profile nothing really pops out as a clear bottleneck. CPU time
> > distributes 83.1% to postgres, 4.9% to pg_dump, 7.4% to psql and 0.7%
> > to pg_upgrade.
> 
> At 64k I see pg_upgrade taking 12% of the duration time, if I subtract
> out the dump/restore times.
> 
> I am attaching an updated pg_upgrade patch, which I believe is ready for
> application for 9.3.
> 
> > Postgres time itself breaks down with 10% for shutdown checkpoint and
> > 90% for regular running, consisting of 16% parsing, 13% analyze, 20%
> > plan, 30% execute, 11% commit (AtEOXact_RelationCache) and 6% network.
> 
> That SVG graph was quite impressive.
> 
> > It looks to me that most benefit could be had from introducing more
> > parallelism. Are there any large roadblocks to pipelining the dump and
> > restore to have them happen in parallel?
> 
> I talked to Andrew Dustan about parallelization in pg_restore.  First,
> we currently use pg_dumpall, which isn't in the custom format required
> for parallel restore, but if we changed to custom format, create table
> isn't done in parallel, only create index/check constraints, and trigger
> creation, etc.  Not sure if it worth perusing this just for pg_upgrade.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Further pg_upgrade analysis for many tables

From
Dimitri Fontaine
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Could we use some adaptive mechanism here?  Say we use a list for the
> first ten entries, and if an eleventh one comes in, we create a hash
> table for that one and all subsequent ones.  All future calls would
> have to examine both the list for the first few and then the hash table.

Is it necessary to do so? Do we know for sure that a 10 elements hash
table is slower than a 10 elements list when only doing key based
lookups, for the object data type we're interested into here?

-- 
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support



Re: Further pg_upgrade analysis for many tables

From
Tom Lane
Date:
Dimitri Fontaine <dimitri@2ndQuadrant.fr> writes:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>> Could we use some adaptive mechanism here?  Say we use a list for the
>> first ten entries, and if an eleventh one comes in, we create a hash
>> table for that one and all subsequent ones.  All future calls would
>> have to examine both the list for the first few and then the hash table.

> Is it necessary to do so? Do we know for sure that a 10 elements hash
> table is slower than a 10 elements list when only doing key based
> lookups, for the object data type we're interested into here?

Well, we'd want to do some testing to choose the cutover point.
Personally I'd bet on that point being quite a bit higher than ten,
for the case that sequence.c is using where the key being compared is
just an OID.  You can compare a lot of OIDs in the time it takes
dynahash.c to do something.

(I think the above sketch is wrong in detail, btw.  What we should do
once we decide to create a hash table is move all the existing entries
into the hash table, not continue to scan a list for them.  There's a
similar case in the planner for tracking join RelOptInfos.)
        regards, tom lane



Re: Further pg_upgrade analysis for many tables

From
Jeff Janes
Date:
On Wed, Nov 14, 2012 at 11:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
>> On Thu, Nov 8, 2012 at 9:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> There are at least three ways we could whack that mole: ...
>>>
>>> * Keep a separate list (or data structure of your choice) so that
>>> relcache entries created in the current xact could be found directly
>>> rather than having to scan the whole relcache.  That'd add complexity
>>> though, and could perhaps be a net loss for cases where the relcache
>>> isn't so bloated.
>
>> Maybe a static list that can overflow, like the ResourceOwner/Lock
>> table one recently added.  The overhead of that should be very low.
>
>> Are the three places where "need_eoxact_work = true;" the only places
>> where things need to be added to the new structure?
>
> Yeah.  The problem is not so much the number of places that do that,
> as that places that flush entries from the relcache would need to know
> to remove them from the separate list, else you'd have dangling
> pointers.

If the list is of hash-tags rather than pointers, all we would have to
do is ignore entries that are not still in the hash table, right?


On a related thought, is a shame that "create temp table on commit
drop" sets "need_eoxact_work", because by the time we get to
AtEOXact_RelationCache upon commit, the entry is already gone and so
there is actual work to do (unless a non-temp  table was also
created).  But on abort, the entry is still there.  I don't know if
there is an opportunity for optimization there for people who use temp
tables a lot.  If we go with a caching list, that would render it moot
unless they use so many as to routinely overflow the cache.

Cheers,

Jeff



Re: Further pg_upgrade analysis for many tables

From
Bruce Momjian
Date:
On Thu, Nov 15, 2012 at 07:05:00PM -0800, Jeff Janes wrote:
> On Wed, Nov 14, 2012 at 11:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Jeff Janes <jeff.janes@gmail.com> writes:
> >> On Thu, Nov 8, 2012 at 9:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >>> There are at least three ways we could whack that mole: ...
> >>>
> >>> * Keep a separate list (or data structure of your choice) so that
> >>> relcache entries created in the current xact could be found directly
> >>> rather than having to scan the whole relcache.  That'd add complexity
> >>> though, and could perhaps be a net loss for cases where the relcache
> >>> isn't so bloated.
> >
> >> Maybe a static list that can overflow, like the ResourceOwner/Lock
> >> table one recently added.  The overhead of that should be very low.
> >
> >> Are the three places where "need_eoxact_work = true;" the only places
> >> where things need to be added to the new structure?
> >
> > Yeah.  The problem is not so much the number of places that do that,
> > as that places that flush entries from the relcache would need to know
> > to remove them from the separate list, else you'd have dangling
> > pointers.
>
> If the list is of hash-tags rather than pointers, all we would have to
> do is ignore entries that are not still in the hash table, right?
>
>
> On a related thought, is a shame that "create temp table on commit
> drop" sets "need_eoxact_work", because by the time we get to
> AtEOXact_RelationCache upon commit, the entry is already gone and so
> there is actual work to do (unless a non-temp  table was also
> created).  But on abort, the entry is still there.  I don't know if
> there is an opportunity for optimization there for people who use temp
> tables a lot.  If we go with a caching list, that would render it moot
> unless they use so many as to routinely overflow the cache.

I added the attached C comment last year to mention why temp tables are
not as isolated as we think, and can't be optimized as much as you would
think.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Attachment

Re: Further pg_upgrade analysis for many tables

From
Jeff Janes
Date:
On Thu, Nov 15, 2012 at 7:05 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Wed, Nov 14, 2012 at 11:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Jeff Janes <jeff.janes@gmail.com> writes:
>>> On Thu, Nov 8, 2012 at 9:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> There are at least three ways we could whack that mole: ...
>>>>
>>>> * Keep a separate list (or data structure of your choice) so that
>>>> relcache entries created in the current xact could be found directly
>>>> rather than having to scan the whole relcache.  That'd add complexity
>>>> though, and could perhaps be a net loss for cases where the relcache
>>>> isn't so bloated.
>>
>>> Maybe a static list that can overflow, like the ResourceOwner/Lock
>>> table one recently added.  The overhead of that should be very low.
>>
>>> Are the three places where "need_eoxact_work = true;" the only places
>>> where things need to be added to the new structure?
>>
>> Yeah.  The problem is not so much the number of places that do that,
>> as that places that flush entries from the relcache would need to know
>> to remove them from the separate list, else you'd have dangling
>> pointers.
>
> If the list is of hash-tags rather than pointers, all we would have to
> do is ignore entries that are not still in the hash table, right?
>

I've attached a proof-of-concept patch to implement this.

I got rid of need_eoxact_work entirely and replaced it with a short
list that fulfills the functions of indicating that work is needed,
and suggesting which rels might need that work.  There is no attempt
to prevent duplicates, nor to remove invalidated entries from the
list.   Invalid entries are skipped when the hash entry is not found,
and processing is idempotent so duplicates are not a problem.

Formally speaking, if MAX_EOXACT_LIST were 0, so that the list
overflowed the first time it was accessed, then it would be identical
to the current behavior or having only a flag.  So formally all I did
was increase the max from 0 to 10.

I wasn't so sure about the idempotent nature of Sub transaction
processing, so I chickened out and left that part alone.  I know of no
workflow for which that was a bottleneck.

AtEOXact_release is oddly indented because that makes the patch
smaller and easier to read.

This makes the non "-1" restore of large dumps very much faster (and
makes them faster than "-1" restores, as well)

I added a "create temp table foo (x integer) on commit drop;" line to
the default pgbench transaction and tested that.  I was hoping to see
a performance improvement there was well (the transaction has ~110
entries in the RelationIdCache at eoxact each time), but the
performance was too variable (probably due to the intense IO it
causes) to detect any changes.  At least it is not noticeably slower.
If I hack pgbench to bloat the RelationIdCache by touching 20,000
useless tables as part of the connection start up process, then this
patch does show a win.

It is not obvious what value to set the MAX list size to.  Since this
array is only allocated once per back-end, and since it not groveled
through to invalidate relations at each invalidation, there is no
particular reason it must be small.  But if the same table is assigned
new filenodes (or forced index lists, whatever those are) repeatedly
within a transaction, the list could become bloated with replicate
entries, potentially becoming even larger than the hash table whose
scan it is intended to short-cut.

In any event, 10 seems to be large enough to overcome the currently
known bottle-neck.  Maybe 100 would be a more principled number, as
that is about where the list could start to become as big as the basal
size of the RelationIdCache table.

I don't think this patch replaces having some mechanism for
restricting how large RelationIdCache can get or how LRU entries in it
can get as Robert suggested.  But this approach seems like it is
easier to implement and agree upon; and doesn't preclude doing other
optimizations later.

Cheers,

Jeff

Attachment

Re: Further pg_upgrade analysis for many tables

From
Bruce Momjian
Date:
Added to TODO:
Improve cache lookup speed for sessions accessing many relations
http://archives.postgresql.org/pgsql-hackers/2012-11/msg00356.php
 

---------------------------------------------------------------------------

On Fri, Nov  9, 2012 at 12:50:34AM -0500, Tom Lane wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
> > Are sure the server you are dumping out of is head?
> 
> I experimented a bit with dumping/restoring 16000 tables matching
> Bruce's test case (ie, one serial column apiece).  The pg_dump profile
> seems fairly flat, without any easy optimization targets.  But
> restoring the dump script shows a rather interesting backend profile:
> 
> samples  %        image name               symbol name
> 30861    39.6289  postgres                 AtEOXact_RelationCache
> 9911     12.7268  postgres                 hash_seq_search
> 2682      3.4440  postgres                 init_sequence
> 2218      2.8482  postgres                 _bt_compare
> 2120      2.7223  postgres                 hash_search_with_hash_value
> 1976      2.5374  postgres                 XLogInsert
> 1429      1.8350  postgres                 CatalogCacheIdInvalidate
> 1282      1.6462  postgres                 LWLockAcquire
> 973       1.2494  postgres                 LWLockRelease
> 702       0.9014  postgres                 hash_any
> 
> The hash_seq_search time is probably mostly associated with
> AtEOXact_RelationCache, which is run during transaction commit and scans
> the relcache hashtable looking for tables created in the current
> transaction.  So that's about 50% of the runtime going into that one
> activity.
> 
> There are at least three ways we could whack that mole:
> 
> * Run the psql script in --single-transaction mode, as I was mumbling
> about the other day.  If we were doing AtEOXact_RelationCache only once,
> rather than once per CREATE TABLE statement, it wouldn't be a problem.
> Easy but has only a narrow scope of applicability.
> 
> * Keep a separate list (or data structure of your choice) so that
> relcache entries created in the current xact could be found directly
> rather than having to scan the whole relcache.  That'd add complexity
> though, and could perhaps be a net loss for cases where the relcache
> isn't so bloated.
> 
> * Limit the size of the relcache (eg by aging out
> not-recently-referenced entries) so that we aren't incurring O(N^2)
> costs for scripts touching N tables.  Again, this adds complexity and
> could be counterproductive in some scenarios.
> 
>             regards, tom lane

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Further pg_upgrade analysis for many tables

From
Bruce Momjian
Date:
On Mon, Nov 12, 2012 at 06:14:59PM -0300, Alvaro Herrera wrote:
> Bruce Momjian escribió:
> 
> > --- 17,24 ----
> >   
> >   static void transfer_single_new_db(pageCnvCtx *pageConverter,
> >                          FileNameMap *maps, int size);
> > ! static int transfer_relfile(pageCnvCtx *pageConverter, FileNameMap *map,
> > !                              const char *suffix);
> 
> Uh, does this code assume that forks other than the main one are not
> split in segments?  I think that's a bug, is it not?

Actually, the segment scanning now happens inside transfer_relfile().

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Further pg_upgrade analysis for many tables

From
Robert Haas
Date:
On Fri, Nov 23, 2012 at 5:34 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Thu, Nov 15, 2012 at 7:05 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> On Wed, Nov 14, 2012 at 11:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Jeff Janes <jeff.janes@gmail.com> writes:
>>>> On Thu, Nov 8, 2012 at 9:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>>> There are at least three ways we could whack that mole: ...
>>>>>
>>>>> * Keep a separate list (or data structure of your choice) so that
>>>>> relcache entries created in the current xact could be found directly
>>>>> rather than having to scan the whole relcache.  That'd add complexity
>>>>> though, and could perhaps be a net loss for cases where the relcache
>>>>> isn't so bloated.
>>>
>>>> Maybe a static list that can overflow, like the ResourceOwner/Lock
>>>> table one recently added.  The overhead of that should be very low.
>>>
>>>> Are the three places where "need_eoxact_work = true;" the only places
>>>> where things need to be added to the new structure?
>>>
>>> Yeah.  The problem is not so much the number of places that do that,
>>> as that places that flush entries from the relcache would need to know
>>> to remove them from the separate list, else you'd have dangling
>>> pointers.
>>
>> If the list is of hash-tags rather than pointers, all we would have to
>> do is ignore entries that are not still in the hash table, right?
>>
>
> I've attached a proof-of-concept patch to implement this.
>
> I got rid of need_eoxact_work entirely and replaced it with a short
> list that fulfills the functions of indicating that work is needed,
> and suggesting which rels might need that work.  There is no attempt
> to prevent duplicates, nor to remove invalidated entries from the
> list.   Invalid entries are skipped when the hash entry is not found,
> and processing is idempotent so duplicates are not a problem.
>
> Formally speaking, if MAX_EOXACT_LIST were 0, so that the list
> overflowed the first time it was accessed, then it would be identical
> to the current behavior or having only a flag.  So formally all I did
> was increase the max from 0 to 10.
>
> I wasn't so sure about the idempotent nature of Sub transaction
> processing, so I chickened out and left that part alone.  I know of no
> workflow for which that was a bottleneck.
>
> AtEOXact_release is oddly indented because that makes the patch
> smaller and easier to read.
>
> This makes the non "-1" restore of large dumps very much faster (and
> makes them faster than "-1" restores, as well)
>
> I added a "create temp table foo (x integer) on commit drop;" line to
> the default pgbench transaction and tested that.  I was hoping to see
> a performance improvement there was well (the transaction has ~110
> entries in the RelationIdCache at eoxact each time), but the
> performance was too variable (probably due to the intense IO it
> causes) to detect any changes.  At least it is not noticeably slower.
> If I hack pgbench to bloat the RelationIdCache by touching 20,000
> useless tables as part of the connection start up process, then this
> patch does show a win.
>
> It is not obvious what value to set the MAX list size to.  Since this
> array is only allocated once per back-end, and since it not groveled
> through to invalidate relations at each invalidation, there is no
> particular reason it must be small.  But if the same table is assigned
> new filenodes (or forced index lists, whatever those are) repeatedly
> within a transaction, the list could become bloated with replicate
> entries, potentially becoming even larger than the hash table whose
> scan it is intended to short-cut.
>
> In any event, 10 seems to be large enough to overcome the currently
> known bottle-neck.  Maybe 100 would be a more principled number, as
> that is about where the list could start to become as big as the basal
> size of the RelationIdCache table.
>
> I don't think this patch replaces having some mechanism for
> restricting how large RelationIdCache can get or how LRU entries in it
> can get as Robert suggested.  But this approach seems like it is
> easier to implement and agree upon; and doesn't preclude doing other
> optimizations later.

I haven't reviewed this terribly closely, but I think this is likely
worth pursuing.  I see you already added it to the next CommitFest,
which is good.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Further pg_upgrade analysis for many tables

From
Bruce Momjian
Date:
On Wed, Nov 14, 2012 at 10:08:15AM -0500, Bruce Momjian wrote:
> > I agree that parallel restore for schemas is a hard problem. But I
> > didn't mean parallelism within the restore, I meant that we could
> > start both postmasters and pipe the output from dump directly to
> > restore. This way the times for dumping and restoring can overlap.
>
> Wow, that is a very creative idea.  The current code doesn't do that,
> but this has the potential of doubling pg_upgrade's speed, without
> adding a lot of complexity.  Here are the challenges of this approach:
>
> *  I would need to log the output of pg_dumpall as it is passed to psql
> so users can debug problems
>
> *  pg_upgrade never runs the old and new clusters at the same time for
> fear that it will run out of resources, e.g. shared memory, or if they
> are using the same port number.  We can make this optional and force
> different port numbers.
>
> Let me work up a prototype in the next few days and see how it performs.
> Thanks for the great idea.

I have developed the attached proof-of-concept patch to test this idea.
Unfortunately, I got poor results:

                                            ---- pg_upgrade ----
                  dump     restore  dmp|res     git     dmp/res
        1         0.12      0.07      0.13     11.16     13.03
     1000         3.80      2.83      5.46     18.78     20.27
     2000         5.39      5.65     13.99     26.78     28.54
     4000        16.08     12.40     28.34     41.90     44.03
     8000        32.77     25.70     57.97     78.61     80.09
    16000        57.67     63.42    134.43    158.49    165.78
    32000       131.84    176.27    302.85    380.11    389.48
    64000       270.37    708.30   1004.39   1085.39   1094.70

The last two columns show the patch didn't help at all, and the third
column shows it is just executing the pg_dump, then the restore, not in
parallel, i.e. column 1 + column 2 ~= column 3.

Testing pg_dump for 4k tables (16 seconds) shows the first row is not
output by pg_dump until 15 seconds, meaning there can't be any
parallelism with a pipe.  (Test script attached.)  Does anyone know how
to get pg_dump to send some output earlier?  In summary, it doesn't seem
pg_dump makes any attempt to output its data early.  pg_dump.c has some
details:

    /*
     * And finally we can do the actual output.
     *
     * Note: for non-plain-text output formats, the output file is written
     * inside CloseArchive().  This is, um, bizarre; but not worth changing
     * right now.
     */
    if (plainText)
        RestoreArchive(fout);

    CloseArchive(fout);

FYI, log_min_duration_statement shows queries taking 11.2 seconds, even
without the network overhead --- not sure how that can be optimized.

I will now test using PRIMARY KEY and custom dump format with pg_restore
--jobs to see if I can get parallelism that way.

A further parallelism would be to allow multiple database to be
dump/restored at the same time.  I will test for that once this is done.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Attachment

Re: Further pg_upgrade analysis for many tables

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> Testing pg_dump for 4k tables (16 seconds) shows the first row is not
> output by pg_dump until 15 seconds, meaning there can't be any
> parallelism with a pipe.  (Test script attached.)  Does anyone know how
> to get pg_dump to send some output earlier?

You can't.  By the time it knows what order to emit the objects in,
it's done all the preliminary work you're griping about.

(In a dump with data, there would be a meaningful amount of computation
remaining, but not in a schema-only dump.)

> I will now test using PRIMARY KEY and custom dump format with pg_restore
> --jobs to see if I can get parallelism that way.

This seems likely to be a waste of effort for the same reason: you only
get meaningful parallelism when there's a substantial data component to
be restored.
        regards, tom lane



Re: Further pg_upgrade analysis for many tables

From
Bruce Momjian
Date:
On Mon, Nov 26, 2012 at 05:26:42PM -0500, Bruce Momjian wrote:
> I have developed the attached proof-of-concept patch to test this idea.
> Unfortunately, I got poor results:
>
>                                             ---- pg_upgrade ----
>                   dump     restore  dmp|res     git     dmp/res
>         1         0.12      0.07      0.13     11.16     13.03
>      1000         3.80      2.83      5.46     18.78     20.27
>      2000         5.39      5.65     13.99     26.78     28.54
>      4000        16.08     12.40     28.34     41.90     44.03
>      8000        32.77     25.70     57.97     78.61     80.09
>     16000        57.67     63.42    134.43    158.49    165.78
>     32000       131.84    176.27    302.85    380.11    389.48
>     64000       270.37    708.30   1004.39   1085.39   1094.70
>
> The last two columns show the patch didn't help at all, and the third
> column shows it is just executing the pg_dump, then the restore, not in
> parallel, i.e. column 1 + column 2 ~= column 3.
...
> I will now test using PRIMARY KEY and custom dump format with pg_restore
> --jobs to see if I can get parallelism that way.

I have some new interesting results (in seconds, test script attached):

            ---- -Fc ----  ------- dump | pg_restore/psql ------  - pg_upgrade -
            dump  restore   -Fc    -Fc|-1  -Fc|-j   -Fp    -Fp|-1   git    patch
    1       0.14    0.08    0.14    0.16    0.19    0.13    0.15   11.04   13.07
 1000       3.08    3.65    6.53    6.60    5.39    6.37    6.54   21.05   22.18
 2000       6.06    6.52   12.15   11.78   10.52   12.89   12.11   31.93   31.65
 4000      11.07   14.68   25.12   24.47   22.07   26.77   26.77   56.03   47.03
 8000      20.85   32.03   53.68   45.23   45.10   59.20   51.33  104.99   85.19
16000      40.28   88.36  127.63   96.65  106.33  136.68  106.64  221.82  157.36
32000      93.78  274.99  368.54  211.30  294.76  376.36  229.80  544.73  321.19
64000     197.79 1109.22 1336.83  577.83 1117.55 1327.98  567.84 1766.12  763.02

I tested custom format with pg_restore -j and -1, as well as text
restore.  The winner was pg_dump -Fc | pg_restore -1;  even -j could not
beat it.  (FYI, Andrew Dunstan told me that indexes can be restored in
parallel with -j.)  That is actually helpful because we can use process
parallelism to restore multiple databases at the same time without
having to use processes for -j parallelism.

Attached is my pg_upgrade patch for this.  I am going to polish it up
for 9.3 application.

> A further parallelism would be to allow multiple database to be
> dump/restored at the same time.  I will test for that once this is done.

I will work on this next.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Attachment

Re: Further pg_upgrade analysis for many tables

From
Jeff Janes
Date:
On Tue, Nov 27, 2012 at 8:13 PM, Bruce Momjian <bruce@momjian.us> wrote:
>
> I have some new interesting results (in seconds, test script attached):
>
>             ---- -Fc ----  ------- dump | pg_restore/psql ------  - pg_upgrade -
>             dump  restore   -Fc    -Fc|-1  -Fc|-j   -Fp    -Fp|-1   git    patch
>     1       0.14    0.08    0.14    0.16    0.19    0.13    0.15   11.04   13.07
>  1000       3.08    3.65    6.53    6.60    5.39    6.37    6.54   21.05   22.18
>  2000       6.06    6.52   12.15   11.78   10.52   12.89   12.11   31.93   31.65
>  4000      11.07   14.68   25.12   24.47   22.07   26.77   26.77   56.03   47.03
>  8000      20.85   32.03   53.68   45.23   45.10   59.20   51.33  104.99   85.19
> 16000      40.28   88.36  127.63   96.65  106.33  136.68  106.64  221.82  157.36
> 32000      93.78  274.99  368.54  211.30  294.76  376.36  229.80  544.73  321.19
> 64000     197.79 1109.22 1336.83  577.83 1117.55 1327.98  567.84 1766.12  763.02
>
> I tested custom format with pg_restore -j and -1, as well as text
> restore.  The winner was pg_dump -Fc | pg_restore -1;

I don't have the numbers at hand, but if my relcache patch is
accepted, then "-1" stops being faster.

-1 gets rid of the AtOEXAct relcache N^2 behavior, but at the cost of
invoking a different N^2, that one in the stats system.


Cheers,

Jeff



Re: Further pg_upgrade analysis for many tables

From
Bruce Momjian
Date:
On Tue, Nov 27, 2012 at 09:35:10PM -0800, Jeff Janes wrote:
> On Tue, Nov 27, 2012 at 8:13 PM, Bruce Momjian <bruce@momjian.us> wrote:
> >
> > I have some new interesting results (in seconds, test script attached):
> >
> >             ---- -Fc ----  ------- dump | pg_restore/psql ------  - pg_upgrade -
> >             dump  restore   -Fc    -Fc|-1  -Fc|-j   -Fp    -Fp|-1   git    patch
> >     1       0.14    0.08    0.14    0.16    0.19    0.13    0.15   11.04   13.07
> >  1000       3.08    3.65    6.53    6.60    5.39    6.37    6.54   21.05   22.18
> >  2000       6.06    6.52   12.15   11.78   10.52   12.89   12.11   31.93   31.65
> >  4000      11.07   14.68   25.12   24.47   22.07   26.77   26.77   56.03   47.03
> >  8000      20.85   32.03   53.68   45.23   45.10   59.20   51.33  104.99   85.19
> > 16000      40.28   88.36  127.63   96.65  106.33  136.68  106.64  221.82  157.36
> > 32000      93.78  274.99  368.54  211.30  294.76  376.36  229.80  544.73  321.19
> > 64000     197.79 1109.22 1336.83  577.83 1117.55 1327.98  567.84 1766.12  763.02
> >
> > I tested custom format with pg_restore -j and -1, as well as text
> > restore.  The winner was pg_dump -Fc | pg_restore -1;
> 
> I don't have the numbers at hand, but if my relcache patch is
> accepted, then "-1" stops being faster.
> 
> -1 gets rid of the AtOEXAct relcache N^2 behavior, but at the cost of
> invoking a different N^2, that one in the stats system.

I was going to ask you that.  :-)  Let me run a test with your patch
now.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Further pg_upgrade analysis for many tables

From
Bruce Momjian
Date:
On Tue, Nov 27, 2012 at 09:35:10PM -0800, Jeff Janes wrote:
> > I tested custom format with pg_restore -j and -1, as well as text
> > restore.  The winner was pg_dump -Fc | pg_restore -1;
> 
> I don't have the numbers at hand, but if my relcache patch is
> accepted, then "-1" stops being faster.
> 
> -1 gets rid of the AtOEXAct relcache N^2 behavior, but at the cost of
> invoking a different N^2, that one in the stats system.

OK, here are the testing results:
#tbls       git     -1    AtOEXAct  both    1      11.06   13.06   10.99   13.20 1000      21.71   22.92   22.20
22.512000      32.86   31.09   32.51   31.62 4000      55.22   49.96   52.50   49.99 8000     105.34   82.10   95.32
82.9416000    223.67  164.27  187.40  159.5332000     543.93  324.63  366.44  317.9364000    1697.14  791.82  767.32
752.57

Up to 2k, they are all similar.  4k & 8k have the -1 patch as a win, and
16k+ really need both patches.

I will continue working on the -1 patch, and hopefully we can get your
AtOEXAct patch in soon.  Is someone reviewing that?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Further pg_upgrade analysis for many tables

From
Bruce Momjian
Date:
On Wed, Nov 28, 2012 at 03:22:32PM -0500, Bruce Momjian wrote:
> On Tue, Nov 27, 2012 at 09:35:10PM -0800, Jeff Janes wrote:
> > > I tested custom format with pg_restore -j and -1, as well as text
> > > restore.  The winner was pg_dump -Fc | pg_restore -1;
> >
> > I don't have the numbers at hand, but if my relcache patch is
> > accepted, then "-1" stops being faster.
> >
> > -1 gets rid of the AtOEXAct relcache N^2 behavior, but at the cost of
> > invoking a different N^2, that one in the stats system.
>
> OK, here are the testing results:
>
>     #tbls       git     -1    AtOEXAct  both
>         1      11.06   13.06   10.99   13.20
>      1000      21.71   22.92   22.20   22.51
>      2000      32.86   31.09   32.51   31.62
>      4000      55.22   49.96   52.50   49.99
>      8000     105.34   82.10   95.32   82.94
>     16000     223.67  164.27  187.40  159.53
>     32000     543.93  324.63  366.44  317.93
>     64000    1697.14  791.82  767.32  752.57
>
> Up to 2k, they are all similar.  4k & 8k have the -1 patch as a win, and
> 16k+ really need both patches.
>
> I will continue working on the -1 patch, and hopefully we can get your
> AtOEXAct patch in soon.  Is someone reviewing that?

I have polished up the patch (attached) and it is ready for application
to 9.3.

Since there is no pg_dump/pg_restore pipe parallelism, I had the old
cluster create per-database dump files, so I don't need to have the old
and new clusters running at the same time, which would have required two
port numbers and make shared memory exhaustion more likely.

We now create a dump file per database, so thousands of database dump
files might cause a performance problem.

This also adds status output so you can see the database names as their
schemas are dumped and restored.  This was requested by users.

I retained custom mode for pg_dump because it is measurably faster than
text mode (not sure why, psql overhead?):

                git     -Fc     -Fp
        1      11.04   11.08   11.02
     1000      22.37   19.68   21.64
     2000      32.39   28.62   31.40
     4000      56.18   48.53   51.15
     8000     105.15   81.23   91.84
    16000     227.64  156.72  177.79
    32000     542.80  323.19  371.81
    64000    1711.77  789.17  865.03

Text dump files are slightly easier to debug, but probably not by much.

Single-transaction restores were recommended to me over a year ago (by
Magnus?), but I wanted to get pg_upgrade rock-solid before doing
optimization, and now is the right time to optimize.

One risk of single-transaction restores is max_locks_per_transaction
exhaustion, but you will need to increase that on the old cluster for
pg_dump anyway because that is done a single transaction, so the only
new thing is that the new cluster might also need to adjust
max_locks_per_transaction.

I was able to remove split_old_dump() because pg_dumpall now produces a
full global restore file and we do database dumps separately.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Attachment

Re: Further pg_upgrade analysis for many tables

From
Bruce Momjian
Date:
On Thu, Nov 29, 2012 at 12:59:19PM -0500, Bruce Momjian wrote:
> I have polished up the patch (attached) and it is ready for application
> to 9.3.

Applied.

---------------------------------------------------------------------------

> Since there is no pg_dump/pg_restore pipe parallelism, I had the old
> cluster create per-database dump files, so I don't need to have the old
> and new clusters running at the same time, which would have required two
> port numbers and make shared memory exhaustion more likely.
> 
> We now create a dump file per database, so thousands of database dump
> files might cause a performance problem.
> 
> This also adds status output so you can see the database names as their
> schemas are dumped and restored.  This was requested by users.
> 
> I retained custom mode for pg_dump because it is measurably faster than
> text mode (not sure why, psql overhead?):
> 
>                 git     -Fc     -Fp
>         1      11.04   11.08   11.02
>      1000      22.37   19.68   21.64
>      2000      32.39   28.62   31.40
>      4000      56.18   48.53   51.15
>      8000     105.15   81.23   91.84
>     16000     227.64  156.72  177.79
>     32000     542.80  323.19  371.81
>     64000    1711.77  789.17  865.03
> 
> Text dump files are slightly easier to debug, but probably not by much.
> 
> Single-transaction restores were recommended to me over a year ago (by
> Magnus?), but I wanted to get pg_upgrade rock-solid before doing
> optimization, and now is the right time to optimize.
> 
> One risk of single-transaction restores is max_locks_per_transaction
> exhaustion, but you will need to increase that on the old cluster for
> pg_dump anyway because that is done a single transaction, so the only
> new thing is that the new cluster might also need to adjust
> max_locks_per_transaction.
> 
> I was able to remove split_old_dump() because pg_dumpall now produces a
> full global restore file and we do database dumps separately.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: Further pg_upgrade analysis for many tables

From
Simon Riggs
Date:
On 23 November 2012 22:34, Jeff Janes <jeff.janes@gmail.com> wrote:

> I got rid of need_eoxact_work entirely and replaced it with a short
> list that fulfills the functions of indicating that work is needed,
> and suggesting which rels might need that work.  There is no attempt
> to prevent duplicates, nor to remove invalidated entries from the
> list.   Invalid entries are skipped when the hash entry is not found,
> and processing is idempotent so duplicates are not a problem.
>
> Formally speaking, if MAX_EOXACT_LIST were 0, so that the list
> overflowed the first time it was accessed, then it would be identical
> to the current behavior or having only a flag.  So formally all I did
> was increase the max from 0 to 10.

...

> It is not obvious what value to set the MAX list size to.

A few questions, that may help you...

Why did you pick 10, when your create temp table example needs 110?

Why does the list not grow as needed?

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Further pg_upgrade analysis for many tables

From
Simon Riggs
Date:
On 9 November 2012 18:50, Jeff Janes <jeff.janes@gmail.com> wrote:

> quadratic behavior in the resource owner/lock table

I didn't want to let that particular phrase go by without saying
"exactly what behaviour is that?", so we can discuss fixing that also.

This maybe something I already know about, but its worth asking about.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Further pg_upgrade analysis for many tables

From
Jeff Janes
Date:
On Wed, Jan 9, 2013 at 3:59 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 9 November 2012 18:50, Jeff Janes <jeff.janes@gmail.com> wrote:
>
>> quadratic behavior in the resource owner/lock table
>
> I didn't want to let that particular phrase go by without saying
> "exactly what behaviour is that?", so we can discuss fixing that also.

It is the thing that was fixed in commit eeb6f37d89fc60c6449ca1, "Add
a small cache of locks owned by a resource owner in ResourceOwner."
But that fix is only in 9.3devel.

Cheers,

Jeff



Re: Further pg_upgrade analysis for many tables

From
Simon Riggs
Date:
On 9 January 2013 17:50, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Wed, Jan 9, 2013 at 3:59 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On 9 November 2012 18:50, Jeff Janes <jeff.janes@gmail.com> wrote:
>>
>>> quadratic behavior in the resource owner/lock table
>>
>> I didn't want to let that particular phrase go by without saying
>> "exactly what behaviour is that?", so we can discuss fixing that also.
>
> It is the thing that was fixed in commit eeb6f37d89fc60c6449ca1, "Add
> a small cache of locks owned by a resource owner in ResourceOwner."
> But that fix is only in 9.3devel.

That's good, it fixes the problem I reported in 2010, under
"SAVEPOINTs and COMMIT performance".

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Further pg_upgrade analysis for many tables

From
Jeff Janes
Date:
On Wednesday, January 9, 2013, Simon Riggs wrote:
On 23 November 2012 22:34, Jeff Janes <jeff.janes@gmail.com> wrote:

> I got rid of need_eoxact_work entirely and replaced it with a short
> list that fulfills the functions of indicating that work is needed,
> and suggesting which rels might need that work.  There is no attempt
> to prevent duplicates, nor to remove invalidated entries from the
> list.   Invalid entries are skipped when the hash entry is not found,
> and processing is idempotent so duplicates are not a problem.
>
> Formally speaking, if MAX_EOXACT_LIST were 0, so that the list
> overflowed the first time it was accessed, then it would be identical
> to the current behavior or having only a flag.  So formally all I did
> was increase the max from 0 to 10.

...

> It is not obvious what value to set the MAX list size to.

A few questions, that may help you...

Why did you pick 10, when your create temp table example needs 110?

The 110 is the size of the RelationIdCache at the end of my augmented pgbench transaction.  But, only one of those entries needs any work, so for that example a MAX of 1 would suffice.  But 1 seems to be cutting it rather close, so I picked the next largest power of 10.

The downsides of making the MAX larger are:

1) For ordinary work loads, each backend needs a very little bit more memory for the static array. (this would change if we want to extend this to EOsubXACT as well as EOXACT, beause there can be only 1 XACT but an unlimited number of SubXACT)

2) For pathological work loads that add the same relation to the list over and over again thousands of times, they have to
grovel through that list at EOX, which in theory could be more work than going through the entire non-redundant RelationIdCache hash.  (I have no idea what a pathological work load might actually look like in practice, but it seems like a good idea to assume that one might exist)

We could prevent duplicates from being added to the list in the first place, but the overhead need to do that seems like a sure loser for ordinary work loads.

By making the list over-flowable, we fix a demonstrated pathological workload (restore of huge schemas); we impose no detectable penalty to normal workloads; and we fail to improve, but also fail to make worse, a hypothetical pathological workload.  All at the expense of a few bytes per backend.

If the list overflowed at 100 rather than 10, the only cost would probably be the extra bytes used per process.  (Because the minimum realistic size of RelationIdCache is 100, and I assume iterating over 100 hash tags which may or may not exist and/or be redundant is about the same amount of work as iterating over a hash which has at least 100 entries)

If we increase the overflow above 100, we might be making things worse for some pathological workload whose existence is entirely hypothetical--but the workload that benefits from setting it above 100 is also hypothetical.  So while 10 might be too small, above 100 doesn't seem to be defensible in the absence of known cases.
 

Why does the list not grow as needed?

It would increase the code complexity for no concretely-known benefit.

If we are concerned about space, the extra bytes of compiled code needed to implement dynamic growth would certainly exceed the bytes need to just jack the MAX setting up to static setting 100 or 500 or so.

For dynamic growth to be a win, would have to have a work-load that satisfies these conditions:

1) It would have to have some transactions that cause >10 or >100 of relations to need clean up.
2) It would have to have even more hundreds of relations in RelationIdCache but which don't need cleanup (otherwise, if most of RelationIdCache needs cleanup then iterating over that hash would be just as efficient as iterating over a list which contains most of the said hash)
3) The above described transaction would have to happen over and over again, because if it only happens once there is no point in worrying about a little inefficiency.

Cheers,

Jeff 

Re: Further pg_upgrade analysis for many tables

From
Stephen Frost
Date:
* Jeff Janes (jeff.janes@gmail.com) wrote:
> By making the list over-flowable, we fix a demonstrated pathological
> workload (restore of huge schemas); we impose no detectable penalty to
> normal workloads; and we fail to improve, but also fail to make worse, a
> hypothetical pathological workload.  All at the expense of a few bytes per
> backend.
[...]
> > Why does the list not grow as needed?
>
> It would increase the code complexity for no concretely-known benefit.

I'm curious if this is going to help with rollback's of transactions
which created lots of tables..?  We've certainly seen that take much
longer than we'd like, although I've generally attributed it to doing
all of the unlink'ing and truncating of files.

I also wonder about making this a linked-list or something which can
trivially grow as we go and then walk later.  That would also keep the
size of it small instead of a static/fixed amount.

> 1) It would have to have some transactions that cause >10 or >100 of
> relations to need clean up.

That doesn't seem hard.

> 2) It would have to have even more hundreds of relations
> in RelationIdCache but which don't need cleanup (otherwise, if most
> of RelationIdCache needs cleanup then iterating over that hash would be
> just as efficient as iterating over a list which contains most of the said
> hash)

Good point.

> 3) The above described transaction would have to happen over and over
> again, because if it only happens once there is no point in worrying about
> a little inefficiency.

We regularly do builds where we have lots of created tables which are
later either committed or dropped (much of that is due to our
hand-crafted partitioning system..).

Looking through the pach itself, it looks pretty clean to me.
Thanks,
    Stephen

Re: Further pg_upgrade analysis for many tables

From
Tom Lane
Date:
Jeff Janes <jeff.janes@gmail.com> writes:
> [ patch for AtEOXact_RelationCache ]

I've reviewed and committed this with some mostly-cosmetic adjustments,
notably:

* Applied it to AtEOSubXact cleanup too.  AFAICS that's just as
idempotent, and it seemed weird to not use the same technique both
places.

* Dropped the hack to force a full-table scan in Assert mode.  Although
that's a behavioral change that I suspect Jeff felt was above his pay
grade, it seemed to me that not exercising the now-normal hash_search
code path in assert-enabled testing was a bad idea.  Also, the value of
exhaustive checking for relcache reference leaks is vastly lower than it
once was, because those refcounts are managed mostly automatically now.

* Redid the representation of the overflowed state a bit --- the way
that n_eoxact_list worked seemed a bit too cute/complicated for my
taste.

> On Wednesday, January 9, 2013, Simon Riggs wrote:
>> Why does the list not grow as needed?

> It would increase the code complexity for no concretely-known benefit.

Actually there's a better argument for that: at some point a long list
is actively counterproductive, because N hash_search lookups will cost
more than the full-table scan would.

I did some simple measurements that told me that with 100-odd entries
in the hashtable (which seems to be about the minimum for an active
backend), the hash_seq_search() traversal is about 40x more expensive
than one hash_search() lookup.  (I find this number slightly
astonishing, but that's the answer I got.)  So the crossover point
is at least 40 and probably quite a bit more, since (1) my measurement
did not count the cost of uselessly doing the actual relcache-entry
cleanup logic on non-targeted entries, and (2) if the list is that
long there are probably more than 100-odd entries in the hash table,
and hash table growth hurts the seqscan approach much more than the
search approach.

Now on the other side, simple single-command transactions are very
unlikely to have created more than a few list entries anyway.  So
it's probably not worth getting very tense about the exact limit
as long as it's at least a couple dozen.  I set the limit to 32
as committed, because that seemed like a nice round number in the
right general area.

BTW, this measurement also convinced me that the patch is a win
even when the hashtable is near minimum size, even though there's
no practical way to isolate the cost of AtEOXact_RelationCache in
vivo in such cases.  It's good to know that we're not penalizing
simple cases to speed up the huge-number-of-relations case, even
if the penalty would be small.
        regards, tom lane



Re: Further pg_upgrade analysis for many tables

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> !          *    Using pg_restore --single-transaction is faster than other
> !          *    methods, like --jobs.

Is this still the case now that Jeff's AtEOXact patch is in?  The risk
of locktable overflow with --single-transaction makes me think that
pg_upgrade should avoid it unless there is a *really* strong performance
case for it, and I fear your old measurements are now invalidated.
        regards, tom lane



Re: Further pg_upgrade analysis for many tables

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> I'm curious if this is going to help with rollback's of transactions
> which created lots of tables..?  We've certainly seen that take much
> longer than we'd like, although I've generally attributed it to doing
> all of the unlink'ing and truncating of files.

If a single transaction creates lots of tables and then rolls back,
this patch won't change anything because we'll long since have
overflowed the eoxact list.  But you weren't seeing an O(N^2) penalty
in such cases anyway: that penalty came from doing O(N) work in each
of N transactions.  I'm sure you're right that you're mostly looking
at the filesystem cleanup work, which we can't do much about.
        regards, tom lane



Re: Further pg_upgrade analysis for many tables

From
Jeff Janes
Date:
On Sunday, January 20, 2013, Stephen Frost wrote:
* Jeff Janes (jeff.janes@gmail.com) wrote:
 
> By making the list over-flowable, we fix a demonstrated pathological
> workload (restore of huge schemas); we impose no detectable penalty to
> normal workloads; and we fail to improve, but also fail to make worse, a
> hypothetical pathological workload.  All at the expense of a few bytes per
> backend.
[...]
> > Why does the list not grow as needed?
>
> It would increase the code complexity for no concretely-known benefit.

I'm curious if this is going to help with rollback's of transactions
which created lots of tables..?  We've certainly seen that take much
longer than we'd like, although I've generally attributed it to doing
all of the unlink'ing and truncating of files.

If you are using large shared_buffers, then you will probably get more benefit from a different recent commit:

279628a  Accelerate end-of-transaction dropping of relations.

Cheers,

Jeff

Re: Further pg_upgrade analysis for many tables

From
Bruce Momjian
Date:
On Sun, Jan 20, 2013 at 02:11:48PM -0500, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > !          *    Using pg_restore --single-transaction is faster than other
> > !          *    methods, like --jobs.
>
> Is this still the case now that Jeff's AtEOXact patch is in?  The risk
> of locktable overflow with --single-transaction makes me think that
> pg_upgrade should avoid it unless there is a *really* strong performance
> case for it, and I fear your old measurements are now invalidated.

I had thought that the AtEOXact patch only helped single transactions
with many tables, but I now remember it mostly helps backends that have
accessed many tables.

With max_locks_per_transaction set high, I tested with the attached
patch that removes --single-transaction from pg_restore.  I saw a 4%
improvement by removing that option, and 15% at 64k.  (Test script
attached.)  I have applied the patch.  This is good news not just for
pg_upgrade but for other backends that access many tables.

                 git     patch
        1       11.06    11.03
     1000       19.97    20.86
     2000       28.50    27.61
     4000       46.90    45.65
     8000       79.38    80.68
    16000      153.33   147.13
    32000      317.40   302.96
    64000      782.94   659.52

FYI, this is better than the tests I did on the original patch that
showed --single-transaction was still a win then:

    http://www.postgresql.org/message-id/20121128202232.GA31741@momjian.us

>     #tbls       git     -1    AtOEXAct  both
>         1      11.06   13.06   10.99   13.20
>      1000      21.71   22.92   22.20   22.51
>      2000      32.86   31.09   32.51   31.62
>      4000      55.22   49.96   52.50   49.99
>      8000     105.34   82.10   95.32   82.94
>     16000     223.67  164.27  187.40  159.53
>     32000     543.93  324.63  366.44  317.93
>     64000    1697.14  791.82  767.32  752.57

Keep in mind this doesn't totally avoid the requirement to increase
max_locks_per_transaction.  There are cases at >6k where pg_dump runs
out of locks, but I don't see how we can improve that.  Hopefully users
have already seen pg_dump fail and have adjusted
max_locks_per_transaction.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Attachment