Thread: performance regression in 9.2 when loading lots of small tables

performance regression in 9.2 when loading lots of small tables

From
Jeff Janes
Date:
There was a regression introduced in 9.2 that effects the creation and
loading of lots of small tables in a single transaction.

It affects the loading of a pg_dump file which has a large number of
small tables (10,000 schemas, one table per schema, 10 rows per
table).  I did not test other schema configurations, so these
specifics might not be needed to invoke the problem.

It causes the loading of a dump with "psql -1 -f " to run at half the
previous speed.  Speed of loading without the -1 is not changed.

The regression was introduced in 39a68e5c6ca7b41b, "Fix toast table
creation".  Perhaps the slowdown is an inevitable result of fixing the
bug.

The regression was removed from 9_1_STABLE at commit dff178f8017e4412,
"More cleanup after failed reduced-lock-levels-for-DDL feature".  It
is still present in 9_2_STABLE.

I don't really understand what is going on in these patches, but it
seems that either 9_1_STABLE now has a bug that was fixed and then
unfixed, or that 9_2_STABLE is slower than it needs to be.


The dump file I used can be obtained like this:

perl -le 'print "set client_min_messages=warning;"; print "create
schema foo$_; create table foo$_.foo$_ (k integer, v integer); insert
into foo$_.foo$_ select * from generate_series(1,10); " foreach
$ARGV[0]..$ARGV[0]+$ARGV[1]-1' 0 10000 | psql > /dev/null ; pg_dump >
10000.dump


Cheers,

Jeff


Re: performance regression in 9.2 when loading lots of small tables

From
Robert Haas
Date:
On Mon, Jun 18, 2012 at 8:42 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> There was a regression introduced in 9.2 that effects the creation and
> loading of lots of small tables in a single transaction.
>
> It affects the loading of a pg_dump file which has a large number of
> small tables (10,000 schemas, one table per schema, 10 rows per
> table).  I did not test other schema configurations, so these
> specifics might not be needed to invoke the problem.
>
> It causes the loading of a dump with "psql -1 -f " to run at half the
> previous speed.  Speed of loading without the -1 is not changed.

I tried to figure out why this was happening.  I tried it out on the
IBM POWER7 box after building from
f5297bdfe4c4a47376c41b96161fb55c2294a0b1 and it ran for about 14
minutes.  'time' reports that psql used 38 seconds of user time and 11
seconds of system time.  The remaining time was presumably spent
waiting for the backend and/or the kernel.

I ran the backend under strace -cf and it had this to say:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------45.31   39.888639          10   3836379
write18.80  16.553878           3   4929697           lseek13.51   11.890826           6   1906179     12819 read10.24
 9.011690           7   1372354      5995 recv 6.27    5.517929           5   1107766           brk 2.07    1.818345
     91     20068           fsync 1.46    1.281999          15     87260           send 1.15    1.015621          24
42527     11334 open 

I did a separate run using perf and it had this to say:

Events: 1M cycles
+  13.18%         postgres  postgres               [.] FlushRelationBuffers
+   9.65%         postgres  postgres               [.] comparetup_index_btree
+   9.34%          swapper  [kernel.kallsyms]      [k]
.pseries_dedicated_idle_sleep
+   4.41%         postgres  postgres               [.] CopyReadLine
+   4.23%         postgres  postgres               [.] tuplesort_heap_siftup
+   4.17%         postgres  postgres               [.] LockReassignCurrentOwner
+   3.88%         postgres  postgres               [.] pg_mbstrlen_with_len
+   2.74%         postgres  postgres               [.] pg_verify_mbstr_len
+   2.46%         postgres  postgres               [.] NextCopyFromRawFields
+   2.44%         postgres  postgres               [.] btint4cmp
+   2.08%         postgres  libc-2.14.90.so        [.] memcpy
+   2.06%         postgres  postgres               [.] hash_seq_search
+   1.55%         postgres  [kernel.kallsyms]      [k] .__copy_tofrom_user
+   1.29%         postgres  libc-2.14.90.so        [.]
__GI_____strtoll_l_internal
+   1.16%         postgres  postgres               [.] pg_utf_mblen

There are certainly opportunities for optimization there but it's sure
not obvious to me what it has to do with either of the commits you
mention.  I haven't actually verified that there's a regression on
this box as result of either of those commits, though: I just profiled
master.  Maybe I better go check that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: performance regression in 9.2 when loading lots of small tables

From
Robert Haas
Date:
On Tue, Jun 19, 2012 at 4:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Jun 18, 2012 at 8:42 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> There was a regression introduced in 9.2 that effects the creation and
>> loading of lots of small tables in a single transaction.
>>
>> It affects the loading of a pg_dump file which has a large number of
>> small tables (10,000 schemas, one table per schema, 10 rows per
>> table).  I did not test other schema configurations, so these
>> specifics might not be needed to invoke the problem.
>>
>> It causes the loading of a dump with "psql -1 -f " to run at half the
>> previous speed.  Speed of loading without the -1 is not changed.
>
> I tried to figure out why this was happening.  I tried it out on the
> IBM POWER7 box after building from
> f5297bdfe4c4a47376c41b96161fb55c2294a0b1 and it ran for about 14
> minutes.  'time' reports that psql used 38 seconds of user time and 11
> seconds of system time.  The remaining time was presumably spent
> waiting for the backend and/or the kernel.
>
> I ran the backend under strace -cf and it had this to say:
>
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  45.31   39.888639          10   3836379           write
>  18.80   16.553878           3   4929697           lseek
>  13.51   11.890826           6   1906179     12819 read
>  10.24    9.011690           7   1372354      5995 recv
>  6.27    5.517929           5   1107766           brk
>  2.07    1.818345          91     20068           fsync
>  1.46    1.281999          15     87260           send
>  1.15    1.015621          24     42527     11334 open
>
> I did a separate run using perf and it had this to say:
>
> Events: 1M cycles
> +  13.18%         postgres  postgres               [.] FlushRelationBuffers
> +   9.65%         postgres  postgres               [.] comparetup_index_btree
> +   9.34%          swapper  [kernel.kallsyms]      [k]
> .pseries_dedicated_idle_sleep
> +   4.41%         postgres  postgres               [.] CopyReadLine
> +   4.23%         postgres  postgres               [.] tuplesort_heap_siftup
> +   4.17%         postgres  postgres               [.] LockReassignCurrentOwner
> +   3.88%         postgres  postgres               [.] pg_mbstrlen_with_len
> +   2.74%         postgres  postgres               [.] pg_verify_mbstr_len
> +   2.46%         postgres  postgres               [.] NextCopyFromRawFields
> +   2.44%         postgres  postgres               [.] btint4cmp
> +   2.08%         postgres  libc-2.14.90.so        [.] memcpy
> +   2.06%         postgres  postgres               [.] hash_seq_search
> +   1.55%         postgres  [kernel.kallsyms]      [k] .__copy_tofrom_user
> +   1.29%         postgres  libc-2.14.90.so        [.]
> __GI_____strtoll_l_internal
> +   1.16%         postgres  postgres               [.] pg_utf_mblen
>
> There are certainly opportunities for optimization there but it's sure
> not obvious to me what it has to do with either of the commits you
> mention.  I haven't actually verified that there's a regression on
> this box as result of either of those commits, though: I just profiled
> master.  Maybe I better go check that.

I built REL9_1_STABLE from commit
1643031e5fe02d2de9ae6b8f86ef9ffd09fe7d3f and it took 19m45.250s, even
slower than master.  So something is different in my environment
versus yours.

I ran this build with perf also, and got this:
   38.77%         postgres  postgres               [.] FlushRelationBuffers    6.80%          swapper
[kernel.kallsyms]     [k] 
.pseries_dedicated_idle_sleep    4.84%         postgres  postgres               [.] comparetup_index_btree    3.23%
   postgres  postgres               [.] CopyReadLine    2.24%         postgres  postgres               [.]
tuplesort_heap_siftup   2.23%         postgres  postgres               [.] pg_mbstrlen_with_len    2.08%
postgres postgres               [.] LockReassignCurrentOwner    2.01%         postgres  postgres               [.]
pg_verify_mbstr_len   1.90%         postgres  postgres               [.] NextCopyFromRawFields    1.62%
postgres postgres               [.] btint4cmp    1.41%         postgres  libc-2.14.90.so        [.] memcpy    1.39%
   postgres  postgres               [.] LWLockAcquire    1.22%         postgres  postgres               [.]
LWLockRelease   1.19%         postgres  postgres               [.] pg_utf_mblen    1.05%         postgres
[kernel.kallsyms]     [k] .__copy_tofrom_user 

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: performance regression in 9.2 when loading lots of small tables

From
Jeff Janes
Date:
On Tue, Jun 19, 2012 at 2:38 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jun 19, 2012 at 4:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Mon, Jun 18, 2012 at 8:42 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>> There was a regression introduced in 9.2 that effects the creation and
>>> loading of lots of small tables in a single transaction.
>>>
>>> It affects the loading of a pg_dump file which has a large number of
>>> small tables (10,000 schemas, one table per schema, 10 rows per
>>> table).  I did not test other schema configurations, so these
>>> specifics might not be needed to invoke the problem.
>>>
>>> It causes the loading of a dump with "psql -1 -f " to run at half the
>>> previous speed.  Speed of loading without the -1 is not changed.
>>
>> I tried to figure out why this was happening.  I tried it out on the
>> IBM POWER7 box after building from
>> f5297bdfe4c4a47376c41b96161fb55c2294a0b1 and it ran for about 14
>> minutes.  'time' reports that psql used 38 seconds of user time and 11
>> seconds of system time.  The remaining time was presumably spent
>> waiting for the backend and/or the kernel.
>>
...

>> I did a separate run using perf and it had this to say:
>>
>> Events: 1M cycles
>> +  13.18%         postgres  postgres               [.] FlushRelationBuffers
>> +   9.65%         postgres  postgres               [.] comparetup_index_btree
...
>
> I built REL9_1_STABLE from commit
> 1643031e5fe02d2de9ae6b8f86ef9ffd09fe7d3f and it took 19m45.250s, even
> slower than master.  So something is different in my environment
> versus yours.

Yes, I forgot the perhaps the most important part.
FlushRelationBuffers is a real pig under "-1 -f -" if you are running
on large shared_buffers, and might swamp the other issue.  I run with
shared_buffers=8MB to get around that problem.  And with fsync=off if
you don't have BBU or something.  The unlocked test in
FlushRelationBuffers recently added to HEAD probably is what makes
HEAD faster if you are using large shared_buffers.

With small shared_buffers and fsync off I get 1 minute to load 10,000
tables/schemas in the fast phenotype, or 2 minutes in the slow
phenotype.

It looks like the phenotype actually has come and gone more than once
(for different but related reasons), which is why I couldn't find the
place in the REL9_2_STABLE where it happened very easily with git
bisect.

The fundamental problem seems to be that the slow phenotype takes
twice as many locks on which makes the local hash table twice as big,
which in turn makes the reassignment of resource owners twice as slow.

The original regression seems to be because each table had both an
AccessExclusiveLock and a ShareUpdateExclusiveLock taken on it when it
was checked to see if a toast table was needed.  I now see that that
was reverted in both branches.

But in the 9.2 branch, the slow phenotype was re-introduced in
1575fbcb795fc331f4, although perhaps the details of who is locking
what differs.  I haven't yet sorted that out.

Since the fundamental problem will hopefully be fixed in 9.3, can 9.2
live with this regression in what is probably a rather rare case
(loading huge number of very small tables, and with a small enough
shared_buffers avoid the other limitation)?

Cheers,

Jeff


Re: performance regression in 9.2 when loading lots of small tables

From
Robert Haas
Date:
On Tue, Jun 19, 2012 at 10:56 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> But in the 9.2 branch, the slow phenotype was re-introduced in
> 1575fbcb795fc331f4, although perhaps the details of who is locking
> what differs.  I haven't yet sorted that out.

It very much does.  That commit prevents people from creating a
relation in - or renaming a relation into - a schema that is being
concurrently dropped, which in previous releases would have resulted
in inconsistent catalog contents.  I admit that it harms your test
case, but how likely is it that someone is going to put every single
table into its own schema?  And have shared_buffers low enough for
this to be the dominant cost?  I think in real-world scenarios this
isn't going to be a problem - although, of course, making the lock
manager faster would be nifty if we can do it, and this might be a
good test case.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: performance regression in 9.2 when loading lots of small tables

From
Jeff Janes
Date:
On Tue, Jun 19, 2012 at 8:06 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jun 19, 2012 at 10:56 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> But in the 9.2 branch, the slow phenotype was re-introduced in
>> 1575fbcb795fc331f4, although perhaps the details of who is locking
>> what differs.  I haven't yet sorted that out.
>
> It very much does.  That commit prevents people from creating a
> relation in - or renaming a relation into - a schema that is being
> concurrently dropped, which in previous releases would have resulted
> in inconsistent catalog contents.  I admit that it harms your test
> case, but how likely is it that someone is going to put every single
> table into its own schema?

Yep, I see that even having 10 tables per scheme (which I think was
the case for the person who started this line of inquiry) rather than
just 1 greatly reduces this regression.  It basically goes from about
2x slower to about 1.1x slower.  So I think that pretty much settles
the issue for 9.2.


> And have shared_buffers low enough for
> this to be the dominant cost?

That one is not so unlikely.  After all, lowering shared_buffers is a
defensive move to get around the FlushRelationBuffers problem, so
hopefully people faced with the task of restoring such a dump would
take advantage of it, if they have a maintenance window in which to do
so.  Also, the FlushRelationBuffers issue is linear while the locks
are quadratic, so eventually the locking would regain dominance even
if shared_buffers is large.

> I think in real-world scenarios this
> isn't going to be a problem - although, of course, making the lock
> manager faster would be nifty if we can do it, and this might be a
> good test case.

The resource owner reassign lock patch does effectively solve the lock
manager problem of restoring dumps of such databases, as well as the
similar problem in creating such dumps.

Cheers,

Jeff