Thread: performance regression in 9.2 when loading lots of small tables
There was a regression introduced in 9.2 that effects the creation and loading of lots of small tables in a single transaction. It affects the loading of a pg_dump file which has a large number of small tables (10,000 schemas, one table per schema, 10 rows per table). I did not test other schema configurations, so these specifics might not be needed to invoke the problem. It causes the loading of a dump with "psql -1 -f " to run at half the previous speed. Speed of loading without the -1 is not changed. The regression was introduced in 39a68e5c6ca7b41b, "Fix toast table creation". Perhaps the slowdown is an inevitable result of fixing the bug. The regression was removed from 9_1_STABLE at commit dff178f8017e4412, "More cleanup after failed reduced-lock-levels-for-DDL feature". It is still present in 9_2_STABLE. I don't really understand what is going on in these patches, but it seems that either 9_1_STABLE now has a bug that was fixed and then unfixed, or that 9_2_STABLE is slower than it needs to be. The dump file I used can be obtained like this: perl -le 'print "set client_min_messages=warning;"; print "create schema foo$_; create table foo$_.foo$_ (k integer, v integer); insert into foo$_.foo$_ select * from generate_series(1,10); " foreach $ARGV[0]..$ARGV[0]+$ARGV[1]-1' 0 10000 | psql > /dev/null ; pg_dump > 10000.dump Cheers, Jeff
On Mon, Jun 18, 2012 at 8:42 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > There was a regression introduced in 9.2 that effects the creation and > loading of lots of small tables in a single transaction. > > It affects the loading of a pg_dump file which has a large number of > small tables (10,000 schemas, one table per schema, 10 rows per > table). I did not test other schema configurations, so these > specifics might not be needed to invoke the problem. > > It causes the loading of a dump with "psql -1 -f " to run at half the > previous speed. Speed of loading without the -1 is not changed. I tried to figure out why this was happening. I tried it out on the IBM POWER7 box after building from f5297bdfe4c4a47376c41b96161fb55c2294a0b1 and it ran for about 14 minutes. 'time' reports that psql used 38 seconds of user time and 11 seconds of system time. The remaining time was presumably spent waiting for the backend and/or the kernel. I ran the backend under strace -cf and it had this to say: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ----------------45.31 39.888639 10 3836379 write18.80 16.553878 3 4929697 lseek13.51 11.890826 6 1906179 12819 read10.24 9.011690 7 1372354 5995 recv 6.27 5.517929 5 1107766 brk 2.07 1.818345 91 20068 fsync 1.46 1.281999 15 87260 send 1.15 1.015621 24 42527 11334 open I did a separate run using perf and it had this to say: Events: 1M cycles + 13.18% postgres postgres [.] FlushRelationBuffers + 9.65% postgres postgres [.] comparetup_index_btree + 9.34% swapper [kernel.kallsyms] [k] .pseries_dedicated_idle_sleep + 4.41% postgres postgres [.] CopyReadLine + 4.23% postgres postgres [.] tuplesort_heap_siftup + 4.17% postgres postgres [.] LockReassignCurrentOwner + 3.88% postgres postgres [.] pg_mbstrlen_with_len + 2.74% postgres postgres [.] pg_verify_mbstr_len + 2.46% postgres postgres [.] NextCopyFromRawFields + 2.44% postgres postgres [.] btint4cmp + 2.08% postgres libc-2.14.90.so [.] memcpy + 2.06% postgres postgres [.] hash_seq_search + 1.55% postgres [kernel.kallsyms] [k] .__copy_tofrom_user + 1.29% postgres libc-2.14.90.so [.] __GI_____strtoll_l_internal + 1.16% postgres postgres [.] pg_utf_mblen There are certainly opportunities for optimization there but it's sure not obvious to me what it has to do with either of the commits you mention. I haven't actually verified that there's a regression on this box as result of either of those commits, though: I just profiled master. Maybe I better go check that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jun 19, 2012 at 4:33 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Jun 18, 2012 at 8:42 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >> There was a regression introduced in 9.2 that effects the creation and >> loading of lots of small tables in a single transaction. >> >> It affects the loading of a pg_dump file which has a large number of >> small tables (10,000 schemas, one table per schema, 10 rows per >> table). I did not test other schema configurations, so these >> specifics might not be needed to invoke the problem. >> >> It causes the loading of a dump with "psql -1 -f " to run at half the >> previous speed. Speed of loading without the -1 is not changed. > > I tried to figure out why this was happening. I tried it out on the > IBM POWER7 box after building from > f5297bdfe4c4a47376c41b96161fb55c2294a0b1 and it ran for about 14 > minutes. 'time' reports that psql used 38 seconds of user time and 11 > seconds of system time. The remaining time was presumably spent > waiting for the backend and/or the kernel. > > I ran the backend under strace -cf and it had this to say: > > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 45.31 39.888639 10 3836379 write > 18.80 16.553878 3 4929697 lseek > 13.51 11.890826 6 1906179 12819 read > 10.24 9.011690 7 1372354 5995 recv > 6.27 5.517929 5 1107766 brk > 2.07 1.818345 91 20068 fsync > 1.46 1.281999 15 87260 send > 1.15 1.015621 24 42527 11334 open > > I did a separate run using perf and it had this to say: > > Events: 1M cycles > + 13.18% postgres postgres [.] FlushRelationBuffers > + 9.65% postgres postgres [.] comparetup_index_btree > + 9.34% swapper [kernel.kallsyms] [k] > .pseries_dedicated_idle_sleep > + 4.41% postgres postgres [.] CopyReadLine > + 4.23% postgres postgres [.] tuplesort_heap_siftup > + 4.17% postgres postgres [.] LockReassignCurrentOwner > + 3.88% postgres postgres [.] pg_mbstrlen_with_len > + 2.74% postgres postgres [.] pg_verify_mbstr_len > + 2.46% postgres postgres [.] NextCopyFromRawFields > + 2.44% postgres postgres [.] btint4cmp > + 2.08% postgres libc-2.14.90.so [.] memcpy > + 2.06% postgres postgres [.] hash_seq_search > + 1.55% postgres [kernel.kallsyms] [k] .__copy_tofrom_user > + 1.29% postgres libc-2.14.90.so [.] > __GI_____strtoll_l_internal > + 1.16% postgres postgres [.] pg_utf_mblen > > There are certainly opportunities for optimization there but it's sure > not obvious to me what it has to do with either of the commits you > mention. I haven't actually verified that there's a regression on > this box as result of either of those commits, though: I just profiled > master. Maybe I better go check that. I built REL9_1_STABLE from commit 1643031e5fe02d2de9ae6b8f86ef9ffd09fe7d3f and it took 19m45.250s, even slower than master. So something is different in my environment versus yours. I ran this build with perf also, and got this: 38.77% postgres postgres [.] FlushRelationBuffers 6.80% swapper [kernel.kallsyms] [k] .pseries_dedicated_idle_sleep 4.84% postgres postgres [.] comparetup_index_btree 3.23% postgres postgres [.] CopyReadLine 2.24% postgres postgres [.] tuplesort_heap_siftup 2.23% postgres postgres [.] pg_mbstrlen_with_len 2.08% postgres postgres [.] LockReassignCurrentOwner 2.01% postgres postgres [.] pg_verify_mbstr_len 1.90% postgres postgres [.] NextCopyFromRawFields 1.62% postgres postgres [.] btint4cmp 1.41% postgres libc-2.14.90.so [.] memcpy 1.39% postgres postgres [.] LWLockAcquire 1.22% postgres postgres [.] LWLockRelease 1.19% postgres postgres [.] pg_utf_mblen 1.05% postgres [kernel.kallsyms] [k] .__copy_tofrom_user -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jun 19, 2012 at 2:38 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Jun 19, 2012 at 4:33 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Mon, Jun 18, 2012 at 8:42 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >>> There was a regression introduced in 9.2 that effects the creation and >>> loading of lots of small tables in a single transaction. >>> >>> It affects the loading of a pg_dump file which has a large number of >>> small tables (10,000 schemas, one table per schema, 10 rows per >>> table). I did not test other schema configurations, so these >>> specifics might not be needed to invoke the problem. >>> >>> It causes the loading of a dump with "psql -1 -f " to run at half the >>> previous speed. Speed of loading without the -1 is not changed. >> >> I tried to figure out why this was happening. I tried it out on the >> IBM POWER7 box after building from >> f5297bdfe4c4a47376c41b96161fb55c2294a0b1 and it ran for about 14 >> minutes. 'time' reports that psql used 38 seconds of user time and 11 >> seconds of system time. The remaining time was presumably spent >> waiting for the backend and/or the kernel. >> ... >> I did a separate run using perf and it had this to say: >> >> Events: 1M cycles >> + 13.18% postgres postgres [.] FlushRelationBuffers >> + 9.65% postgres postgres [.] comparetup_index_btree ... > > I built REL9_1_STABLE from commit > 1643031e5fe02d2de9ae6b8f86ef9ffd09fe7d3f and it took 19m45.250s, even > slower than master. So something is different in my environment > versus yours. Yes, I forgot the perhaps the most important part. FlushRelationBuffers is a real pig under "-1 -f -" if you are running on large shared_buffers, and might swamp the other issue. I run with shared_buffers=8MB to get around that problem. And with fsync=off if you don't have BBU or something. The unlocked test in FlushRelationBuffers recently added to HEAD probably is what makes HEAD faster if you are using large shared_buffers. With small shared_buffers and fsync off I get 1 minute to load 10,000 tables/schemas in the fast phenotype, or 2 minutes in the slow phenotype. It looks like the phenotype actually has come and gone more than once (for different but related reasons), which is why I couldn't find the place in the REL9_2_STABLE where it happened very easily with git bisect. The fundamental problem seems to be that the slow phenotype takes twice as many locks on which makes the local hash table twice as big, which in turn makes the reassignment of resource owners twice as slow. The original regression seems to be because each table had both an AccessExclusiveLock and a ShareUpdateExclusiveLock taken on it when it was checked to see if a toast table was needed. I now see that that was reverted in both branches. But in the 9.2 branch, the slow phenotype was re-introduced in 1575fbcb795fc331f4, although perhaps the details of who is locking what differs. I haven't yet sorted that out. Since the fundamental problem will hopefully be fixed in 9.3, can 9.2 live with this regression in what is probably a rather rare case (loading huge number of very small tables, and with a small enough shared_buffers avoid the other limitation)? Cheers, Jeff
On Tue, Jun 19, 2012 at 10:56 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > But in the 9.2 branch, the slow phenotype was re-introduced in > 1575fbcb795fc331f4, although perhaps the details of who is locking > what differs. I haven't yet sorted that out. It very much does. That commit prevents people from creating a relation in - or renaming a relation into - a schema that is being concurrently dropped, which in previous releases would have resulted in inconsistent catalog contents. I admit that it harms your test case, but how likely is it that someone is going to put every single table into its own schema? And have shared_buffers low enough for this to be the dominant cost? I think in real-world scenarios this isn't going to be a problem - although, of course, making the lock manager faster would be nifty if we can do it, and this might be a good test case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jun 19, 2012 at 8:06 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Jun 19, 2012 at 10:56 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >> But in the 9.2 branch, the slow phenotype was re-introduced in >> 1575fbcb795fc331f4, although perhaps the details of who is locking >> what differs. I haven't yet sorted that out. > > It very much does. That commit prevents people from creating a > relation in - or renaming a relation into - a schema that is being > concurrently dropped, which in previous releases would have resulted > in inconsistent catalog contents. I admit that it harms your test > case, but how likely is it that someone is going to put every single > table into its own schema? Yep, I see that even having 10 tables per scheme (which I think was the case for the person who started this line of inquiry) rather than just 1 greatly reduces this regression. It basically goes from about 2x slower to about 1.1x slower. So I think that pretty much settles the issue for 9.2. > And have shared_buffers low enough for > this to be the dominant cost? That one is not so unlikely. After all, lowering shared_buffers is a defensive move to get around the FlushRelationBuffers problem, so hopefully people faced with the task of restoring such a dump would take advantage of it, if they have a maintenance window in which to do so. Also, the FlushRelationBuffers issue is linear while the locks are quadratic, so eventually the locking would regain dominance even if shared_buffers is large. > I think in real-world scenarios this > isn't going to be a problem - although, of course, making the lock > manager faster would be nifty if we can do it, and this might be a > good test case. The resource owner reassign lock patch does effectively solve the lock manager problem of restoring dumps of such databases, as well as the similar problem in creating such dumps. Cheers, Jeff