Thread: slow commits with heavy temp table usage in 8.4.0
Hi, I've noticed that on 8.4.0, commits can take a long time when a temp table is repeatedly filled and truncated within a loop. A very contrived example is begin; create or replace function commit_test_with_truncations()returns voidlanguage 'plpgsql' as $_func_$ declarei integer; begincreate temp table t1 (x integer) on commit drop ;for i in 1 .. 22000 loop insert into t1 select s from generate_series(1,1000)s ; truncate t1 ;end loop; end; $_func_$; select commit_test_with_truncations() ; commit ; On may laptop (Core2 Duo with 3.5GB and a disk dedicated to PG), the function call takes about 124 seconds, and the commit takes about 43 seconds. The function execution generates a lot of I/O activity, but the commit is entirely CPU bound. By contrast, the same test on an 8.2.13 system (2 older Xeons and 8GB) had times of 495 and 19 seconds. In this case, both the function execution and the commit were entirely CPU bound. The overall process in 8.4 is much faster than 8.2.13, but the commit time is somewhat surprising to me. Is that to be expected? 8.4 version(): PostgreSQL 8.4.0 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-27),64-bit 8.2.13 version(): PostgreSQL 8.2.13 on x86_64-suse-linux-gnu, compiled by GCC gcc (GCC) 3.3.5 20050117 (prerelease) (SUSELinux) -- todd
"Todd A. Cook" <tcook@blackducksoftware.com> writes: > I've noticed that on 8.4.0, commits can take a long time when a temp table is repeatedly > filled and truncated within a loop. A very contrived example is Hmm. I tweaked the function to allow varying the number of truncates: regression=# begin; BEGIN Time: 1.037 ms regression=# select commit_test_with_truncations(10000) ;commit_test_with_truncations ------------------------------ (1 row) Time: 9466.060 ms regression=# commit; COMMIT Time: 1095.946 ms regression=# begin; BEGIN Time: 1.002 ms regression=# select commit_test_with_truncations(30000) ;commit_test_with_truncations ------------------------------ (1 row) Time: 93492.874 ms regression=# commit; COMMIT Time: 3184.248 ms The commit time doesn't seem tremendously out of line, but it looks like there's something O(N^2)-ish in the function execution. Do you see a similar pattern? With so many temp files there could well be some blame on the kernel side. (This is a Fedora 10 box.) regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Todd A. Cook" <tcook@blackducksoftware.com> writes: >> I've noticed that on 8.4.0, commits can take a long time when a >> temp table is repeatedly filled and truncated within a loop. > The commit time doesn't seem tremendously out of line, but it looks > like there's something O(N^2)-ish in the function execution. Do > you see a similar pattern? With so many temp files there could well > be some blame on the kernel side. (This is a Fedora 10 box.) This sounds very similar to my experience here: http://archives.postgresql.org/pgsql-hackers/2008-11/msg01472.php Depending on what sort of RAID controller caching is present, a BBU cache might be containing the problem up to some threshold. Perhaps it's not so much O(N^2) as O(N)-<someconstant>, with a min of zero? -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> "Todd A. Cook" <tcook@blackducksoftware.com> writes: >>> I've noticed that on 8.4.0, commits can take a long time when a >>> temp table is repeatedly filled and truncated within a loop. >> The commit time doesn't seem tremendously out of line, but it looks >> like there's something O(N^2)-ish in the function execution. Do >> you see a similar pattern? With so many temp files there could well >> be some blame on the kernel side. (This is a Fedora 10 box.) > This sounds very similar to my experience here: > http://archives.postgresql.org/pgsql-hackers/2008-11/msg01472.php I did some more poking with oprofile, and got this: samples % image name symbol name 559375 39.9848 postgres index_getnext 167626 11.9821 postgres TransactionIdIsCurrentTransactionId 107421 7.6786 postgres HeapTupleSatisfiesNow 65689 4.6955 postgres HeapTupleHeaderGetCmin 47220 3.3753 postgres HeapTupleHeaderGetCmax 46799 3.3452 postgres hash_search_with_hash_value 29331 2.0966 postgres heap_hot_search_buffer 23737 1.6967 postgres CatalogCacheFlushRelation 20562 1.4698 postgres LWLockAcquire 19838 1.4180 postgres heap_page_prune_opt 19044 1.3613 postgres _bt_checkkeys 17400 1.2438 postgres LWLockRelease 12993 0.9288 postgres PinBuffer So what I'm seeing is entirely explained by the buildup of dead versions of the temp table's pg_class row --- the index_getnext time is spent scanning over dead HOT-chain members. It might be possible to avoid that by special-casing temp tables in TRUNCATE to recycle the existing file instead of assigning a new one. However, there is no reason to think that 8.3 would be any better than 8.4 on that score. Also, I'm not seeing the very long CPU-bound commit phase that Todd is seeing. So I think there's something happening on his box that's different from what I'm measuring. I'm actually testing CVS HEAD, not 8.4.0, but I don't recall that we've done anything in the past month that would be likely to affect this ... regards, tom lane
Tom Lane wrote: > > So what I'm seeing is entirely explained by the buildup of dead versions > of the temp table's pg_class row --- the index_getnext time is spent > scanning over dead HOT-chain members. It might be possible to avoid > that by special-casing temp tables in TRUNCATE to recycle the existing > file instead of assigning a new one. However, there is no reason to > think that 8.3 would be any better than 8.4 on that score. Also, I'm > not seeing the very long CPU-bound commit phase that Todd is seeing. The commit looks CPU-bound when I let the residual I/O from the function execution die out before I issue the commit. > > I'm actually testing CVS HEAD, not 8.4.0, but I don't recall that we've > done anything in the past month that would be likely to affect this ... > > regards, tom lane > . > Tom's theory may explain the different commit results I get when testing on two different databases: db truncations function commit ------------ ----------- ----------- ---------- test 10000 29603.624 6054.889 test 10000 34740.167 14551.177 test 10000 30608.260 11144.503 test 10000 32239.049 9846.676 test 30000 227115.850 50206.947 test 30000 201859.698 46083.222 test 30000 231926.642 46681.009 test 30000 235665.970 47113.137 production 10000 32982.069 17654.772 production 10000 33297.524 17396.792 production 10000 35503.185 18343.045 production 10000 34251.753 18284.725 production 30000 200899.786 75480.448 production 30000 206793.209 73316.405 production 30000 260491.759 72570.297 production 30000 191363.168 66659.129 The "test" DB is nearly empty with 251 entries in pg_class, whereas "production" has real data with 9981 entries in pg_class. -- todd
"Todd A. Cook" <tcook@blackducksoftware.com> writes: > Tom Lane wrote: >> I'm not seeing the very long CPU-bound commit phase that Todd is seeing. > The commit looks CPU-bound when I let the residual I/O from the function > execution die out before I issue the commit. Well, mine is CPU-bound too, it just is much shorter relative to the function execution time than you're showing. My test showed about a 9x ratio at 10000 truncates and a 30x ratio at 30000; you've got numbers in the range of 2x to 4x. So something is behaving differently between your machines and mine. I took a look through the CVS history and verified that there were no post-8.4 commits that looked like they'd affect performance in this area. So I think it's got to be a platform difference not a PG version difference. In particular I think we are probably looking at a filesystem issue: how fast can you delete 10000 or 30000 files. (I'm testing an ext3 filesystem on a plain ol consumer-grade drive that is probably lying about write complete, so I'd not be surprised if deletions go a lot faster than they "ought to" ... except that the disk issue shouldn't affect things if it's CPU-bound anyway.) As I said, my inclination for improving this area, if someone wanted to work on it, would be to find a way to do truncate-in-place on temp tables. ISTM that in the case you're showing --- truncate that's not within a subtransaction, on a table that's drop-on-commit anyway --- we should not need to keep around the pre-truncation data. So we could just do ftruncate instead of creating a new file, and we'd not need a new copy of the pg_class row either. So that should make both the function time and the commit time a lot better. But I'm not sure if the use-case is popular enough to deserve such a hack. regards, tom lane
I wrote: > As I said, my inclination for improving this area, if someone wanted > to work on it, would be to find a way to do truncate-in-place on > temp tables. ISTM that in the case you're showing --- truncate that's > not within a subtransaction, on a table that's drop-on-commit anyway > --- we should not need to keep around the pre-truncation data. So we > could just do ftruncate instead of creating a new file, and we'd not > need a new copy of the pg_class row either. So that should make both > the function time and the commit time a lot better. But I'm not sure > if the use-case is popular enough to deserve such a hack. Actually, this is easier than I thought, because there is already bookkeeping being done that (in effect) tracks whether a table has already been truncated in the current transaction. So we can rely on that, and with only a very few lines of code added, ensure that a situation like this does only one full-scale transaction-safe truncation per transaction. The attached prototype patch does this and seems to fix the speed problem nicely. It's not tremendously well tested, but perhaps you'd like to test? Should work in 8.4. regards, tom lane Index: src/backend/catalog/heap.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/catalog/heap.c,v retrieving revision 1.357 diff -c -r1.357 heap.c *** src/backend/catalog/heap.c 2 Aug 2009 22:14:52 -0000 1.357 --- src/backend/catalog/heap.c 6 Aug 2009 06:15:05 -0000 *************** *** 2342,2359 **** { Oid rid = lfirst_oid(cell); Relation rel; - Oid toastrelid; rel = heap_open(rid, AccessExclusiveLock); relations = lappend(relations, rel); - - /* If there is a toast table, add it to the list too */ - toastrelid = rel->rd_rel->reltoastrelid; - if (OidIsValid(toastrelid)) - { - rel = heap_open(toastrelid, AccessExclusiveLock); - relations = lappend(relations, rel); - } } /* Don't allow truncate on tables that are referenced by foreign keys */ --- 2342,2350 ---- *************** *** 2364,2383 **** { Relation rel = lfirst(cell); ! /* Truncate the actual file (and discard buffers) */ ! RelationTruncate(rel, 0); ! /* If this relation has indexes, truncate the indexes too */ ! RelationTruncateIndexes(rel); ! ! /* ! * Close the relation, but keep exclusive lock on it until commit. ! */ heap_close(rel, NoLock); } } /* * heap_truncate_check_FKs * Check for foreign keys referencing a list of relations that * are to be truncated, and raise error if there are any --- 2355,2402 ---- { Relation rel = lfirst(cell); ! /* Truncate the relation */ ! heap_truncate_one_rel(rel); ! /* Close the relation, but keep exclusive lock on it until commit */ heap_close(rel, NoLock); } } /* + * heap_truncate_one_rel + * + * This routine deletes all data within the specified relation. + * + * This is not transaction-safe, because the truncation is done immediately + * and cannot be rolled back later. Caller is responsible for having + * checked permissions etc, and must have obtained AccessExclusiveLock. + */ + void + heap_truncate_one_rel(Relation rel) + { + Oid toastrelid; + + /* Truncate the actual file (and discard buffers) */ + RelationTruncate(rel, 0); + + /* If the relation has indexes, truncate the indexes too */ + RelationTruncateIndexes(rel); + + /* If there is a toast table, truncate that too */ + toastrelid = rel->rd_rel->reltoastrelid; + if (OidIsValid(toastrelid)) + { + Relation toastrel = heap_open(toastrelid, AccessExclusiveLock); + + RelationTruncate(toastrel, 0); + RelationTruncateIndexes(toastrel); + /* keep the lock... */ + heap_close(toastrel, NoLock); + } + } + + /* * heap_truncate_check_FKs * Check for foreign keys referencing a list of relations that * are to be truncated, and raise error if there are any Index: src/backend/commands/tablecmds.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/commands/tablecmds.c,v retrieving revision 1.295 diff -c -r1.295 tablecmds.c *** src/backend/commands/tablecmds.c 2 Aug 2009 22:14:52 -0000 1.295 --- src/backend/commands/tablecmds.c 6 Aug 2009 06:15:06 -0000 *************** *** 775,780 **** --- 775,781 ---- EState *estate; ResultRelInfo *resultRelInfos; ResultRelInfo *resultRelInfo; + SubTransactionId mySubid; ListCell *cell; /* *************** *** 944,979 **** /* * OK, truncate each table. */ foreach(cell, rels) { Relation rel = (Relation) lfirst(cell); - Oid heap_relid; - Oid toast_relid; /* ! * Create a new empty storage file for the relation, and assign it as ! * the relfilenode value. The old storage file is scheduled for ! * deletion at commit. */ ! setNewRelfilenode(rel, RecentXmin); ! ! heap_relid = RelationGetRelid(rel); ! toast_relid = rel->rd_rel->reltoastrelid; ! ! /* ! * The same for the toast table, if any. ! */ ! if (OidIsValid(toast_relid)) { ! rel = relation_open(toast_relid, AccessExclusiveLock); ! setNewRelfilenode(rel, RecentXmin); ! heap_close(rel, NoLock); } ! /* ! * Reconstruct the indexes to match, and we're done. ! */ ! reindex_relation(heap_relid, true); } /* --- 945,1002 ---- /* * OK, truncate each table. */ + mySubid = GetCurrentSubTransactionId(); + foreach(cell, rels) { Relation rel = (Relation) lfirst(cell); /* ! * Normally, we need a transaction-safe truncation here. However, ! * if the table was either created in the current (sub)transaction ! * or has a new relfilenode in the current (sub)transaction, then ! * we can just truncate it in-place, because a rollback would ! * cause the whole table or the current physical file to be ! * thrown away anyway. */ ! if (rel->rd_createSubid == mySubid || ! rel->rd_newRelfilenodeSubid == mySubid) { ! /* Immediate, non-rollbackable truncation is OK */ ! heap_truncate_one_rel(rel); } + else + { + Oid heap_relid; + Oid toast_relid; ! /* ! * Need the full transaction-safe pushups. ! * ! * Create a new empty storage file for the relation, and assign it ! * as the relfilenode value. The old storage file is scheduled for ! * deletion at commit. ! */ ! setNewRelfilenode(rel, RecentXmin); ! ! heap_relid = RelationGetRelid(rel); ! toast_relid = rel->rd_rel->reltoastrelid; ! ! /* ! * The same for the toast table, if any. ! */ ! if (OidIsValid(toast_relid)) ! { ! rel = relation_open(toast_relid, AccessExclusiveLock); ! setNewRelfilenode(rel, RecentXmin); ! heap_close(rel, NoLock); ! } ! ! /* ! * Reconstruct the indexes to match, and we're done. ! */ ! reindex_relation(heap_relid, true); ! } } /* Index: src/include/catalog/heap.h =================================================================== RCS file: /cvsroot/pgsql/src/include/catalog/heap.h,v retrieving revision 1.91 diff -c -r1.91 heap.h *** src/include/catalog/heap.h 11 Jun 2009 14:49:09 -0000 1.91 --- src/include/catalog/heap.h 6 Aug 2009 06:15:06 -0000 *************** *** 62,67 **** --- 62,69 ---- extern void heap_truncate(List *relids); + extern void heap_truncate_one_rel(Relation rel); + extern void heap_truncate_check_FKs(List *relations, bool tempTables); extern List *heap_truncate_find_FKs(List *relationIds);
Tom Lane wrote: > > I took a look through the CVS history and verified that there were > no post-8.4 commits that looked like they'd affect performance in > this area. So I think it's got to be a platform difference not a > PG version difference. In particular I think we are probably looking > at a filesystem issue: how fast can you delete [...] 30000 files. I'm still on Fedora 7, so maybe this will be motivation to upgrade. FYI, on my 8.2.13 system, the test created 30001 files which were all deleted during the commit. On my 8.4.0 system, the test created 60001 files, of which 30000 were deleted at commit and 30001 disappeared later (presumably during a checkpoint?). > But I'm not sure > if the use-case is popular enough to deserve such a hack. FWIW, the full app was looping over a set of datasets. On each iteration, it computed some intermediate results into a temp table, generated several reports from those intermediate results, and finally truncated the table for the next iteration. -- todd
Tom Lane wrote: > > The attached prototype patch does this > and seems to fix the speed problem nicely. It's not tremendously > well tested, but perhaps you'd like to test? Should work in 8.4. I'll give it a try and report back (though probably not until tomorrow). -- todd
On Thu, Aug 6, 2009 at 11:32, Todd A. Cook<tcook@blackducksoftware.com> wrote: > Tom Lane wrote: >> >> I took a look through the CVS history and verified that there were >> no post-8.4 commits that looked like they'd affect performance in >> this area. So I think it's got to be a platform difference not a >> PG version difference. In particular I think we are probably looking >> at a filesystem issue: how fast can you delete [...] 30000 files. > > I'm still on Fedora 7, so maybe this will be motivation to upgrade. > > FYI, on my 8.2.13 system, the test created 30001 files which were all > deleted during the commit. On my 8.4.0 system, the test created 60001 > files, of which 30000 were deleted at commit and 30001 disappeared > later (presumably during a checkpoint?). Smells like fsm? With double the number of files maybe something simple like turning on dir_index if you are ext3 will help?
Tom Lane wrote: > > Actually, this is easier than I thought, because there is already > bookkeeping being done that (in effect) tracks whether a table has > already been truncated in the current transaction. So we can rely > on that, and with only a very few lines of code added, ensure that > a situation like this does only one full-scale transaction-safe > truncation per transaction. The attached prototype patch does this > and seems to fix the speed problem nicely. It's not tremendously > well tested, but perhaps you'd like to test? Should work in 8.4. I downloaded the 8.4 source, built it unmodified, created a new cluster, and ran the test in an empty DB there. Function execution took about 230 seconds, and commit took about 6 seconds. With the patch applied, the test only took 35 seconds, and the commit was practically instant (30ms). I monitored the database directory, and the test execution only created 2 files (down from 60000). Thanks for the patch; it looks great. :) Is there any chance that it will be backpatched to 8.4? -- todd
Alex Hunsaker wrote: > >> FYI, on my 8.2.13 system, the test created 30001 files which were all >> deleted during the commit. Â On my 8.4.0 system, the test created 60001 >> files, of which 30000 were deleted at commit and 30001 disappeared >> later (presumably during a checkpoint?). > > Smells like fsm? Yes, that was it. 30000 of the filenames ended with "_fsm". > With double the number of files maybe something > simple like turning on dir_index if you are ext3 will help? Thanks for the tip. Doing "tune2fs -O +dir_index" didn't seem to make a difference, which is kinda expected for an existing directory. When I get a chance, I'll try to recreate the filesystem. -- todd
"Todd A. Cook" <tcook@blackducksoftware.com> writes: > Tom Lane wrote: >> The attached prototype patch does this >> and seems to fix the speed problem nicely. It's not tremendously >> well tested, but perhaps you'd like to test? Should work in 8.4. > With the patch applied, the test only took 35 seconds, and the commit > was practically instant (30ms). I know it's faster, what I meant by testing was does it *work*. If you roll back a truncate, do you get the expected state? How about after a database crash? > Is there any chance that it will be backpatched to 8.4? Not a lot. regards, tom lane
On Fri, Aug 7, 2009 at 10:10, Todd A. Cook<tcook@blackducksoftware.com> wrote: > Alex Hunsaker wrote: >> With double the number of files maybe something >> >> simple like turning on dir_index if you are ext3 will help? > > Thanks for the tip. Doing "tune2fs -O +dir_index" didn't seem to make > a difference, which is kinda expected for an existing directory. When > I get a chance, I'll try to recreate the filesystem. e2fsck -D should also do the trick. > -- todd >
Tom Lane wrote: > > If you roll back a truncate, do you get the expected state? I did a number of variations on the test below, with and without "on drop commit", and similar tests where the "create table" is done before the "begin". After the checkpoint, the number of files in the database directory always returned to the value before the "begin" (210 in this case). Everything behaved as expected. test=# begin; BEGIN test=# create temp table t1 (x integer) ; CREATE TABLE test=# insert into t1 select s from generate_series(1,1000) s ; INSERT 0 1000 test=# select count(*) from t1 ; 1000 test=# savepoint s1; SAVEPOINT test=# truncate t1; TRUNCATE TABLE test=# select count(*) from t1 ; 0 test=# insert into t1 select s from generate_series(10000,11000) s ; INSERT 0 1001 test=# select count(*) from t1 ; 1001 test=# rollback to savepoint s1 ; ROLLBACK test=# select count(*) from t1 ; 1000 test=# commit ; COMMIT test=# select count(*) from t1 ; 1000 test=# checkpoint; CHECKPOINT test=# > How about after a database crash? Repeating the same test as above, after the second insert, I did "killall -9 postgres". Restarting generated the expected messages in the log: 2009-08-07 13:09:56 EDT LOG: database system was interrupted; last known up at 2009-08-07 13:06:01 EDT 2009-08-07 13:09:56 EDT LOG: database system was not properly shut down; automatic recovery in progress 2009-08-07 13:09:56 EDT LOG: redo starts at 0/1F8D6D0 2009-08-07 13:09:56 EDT LOG: invalid magic number 0000 in log file 0, segment 1, offset 16392192 2009-08-07 13:09:56 EDT LOG: redo done at 0/1F9F3B8 2009-08-07 13:09:56 EDT LOG: autovacuum launcher started 2009-08-07 13:09:56 EDT LOG: database system is ready to accept connections However, the DB directory now has 214 files (up from 210); I have no idea whether this means anything or not. Repeating the previous tests gives expected results. -- todd
Tom Lane wrote: > the function time and the commit time a lot better. But I'm not sure > if the use-case is popular enough to deserve such a hack. > > For some OLTP workloads, it makes a lot of sense to spool tuples of primary key plus new fields into a temp table and then doing a single update or delete operation referencing the temp table. Perhaps not so much for code designed for postgres where there is some extra flexibility with array params and the like, but for code that targets other systems as well. Having temp tables be as fast as possible is quite a big win in this case.
"Todd A. Cook" <tcook@blackducksoftware.com> writes: > Tom Lane wrote: >>> If you roll back a truncate, do you get the expected state? > I did a number of variations on the test below, with and without "on drop commit", > and similar tests where the "create table" is done before the "begin". After the > checkpoint, the number of files in the database directory always returned to the > value before the "begin" (210 in this case). Everything behaved as expected. Thanks for doing the testing. I've applied this patch to CVS HEAD. regards, tom lane