Thread: RAM-only temporary tables
An idea for a possible enhancement to PostgreSQL: allow creation of a temporary table without generating any disk I/O. (Creating and dropping a three-column temporary table within a database transaction currently generates about 150 disk writes). If some circumstances don't make it feasible to always do this as a RAM-only operation, perhaps a clause could be added to the CREATE TEMPORARY TABLE syntax to specify this behavior along with whatever limitations on the temporary table are required for this to work. (For example, maybe this is only feasible if the table will be dropped by the end of the transaction?) -Kevin
On Wed, Nov 5, 2008 at 16:00, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > An idea for a possible enhancement to PostgreSQL: allow creation of a > temporary table without generating any disk I/O. (Creating and > dropping a three-column temporary table within a database transaction > currently generates about 150 disk writes). Well arguably you can already do this... create tablespace ram location '/dev/shm/'; create temp table (junk int) tablespace ram; In theory because only one backend needs the temp table we should be able to use mmap() to fake it being a file as well without to much trouble. But im sure I am missing something...
Kevin Grittner wrote: > An idea for a possible enhancement to PostgreSQL: allow creation of a > temporary table without generating any disk I/O. (Creating and > dropping a three-column temporary table within a database transaction > currently generates about 150 disk writes). > > If some circumstances don't make it feasible to always do this as a > RAM-only operation, perhaps a clause could be added to the CREATE > TEMPORARY TABLE syntax to specify this behavior along with whatever > limitations on the temporary table are required for this to work. > (For example, maybe this is only feasible if the table will be dropped > by the end of the transaction?) As someone else already pointed out you can put temp tables on a RAM disk, but the larger issue is that temp tables still cause system table churn which will always need to be on stable media.
Kevin Grittner wrote: > An idea for a possible enhancement to PostgreSQL: allow creation of a > temporary table without generating any disk I/O. (Creating and > dropping a three-column temporary table within a database transaction > currently generates about 150 disk writes). Most of these are catalog updates. A trace of WAL logs including only heap inserts says that to create a temp table with 3 columns (2 int, 1 text) and no indexes there are this many inserts: 3 1247 (pg_type) 20 1249 (pg_attribute) 3 1259 (pg_class) 7 2608 (pg_depend) 1 2610 (pg_index) Note the excess of pg_attribute entries! There are 3 in the table, 3 in the toast table, and then there are 14 extra attrs which are for system columns (7 for the main table, 7 for the toast table). Just getting rid of pg_attribute entries for those would probably prove to be an importante gain. (Don't forget the index updates for each of those heap inserts; for pg_type it's 2 btree inserts for each index insert.) If you do this, you've shaved 42 of those 150 writes. Perhaps another gain is getting rid of array types for temp tables; a new feature of 8.3. Who wants those anyway? That also removes one or two of the pg_depend entries. Maybe these extra array types are the additional 8.3 overhead. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
>>> Alvaro Herrera <alvherre@commandprompt.com> wrote: > Kevin Grittner wrote: >> An idea for a possible enhancement to PostgreSQL: allow creation of a >> temporary table without generating any disk I/O. (Creating and >> dropping a three-column temporary table within a database transaction >> currently generates about 150 disk writes). > > Most of these are catalog updates. A trace of WAL logs including only > heap inserts says that to create a temp table with 3 columns (2 int, 1 > text) and no indexes there are this many inserts: > > 3 1247 (pg_type) > 20 1249 (pg_attribute) > 3 1259 (pg_class) > 7 2608 (pg_depend) > 1 2610 (pg_index) > > Note the excess of pg_attribute entries! There are 3 in the table, 3 in > the toast table, and then there are 14 extra attrs which are for system > columns (7 for the main table, 7 for the toast table). Just getting rid > of pg_attribute entries for those would probably prove to be an > importante gain. (Don't forget the index updates for each of those heap > inserts; for pg_type it's 2 btree inserts for each index insert.) If > you do this, you've shaved 42 of those 150 writes. Note that the 150 disk writes were for the CREATE and the DROP. Does that mean that we'd actually shave 84 of 150 writes? Also, if you're looking to account for all the writes, it's worth noting that my test declared a one-column primary key (on an integer column) in the CREATE TEMPORARY TABLE statement. In suggesting this enhancement, my hope is that each session could check for a referenced table as a temporary in RAM before going to the system tables, in a manner vaguely similar to how space reserved by the temp_buffers GUC is used for temp table data. I, of course, am suggesting this from a position of blissful ignorance of the actual complexity of making such a change. -Kevin
Kevin Grittner wrote: > Note that the 150 disk writes were for the CREATE and the DROP. Does > that mean that we'd actually shave 84 of 150 writes? Hmm, you'd shave more than 42 but not 84, because index entries are not deleted until a later vacuum. (I'd say about 56 -- 42 plus the 14 heap deletions). > Also, if you're looking to account for all the writes, it's worth > noting that my test declared a one-column primary key (on an integer > column) in the CREATE TEMPORARY TABLE statement. That probably makes up for the extra few writes that I didn't see in my quick test. > In suggesting this enhancement, my hope is that each session could > check for a referenced table as a temporary in RAM before going to the > system tables, in a manner vaguely similar to how space reserved by > the temp_buffers GUC is used for temp table data. I, of course, am > suggesting this from a position of blissful ignorance of the actual > complexity of making such a change. Right -- I don't expect we can make use of such an idea readily. Not creating unnecessary pg_attribute entries for system columns is probably a lot easier to do. The idea of uncatalogued temp tables has been suggested and rejected several times in the past. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
>>> Alvaro Herrera <alvherre@commandprompt.com> wrote: > Kevin Grittner wrote: > >> Note that the 150 disk writes were for the CREATE and the DROP. Does >> that mean that we'd actually shave 84 of 150 writes? > > Hmm, you'd shave more than 42 but not 84, because index entries are not > deleted until a later vacuum. (I'd say about 56 -- 42 plus the 14 heap > deletions). > >> Also, if you're looking to account for all the writes, it's worth >> noting that my test declared a one-column primary key (on an integer >> column) in the CREATE TEMPORARY TABLE statement. > > That probably makes up for the extra few writes that I didn't see in my > quick test. It sounds like you were counting the 8kB pages pushed from the PostgreSQL cache to the OS cache, and I was counting the 1kB blocks pushed from the OS cache to the RAID controller cache. By watching vmstat results after pushing this to a more-or-less steady state, I was probably picking up the results of autovacuum runs, but multiple writes to a single page were often combined by the OS. If we match, it's really just a coincidence. -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Note that the 150 disk writes were for the CREATE and the DROP. Does > that mean that we'd actually shave 84 of 150 writes? It really shouldn't be the case that each system catalog tuple insertion generates a separate write --- especially not for multiple insertions into the same catalog, which we could expect to go into the same page or few pages. I think a large fraction of the writes you're measuring are coming from the file create/unlink operations. It would certainly be important to identify where the bulk of the cost *really* is before we start expending effort on a solution. > In suggesting this enhancement, my hope is that each session could > check for a referenced table as a temporary in RAM before going to the > system tables, in a manner vaguely similar to how space reserved by > the temp_buffers GUC is used for temp table data. This isn't very workable. For one thing, client-side operations such as psql's \dt still need to see catalog entries for temp tables. There's been some handwaving about keeping catalog entries for temp tables in temp children of the main system catalogs, but it hasn't got past the handwaving stage. regards, tom lane
Kevin Grittner wrote: > It sounds like you were counting the 8kB pages pushed from the > PostgreSQL cache to the OS cache, and I was counting the 1kB blocks > pushed from the OS cache to the RAID controller cache. By watching > vmstat results after pushing this to a more-or-less steady state, I > was probably picking up the results of autovacuum runs, but multiple > writes to a single page were often combined by the OS. If we match, > it's really just a coincidence. I was counting WAL inserts actually, so "logical database operations". I left out "storage" items (filesystem actions), so it was mainly just catalog changes. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Right -- I don't expect we can make use of such an idea readily. Not > creating unnecessary pg_attribute entries for system columns is probably > a lot easier to do. I seem to recall having proposed that in the past, and getting shot down on the basis that clients might be depending on those pg_attribute entries being there. I'm not sure how big a risk there really is --- most of the code I've seen explicitly selects attnum > 0 --- but it's a consideration. regards, tom lane
>>> Tom Lane <tgl@sss.pgh.pa.us> wrote: > I think a large fraction of the writes you're measuring are coming from > the file create/unlink operations. It would certainly be important to > identify where the bulk of the cost *really* is before we start > expending effort on a solution. Any ideas on a good way to gather that information? Given the temp_buffers space, would it make sense to defer the creation of the actual file until there is actually a need to spill data to the disk? -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I think a large fraction of the writes you're measuring are coming from >> the file create/unlink operations. It would certainly be important to >> identify where the bulk of the cost *really* is before we start >> expending effort on a solution. > Any ideas on a good way to gather that information? I had done some preliminary trials using strace (you need to trace the active backend, the bgwriter, and the wal writer process to be sure you see everything going on). However it's difficult to tell how much physical I/O results from the create or unlink syscalls. It might be interesting to make a test program that just creates 4000 files and then removes them again, and see what sort of load you see from that. > Given the temp_buffers space, would it make sense to defer the > creation of the actual file until there is actually a need to spill > data to the disk? No, because that opens us to problems with reuse of relfilenode numbers. One place that I've always wanted to look at was suppressing the creation of a btree metapage until there's some useful data in the table. We managed to avoid creating a root page until there's data, but at the time avoiding the metapage seemed too invasive. (Admittedly, though, if one assumes that your real world case does involve putting some data in the tables, this wouldn't actually save anything...) regards, tom lane
Tom Lane wrote: > One place that I've always wanted to look at was suppressing the > creation of a btree metapage until there's some useful data in the > table. We managed to avoid creating a root page until there's data, > but at the time avoiding the metapage seemed too invasive. (Admittedly, > though, if one assumes that your real world case does involve putting > some data in the tables, this wouldn't actually save anything...) Agreed on the parenthised comment -- it'd be just benchmark optimization. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
>>> Tom Lane <tgl@sss.pgh.pa.us> wrote: > it's difficult to tell how much > physical I/O results from the create or unlink syscalls. It might be > interesting to make a test program that just creates 4000 files We use xfs with noatime for our databases. In a fresh subdirectory of such a mountpoint: for ((i=0 ; i < 4000 ; ++i)) ; do touch $i ; done causes 44,969 block writes > and then > removes them again, and see what sort of load you see from that. rm * causes 26,820 block writes That would make the file creation and unlink just under half the load. -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > That would make the file creation and unlink just under half the load. Worst possible case :-( ... means that we wouldn't get much improvement without addressing both aspects. It strikes me however that this does put some urgency into the question of how much per-relation FSM is going to cost us. For short-lived temp tables the FSM is never going to have any usefulness at all, but in the current HEAD code it'll double the create/unlink load. Heikki, would it be reasonable to fix things so that a nonexistent FSM fork is semantically the same as an empty one, and not create FSM until there's actually something to put in it? regards, tom lane
Tom Lane wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >> That would make the file creation and unlink just under half the load. > > Worst possible case :-( ... means that we wouldn't get much improvement > without addressing both aspects. > > It strikes me however that this does put some urgency into the question > of how much per-relation FSM is going to cost us. For short-lived temp > tables the FSM is never going to have any usefulness at all, but in the > current HEAD code it'll double the create/unlink load. Agreed. > Heikki, would it be reasonable to fix things so that a nonexistent FSM > fork is semantically the same as an empty one, and not create FSM until > there's actually something to put in it? Possibly, but I'd like to understand what exactly the problem is. I tried running this: CREATE TEMPORARY TABLE footemp (id int4); DROP TABLE footemp; with pgbench -f, but can't see any meaningful difference between 8.3 and CVS HEAD. Both can do about 300 tpm, or 700-800 with fsync=off. There probably is a measurable difference there if you run longer tests, but doesn't seem like the extra file creation+unlink is worth worrying about. With the caveat that this is on reiserfs, on my laptop. Does someone see a difference on other filesystems? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Tom Lane wrote: >> Heikki, would it be reasonable to fix things so that a nonexistent FSM >> fork is semantically the same as an empty one, and not create FSM until >> there's actually something to put in it? > Possibly, but I'd like to understand what exactly the problem is. I > tried running this: > CREATE TEMPORARY TABLE footemp (id int4); > DROP TABLE footemp; > with pgbench -f, but can't see any meaningful difference between 8.3 and > CVS HEAD. Both can do about 300 tpm, or 700-800 with fsync=off. Try several thousand temp tables within one transaction. regards, tom lane
Heikki Linnakangas wrote: > I tried running this: > > CREATE TEMPORARY TABLE footemp (id int4); > DROP TABLE footemp; > > with pgbench -f, but can't see any meaningful difference between 8.3 and > CVS HEAD. Both can do about 300 tpm, or 700-800 with fsync=off. There > probably is a measurable difference there if you run longer tests, but > doesn't seem like the extra file creation+unlink is worth worrying > about. With the caveat that this is on reiserfs, on my laptop. Does > someone see a difference on other filesystems? I just tried that on a test server with data directory a small RAID array with ext3. Results are similar, at ~120 tps with both CVS HEAD and 8.3, and with fsync=off, around ~1700 tps on both versions. oprofile does suggest that more time is spent in the kernel in CVS HEAD. CVS HEAD: 37387 13.9383 no-vmlinux postgres (no symbols) 20604 7.6814 postgres postgres CatalogCacheFlushRelation 16075 5.9929 postgres postgres hash_search_with_hash_value 10109 3.7688 postgres postgres LWLockAcquire 9225 3.4392 postgres postgres _bt_compare 8973 3.3452 postgres postgres XLogInsert 8368 3.1197 postgres postgres LWLockRelease 8009 2.9858 postgres postgres hash_any 6944 2.5888 no-vmlinux no-vmlinux (no symbols) 4432 1.6523 no-vmlinux pgbench (no symbols) 4197 1.5647 postgres postgres AllocSetAlloc 3773 1.4066 libc-2.7.so postgres memcmp 3359 1.2523 postgres postgres _bt_checkkeys 3315 1.2359 postgres postgres FunctionCall2 3135 1.1688 postgres postgres PinBuffer 8.3: 27264 9.3254 no-vmlinux postgres (no symbols) 20673 7.0710 postgres postgres CatalogCacheFlushRelation 18576 6.3537 postgres postgres hash_search_with_hash_value 12795 4.3764 postgres postgres LWLockAcquire 11565 3.9557 postgres postgres _bt_compare 11538 3.9465 postgres postgres hash_any 10414 3.5620 postgres postgres XLogInsert 10100 3.4546 postgres postgres LWLockRelease 6306 2.1569 postgres postgres _bt_checkkeys 5096 1.7430 postgres postgres AllocSetAlloc 4835 1.6538 no-vmlinux no-vmlinux (no symbols) 4090 1.3989 postgres postgres PrepareToInvalidateCacheTuple 4021 1.3753 postgres postgres FunctionCall2 3965 1.3562 no-vmlinux pgbench (no symbols) 3869 1.3234 libc-2.7.so postgres memcmp but frankly I can't get too excited about that. I also quickly tried a test case with ON COMMIT DELETE ROWS temp table. I modified pgbench slightly, so that it creates a temp table with ON COMMIT DELETE ROWS after connecting, and then run a script with a simple one row INSERT to the temp table. The results look similar; I'm getting ~11000-12000 tps on both 8.3 and CVS HEAD. oprofile suggests that about ~50% of the time is spent in kernel, so I'm actually a bit surprised that the new FSM file isn't hurting more there. So, I'm not convinced we need to do anything, based on these test. Maybe I missed the point of the OP; if so, a repeatable test case would be nice. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> Tom Lane wrote: >>> Heikki, would it be reasonable to fix things so that a nonexistent FSM >>> fork is semantically the same as an empty one, and not create FSM until >>> there's actually something to put in it? > >> Possibly, but I'd like to understand what exactly the problem is. I >> tried running this: > >> CREATE TEMPORARY TABLE footemp (id int4); >> DROP TABLE footemp; > >> with pgbench -f, but can't see any meaningful difference between 8.3 and >> CVS HEAD. Both can do about 300 tpm, or 700-800 with fsync=off. > > Try several thousand temp tables within one transaction. I ran into an interesting problem while doing that. I created a SQL script with 10000 CREATE TEMPORARY TABLE statements. After testing with that a few times, I got this: WARNING: out of shared memory ERROR: out of shared memory HINT: You might need to increase max_locks_per_transaction. Not that surprising, but when I then just tried to run a single CREATE TEMPORARY TABLE in a new psql session, I got the same error. I then stopped and started postmaster, and I still get the same error! Testing with trace_locks=on, looks like the table creation takes a lock on all the temp tables that are already gone, and runs out of memory doing that. I'll hunt that down, and try benchmarking the "thousands of temp tables in one transaction" case again after that... -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > Tom Lane wrote: >> Try several thousand temp tables within one transaction. > > I ran into an interesting problem while doing that. I created a SQL > script with 10000 CREATE TEMPORARY TABLE statements. After testing with > that a few times, I got this: > > WARNING: out of shared memory > ERROR: out of shared memory > HINT: You might need to increase max_locks_per_transaction. > > Not that surprising, but when I then just tried to run a single CREATE > TEMPORARY TABLE in a new psql session, I got the same error. I then > stopped and started postmaster, and I still get the same error! Testing > with trace_locks=on, looks like the table creation takes a lock on all > the temp tables that are already gone, and runs out of memory doing that. What's happening is that there is simply so many temporary tables in pg_class that when the new backend tries to clear them with RemoveTempRelations, it runs out of lock space. One idea would be to modify RemoveTempRelations to not acquire locks on the temp objects, but that's pretty ugly and I'm not sure how it would interact with concurrent DROP TYPE CASCADE or similar. Any better ideas? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > What's happening is that there is simply so many temporary tables in > pg_class that when the new backend tries to clear them with > RemoveTempRelations, it runs out of lock space. What happened to the original owner of the tables? It must have had locks on all those tables ... regards, tom lane
Heikki Linnakangas wrote: > oprofile does suggest that more time is spent in the kernel in CVS HEAD. > CVS HEAD: > > 37387 13.9383 no-vmlinux postgres (no > symbols) > 8.3: > > 27264 9.3254 no-vmlinux postgres (no > symbols) What this means is that there are a lot more syscalls in HEAD than 8.3 ... -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> What's happening is that there is simply so many temporary tables in >> pg_class that when the new backend tries to clear them with >> RemoveTempRelations, it runs out of lock space. > > What happened to the original owner of the tables? It must have had > locks on all those tables ... It's pretty easily reproducible with this: postgres=# CREATE OR REPLACE FUNCTION createtemps (iters int4) RETURNS VOID LANGUAGE plpgsql AS $$ DECLARE i int4; BEGIN i := 0; WHILE i < iters LOOP EXECUTE 'CREATE TEMPORARY TABLE "dummytemp' || i || '_' || random() ||'" (id int4)'; i := i + 1; END LOOP; END; $$; CREATE FUNCTION postgres=# SELECT createtemps(7000); createtemps ------------- (1 row) postgres=# SELECT createtemps(7000); createtemps ------------- (1 row) postgres=# \q $ psql postgres psql (8.4devel) Type "help" for help. postgres=# CREATE TEMPORARY TABLE footemp(id int4);WARNING: out of shared memory ERROR: out of shared memory HINT: You might need to increase max_locks_per_transaction. postgres=# In this case, the temp tables are created in two separate transactions, which is why the creations don't run out of shared memory. But when they're later dropped, the new backend tries to grab a lock on all of them at the same time. I'm not quite sure how I got into that situation with my test case. I might have run the test case without a BEGIN/COMMIT at first, while developing the test script, so that all the thousands of temp tables were created in separate transactions. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Alvaro Herrera wrote: > Heikki Linnakangas wrote: > >> oprofile does suggest that more time is spent in the kernel in CVS HEAD. >> CVS HEAD: >> >> 37387 13.9383 no-vmlinux postgres (no >> symbols) > >> 8.3: >> >> 27264 9.3254 no-vmlinux postgres (no >> symbols) > > What this means is that there are a lot more syscalls in HEAD than 8.3 ... Yeah, sure. My point was that it doesn't seem to be significant in the big scheme of things. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> Tom Lane wrote: >>> Heikki, would it be reasonable to fix things so that a nonexistent FSM >>> fork is semantically the same as an empty one, and not create FSM until >>> there's actually something to put in it? > >> Possibly, but I'd like to understand what exactly the problem is. I >> tried running this: > >> CREATE TEMPORARY TABLE footemp (id int4); >> DROP TABLE footemp; > >> with pgbench -f, but can't see any meaningful difference between 8.3 and >> CVS HEAD. Both can do about 300 tpm, or 700-800 with fsync=off. > > Try several thousand temp tables within one transaction. After increasing max_lock_per_transaction to avoid the issue I mentioned elsewhere in this thread, using this test script: #/bin/sh echo "BEGIN;" for ((i=0;i<=10000;i+=1)); do echo "CREATE TEMPORARY TABLE footemp$i (id int4);" done echo "COMMIT;" And repeating a few times with: $ time sh temptest.sh | ~/installations/cvshead/bin/psql postgres -q I'm again seeing no big difference between 8.3 and CVS HEAD. I'm getting values around both sides of 6 seconds, with 5.8s being the minimum for CVS HEAD, and 5.6s the minimum for PG 8.3. oprofile for 8.3: samples % image name app name symbol name 23844 11.1625 postgres postgres LockReassignCurrentOwner 22299 10.4392 no-vmlinux postgres (no symbols) 15461 7.2380 postgres postgres hash_seq_search 12309 5.7624 postgres postgres CatalogCacheFlushRelation 10981 5.1407 postgres postgres hash_search_with_hash_value 7139 3.3421 postgres postgres get_tabstat_entry 7133 3.3393 postgres postgres XLogInsert 7112 3.3294 no-vmlinux no-vmlinux (no symbols) 6294 2.9465 postgres postgres _bt_compare 6027 2.8215 postgres postgres LWLockAcquire 5189 2.4292 postgres postgres hash_any 4744 2.2209 postgres postgres LWLockRelease 3355 1.5706 bash bash (no symbols) oprofile for CVS HEAD: samples % image name app name symbol name 30167 11.0321 postgres postgres LockReassignCurrentOwner 27345 10.0001 no-vmlinux postgres (no symbols) 19537 7.1447 postgres postgres hash_seq_search 17376 6.3544 postgres postgres CatalogCacheFlushRelation 14910 5.4526 postgres postgres findDependentObjects 14187 5.1882 postgres postgres hash_search_with_hash_value 9268 3.3893 postgres postgres get_tabstat_entry 7478 2.7347 postgres postgres _bt_compare 6861 2.5091 no-vmlinux no-vmlinux (no symbols) 6779 2.4791 postgres postgres LWLockAcquire 6764 2.4736 postgres postgres XLogInsert 5122 1.8731 postgres postgres LWLockRelease 4838 1.7693 postgres postgres hash_any 3501 1.2803 bash bash (no symbols) I'm quite surprised to see LockReassignCurrentOwner being called at all. gdb shows where the call is coming from: #0 LockReassignCurrentOwner () at lock.c:1653 #1 0x0000000000662865 in ResourceOwnerReleaseInternal (owner=0xab21f8, phase=RESOURCE_RELEASE_LOCKS, isCommit=1 '\001',isTopLevel=0 '\0') at resowner.c:261 #2 0x00000000006628f0 in ResourceOwnerRelease (owner=0xab21f8, phase=RESOURCE_RELEASE_LOCKS, isCommit=1 '\001', isTopLevel=0'\0') at resowner.c:171 #3 0x000000000066139b in PortalDrop (portal=0xacc060, isTopCommit=<value optimized out>) at portalmem.c:418 #4 0x00000000005aa162 in exec_simple_query ( query_string=0xa91350 "CREATE TEMPORARY TABLE footemp7013 (id int4);") at postgres.c:972 I'm also a bit surprised that the higher number of syscalls in CVS HEAD isn't visible in this oprofile report. The top percentages seemed to vary by a couple percentage points from run to run, though, so perhaps it's just drowned by noise. Kevin, what was your original scenario like that led you to investigate this? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Sorry for the late response; I was on vacation. >>> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Kevin, what was your original scenario like that led you to investigate > this? We noticed a performance degradation in application code which, within a database transaction, looped through large numbers of iterations where each iteration created, used, and dropped a temporary table. The temporary table always had several columns (typically ten to twenty, many of which were varchar) and had a primary key. We initially thought this was because of an upgrade of PostgreSQL from 8.2.5 to 8.3.4, but subsequent testing showed that it was because of the concurrent update of the Linux kernel from one which defaulted to not using write barriers to one which did default to using write barriers. The file system was XFS on five spindles in RAID 5 on a good BBU controller. Updating the relevant /etc/fstab entry with nobarrier brought performance back to an acceptable level. -Kevin