Thread: ExclusiveLock
Recent runs of DBT-2 show very occasional ExclusiveLock (s) being held by transactions, sometimes waiting to be granted. On Sat, Nov 06, 2004 at 11:40:49AM +0000, Simon Riggs wrote: > > The lockstats just show there's all those Exclusive Locks on order_line, right?: > > http://www.osdl.org/projects/dbt2dev/results/dev4-010/191/db/lockstats.out > > > > The output is... > relname | pid | mode | granted > ---------------+-------+------------------+--------- > new_order | 21735 | AccessShareLock | t > new_order | 21735 | RowExclusiveLock | t > orders | 21715 | AccessShareLock | t > orders | 21715 | RowExclusiveLock | t > pg_class | 23254 | AccessShareLock | t > order_line | 21715 | AccessShareLock | t > order_line | 21715 | RowExclusiveLock | t > order_line | 21735 | ExclusiveLock | f > new_order | 21715 | AccessShareLock | t ... > > which shows a non-granted lock, waiting for a Table-level ExclusiveLock > on order_line. This is unexpected (by me, that is...) According to the manual, Exclusive Lock is not normally held by SQL statements. There are no LOCK TABLE statements in DBT-2. My digging reveals that ExclusiveLock is held on user relations by_bt_getbuf() - when we extend a btree relation by one page I also find ExclusiveLock is held by - LISTEN/NOTIFY - XactLockTableInsert()/XactLockTableDelete() but those don't look like they lock user relations LockAcquire() says its locks show in lock tables, so is index extension the source of the ExclusiveLocks shown in the lock output? Presumably they would be short duration, so you wouldn't see them unless you caught it at just the right moment....unless we start to queue up on the leadingedge of the index. I expect index extension to be a source of contention anyway, but are we actually *seeing* it? Or is it another issue, and is this an 8.0 problem? -- Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > Recent runs of DBT-2 show very occasional ExclusiveLock (s) being held > by transactions, sometimes waiting to be granted. I think you are right that these reflect heap or btree-index extension operations. Those do not actually take locks on the *table* however, but locks on a single page within it (which are completely orthogonal to table locks and don't conflict). The pg_locks output leaves something to be desired, because you can't tell the difference between table and page locks. It's odd that your example does not appear to show someone else holding a conflicting lock. regards, tom lane
On Mon, 2004-11-08 at 21:37, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > Recent runs of DBT-2 show very occasional ExclusiveLock (s) being held > > by transactions, sometimes waiting to be granted. > > I think you are right that these reflect heap or btree-index extension > operations. Those do not actually take locks on the *table* however, > but locks on a single page within it (which are completely orthogonal to > table locks and don't conflict). The pg_locks output leaves something > to be desired, because you can't tell the difference between table and > page locks. Good. Thought it was worth discussion... > It's odd that your example does not appear to show someone else holding > a conflicting lock. There is....I didn't copy the whole lock table output...here it is... relname | pid | mode | granted ---------------+-------+------------------+---------new_order | 21735 | AccessShareLock | tnew_order | 21735 | RowExclusiveLock| torders | 21715 | AccessShareLock | torders | 21715 | RowExclusiveLock | tpg_class | 23254 | AccessShareLock | torder_line | 21715 | AccessShareLock | torder_line | 21715 | RowExclusiveLock | torder_line | 21735 | ExclusiveLock | fnew_order | 21715 | AccessShareLock | tnew_order | 21715 | RowExclusiveLock| tcustomer | 21715 | AccessShareLock | tpk_order_line | 21735 | AccessShareLock | tpk_order_line| 21735 | RowExclusiveLock | titem | 21715 | AccessShareLock | torders | 21735 | AccessShareLock | torders | 21735 | RowExclusiveLock | torder_line | 21735 | AccessShareLock | torder_line | 21735 | RowExclusiveLock | tstock | 21715 | AccessShareLock | tstock | 21715 | RowExclusiveLock | torder_line | 21715 | ExclusiveLock | tpk_order_line | 21715 | RowExclusiveLock | tpg_locks | 23254 | AccessShareLock | tdistrict | 21715 | AccessShareLock | tdistrict | 21715 | RowShareLock | tdistrict | 21715 | RowExclusiveLock | twarehouse | 21715 | AccessShareLock | tcustomer | 21735 | AccessShareLock | tcustomer | 21735 | RowExclusiveLock | t (29 rows) Pids 21715 and 21735 are conflicting. There's also another example where the lock table output is > 1400 rows, with two lock requests pending. The oprofile for this run looks like this: (but is not of course a snapshot at a point in time, like the lock list) CPU: CPU with timer interrupt, speed 0 MHz (estimated) Profiling through timer interrupt samples % app name symbol name 170746 42.7220 vmlinux-2.6.8.1-osdl2 ia64_pal_call_static 18934 4.7374 libc-2.3.2.so (no symbols) 10691 2.6750 postgres FunctionCall2 9814 2.4555 postgres hash_seq_search 8654 2.1653 postgres SearchCatCache 7389 1.8488 postgres AllocSetAlloc 6122 1.5318 postgres hash_search 5707 1.4279 postgres OpernameGetCandidates 4901 1.2263 postgres StrategyDirtyBufferList 4627 1.1577 postgres XLogInsert 4424 1.1069 postgres pglz_decompress 4371 1.0937 vmlinux-2.6.8.1-osdl2 __copy_user 3796 0.9498 vmlinux-2.6.8.1-osdl2 finish_task_switch 3483 0.8715 postgres LWLockAcquire 3458 0.8652 postgres eqjoinsel 3001 0.7509 vmlinux-2.6.8.1-osdl2 get_exec_dcookie 2824 0.7066 postgres AtEOXact_CatCache 2745 0.6868 postgres _bt_compare 2730 0.6831 postgres nocachegetattr 2715 0.6793 postgres SearchCatCacheList 2659 0.6653 postgres MemoryContextAllocZeroAligned 2604 0.6515 postgres yyparse 2553 0.6388 postgres eqsel 2127 0.5322 postgres deconstruct_array 1921 0.4806 postgres hash_any 1919 0.4801 postgres int4eq 1855 0.4641 postgres LWLockRelease 1839 0.4601 postgres StrategyBufferLookup 1777 0.4446 postgres GetSnapshotData 1729 0.4326 postgres heap_getsysattr 1595 0.3991 postgres DLMoveToFront 1586 0.3968 postgres MemoryContextAlloc 1485 0.3716 vmlinux-2.6.8.1-osdl2 try_atomic_semop 1455 0.3641 postgres anonymous symbol from section .plt 1409 0.3525 postgres lappend 1352 0.3383 postgres heap_release_fetch 1270 0.3178 postgres PinBuffer 1141 0.2855 postgres DirectFunctionCall1 1132 0.2832 postgres base_yylex 982 0.2457 postgres pgstat_initstats 957 0.2394 vmlinux-2.6.8.1-osdl2 __make_request 926 0.2317 postgres AllocSetFree 892 0.2232 vmlinux-2.6.8.1-osdl2 try_to_wake_up 874 0.2187 postgres _bt_checkkeys 870 0.2177 postgres fmgr_isbuiltin 853 0.2134 postgres ReadBufferInternal 852 0.2132 postgres pfree 850 0.2127 postgres _bt_moveright 848 0.2122 vmlinux-2.6.8.1-osdl2 do_cciss_request 766 0.1917 postgres ExecTargetList 734 0.1837 postgres SearchSysCache 730 0.1827 postgres PGSemaphoreLock 706 0.1766 postgres expression_tree_walker 684 0.1711 postgres ExecEvalVar 674 0.1686 postgres StrategyGetBuffer 669 0.1674 postgres ResourceOwnerForgetCatCacheRef 660 0.1651 postgres lcons 614 0.1536 vmlinux-2.6.8.1-osdl2 find_get_page 586 0.1466 postgres _bt_restscan 582 0.1456 postgres MemoryContextAllocZero 551 0.1379 postgres LockRelease 551 0.1379 postgres heap_formtuple 540 0.1351 postgres OidFunctionCall3 537 0.1344 postgres check_stack_depth 527 0.1319 postgres ExecutePlan 521 0.1304 postgres CatalogCacheComputeHashValue 510 0.1276 postgres buildRelationAliases 508 0.1271 vmlinux-2.6.8.1-osdl2 find_get_pages_tag 504 0.1261 postgres btgettuple 499 0.1249 postgres IndexNext 454 0.1136 postgres ExecInitExpr 453 0.1133 postgres ExecProcNode 447 0.1118 postgres LockAcquire I note that an important one has dropped down the list: 1 2.5e-04 postgres AtEOXact_Buffers and this is nowhere now... UnlockBuffers StrategyDirtyBufferList is too high, so we can change that. As a follow-on: We've got freelists for reuse of space. Do freelists work for index/heap extension also, or does everybody read the same info to get the next block....i.e. we are space conservative, rather than emphasising concurrency? It would be good to have one freelist per CPU.... -- Best Regards, Simon Riggs
Tom, > I think you are right that these reflect heap or btree-index extension > operations. Those do not actually take locks on the *table* however, > but locks on a single page within it (which are completely orthogonal to > table locks and don't conflict). The pg_locks output leaves something > to be desired, because you can't tell the difference between table and > page locks. Aside from foriegn keys, though, is there any way in which INSERT page locks could block other inserts? I have another system (Lyris) where that appears to be happening with 32 concurrent INSERT streams. It's possible that the problem is somewhere else, but I'm disturbed by the possibility. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus <josh@agliodbs.com> writes: > Aside from foriegn keys, though, is there any way in which INSERT page locks > could block other inserts? Not for longer than the time needed to physically add a tuple to a page. regards, tom lane
On Thu, 2004-11-18 at 22:12, Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: > > Aside from foriegn keys, though, is there any way in which INSERT page locks > > could block other inserts? > > Not for longer than the time needed to physically add a tuple to a page. The main problem on INSERTs is that it is usually the same few pages: the lead data block and the lead index block. There are ways of spreading the load out across an index, but I'm not sure what happens on the leading edge of the data relation, but I think it hits the same block each time. Only an issue if you have more than one CPU... -- Best Regards, Simon Riggs
Simon, Tom, > The main problem on INSERTs is that it is usually the same few pages: > the lead data block and the lead index block. There are ways of > spreading the load out across an index, but I'm not sure what happens on > the leading edge of the data relation, but I think it hits the same > block each time. I actually have several test cases for this, can you give me a trace or profile suggestion that would show if this is happening? -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Simon Riggs <simon@2ndquadrant.com> writes: > The main problem on INSERTs is that it is usually the same few pages: > the lead data block and the lead index block. There are ways of > spreading the load out across an index, but I'm not sure what happens on > the leading edge of the data relation, but I think it hits the same > block each time. FSM does what it can to spread the insertion load across multiple pages, but of course this is not going to help much unless your table has lots of embedded free space. I think it would work pretty well on a table with lots of update turnover, but not on an INSERT-only workload. regards, tom lane
Josh Berkus <josh@agliodbs.com> writes: >> The main problem on INSERTs is that it is usually the same few pages: >> the lead data block and the lead index block. There are ways of >> spreading the load out across an index, but I'm not sure what happens on >> the leading edge of the data relation, but I think it hits the same >> block each time. > I actually have several test cases for this, can you give me a trace or > profile suggestion that would show if this is happening? If it is a problem, the LockBuffer calls in RelationGetBufferForTuple would be the places showing contention delays. It could also be that the contention is for the WALInsertLock, ie, the right to stuff a WAL record into the shared buffers. This effect would be the same even if you were inserting into N separate tables. regards, tom lane
On Thu, 2004-11-18 at 22:51, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > The main problem on INSERTs is that it is usually the same few pages: > > the lead data block and the lead index block. There are ways of > > spreading the load out across an index, but I'm not sure what happens on > > the leading edge of the data relation, but I think it hits the same > > block each time. > > FSM does what it can to spread the insertion load across multiple pages, > but of course this is not going to help much unless your table has lots > of embedded free space. I think it would work pretty well on a table > with lots of update turnover, but not on an INSERT-only workload. OK, thats what I thought. So with a table with an INSERT-only workload, the FSM is always empty, so there only ever is one block that gets locked. That means we can't ever go faster than 1 CPU can go - any other CPUs will just wait for the block lock. [In Josh's case, 32 INSERT streams won't go significantly faster than about 4 streams, allowing for some overlap of other operations] Would it be possible to: when a new block is allocated from the relation file (rather than reused), we check the FSM - if it is empty, then we allocate 8 new blocks and add them all to the FSM. The next few INSERTers will then use the FSM blocks normally. Doing that will definitely speed up DBT-2 and many other workloads. Many tables have SERIAL defined, or use a monotonically increasing unique key. -- Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > Would it be possible to: when a new block is allocated from the relation > file (rather than reused), we check the FSM - if it is empty, then we > allocate 8 new blocks and add them all to the FSM. The next few > INSERTers will then use the FSM blocks normally. Most likely that would just shift the contention to the WALInsertLock. regards, tom lane
On Thu, 2004-11-18 at 23:19, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > Would it be possible to: when a new block is allocated from the relation > > file (rather than reused), we check the FSM - if it is empty, then we > > allocate 8 new blocks and add them all to the FSM. The next few > > INSERTers will then use the FSM blocks normally. > > Most likely that would just shift the contention to the WALInsertLock. Well, removing any performance bottleneck shifts the bottleneck to another place, though that is not an argument against removing it. Can we subdivide the WALInsertLock so there are multiple entry points to wal_buffers, based upon hashing the xid? That would allow wal to be written sequentially by each transaction though slightly out of order for different transactions. Commit/Abort would all go through the same lock to guarantee serializability. -- Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > Can we subdivide the WALInsertLock so there are multiple entry points to > wal_buffers, based upon hashing the xid? I don't think so; WAL is inherently a linear log. (Awhile ago there was some talk of nonlinear log writing to get around the one-commit-per- disk-revolution syndrome, but the idea basically got rejected as unworkably complicated.) What's more, there are a lot of entries that must remain time-ordered independently of transaction ownership. Consider btree index page splits and sequence nextvals for two examples. Certainly I'd not buy into any such project without incontrovertible proof that it would solve a major bottleneck --- and right now we are only speculating with no evidence. regards, tom lane
On Thu, 2004-11-18 at 22:55, Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: > >> The main problem on INSERTs is that it is usually the same few pages: > >> the lead data block and the lead index block. There are ways of > >> spreading the load out across an index, but I'm not sure what happens on > >> the leading edge of the data relation, but I think it hits the same > >> block each time. > > > I actually have several test cases for this, can you give me a trace or > > profile suggestion that would show if this is happening? > > If it is a problem, the LockBuffer calls in RelationGetBufferForTuple > would be the places showing contention delays. You say this as if we can easily check that. My understanding is that this would require a scripted gdb session to instrument the executable at that point. Is that what you mean? That isn't typically regarded as a great thing to do on a production system. You've mentioned about performance speculation, which I agree with, but what are the alternatives? Compile-time changes aren't usually able to be enabled, since many people from work RPMs. > It could also be that the contention is for the WALInsertLock, ie, the > right to stuff a WAL record into the shared buffers. This effect would > be the same even if you were inserting into N separate tables. ...and how do we check that also. Are we back to simulated workloads and fully rigged executables? -- Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > On Thu, 2004-11-18 at 22:55, Tom Lane wrote: >> If it is a problem, the LockBuffer calls in RelationGetBufferForTuple >> would be the places showing contention delays. > You say this as if we can easily check that. I think this can be done with oprofile ... regards, tom lane
On Sat, 2004-11-20 at 16:14, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > On Thu, 2004-11-18 at 22:55, Tom Lane wrote: > >> If it is a problem, the LockBuffer calls in RelationGetBufferForTuple > >> would be the places showing contention delays. > > > You say this as if we can easily check that. > > I think this can be done with oprofile ... OK, well thats where this thread started. oprofile only tells us aggregate information. It doesn't tell us how much time is spent waiting because of contention issues, it just tells us how much time is spent and even that is skewed. There really ought to be a better way to instrument things from inside, based upon knowledge of the code. -- Best Regards, Simon Riggs
On Thu, 2004-11-18 at 23:54, Tom Lane wrote: > I don't think so; WAL is inherently a linear log. (Awhile ago there was > some talk of nonlinear log writing to get around the one-commit-per- > disk-revolution syndrome, but the idea basically got rejected as > unworkably complicated.) ...this appears to still be on the TODO list... should it be removed? - Find a way to reduce rotational delay when repeatedly writing last WAL page Currently fsync of WAL requires the disk platter to perform a full rotation to fsync again. One idea is to write the WAL to different offsets that might reduce the rotational delay. -- Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > - Find a way to reduce rotational delay when repeatedly writing last WAL > page > > Currently fsync of WAL requires the disk platter to perform a full > rotation to fsync again. One idea is to write the WAL to different > offsets that might reduce the rotational delay. Once upon a time when you formatted hard drives you actually gave them an interleave factor for a similar reason. These days you invariably use an interleave of 1, ie, store the blocks continuously. Whether that's because controllers have become fast enough to keep up with the burst rate or because the firmware is smart enough to handle the block interleaving invisibly isn't clear to me. I wonder if formatting the drive to have an interleave >1 would actually improve performance of the WAL log. It would depend a lot on the usage pattern though. A heavily used system might be able to generate enough WAL traffic to keep up with the burst rate of the drive. And an less used system might benefit but might lose. Probably now the less than saturated system gets close to the average half-rotation-time latency. This idea would only really help if you have a system that happens to be triggering pessimal results worse than that due to unfortunate timing. -- greg
On Mon, 2004-11-22 at 23:37, Greg Stark wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > > - Find a way to reduce rotational delay when repeatedly writing last WAL > > page > > > > Currently fsync of WAL requires the disk platter to perform a full > > rotation to fsync again. One idea is to write the WAL to different > > offsets that might reduce the rotational delay. > > Once upon a time when you formatted hard drives you actually gave them an > interleave factor for a similar reason. These days you invariably use an > interleave of 1, ie, store the blocks continuously. Whether that's because > controllers have become fast enough to keep up with the burst rate or because > the firmware is smart enough to handle the block interleaving invisibly isn't > clear to me. > > I wonder if formatting the drive to have an interleave >1 would actually > improve performance of the WAL log. > > It would depend a lot on the usage pattern though. A heavily used system might > be able to generate enough WAL traffic to keep up with the burst rate of the > drive. And an less used system might benefit but might lose. > > Probably now the less than saturated system gets close to the average > half-rotation-time latency. This idea would only really help if you have a > system that happens to be triggering pessimal results worse than that due to > unfortunate timing. I was asking whether that topic should be removed, since Tom had said it had been rejected.... If you could tell me how to instrument the system to (better) show whether such plans as you suggest are workable, I would be greatly interested. Anything we do needs to be able to be monitored for success/failure. -- Best Regards, Simon Riggs
Simon Riggs wrote: > On Mon, 2004-11-22 at 23:37, Greg Stark wrote: > > Simon Riggs <simon@2ndquadrant.com> writes: > > > > > - Find a way to reduce rotational delay when repeatedly writing last WAL > > > page > > > > > > Currently fsync of WAL requires the disk platter to perform a full > > > rotation to fsync again. One idea is to write the WAL to different > > > offsets that might reduce the rotational delay. > > > > Once upon a time when you formatted hard drives you actually gave them an > > interleave factor for a similar reason. These days you invariably use an > > interleave of 1, ie, store the blocks continuously. Whether that's because > > controllers have become fast enough to keep up with the burst rate or because > > the firmware is smart enough to handle the block interleaving invisibly isn't > > clear to me. > > > > I wonder if formatting the drive to have an interleave >1 would actually > > improve performance of the WAL log. > > > > It would depend a lot on the usage pattern though. A heavily used system might > > be able to generate enough WAL traffic to keep up with the burst rate of the > > drive. And an less used system might benefit but might lose. > > > > Probably now the less than saturated system gets close to the average > > half-rotation-time latency. This idea would only really help if you have a > > system that happens to be triggering pessimal results worse than that due to > > unfortunate timing. > > I was asking whether that topic should be removed, since Tom had said it > had been rejected.... The method used to fix it was rejected, but the goal of making it better is still a valid one. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Greg Stark <gsstark@mit.edu> writes: > Once upon a time when you formatted hard drives you actually gave them an > interleave factor for a similar reason. These days you invariably use an > interleave of 1, ie, store the blocks continuously. Whether that's because > controllers have become fast enough to keep up with the burst rate or because > the firmware is smart enough to handle the block interleaving invisibly isn't > clear to me. The impression I had was that disk drives no longer pay the slightest attention to interleave specs, because the logical model implied by the concept is too far removed from modern reality (on-disk buffering, variable numbers of sectors per track, transparently remapped bad sectors, yadda yadda). And that's just at the hardware level ... who knows where the filesystem is putting your data, or what the kernel I/O scheduler is doing with your requests :-( Basically I see the TODO item as a blue-sky research topic, not something we have any idea how to implement. That doesn't mean it can't be on the TODO list ... regards, tom lane
On Tue, Nov 23, 2004 at 12:04:17AM +0000, Simon Riggs wrote: > On Mon, 2004-11-22 at 23:37, Greg Stark wrote: > > Simon Riggs <simon@2ndquadrant.com> writes: > > > > > - Find a way to reduce rotational delay when repeatedly writing last WAL > > > page > > > > > > Currently fsync of WAL requires the disk platter to perform a full > > > rotation to fsync again. One idea is to write the WAL to different > > > offsets that might reduce the rotational delay. > > > > Once upon a time when you formatted hard drives you actually gave them an > > interleave factor for a similar reason. These days you invariably use an > > interleave of 1, ie, store the blocks continuously. Whether that's because > > controllers have become fast enough to keep up with the burst rate or because > > the firmware is smart enough to handle the block interleaving invisibly isn't > > clear to me. > > > > I wonder if formatting the drive to have an interleave >1 would actually > > improve performance of the WAL log. > > > > It would depend a lot on the usage pattern though. A heavily used system might > > be able to generate enough WAL traffic to keep up with the burst rate of the > > drive. And an less used system might benefit but might lose. > > > > Probably now the less than saturated system gets close to the average > > half-rotation-time latency. This idea would only really help if you have a > > system that happens to be triggering pessimal results worse than that due to > > unfortunate timing. > > I was asking whether that topic should be removed, since Tom had said it > had been rejected.... > > If you could tell me how to instrument the system to (better) show > whether such plans as you suggest are workable, I would be greatly > interested. Anything we do needs to be able to be monitored for > success/failure. > > -- > Best Regards, Simon Riggs > The disk performance has increased so much that the reasons for having an interleave factor other than 1 (no interleaving) have all but disappeared. CPU speed has also increased so much relative to disk speed that using some CPU cycles to improve I/O is a reasonable approach. I have been considering how this might be accomplished. As Simon so aptly pointed out, we need to show that it materially affects the performance or it is not worth doing. The simplest idea I had was to pre-layout the WAL logs in a contiguous fashion on the disk. Solaris has this ability given appropriate FS parameters and we should be able to get close on most other OSes. Once that has happened, use something like the FSM map to show the allocated blocks. The CPU can keep track of its current disk rotational position (approx. is okay) then when we need to write a WAL block start writing at the next area that the disk head will be sweeping. Give it a little leaway for latency in the system and we should be able to get very low latency for the writes. Obviously, there would be wasted space but you could intersperse writes to the granularity of space overhead that you would like to see. As far as implementation, I was reading an interesting article that used a simple theoretical model to estimate disk head position to avoid latency. Yours truly, Ken Marshall