Thread: 'COPY ... FROM' inserts to btree, blocks on buffer writeout
Hi, two concurrent clients try to 'COPY ... FROM ...' to the same table, "feature_link". The second one (pid 17983) is waiting for an ExclusiveLock on the table's primary key index, "key__idpk__flink". The first one (pid 17980) is inserting into the index's btree, waiting for a buffer lock. This locking state seems to persist. Do you have any idea what could cause this scenario and what kind of debugging information I could extract (the processes are still running) to find out who is currently holding the buffer lock in question? Thanks for your help, best wishes, Mike [fuchs:~/build/postgresql-8.0.0beta2] vi +1004 ./src/backend/storage/buffer/bufmgr.c 999 /* 1000 * Protect buffercontent against concurrent update. (Note that 1001 * hint-bit updates can still occur while the write is in progress, 1002 * but we assume that that will not invalidate the data written.) 1003 */ 1004 LockBuffer(buffer,BUFFER_LOCK_SHARE); 1005 [fuchs:~/build/postgresql-8.0.0beta2] gdb src/backend/postgres [...] (gdb) attach 17980 [...] (gdb) where #0 0x0000002a96181279 in semop () from /lib64/tls/libc.so.6 #1 0x0000000000511707 in PGSemaphoreLock (sema=0x2aa16de2f0, interruptOK=0 '\0') at pg_sema.c:418 #2 0x000000000053549a in LWLockAcquire (lockid=32741, mode=LW_SHARED) at lwlock.c:315 #3 0x0000000000528a3d in FlushBuffer (buf=0x2a9953ea40, reln=0x900640) at bufmgr.c:1004 #4 0x000000000052806d in BufferAlloc (reln=0x2768021, blockNum=3220402560, foundPtr=0x7fbff3728f "") at bufmgr.c:365 #5 0x0000000000527edc in ReadBufferInternal (reln=0x2aa24786b8, blockNum=20055, bufferLockHeld=0 '\0') at bufmgr.c:153 #6 0x000000000044a930 in _bt_getbuf (rel=0x2aa24786b8, blkno=3220402560, access=2) at nbtpage.c:492 #7 0x0000000000448c87 in _bt_split (rel=0x2aa24786b8, buf=673, firstright=89, newitemoff=164, newitemsz=40, newitem=0x911d40,newitemonleft=0 '\0', itup_off=0x7fbff374c6, itup_blkno=0x7fbff374c8) at nbtinsert.c:683 #8 0x00000000004486ba in _bt_insertonpg (rel=0x2aa24786b8, buf=673, stack=0x9924f0, keysz=1, scankey=0x911d90, btitem=0x911d40,afteritem=0, split_only_page=0 '\0') at nbtinsert.c:500 #9 0x00000000004481eb in _bt_doinsert (rel=0x2aa24786b8, btitem=0x911d40, index_is_unique=1 '\001', heapRel=0x2aa2477c38)at nbtinsert.c:141 #10 0x000000000044bb71 in btinsert (fcinfo=0x2768021) at nbtree.c:257 #11 0x00000000005a44aa in OidFunctionCall6 (functionId=41320481, arg1=183111222968, arg2=548681250960, arg3=548681250928,arg4=9515332, arg5=183111220280, arg6=18446744073709551612) at fmgr.c:1487 #12 0x00000000004476a5 in index_insert (indexRelation=0x2aa24786b8, datums=0x7fbff37890, nulls=0x7fbff37870 " î", heap_t_ctid=0x913144, heapRelation=0x2aa2477c38, check_uniqueness=1 '\001') at indexam.c:226 #13 0x00000000004cf3cf in ExecInsertIndexTuples (slot=0x0, tupleid=0x7fbff37180, estate=0x990450, is_vacuum=0 '\0') at execUtils.c:859 #14 0x000000000049c1d1 in CopyFrom (rel=0x2aa2477c38, attnumlist=0x990060, binary=0 '\0', oids=0 '\0', delim=0x63e4cd "\t",null_print=0x608614 "\\N", csv_mode=0 '\0', quote=0x0, escape=0x0, force_notnull_atts=0x911cb0) at copy.c:1958 #15 0x000000000049a34f in DoCopy (stmt=0x2768021) at copy.c:1043 #16 0x000000000053d7b9 in PortalRunUtility (portal=0x906b40, query=0x8fa5f0, dest=0x8fa910, completionTag=0x7fbff37f60 "")at pquery.c:839 #17 0x000000000053da41 in PortalRunMulti (portal=0x906b40, dest=0x8fa910, altdest=0x8fa910, completionTag=0x7fbff37f60 "")at pquery.c:902 ---Type <return> to continue, or q <return> to quit--- #18 0x000000000053d230 in PortalRun (portal=0x906b40, count=9223372036854775807, dest=0x8fa910, altdest=0x8fa910, completionTag=0x7fbff37f60"") at pquery.c:543 #19 0x0000000000539c59 in exec_simple_query ( query_string=0x8fa700 "COPY feature_link from '/anniedev1/impseb/datastore/results/2004/12/29/22/38/bio:query:ncbi-blast:10562062.out_featurelink'") at postgres.c:924 #20 0x000000000053bf6d in PostgresMain (argc=9414400, argv=0x86f028, username=0x86eff0 "annieseb") at postgres.c:2970 #21 0x0000000000514db7 in BackendRun (port=0x89c650) at postmaster.c:2848 #22 0x0000000000514850 in BackendStartup (port=0x89c650) at postmaster.c:2470 #23 0x0000000000512fde in ServerLoop () at postmaster.c:1215 #24 0x0000000000512446 in PostmasterMain (argc=1, argv=0x804850) at postmaster.c:898 #25 0x00000000004e3206 in main (argc=1, argv=0x804850) at main.c:265 [fuchs:/anniedev1/impseb/work/annie] ps -ef | grep impseb2 | grep -v grep | sort anniedb2 17976 21254 0 Dec29 ? 00:03:50 postgres: annieseb impseb2 10.44.1.20(33855) idle in transaction anniedb2 17977 21254 0 Dec29 ? 00:01:38 postgres: annieseb impseb2 10.44.1.20(33856) idle in transaction anniedb2 17979 21254 0 Dec29 ? 00:03:21 postgres: annieseb impseb2 10.44.1.20(33857) idle in transaction anniedb2 17980 21254 0 Dec29 ? 00:02:26 postgres: annieseb impseb2 10.44.1.20(33858) COPY anniedb2 17981 21254 0 Dec29 ? 00:00:53 postgres: annieseb impseb2 10.44.1.20(33859) idle in transaction anniedb2 17982 21254 0 Dec29 ? 00:03:31 postgres: annieseb impseb2 10.44.1.20(33860) idle in transaction anniedb2 17983 21254 0 Dec29 ? 00:03:34 postgres: annieseb impseb2 10.44.1.20(33861) COPY waiting anniedb2 18084 21254 0 Dec29 ? 00:03:15 postgres: annieseb impseb2 10.44.1.20(33863) idle in transaction anniedb2 23066 21254 0 17:45 ? 00:00:00 postgres: annieseb impseb2 10.44.1.21(45361) idle [fuchs:/anniedev1/impseb/work/annie] bin/pg-dump-locks pid 17979 pid 23130 granted: AccessShareLock on table pg_locks (16837) granted: AccessShareLock on table pg_class (1259) granted:AccessShareLock on table pg_index (16390) pid 17981 pid 17983 waiting: ExclusiveLock on table key__idpk__flink (2124233471) granted: RowExclusiveLock on table ncbi_blast_result(2112117337) granted: AccessShareLock on table nmt_parameter_set (2112116466) granted: RowExclusiveLockon table ncbi_blast_round (2112085544) granted: AccessShareLock on table gpi_learning_set (2112116456) granted: RowExclusiveLock on table ncbi_blast_alignment (2112117357) granted: AccessShareLock on table gpi3_learning_set(2112116461) granted: RowExclusiveLock on table feature_link (2112116486) granted: AccessShareLock ontable managed_file_group (2112116530) granted: AccessShareLock on table algorithm_index_file (2112116535) granted:AccessShareLock on table def_parser_type (2112116451) granted: AccessShareLock on table sequence_type (2112116441) granted: AccessShareLock on table file_type (2112116515) granted: AccessShareLock on table managed_file (2112116525) granted: AccessShareLock on table ptsone_function (2112116471) granted: AccessShareLock on table toppred_function(2112116476) granted: RowExclusiveLock on table ncbi_blast_subject (2112085549) granted: AccessShareLockon table set_type (2112116436) granted: AccessShareLock on table data_store (2112116520) granted: AccessShareLockon table sequence_alphabet (2112116431) pid 17980 granted: AccessShareLock on table data_store (2112116520) granted: RowExclusiveLock on table ncbi_blast_subject(2112085549) granted: AccessShareLock on table managed_file_group (2112116530) granted: RowExclusiveLockon table ncbi_blast_round (2112085544) granted: AccessShareLock on table ptsone_function (2112116471) granted: ExclusiveLock on table key__idpk__flink (2124233471) granted: AccessShareLock on table algorithm_index_file (2112116535) granted: AccessShareLock on table toppred_function (2112116476) granted: AccessShareLock on table def_parser_type(2112116451) granted: AccessShareLock on table file_type (2112116515) granted: RowExclusiveLock on tablefeature_link (2112116486) granted: AccessShareLock on table sequence_type (2112116441) granted: AccessShareLockon table managed_file (2112116525) granted: AccessShareLock on table sequence_alphabet (2112116431) granted:AccessShareLock on table gpi_learning_set (2112116456) granted: AccessShareLock on table set_type (2112116436) granted: RowExclusiveLock on table ncbi_blast_result (2112117337) granted: AccessShareLock on table gpi3_learning_set(2112116461) granted: RowExclusiveLock on table ncbi_blast_alignment (2112117357) granted: AccessShareLockon table nmt_parameter_set (2112116466) pid 17982 pid 18084 granted: AccessShareLock on table changeable_bio_set (2112116407) granted: AccessShareLock on table ojb_splice_list(2112117699) granted: AccessShareLock on table set_type (2112116436) granted: AccessShareLock on tablesequence_alphabet (2112116431) pid 17976 pid 17977 impseb2=# select * from pg_locks; relation | database | transaction | pid | mode | granted ------------+------------+-------------+-------+------------------+---------2112117337 | 2112078897 | | 17983| RowExclusiveLock | t | | 26991056 | 17979 | ExclusiveLock | t2112116466 | 2112078897 | | 17983 | AccessShareLock | t2112085544 | 2112078897 | | 17983 | RowExclusiveLock | t | | 26991864 | 17983 | ExclusiveLock | t2112116520 | 2112078897 | | 17980 | AccessShareLock | t2112116407 | 2112078897 | | 18084 | AccessShareLock | t2112116456 | 2112078897 | | 17983 | AccessShareLock | t2112085549 | 2112078897 | | 17980 | RowExclusiveLock | t2112117357 | 2112078897| | 17983 | RowExclusiveLock | t2112116461 | 2112078897 | | 17983 | AccessShareLock |t | | 26990698 | 17981 | ExclusiveLock | t2112116530 | 2112078897 | | 17980 | AccessShareLock | t2112085544 | 2112078897 | | 17980 | RowExclusiveLock | t | | 26991878| 17980 | ExclusiveLock | t2112116486 | 2112078897 | | 17983 | RowExclusiveLock | t2112117699 | 2112078897| | 18084 | AccessShareLock | t2112116530 | 2112078897 | | 17983 | AccessShareLock |t2112116535 | 2112078897 | | 17983 | AccessShareLock | t2112116471 | 2112078897 | | 17980 | AccessShareLock | t2124233471 | 2112078897 | | 17980 | ExclusiveLock | t2112116436 | 2112078897 | | 18084 | AccessShareLock | t2112116535 | 2112078897 | | 17980 | AccessShareLock | t2112116476 | 2112078897| | 17980 | AccessShareLock | t | | 26990811 | 17977 | ExclusiveLock |t2112116451 | 2112078897 | | 17980 | AccessShareLock | t 16837 | 2112078897 | | 23060 | AccessShareLock | t2112116451 | 2112078897 | | 17983 | AccessShareLock | t2112116441 | 2112078897 | | 17983 | AccessShareLock | t2112116515 | 2112078897 | | 17983 | AccessShareLock | t2112116525 | 2112078897| | 17983 | AccessShareLock | t2112116515 | 2112078897 | | 17980 | AccessShareLock |t2112116471 | 2112078897 | | 17983 | AccessShareLock | t | | 27016842 | 18084 | ExclusiveLock | t2112116476 | 2112078897 | | 17983 | AccessShareLock | t2112116486 | 2112078897 | | 17980 | RowExclusiveLock | t2112116441 | 2112078897 | | 17980 | AccessShareLock | t2112116525 | 2112078897| | 17980 | AccessShareLock | t | | 27018036 | 17976 | ExclusiveLock |t2112116431 | 2112078897 | | 17980 | AccessShareLock | t2112116456 | 2112078897 | | 17980 | AccessShareLock | t2112085549 | 2112078897 | | 17983 | RowExclusiveLock | t2112116436 | 2112078897 | | 17980 | AccessShareLock | t2112116436 | 2112078897 | | 17983 | AccessShareLock | t2112117337 | 2112078897| | 17980 | RowExclusiveLock | t2112116461 | 2112078897 | | 17980 | AccessShareLock |t2112117357 | 2112078897 | | 17980 | RowExclusiveLock | t2112116466 | 2112078897 | | 17980 | AccessShareLock | t2112116520 | 2112078897 | | 17983 | AccessShareLock | t | | 26991357| 17982 | ExclusiveLock | t | | 27018054 | 23060 | ExclusiveLock | t2124233471 | 2112078897| | 17983 | ExclusiveLock | f2112116431 | 2112078897 | | 17983 | AccessShareLock |t2112116431 | 2112078897 | | 18084 | AccessShareLock | t (54 rows) -- Do not feel safe. The poet remembers. DI Michael Wildpaner You can kill one, but another is born. Ph.D. Student The words are written down, the deed, the date. (Czeslaw Milosz)
Michael Wildpaner <mike@rainbow.studorg.tuwien.ac.at> writes: > two concurrent clients try to 'COPY ... FROM ...' to the same table, > "feature_link". > The second one (pid 17983) is waiting for an ExclusiveLock on the table's > primary key index, "key__idpk__flink". You didn't show a stack trace for this one ... regards, tom lane
Michael Wildpaner <mike@rainbow.studorg.tuwien.ac.at> writes: > two concurrent clients try to 'COPY ... FROM ...' to the same table, > "feature_link". > The second one (pid 17983) is waiting for an ExclusiveLock on the table's > primary key index, "key__idpk__flink". > The first one (pid 17980) is inserting into the index's btree, waiting > for a buffer lock. This locking state seems to persist. After staring at this for a little bit I have a theory. The stack trace for 17980 indicates that it is trying to flush a dirty buffer so that it can take over the buffer for a new index page. It's blocked trying to acquire a shared buffer lock on that buffer, which means that someone else must have an exclusive buffer lock, which the code is not expecting because the buffer just came off the free list and therefore was not pinned by anyone. However ... FlushBuffer releases the BufMgrLock before trying to acquire the per-buffer lock. If 17980 lost its time slice during that interval, it'd be possible for someone else to come in and re-pin the chosen victim page and then lock the buffer before 17980 could. Normally this wouldn't be any big problem, but if the someone else later blocked on some lock held by 17980, you'd have a deadlock. I think that's exactly what we're seeing here. The victim buffer page must be the same one that 17983 is trying to split; so it's got exclusive lock (and pin) on that page, and is now stuck waiting for the lock that would give it the privilege to extend the index. A possible fix for this is to reorder the operations in FlushBuffer so that we share-lock the buffer before releasing BufMgrLock ... but I'm not sure that doesn't introduce other deadlock risks. It needs some thought. If this is the correct explanation, the bug has been around for a good while; but it's got to be a very low-probability scenario. Congrats to Michael for taking the time to dig into it when he saw it. regards, tom lane
I wrote: > A possible fix for this is to reorder the operations in FlushBuffer > so that we share-lock the buffer before releasing BufMgrLock ... but > I'm not sure that doesn't introduce other deadlock risks. That's too simplistic, since at least one caller of FlushBuffer is trying to write pages that may be in active use. If the target page is already exclusive-locked by another process then a deadlock between the per-buffer lock and BufMgrLock is entirely possible. I think that it would work for BufferAlloc to share-lock the victim buffer before calling FlushBuffer; we'd have to add a bool parameter to FlushBuffer telling it the lock was already acquired. This is safe in this particular path because a buffer that was just taken from the freelist should not be exclusive-locked by anyone. (It could possibly be share-locked by the bgwriter, but that won't cause a deadlock.) A process that wanted exclusive lock would first have to pin the buffer, which can't happen before we release the BufMgrLock. By adding the parameter we'd avoid changing the behavior for other callers. Comments, better ideas? BTW, it looks to me like this deadlock potential has existed at least since 7.0. I seem to recall one or two reports of unexplainable apparent deadlocks, which perhaps are now explained. regards, tom lane
On Fri, 31 Dec 2004, Tom Lane wrote: > Michael Wildpaner <mike@rainbow.studorg.tuwien.ac.at> writes: > > two concurrent clients try to 'COPY ... FROM ...' to the same table, > > "feature_link". > > > The second one (pid 17983) is waiting for an ExclusiveLock on the table's > > primary key index, "key__idpk__flink". > > You didn't show a stack trace for this one ... Here it is: [fuchs:/people/mike/build/postgresql-8.0.0beta2] gdb src/backend/postgres [...] (gdb) attach 17983 [...] (gdb) where #0 0x0000002a96181279 in semop () from /lib64/tls/libc.so.6 #1 0x0000000000511707 in PGSemaphoreLock (sema=0x2aa16d8f90, interruptOK=1 '\001') at pg_sema.c:418 #2 0x0000000000533e76 in ProcSleep (lockMethodTable=0x274801d, lockmode=7, lock=0x2aa1773878, proclock=0x2aa18ee398) atproc.c:725 #3 0x0000000000532a7c in WaitOnLock (lockmethodid=32797, locallock=0x894180, owner=0x975130) at lock.c:1037 #4 0x000000000053246f in LockAcquire (lockmethodid=1, locktag=0x1, xid=2710496152, lockmode=7, dontWait=0 '\0') at lock.c:754 #5 0x0000000000531aa3 in LockPage (relation=0x274801d, blkno=3220402448, lockmode=7) at lmgr.c:267 #6 0x000000000044a98d in _bt_getbuf (rel=0x2aa2466168, blkno=3220402448, access=2) at nbtpage.c:490 #7 0x0000000000448c87 in _bt_split (rel=0x2aa2466168, buf=16355, firstright=89, newitemoff=119, newitemsz=40, newitem=0x985828, newitemonleft=0 '\0', itup_off=0x7fbff374c6, itup_blkno=0x7fbff374c8) at nbtinsert.c:683 #8 0x00000000004486ba in _bt_insertonpg (rel=0x2aa2466168, buf=16355, stack=0x9a2a70, keysz=1, scankey=0x984a60, btitem=0x985828, afteritem=0, split_only_page=0 '\0') at nbtinsert.c:500 #9 0x00000000004481eb in _bt_doinsert (rel=0x2aa2466168, btitem=0x985828, index_is_unique=1 '\001', heapRel=0x2aa24656e8) at nbtinsert.c:141 #10 0x000000000044bb71 in btinsert (fcinfo=0x274801d) at nbtree.c:257 #11 0x00000000005a44aa in OidFunctionCall6 (functionId=41189405, arg1=183111147880, arg2=548681250960, arg3=548681250928,arg4=10030788, arg5=183111145192, arg6=18446744073709551612) at fmgr.c:1487 #12 0x00000000004476a5 in index_insert (indexRelation=0x2aa2466168, datums=0x7fbff37890, nulls=0x7fbff37870 " î", heap_t_ctid=0x990ec4, heapRelation=0x2aa24656e8, check_uniqueness=1 '\001') at indexam.c:226 #13 0x00000000004cf3cf in ExecInsertIndexTuples (slot=0x0, tupleid=0x7fbff37110, estate=0x9828a0, is_vacuum=0 '\0') at execUtils.c:859 #14 0x000000000049c1d1 in CopyFrom (rel=0x2aa24656e8, attnumlist=0x9a25f0, binary=0 '\0', oids=0 '\0', delim=0x63e4cd "\t", null_print=0x608614 "\\N", csv_mode=0 '\0', quote=0x0, escape=0x0, force_notnull_atts=0x9849d0) at copy.c:1958 #15 0x000000000049a34f in DoCopy (stmt=0x274801d) at copy.c:1043 #16 0x000000000053d7b9 in PortalRunUtility (portal=0x906b70, query=0x8fa620, dest=0x8fa940, completionTag=0x7fbff37f60 "")at pquery.c:839 #17 0x000000000053da41 in PortalRunMulti (portal=0x906b70, dest=0x8fa940, altdest=0x8fa940, completionTag=0x7fbff37f60 "")at pquery.c:902 #18 0x000000000053d230 in PortalRun (portal=0x906b70, count=9223372036854775807, dest=0x8fa940, altdest=0x8fa940, completionTag=0x7fbff37f60"") at pquery.c:543 #19 0x0000000000539c59 in exec_simple_query ( query_string=0x8fa730 "COPY feature_link from '/anniedev1/impseb/datastore/results/2004/12/29/22/38/bio:query:ncbi-blast:10562060.out_featurelink'")at postgres.c:924 #20 0x000000000053bf6d in PostgresMain (argc=9414448, argv=0x86f028, username=0x86eff0 "annieseb") at postgres.c:2970 #21 0x0000000000514db7 in BackendRun (port=0x89c650) at postmaster.c:2848 #22 0x0000000000514850 in BackendStartup (port=0x89c650) at postmaster.c:2470 #23 0x0000000000512fde in ServerLoop () at postmaster.c:1215 #24 0x0000000000512446 in PostmasterMain (argc=1, argv=0x804850) at postmaster.c:898 #25 0x00000000004e3206 in main (argc=1, argv=0x804850) at main.c:265 Happy new year! (nearly 2 AM MET here ;) Best wishes, Mike -- Do not feel safe. The poet remembers. DI Michael Wildpaner You can kill one, but another is born. Ph.D. Student The words are written down, the deed, the date. (Czeslaw Milosz)
On Sat, 2005-01-01 at 00:42, Tom Lane wrote: > I wrote: > > A possible fix for this is to reorder the operations in FlushBuffer > > so that we share-lock the buffer before releasing BufMgrLock ... but > > I'm not sure that doesn't introduce other deadlock risks. > > That's too simplistic, since at least one caller of FlushBuffer is > trying to write pages that may be in active use. If the target page > is already exclusive-locked by another process then a deadlock between > the per-buffer lock and BufMgrLock is entirely possible. > > I think that it would work for BufferAlloc to share-lock the victim > buffer before calling FlushBuffer; we'd have to add a bool parameter to > FlushBuffer telling it the lock was already acquired. This is safe in > this particular path because a buffer that was just taken from the > freelist should not be exclusive-locked by anyone. (It could possibly > be share-locked by the bgwriter, but that won't cause a deadlock.) > A process that wanted exclusive lock would first have to pin the buffer, > which can't happen before we release the BufMgrLock. By adding the > parameter we'd avoid changing the behavior for other callers. > > Comments, better ideas? I think the proposal sounds safe, though I worry about performance. This situation occurs because an index block in cache was thought to be a block replacement candidate, then found to be useful by another backend which attempts to start using it before the flush occurs. ...which hints at the shared_buffers being set too low. AFAICS we need not consider losing timeslices as the explanation for concurrent activity. This situation is much more easily possible with multi-CPU systems, especially when many single core systems have hyperthreaded virtual CPUs and the default shared_buffers of 8Mb has not been changed. The effect of the proposed change would be to favour the removal of the block from the cache rather than favoring the reuse of it. I think the favour in this situation should go towards the reuse, forcing the backend searching for a free buffer to search again. There are already 2 other cases in BufferAlloc which cater for when two backends want the same block, which hints towards the importance of favouring the reuse: this is just one more case. My suggestion: LockBuffer in FlushBuffer should return as unsuccessful if there is an LW_EXCLUSIVE lock already held, causing another iteration of the do while loop in BufferAlloc. Doing that would be faster for both backends since neither has to wait for I/O, as well as avoiding the deadlock situation that has been uncovered. If the backend searches for a block replacement candidate and fails more than once, we should consider warning "shared buffers too low..." Later, I'd like to consider changing the way StrategyGetBuffer works, since it always stops at the first unpinned buffer and doesn't consider whether it is dirty or not in its decision to use it. It seems like it would be worth reading a *few* buffers further on to see if a more appropriate choice could have been made now that we have bgwriter to eventually clean the buffers. -- Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > I think the proposal sounds safe, though I worry about performance. There is no performance loss; we are just changing the order in which we acquire two locks. If there were some risk of blocking for a measurable time while holding the BufMgrLock, then that would be bad for concurrent performance --- but in fact the per-buffer lock is guaranteed free at that point. I don't think there's any value in trying to avoid the I/O. This is a corner case of such rarity that it's only been seen perhaps half a dozen times in the history of the project. "Optimizing" it is not the proper concern. The case where the I/O is wasted because someone re-pins the buffer during the write is far more likely, simply because of the relative widths of the windows involved; and we can't avoid that. > My suggestion: LockBuffer in FlushBuffer should return as unsuccessful > if there is an LW_EXCLUSIVE lock already held, causing another iteration > of the do while loop in BufferAlloc. This would break the other callers of FlushBuffer. We could redefine FlushBuffer as taking either a conditional or unconditional lock, but I think that's a weirder API than a flag to say the lock is already taken. Bottom line is that I don't think it's useful to consider this as a performance issue. What we need is correctness with minimum extra complication of the logic. regards, tom lane
On Sat, 2005-01-01 at 18:55, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > I think the proposal sounds safe, though I worry about performance. > > There is no performance loss; we are just changing the order in which > we acquire two locks. If there were some risk of blocking for a > measurable time while holding the BufMgrLock, then that would be bad for > concurrent performance --- but in fact the per-buffer lock is guaranteed > free at that point. > > I don't think there's any value in trying to avoid the I/O. This is a > corner case of such rarity that it's only been seen perhaps half a dozen > times in the history of the project. "Optimizing" it is not the proper > concern. The case where the I/O is wasted because someone re-pins the > buffer during the write is far more likely, simply because of the > relative widths of the windows involved; and we can't avoid that. The deadlock is incredibly rare, I agree. That was not my point. The situation where another backend requests the block immediately before the I/O is fairly common AFAICS, especially since StrategyGetBuffer ignores the BM_DIRTY flag in selecting victims. ISTM making the code deadlock-safe will effect cases where there never would have been a deadlock, slowing both backends down while waiting for the I/O to complete. > Bottom line is that I don't think it's useful to consider this as a > performance issue. What we need is correctness with minimum extra > complication of the logic. I'll step aside and look more closely for 8.1 -- Best Regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes: > The situation where another backend requests the block immediately > before the I/O is fairly common AFAICS, especially since > StrategyGetBuffer ignores the BM_DIRTY flag in selecting victims. How do you figure that? StrategyGetBuffer won't return the same buffer again (because dirty or not, it'll be pinned by the time anyone else gets to run StrategyGetBuffer). The case we are interested in is where someone suddenly wants the original page again --- that is, a page that was just about to fall off the back end of the freelist is wanted again. I don't see that that case is common, especially not with a reasonably large shared_buffer setting, and most especially not when the bgwriter is doing its job and keeping the back end of the freelist clean. > ISTM making the code deadlock-safe will effect cases where there never > would have been a deadlock, slowing both backends down while waiting for > the I/O to complete. The other backend will have to wait for the I/O to complete before he can gain an exclusive lock on the page ... but so what? If he'd come along a microsecond later, he'd have had to wait, too. Basically we are eliminating a very narrow window by causing it to behave the same as what happens in the larger window where the I/O is occurring. (BTW, "I/O" in this case actually just represents transferring the data to kernel buffers, so the amount of delay involved is likely not to be all that large...) regards, tom lane
I wrote: > I think that it would work for BufferAlloc to share-lock the victim > buffer before calling FlushBuffer; we'd have to add a bool parameter to > FlushBuffer telling it the lock was already acquired. I've applied a patch for this. > BTW, it looks to me like this deadlock potential has existed at least > since 7.0. I seem to recall one or two reports of unexplainable > apparent deadlocks, which perhaps are now explained. On closer investigation the deadlock does not seem to exist in 7.4 and before, because BufferReplace didn't acquire the buffer sharelock. (There is a comment in the 7.4 code claiming that we didn't need to, but I'm unconvinced that it's correct...) regards, tom lane
On Mon, 2005-01-03 at 17:14, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > The situation where another backend requests the block immediately > > before the I/O is fairly common AFAICS, especially since > > StrategyGetBuffer ignores the BM_DIRTY flag in selecting victims. > > How do you figure that? StrategyGetBuffer won't return the same buffer > again (because dirty or not, it'll be pinned by the time anyone else > gets to run StrategyGetBuffer). The case we are interested in is where > someone suddenly wants the original page again --- that is, a page that > was just about to fall off the back end of the freelist is wanted again. > I don't see that that case is common, especially not with a reasonably > large shared_buffer setting, and most especially not when the bgwriter > is doing its job and keeping the back end of the freelist clean. Yes, what I was effectively arguing for was to tune for the case where shared_buffers is still at the default...which is of course fairly pointless, since the way to tune is just to increase shared_buffers. ...Fully agree with your original suggestion now. -- Best Regards, Simon Riggs