Thread: 'COPY ... FROM' inserts to btree, blocks on buffer writeout

'COPY ... FROM' inserts to btree, blocks on buffer writeout

From
Michael Wildpaner
Date:
Hi,

two concurrent clients try to 'COPY ... FROM ...' to the same table,
"feature_link".

The second one (pid 17983) is waiting for an ExclusiveLock on the table's
primary key index, "key__idpk__flink".

The first one (pid 17980) is inserting into the index's btree, waiting
for a buffer lock. This locking state seems to persist.

Do you have any idea what could cause this scenario and what kind of
debugging information I could extract (the processes are still running) to
find out who is currently holding the buffer lock in question?

Thanks for your help,
best wishes, Mike


[fuchs:~/build/postgresql-8.0.0beta2] vi +1004 ./src/backend/storage/buffer/bufmgr.c   999   /*  1000    * Protect
buffercontent against concurrent update.  (Note that  1001    * hint-bit updates can still occur while the write is in
progress, 1002    * but we assume that that will not invalidate the data written.)  1003    */  1004
LockBuffer(buffer,BUFFER_LOCK_SHARE);  1005 


[fuchs:~/build/postgresql-8.0.0beta2] gdb src/backend/postgres
[...]
(gdb) attach 17980
[...]
(gdb) where
#0  0x0000002a96181279 in semop () from /lib64/tls/libc.so.6
#1  0x0000000000511707 in PGSemaphoreLock (sema=0x2aa16de2f0, interruptOK=0 '\0') at pg_sema.c:418
#2  0x000000000053549a in LWLockAcquire (lockid=32741, mode=LW_SHARED) at lwlock.c:315
#3  0x0000000000528a3d in FlushBuffer (buf=0x2a9953ea40, reln=0x900640) at bufmgr.c:1004
#4  0x000000000052806d in BufferAlloc (reln=0x2768021, blockNum=3220402560, foundPtr=0x7fbff3728f "") at bufmgr.c:365
#5  0x0000000000527edc in ReadBufferInternal (reln=0x2aa24786b8, blockNum=20055, bufferLockHeld=0 '\0') at bufmgr.c:153
#6  0x000000000044a930 in _bt_getbuf (rel=0x2aa24786b8, blkno=3220402560, access=2) at nbtpage.c:492
#7  0x0000000000448c87 in _bt_split (rel=0x2aa24786b8, buf=673, firstright=89, newitemoff=164, newitemsz=40,
newitem=0x911d40,newitemonleft=0 '\0',   itup_off=0x7fbff374c6, itup_blkno=0x7fbff374c8) at nbtinsert.c:683 
#8  0x00000000004486ba in _bt_insertonpg (rel=0x2aa24786b8, buf=673, stack=0x9924f0, keysz=1, scankey=0x911d90,
btitem=0x911d40,afteritem=0,   split_only_page=0 '\0') at nbtinsert.c:500 
#9  0x00000000004481eb in _bt_doinsert (rel=0x2aa24786b8, btitem=0x911d40, index_is_unique=1 '\001',
heapRel=0x2aa2477c38)at nbtinsert.c:141 
#10 0x000000000044bb71 in btinsert (fcinfo=0x2768021) at nbtree.c:257
#11 0x00000000005a44aa in OidFunctionCall6 (functionId=41320481, arg1=183111222968, arg2=548681250960,
arg3=548681250928,arg4=9515332,   arg5=183111220280, arg6=18446744073709551612) at fmgr.c:1487 
#12 0x00000000004476a5 in index_insert (indexRelation=0x2aa24786b8, datums=0x7fbff37890, nulls=0x7fbff37870 " î",
heap_t_ctid=0x913144,  heapRelation=0x2aa2477c38, check_uniqueness=1 '\001') at indexam.c:226 
#13 0x00000000004cf3cf in ExecInsertIndexTuples (slot=0x0, tupleid=0x7fbff37180, estate=0x990450, is_vacuum=0 '\0') at
execUtils.c:859
#14 0x000000000049c1d1 in CopyFrom (rel=0x2aa2477c38, attnumlist=0x990060, binary=0 '\0', oids=0 '\0', delim=0x63e4cd
"\t",null_print=0x608614 "\\N",   csv_mode=0 '\0', quote=0x0, escape=0x0, force_notnull_atts=0x911cb0) at copy.c:1958 
#15 0x000000000049a34f in DoCopy (stmt=0x2768021) at copy.c:1043
#16 0x000000000053d7b9 in PortalRunUtility (portal=0x906b40, query=0x8fa5f0, dest=0x8fa910, completionTag=0x7fbff37f60
"")at pquery.c:839 
#17 0x000000000053da41 in PortalRunMulti (portal=0x906b40, dest=0x8fa910, altdest=0x8fa910, completionTag=0x7fbff37f60
"")at pquery.c:902 
---Type <return> to continue, or q <return> to quit---
#18 0x000000000053d230 in PortalRun (portal=0x906b40, count=9223372036854775807, dest=0x8fa910, altdest=0x8fa910,
completionTag=0x7fbff37f60"")   at pquery.c:543 
#19 0x0000000000539c59 in exec_simple_query (   query_string=0x8fa700 "COPY feature_link from
'/anniedev1/impseb/datastore/results/2004/12/29/22/38/bio:query:ncbi-blast:10562062.out_featurelink'")  at
postgres.c:924
#20 0x000000000053bf6d in PostgresMain (argc=9414400, argv=0x86f028, username=0x86eff0 "annieseb") at postgres.c:2970
#21 0x0000000000514db7 in BackendRun (port=0x89c650) at postmaster.c:2848
#22 0x0000000000514850 in BackendStartup (port=0x89c650) at postmaster.c:2470
#23 0x0000000000512fde in ServerLoop () at postmaster.c:1215
#24 0x0000000000512446 in PostmasterMain (argc=1, argv=0x804850) at postmaster.c:898
#25 0x00000000004e3206 in main (argc=1, argv=0x804850) at main.c:265


[fuchs:/anniedev1/impseb/work/annie] ps -ef | grep impseb2 | grep -v grep | sort
anniedb2 17976 21254  0 Dec29 ?        00:03:50 postgres: annieseb impseb2 10.44.1.20(33855) idle in transaction
anniedb2 17977 21254  0 Dec29 ?        00:01:38 postgres: annieseb impseb2 10.44.1.20(33856) idle in transaction
anniedb2 17979 21254  0 Dec29 ?        00:03:21 postgres: annieseb impseb2 10.44.1.20(33857) idle in transaction
anniedb2 17980 21254  0 Dec29 ?        00:02:26 postgres: annieseb impseb2 10.44.1.20(33858) COPY
anniedb2 17981 21254  0 Dec29 ?        00:00:53 postgres: annieseb impseb2 10.44.1.20(33859) idle in transaction
anniedb2 17982 21254  0 Dec29 ?        00:03:31 postgres: annieseb impseb2 10.44.1.20(33860) idle in transaction
anniedb2 17983 21254  0 Dec29 ?        00:03:34 postgres: annieseb impseb2 10.44.1.20(33861) COPY waiting
anniedb2 18084 21254  0 Dec29 ?        00:03:15 postgres: annieseb impseb2 10.44.1.20(33863) idle in transaction
anniedb2 23066 21254  0 17:45 ?        00:00:00 postgres: annieseb impseb2 10.44.1.21(45361) idle

[fuchs:/anniedev1/impseb/work/annie] bin/pg-dump-locks
pid 17979
pid 23130   granted: AccessShareLock on table pg_locks (16837)   granted: AccessShareLock on table pg_class (1259)
granted:AccessShareLock on table pg_index (16390) 
pid 17981
pid 17983   waiting: ExclusiveLock on table key__idpk__flink (2124233471)   granted: RowExclusiveLock on table
ncbi_blast_result(2112117337)   granted: AccessShareLock on table nmt_parameter_set (2112116466)   granted:
RowExclusiveLockon table ncbi_blast_round (2112085544)   granted: AccessShareLock on table gpi_learning_set
(2112116456)  granted: RowExclusiveLock on table ncbi_blast_alignment (2112117357)   granted: AccessShareLock on table
gpi3_learning_set(2112116461)   granted: RowExclusiveLock on table feature_link (2112116486)   granted: AccessShareLock
ontable managed_file_group (2112116530)   granted: AccessShareLock on table algorithm_index_file (2112116535)
granted:AccessShareLock on table def_parser_type (2112116451)   granted: AccessShareLock on table sequence_type
(2112116441)  granted: AccessShareLock on table file_type (2112116515)   granted: AccessShareLock on table managed_file
(2112116525)  granted: AccessShareLock on table ptsone_function (2112116471)   granted: AccessShareLock on table
toppred_function(2112116476)   granted: RowExclusiveLock on table ncbi_blast_subject (2112085549)   granted:
AccessShareLockon table set_type (2112116436)   granted: AccessShareLock on table data_store (2112116520)   granted:
AccessShareLockon table sequence_alphabet (2112116431) 
pid 17980   granted: AccessShareLock on table data_store (2112116520)   granted: RowExclusiveLock on table
ncbi_blast_subject(2112085549)   granted: AccessShareLock on table managed_file_group (2112116530)   granted:
RowExclusiveLockon table ncbi_blast_round (2112085544)   granted: AccessShareLock on table ptsone_function (2112116471)
 granted: ExclusiveLock on table key__idpk__flink (2124233471)   granted: AccessShareLock on table algorithm_index_file
(2112116535)  granted: AccessShareLock on table toppred_function (2112116476)   granted: AccessShareLock on table
def_parser_type(2112116451)   granted: AccessShareLock on table file_type (2112116515)   granted: RowExclusiveLock on
tablefeature_link (2112116486)   granted: AccessShareLock on table sequence_type (2112116441)   granted:
AccessShareLockon table managed_file (2112116525)   granted: AccessShareLock on table sequence_alphabet (2112116431)
granted:AccessShareLock on table gpi_learning_set (2112116456)   granted: AccessShareLock on table set_type
(2112116436)  granted: RowExclusiveLock on table ncbi_blast_result (2112117337)   granted: AccessShareLock on table
gpi3_learning_set(2112116461)   granted: RowExclusiveLock on table ncbi_blast_alignment (2112117357)   granted:
AccessShareLockon table nmt_parameter_set (2112116466) 
pid 17982
pid 18084   granted: AccessShareLock on table changeable_bio_set (2112116407)   granted: AccessShareLock on table
ojb_splice_list(2112117699)   granted: AccessShareLock on table set_type (2112116436)   granted: AccessShareLock on
tablesequence_alphabet (2112116431) 
pid 17976
pid 17977

impseb2=# select * from pg_locks; relation  |  database  | transaction |  pid  |       mode       | granted
------------+------------+-------------+-------+------------------+---------2112117337 | 2112078897 |             |
17983| RowExclusiveLock | t           |            |    26991056 | 17979 | ExclusiveLock    | t2112116466 | 2112078897
|            | 17983 | AccessShareLock  | t2112085544 | 2112078897 |             | 17983 | RowExclusiveLock | t
 |            |    26991864 | 17983 | ExclusiveLock    | t2112116520 | 2112078897 |             | 17980 |
AccessShareLock | t2112116407 | 2112078897 |             | 18084 | AccessShareLock  | t2112116456 | 2112078897 |
    | 17983 | AccessShareLock  | t2112085549 | 2112078897 |             | 17980 | RowExclusiveLock | t2112117357 |
2112078897|             | 17983 | RowExclusiveLock | t2112116461 | 2112078897 |             | 17983 | AccessShareLock
|t           |            |    26990698 | 17981 | ExclusiveLock    | t2112116530 | 2112078897 |             | 17980 |
AccessShareLock | t2112085544 | 2112078897 |             | 17980 | RowExclusiveLock | t           |            |
26991878| 17980 | ExclusiveLock    | t2112116486 | 2112078897 |             | 17983 | RowExclusiveLock | t2112117699 |
2112078897|             | 18084 | AccessShareLock  | t2112116530 | 2112078897 |             | 17983 | AccessShareLock
|t2112116535 | 2112078897 |             | 17983 | AccessShareLock  | t2112116471 | 2112078897 |             | 17980 |
AccessShareLock | t2124233471 | 2112078897 |             | 17980 | ExclusiveLock    | t2112116436 | 2112078897 |
    | 18084 | AccessShareLock  | t2112116535 | 2112078897 |             | 17980 | AccessShareLock  | t2112116476 |
2112078897|             | 17980 | AccessShareLock  | t           |            |    26990811 | 17977 | ExclusiveLock
|t2112116451 | 2112078897 |             | 17980 | AccessShareLock  | t     16837 | 2112078897 |             | 23060 |
AccessShareLock | t2112116451 | 2112078897 |             | 17983 | AccessShareLock  | t2112116441 | 2112078897 |
    | 17983 | AccessShareLock  | t2112116515 | 2112078897 |             | 17983 | AccessShareLock  | t2112116525 |
2112078897|             | 17983 | AccessShareLock  | t2112116515 | 2112078897 |             | 17980 | AccessShareLock
|t2112116471 | 2112078897 |             | 17983 | AccessShareLock  | t           |            |    27016842 | 18084 |
ExclusiveLock   | t2112116476 | 2112078897 |             | 17983 | AccessShareLock  | t2112116486 | 2112078897 |
    | 17980 | RowExclusiveLock | t2112116441 | 2112078897 |             | 17980 | AccessShareLock  | t2112116525 |
2112078897|             | 17980 | AccessShareLock  | t           |            |    27018036 | 17976 | ExclusiveLock
|t2112116431 | 2112078897 |             | 17980 | AccessShareLock  | t2112116456 | 2112078897 |             | 17980 |
AccessShareLock | t2112085549 | 2112078897 |             | 17983 | RowExclusiveLock | t2112116436 | 2112078897 |
    | 17980 | AccessShareLock  | t2112116436 | 2112078897 |             | 17983 | AccessShareLock  | t2112117337 |
2112078897|             | 17980 | RowExclusiveLock | t2112116461 | 2112078897 |             | 17980 | AccessShareLock
|t2112117357 | 2112078897 |             | 17980 | RowExclusiveLock | t2112116466 | 2112078897 |             | 17980 |
AccessShareLock | t2112116520 | 2112078897 |             | 17983 | AccessShareLock  | t           |            |
26991357| 17982 | ExclusiveLock    | t           |            |    27018054 | 23060 | ExclusiveLock    | t2124233471 |
2112078897|             | 17983 | ExclusiveLock    | f2112116431 | 2112078897 |             | 17983 | AccessShareLock
|t2112116431 | 2112078897 |             | 18084 | AccessShareLock  | t 
(54 rows)

--
Do not feel safe. The poet remembers.           DI Michael Wildpaner
You can kill one, but another is born.                 Ph.D. Student
The words are written down,
the deed, the date.  (Czeslaw Milosz)


Re: 'COPY ... FROM' inserts to btree, blocks on buffer writeout

From
Tom Lane
Date:
Michael Wildpaner <mike@rainbow.studorg.tuwien.ac.at> writes:
> two concurrent clients try to 'COPY ... FROM ...' to the same table,
> "feature_link".

> The second one (pid 17983) is waiting for an ExclusiveLock on the table's
> primary key index, "key__idpk__flink".

You didn't show a stack trace for this one ...
        regards, tom lane


Re: 'COPY ... FROM' inserts to btree, blocks on buffer writeout

From
Tom Lane
Date:
Michael Wildpaner <mike@rainbow.studorg.tuwien.ac.at> writes:
> two concurrent clients try to 'COPY ... FROM ...' to the same table,
> "feature_link".

> The second one (pid 17983) is waiting for an ExclusiveLock on the table's
> primary key index, "key__idpk__flink".

> The first one (pid 17980) is inserting into the index's btree, waiting
> for a buffer lock. This locking state seems to persist.

After staring at this for a little bit I have a theory.  The stack trace
for 17980 indicates that it is trying to flush a dirty buffer so that it
can take over the buffer for a new index page.  It's blocked trying to
acquire a shared buffer lock on that buffer, which means that someone
else must have an exclusive buffer lock, which the code is not expecting
because the buffer just came off the free list and therefore was not
pinned by anyone.

However ... FlushBuffer releases the BufMgrLock before trying to acquire
the per-buffer lock.  If 17980 lost its time slice during that interval,
it'd be possible for someone else to come in and re-pin the chosen
victim page and then lock the buffer before 17980 could.

Normally this wouldn't be any big problem, but if the someone else later
blocked on some lock held by 17980, you'd have a deadlock.  I think
that's exactly what we're seeing here.  The victim buffer page must be
the same one that 17983 is trying to split; so it's got exclusive lock
(and pin) on that page, and is now stuck waiting for the lock that would
give it the privilege to extend the index.

A possible fix for this is to reorder the operations in FlushBuffer
so that we share-lock the buffer before releasing BufMgrLock ... but
I'm not sure that doesn't introduce other deadlock risks.  It needs
some thought.

If this is the correct explanation, the bug has been around for a good
while; but it's got to be a very low-probability scenario.  Congrats
to Michael for taking the time to dig into it when he saw it.
        regards, tom lane


Re: 'COPY ... FROM' inserts to btree, blocks on buffer writeout

From
Tom Lane
Date:
I wrote:
> A possible fix for this is to reorder the operations in FlushBuffer
> so that we share-lock the buffer before releasing BufMgrLock ... but
> I'm not sure that doesn't introduce other deadlock risks.

That's too simplistic, since at least one caller of FlushBuffer is
trying to write pages that may be in active use.  If the target page
is already exclusive-locked by another process then a deadlock between
the per-buffer lock and BufMgrLock is entirely possible.

I think that it would work for BufferAlloc to share-lock the victim
buffer before calling FlushBuffer; we'd have to add a bool parameter to
FlushBuffer telling it the lock was already acquired.  This is safe in
this particular path because a buffer that was just taken from the
freelist should not be exclusive-locked by anyone.  (It could possibly
be share-locked by the bgwriter, but that won't cause a deadlock.)
A process that wanted exclusive lock would first have to pin the buffer,
which can't happen before we release the BufMgrLock.  By adding the
parameter we'd avoid changing the behavior for other callers.

Comments, better ideas?

BTW, it looks to me like this deadlock potential has existed at least
since 7.0.  I seem to recall one or two reports of unexplainable
apparent deadlocks, which perhaps are now explained.
        regards, tom lane


Re: 'COPY ... FROM' inserts to btree, blocks on buffer

From
Michael Wildpaner
Date:
On Fri, 31 Dec 2004, Tom Lane wrote:
> Michael Wildpaner <mike@rainbow.studorg.tuwien.ac.at> writes:
> > two concurrent clients try to 'COPY ... FROM ...' to the same table,
> > "feature_link".
>
> > The second one (pid 17983) is waiting for an ExclusiveLock on the table's
> > primary key index, "key__idpk__flink".
>
> You didn't show a stack trace for this one ...

Here it is:

[fuchs:/people/mike/build/postgresql-8.0.0beta2] gdb src/backend/postgres
[...]
(gdb) attach 17983
[...]
(gdb) where
#0  0x0000002a96181279 in semop () from /lib64/tls/libc.so.6
#1  0x0000000000511707 in PGSemaphoreLock (sema=0x2aa16d8f90, interruptOK=1 '\001') at pg_sema.c:418
#2  0x0000000000533e76 in ProcSleep (lockMethodTable=0x274801d, lockmode=7, lock=0x2aa1773878, proclock=0x2aa18ee398)
atproc.c:725 
#3  0x0000000000532a7c in WaitOnLock (lockmethodid=32797, locallock=0x894180, owner=0x975130) at lock.c:1037
#4  0x000000000053246f in LockAcquire (lockmethodid=1, locktag=0x1, xid=2710496152, lockmode=7, dontWait=0 '\0') at
lock.c:754
#5  0x0000000000531aa3 in LockPage (relation=0x274801d, blkno=3220402448, lockmode=7) at lmgr.c:267
#6  0x000000000044a98d in _bt_getbuf (rel=0x2aa2466168, blkno=3220402448, access=2) at nbtpage.c:490
#7  0x0000000000448c87 in _bt_split (rel=0x2aa2466168, buf=16355, firstright=89, newitemoff=119, newitemsz=40,
newitem=0x985828,  newitemonleft=0 '\0', itup_off=0x7fbff374c6, itup_blkno=0x7fbff374c8) at nbtinsert.c:683 
#8  0x00000000004486ba in _bt_insertonpg (rel=0x2aa2466168, buf=16355, stack=0x9a2a70, keysz=1, scankey=0x984a60,
btitem=0x985828,  afteritem=0, split_only_page=0 '\0') at nbtinsert.c:500 
#9  0x00000000004481eb in _bt_doinsert (rel=0x2aa2466168, btitem=0x985828, index_is_unique=1 '\001',
heapRel=0x2aa24656e8)  at nbtinsert.c:141 
#10 0x000000000044bb71 in btinsert (fcinfo=0x274801d) at nbtree.c:257
#11 0x00000000005a44aa in OidFunctionCall6 (functionId=41189405, arg1=183111147880, arg2=548681250960,
arg3=548681250928,arg4=10030788,   arg5=183111145192, arg6=18446744073709551612) at fmgr.c:1487 
#12 0x00000000004476a5 in index_insert (indexRelation=0x2aa2466168, datums=0x7fbff37890, nulls=0x7fbff37870 " î",
heap_t_ctid=0x990ec4,  heapRelation=0x2aa24656e8, check_uniqueness=1 '\001') at indexam.c:226 
#13 0x00000000004cf3cf in ExecInsertIndexTuples (slot=0x0, tupleid=0x7fbff37110, estate=0x9828a0, is_vacuum=0 '\0') at
execUtils.c:859
#14 0x000000000049c1d1 in CopyFrom (rel=0x2aa24656e8, attnumlist=0x9a25f0, binary=0 '\0', oids=0 '\0', delim=0x63e4cd
"\t",  null_print=0x608614 "\\N", csv_mode=0 '\0', quote=0x0, escape=0x0, force_notnull_atts=0x9849d0) at copy.c:1958 
#15 0x000000000049a34f in DoCopy (stmt=0x274801d) at copy.c:1043
#16 0x000000000053d7b9 in PortalRunUtility (portal=0x906b70, query=0x8fa620, dest=0x8fa940, completionTag=0x7fbff37f60
"")at pquery.c:839 
#17 0x000000000053da41 in PortalRunMulti (portal=0x906b70, dest=0x8fa940, altdest=0x8fa940, completionTag=0x7fbff37f60
"")at pquery.c:902 
#18 0x000000000053d230 in PortalRun (portal=0x906b70, count=9223372036854775807, dest=0x8fa940, altdest=0x8fa940,
completionTag=0x7fbff37f60"") at pquery.c:543 
#19 0x0000000000539c59 in exec_simple_query (   query_string=0x8fa730 "COPY feature_link from
'/anniedev1/impseb/datastore/results/2004/12/29/22/38/bio:query:ncbi-blast:10562060.out_featurelink'")at postgres.c:924 
#20 0x000000000053bf6d in PostgresMain (argc=9414448, argv=0x86f028, username=0x86eff0 "annieseb") at postgres.c:2970
#21 0x0000000000514db7 in BackendRun (port=0x89c650) at postmaster.c:2848
#22 0x0000000000514850 in BackendStartup (port=0x89c650) at postmaster.c:2470
#23 0x0000000000512fde in ServerLoop () at postmaster.c:1215
#24 0x0000000000512446 in PostmasterMain (argc=1, argv=0x804850) at postmaster.c:898
#25 0x00000000004e3206 in main (argc=1, argv=0x804850) at main.c:265

Happy new year! (nearly 2 AM MET here ;)
Best wishes, Mike

--
Do not feel safe. The poet remembers.           DI Michael Wildpaner
You can kill one, but another is born.                 Ph.D. Student
The words are written down,
the deed, the date.  (Czeslaw Milosz)


Re: 'COPY ... FROM' inserts to btree, blocks on buffer

From
Simon Riggs
Date:
On Sat, 2005-01-01 at 00:42, Tom Lane wrote:
> I wrote:
> > A possible fix for this is to reorder the operations in FlushBuffer
> > so that we share-lock the buffer before releasing BufMgrLock ... but
> > I'm not sure that doesn't introduce other deadlock risks.
> 
> That's too simplistic, since at least one caller of FlushBuffer is
> trying to write pages that may be in active use.  If the target page
> is already exclusive-locked by another process then a deadlock between
> the per-buffer lock and BufMgrLock is entirely possible.
> 
> I think that it would work for BufferAlloc to share-lock the victim
> buffer before calling FlushBuffer; we'd have to add a bool parameter to
> FlushBuffer telling it the lock was already acquired.  This is safe in
> this particular path because a buffer that was just taken from the
> freelist should not be exclusive-locked by anyone.  (It could possibly
> be share-locked by the bgwriter, but that won't cause a deadlock.)
> A process that wanted exclusive lock would first have to pin the buffer,
> which can't happen before we release the BufMgrLock.  By adding the
> parameter we'd avoid changing the behavior for other callers.
> 
> Comments, better ideas?

I think the proposal sounds safe, though I worry about performance.

This situation occurs because an index block in cache was thought to be
a block replacement candidate, then found to be useful by another
backend which attempts to start using it before the flush occurs.
...which hints at the shared_buffers being set too low.

AFAICS we need not consider losing timeslices as the explanation for
concurrent activity. This situation is much more easily possible with
multi-CPU systems, especially when many single core systems have
hyperthreaded virtual CPUs and the default shared_buffers of 8Mb has not
been changed.

The effect of the proposed change would be to favour the removal of the
block from the cache rather than favoring the reuse of it. I think the
favour in this situation should go towards the reuse, forcing the
backend searching for a free buffer to search again. There are already 2
other cases in BufferAlloc which cater for when two backends want the
same block, which hints towards the importance of favouring the reuse:
this is just one more case.

My suggestion: LockBuffer in FlushBuffer should return as unsuccessful
if there is an LW_EXCLUSIVE lock already held, causing another iteration
of the do while loop in BufferAlloc. Doing that would be faster for both
backends since neither has to wait for I/O, as well as avoiding the
deadlock situation that has been uncovered.

If the backend searches for a block replacement candidate and fails more
than once, we should consider warning "shared buffers too low..."

Later, I'd like to consider changing the way StrategyGetBuffer works,
since it always stops at the first unpinned buffer and doesn't consider
whether it is dirty or not in its decision to use it. It seems like it
would be worth reading a *few* buffers further on to see if a more
appropriate choice could have been made now that we have bgwriter to
eventually clean the buffers.

-- 
Best Regards, Simon Riggs



Re: 'COPY ... FROM' inserts to btree, blocks on buffer writeout

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> I think the proposal sounds safe, though I worry about performance.

There is no performance loss; we are just changing the order in which
we acquire two locks.  If there were some risk of blocking for a
measurable time while holding the BufMgrLock, then that would be bad for
concurrent performance --- but in fact the per-buffer lock is guaranteed
free at that point.

I don't think there's any value in trying to avoid the I/O.  This is a
corner case of such rarity that it's only been seen perhaps half a dozen
times in the history of the project.  "Optimizing" it is not the proper
concern.  The case where the I/O is wasted because someone re-pins the
buffer during the write is far more likely, simply because of the
relative widths of the windows involved; and we can't avoid that.

> My suggestion: LockBuffer in FlushBuffer should return as unsuccessful
> if there is an LW_EXCLUSIVE lock already held, causing another iteration
> of the do while loop in BufferAlloc.

This would break the other callers of FlushBuffer.  We could redefine
FlushBuffer as taking either a conditional or unconditional lock, but
I think that's a weirder API than a flag to say the lock is already
taken.

Bottom line is that I don't think it's useful to consider this as a
performance issue.  What we need is correctness with minimum extra
complication of the logic.
        regards, tom lane


Re: 'COPY ... FROM' inserts to btree, blocks on buffer

From
Simon Riggs
Date:
On Sat, 2005-01-01 at 18:55, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > I think the proposal sounds safe, though I worry about performance.
> 
> There is no performance loss; we are just changing the order in which
> we acquire two locks.  If there were some risk of blocking for a
> measurable time while holding the BufMgrLock, then that would be bad for
> concurrent performance --- but in fact the per-buffer lock is guaranteed
> free at that point.
> 
> I don't think there's any value in trying to avoid the I/O.  This is a
> corner case of such rarity that it's only been seen perhaps half a dozen
> times in the history of the project.  "Optimizing" it is not the proper
> concern.  The case where the I/O is wasted because someone re-pins the
> buffer during the write is far more likely, simply because of the
> relative widths of the windows involved; and we can't avoid that.

The deadlock is incredibly rare, I agree. That was not my point.

The situation where another backend requests the block immediately
before the I/O is fairly common AFAICS, especially since
StrategyGetBuffer ignores the BM_DIRTY flag in selecting victims.

ISTM making the code deadlock-safe will effect cases where there never
would have been a deadlock, slowing both backends down while waiting for
the I/O to complete.

> Bottom line is that I don't think it's useful to consider this as a
> performance issue.  What we need is correctness with minimum extra
> complication of the logic.

I'll step aside and look more closely for 8.1

-- 
Best Regards, Simon Riggs



Re: 'COPY ... FROM' inserts to btree, blocks on buffer writeout

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> The situation where another backend requests the block immediately
> before the I/O is fairly common AFAICS, especially since
> StrategyGetBuffer ignores the BM_DIRTY flag in selecting victims.

How do you figure that?  StrategyGetBuffer won't return the same buffer
again (because dirty or not, it'll be pinned by the time anyone else
gets to run StrategyGetBuffer).  The case we are interested in is where
someone suddenly wants the original page again --- that is, a page that
was just about to fall off the back end of the freelist is wanted again.
I don't see that that case is common, especially not with a reasonably
large shared_buffer setting, and most especially not when the bgwriter
is doing its job and keeping the back end of the freelist clean.

> ISTM making the code deadlock-safe will effect cases where there never
> would have been a deadlock, slowing both backends down while waiting for
> the I/O to complete.

The other backend will have to wait for the I/O to complete before he
can gain an exclusive lock on the page ... but so what?  If he'd come
along a microsecond later, he'd have had to wait, too.  Basically we are
eliminating a very narrow window by causing it to behave the same as
what happens in the larger window where the I/O is occurring.

(BTW, "I/O" in this case actually just represents transferring the data
to kernel buffers, so the amount of delay involved is likely not to be
all that large...)
        regards, tom lane


Re: 'COPY ... FROM' inserts to btree, blocks on buffer writeout

From
Tom Lane
Date:
I wrote:
> I think that it would work for BufferAlloc to share-lock the victim
> buffer before calling FlushBuffer; we'd have to add a bool parameter to
> FlushBuffer telling it the lock was already acquired.

I've applied a patch for this.

> BTW, it looks to me like this deadlock potential has existed at least
> since 7.0.  I seem to recall one or two reports of unexplainable
> apparent deadlocks, which perhaps are now explained.

On closer investigation the deadlock does not seem to exist in 7.4 and
before, because BufferReplace didn't acquire the buffer sharelock.
(There is a comment in the 7.4 code claiming that we didn't need to,
but I'm unconvinced that it's correct...)
        regards, tom lane


Re: 'COPY ... FROM' inserts to btree, blocks on buffer

From
Simon Riggs
Date:
On Mon, 2005-01-03 at 17:14, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > The situation where another backend requests the block immediately
> > before the I/O is fairly common AFAICS, especially since
> > StrategyGetBuffer ignores the BM_DIRTY flag in selecting victims.
> 
> How do you figure that?  StrategyGetBuffer won't return the same buffer
> again (because dirty or not, it'll be pinned by the time anyone else
> gets to run StrategyGetBuffer).  The case we are interested in is where
> someone suddenly wants the original page again --- that is, a page that
> was just about to fall off the back end of the freelist is wanted again.
> I don't see that that case is common, especially not with a reasonably
> large shared_buffer setting, and most especially not when the bgwriter
> is doing its job and keeping the back end of the freelist clean.

Yes, what I was effectively arguing for was to tune for the case where
shared_buffers is still at the default...which is of course fairly
pointless, since the way to tune is just to increase shared_buffers.

...Fully agree with your original suggestion now.

-- 
Best Regards, Simon Riggs