Re: Bug in new buffer freelist code - Mailing list pgsql-hackers

From Jan Wieck
Subject Re: Bug in new buffer freelist code
Date
Msg-id 3FE8A88D.10309@Yahoo.com
Whole thread Raw
In response to Bug in new buffer freelist code  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Bug in new buffer freelist code
Re: Bug in new buffer freelist code
List pgsql-hackers
Tom Lane wrote:
> I just had the parallel regression tests hang up due to what appears to
> be a bug in the new ARC code.  The CLUSTER test gets into an infinite
> loop trying to do "CLUSTER clstr_1;".  The loop is in
> StrategyInvalidateBuffer's check that the buffer is already in the
> freelist; it isn't, and the freelist is circular.

It seems to me that buffers that are thrown away via 
StrategyInvalidateBuffer() do not get their relnode and blocknum cleaned 
out. That causes FlushRelationBuffers() while doing a full scan of the 
whole buffer pool to find buffers that once contained the block again.

If buffer 839 once contained that block, and it was given up that way, 
and later on buffer 850 contains it, there is a CDB for it. If now 
FlushRelationBuffers() scans the buffer pool, it will find buffer 839 
first and call StrategyInvalidateBuffer() for it. That finds the CDB for 
buffer 850, and add's buffer 839 to the list again. Later on FlushRB() 
calls StrategyIB() for buffer 850 and we have the situation at hand.


Does that make sense?

Jan

> 
> (gdb) bt
> #0  0x1fe8a8 in StrategyInvalidateBuffer (buf=0xc3a56f60) at freelist.c:733
> #1  0x1fbf08 in FlushRelationBuffers (rel=0x400fa298, firstDelBlock=0)
>     at bufmgr.c:1596
> #2  0x1479fc in swap_relfilenodes (r1=143786, r2=143915) at cluster.c:736
> #3  0x147458 in rebuild_relation (OldHeap=0x2322b, indexOid=143788)
>     at cluster.c:455
> #4  0x1473b0 in cluster_rel (rvtc=0x7b03bed8, recheck=0 '\000')
>     at cluster.c:395
> #5  0x146ff4 in cluster (stmt=0x400b88a8) at cluster.c:232
> #6  0x21c60c in ProcessUtility (parsetree=0x400b88a8, dest=0x400b88e8,
>     completionTag=0x7b03bbe8 "") at utility.c:1033
> ... etc ...
> 
> (gdb) p *buf
> $5 = {bufNext = -1, data = 7211904, tag = {rnode = {tblNode = 17142,
>       relNode = 143906}, blockNum = 0}, buf_id = 850, flags = 14,
>   refcount = 0, io_in_progress_lock = 1721, cntx_lock = 1722,
>   cntxDirty = 0 '\000', wait_backend_id = 0}
> (gdb) p *StrategyControl
> $1 = {target_T1_size = 423, listUnusedCDB = 249, listHead = {464, 967, 1692,
>     1227}, listTail = {968, 645, 1528, 1694}, listSize = {364, 413, 584, 636},
>   listFreeBuffers = 839, num_lookup = 546939, num_hit = {1378, 246896, 282639,
>     3935}, stat_report = 0, cdb = {{prev = 386, next = 23, list = 3,
>       buf_tag = {rnode = {tblNode = 17142, relNode = 19080}, blockNum = 30},
>       buf_id = -1, t1_xid = 3402}}}
> (gdb) p BufferDescriptors[839]
> $2 = {bufNext = 839, data = 7121792, tag = {rnode = {tblNode = 17142,
>       relNode = 143906}, blockNum = 0}, buf_id = 839, flags = 14,
>   refcount = 0, io_in_progress_lock = 1699, cntx_lock = 1700,
>   cntxDirty = 0 '\000', wait_backend_id = 0}
> 
> So we've got a couple of problems here: buffers 839 and 850 both claim
> to contain block 0 of rel 143906 (which is clstr_1), and the freelist
> is circular.
> 
> This doesn't seem to be super reproducible, but there's definitely a
> problem in there somewhere.
> 
>             regards, tom lane


-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



pgsql-hackers by date:

Previous
From: Adam Witney
Date:
Subject: One regression failure with 7.4.1 on Debian 3.0r2
Next
From: Jean-Michel POURE
Date:
Subject: Re: PostgreSQL port to pure Java?