slru.c race condition (was Re: TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)",) - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | slru.c race condition (was Re: TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)",) |
Date | |
Msg-id | 15543.1130714273@sss.pgh.pa.us Whole thread Raw |
In response to | Re: TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)", ("Jim Nasby" <jnasby@pervasive.com>) |
Responses |
Re: slru.c race condition (was Re: TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)",)
Re: slru.c race condition (was Re: TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)",) Re: slru.c race condition (was Re: TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)",) |
List | pgsql-hackers |
OK, I think I see it. The problem is that the code in slru.c is careful about not modifying state when it doesn't hold the proper lock, but not so careful about not *inspecting* state without the proper lock. In particular consider these lines in SimpleLruReadPage (line numbers are as in CVS tip): /* Mark the slot read-busy (no-op if it already was) */ 277 shared->page_number[slotno] = pageno; 278 shared->page_status[slotno] = SLRU_PAGE_READ_IN_PROGRESS; ... /* Release shared lock, grab per-buffer lock instead */ 287 LWLockRelease(shared->ControlLock); 288 LWLockAcquire(shared->buffer_locks[slotno], LW_EXCLUSIVE); /* * Check to see if someone else already did the read, or took * the buffer away from us. If so, restart fromthe top. */ 294 if (shared->page_number[slotno] != pageno || 295 shared->page_status[slotno] != SLRU_PAGE_READ_IN_PROGRESS) ... Suppose that we arrive at line 277 when the page is currently being faulted in by another process (ie, its state is already read-in-progress). The assignments at lines 277 & 278 are then no-ops, and we'll block at line 288 waiting for the other guy to finish the I/O and release the per-buffer lock. Suppose that after the I/O finishes and before we get to run again, this buffer sinks back to the bottom of the LRU list and gets chosen to be replaced with another page. Some other process will then start executing this code and will change the page number (line 277) and change the state from clean to read-in-progress (line 278). The race condition is that this could happen between the tests at lines 294 and 295 in our original process. In that case the original process would think that the page still needed to be read in, and would set about doing so. It would discover its error at the Assert at line 308, exactly where Jim reports a problem. The association we thought we'd noted to recursion through SlruSelectLRUPage is in part a red herring: the race can occur without that. However, it's perhaps a bit more probable to occur in that path, because when SlruSelectLRUPage recurses back to SimpleLruReadPage, we *know* that there is another process currently doing read-in, and so the first coincidence is already satisfied. Still, the race condition window is extremely narrow, only a couple of instructions. I looked at the assembly output for the compiler Jim is using, and it looks like this: cmpl %r13d, 104(%rbp,%r12,4)je .L155 .L116:... code for if() body here... .L155:cmpl $1, 72(%rbp,%r12,4)jne .L116... code for subsequent lines here However, if the processor predicts the forward branch not to be taken, it could waste a few cycles recovering from its mistake, so the window maybe is a little wider than it appears. I'd like Jim to test this theory by seeing if it helps to reverse the order of the if-test elements at lines 294/295, ie make it look like if (shared->page_status[slotno] != SLRU_PAGE_READ_IN_PROGRESS || shared->page_number[slotno] != pageno) This won't do as a permanent patch, because it isn't guaranteed to fix the problem on machines that don't strongly order writes, but it should work on Opterons, at least well enough to confirm the diagnosis. I'm still thinking about how to make a real fix without introducing another lock/unlock cycle --- we can do that if we have to, but I think maybe it's possible to fix it without. SimpleLruWritePage has an identical problem BTW :-( regards, tom lane
pgsql-hackers by date: