slru.c race condition (was Re: TRAP: FailedAssertion("!((itemid)->lp_flags & 0x01)",) - Mailing list pgsql-hackers

OK, I think I see it.  The problem is that the code in slru.c is careful
about not modifying state when it doesn't hold the proper lock, but not
so careful about not *inspecting* state without the proper lock.  In
particular consider these lines in SimpleLruReadPage (line numbers are
as in CVS tip):
    /* Mark the slot read-busy (no-op if it already was) */
277        shared->page_number[slotno] = pageno;
278        shared->page_status[slotno] = SLRU_PAGE_READ_IN_PROGRESS;
    ...
    /* Release shared lock, grab per-buffer lock instead */
287        LWLockRelease(shared->ControlLock);
288        LWLockAcquire(shared->buffer_locks[slotno], LW_EXCLUSIVE);
    /*     * Check to see if someone else already did the read, or took     * the buffer away from us.  If so, restart
fromthe top.     */
 
294        if (shared->page_number[slotno] != pageno ||
295            shared->page_status[slotno] != SLRU_PAGE_READ_IN_PROGRESS)    ...

Suppose that we arrive at line 277 when the page is currently being
faulted in by another process (ie, its state is already
read-in-progress).  The assignments at lines 277 & 278 are then no-ops,
and we'll block at line 288 waiting for the other guy to finish the I/O
and release the per-buffer lock.

Suppose that after the I/O finishes and before we get to run again, this
buffer sinks back to the bottom of the LRU list and gets chosen to be
replaced with another page.  Some other process will then start
executing this code and will change the page number (line 277) and
change the state from clean to read-in-progress (line 278).

The race condition is that this could happen between the tests at lines
294 and 295 in our original process.  In that case the original process
would think that the page still needed to be read in, and would set
about doing so.  It would discover its error at the Assert at line
308, exactly where Jim reports a problem.

The association we thought we'd noted to recursion through
SlruSelectLRUPage is in part a red herring: the race can occur without
that.  However, it's perhaps a bit more probable to occur in that path,
because when SlruSelectLRUPage recurses back to SimpleLruReadPage, we
*know* that there is another process currently doing read-in, and so
the first coincidence is already satisfied.

Still, the race condition window is extremely narrow, only a couple of
instructions.  I looked at the assembly output for the compiler
Jim is using, and it looks like this:
cmpl    %r13d, 104(%rbp,%r12,4)je    .L155
.L116:... code for if() body here...
.L155:cmpl    $1, 72(%rbp,%r12,4)jne    .L116... code for subsequent lines here

However, if the processor predicts the forward branch not to be taken,
it could waste a few cycles recovering from its mistake, so the window
maybe is a little wider than it appears.

I'd like Jim to test this theory by seeing if it helps to reverse the
order of the if-test elements at lines 294/295, ie make it look like
       if (shared->page_status[slotno] != SLRU_PAGE_READ_IN_PROGRESS ||           shared->page_number[slotno] !=
pageno)

This won't do as a permanent patch, because it isn't guaranteed to fix
the problem on machines that don't strongly order writes, but it should
work on Opterons, at least well enough to confirm the diagnosis.

I'm still thinking about how to make a real fix without introducing
another lock/unlock cycle --- we can do that if we have to, but I think
maybe it's possible to fix it without.

SimpleLruWritePage has an identical problem BTW :-(
        regards, tom lane


pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: pg_dump option to dump only functions
Next
From: Tom Lane
Date:
Subject: Re: add_missing_from breaks existing views