Re: assertion failure 9.3.4 - Mailing list pgsql-hackers

From Andres Freund
Subject Re: assertion failure 9.3.4
Date
Msg-id 20140421185422.GA13906@alap3.anarazel.de
Whole thread Raw
In response to assertion failure 9.3.4  (Andrew Dunstan <andrew.dunstan@pgexperts.com>)
Responses Re: assertion failure 9.3.4  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: assertion failure 9.3.4  (Andrew Dunstan <andrew@dunslane.net>)
List pgsql-hackers
Hi,

I spent the last two hours poking arounds in the environment Andrew
provided and I was able to reproduce the issue, find a assert to
reproduce it much faster and find a possible root cause.

Since the symptom of the problem seem to be multixacts with more than
one updating xid, I added a check to MultiXactIdCreateFromMembers()
preventing that. That requires to move ISUPDATE_from_mxstatus() to a
header, but I think we should definitely add such a assert.

As it turns out the problem is in the
else if (result == HeapTupleBeingUpdated && wait)
branch in (at least) heap_update(). When the problem is hit the
to-be-updated tuple originally has HEAP_XMIN_COMMITTED |
HEAP_XMAX_LOCK_ONLY | HEAP_XMAX_KEYSHR_LOCK set. So we release the
buffer lock, acquire the tuple lock, and reacquire the buffer lock. But
inbetween the locking backend has actually updated the tuple.
The code tries to protect against that with:               /*                * recheck the locker; if someone else
changedthe tuple while                * we weren't looking, start over.                */               if
((oldtup.t_data->t_infomask& HEAP_XMAX_IS_MULTI) ||                   !TransactionIdEquals(
     HeapTupleHeaderGetRawXmax(oldtup.t_data),                                        xwait))                   goto
l2;
               can_continue = true;               locker_remains = true;

and similar. The problem is that in Andrew's case the infomask changes
from 0x2192 to 0x2102 (i.e. it's a normal update afterwards), while xmax
stays the same. Ooops.
A bit later there's:       result = can_continue ? HeapTupleMayBeUpdated : HeapTupleUpdated;
So, from thereon we happily continue to update the tuple, thinking
there's no previous updater. Which obviously causes problems.

I've hacked^Wfixed this by changing the infomask test above into
infomask != oldtup.t_data->t_infomask in a couple of places. That seems
to be sufficient to survive the testcase a couple of times.

I am too hungry right now to think about a proper fix for this and
whether there's further problematic areas.

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: Perfomance degradation 9.3 (vs 9.2) for FreeBSD
Next
From: Tom Lane
Date:
Subject: Re: assertion failure 9.3.4