Re: BUG #13681: Serialization failures caused by new multixact code of 9.3 (back-patch request) - Mailing list pgsql-bugs

From Alvaro Herrera
Subject Re: BUG #13681: Serialization failures caused by new multixact code of 9.3 (back-patch request)
Date
Msg-id 20151218172413.GS2618@alvherre.pgsql
Whole thread Raw
In response to Re: BUG #13681: Serialization failures caused by new multixact code of 9.3 (back-patch request)  (Kevin Grittner <kgrittn@gmail.com>)
Responses Re: BUG #13681: Serialization failures caused by new multixact code of 9.3 (back-patch request)  (Alvaro Herrera <alvherre@2ndquadrant.com>)
List pgsql-bugs
Kevin Grittner wrote:
> On Thu, Dec 17, 2015 at 12:31 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:

> > we have a transaction that takes a lock-only multi in table
> > users, and then when we do the second update we don't look it up
> > because ...??
>
> The referencing column value did not change.  (We would not have
> looked up on the first update either, since it also didn't change
> there.)
>
> >  And then this causes the test case not to fail because ..?
>
> The concurrent update of the referencing table is not seen as a
> write conflict (because it didn't actually change).

Okay, but I don't understand why that particular patch fixes the
problem.  That patch was only supposed to skip looking up members of
multixacts when they were not interesting; so it's just a performance
optimization which shouldn't change the behavior of anything.  However
in this case, it is changing behavior and I'm concerned that it might
have introduced other bugs.

Tracing through the test case, what happens is that s1's second update
calls SELECT FOR SHARE of the users tuple (via the RI code); this calls
ExecLockRows which calls heap_lock_tuple which calls
HeapTupleSatisfiesUpdate.  The latter returns HeapTupleUpdated, because
it sees that the tuple has been modified by a transaction that already
committed (s2).  But we already hold a for-key-share lock on the old
version of the tuple -- that HeapTupleUpdated result value should be
examined carefully by heap_lock_tuple to determine that, yes, it can
acquire the row lock without raising an error.

So, if I patch heap_lock_rows like below, the test passes too:

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 559970f..aaf8e8e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -4005,7 +4005,7 @@ l3:
         UnlockReleaseBuffer(*buffer);
         elog(ERROR, "attempted to lock invisible tuple");
     }
-    else if (result == HeapTupleBeingUpdated)
+    else if (result == HeapTupleBeingUpdated || result == HeapTupleUpdated)
     {
         TransactionId xwait;
         uint16        infomask;

I think heap_lock_rows had that shape (only consider BeingUpdated as a
reason to check/wait) only because it was possible to lock a row that
was being locked by someone else, but it wasn't possible to lock a row
that had been updated by someone else -- which became possible in 9.3.
So this patch is necessary, and not just to fix this one bug.

I need to study the implications of this patch more closely, but I think
in theory it is correct.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pgsql-bugs by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Known issues on PostgreSQL server 8.1.19
Next
From: "Millepied, Pascal (GE Healthcare)"
Date:
Subject: Re: Known issues on PostgreSQL server 8.1.19