Thread: [HACKERS] bug in locking an update tuple chain

[HACKERS] bug in locking an update tuple chain

From

Alvaro Herrera

Date:

15 July 2017, 03:00:11

A customer of ours reported a problem in 9.3.14 while inserting tuples
in a table with a foreign key, with many concurrent transactions doing
the same: after a few insertions worked sucessfully, a later one would
return failure indicating that the primary key value was not present in
the referenced table.  It worked fine for them on 9.3.4.

After some research, we determined that the problem disappeared if
commit this commit was reverted:

Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Branch: master Release: REL9_6_BR [533e9c6b0] 2016-07-15 14:17:20 -0400
Branch: REL9_5_STABLE Release: REL9_5_4 [649dd1b58] 2016-07-15 14:17:20 -0400
Branch: REL9_4_STABLE Release: REL9_4_9 [166873dd0] 2016-07-15 14:17:20 -0400
Branch: REL9_3_STABLE Release: REL9_3_14 [6c243f90a] 2016-07-15 14:17:20 -0400

    Avoid serializability errors when locking a tuple with a committed update

I spent some time writing an isolationtester spec to reproduce the
problem.  It turned out that this required six concurrent sessions in
order for the problem to show up at all, but once I had that, figuring
out what was going on was simple: a transaction wants to lock the
updated version of some tuple, and it does so; and some other
transaction is also locking the same tuple concurrently in a compatible
way.  So both are okay to proceed concurrently.  The problem is that if
one of them detects that anything changed in the process of doing this
(such as the other session updating the multixact to include itself,
both having compatible lock modes), it loops back to ensure xmax/
infomask are still sane; but heap_lock_updated_tuple_rec is not prepared
to deal with the situation of "the current transaction has the lock
already", so it returns a failure and the tuple is returned as "not
visible" causing the described problem.

I *think* that the problem did not show up before the commit cited above
because the bug fixed by that commit reduced concurrency, effectively
masking this bug.

In assertion-enabled builds, this happens instead (about 2 out of 3
times I run the script in my laptop):

TRAP: FailedAssertion(«!(TransactionIdIsCurrentTransactionId(((!((tup)->t_infomask & 0x0800) && ((tup)->t_infomask &
0x1000)&& !((tup)->t_infomask & 0x0080)) ? HeapTupleGetUpdateXid(tup) : ( (tup)->t_choice.t_heap.t_xmax) )))», Archivo:
«/pgsql/source/REL9_3_STABLE/src/backend/utils/time/combocid.c»,Línea: 122)
 

with this backtrace:

(gdb) bt
#0  0x00007f651cd66067 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007f651cd67448 in __GI_abort () at abort.c:89
#2  0x0000000000754ac1 in ExceptionalCondition (
    conditionName=conditionName@entry=0x900d68 "!(TransactionIdIsCurrentTransactionId(( (!((tup)->t_infomask & 0x0800)
&&((tup)->t_infomask & 0x1000) && !
 
+((tup)->t_infomask & 0x0080)) ? HeapTupleGetUpdateXid(tup) : ( (tup)->t_choice.t_heap.t_xmax "...,
    errorType=errorType@entry=0x78cdb4 "FailedAssertion",
    fileName=fileName@entry=0x900ca8 "/pgsql/source/REL9_3_STABLE/src/backend/utils/time/combocid.c",
lineNumber=lineNumber@entry=122)
    at /pgsql/source/REL9_3_STABLE/src/backend/utils/error/assert.c:54
#3  0x0000000000781cf8 in HeapTupleHeaderGetCmax (tup=0x7f6513f289d0)
    at /pgsql/source/REL9_3_STABLE/src/backend/utils/time/combocid.c:122
#4  0x0000000000495911 in heap_lock_tuple (relation=0x7f651e0d9138, tuple=0x7ffe928a7de0, cid=0,
mode=LockTupleKeyShare,
    nowait=0 '\000', follow_updates=1 '\001', buffer=0x7ffe928a7dcc, hufd=0x7ffe928a7dd0)
    at /pgsql/source/REL9_3_STABLE/src/backend/access/heap/heapam.c:4439
#5  0x00000000005b2f74 in ExecLockRows (node=0x290f070) at
/pgsql/source/REL9_3_STABLE/src/backend/executor/nodeLockRows.c:134

The attached patch fixes the problem.  When locking some old tuple version of
the chain, if we detect that we already hold that lock
(test_lockmode_for_conflict returns HeapTupleSelfUpdated), do not try to lock
it again but instead skip ahead to the next version.  This fixes the synthetic
case in my isolationtester as well as our customer's production case.


I attach the isolationtester spec file.  In its present form it's rather
noisy, because I added calls to report the XIDs for each transaction, so
that I could exactly replicate the WAL sequence that I obtained from the
customer in order to reproduce the issue.  That would have to be removed
in order for the test to be included in the repository, of course.  This
spec doesn't work at all with 9.3's isolationtester, because back then
isolationtester had the limitation that only one session could be
waiting; to verify this bug, I backpatched 38f8bdcac498 ("Modify the
isolation tester so that multiple sessions can wait.") so that it'd work
at all.  This means that we cannot include the test in branches older
than 9.6.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] bug in locking an update tuple chain

From

Amit Kapila

Date:

18 July 2017, 16:29:14

On Sat, Jul 15, 2017 at 2:30 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> A customer of ours reported a problem in 9.3.14 while inserting tuples
> in a table with a foreign key, with many concurrent transactions doing
> the same: after a few insertions worked sucessfully, a later one would
> return failure indicating that the primary key value was not present in
> the referenced table.  It worked fine for them on 9.3.4.
>
> After some research, we determined that the problem disappeared if
> commit this commit was reverted:
>
> Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
> Branch: master Release: REL9_6_BR [533e9c6b0] 2016-07-15 14:17:20 -0400
> Branch: REL9_5_STABLE Release: REL9_5_4 [649dd1b58] 2016-07-15 14:17:20 -0400
> Branch: REL9_4_STABLE Release: REL9_4_9 [166873dd0] 2016-07-15 14:17:20 -0400
> Branch: REL9_3_STABLE Release: REL9_3_14 [6c243f90a] 2016-07-15 14:17:20 -0400
>
>     Avoid serializability errors when locking a tuple with a committed update
>
> I spent some time writing an isolationtester spec to reproduce the
> problem.  It turned out that this required six concurrent sessions in
> order for the problem to show up at all, but once I had that, figuring
> out what was going on was simple: a transaction wants to lock the
> updated version of some tuple, and it does so; and some other
> transaction is also locking the same tuple concurrently in a compatible
> way.  So both are okay to proceed concurrently.  The problem is that if
> one of them detects that anything changed in the process of doing this
> (such as the other session updating the multixact to include itself,
> both having compatible lock modes), it loops back to ensure xmax/
> infomask are still sane; but heap_lock_updated_tuple_rec is not prepared
> to deal with the situation of "the current transaction has the lock
> already", so it returns a failure and the tuple is returned as "not
> visible" causing the described problem.
>

Your fix seems logical to me, though I have not tested it till now.
However, I wonder why heap_lock_tuple need to restart from the
beginning of update-chain in this case?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] bug in locking an update tuple chain

From

Alvaro Herrera

Date:

26 July 2017, 23:17:26

Amit Kapila wrote:
> On Sat, Jul 15, 2017 at 2:30 AM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:

> > a transaction wants to lock the
> > updated version of some tuple, and it does so; and some other
> > transaction is also locking the same tuple concurrently in a compatible
> > way.  So both are okay to proceed concurrently.  The problem is that if
> > one of them detects that anything changed in the process of doing this
> > (such as the other session updating the multixact to include itself,
> > both having compatible lock modes), it loops back to ensure xmax/
> > infomask are still sane; but heap_lock_updated_tuple_rec is not prepared
> > to deal with the situation of "the current transaction has the lock
> > already", so it returns a failure and the tuple is returned as "not
> > visible" causing the described problem.
> 
> Your fix seems logical to me, though I have not tested it till now.
> However, I wonder why heap_lock_tuple need to restart from the
> beginning of update-chain in this case?

Well, it's possible that we could change things so that it doesn't need
to re-start from the same spot where it initially began, but I think it
requires changing too much code; I'd rather not touch it in a
back-patchable bug fix.  If we really wanted, we could perhaps change
things to avoid repeated walks of the chain, but I'd see that as a pg11
(or future) change only.  (You would be forgiven for thinking that the
interactions between EvalPlanQualFetch, heap_lock_tuple and
heap_lock_update_tuple are rather Rube Goldbergian, to use Tom's term.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] bug in locking an update tuple chain

From

Alvaro Herrera

Date:

27 July 2017, 03:27:45

> The attached patch fixes the problem.  When locking some old tuple version of
> the chain, if we detect that we already hold that lock
> (test_lockmode_for_conflict returns HeapTupleSelfUpdated), do not try to lock
> it again but instead skip ahead to the next version.  This fixes the synthetic
> case in my isolationtester as well as our customer's production case.

Pushed.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services