Thread: On conflict update & hint bits
Hi, I am faced with rarely reproduced problem at our multimaster (and never at vanilla Postgres). We are using our own customized transaction manager, so it may be definitely the problem in our multimaster. But stack trace looks suspiciously and this is why I want to consult with people familiar with this code whether it is bug in ExecOnConflictUpdate or not. Briefly: ExecOnConflictUpdate tries to set hint bit without holding lock on the buffer and so get assertion failure in MarkBufferDirtyHint. Now stack trace: #0 0x00007fe3b940acc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 #1 0x00007fe3b940e0d8 in __GI_abort () at abort.c:89 #2 0x000000000097b996 in ExceptionalCondition (conditionName=0xb4d970 "!(LWLockHeldByMe(((LWLock*) (&(bufHdr)->content_lock))))", errorType=0xb4d2e9 "FailedAssertion", fileName=0xb4d2e0 "bufmgr.c", lineNumber=3380) at assert.c:54 #3 0x00000000007e365b in MarkBufferDirtyHint (buffer=946, buffer_std=1 '\001') at bufmgr.c:3380 #4 0x00000000009c3660 in SetHintBits (tuple=0x7fe396a9d858, buffer=946, infomask=256, xid=1398) at tqual.c:136 #5 0x00000000009c5194 in HeapTupleSatisfiesMVCC (htup=0x7ffc00169030, snapshot=0x2e79778, buffer=946) at tqual.c:1065 #6 0x00000000006ace83 in ExecCheckHeapTupleVisible (estate=0x2e81ae8, tuple=0x7ffc00169030, buffer=946) at nodeModifyTable.c:197 #7 0x00000000006ae343 in ExecOnConflictUpdate (mtstate=0x2e81d50, resultRelInfo=0x2e81c38, conflictTid=0x7ffc001690c0, planSlot=0x2e82428, excludedSlot=0x2e82428, estate=0x2e81ae8, canSetTag=1 '\001', returning=0x7ffc001690c8) at nodeModifyTable.c:1173 #8 0x00000000006ad256 in ExecInsert (mtstate=0x2e81d50, slot=0x2e82428, planSlot=0x2e82428, arbiterIndexes=0x2e7eeb0, onconflict=ONCONFLICT_UPDATE, estate=0x2e81ae8, canSetTag=1 '\001') at nodeModifyTable.c:395 #9 0x00000000006aebe3 in ExecModifyTable (node=0x2e81d50) at nodeModifyTable.c:1496 In ExecOnConflictUpdate buffer is pinned but not locked: /* * Lock tuple for update. Don't follow updates when tuple cannot be * locked without doing so. A row lockingconflict here means our * previous conclusion that the tuple is conclusively committed is not * true anymore. */ tuple.t_self = *conflictTid; test = heap_lock_tuple(relation, &tuple, estate->es_output_cid, lockmode, LockWaitBlock, false, &buffer, &hufd); heap_lock_tuple is pinning buffer but not locking it: * *buffer: set to buffer holding tuple (pinned but not locked atexit) Later we try to check tuple visibility: ExecCheckHeapTupleVisible(estate, &tuple, buffer); and inside HeapTupleSatisfiesMVCC try to set hint bit. MarkBufferDirtyHint assumes that buffer is locked: * 2. The caller might have only share-lock instead of exclusive-lock on the * buffer's content lock. and we get assertion failure in /* here, either share or exclusive lock is OK */ Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr))); So the question is whether it is correct that ExecOnConflictUpdate tries to access and update tuple without holding lock on the buffer? Thank in advance, -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Fri, Sep 30, 2016 at 5:33 PM, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote: > Later we try to check tuple visibility: > > ExecCheckHeapTupleVisible(estate, &tuple, buffer); > > and inside HeapTupleSatisfiesMVCC try to set hint bit. So, you're using repeatable read or serializable isolation level? -- Peter Geoghegan
On 30.09.2016 19:37, Peter Geoghegan wrote: > On Fri, Sep 30, 2016 at 5:33 PM, Konstantin Knizhnik > <k.knizhnik@postgrespro.ru> wrote: >> Later we try to check tuple visibility: >> >> ExecCheckHeapTupleVisible(estate, &tuple, buffer); >> >> and inside HeapTupleSatisfiesMVCC try to set hint bit. > So, you're using repeatable read or serializable isolation level? > Repeatable read. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Fri, Sep 30, 2016 at 5:33 PM, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote: > So the question is whether it is correct that ExecOnConflictUpdate tries to > access and update tuple without holding lock on the buffer? You're right -- this is a bug in Postgres. I'm travelling from Ireland to the USA this weekend, but will work on this early next week. I don't think it's a particularly tricky fix -- as you say, it is necessary to have at least a shared buffer lock to call HeapTupleSatisfiesVisibility(), and we quite simply fail to ensure that. We must have a shared buffer lock in the visibility-check path for ON CONFLICT DO UPDATE where isolation level > READ COMMITTED -- a buffer pin is not enough. It also looks like the DO NOTHING variant is similarly affected, even when the isolation level is READ COMMITTED. :-( Thanks for the detailed report. -- Peter Geoghegan
On Sat, Oct 1, 2016 at 1:15 PM, Peter Geoghegan <pg@heroku.com> wrote: > It also looks like the DO NOTHING variant is similarly affected, even > when the isolation level is READ COMMITTED. :-( Actually, the DO NOTHING variant is also only affected in isolation levels greater than READ COMMITTED. Not sure why I thought otherwise. -- Peter Geoghegan
On Sat, Oct 1, 2016 at 5:15 AM, Peter Geoghegan <pg@heroku.com> wrote: > On Fri, Sep 30, 2016 at 5:33 PM, Konstantin Knizhnik > <k.knizhnik@postgrespro.ru> wrote: >> So the question is whether it is correct that ExecOnConflictUpdate tries to >> access and update tuple without holding lock on the buffer? > > You're right -- this is a bug in Postgres. (Thomas Munro and Kevin Grittner added to CC list.) Attached is a revision of Thomas' patch to fix a related issue; it now also fixes this no-buffer-lock-held issue. He posted his version of this to a -general mailing list thread a week ago [1]. This was a thread about spurious serialization failure with ON CONFLICT DO NOTHING. I understand that Kevin is still not happy with the behavior under SSI even with our fix, since serialization failures will still occur more often than necessary (see other thread for details of what I'm talking about). I believe this patch to be a strict improvement on master, and I suggest it be applied in advance of the deadline to get fixes in for the next set of point releases. Note that this revision includes Thomas' isolation test, which is completely unchanged. I still intend to work with Kevin to do better than this under SSI, though that will probably be for a future point release. The behavior proposed by my fix here is the right thing for REPEATABLE READ isolation level, which has nothing to do with Kevin's proposed more careful handling under SSI. Besides, the buffer pin bug reported by Konstantin on this thread really should be addressed ASAP. [1] https://www.postgresql.org/message-id/CAEepm=3Ra9NgDHocDBtB4iiB7MWdavQybNS3F47SvKh1Mk-mFQ@mail.gmail.com -- Peter Geoghegan
Attachment
Peter Geoghegan <pg@heroku.com> writes: > Attached is a revision of Thomas' patch to fix a related issue; it now > also fixes this no-buffer-lock-held issue. I think this needs more work. 1. Why are you replacing ExecOnConflictUpdate's ExecCheckHeapTupleVisible call with ExecCheckTIDVisible? That appears to demand a re-fetch of the tuple ExecOnConflictUpdate already has its hands on. Wouldn't it be better to just put a buffer lock/unlock around the existing code? 2. Your proposed coding + if (!heap_fetch(rel, estate->es_snapshot, &tuple, &buffer, true, NULL)) + ... + if (!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tuple.t_data))) will dump core in the case where heap_fetch returns false with tuple.t_data set to null. If that case is impossible here, it's not apparent why, and it'd surely deserve at least a comment, maybe an Assert. But I'd rather just assume it's possible. I'm not taking a position on whether this is OK at the higher level of whether the SSI behavior could be better. However, unless Kevin has objections to committing something like this now, I think we should fix the above-mentioned problems and push it. The existing behavior is clearly bad. regards, tom lane
On Sat, Oct 22, 2016 at 9:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > 1. Why are you replacing ExecOnConflictUpdate's ExecCheckHeapTupleVisible > call with ExecCheckTIDVisible? That appears to demand a re-fetch of the > tuple ExecOnConflictUpdate already has its hands on. Wouldn't it be > better to just put a buffer lock/unlock around the existing code? I thought that that was less than idiomatic within nodeModifyTable.c -- executor modules rarely directly acquire buffer locks. More importantly, while the lighter weight ExecCheckHeapTupleVisible() variant seemed to make sense when there was no buffer lock acquisition, now that it's clear that a buffer lock acquisition is necessary, I am skeptical. I now doubt that the added complexity pays for itself, which is why I proposed to remove ExecCheckHeapTupleVisible(). Besides, ON CONFLICT seems like a feature that is significantly less compelling at higher isolation levels; that must be why it took this long to hear a problem report like Konstantin's. > 2. Your proposed coding > > + if (!heap_fetch(rel, estate->es_snapshot, &tuple, &buffer, true, NULL)) > + ... > + if (!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tuple.t_data))) > > will dump core in the case where heap_fetch returns false with > tuple.t_data set to null. If that case is impossible here, > it's not apparent why, and it'd surely deserve at least a comment, > maybe an Assert. But I'd rather just assume it's possible. You could say the same thing about at least one other existing place that more or less assumes a conflictTid heap_fetch() or similar cannot allow that to happen, I think. Maybe this is just as good a place to talk about it as any other, though. I defer to you. (It's safe because our snapshot is a kind of interlock against vacuum killing the conflictTid tuple.) > I'm not taking a position on whether this is OK at the higher level > of whether the SSI behavior could be better. However, unless Kevin > has objections to committing something like this now, I think we > should fix the above-mentioned problems and push it. The existing > behavior is clearly bad. I don't have any strong feelings on it myself just yet; I trust that Kevin's judgement that we should do better under SSI is sound. I should have mentioned that Kevin encouraged me to do this much first in my description of the patch. I'm pretty confident that he will have no objection to doing something like this for now. -- Peter Geoghegan
Peter Geoghegan <pg@heroku.com> writes: > On Sat, Oct 22, 2016 at 9:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> 1. Why are you replacing ExecOnConflictUpdate's ExecCheckHeapTupleVisible >> call with ExecCheckTIDVisible? That appears to demand a re-fetch of the >> tuple ExecOnConflictUpdate already has its hands on. Wouldn't it be >> better to just put a buffer lock/unlock around the existing code? > I thought that that was less than idiomatic within nodeModifyTable.c > -- executor modules rarely directly acquire buffer locks. "Rarely" is not "never". A bigger problem though is that heap_fetch does more than just lock the buffer: there are also PredicateLockTuple and CheckForSerializableConflictOut calls in there. It's possible that those are no-ops in this usage (because after all we already fetched the tuple once), or maybe they're even desirable because they would help resolve Kevin's concerns. But it hasn't been analyzed and so I'm not prepared to risk a behavioral change in this already under-tested area the day before a release wrap. AFAICT, it's very hard to get to an actual failure from the lack of buffer lock here. You would need to have a situation where the tuple was not hintable when originally fetched but has become so by the time ON CONFLICT is rechecking it. That's possible, eg if we're using async commit and the previous transaction's commit record has gotten flushed to disk in between, but it's not likely. Even then, no actual problem would ensue (in non-assert builds) from setting a hint bit without buffer lock, except in very hard-to-hit race conditions. So the buffer lock issue doesn't seem urgent enough to me to justify making a fix under time pressure. The business about not throwing a serialization failure due to actions of our own xact does seem worth fixing now, but if I understand correctly, we can deal with that by adding a test for xmin-is-our-own-xact into the existing code structure. I propose doing that much (and adding Munro's regression test case) and calling it good for today. regards, tom lane
On Sun, Oct 23, 2016 at 1:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Rarely" is not "never". A bigger problem though is that heap_fetch > does more than just lock the buffer: there are also PredicateLockTuple > and CheckForSerializableConflictOut calls in there. It's possible that > those are no-ops in this usage (because after all we already fetched > the tuple once), or maybe they're even desirable because they would help > resolve Kevin's concerns. But it hasn't been analyzed and so I'm not > prepared to risk a behavioral change in this already under-tested area > the day before a release wrap. The heap_fetch() contract doesn't ask its callers to consider any of that. Besides, those actions (PredicateLockTuple + CheckForSerializableConflictOut) are very probably redundant no-ops as you say. (They won't help with Kevin's additional concerns, BTW, because Kevin is concerned about excessive serialization failures with SSI.) > AFAICT, it's very hard to get to an actual failure from the lack of > buffer lock here. You would need to have a situation where the tuple > was not hintable when originally fetched but has become so by the time > ON CONFLICT is rechecking it. That's possible, eg if we're using > async commit and the previous transaction's commit record has gotten > flushed to disk in between, but it's not likely. Even then, no actual > problem would ensue (in non-assert builds) from setting a hint bit without > buffer lock, except in very hard-to-hit race conditions. So the buffer > lock issue doesn't seem urgent enough to me to justify making a fix under > time pressure. > > The business about not throwing a serialization failure due to actions > of our own xact does seem worth fixing now, but if I understand correctly, > we can deal with that by adding a test for xmin-is-our-own-xact into > the existing code structure. I propose doing that much (and adding > Munro's regression test case) and calling it good for today. I'm surprised at how you've assessed the risk here. I think that the risk of not proceeding with fixing the buffer lock issue is greater than the risk of breaking something else with the proposed fix. Even if the former risk isn't such a big one. If there are a non-trivial number of users that are particularly attached to the precise behavior of higher isolation levels in master (assuming it would be altered by the proposed fix), then it's surprising that it took so many months for someone to complain about the ON CONFLICT DO NOTHING behavior being blatantly broken. -- Peter Geoghegan
Peter Geoghegan <pg@heroku.com> writes: > On Sun, Oct 23, 2016 at 1:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> "Rarely" is not "never". A bigger problem though is that heap_fetch >> does more than just lock the buffer: there are also PredicateLockTuple >> and CheckForSerializableConflictOut calls in there. It's possible that >> those are no-ops in this usage (because after all we already fetched >> the tuple once), or maybe they're even desirable because they would help >> resolve Kevin's concerns. But it hasn't been analyzed and so I'm not >> prepared to risk a behavioral change in this already under-tested area >> the day before a release wrap. > I'm surprised at how you've assessed the risk here. What's bothering me is (a) it's less than 24 hours to release wrap and (b) this patch changes SSI-relevant behavior and hasn't been approved by Kevin. I'm not familiar enough with that logic to commit a change in it on my own authority, especially with so little time for problems to be uncovered. I'm okay with adding an explicit buffer lock/unlock pair, and in fact plan to go have a look at that in a bit. I'm not okay with doing a refactoring that might change the behavior in more ways than that under these circumstances. regards, tom lane
On Sun, Oct 23, 2016 at 2:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > What's bothering me is (a) it's less than 24 hours to release wrap and > (b) this patch changes SSI-relevant behavior and hasn't been approved > by Kevin. I'm not familiar enough with that logic to commit a change > in it on my own authority, especially with so little time for problems > to be uncovered. I should point out that I knew that the next set of point releases had been moved forward much later than you did. I had to work on this fix during the week, which was pretty far from ideal for me for my own reasons. > I'm okay with adding an explicit buffer lock/unlock pair, and in fact > plan to go have a look at that in a bit. I'm not okay with doing a > refactoring that might change the behavior in more ways than that > under these circumstances. Fair enough. As long as we do that much, I'm comfortable. -- Peter Geoghegan
On 24.10.2016 00:49, Peter Geoghegan wrote: > On Sun, Oct 23, 2016 at 2:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> What's bothering me is (a) it's less than 24 hours to release wrap and >> (b) this patch changes SSI-relevant behavior and hasn't been approved >> by Kevin. I'm not familiar enough with that logic to commit a change >> in it on my own authority, especially with so little time for problems >> to be uncovered. > I should point out that I knew that the next set of point releases had > been moved forward much later than you did. I had to work on this fix > during the week, which was pretty far from ideal for me for my own > reasons. > Just for information: I know that you are working on this issue, but as far as we need to proceed further with our testing of multimaster, I have done the following obvious changes and it fixes the problem (at least this assertion failure is not happen any more): src/backend/executor/nodeModifyTable.c @@ -1087,6 +1087,13 @@ ExecOnConflictUpdate(ModifyTableState *mtstate, test = heap_lock_tuple(relation, &tuple, estate->es_output_cid, lockmode, LockWaitBlock, false, &buffer, &hufd); + /* + * We must hold share lock on the buffer content while examining tuple + * visibility. Afterwards, however, the tuples we have found to be + * visible are guaranteed good as long as we hold the buffer pin. + */ + LockBuffer(buffer, BUFFER_LOCK_SHARE); + switch (test) { case HeapTupleMayBeUpdated: @@ -1142,6 +1149,7 @@ ExecOnConflictUpdate(ModifyTableState *mtstate, * loop here, as the new version of therow might not conflict * anymore, or the conflicting tuple has actually been deleted. */ + LockBuffer(buffer, BUFFER_LOCK_UNLOCK); ReleaseBuffer(buffer); return false; @@ -1175,6 +1183,8 @@ ExecOnConflictUpdate(ModifyTableState *mtstate, /* Store target's existing tuple in the state'sdedicated slot */ ExecStoreTuple(&tuple, mtstate->mt_existing, buffer, false); + LockBuffer(buffer, BUFFER_LOCK_UNLOCK); + /* * Make tuple and any needed join variables available to ExecQual and * ExecProject. The EXCLUDED tupleis installed in ecxt_innertuple, while -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Konstantin Knizhnik <k.knizhnik@postgrespro.ru> writes: > Just for information: I know that you are working on this issue, but as > far as we need to proceed further with our testing of multimaster, > I have done the following obvious changes and it fixes the problem (at > least this assertion failure is not happen any more): This seems kind of the hard way --- why didn't you put the buffer lock calls into ExecCheckHeapTupleVisible? https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=8f1fb7d621b0e6bd2eb0ba2ac9634c5b5a03564b regards, tom lane
On Sun, Oct 23, 2016 at 3:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > The business about not throwing a serialization failure due to actions > of our own xact does seem worth fixing now, but if I understand correctly, > we can deal with that by adding a test for xmin-is-our-own-xact into > the existing code structure. I propose doing that much (and adding > Munro's regression test case) and calling it good for today. Thanks. This is the only part of it that I consider an actual *bug* (since you can retry the serialization failure forever and never move on because there is no other transaction involved which can finish to clear the problem) as opposed to an opportunity to optimize (by reducing false positive serialization failures actually involving other transactions). I never saw anything in the literature addressing the intersection of UPSERT and SSI, and I think we do need to think about and discuss this a bit more before we can be sure of the best fix. This is probably not thread on which to have that discussion. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company