Thread: Re: pgsql: Fix a couple of bugs in MultiXactId freezing
On Sat, Nov 30, 2013 at 01:06:09AM +0000, Alvaro Herrera wrote: > Fix a couple of bugs in MultiXactId freezing > > Both heap_freeze_tuple() and heap_tuple_needs_freeze() neglected to look > into a multixact to check the members against cutoff_xid. > ! /* > ! * This is a multixact which is not marked LOCK_ONLY, but which > ! * is newer than the cutoff_multi. If the update_xid is below the > ! * cutoff_xid point, then we can just freeze the Xmax in the > ! * tuple, removing it altogether. This seems simple, but there > ! * are several underlying assumptions: > ! * > ! * 1. A tuple marked by an multixact containing a very old > ! * committed update Xid would have been pruned away by vacuum; we > ! * wouldn't be freezing this tuple at all. > ! * > ! * 2. There cannot possibly be any live locking members remaining > ! * in the multixact. This is because if they were alive, the > ! * update's Xid would had been considered, via the lockers' > ! * snapshot's Xmin, as part the cutoff_xid. READ COMMITTED transactions can reset MyPgXact->xmin between commands, defeating that assumption; see SnapshotResetXmin(). I have attached an isolationtester spec demonstrating the problem. The test spec additionally covers a (probably-related) assertion failure, new in 9.3.2. > ! * > ! * 3. We don't create new MultiXacts via MultiXactIdExpand() that > ! * include a very old aborted update Xid: in that function we only > ! * include update Xids corresponding to transactions that are > ! * committed or in-progress. > ! */ > ! update_xid = HeapTupleGetUpdateXid(tuple); > ! if (TransactionIdPrecedes(update_xid, cutoff_xid)) > ! freeze_xmax = true; That was the only concrete runtime problem I found during a study of the newest heap_freeze_tuple() and heap_tuple_needs_freeze() code. One thing that leaves me unsure is the fact that vacuum_set_xid_limits() does no locking to ensure a consistent result between GetOldestXmin() and GetOldestMultiXactId(). Transactions may start or end between those calls, making the GetOldestMultiXactId() result represent a later set of transactions than the GetOldestXmin() result. I suspect that's fine. New transactions have no immediate effect on either cutoff, and transaction end can only increase a cutoff. Using a slightly-lower cutoff than the maximum safe cutoff is always okay; consider vacuum_defer_cleanup_age. Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com
Attachment
On 2013-12-03 00:47:07 -0500, Noah Misch wrote: > On Sat, Nov 30, 2013 at 01:06:09AM +0000, Alvaro Herrera wrote: > > Fix a couple of bugs in MultiXactId freezing > > > > Both heap_freeze_tuple() and heap_tuple_needs_freeze() neglected to look > > into a multixact to check the members against cutoff_xid. > > > ! /* > > ! * This is a multixact which is not marked LOCK_ONLY, but which > > ! * is newer than the cutoff_multi. If the update_xid is below the > > ! * cutoff_xid point, then we can just freeze the Xmax in the > > ! * tuple, removing it altogether. This seems simple, but there > > ! * are several underlying assumptions: > > ! * > > ! * 1. A tuple marked by an multixact containing a very old > > ! * committed update Xid would have been pruned away by vacuum; we > > ! * wouldn't be freezing this tuple at all. > > ! * > > ! * 2. There cannot possibly be any live locking members remaining > > ! * in the multixact. This is because if they were alive, the > > ! * update's Xid would had been considered, via the lockers' > > ! * snapshot's Xmin, as part the cutoff_xid. > > READ COMMITTED transactions can reset MyPgXact->xmin between commands, > defeating that assumption; see SnapshotResetXmin(). I have attached an > isolationtester spec demonstrating the problem. Any idea how to cheat our way out of that one given the current way heap_freeze_tuple() works (running on both primary and standby)? My only idea was to MultiXactIdWait() if !InRecovery but that's extremly grotty. We can't even realistically create a new multixact with fewer members with the current format of xl_heap_freeze. > The test spec additionally > covers a (probably-related) assertion failure, new in 9.3.2. Too bad it's too late to do anthing about it for 9.3.2. :(. At least the last seems actually unrelated, I am not sure why it's 9.3.2 only. Alvaro, are you looking? > That was the only concrete runtime problem I found during a study of the > newest heap_freeze_tuple() and heap_tuple_needs_freeze() code. I'd even be interested in fuzzy problems ;). If 9.3. wouldn't have been released the interactions between cutoff_xid/multi would have caused me to say "back to the drawing" board... I'm not suprised if further things are lurking there. > One thing that > leaves me unsure is the fact that vacuum_set_xid_limits() does no locking to > ensure a consistent result between GetOldestXmin() and GetOldestMultiXactId(). > Transactions may start or end between those calls, making the > GetOldestMultiXactId() result represent a later set of transactions than the > GetOldestXmin() result. I suspect that's fine. New transactions have no > immediate effect on either cutoff, and transaction end can only increase a > cutoff. Using a slightly-lower cutoff than the maximum safe cutoff is always > okay; consider vacuum_defer_cleanup_age. Yes, that seems fine to me, with the same reasoning. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Dec 03, 2013 at 11:56:07AM +0100, Andres Freund wrote: > On 2013-12-03 00:47:07 -0500, Noah Misch wrote: > > On Sat, Nov 30, 2013 at 01:06:09AM +0000, Alvaro Herrera wrote: > > > Fix a couple of bugs in MultiXactId freezing > > > > > > Both heap_freeze_tuple() and heap_tuple_needs_freeze() neglected to look > > > into a multixact to check the members against cutoff_xid. > > > > > ! /* > > > ! * This is a multixact which is not marked LOCK_ONLY, but which > > > ! * is newer than the cutoff_multi. If the update_xid is below the > > > ! * cutoff_xid point, then we can just freeze the Xmax in the > > > ! * tuple, removing it altogether. This seems simple, but there > > > ! * are several underlying assumptions: > > > ! * > > > ! * 1. A tuple marked by an multixact containing a very old > > > ! * committed update Xid would have been pruned away by vacuum; we > > > ! * wouldn't be freezing this tuple at all. > > > ! * > > > ! * 2. There cannot possibly be any live locking members remaining > > > ! * in the multixact. This is because if they were alive, the > > > ! * update's Xid would had been considered, via the lockers' > > > ! * snapshot's Xmin, as part the cutoff_xid. > > > > READ COMMITTED transactions can reset MyPgXact->xmin between commands, > > defeating that assumption; see SnapshotResetXmin(). I have attached an > > isolationtester spec demonstrating the problem. > > Any idea how to cheat our way out of that one given the current way > heap_freeze_tuple() works (running on both primary and standby)? My only > idea was to MultiXactIdWait() if !InRecovery but that's extremly grotty. > We can't even realistically create a new multixact with fewer members > with the current format of xl_heap_freeze. Perhaps set HEAP_XMAX_LOCK_ONLY on the tuple? We'd then ensure all update XID consumers check HEAP_XMAX_IS_LOCKED_ONLY() first, much like xmax consumers are already expected to check HEAP_XMAX_INVALID first. Seems doable, albeit yet another injection of complexity. > > The test spec additionally > > covers a (probably-related) assertion failure, new in 9.3.2. > > Too bad it's too late to do anthing about it for 9.3.2. :(. At least the > last seems actually unrelated, I am not sure why it's 9.3.2 > only. Alvaro, are you looking? (For clarity, the other problem demonstrated by the test spec is also a 9.3.2 regression.) > > That was the only concrete runtime problem I found during a study of the > > newest heap_freeze_tuple() and heap_tuple_needs_freeze() code. > > I'd even be interested in fuzzy problems ;). If 9.3. wouldn't have been > released the interactions between cutoff_xid/multi would have caused me > to say "back to the drawing" board... I'm not suprised if further things > are lurking there. heap_freeze_tuple() of 9.2 had an XXX comment about the possibility of getting spurious lock contention due to wraparound of the multixact space. The comment is gone, and that mechanism no longer poses a threat. However, a non-wrapped multixact containing wrapped locker XIDs (we don't freeze locker XIDs, just updater XIDs) may cause similar spurious contention. > + /* > + * The multixact has an update hidden within. Get rid of it. > + * > + * If the update_xid is below the cutoff_xid, it necessarily > + * must be an aborted transaction. In a primary server, such > + * an Xmax would have gotten marked invalid by > + * HeapTupleSatisfiesVacuum, but in a replica that is not > + * called before we are, so deal with it in the same way. > + * > + * If not below the cutoff_xid, then the tuple would have been > + * pruned by vacuum, if the update committed long enough ago, > + * and we wouldn't be freezing it; so it's either recently > + * committed, or in-progress. Deal with this by setting the > + * Xmax to the update Xid directly and remove the IS_MULTI > + * bit. (We know there cannot be running lockers in this > + * multi, because it's below the cutoff_multi value.) > + */ > + > + if (TransactionIdPrecedes(update_xid, cutoff_xid)) > + { > + Assert(InRecovery || TransactionIdDidAbort(update_xid)); > + freeze_xmax = true; > + } > + else > + { > + Assert(InRecovery || !TransactionIdIsInProgress(update_xid)); This assertion is at odds with the comment, but the assertion is okay for now. If the updater is still in progress, its OldestMemberMXactId[] entry will have held back cutoff_multi, and we won't be here. Therefore, if we get here, the tuple will always be HEAPTUPLE_RECENTLY_DEAD (recently-committed updater) or HEAPTUPLE_LIVE (aborted updater, recent or not). Numerous comments in the vicinity (e.g. ones at MultiXactStateData) reflect a pre-9.3 world. Most or all of that isn't new with the patch at hand, but it does complicate study. -- Noah Misch EnterpriseDB http://www.enterprisedb.com
Andres Freund <andres@2ndquadrant.com> writes: > On 2013-12-03 00:47:07 -0500, Noah Misch wrote: >> The test spec additionally >> covers a (probably-related) assertion failure, new in 9.3.2. > Too bad it's too late to do anthing about it for 9.3.2. :(. At least the > last seems actually unrelated, I am not sure why it's 9.3.2 > only. Alvaro, are you looking? Is this bad enough that we need to re-wrap the release? regards, tom lane
On 2013-12-03 09:48:23 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2013-12-03 00:47:07 -0500, Noah Misch wrote: > >> The test spec additionally > >> covers a (probably-related) assertion failure, new in 9.3.2. > > > Too bad it's too late to do anthing about it for 9.3.2. :(. At least the > > last seems actually unrelated, I am not sure why it's 9.3.2 > > only. Alvaro, are you looking? > > Is this bad enough that we need to re-wrap the release? Tentatively I'd say no, the only risk is loosing locks afaics. Thats much bettter than corrupting rows as in 9.3.1. But I'll look into it in a bit more detail as soon as I am of the phone call I am on. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-12-03 09:16:18 -0500, Noah Misch wrote: > On Tue, Dec 03, 2013 at 11:56:07AM +0100, Andres Freund wrote: > > On 2013-12-03 00:47:07 -0500, Noah Misch wrote: > > > On Sat, Nov 30, 2013 at 01:06:09AM +0000, Alvaro Herrera wrote: > > Any idea how to cheat our way out of that one given the current way > > heap_freeze_tuple() works (running on both primary and standby)? My only > > idea was to MultiXactIdWait() if !InRecovery but that's extremly grotty. > > We can't even realistically create a new multixact with fewer members > > with the current format of xl_heap_freeze. > > Perhaps set HEAP_XMAX_LOCK_ONLY on the tuple? We'd then ensure all update XID > consumers check HEAP_XMAX_IS_LOCKED_ONLY() first, much like xmax consumers are > already expected to check HEAP_XMAX_INVALID first. Seems doable, albeit yet > another injection of complexity. I think its pretty much checked that way already, but the problem seems to be how to avoid checks on xid commit/abort in that case. I've complained in 20131121200517.GM7240@alap2.anarazel.de that the old pre-condition that multixacts aren't checked when they can't be relevant (via OldestVisibleM*) isn't observed anymore. So, if we re-introduce that condition again, we should be on the safe side with that, right? > > > The test spec additionally > > > covers a (probably-related) assertion failure, new in 9.3.2. > > > > Too bad it's too late to do anthing about it for 9.3.2. :(. At least the > > last seems actually unrelated, I am not sure why it's 9.3.2 > > only. Alvaro, are you looking? > > (For clarity, the other problem demonstrated by the test spec is also a 9.3.2 > regression.) Yea, I just don't see why yet... Looking now. > heap_freeze_tuple() of 9.2 had an XXX comment about the possibility of getting > spurious lock contention due to wraparound of the multixact space. The > comment is gone, and that mechanism no longer poses a threat. However, a > non-wrapped multixact containing wrapped locker XIDs (we don't freeze locker > XIDs, just updater XIDs) may cause similar spurious contention. Yea, I noticed that that comment was missing as well. I think what we should do now is to rework freezing in HEAD to make all this more reasonable. > Numerous comments in the vicinity (e.g. ones at MultiXactStateData) reflect a > pre-9.3 world. Most or all of that isn't new with the patch at hand, but it > does complicate study. Yea, Alvaro sent a patch for that somewhere, it seems a patch in the series got lost when foreign key locks were originally applied. I think we seriously need to put a good amount of work into the multixact.c stuff in the next months. Otherwise it will be a maintenance nightmore for a fair bit more time. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-12-03 09:16:18 -0500, Noah Misch wrote: > > > The test spec additionally > > > covers a (probably-related) assertion failure, new in 9.3.2. > > > > Too bad it's too late to do anthing about it for 9.3.2. :(. At least the > > last seems actually unrelated, I am not sure why it's 9.3.2 > > only. Alvaro, are you looking? > > (For clarity, the other problem demonstrated by the test spec is also a 9.3.2 > regression.) The backtrace for the Assert() you found is: #4 0x00000000004f1da5 in CreateMultiXactId (nmembers=2, members=0x1ce17d8) at /home/andres/src/postgresql/src/backend/access/transam/multixact.c:708 #5 0x00000000004f1aeb in MultiXactIdExpand (multi=6241831, xid=6019366, status=MultiXactStatusUpdate) at /home/andres/src/postgresql/src/backend/access/transam/multixact.c:462 #6 0x00000000004a5d8e in compute_new_xmax_infomask (xmax=6241831, old_infomask=4416, old_infomask2=16386, add_to_xmax=6019366, mode=LockTupleExclusive, is_update=1 '\001', result_xmax=0x7fffca02a700, result_infomask=0x7fffca02a6fe, result_infomask2=0x7fffca02a6fc) at /home/andres/src/postgresql/src/backend/access/heap/heapam.c:4651 #7 0x00000000004a2d27 in heap_update (relation=0x7f9fc45cc828, otid=0x7fffca02a8d0, newtup=0x1ce1740, cid=0, crosscheck=0x0, wait=1 '\001', hufd=0x7fffca02a850, lockmode=0x7fffca02a82c) at /home/andres/src/postgresql/src/backend/access/heap/heapam.c:3304 #8 0x0000000000646f04 in ExecUpdate (tupleid=0x7fffca02a8d0, oldtuple=0x0, slot=0x1ce12c0, planSlot=0x1ce0740, epqstate=0x1ce0120, estate=0x1cdfe98, canSetTag=1 '\001') at /home/andres/src/postgresql/src/backend/executor/nodeModifyTable.c:690 So imo it isn't really a new problem, it existed all along :/. We only don't hit it in your terstcase before because we spuriously thought that a tuple was in-progress if *any* member of the old multi were still running in some cases instead of just the updater. But I am pretty sure it can also reproduced in 9.3.1. Imo the MultiXactIdSetOldestMember() call in heap_update() needs to be moved outside of the if (satisfies_key). Everything else is vastly more complex. Alvaro, correct? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Dec 03, 2013 at 04:08:23PM +0100, Andres Freund wrote: > On 2013-12-03 09:16:18 -0500, Noah Misch wrote: > > On Tue, Dec 03, 2013 at 11:56:07AM +0100, Andres Freund wrote: > > > On 2013-12-03 00:47:07 -0500, Noah Misch wrote: > > > > On Sat, Nov 30, 2013 at 01:06:09AM +0000, Alvaro Herrera wrote: > > > Any idea how to cheat our way out of that one given the current way > > > heap_freeze_tuple() works (running on both primary and standby)? My only > > > idea was to MultiXactIdWait() if !InRecovery but that's extremly grotty. > > > We can't even realistically create a new multixact with fewer members > > > with the current format of xl_heap_freeze. > > > > Perhaps set HEAP_XMAX_LOCK_ONLY on the tuple? We'd then ensure all update XID > > consumers check HEAP_XMAX_IS_LOCKED_ONLY() first, much like xmax consumers are > > already expected to check HEAP_XMAX_INVALID first. Seems doable, albeit yet > > another injection of complexity. > > I think its pretty much checked that way already, but the problem seems > to be how to avoid checks on xid commit/abort in that case. I've > complained in 20131121200517.GM7240@alap2.anarazel.de that the old > pre-condition that multixacts aren't checked when they can't be relevant > (via OldestVisibleM*) isn't observed anymore. > So, if we re-introduce that condition again, we should be on the safe > side with that, right? What specific commit/abort checks do you have in mind? > > > > The test spec additionally > > > > covers a (probably-related) assertion failure, new in 9.3.2. > > > > > > Too bad it's too late to do anthing about it for 9.3.2. :(. At least the > > > last seems actually unrelated, I am not sure why it's 9.3.2 > > > only. Alvaro, are you looking? > > > > (For clarity, the other problem demonstrated by the test spec is also a 9.3.2 > > regression.) > > Yea, I just don't see why yet... Looking now. Sorry, my original report was rather terse. I speak of the scenario exercised by the second permutation in that isolationtester spec. The multixact is later than VACUUM's cutoff_multi, so 9.3.1 does not freeze it at all. 9.3.2 does freeze it to InvalidTransactionId per the code I cited in my first response on this thread, which wrongly removes a key lock. -- Noah Misch EnterpriseDB http://www.enterprisedb.com
On 2013-12-03 10:29:54 -0500, Noah Misch wrote: > Sorry, my original report was rather terse. I speak of the scenario exercised > by the second permutation in that isolationtester spec. The multixact is > later than VACUUM's cutoff_multi, so 9.3.1 does not freeze it at all. 9.3.2 > does freeze it to InvalidTransactionId per the code I cited in my first > response on this thread, which wrongly removes a key lock. That one is clear, I was only confused about the Assert() you reported. But I think I've explained that elsewhere. I currently don't see fixing the errorneous freezing of lockers (not the updater though) without changing the wal format or synchronously waiting for all lockers to end. Which both see like a no-go? While it's still a major bug it seems to still be much better than the previous case of either inaccessible or reappearing rows. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2013-12-03 10:29:54 -0500, Noah Misch wrote: > On Tue, Dec 03, 2013 at 04:08:23PM +0100, Andres Freund wrote: > > On 2013-12-03 09:16:18 -0500, Noah Misch wrote: > > > On Tue, Dec 03, 2013 at 11:56:07AM +0100, Andres Freund wrote: > > > > On 2013-12-03 00:47:07 -0500, Noah Misch wrote: > > > > > On Sat, Nov 30, 2013 at 01:06:09AM +0000, Alvaro Herrera wrote: > > > > Any idea how to cheat our way out of that one given the current way > > > > heap_freeze_tuple() works (running on both primary and standby)? My only > > > > idea was to MultiXactIdWait() if !InRecovery but that's extremly grotty. > > > > We can't even realistically create a new multixact with fewer members > > > > with the current format of xl_heap_freeze. > > > > > > Perhaps set HEAP_XMAX_LOCK_ONLY on the tuple? We'd then ensure all update XID > > > consumers check HEAP_XMAX_IS_LOCKED_ONLY() first, much like xmax consumers are > > > already expected to check HEAP_XMAX_INVALID first. Seems doable, albeit yet > > > another injection of complexity. > > > > I think its pretty much checked that way already, but the problem seems > > to be how to avoid checks on xid commit/abort in that case. I've > > complained in 20131121200517.GM7240@alap2.anarazel.de that the old > > pre-condition that multixacts aren't checked when they can't be relevant > > (via OldestVisibleM*) isn't observed anymore. > > So, if we re-introduce that condition again, we should be on the safe > > side with that, right? > > What specific commit/abort checks do you have in mind? MultiXactIdIsRunning() does a TransactionIdIsInProgress() for each member which in turn does TransactionIdDidCommit(). Similar when locking a tuple that's locked/updated without a multixact where we go for a TransactionIdIsInProgress() in XactLockTableWait(). Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi Alvaro, Noah, On 2013-12-03 15:57:10 +0100, Andres Freund wrote: > On 2013-12-03 09:48:23 -0500, Tom Lane wrote: > > Andres Freund <andres@2ndquadrant.com> writes: > > > On 2013-12-03 00:47:07 -0500, Noah Misch wrote: > > >> The test spec additionally > > >> covers a (probably-related) assertion failure, new in 9.3.2. > > > > > Too bad it's too late to do anthing about it for 9.3.2. :(. At least the > > > last seems actually unrelated, I am not sure why it's 9.3.2 > > > only. Alvaro, are you looking? > > > > Is this bad enough that we need to re-wrap the release? > > Tentatively I'd say no, the only risk is loosing locks afaics. Thats > much bettter than corrupting rows as in 9.3.1. But I'll look into it in > a bit more detail as soon as I am of the phone call I am on. After looking, I think I am revising my opinion. The broken locking behaviour (outside of freeze, which I don't see how we can fix in time), is actually bad. Would that stop us from making the release date with packages? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: >> On 2013-12-03 09:48:23 -0500, Tom Lane wrote: >>> Is this bad enough that we need to re-wrap the release? > After looking, I think I am revising my opinion. The broken locking > behaviour (outside of freeze, which I don't see how we can fix in time), > is actually bad. > Would that stop us from making the release date with packages? That's hardly answerable when you haven't specified how long you think it'd take to fix. In general, though, I'm going to be exceedingly unhappy if this release introduces new regressions. If we have to put off the release to fix something, maybe we'd better do so. And we'd damn well better get it right this time. regards, tom lane
On 2013-12-03 12:22:33 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > >> On 2013-12-03 09:48:23 -0500, Tom Lane wrote: > >>> Is this bad enough that we need to re-wrap the release? > > > After looking, I think I am revising my opinion. The broken locking > > behaviour (outside of freeze, which I don't see how we can fix in time), > > is actually bad. > > Would that stop us from making the release date with packages? > > That's hardly answerable when you haven't specified how long you > think it'd take to fix. There's two things that are broken as-is: 1) the freezing of multixacts: The new state is much better than the old one because the old one corrupted data, while thenew one somes removes locks when you explicitly FREEZE. 2) The broken locking behaviour in Noah's test without the FREEZE. I don't see a realistic chance of fixing 1) in 9.3. Not even sure if it can be done without changing the freeze WAL format. But 2) should be fixed and basically is a oneliner + comments + test. Alvaro? > In general, though, I'm going to be exceedingly unhappy if this release > introduces new regressions. If we have to put off the release to fix > something, maybe we'd better do so. And we'd damn well better get it > right this time. I think that's really hard for the multixacts stuff. There's lots of stuff not really okay in there, but we can't do much about that now :( Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > Any idea how to cheat our way out of that one given the current way > heap_freeze_tuple() works (running on both primary and standby)? My only > idea was to MultiXactIdWait() if !InRecovery but that's extremly grotty. > We can't even realistically create a new multixact with fewer members > with the current format of xl_heap_freeze. Maybe we should just bite the bullet and change the WAL format for heap_freeze (inventing an all-new record type, not repurposing the old one, and allowing WAL replay to continue to accept the old one). The implication for users would be that they'd have to update slave servers before the master when installing the update; which is unpleasant, but better than living with a known data corruption case. regards, tom lane
On Tue, Dec 3, 2013 at 7:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
Andres Freund <andres@2ndquadrant.com> writes:Maybe we should just bite the bullet and change the WAL format for
> Any idea how to cheat our way out of that one given the current way
> heap_freeze_tuple() works (running on both primary and standby)? My only
> idea was to MultiXactIdWait() if !InRecovery but that's extremly grotty.
> We can't even realistically create a new multixact with fewer members
> with the current format of xl_heap_freeze.
heap_freeze (inventing an all-new record type, not repurposing the old
one, and allowing WAL replay to continue to accept the old one). The
implication for users would be that they'd have to update slave servers
before the master when installing the update; which is unpleasant, but
better than living with a known data corruption case.
Agreed. It may suck, but it sucks less.
How badly will it break if they do the upgrade in the wrong order though. Will the slaves just stop (I assume this?) or is there a risk of a wrong-order upgrade causing extra breakage? And if they do shut down, would just upgrading the slave fix it, or would they then have to rebuild the slave? (actually, don't we recommend they always rebuild the slave *anyway*? In which case the problem is even smaller..)
I think we've always told people to upgrade the slave first, and it's the logical thing that AFAIK most other systems require as well, so that's not an unreasonable requirement at all.
I assume we'd then get rid of the old record type completely in 9.4, right?
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
On Tue, Dec 03, 2013 at 04:37:58PM +0100, Andres Freund wrote: > On 2013-12-03 10:29:54 -0500, Noah Misch wrote: > > Sorry, my original report was rather terse. I speak of the scenario exercised > > by the second permutation in that isolationtester spec. The multixact is > > later than VACUUM's cutoff_multi, so 9.3.1 does not freeze it at all. 9.3.2 > > does freeze it to InvalidTransactionId per the code I cited in my first > > response on this thread, which wrongly removes a key lock. > > That one is clear, I was only confused about the Assert() you > reported. But I think I've explained that elsewhere. > > I currently don't see fixing the errorneous freezing of lockers (not the > updater though) without changing the wal format or synchronously waiting > for all lockers to end. Which both see like a no-go? Not fixing it at all is the real no-go. We'd take both of those undesirables before just tolerating the lost locks in 9.3. The attached patch illustrates the approach I was describing earlier. It fixes the test case discussed above. I haven't verified that everything else in the system is ready for it, so this is just for illustration purposes. -- Noah Misch EnterpriseDB http://www.enterprisedb.com
Attachment
Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > Any idea how to cheat our way out of that one given the current way > > heap_freeze_tuple() works (running on both primary and standby)? My only > > idea was to MultiXactIdWait() if !InRecovery but that's extremly grotty. > > We can't even realistically create a new multixact with fewer members > > with the current format of xl_heap_freeze. > > Maybe we should just bite the bullet and change the WAL format for > heap_freeze (inventing an all-new record type, not repurposing the old > one, and allowing WAL replay to continue to accept the old one). The > implication for users would be that they'd have to update slave servers > before the master when installing the update; which is unpleasant, but > better than living with a known data corruption case. That was my suggestion too (modulo, I admit, the bit about it being a new, separate record type.) -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Magnus Hagander <magnus@hagander.net> writes: > On Tue, Dec 3, 2013 at 7:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Maybe we should just bite the bullet and change the WAL format for >> heap_freeze (inventing an all-new record type, not repurposing the old >> one, and allowing WAL replay to continue to accept the old one). The >> implication for users would be that they'd have to update slave servers >> before the master when installing the update; which is unpleasant, but >> better than living with a known data corruption case. > Agreed. It may suck, but it sucks less. > How badly will it break if they do the upgrade in the wrong order though. > Will the slaves just stop (I assume this?) or is there a risk of a > wrong-order upgrade causing extra breakage? I assume what would happen is the slave would PANIC upon seeing a WAL record code it didn't recognize. Installing the updated version should allow it to resume functioning. Would be good to test this, but if it doesn't work like that, that'd be another bug to fix IMO. We've always foreseen the possible need to do something like this, so it ought to work reasonably cleanly. > I assume we'd then get rid of the old record type completely in 9.4, right? Yeah, we should be able to drop it in 9.4, since we'll surely have other WAL format changes anyway. regards, tom lane
On Tue, Dec 3, 2013 at 7:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Magnus Hagander <magnus@hagander.net> writes:
> On Tue, Dec 3, 2013 at 7:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:>> Maybe we should just bite the bullet and change the WAL format forI assume what would happen is the slave would PANIC upon seeing a WAL
>> heap_freeze (inventing an all-new record type, not repurposing the old
>> one, and allowing WAL replay to continue to accept the old one). The
>> implication for users would be that they'd have to update slave servers
>> before the master when installing the update; which is unpleasant, but
>> better than living with a known data corruption case.
> Agreed. It may suck, but it sucks less.
> How badly will it break if they do the upgrade in the wrong order though.
> Will the slaves just stop (I assume this?) or is there a risk of a
> wrong-order upgrade causing extra breakage?
record code it didn't recognize. Installing the updated version should
allow it to resume functioning. Would be good to test this, but if it
doesn't work like that, that'd be another bug to fix IMO. We've always
foreseen the possible need to do something like this, so it ought to
work reasonably cleanly.
Yeah, as long as that's tested and actually works, that sounds like an acceptable thing to deal with.
> I assume we'd then get rid of the old record type completely in 9.4, right?Yeah, we should be able to drop it in 9.4, since we'll surely have other
WAL format changes anyway.
And even if not, there's no point in keeping it unless we actually support replication from 9.3 -> 9.4, I think, and I don't believe anybody has even considered working on that yet :)
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
On 2013-12-03 13:14:38 -0500, Noah Misch wrote: > On Tue, Dec 03, 2013 at 04:37:58PM +0100, Andres Freund wrote: > > On 2013-12-03 10:29:54 -0500, Noah Misch wrote: > > > Sorry, my original report was rather terse. I speak of the scenario exercised > > > by the second permutation in that isolationtester spec. The multixact is > > > later than VACUUM's cutoff_multi, so 9.3.1 does not freeze it at all. 9.3.2 > > > does freeze it to InvalidTransactionId per the code I cited in my first > > > response on this thread, which wrongly removes a key lock. > > > > That one is clear, I was only confused about the Assert() you > > reported. But I think I've explained that elsewhere. > > > > I currently don't see fixing the errorneous freezing of lockers (not the > > updater though) without changing the wal format or synchronously waiting > > for all lockers to end. Which both see like a no-go? > > Not fixing it at all is the real no-go. We'd take both of those undesirables > before just tolerating the lost locks in 9.3. I think it's changing the wal format then. > The attached patch illustrates the approach I was describing earlier. It > fixes the test case discussed above. I haven't verified that everything else > in the system is ready for it, so this is just for illustration purposes. That might be better than the current state because the potential damage from such not frozen locks would be to get "could not access status of transaction ..." errors (*). But the problem I see with it is that a FOR UPDATE/NO KEY UPDATE lock taken out by UPDATE is different than the respective lock taken out by SELECT FOR SHARE: typedef enum {MultiXactStatusForKeyShare = 0x00,MultiXactStatusForShare = 0x01,MultiXactStatusForNoKeyUpdate = 0x02,MultiXactStatusForUpdate= 0x03,/* an update that doesn't touch "key" columns */MultiXactStatusNoKeyUpdate = 0x04,/*other updates, and delete */MultiXactStatusUpdate = 0x05 } MultiXactStatus; Ignoring the difference that way isn't going to fly nicely. *: which already are possible because we do not check multis properly against OldestVisibleMXactId[] anymore. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Noah Misch wrote: > > On 2013-12-03 10:29:54 -0500, Noah Misch wrote: > > > Sorry, my original report was rather terse. I speak of the scenario exercised > > > by the second permutation in that isolationtester spec. The multixact is > > > later than VACUUM's cutoff_multi, so 9.3.1 does not freeze it at all. 9.3.2 > > > does freeze it to InvalidTransactionId per the code I cited in my first > > > response on this thread, which wrongly removes a key lock. > > > > That one is clear, I was only confused about the Assert() you > > reported. But I think I've explained that elsewhere. > > > > I currently don't see fixing the errorneous freezing of lockers (not the > > updater though) without changing the wal format or synchronously waiting > > for all lockers to end. Which both see like a no-go? > > Not fixing it at all is the real no-go. We'd take both of those undesirables > before just tolerating the lost locks in 9.3. Well, unless I misunderstand, this is only a problem in the case that cutoff_multi is not yet past but cutoff_xid is; and that there are locker transactions still running. So it's really a corner case. Not saying it's impossible to hit, mind. > The attached patch illustrates the approach I was describing earlier. It > fixes the test case discussed above. I haven't verified that everything else > in the system is ready for it, so this is just for illustration purposes. Wow, this is scary. I don't oppose it in principle, but we'd better go over the whole thing once more to ensure every place that cares is prepared to deal with this. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2013-12-03 13:11:13 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > Any idea how to cheat our way out of that one given the current way > > heap_freeze_tuple() works (running on both primary and standby)? My only > > idea was to MultiXactIdWait() if !InRecovery but that's extremly grotty. > > We can't even realistically create a new multixact with fewer members > > with the current format of xl_heap_freeze. > > Maybe we should just bite the bullet and change the WAL format for > heap_freeze (inventing an all-new record type, not repurposing the old > one, and allowing WAL replay to continue to accept the old one). The > implication for users would be that they'd have to update slave servers > before the master when installing the update; which is unpleasant, but > better than living with a known data corruption case. I wondered about that myself. How would you suggest the format to look like? ISTM per tuple we'd need: * OffsetNumber off * uint16 infomask * TransactionId xmin * TransactionId xmax I don't see why we'd need infomask2, but so far being overly skimpy in that place hasn't shown itself to be the greatest idea? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund wrote: > I wondered about that myself. How would you suggest the format to look > like? > ISTM per tuple we'd need: > > * OffsetNumber off > * uint16 infomask > * TransactionId xmin > * TransactionId xmax > > I don't see why we'd need infomask2, but so far being overly skimpy in > that place hasn't shown itself to be the greatest idea? I don't see that we need the xmin; a simple bit flag indicating whether the Xmin was frozen should be enough. For xmax we need more detail, as you propose. In infomask, are you proposing to store the complete infomask, or just the flags being changed? Note we have a set of bits used in other wal records, XLHL_XMAX_IS_MULTI and friends, which perhaps we can use here. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Dec 03, 2013 at 07:26:38PM +0100, Andres Freund wrote: > On 2013-12-03 13:14:38 -0500, Noah Misch wrote: > > On Tue, Dec 03, 2013 at 04:37:58PM +0100, Andres Freund wrote: > > > I currently don't see fixing the errorneous freezing of lockers (not the > > > updater though) without changing the wal format or synchronously waiting > > > for all lockers to end. Which both see like a no-go? > > > > Not fixing it at all is the real no-go. We'd take both of those undesirables > > before just tolerating the lost locks in 9.3. > > I think it's changing the wal format then. I'd rather have an readily-verifiable fix that changes WAL format than a tricky fix that avoids doing so. So, modulo not having seen the change, +1. > > The attached patch illustrates the approach I was describing earlier. It > > fixes the test case discussed above. I haven't verified that everything else > > in the system is ready for it, so this is just for illustration purposes. > > That might be better than the current state because the potential damage > from such not frozen locks would be to get "could not access status of > transaction ..." errors (*). > *: which already are possible because we do not check multis properly > against OldestVisibleMXactId[] anymore. Separate issue. That patch adds to the ways we can end up with an extant multixact referencing an locker XID no longer found it CLOG, but it doesn't introduce that problem. -- Noah Misch EnterpriseDB http://www.enterprisedb.com
On 2013-12-03 13:49:49 -0500, Noah Misch wrote: > On Tue, Dec 03, 2013 at 07:26:38PM +0100, Andres Freund wrote: > > On 2013-12-03 13:14:38 -0500, Noah Misch wrote: > > > Not fixing it at all is the real no-go. We'd take both of those undesirables > > > before just tolerating the lost locks in 9.3. > > > > I think it's changing the wal format then. > > I'd rather have an readily-verifiable fix that changes WAL format than a > tricky fix that avoids doing so. So, modulo not having seen the change, +1. Well, who's going to write that then? I can write something up, but I really would not like not to be solely responsible for it. That means we cannot release 9.3 on schedule anyway, right? > > > The attached patch illustrates the approach I was describing earlier. It > > > fixes the test case discussed above. I haven't verified that everything else > > > in the system is ready for it, so this is just for illustration purposes. > > > > > That might be better than the current state because the potential damage > > from such not frozen locks would be to get "could not access status of > > transaction ..." errors (*). > > > > *: which already are possible because we do not check multis properly > > against OldestVisibleMXactId[] anymore. > > Separate issue. That patch adds to the ways we can end up with an extant > multixact referencing an locker XID no longer found it CLOG, but it doesn't > introduce that problem. Sure, that was an argument in favor of your idea, not against it ;). Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-12-03 15:40:44 -0300, Alvaro Herrera wrote: > Andres Freund wrote: > > > I wondered about that myself. How would you suggest the format to look > > like? > > ISTM per tuple we'd need: > > > > * OffsetNumber off > > * uint16 infomask > > * TransactionId xmin > > * TransactionId xmax > > > > I don't see why we'd need infomask2, but so far being overly skimpy in > > that place hasn't shown itself to be the greatest idea? > I don't see that we need the xmin; a simple bit flag indicating whether > the Xmin was frozen should be enough. Yea, that would work as well. > For xmax we need more detail, as you propose. In infomask, are you > proposing to store the complete infomask, or just the flags being > changed? Note we have a set of bits used in other wal records, > XLHL_XMAX_IS_MULTI and friends, which perhaps we can use here. Tentatively the complete one. I don't think we'd win enough by using compute_infobits/fix_infomask_from_infobits and we'd need to extend the bits stored in there unless we are willing to live with not transporting XMIN/XMAX_COMMITTED which doesn't seem like a good idea. Btw, why is it currently ok to modify the tuple in heap_freeze_tuple() without being in a critical section? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund wrote: > On 2013-12-03 13:49:49 -0500, Noah Misch wrote: > > On Tue, Dec 03, 2013 at 07:26:38PM +0100, Andres Freund wrote: > > > On 2013-12-03 13:14:38 -0500, Noah Misch wrote: > > > > Not fixing it at all is the real no-go. We'd take both of those undesirables > > > > before just tolerating the lost locks in 9.3. > > > > > > I think it's changing the wal format then. > > > > I'd rather have an readily-verifiable fix that changes WAL format than a > > tricky fix that avoids doing so. So, modulo not having seen the change, +1. > > Well, who's going to write that then? I can write something up, but I > really would not like not to be solely responsible for it. I will give this a shot. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Noah Misch <noah@leadboat.com> writes: > On Tue, Dec 03, 2013 at 07:26:38PM +0100, Andres Freund wrote: >> On 2013-12-03 13:14:38 -0500, Noah Misch wrote: >>> On Tue, Dec 03, 2013 at 04:37:58PM +0100, Andres Freund wrote: >>>> I currently don't see fixing the errorneous freezing of lockers (not the >>>> updater though) without changing the wal format or synchronously waiting >>>> for all lockers to end. Which both see like a no-go? > >>> Not fixing it at all is the real no-go. We'd take both of those undesirables >>> before just tolerating the lost locks in 9.3. >> I think it's changing the wal format then. > I'd rather have an readily-verifiable fix that changes WAL format than a > tricky fix that avoids doing so. So, modulo not having seen the change, +1. Yeah, same here. After some discussion, the core committee has concluded that we should go ahead with the already-wrapped releases. 9.2.6 and below are good anyway, and despite this issue 9.3.2 is an improvement over 9.3.1. We'll plan to do a 9.3.3 as soon as the multixact situation can be straightened out; but let's learn from experience and not try to fix it in a panic. regards, tom lane
Tom Lane wrote: > After some discussion, the core committee has concluded that we should go > ahead with the already-wrapped releases. 9.2.6 and below are good anyway, > and despite this issue 9.3.2 is an improvement over 9.3.1. We'll plan to > do a 9.3.3 as soon as the multixact situation can be straightened out; > but let's learn from experience and not try to fix it in a panic. I would suggest we include this one fix in 9.3.2a. This bug is more serious than the others because it happens because of checking HTSU on a tuple containing running locker-only transactions and an aborted update. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Tom Lane wrote: >> After some discussion, the core committee has concluded that we should go >> ahead with the already-wrapped releases. 9.2.6 and below are good anyway, >> and despite this issue 9.3.2 is an improvement over 9.3.1. We'll plan to >> do a 9.3.3 as soon as the multixact situation can be straightened out; >> but let's learn from experience and not try to fix it in a panic. > I would suggest we include this one fix in 9.3.2a. This bug is more > serious than the others because it happens because of checking HTSU on a > tuple containing running locker-only transactions and an aborted update. The effect is just that the lockers could lose their locks early, right? While that's annoying, it's not *directly* a data corruption problem. And I've lost any enthusiasm I might've had for quick fixes in this area. I think it'd be better to wait a few days, think this over, and get the other problem fixed as well. In any case, I think we're already past the point where we could re-wrap 9.3.2; the tarballs have been in the hands of packagers for > 24 hours. We'd have to call it 9.3.3. regards, tom lane
Noah Misch wrote: > On Tue, Dec 03, 2013 at 04:08:23PM +0100, Andres Freund wrote: > > > (For clarity, the other problem demonstrated by the test spec is also a 9.3.2 > > > regression.) > > > > Yea, I just don't see why yet... Looking now. > > Sorry, my original report was rather terse. I speak of the scenario exercised > by the second permutation in that isolationtester spec. The multixact is > later than VACUUM's cutoff_multi, so 9.3.1 does not freeze it at all. 9.3.2 > does freeze it to InvalidTransactionId per the code I cited in my first > response on this thread, which wrongly removes a key lock. Attached is a patch to fix it. It's a simple fix, really, but it reverts the delete-abort-savept test changes we did in 1ce150b7bb. (This is a more complete version of a patch I posted elsewhere in this thread as a reply to Tom.) I added a new isolation spec to test this specific case, and noticed something that seems curious to me when that test is run in REPEATABLE READ mode: when the UPDATE is aborted, the concurrent FOR UPDATE gets a "can't serialize due to concurrent update", but when the UPDATE is committed, FOR UPDATE works fine. Shouldn't it happen pretty much exactly the other way around? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2013-12-03 15:46:09 -0500, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > I'd rather have an readily-verifiable fix that changes WAL format than a > > tricky fix that avoids doing so. So, modulo not having seen the change, +1. > > Yeah, same here. I am afraid it won't be *that* simple. We still need code to look into multis, check whether all members are ok wrt. cutoff_xid and replace them, either by the contained xid, or by a new multi with the still living members. Ugly. There's currently also the issue that heap_freeze_tuple() modifies the tuple inplace without a critical section. We're executing a HeapTupleSatisfiesVacuum() before we get to WAL logging things, that has plenty of rope to hang itself on. So that doesn't really seem ok to me? Attached is a pre-pre alpha patch for this. To fix the issue with the missing critical section it splits freezing into heap_prepare_freeze_tuple() and heap_execute_freeze_tuple(). The former doesn't touch tuples and is executed on the primary, and the second actually peforms the modifications and is executed both, during normal processing and recovery. Needs a fair bit of work: * Should move parts of the multixact processing into multixact.c, specifically it shouldn't require CreateMultiXactId() to be exported. * it relies on forward-declaring a struct in heapam.h that's actually defined heapam_xlog.h - that's pretty ugly. * any form of testing but make check/isolationcheck across SR. * lots of the comments around need to be added/reworked * has a simpler version of Alvaro's patch to HTSV in there Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Alvaro Herrera wrote: > Attached is a patch to fix it. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 2013-12-03 19:55:40 -0300, Alvaro Herrera wrote: > I added a new isolation spec to test this specific case, and noticed > something that seems curious to me when that test is run in REPEATABLE > READ mode: when the UPDATE is aborted, the concurrent FOR UPDATE gets a > "can't serialize due to concurrent update", but when the UPDATE is > committed, FOR UPDATE works fine. Shouldn't it happen pretty much > exactly the other way around? That's 247c76a989097f1b4ab6fae898f24e75aa27fc1b . Specifically the DidCommit() branch in test_lockmode_for_conflict(). You forgot something akin to /* locker has finished, all good to go */ if (!ISUPDATE_from_mxstatus(status)) return HeapTupleMayBeUpdated; Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Dec 3, 2013 at 7:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Magnus Hagander <magnus@hagander.net> writes:
> On Tue, Dec 3, 2013 at 7:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:>> Maybe we should just bite the bullet and change the WAL format forI assume what would happen is the slave would PANIC upon seeing a WAL
>> heap_freeze (inventing an all-new record type, not repurposing the old
>> one, and allowing WAL replay to continue to accept the old one). The
>> implication for users would be that they'd have to update slave servers
>> before the master when installing the update; which is unpleasant, but
>> better than living with a known data corruption case.
> Agreed. It may suck, but it sucks less.
> How badly will it break if they do the upgrade in the wrong order though.
> Will the slaves just stop (I assume this?) or is there a risk of a
> wrong-order upgrade causing extra breakage?
record code it didn't recognize. Installing the updated version should
allow it to resume functioning. Would be good to test this, but if it
doesn't work like that, that'd be another bug to fix IMO. We've always
foreseen the possible need to do something like this, so it ought to
work reasonably cleanly.
I wonder if we should for the future have the START_REPLICATION command (or the IDENTIFY_SYSTEM would probably make more sense - or even adding a new command like IDENTIFY_CLIENT. The point is, something in the replication protocol) have walreceiver include it's version sent to the master. That way we could have the walsender identify a walreceiver that's too old and disconnect it right away - with a much nicer error message than a PANIC. Right now, walreceiver knows the version of the walsender (through pqserverversion), but AFAICT there is no way for the walsender to know which version of the receiver is connected.
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
Magnus Hagander <magnus@hagander.net> writes: > On Tue, Dec 3, 2013 at 7:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I assume what would happen is the slave would PANIC upon seeing a WAL >> record code it didn't recognize. > I wonder if we should for the future have the START_REPLICATION command (or > the IDENTIFY_SYSTEM would probably make more sense - or even adding a new > command like IDENTIFY_CLIENT. The point is, something in the replication > protocol) have walreceiver include it's version sent to the master. That > way we could have the walsender identify a walreceiver that's too old and > disconnect it right away - with a much nicer error message than a PANIC. Meh. That only helps for the case of streaming replication, and not for the thirty-seven other ways that some WAL might arrive at something that wants to replay it. It might be worth doing anyway, but I can't get excited about it for this scenario. regards, tom lane
On Wed, Dec 4, 2013 at 8:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
Magnus Hagander <magnus@hagander.net> writes:
> On Tue, Dec 3, 2013 at 7:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:>> I assume what would happen is the slave would PANIC upon seeing a WAL
>> record code it didn't recognize.> I wonder if we should for the future have the START_REPLICATION command (orMeh. That only helps for the case of streaming replication, and not for
> the IDENTIFY_SYSTEM would probably make more sense - or even adding a new
> command like IDENTIFY_CLIENT. The point is, something in the replication
> protocol) have walreceiver include it's version sent to the master. That
> way we could have the walsender identify a walreceiver that's too old and
> disconnect it right away - with a much nicer error message than a PANIC.
the thirty-seven other ways that some WAL might arrive at something that
wants to replay it.
It might be worth doing anyway, but I can't get excited about it for this
scenario.
It does, but I bet it's one of the by far most common cases. I'd say it's that one and restore-from-backup that would cover a huge majority of all cases. If we can cover those, we don't have to be perfect - so unless it turns out to be ridiculously complicated, I think it would be worthwhile having.
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
Andres Freund wrote: > On 2013-12-03 19:55:40 -0300, Alvaro Herrera wrote: > > I added a new isolation spec to test this specific case, and noticed > > something that seems curious to me when that test is run in REPEATABLE > > READ mode: when the UPDATE is aborted, the concurrent FOR UPDATE gets a > > "can't serialize due to concurrent update", but when the UPDATE is > > committed, FOR UPDATE works fine. Shouldn't it happen pretty much > > exactly the other way around? > > That's 247c76a989097f1b4ab6fae898f24e75aa27fc1b . Specifically the > DidCommit() branch in test_lockmode_for_conflict(). You forgot something > akin to > /* locker has finished, all good to go */ > if (!ISUPDATE_from_mxstatus(status)) > return HeapTupleMayBeUpdated; So I did. Here are two patches, one to fix this issue, and the other to fix the issue above. I intend to apply these two to 9.3 and master, and then apply your freeze fix on top (which I'm cleaning up a bit -- will resend later.) -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hi, On 2013-12-05 10:42:35 -0300, Alvaro Herrera wrote: > I intend to apply these two to 9.3 and master, and > then apply your freeze fix on top (which I'm cleaning up a bit -- will > resend later.) I sure hope it get's cleaned up - it's an evening's hack, with a good glass of wine ontop. Do you agree with the general direction? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Here's a revamped version of this patch. One thing I didn't do here is revert the exporting of CreateMultiXactId, but I don't see any way to avoid that. Andres mentioned the idea of sharing some code between heap_prepare_freeze_tuple and heap_tuple_needs_freeze, but I haven't explored that. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 2013-12-09 19:14:58 -0300, Alvaro Herrera wrote: > Here's a revamped version of this patch. One thing I didn't do here is > revert the exporting of CreateMultiXactId, but I don't see any way to > avoid that. I don't so much have a problem with exporting CreateMultiXactId(), just with exporting it under its current name. It's already quirky to have both MultiXactIdCreate and CreateMultiXactId() in multixact.c but exporting it imo goes to far. > Andres mentioned the idea of sharing some code between > heap_prepare_freeze_tuple and heap_tuple_needs_freeze, but I haven't > explored that. My idea would just be to have heap_tuple_needs_freeze() call heap_prepare_freeze_tuple() and check whether it returns true. Yes, that's slightly more expensive than the current heap_tuple_needs_freeze(), but it's only called when we couldn't get a cleanup lock on a page, so that seems ok. > ! static TransactionId > ! FreezeMultiXactId(MultiXactId multi, uint16 t_infomask, > ! TransactionId cutoff_xid, MultiXactId cutoff_multi, > ! uint16 *flags) > ! { > ! if (!MultiXactIdIsValid(multi)) > ! { > ! /* Ensure infomask bits are appropriately set/reset */ > ! *flags |= FRM_INVALIDATE_XMAX; > ! return InvalidTransactionId; > ! } Maybe comment that we're sure to be only called when IS_MULTI is set, which will be unset by FRM_INVALIDATE_XMAX? I wondered twice whether we wouldn't just continually mark the buffer dirty this way. > ! else if (MultiXactIdPrecedes(multi, cutoff_multi)) > ! { > ! /* > ! * This old multi cannot possibly have members still running. If it > ! * was a locker only, it can be removed without any further > ! * consideration; but if it contained an update, we might need to > ! * preserve it. > ! */ > ! if (HEAP_XMAX_IS_LOCKED_ONLY(t_infomask)) > ! { > ! *flags |= FRM_INVALIDATE_XMAX; > ! return InvalidTransactionId; Cna you place an Assert(!MultiXactIdIsRunning(multi)) here? > ! if (ISUPDATE_from_mxstatus(members[i].status) && > ! !TransactionIdDidAbort(members[i].xid))# It makes me wary to see a DidAbort() without a previous InProgress() call. Also, after we crashed, doesn't DidAbort() possibly return false for transactions that were in progress before we crashed? At least that's how I always understood it, and how tqual.c is written. > ! /* We only keep lockers if they are still running */ > ! if (TransactionIdIsCurrentTransactionId(members[i].xid) || I don't think there's a need to check for TransactionIdIsCurrentTransactionId() - vacuum can explicitly *not* be run inside a transaction. > *************** > *** 5443,5458 **** heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid, > * xvac transaction succeeded. > */ > if (tuple->t_infomask & HEAP_MOVED_OFF) > ! HeapTupleHeaderSetXvac(tuple, InvalidTransactionId); > else > ! HeapTupleHeaderSetXvac(tuple, FrozenTransactionId); > > /* > * Might as well fix the hint bits too; usually XMIN_COMMITTED > * will already be set here, but there's a small chance not. > */ > Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID)); > ! tuple->t_infomask |= HEAP_XMIN_COMMITTED; > changed = true; > } > } > --- 5571,5586 ---- > * xvac transaction succeeded. > */ > if (tuple->t_infomask & HEAP_MOVED_OFF) > ! frz->frzflags |= XLH_FREEZE_XVAC; > else > ! frz->frzflags |= XLH_INVALID_XVAC; > Hm. Isn't this case inverted? I.e. shouldn't you set XLH_FREEZE_XVAC and XLH_INVALID_XVAC exactly the other way round? I really don't understand the moved in/off, since the code has been gone longer than I've followed the code... *** a/src/backend/access/rmgrdesc/mxactdesc.c > --- b/src/backend/access/rmgrdesc/mxactdesc.c > *************** > *** 41,47 **** out_member(StringInfo buf, MultiXactMember *member) > appendStringInfoString(buf, "(upd) "); > break; > default: > ! appendStringInfoString(buf, "(unk) "); > break; > } > } > --- 41,47 ---- > appendStringInfoString(buf, "(upd) "); > break; > default: > ! appendStringInfo(buf, "(unk) ", member->status); > break; > } > } That change doesn't look like it will do anything? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund wrote: > On 2013-12-09 19:14:58 -0300, Alvaro Herrera wrote: > > Here's a revamped version of this patch. One thing I didn't do here is > > revert the exporting of CreateMultiXactId, but I don't see any way to > > avoid that. > > I don't so much have a problem with exporting CreateMultiXactId(), just > with exporting it under its current name. It's already quirky to have > both MultiXactIdCreate and CreateMultiXactId() in multixact.c but > exporting it imo goes to far. MultiXactidCreateFromMembers(int, MultiXactMembers *) ? > > Andres mentioned the idea of sharing some code between > > heap_prepare_freeze_tuple and heap_tuple_needs_freeze, but I haven't > > explored that. > > My idea would just be to have heap_tuple_needs_freeze() call > heap_prepare_freeze_tuple() and check whether it returns true. Yes, > that's slightly more expensive than the current > heap_tuple_needs_freeze(), but it's only called when we couldn't get a > cleanup lock on a page, so that seems ok. Doesn't seem a completely bad idea, but let's leave it for a separate patch. This should be changed in master only IMV anyway, while the rest of this patch is to be backpatched to 9.3. > > ! if (!MultiXactIdIsValid(multi)) > > ! { > > ! /* Ensure infomask bits are appropriately set/reset */ > > ! *flags |= FRM_INVALIDATE_XMAX; > > ! return InvalidTransactionId; > > ! } > > Maybe comment that we're sure to be only called when IS_MULTI is set, > which will be unset by FRM_INVALIDATE_XMAX? I wondered twice whether we > wouldn't just continually mark the buffer dirty this way. Done. > > ! else if (MultiXactIdPrecedes(multi, cutoff_multi)) > > ! { > > ! /* > > ! * This old multi cannot possibly have members still running. If it > Cna you place an Assert(!MultiXactIdIsRunning(multi)) here? Done. > > ! if (ISUPDATE_from_mxstatus(members[i].status) && > > ! !TransactionIdDidAbort(members[i].xid))# > > It makes me wary to see a DidAbort() without a previous InProgress() > call. Also, after we crashed, doesn't DidAbort() possibly return > false for transactions that were in progress before we crashed? At > least that's how I always understood it, and how tqual.c is written. Yes, that's correct. But note that here we're not doing a tuple liveliness test, which is what tqual.c is doing. What we do with this info is to keep the Xid as part of the multi if it's still running or committed. We also keep it if the xact crashed, which is fine because the Xid will be removed by some later step. If we know for certain that the update Xid is aborted, then we can ignore it, but this is just an optimization and not needed for correctness. That loop had a bug, so I restructured it. (If the update xact had aborted we wouldn't get to the "continue" and thus would treat it as a locker-only. I don't think that behavior would cause any visible misbehavior but it's wrong nonetheless.) One interesting bit is that we might end up creating singleton MultiXactIds when freezing, if there's no updater and there's a running locker. We could avoid this (i.e. mark the tuple as locked by a single Xid) but it would complicate FreezeMultiXactId's API and it's unlikely to occur with any frequency anyway. > > ! /* We only keep lockers if they are still running */ > > ! if (TransactionIdIsCurrentTransactionId(members[i].xid) || > > I don't think there's a need to check for > TransactionIdIsCurrentTransactionId() - vacuum can explicitly *not* be > run inside a transaction. Keep in mind that freezing can also happen for tuples handled during a table-rewrite operation such as CLUSTER. I wouldn't place a bet that you can't have a multi created by a transaction and later run cluster in the same table in the same transaction. Maybe this is fine because of the fact that at that point we're holding an exclusive lock in the table, but it seems fragile. And the test is cheap anyway. > > --- 5571,5586 ---- > > * xvac transaction succeeded. > > */ > > if (tuple->t_infomask & HEAP_MOVED_OFF) > > ! frz->frzflags |= XLH_FREEZE_XVAC; > > else > > ! frz->frzflags |= XLH_INVALID_XVAC; > > > > Hm. Isn't this case inverted? I.e. shouldn't you set XLH_FREEZE_XVAC and > XLH_INVALID_XVAC exactly the other way round? I really don't understand > the moved in/off, since the code has been gone longer than I've followed > the code... Yep, fixed. > > --- b/src/backend/access/rmgrdesc/mxactdesc.c > > *************** > > *** 41,47 **** out_member(StringInfo buf, MultiXactMember *member) > > appendStringInfoString(buf, "(upd) "); > > break; > > default: > > ! appendStringInfoString(buf, "(unk) "); > > break; > > } > > } > > --- 41,47 ---- > > appendStringInfoString(buf, "(upd) "); > > break; > > default: > > ! appendStringInfo(buf, "(unk) ", member->status); > > break; > > } > > } > > That change doesn't look like it will do anything? Meh. That was a leftover --- removed. (I was toying with the "desc" code because it misbehaves when applied on records as they are created, as opposed to being applied on records as they are replayed. I'm pretty sure everyone already knows about this, and it's the reason why everybody has skimped from examining arrays of things stored in followup data records. I was naive enough to write code that tries to decode the followup record that contains the members of the multiaxct we're creating, which works fine during replay but gets them completely wrong during regular operation. This is the third time I'm surprised by this misbehavior; blame my bad memory for not remembering that it's not supposed to work in the first place.) Right now there is one case in this code that returns FRM_INVALIDATE_XMAX when it's not strictly necessary, i.e. it would also work to keep the Multi as is and return FRM_NOOP instead; and it also returns FRM_NOOP in one case when we could return FRM_INVALIDATE_XMAX instead. Neither does any great damage, but there is a consideration that future examiners of the tuple would have to resolve the MultiXact by themselves (==> performance hit). On the other hand, returning INVALIDATE causes the block to be dirtied, which is undesirable if not already dirty. Maybe this can be optimized so that we return a separate flag from FreezeMultiXactId when Xmax invalidation is optional, so that we execute all such operations if and only if the block is already dirty or being dirtied for other reasons. That would provide the cleanup for later onlookers, while not causing an unnecessary dirtying. Attached are patches for this, for both 9.3 and master. The 9.3 patch keeps the original FREEZE record; I have tested that an unpatched replica dies with: PANIC: heap2_redo: unknown op code 32 CONTEXTO: xlog redo UNKNOWN LOG: proceso de inicio (PID 316) fue terminado por una señal 6: Aborted when the master is running the new code. The message is ugly, but I don't see any way to fix that. For the master branch, I have removed the original FREEZE record definition completely and bumped XLOG_PAGE_MAGIC. This doesn't pose a problem given that we have no replication between different major versions. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 2013-12-11 14:00:05 -0300, Alvaro Herrera wrote: > Andres Freund wrote: > > On 2013-12-09 19:14:58 -0300, Alvaro Herrera wrote: > > I don't so much have a problem with exporting CreateMultiXactId(), just > > with exporting it under its current name. It's already quirky to have > > both MultiXactIdCreate and CreateMultiXactId() in multixact.c but > > exporting it imo goes to far. > > MultiXactidCreateFromMembers(int, MultiXactMembers *) ? Works for me. > > > Andres mentioned the idea of sharing some code between > > > heap_prepare_freeze_tuple and heap_tuple_needs_freeze, but I haven't > > > explored that. > > > > My idea would just be to have heap_tuple_needs_freeze() call > > heap_prepare_freeze_tuple() and check whether it returns true. Yes, > > that's slightly more expensive than the current > > heap_tuple_needs_freeze(), but it's only called when we couldn't get a > > cleanup lock on a page, so that seems ok. > > Doesn't seem a completely bad idea, but let's leave it for a separate > patch. This should be changed in master only IMV anyway, while the rest > of this patch is to be backpatched to 9.3. I am not so sure it shouldn't be backpatched together with this. We now have similar complex logic in both functions. > > > ! if (ISUPDATE_from_mxstatus(members[i].status) && > > > ! !TransactionIdDidAbort(members[i].xid))# > > > > It makes me wary to see a DidAbort() without a previous InProgress() > > call. Also, after we crashed, doesn't DidAbort() possibly return > > false for transactions that were in progress before we crashed? At > > least that's how I always understood it, and how tqual.c is written. > > Yes, that's correct. But note that here we're not doing a tuple > liveliness test, which is what tqual.c is doing. What we do with this > info is to keep the Xid as part of the multi if it's still running or > committed. We also keep it if the xact crashed, which is fine because > the Xid will be removed by some later step. If we know for certain that > the update Xid is aborted, then we can ignore it, but this is just an > optimization and not needed for correctness. But why deviate that way? It doesn't seem to save us much? > One interesting bit is that we might end up creating singleton > MultiXactIds when freezing, if there's no updater and there's a running > locker. We could avoid this (i.e. mark the tuple as locked by a single > Xid) but it would complicate FreezeMultiXactId's API and it's unlikely > to occur with any frequency anyway. Yea, that seems completely fine. > > I don't think there's a need to check for > > TransactionIdIsCurrentTransactionId() - vacuum can explicitly *not* be > > run inside a transaction. > > Keep in mind that freezing can also happen for tuples handled during a > table-rewrite operation such as CLUSTER. Good point. > > > if (tuple->t_infomask & HEAP_MOVED_OFF) > > > ! frz->frzflags |= XLH_FREEZE_XVAC; > > > else > > > ! frz->frzflags |= XLH_INVALID_XVAC; > > > > > > > Hm. Isn't this case inverted? I.e. shouldn't you set XLH_FREEZE_XVAC and > > XLH_INVALID_XVAC exactly the other way round? I really don't understand > > the moved in/off, since the code has been gone longer than I've followed > > the code... > > Yep, fixed. I wonder how many of the HEAP_MOVED_* cases around are actually correct... What was the last version those were generated? 8.4? > (I was toying with the "desc" > code because it misbehaves when applied on records as they are created, > as opposed to being applied on records as they are replayed. I'm pretty > sure everyone already knows about this, and it's the reason why > everybody has skimped from examining arrays of things stored in followup > data records. I was naive enough to write code that tries to decode the > followup record that contains the members of the multiaxct we're > creating, which works fine during replay but gets them completely wrong > during regular operation. This is the third time I'm surprised by this > misbehavior; blame my bad memory for not remembering that it's not > supposed to work in the first place.) I am not really sure what you are talking about. That you cannot properly decode records before they have been processed by XLogInsert()? If so, yes, that's pretty clear and I am pretty sure it will break in lots of places if you try? > Right now there is one case in this code that returns > FRM_INVALIDATE_XMAX when it's not strictly necessary, i.e. it would also > work to keep the Multi as is and return FRM_NOOP instead; and it also > returns FRM_NOOP in one case when we could return FRM_INVALIDATE_XMAX > instead. Neither does any great damage, but there is a consideration > that future examiners of the tuple would have to resolve the MultiXact > by themselves (==> performance hit). On the other hand, returning > INVALIDATE causes the block to be dirtied, which is undesirable if not > already dirty. Otherwise it will be marked dirty the next time reads the page, so I don't think this is problematic. > ! { > ! if (ISUPDATE_from_mxstatus(members[i].status)) > ! { > ! /* > ! * It's an update; should we keep it? If the transaction is known > ! * aborted then it's okay to ignore it, otherwise not. (Note this > ! * is just an optimization and not needed for correctness, so it's > ! * okay to get this test wrong; for example, in case an updater is > ! * crashed, or a running transaction in the process of aborting.) > ! */ > ! if (!TransactionIdDidAbort(members[i].xid)) > ! { > ! newmembers[nnewmembers++] = members[i]; > ! Assert(!TransactionIdIsValid(update_xid)); > ! > ! /* > ! * Tell caller to set HEAP_XMAX_COMMITTED hint while we have > ! * the Xid in cache. Again, this is just an optimization, so > ! * it's not a problem if the transaction is still running and > ! * in the process of committing. > ! */ > ! if (TransactionIdDidCommit(update_xid)) > ! update_committed = true; > ! > ! update_xid = newmembers[i].xid; > ! } I don't think the conclusions here are correct - we might be setting HEAP_XMAX_COMMITTED a smudge to early that way. If the updating transaction is in progress, there's the situation that we have updated the clog, but have not yet removed ourselves from the procarray. I.e. a situation in which TransactionIdIsInProgress() and TransactionIdDidCommit() both return true. Afaik it is only correct to set HEAP_XMAX_COMMITTED once TransactionIdIsInProgress() returns false. > ! /* > ! * Checking for very old update Xids is critical: if the update > ! * member of the multi is older than cutoff_xid, we must remove > ! * it, because otherwise a later liveliness check could attempt > ! * pg_clog access for a page that was truncated away by the > ! * current vacuum. Note that if the update had committed, we > ! * wouldn't be freezing this tuple because it would have gotten > ! * removed (HEAPTUPLE_DEAD) by HeapTupleSatisfiesVacuum; so it > ! * either aborted or crashed. Therefore, ignore this update_xid. > ! */ > ! if (TransactionIdPrecedes(update_xid, cutoff_xid)) > ! { > ! update_xid = InvalidTransactionId; > ! update_committed = false; I vote for an Assert(!TransactionIdDidCommit(update_xid)) here. > ! else > ! { > ! /* > ! * Create a new multixact with the surviving members of the previous > ! * one, to set as new Xmax in the tuple. > ! * > ! * If this is the first possibly-multixact-able operation in the > ! * current transaction, set my per-backend OldestMemberMXactId > ! * setting. We can be certain that the transaction will never become a > ! * member of any older MultiXactIds than that. > ! */ > ! MultiXactIdSetOldestMember(); > ! xid = MultiXactIdCreateFromMembers(nnewmembers, newmembers); > ! *flags |= FRM_RETURN_IS_MULTI; > ! } I worry that this MultiXactIdSetOldestMember() will be problematic in longrunning vacuums like a anti-wraparound vacuum of a huge table. There's no real need to set MultiXactIdSetOldestMember() here, since we will not become the member of a multi. So I think you should either move the Assert() in MultiXactIdCreateFromMembers() to it's other callers, or add a parameter to skip it. > ! /* > ! * heap_prepare_freeze_tuple > * > * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac) > ! * are older than the specified cutoff XID and cutoff MultiXactId. If so, > ! * setup enough state (in the *frz output argument) to later execute and > ! * WAL-log what we would need to do, and return TRUE. Return FALSE if nothing > ! * is to be changed. > ! * > ! * Caller is responsible for setting the offset field, if appropriate. This > ! * is only necessary if the freeze is to be WAL-logged. I'd leave of that second sentence, if you want to freeze a whole page but not WAL log it, you'd need to set offset as well... > if (tuple->t_infomask & HEAP_XMAX_IS_MULTI) > { > ! else if (flags & FRM_RETURN_IS_MULTI) > { > ! frz->t_infomask &= ~HEAP_XMAX_BITS; > ! frz->xmax = newxmax; > ! > ! GetMultiXactIdHintBits(newxmax, > ! &frz->t_infomask, > ! &frz->t_infomask2); > ! changed = true; > } I worry that all these multixact accesses will create huge performance problems due to the inefficiency of the multixactid cache. If you scan a huge table there very well might be millions of different multis we touch and afaics most of them will end up in the multixactid cache. That can't end well. I think we need to either regularly delete that cache when it goes past, say, 100 entries, or just bypass it entirely. Greetings, Andres Freund
Andres Freund wrote: > On 2013-12-11 14:00:05 -0300, Alvaro Herrera wrote: > > Andres Freund wrote: > > > On 2013-12-09 19:14:58 -0300, Alvaro Herrera wrote: > > > > Andres mentioned the idea of sharing some code between > > > > heap_prepare_freeze_tuple and heap_tuple_needs_freeze, but I haven't > > > > explored that. > > > > > > My idea would just be to have heap_tuple_needs_freeze() call > > > heap_prepare_freeze_tuple() and check whether it returns true. Yes, > > > that's slightly more expensive than the current > > > heap_tuple_needs_freeze(), but it's only called when we couldn't get a > > > cleanup lock on a page, so that seems ok. > > > > Doesn't seem a completely bad idea, but let's leave it for a separate > > patch. This should be changed in master only IMV anyway, while the rest > > of this patch is to be backpatched to 9.3. > > I am not so sure it shouldn't be backpatched together with this. We now > have similar complex logic in both functions. Any other opinions on this? > > > > ! if (ISUPDATE_from_mxstatus(members[i].status) && > > > > ! !TransactionIdDidAbort(members[i].xid))# > > > > > > It makes me wary to see a DidAbort() without a previous InProgress() > > > call. Also, after we crashed, doesn't DidAbort() possibly return > > > false for transactions that were in progress before we crashed? At > > > least that's how I always understood it, and how tqual.c is written. > > > > Yes, that's correct. But note that here we're not doing a tuple > > liveliness test, which is what tqual.c is doing. What we do with this > > info is to keep the Xid as part of the multi if it's still running or > > committed. We also keep it if the xact crashed, which is fine because > > the Xid will be removed by some later step. If we know for certain that > > the update Xid is aborted, then we can ignore it, but this is just an > > optimization and not needed for correctness. > > But why deviate that way? It doesn't seem to save us much? Well, it does save something -- there not being a live update means we are likely to be able to invalidate the Xmax completely if there are no lockers; and even in the case where there are lockers, we will be able to set LOCK_ONLY which means faster access in several places. > > > > if (tuple->t_infomask & HEAP_MOVED_OFF) > > > > ! frz->frzflags |= XLH_FREEZE_XVAC; > > > > else > > > > ! frz->frzflags |= XLH_INVALID_XVAC; > > > > > > > > > > Hm. Isn't this case inverted? I.e. shouldn't you set XLH_FREEZE_XVAC and > > > XLH_INVALID_XVAC exactly the other way round? I really don't understand > > > the moved in/off, since the code has been gone longer than I've followed > > > the code... > > > > Yep, fixed. > > I wonder how many of the HEAP_MOVED_* cases around are actually > correct... What was the last version those were generated? 8.4? 8.4, yeah, before VACUUM FULL got rewritten. I don't think anybody tests these code paths, because it involves databases that were upgraded straight from 8.4 and which in their 8.4 time saw VACUUM FULL executed. I think we should be considering removing these things, or at least have some mechanism to ensure they don't survive from pre-9.0 installs. > > (I was toying with the "desc" > > code because it misbehaves when applied on records as they are created, > > as opposed to being applied on records as they are replayed. I'm pretty > > sure everyone already knows about this, and it's the reason why > > everybody has skimped from examining arrays of things stored in followup > > data records. I was naive enough to write code that tries to decode the > > followup record that contains the members of the multiaxct we're > > creating, which works fine during replay but gets them completely wrong > > during regular operation. This is the third time I'm surprised by this > > misbehavior; blame my bad memory for not remembering that it's not > > supposed to work in the first place.) > > I am not really sure what you are talking about. That you cannot > properly decode records before they have been processed by XLogInsert()? > If so, yes, that's pretty clear and I am pretty sure it will break in > lots of places if you try? Well, not sure about "lots of places". The only misbahavior I have seen is in those desc routines. Of course, the redo routines might also fail, but then there's no way for them to be running ... > > Right now there is one case in this code that returns > > FRM_INVALIDATE_XMAX when it's not strictly necessary, i.e. it would also > > work to keep the Multi as is and return FRM_NOOP instead; and it also > > returns FRM_NOOP in one case when we could return FRM_INVALIDATE_XMAX > > instead. Neither does any great damage, but there is a consideration > > that future examiners of the tuple would have to resolve the MultiXact > > by themselves (==> performance hit). On the other hand, returning > > INVALIDATE causes the block to be dirtied, which is undesirable if not > > already dirty. > > Otherwise it will be marked dirty the next time reads the page, so I > don't think this is problematic. Not necessarily. I mean, if somebody sees a multi, they might just resolve it to its members and otherwise leave the page alone. Or in some cases not even resolve to members (if it's LOCK_ONLY and old enough to be behind the oldest visible multi). > > ! { > > ! if (ISUPDATE_from_mxstatus(members[i].status)) > > ! { > > ! /* > > ! * It's an update; should we keep it? If the transaction is known > > ! * aborted then it's okay to ignore it, otherwise not. (Note this > > ! * is just an optimization and not needed for correctness, so it's > > ! * okay to get this test wrong; for example, in case an updater is > > ! * crashed, or a running transaction in the process of aborting.) > > ! */ > > ! if (!TransactionIdDidAbort(members[i].xid)) > > ! { > > ! newmembers[nnewmembers++] = members[i]; > > ! Assert(!TransactionIdIsValid(update_xid)); > > ! > > ! /* > > ! * Tell caller to set HEAP_XMAX_COMMITTED hint while we have > > ! * the Xid in cache. Again, this is just an optimization, so > > ! * it's not a problem if the transaction is still running and > > ! * in the process of committing. > > ! */ > > ! if (TransactionIdDidCommit(update_xid)) > > ! update_committed = true; > > ! > > ! update_xid = newmembers[i].xid; > > ! } > > I don't think the conclusions here are correct - we might be setting > HEAP_XMAX_COMMITTED a smudge to early that way. If the updating > transaction is in progress, there's the situation that we have updated > the clog, but have not yet removed ourselves from the procarray. I.e. a > situation in which TransactionIdIsInProgress() and > TransactionIdDidCommit() both return true. Afaik it is only correct to > set HEAP_XMAX_COMMITTED once TransactionIdIsInProgress() returns false. Hmm ... Is there an actual difference? I mean, a transaction that marked itself as committed in pg_clog cannot return to any other state, regardless of what happens elsewhere. > > ! /* > > ! * Checking for very old update Xids is critical: if the update > > ! * member of the multi is older than cutoff_xid, we must remove > > ! * it, because otherwise a later liveliness check could attempt > > ! * pg_clog access for a page that was truncated away by the > > ! * current vacuum. Note that if the update had committed, we > > ! * wouldn't be freezing this tuple because it would have gotten > > ! * removed (HEAPTUPLE_DEAD) by HeapTupleSatisfiesVacuum; so it > > ! * either aborted or crashed. Therefore, ignore this update_xid. > > ! */ > > ! if (TransactionIdPrecedes(update_xid, cutoff_xid)) > > ! { > > ! update_xid = InvalidTransactionId; > > ! update_committed = false; > > I vote for an Assert(!TransactionIdDidCommit(update_xid)) here. Will add. > > ! else > > ! { > > ! /* > > ! * Create a new multixact with the surviving members of the previous > > ! * one, to set as new Xmax in the tuple. > > ! * > > ! * If this is the first possibly-multixact-able operation in the > > ! * current transaction, set my per-backend OldestMemberMXactId > > ! * setting. We can be certain that the transaction will never become a > > ! * member of any older MultiXactIds than that. > > ! */ > > ! MultiXactIdSetOldestMember(); > > ! xid = MultiXactIdCreateFromMembers(nnewmembers, newmembers); > > ! *flags |= FRM_RETURN_IS_MULTI; > > ! } > > I worry that this MultiXactIdSetOldestMember() will be problematic in > longrunning vacuums like a anti-wraparound vacuum of a huge > table. There's no real need to set MultiXactIdSetOldestMember() here, > since we will not become the member of a multi. So I think you should > either move the Assert() in MultiXactIdCreateFromMembers() to it's other > callers, or add a parameter to skip it. I would like to have the Assert() work automatically, that is, check the PROC_IN_VACUUM flag in MyProc->vacuumflags ... but this probably won't work with CLUSTER. That said, I think we *should* call SetOldestMember in CLUSTER. So maybe both things should be conditional on PROC_IN_VACUUM. (Either way it will be ugly.) > > ! /* > > ! * heap_prepare_freeze_tuple > > * > > * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac) > > ! * are older than the specified cutoff XID and cutoff MultiXactId. If so, > > ! * setup enough state (in the *frz output argument) to later execute and > > ! * WAL-log what we would need to do, and return TRUE. Return FALSE if nothing > > ! * is to be changed. > > ! * > > ! * Caller is responsible for setting the offset field, if appropriate. This > > ! * is only necessary if the freeze is to be WAL-logged. > > I'd leave of that second sentence, if you want to freeze a whole page > but not WAL log it, you'd need to set offset as well... I can buy that. > I worry that all these multixact accesses will create huge performance > problems due to the inefficiency of the multixactid cache. If you scan a > huge table there very well might be millions of different multis we > touch and afaics most of them will end up in the multixactid cache. That > can't end well. > I think we need to either regularly delete that cache when it goes past, > say, 100 entries, or just bypass it entirely. Delete the whole cache, or just prune it of the least recently used entries? Maybe the cache should be a dlist instead of the open-coded stuff that's there now; that would enable pruning of the oldest entries. I think a blanket deletion might be a cure worse than the disease. I see your point anyhow. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2013-12-11 22:08:41 -0300, Alvaro Herrera wrote: > Andres Freund wrote: > > On 2013-12-11 14:00:05 -0300, Alvaro Herrera wrote: > > > Andres Freund wrote: > > > > On 2013-12-09 19:14:58 -0300, Alvaro Herrera wrote: > > > > > ! if (ISUPDATE_from_mxstatus(members[i].status) && > > > > > ! !TransactionIdDidAbort(members[i].xid))# > > > > > > > > It makes me wary to see a DidAbort() without a previous InProgress() > > > > call. Also, after we crashed, doesn't DidAbort() possibly return > > > > false for transactions that were in progress before we crashed? At > > > > least that's how I always understood it, and how tqual.c is written. > > > > > > Yes, that's correct. But note that here we're not doing a tuple > > > liveliness test, which is what tqual.c is doing. What we do with this > > > info is to keep the Xid as part of the multi if it's still running or > > > committed. We also keep it if the xact crashed, which is fine because > > > the Xid will be removed by some later step. If we know for certain that > > > the update Xid is aborted, then we can ignore it, but this is just an > > > optimization and not needed for correctness. > > > > But why deviate that way? It doesn't seem to save us much? > > Well, it does save something -- there not being a live update means we > are likely to be able to invalidate the Xmax completely if there are no > lockers; and even in the case where there are lockers, we will be able > to set LOCK_ONLY which means faster access in several places. What I mean is that we could just query TransactionIdIsInProgress() like usual. I most of the cases it will be very cheap because of the RecentXmin() check at its beginning. > > I am not really sure what you are talking about. That you cannot > > properly decode records before they have been processed by XLogInsert()? > > If so, yes, that's pretty clear and I am pretty sure it will break in > > lots of places if you try? > > Well, not sure about "lots of places". The only misbahavior I have seen > is in those desc routines. Of course, the redo routines might also > fail, but then there's no way for them to be running ... Hm. I would guess that e.g. display xl_xact_commit fails majorly. > > > Right now there is one case in this code that returns > > > FRM_INVALIDATE_XMAX when it's not strictly necessary, i.e. it would also > > > work to keep the Multi as is and return FRM_NOOP instead; and it also > > > returns FRM_NOOP in one case when we could return FRM_INVALIDATE_XMAX > > > instead. Neither does any great damage, but there is a consideration > > > that future examiners of the tuple would have to resolve the MultiXact > > > by themselves (==> performance hit). On the other hand, returning > > > INVALIDATE causes the block to be dirtied, which is undesirable if not > > > already dirty. > > > > Otherwise it will be marked dirty the next time reads the page, so I > > don't think this is problematic. > > Not necessarily. I mean, if somebody sees a multi, they might just > resolve it to its members and otherwise leave the page alone. Or in > some cases not even resolve to members (if it's LOCK_ONLY and old enough > to be behind the oldest visible multi). But the work has to be done anyway, even if possibly slightly later? > > > ! * Tell caller to set HEAP_XMAX_COMMITTED hint while we have > > > ! * the Xid in cache. Again, this is just an optimization, so > > > ! * it's not a problem if the transaction is still running and > > > ! * in the process of committing. > > > ! */ > > > ! if (TransactionIdDidCommit(update_xid)) > > > ! update_committed = true; > > > ! > > > ! update_xid = newmembers[i].xid; > > > ! } > > > > I don't think the conclusions here are correct - we might be setting > > HEAP_XMAX_COMMITTED a smudge to early that way. If the updating > > transaction is in progress, there's the situation that we have updated > > the clog, but have not yet removed ourselves from the procarray. I.e. a > > situation in which TransactionIdIsInProgress() and > > TransactionIdDidCommit() both return true. Afaik it is only correct to > > set HEAP_XMAX_COMMITTED once TransactionIdIsInProgress() returns false. > > Hmm ... Is there an actual difference? I mean, a transaction that > marked itself as committed in pg_clog cannot return to any other state, > regardless of what happens elsewhere. But it could lead to other transactions seing the row as committed, even though it isn't really yet. tqual.c sayeth:* NOTE: must check TransactionIdIsInProgress (which looks in PGXACT array)* before TransactionIdDidCommit/TransactionIdDidAbort(which look in* pg_clog). Otherwise we have a race condition: we might decidethat a* just-committed transaction crashed, because none of the tests succeed.* xact.c is careful to record commit/abortin pg_clog before it unsets* MyPgXact->xid in PGXACT array. That fixes that problem, but it also* means thereis a window where TransactionIdIsInProgress and* TransactionIdDidCommit will both return true. If we check only* TransactionIdDidCommit,we could consider a tuple committed when a* later GetSnapshotData call will still think the originatingtransaction* is in progress, which leads to application-level inconsistency. The* upshot is that we gotta checkTransactionIdIsInProgress first in all* code paths, except for a few cases where we are looking at* subtransactionsof our own main transaction and so there can't be any* race condition. I don't think there's any reason to deviate from this pattern here. For old xids TransactionIdIsInProgress() should be really cheap. > > I worry that this MultiXactIdSetOldestMember() will be problematic in > > longrunning vacuums like a anti-wraparound vacuum of a huge > > table. There's no real need to set MultiXactIdSetOldestMember() here, > > since we will not become the member of a multi. So I think you should > > either move the Assert() in MultiXactIdCreateFromMembers() to it's other > > callers, or add a parameter to skip it. > > I would like to have the Assert() work automatically, that is, check the > PROC_IN_VACUUM flag in MyProc->vacuumflags ... but this probably won't > work with CLUSTER. That said, I think we *should* call SetOldestMember > in CLUSTER. So maybe both things should be conditional on > PROC_IN_VACUUM. Why should it be dependent on cluster? SetOldestMember() defines the oldest multi we can be a member of. Even in cluster, the freezing will not make us a member of a multi. If the transaction does something else requiring SetOldestMember(), that will do it? > > I worry that all these multixact accesses will create huge performance > > problems due to the inefficiency of the multixactid cache. If you scan a > > huge table there very well might be millions of different multis we > > touch and afaics most of them will end up in the multixactid cache. That > > can't end well. > > I think we need to either regularly delete that cache when it goes past, > > say, 100 entries, or just bypass it entirely. > > Delete the whole cache, or just prune it of the least recently used > entries? Maybe the cache should be a dlist instead of the open-coded > stuff that's there now; that would enable pruning of the oldest entries. > I think a blanket deletion might be a cure worse than the disease. I > see your point anyhow. I was thinking of just deleting the whole thing. Revamping the cache mechanism to be more efficient, is an important goal, but it imo shouldn't be lumped together with this. Now you could argue that purging the cache shouldn't be either - but 9.3.2+ the worst case essentially is O(n^2) in the number of rows in a table. Don't think that can be acceptable. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund wrote: > On 2013-12-11 22:08:41 -0300, Alvaro Herrera wrote: > > Andres Freund wrote: > > > I worry that all these multixact accesses will create huge performance > > > problems due to the inefficiency of the multixactid cache. If you scan a > > > huge table there very well might be millions of different multis we > > > touch and afaics most of them will end up in the multixactid cache. That > > > can't end well. > > > I think we need to either regularly delete that cache when it goes past, > > > say, 100 entries, or just bypass it entirely. > > > > Delete the whole cache, or just prune it of the least recently used > > entries? Maybe the cache should be a dlist instead of the open-coded > > stuff that's there now; that would enable pruning of the oldest entries. > > I think a blanket deletion might be a cure worse than the disease. I > > see your point anyhow. > > I was thinking of just deleting the whole thing. Revamping the cache > mechanism to be more efficient, is an important goal, but it imo > shouldn't be lumped together with this. Now you could argue that purging > the cache shouldn't be either - but 9.3.2+ the worst case essentially is > O(n^2) in the number of rows in a table. Don't think that can be > acceptable. So I think this is the only remaining issue to make this patch committable (I will address the other points in Andres' email.) Since there has been no other feedback on this thread, Andres and I discussed the cache issue a bit over IM and we seem to agree that a patch to revamp the cache should be a fairly localized change that could be applied on both 9.3 and master, separately from this fix. Doing cache deletion seems more invasive, and not provide better performance anyway. Since having a potentially O(n^2) cache behavior but with working freeze seems better than no O(n^2) but broken freeze, I'm going to apply this patch shortly and then work on reworking the cache. Are there other opinions? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund wrote: > On 2013-12-11 22:08:41 -0300, Alvaro Herrera wrote: > > Andres Freund wrote: > > > I worry that this MultiXactIdSetOldestMember() will be problematic in > > > longrunning vacuums like a anti-wraparound vacuum of a huge > > > table. There's no real need to set MultiXactIdSetOldestMember() here, > > > since we will not become the member of a multi. So I think you should > > > either move the Assert() in MultiXactIdCreateFromMembers() to it's other > > > callers, or add a parameter to skip it. > > > > I would like to have the Assert() work automatically, that is, check the > > PROC_IN_VACUUM flag in MyProc->vacuumflags ... but this probably won't > > work with CLUSTER. That said, I think we *should* call SetOldestMember > > in CLUSTER. So maybe both things should be conditional on > > PROC_IN_VACUUM. > > Why should it be dependent on cluster? SetOldestMember() defines the > oldest multi we can be a member of. Even in cluster, the freezing will > not make us a member of a multi. If the transaction does something else > requiring SetOldestMember(), that will do it? One last thing (I hope). It's not real easy to disable this check, because it actually lives in GetNewMultiXactId. It would uglify the API a lot if we were to pass a flag down two layers of routines; and moving it to higher-level routines doesn't seem all that appropriate either. I'm thinking we can have a new flag in MyPgXact->vacuumFlags, so heap_prepare_freeze_tuple does this: PG_TRY(); { /* set flag to let multixact.c know what we're doing */ MyPgXact->vacuumFlags |= PROC_FREEZING_MULTI; newxmax = FreezeMultiXactId(xid, tuple->t_infomask, cutoff_xid, cutoff_multi, &flags); } PG_CATCH(); { MyPgXact->vacuumFlags &= ~PROC_FREEZING_MULTI; PG_RE_THROW(); } PG_END_TRY(); MyPgXact->vacuumFlags &= ~PROC_FREEZING_MULTI; and GetNewMultiXactId tests it to avoid the assert: /* * MultiXactIdSetOldestMember() must have been called already, but don't * check while freezing MultiXactIds. */ Assert((MyPggXact->vacuumFlags & PROC_FREEZING_MULTI) || MultiXactIdIsValid(OldestMemberMXactId[MyBackendId])); This avoids the API uglification issues, but introduces a setjmp call for every tuple to be frozen. I don't think this is an excessive cost to pay; after all, this is going to happen only for tuples for which heap_tuple_needs_freeze already returned true; and for those we're already going to do a lot of other work. Attached is the whole series of patches for 9.3. (master is the same, only with an additional patch that removes the legacy WAL record.) -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Alvaro Herrera wrote: > One last thing (I hope). It's not real easy to disable this check, > because it actually lives in GetNewMultiXactId. It would uglify the API > a lot if we were to pass a flag down two layers of routines; and moving > it to higher-level routines doesn't seem all that appropriate either. > I'm thinking we can have a new flag in MyPgXact->vacuumFlags, so > heap_prepare_freeze_tuple does this: > > PG_TRY(); > { > /* set flag to let multixact.c know what we're doing */ > MyPgXact->vacuumFlags |= PROC_FREEZING_MULTI; > newxmax = FreezeMultiXactId(xid, tuple->t_infomask, > cutoff_xid, cutoff_multi, &flags); > } Uhm, actually we don't need a PG_TRY block at all for this to work: we can rely on the flag being reset at transaction abort, if anything wrong happens and we lose control. So just set the flag, call FreezeMultiXactId, reset flag. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2013-12-12 18:39:43 -0300, Alvaro Herrera wrote: > Alvaro Herrera wrote: > > One last thing (I hope). It's not real easy to disable this check, > > because it actually lives in GetNewMultiXactId. It would uglify the API > > a lot if we were to pass a flag down two layers of routines; and moving > > it to higher-level routines doesn't seem all that appropriate > > either. Unfortunately I find that too ugly. Adding a flag in the procarray because of an Assert() in a lowlevel routine? That's overkill. What's the problem with moving the check to MultiXactIdCreate() and MultiXactIdExpand() instead? Since those are the ones where it's required to have called SetOldest() before, I don't see why that would be inappropriate? > > I'm thinking we can have a new flag in MyPgXact->vacuumFlags, so > > heap_prepare_freeze_tuple does this: > > > > PG_TRY(); > > { > > /* set flag to let multixact.c know what we're doing */ > > MyPgXact->vacuumFlags |= PROC_FREEZING_MULTI; > > newxmax = FreezeMultiXactId(xid, tuple->t_infomask, > > cutoff_xid, cutoff_multi, &flags); > > } > > Uhm, actually we don't need a PG_TRY block at all for this to work: we > can rely on the flag being reset at transaction abort, if anything wrong > happens and we lose control. So just set the flag, call > FreezeMultiXactId, reset flag. I don't think that'd be correct for a CLUSTER in a subtransaction? A subtransaction's abort afaics doesn't call ProcArrayEndTransaction() and thus don't clear vacuumFlags... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > Unfortunately I find that too ugly. Adding a flag in the procarray > because of an Assert() in a lowlevel routine? That's overkill. If this flag doesn't need to be visible to other backends, it absolutely does not belong in the procarray. regards, tom lane
On 2013-12-12 18:24:34 -0300, Alvaro Herrera wrote: > + /* > + * It's an update; should we keep it? If the transaction is known > + * aborted then it's okay to ignore it, otherwise not. (Note this > + * is just an optimization and not needed for correctness, so it's > + * okay to get this test wrong; for example, in case an updater is > + * crashed, or a running transaction in the process of aborting.) > + */ > + if (!TransactionIdDidAbort(members[i].xid)) > + { > + newmembers[nnewmembers++] = members[i]; > + Assert(!TransactionIdIsValid(update_xid)); > + > + /* > + * Tell caller to set HEAP_XMAX_COMMITTED hint while we have > + * the Xid in cache. Again, this is just an optimization, so > + * it's not a problem if the transaction is still running and > + * in the process of committing. > + */ > + if (TransactionIdDidCommit(update_xid)) > + update_committed = true; > + > + update_xid = newmembers[i].xid; > + } I still don't think this is ok. Freezing shouldn't set hint bits earlier than tqual.c does. What's the problem with adding a !TransactionIdIsInProgress()? You also wrote: On 2013-12-11 22:08:41 -0300, Alvaro Herrera wrote: > Hmm ... Is there an actual difference? I mean, a transaction that > marked itself as committed in pg_clog cannot return to any other state, > regardless of what happens elsewhere. Hm, that's not actually true, I missed that so far: Think of async commits and what we do in tqual.c:SetHintBits(). But I think we're safe in this scenario, at least for the current callers. vacuumlazy.c will WAL log the freezing and set the LSN while holding an exclusive lock, therefor providing an LSN interlock. VACUUM FULL/CLUSTER will be safe, even with wal_level=minimal, because the relation won't be visible until it commits and it will contain a smgr pending delete forcing a synchronous commit. But that should be documented. > + if (TransactionIdPrecedes(update_xid, cutoff_xid)) > + { > + update_xid = InvalidTransactionId; > + update_committed = false; > + > + } Deserves an Assert(). > + else if (TransactionIdIsValid(update_xid) && !has_lockers) > + { > + /* > + * If there's a single member and it's an update, pass it back alone > + * without creating a new Multi. (XXX we could do this when there's a > + * single remaining locker, too, but that would complicate the API too > + * much; moreover, the case with the single updater is more > + * interesting, because those are longer-lived.) > + */ > + Assert(nnewmembers == 1); > + *flags |= FRM_RETURN_IS_XID; > + if (update_committed) > + *flags |= FRM_MARK_COMMITTED; > + xid = update_xid; > + } Afaics this will cause HEAP_KEYS_UPDATED to be reset, is that problematic? I don't really remember what it's needed for TBH... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund wrote: > On 2013-12-12 18:24:34 -0300, Alvaro Herrera wrote: > > + /* > > + * It's an update; should we keep it? If the transaction is known > > + * aborted then it's okay to ignore it, otherwise not. (Note this > > + * is just an optimization and not needed for correctness, so it's > > + * okay to get this test wrong; for example, in case an updater is > > + * crashed, or a running transaction in the process of aborting.) > > + */ > > + if (!TransactionIdDidAbort(members[i].xid)) > > + { > > + newmembers[nnewmembers++] = members[i]; > > + Assert(!TransactionIdIsValid(update_xid)); > > + > > + /* > > + * Tell caller to set HEAP_XMAX_COMMITTED hint while we have > > + * the Xid in cache. Again, this is just an optimization, so > > + * it's not a problem if the transaction is still running and > > + * in the process of committing. > > + */ > > + if (TransactionIdDidCommit(update_xid)) > > + update_committed = true; > > + > > + update_xid = newmembers[i].xid; > > + } > > I still don't think this is ok. Freezing shouldn't set hint bits earlier > than tqual.c does. What's the problem with adding a > !TransactionIdIsInProgress()? I was supposed to tell you, and evidently forgot, that patch 0001 was the same as previously submitted, and was modified by the subsequent patches modify per review comments. These comments should already be handled in the later patches in the series I just posted. The idea was to spare you reading the whole thing all over again, but evidently that backfired. I think the new code doesn't suffer from the problem you mention; and neither the other one that I trimmed out. > > + else if (TransactionIdIsValid(update_xid) && !has_lockers) > > + { > > + /* > > + * If there's a single member and it's an update, pass it back alone > > + * without creating a new Multi. (XXX we could do this when there's a > > + * single remaining locker, too, but that would complicate the API too > > + * much; moreover, the case with the single updater is more > > + * interesting, because those are longer-lived.) > > + */ > > + Assert(nnewmembers == 1); > > + *flags |= FRM_RETURN_IS_XID; > > + if (update_committed) > > + *flags |= FRM_MARK_COMMITTED; > > + xid = update_xid; > > + } > > Afaics this will cause HEAP_KEYS_UPDATED to be reset, is that > problematic? I don't really remember what it's needed for TBH... Hmm, will check that out. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Alvaro Herrera wrote: > So I think this is the only remaining issue to make this patch > committable (I will address the other points in Andres' email.) Since > there has been no other feedback on this thread, Andres and I discussed > the cache issue a bit over IM and we seem to agree that a patch to > revamp the cache should be a fairly localized change that could be > applied on both 9.3 and master, separately from this fix. Doing cache > deletion seems more invasive, and not provide better performance anyway. Here's cache code with LRU superpowers (ahem.) I settled on 256 as number of entries because it's in the same ballpark as MaxHeapTuplesPerPage which seems a reasonable guideline to follow. I considered the idea of avoiding palloc/pfree for cache entries entirely, instead storing them in a static array which is referenced from the dlist; unfortunately that doesn't work because each cache entry is variable size, depending on number of members. We could try to work around that and allocate a large shared array for members, but that starts to smell of over-engineering, so I punted. I was going to 'perf' this, but then found out that I need to compile my own linux-tools package for a home-compiled kernel ATM. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 2013-12-13 13:39:20 -0300, Alvaro Herrera wrote: > Here's cache code with LRU superpowers (ahem.) Heh. > I settled on 256 as number of entries because it's in the same ballpark > as MaxHeapTuplesPerPage which seems a reasonable guideline to follow. Sounds ok. > I considered the idea of avoiding palloc/pfree for cache entries > entirely, instead storing them in a static array which is referenced > from the dlist; unfortunately that doesn't work because each cache entry > is variable size, depending on number of members. We could try to work > around that and allocate a large shared array for members, but that > starts to smell of over-engineering, so I punted. Good plan imo. > *** 1326,1331 **** mXactCacheGetBySet(int nmembers, MultiXactMember *members) > --- 1331,1337 ---- > if (memcmp(members, entry->members, nmembers * sizeof(MultiXactMember)) == 0) > { > debug_elog3(DEBUG2, "CacheGet: found %u", entry->multi); > + dlist_move_head(&MXactCache, iter.cur); > return entry->multi; > } > } That's only possible because we immediately abort the loop, otherwise we'd corrupt the iterator. Maybe that deserves a comment. > + > + dlist_move_head(&MXactCache, iter.cur); > + Heh. I forgot that we already had that bit; I was wondering whether you had to forgot to include it in the patch ;) > static char * > --- 1420,1435 ---- > /* mXactCacheGetBySet assumes the entries are sorted, so sort them */ > qsort(entry->members, nmembers, sizeof(MultiXactMember), mxactMemberComparator); > > ! dlist_push_head(&MXactCache, &entry->node); > ! if (MXactCacheMembers++ >= MAX_CACHE_ENTRIES) > ! { > ! dlist_node *node; > ! > ! node = dlist_tail_node(&MXactCache); > ! dlist_delete(dlist_tail_node(&MXactCache)); > ! MXactCacheMembers--; > ! pfree(dlist_container(mXactCacheEnt, node, node)); > ! } > } Duplicate dlist_tail_node(). Maybe add a debug_elog3(.. "CacheGet: pruning %u from cache")? I wondered before if we shouldn't introduce a layer above dlists, that support keeping track of the number of elements, and maybe also have support for LRU behaviour. Not as a part this patch, just generally. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund wrote: > On 2013-12-12 18:24:34 -0300, Alvaro Herrera wrote: > > + else if (TransactionIdIsValid(update_xid) && !has_lockers) > > + { > > + /* > > + * If there's a single member and it's an update, pass it back alone > > + * without creating a new Multi. (XXX we could do this when there's a > > + * single remaining locker, too, but that would complicate the API too > > + * much; moreover, the case with the single updater is more > > + * interesting, because those are longer-lived.) > > + */ > > + Assert(nnewmembers == 1); > > + *flags |= FRM_RETURN_IS_XID; > > + if (update_committed) > > + *flags |= FRM_MARK_COMMITTED; > > + xid = update_xid; > > + } > > Afaics this will cause HEAP_KEYS_UPDATED to be reset, is that > problematic? I don't really remember what it's needed for TBH... So we do reset HEAP_KEYS_UPDATED, and in general that bit seems critical for several things. So it should be kept when a Xmax is carried over from the pre-frozen version of the tuple. But while reading through that, I realize that we should also be keeping HEAP_HOT_UPDATED in that case. And particularly we should never clear HEAP_ONLY_TUPLE. So I think heap_execute_freeze_tuple() is wrong, because it's resetting the whole infomask to zero, and then setting it to only the bits that heap_prepare_freeze_tuple decided that it needed set. That seems bogus to me. heap_execute_freeze_tuple() should only clear a certain number of bits, and then possibly set some of the same bits; but the remaining flags should remain untouched. So HEAP_KEYS_UPDATED, HEAP_UPDATED and HEAP_HOT_UPDATED should be untouched by heap_execute_freeze_tuple; heap_prepare_freeze_tuple needn't worry about querying those bits at all. Only when FreezeMultiXactId returns FRM_INVALIDATE_XMAX, and when the Xmax is not a multi and it gets removed, should those three flags be removed completely. HEAP_ONLY_TUPLE should be untouched by the freezing protocol. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2013-12-13 17:08:46 -0300, Alvaro Herrera wrote: > Andres Freund wrote: > > Afaics this will cause HEAP_KEYS_UPDATED to be reset, is that > > problematic? I don't really remember what it's needed for TBH... > > So we do reset HEAP_KEYS_UPDATED, and in general that bit seems critical > for several things. So it should be kept when a Xmax is carried over > from the pre-frozen version of the tuple. But while reading through > that, I realize that we should also be keeping HEAP_HOT_UPDATED in that > case. And particularly we should never clear HEAP_ONLY_TUPLE. That's only for the multi->plain xid case tho, right? > So I think heap_execute_freeze_tuple() is wrong, because it's resetting > the whole infomask to zero, and then setting it to only the bits that > heap_prepare_freeze_tuple decided that it needed set. That seems bogus > to me. heap_execute_freeze_tuple() should only clear a certain number > of bits, and then possibly set some of the same bits; but the remaining > flags should remain untouched. Uh, my version and the latest you've sent intiially copy the original infomask to the freeze plan and then manipulate those. That seems fine to me. Am I missing something? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Noah Misch wrote: > On Tue, Dec 03, 2013 at 07:26:38PM +0100, Andres Freund wrote: > > On 2013-12-03 13:14:38 -0500, Noah Misch wrote: > > > On Tue, Dec 03, 2013 at 04:37:58PM +0100, Andres Freund wrote: > > > > I currently don't see fixing the errorneous freezing of lockers (not the > > > > updater though) without changing the wal format or synchronously waiting > > > > for all lockers to end. Which both see like a no-go? > > > > > > Not fixing it at all is the real no-go. We'd take both of those undesirables > > > before just tolerating the lost locks in 9.3. > > > > I think it's changing the wal format then. > > I'd rather have an readily-verifiable fix that changes WAL format than a > tricky fix that avoids doing so. So, modulo not having seen the change, +1. I've committed a patch which hopefully fixes the problem using this approach. Thanks, Noah, for noticing the issue, and thanks, Andres, for collaboration in getting the code in the right state. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
BTW, there are a a couple of spec files floating around which perhaps we should consider getting into the source repo (in some cleaned up form). Noah published one, Andres shared a couple more with me, and I think I have two more. They can't be made to work in normal circumstances, because they depend on concurrent server activity. But perhaps we should add them anyway and perhaps list them in a separate schedule file, so that any developer interested in messing with this stuff has them readily available for testing. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services