Thread: SetBufferCommitInfoNeedsSave and race conditions
During one of HOT stress tests, an asserition failed at tqual.c:1178
in HeapTupleSatisfiesVacuum(). The assertion failure looked really
strange because the assertion checks for HEAP_XMAX_COMMITTED
which we set just couple of lines above. I inspected the core dump
and found that the flag is *set* properly. That was even more strange.
I confirmed that we are holding a SHARE lock on the buffer as we
do at several other places while checking/setting the infomask bits.
We had a theory that somebody clears the flag after the asserting
process sets it and before it checks it. The other process also sets it
back before core dump is generated because core shows the flag
being set properly. The chances of this happening are very slim and
can further be ruled out because I carefully looked at the code and found
that the flag can only be cleared holding an exclusive lock on the buffer.
So we suspected an interaction between multiple processes each holding
a SHARE lock and setting/checking different bits in the infomask and
we could theoritically say that such interaction can potentially lead to
missing hint bit updates. I can think of the following:
Process P1 is setting bit 0 and process P2 setting bit 1 of an integer
'x' whose current value is say 0.
P1 P2
load x in register A load x in register B
A = A | 0x0001 B = B | 0x0002
Store A to x
Store B to x
At the end, P1's update is missing! If P1's further processing is based
on the bit-check, it would go completely wrong.
This easily explains the assertion and core dump analysis. We can
possibly remove that assertion and any other similar assertions
(unless someone can find a hole in the above analysis). But I am
more worried about other similar race conditions where hint bit updates
go missing and thus causing severe MVCC failures.
Btw, to validate the race condition I quickly wrote a simple C
program which attaches to a share memory. Each instance of the
process sets/clears and checks a separate bit. It clearly demonstrates
the danger. The code is attached. Compile and run with an integer
argument to tell which bit to set/reset.
Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com
Attachment
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes: > So we suspected an interaction between multiple processes each holding > a SHARE lock and setting/checking different bits in the infomask and > we could theoritically say that such interaction can potentially lead to > missing hint bit updates. Yeah. This is in fact something that's been foreseen, but I guess it didn't occur to anyone that those Asserts would fail. I concur with removing them. regards, tom lane
Pavan Deolasee wrote: > During one of HOT stress tests, an asserition failed at tqual.c:1178 > in HeapTupleSatisfiesVacuum(). The assertion failure looked really > strange because the assertion checks for HEAP_XMAX_COMMITTED > which we set just couple of lines above. I inspected the core dump > and found that the flag is *set* properly. That was even more strange. > I confirmed that we are holding a SHARE lock on the buffer as we > do at several other places while checking/setting the infomask bits. > > We had a theory that somebody clears the flag after the asserting > process sets it and before it checks it. The other process also sets it > back before core dump is generated because core shows the flag > being set properly. The chances of this happening are very slim and > can further be ruled out because I carefully looked at the code and found > that the flag can only be cleared holding an exclusive lock on the buffer. > > So we suspected an interaction between multiple processes each holding > a SHARE lock and setting/checking different bits in the infomask and > we could theoritically say that such interaction can potentially lead to > missing hint bit updates. I can think of the following: FWIW, this can be reproduced by single-stepping with a debugger: First, you need a tuple that's committed dead but no hint bits have been set: BEGIN; truncate foo; INSERT INTO foo values (1,'foo'); DELETE FROM Foo; commit; In one backend, set a breakpoint to HeapTupleSatisfiesMVCC lin 953 where it sets the XMIN_COMMITED hint bit: > else if (TransactionIdDidCommit(HeapTupleHeaderGetXmin(tuple)))> {>>>> tuple->t_infomask |=HEAP_XMIN_COMMITTED;> SetBufferCommitInfoNeedsSave(buffer);> } Issue a SELECT * FROM foo, and step a single instruction that fetches the infomask field from memory to a register. Open another backend, set a breakpoint to HeapTupleSatisfiesVacuum line 1178: > else if (TransactionIdDidCommit(HeapTupleHeaderGetXmax(tuple)))> {> tuple->t_infomask |= HEAP_XMAX_COMMITTED;> SetBufferCommitInfoNeedsSave(buffer);> }> else> {> /*> * Not in Progress, Not Committed, so either Aborted or crashed> */> tuple->t_infomask |= HEAP_XMAX_INVALID;> SetBufferCommitInfoNeedsSave(buffer);> return HEAPTUPLE_LIVE;> }> /* Should only get here if weset XMAX_COMMITTED */>>>>> Assert(tuple->t_infomask & HEAP_XMAX_COMMITTED);> } And issue "VACUUM foo". It'll stop on that breakpoint. Let the first backend continue. It will clear the XMAX_COMMITTED field. Now let the 2nd backend to continue and you get an assertion failure. AFAICS, we can just simply remove the assertion. But is there any codepaths that assume that after calling HeapTupleSatisfiesSnapshot, all appropriate hint bits are set? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > AFAICS, we can just simply remove the assertion. But is there any > codepaths that assume that after calling HeapTupleSatisfiesSnapshot, all > appropriate hint bits are set? There had better not be, since we are going to postpone setting hint bits for recently-committed transactions as part of the async-commit patch. A quick grep suggests that VACUUM FULL might be at risk here. regards, tom lane
Tom Lane escribió: > Heikki Linnakangas <heikki@enterprisedb.com> writes: > > AFAICS, we can just simply remove the assertion. But is there any > > codepaths that assume that after calling HeapTupleSatisfiesSnapshot, all > > appropriate hint bits are set? > > There had better not be, since we are going to postpone setting hint > bits for recently-committed transactions as part of the async-commit > patch. > > A quick grep suggests that VACUUM FULL might be at risk here. That particular case seems easily fixed since VACUUM FULL must hold an exclusive lock; and we can forcibly set sync commit for VACUUM FULL. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Tom Lane escribi�: >> A quick grep suggests that VACUUM FULL might be at risk here. > That particular case seems easily fixed since VACUUM FULL must hold an > exclusive lock; and we can forcibly set sync commit for VACUUM FULL. Uh, that wouldn't help. The problem is that if VACUUM FULL is *looking at* a recently-committed tuple, tqual.c might decide it can't set the hint bit yet because it's not certain the commit record for that other transaction is flushed. We could possibly hack things so that inside a VACUUM FULL (maybe plain vacuum too?), we prefer flushing xlog to leaving hint bits unset. That's likely to be messy though. Probably a cleaner and more robust answer is to make VACUUM FULL call tqual.c again in the places where it currently assumes it can look directly at the hint bits. regards, tom lane
On Thu, 2007-06-28 at 15:16 -0400, Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: > > AFAICS, we can just simply remove the assertion. But is there any > > codepaths that assume that after calling HeapTupleSatisfiesSnapshot, all > > appropriate hint bits are set? > > There had better not be, since we are going to postpone setting hint > bits for recently-committed transactions as part of the async-commit > patch. > > A quick grep suggests that VACUUM FULL might be at risk here. No we're clear: I caught that issue specifically for VACUUM FULL fairly early on. VF assumes all hint bits are set after the first scan, so we flush prior to the scan to ensure its safe to set the hint bits. There are no concurrent hint bit setters, so we are good. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Thu, 2007-06-28 at 15:29 -0400, Alvaro Herrera wrote: > Tom Lane escribió: > > Heikki Linnakangas <heikki@enterprisedb.com> writes: > > > AFAICS, we can just simply remove the assertion. But is there any > > > codepaths that assume that after calling HeapTupleSatisfiesSnapshot, all > > > appropriate hint bits are set? > > > > There had better not be, since we are going to postpone setting hint > > bits for recently-committed transactions as part of the async-commit > > patch. > > > > A quick grep suggests that VACUUM FULL might be at risk here. > > That particular case seems easily fixed since VACUUM FULL must hold an > exclusive lock; and we can forcibly set sync commit for VACUUM FULL. Exactly what it does! -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > On Thu, 2007-06-28 at 15:16 -0400, Tom Lane wrote: >> A quick grep suggests that VACUUM FULL might be at risk here. > No we're clear: I caught that issue specifically for VACUUM FULL fairly > early on. VF assumes all hint bits are set after the first scan, so we > flush prior to the scan to ensure its safe to set the hint bits. Flush what prior to the scan? The methodology I suggested earlier (involving tracking LSN only at the level of pg_clog pages) isn't going to make that work, unless you somehow force the XID counter forward to the next page boundary. It might be that that level of tracking is too coarse anyway, since it essentially says that you can't hint any transaction until the next 32K-transaction boundary is reached. regards, tom lane
On Thu, 2007-06-28 at 20:23 -0400, Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > On Thu, 2007-06-28 at 15:16 -0400, Tom Lane wrote: > >> A quick grep suggests that VACUUM FULL might be at risk here. > > > No we're clear: I caught that issue specifically for VACUUM FULL fairly > > early on. VF assumes all hint bits are set after the first scan, so we > > flush prior to the scan to ensure its safe to set the hint bits. > > Flush what prior to the scan? > > The methodology I suggested earlier (involving tracking LSN only at the > level of pg_clog pages) isn't going to make that work, unless you > somehow force the XID counter forward to the next page boundary. > It might be that that level of tracking is too coarse anyway, since > it essentially says that you can't hint any transaction until the > next 32K-transaction boundary is reached. If you completely flush WAL after the AccessExclusiveLock has been taken by VF, then you are guaranteed to have flushed all asynchronous commits that touch the table. You're right that I just broke VF again by changing the implementation back to deferred hint setting. I'll fix again. Hmmm, I can add a test for LockHeldByMe(AccessExclusiveLock) in tqual, or just have a VacuumHintStrategy() call that alters the behaviour of the tqual routines. Latter sounds best? My aim is to get this all wrapped up by end of Monday night. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > On Thu, 2007-06-28 at 20:23 -0400, Tom Lane wrote: >> The methodology I suggested earlier (involving tracking LSN only at the >> level of pg_clog pages) isn't going to make that work, unless you >> somehow force the XID counter forward to the next page boundary. > If you completely flush WAL after the AccessExclusiveLock has been taken > by VF, then you are guaranteed to have flushed all asynchronous commits > that touch the table. I don't believe this is correct (see system catalogs) and in any case it's far too fragile for my taste. I think it'd be a lot better to just stop referencing the hint bits directly in VACUUM FULL. regards, tom lane
On Fri, 2007-06-29 at 11:13 -0400, Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > On Thu, 2007-06-28 at 20:23 -0400, Tom Lane wrote: > >> The methodology I suggested earlier (involving tracking LSN only at the > >> level of pg_clog pages) isn't going to make that work, unless you > >> somehow force the XID counter forward to the next page boundary. > > > If you completely flush WAL after the AccessExclusiveLock has been taken > > by VF, then you are guaranteed to have flushed all asynchronous commits > > that touch the table. > > I don't believe this is correct (see system catalogs) and in any case > it's far too fragile for my taste. I think it'd be a lot better to just > stop referencing the hint bits directly in VACUUM FULL. Well, according to the comments in repair_frag(), line 1799-1815, specifically line 1810 says "we wouldnt be in repair_frag() at all" if the tuple was "in a system catalog, inserted or deleted by a not yet committed transaction". If we have flushed all committed and/or aborted transactions then we're good. Maybe the comment is wrong? Wouldn't be the first time. Either way, please explain your concern in more detail. I'd rather do this the easy way since VF seems soon to die (payback for all the torture its caused us down the years). -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Thu, 2007-06-28 at 20:23 -0400, Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > On Thu, 2007-06-28 at 15:16 -0400, Tom Lane wrote: > >> A quick grep suggests that VACUUM FULL might be at risk here. > > > No we're clear: I caught that issue specifically for VACUUM FULL fairly > > early on. VF assumes all hint bits are set after the first scan, so we > > flush prior to the scan to ensure its safe to set the hint bits. > > Flush what prior to the scan? > > The methodology I suggested earlier (involving tracking LSN only at the > level of pg_clog pages) isn't going to make that work, unless you > somehow force the XID counter forward to the next page boundary. > It might be that that level of tracking is too coarse anyway, since > it essentially says that you can't hint any transaction until the > next 32K-transaction boundary is reached. Solutions I'm going for are these: - Force XLogFlush() prior to initial VF scan. Tqual will set hint bits if WAL has been flushed, else it will be deferred, so no WAL flushes will be forced by normal hint bit setting and VF will work without needing any crufty special cases or rework of VF logic. - Use Tom's LSN tracking at clog page level. Make the LSN tracking store an array of LSNs rather than just one. Array size is fixed at NUMBER_TRACKED_LSNS_PER_PAGE, so that each LSN covers 32,000/NUMBER_TRACKED_LSNS_PER_PAGE transactions. I'd guess that storing 8 per page would be optimal, so each stored xid would track 4,000 transactions - probably around 1 sec worth of transactions when the feature is used. Comments? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > I'd guess that storing 8 per page would be optimal, so each stored xid would > track 4,000 transactions - probably around 1 sec worth of transactions when > the feature is used. This is something we can experiment with but I suspect that while 8 might be sufficient for many use cases there would be others where more would be better. The cost to having more lsns stored in the clog would be pretty small. On TPCC which has longer transactions on moderate hardware we only see order of 1,000 txn/min. So a setting like 128 which allows a granularity of 256 transactions would be about 15s which is not so much longer than the xmin horizon of the 90th percentile response time of 2*5s. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Where are we on this? --------------------------------------------------------------------------- Gregory Stark wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > > I'd guess that storing 8 per page would be optimal, so each stored xid would > > track 4,000 transactions - probably around 1 sec worth of transactions when > > the feature is used. > > This is something we can experiment with but I suspect that while 8 might be > sufficient for many use cases there would be others where more would be > better. The cost to having more lsns stored in the clog would be pretty small. > > On TPCC which has longer transactions on moderate hardware we only see order > of 1,000 txn/min. So a setting like 128 which allows a granularity of 256 > transactions would be about 15s which is not so much longer than the xmin > horizon of the 90th percentile response time of 2*5s. > > -- > Gregory Stark > EnterpriseDB http://www.enterprisedb.com > > > ---------------------------(end of broadcast)--------------------------- > TIP 4: Have you searched our list archives? > > http://archives.postgresql.org -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
"Bruce Momjian" <bruce@momjian.us> writes: > Where are we on this? Well Simon just sent the reworked patch yesterday so the answer is we haven't started tuning this parameter. (Bruce's message is referring to the discussion about what the optimal value of lsns per clog page would be.) I intend to do some benchmarking on it and testing what the best value of this parameter is one of the things I aim to determine with these benchmarks. I'm not 100% sure yet what to measure though. There are two questions here: 1) What are some good workloads to test which will require larger values for this parameter or which will be hurt by excessivelylarge values? I think any short-transaction workload should be basically good enough. I'm thinking of using just pgbench's default workload. Does anyone think there are other workloads that we need to specifically test? 2) What metric do I use to determine if the value is large enough or too large? The gross measurement would be to look at tps. I would prefer to actually count events which we want to minimize such as xlogflushes because the clog lsn is too old and how much extra work the visibility checks do. I'm not sure exactly how much of this we can really measure though. Are there any other events having an insufficiently large or excessively large value of this parameter will cause which we can count? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On Wed, 2007-07-18 at 01:01 +0100, Gregory Stark wrote: > "Bruce Momjian" <bruce@momjian.us> writes: > > > Where are we on this? > > Well Simon just sent the reworked patch yesterday so the answer is we haven't > started tuning this parameter. (Bruce's message is referring to the discussion > about what the optimal value of lsns per clog page would be.) >From discussion, Greg wishes to see a certain parameter higher, though I'm fairly comfortable with the value now. Thanks to Greg for making me think this through in more detail. There is some confirmation required, but no significant tuning effort required. Here's the thinking: When we commit, we store an LSN that covers a group of transactions. When we set hint bits we check the appropriate LSN. If it has been flushed, then we set the hint bit. If we're using synch commits then we can always set hint bits. If we're mixing async commits with sync commits then the clog LSNs will very often be flushed already, so we'll be able to set most hint bits. If we are doing *only* simple async commits then only the WAL writer flushes WAL. In that case when we try to set hint bits we will only be prevented from setting hint bits for the last "few" transactions. So how many is that exactly? We are flushing WAL every W seconds and doing T transactions per second, then we will flush WAL every T*W transactions. e.g. W = 0.2 (default), T ~ 10,000 => T*W = 2,000 transactions. We have divided up each clog page up into N groups of transactions, each with their own LSN, then we will have 32,000/N transactions per group. The best situation is that we set N such that we never defer the setting of a hint bit for a transaction that has been flushed. There is no point setting N any lower than32,000/N = T*W i.e. N <= W * (32,000/T) My laptop can do T=1000 with W=0.2 with pgbench, so N <= 6. By definition, anyone using async commit will have a high transaction rate (else there is no benefit), so N seems able to be reasonably small. If the transaction rate drops off momentarily the number of unflushed committed async transactions doesn't rise, it falls quickly to zero as the WAL writer flushes WAL up to the last async LSN. The patch sets a value of 16, which seems likely to be enough to cover most situations. Having thought that through, I see two changes I'd like to make: - I realise that I've arranged the LSNs to be striped rather than grouped. I'll change this part of the patch - couple of lines in GetLSNIndex(). - The LSN of sync commits is also used to update the LSN of the clog page. That is unnecessary and we only need set the clog LSN for async commits. I'll wait for other review comments before cutting version 23.... -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
This has been saved for the 8.4 release: http://momjian.postgresql.org/cgi-bin/pgpatches_hold --------------------------------------------------------------------------- Simon Riggs wrote: > On Thu, 2007-06-28 at 20:23 -0400, Tom Lane wrote: > > "Simon Riggs" <simon@2ndquadrant.com> writes: > > > On Thu, 2007-06-28 at 15:16 -0400, Tom Lane wrote: > > >> A quick grep suggests that VACUUM FULL might be at risk here. > > > > > No we're clear: I caught that issue specifically for VACUUM FULL fairly > > > early on. VF assumes all hint bits are set after the first scan, so we > > > flush prior to the scan to ensure its safe to set the hint bits. > > > > Flush what prior to the scan? > > > > The methodology I suggested earlier (involving tracking LSN only at the > > level of pg_clog pages) isn't going to make that work, unless you > > somehow force the XID counter forward to the next page boundary. > > It might be that that level of tracking is too coarse anyway, since > > it essentially says that you can't hint any transaction until the > > next 32K-transaction boundary is reached. > > Solutions I'm going for are these: > > - Force XLogFlush() prior to initial VF scan. Tqual will set hint bits > if WAL has been flushed, else it will be deferred, so no WAL flushes > will be forced by normal hint bit setting and VF will work without > needing any crufty special cases or rework of VF logic. > > - Use Tom's LSN tracking at clog page level. Make the LSN tracking store > an array of LSNs rather than just one. Array size is fixed at > NUMBER_TRACKED_LSNS_PER_PAGE, so that each LSN covers > 32,000/NUMBER_TRACKED_LSNS_PER_PAGE transactions. I'd guess that storing > 8 per page would be optimal, so each stored xid would track 4,000 > transactions - probably around 1 sec worth of transactions when the > feature is used. > > Comments? > > -- > Simon Riggs > EnterpriseDB http://www.enterprisedb.com > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Wed, 2007-09-26 at 04:33 -0400, Bruce Momjian wrote: > This has been saved for the 8.4 release: > > http://momjian.postgresql.org/cgi-bin/pgpatches_hold Already applied. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com