Thread: SetBufferCommitInfoNeedsSave and race conditions

SetBufferCommitInfoNeedsSave and race conditions

From
"Pavan Deolasee"
Date:

During one of HOT stress tests, an asserition failed at tqual.c:1178
in HeapTupleSatisfiesVacuum(). The assertion failure looked really
strange because the assertion checks for HEAP_XMAX_COMMITTED
which we set just couple of lines above. I inspected the core dump
and found that the flag is *set* properly. That was even more strange.
I confirmed that we are holding a SHARE lock on the buffer as we
do at several other places while checking/setting the infomask bits.

We had a theory that somebody clears the flag after the asserting
process sets it and before it checks it. The other process also sets it
back before core dump is generated because core shows the flag
being set properly. The chances of this happening are very slim and
can further be ruled out because I carefully looked at the code and found
that the flag can only be cleared holding an exclusive lock on the buffer.

So we suspected an interaction between multiple processes each holding
a SHARE lock and setting/checking different bits in the infomask and
we could theoritically say that such interaction can potentially lead to
missing hint bit updates. I can think of the following:

Process P1 is setting bit 0 and process P2 setting bit 1 of an integer
'x' whose current value is say 0.

             P1                         P2
       load x in register A     load x in register B
       A = A | 0x0001           B = B | 0x0002
       Store A to x
                                       Store B to x

At the end, P1's update is missing! If P1's further processing is based
on the bit-check, it would go completely wrong.

This easily explains the assertion and core dump analysis. We can
possibly remove that assertion and any other similar assertions
(unless someone can find a hole in the above analysis). But I am
more worried about other similar race conditions where hint bit updates
go missing and thus causing severe MVCC failures.

Btw, to validate the race condition I quickly wrote a simple C
program which attaches to a share memory. Each instance of the
process sets/clears and checks a separate bit. It clearly demonstrates
the danger. The code is attached. Compile and run with an integer
argument to tell which bit to set/reset.


Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com
Attachment

Re: SetBufferCommitInfoNeedsSave and race conditions

From
Tom Lane
Date:
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:
> So we suspected an interaction between multiple processes each holding
> a SHARE lock and setting/checking different bits in the infomask and
> we could theoritically say that such interaction can potentially lead to
> missing hint bit updates.

Yeah.  This is in fact something that's been foreseen, but I guess it
didn't occur to anyone that those Asserts would fail.  I concur with
removing them.
        regards, tom lane


Re: SetBufferCommitInfoNeedsSave and race conditions

From
Heikki Linnakangas
Date:
Pavan Deolasee wrote:
> During one of HOT stress tests, an asserition failed at tqual.c:1178
> in HeapTupleSatisfiesVacuum(). The assertion failure looked really
> strange because the assertion checks for HEAP_XMAX_COMMITTED
> which we set just couple of lines above. I inspected the core dump
> and found that the flag is *set* properly. That was even more strange.
> I confirmed that we are holding a SHARE lock on the buffer as we
> do at several other places while checking/setting the infomask bits.
> 
> We had a theory that somebody clears the flag after the asserting
> process sets it and before it checks it. The other process also sets it
> back before core dump is generated because core shows the flag
> being set properly. The chances of this happening are very slim and
> can further be ruled out because I carefully looked at the code and found
> that the flag can only be cleared holding an exclusive lock on the buffer.
> 
> So we suspected an interaction between multiple processes each holding
> a SHARE lock and setting/checking different bits in the infomask and
> we could theoritically say that such interaction can potentially lead to
> missing hint bit updates. I can think of the following:

FWIW, this can be reproduced by single-stepping with a debugger:

First, you need a tuple that's committed dead but no hint bits have been 
set:

BEGIN; truncate foo; INSERT INTO foo values (1,'foo'); DELETE FROM Foo; 
commit;

In one backend, set a breakpoint to HeapTupleSatisfiesMVCC lin 953 where 
it sets the XMIN_COMMITED hint bit:
>         else if (TransactionIdDidCommit(HeapTupleHeaderGetXmin(tuple)))>         {>>>>             tuple->t_infomask
|=HEAP_XMIN_COMMITTED;>             SetBufferCommitInfoNeedsSave(buffer);>         }
 

Issue a SELECT * FROM foo, and step a single instruction that fetches 
the infomask field from memory to a register.

Open another backend, set a breakpoint to HeapTupleSatisfiesVacuum line 
1178:
>         else if (TransactionIdDidCommit(HeapTupleHeaderGetXmax(tuple)))>         {>             tuple->t_infomask |=
HEAP_XMAX_COMMITTED;>            SetBufferCommitInfoNeedsSave(buffer);>         }>         else>         {>
/*>             * Not in Progress, Not Committed, so either Aborted or 
 
crashed>              */>             tuple->t_infomask |= HEAP_XMAX_INVALID;>
SetBufferCommitInfoNeedsSave(buffer);>            return HEAPTUPLE_LIVE;>         }>         /* Should only get here if
weset XMAX_COMMITTED */>>>>>         Assert(tuple->t_infomask & HEAP_XMAX_COMMITTED);>     }
 

And issue "VACUUM foo". It'll stop on that breakpoint.

Let the first backend continue. It will clear the XMAX_COMMITTED field.

Now let the 2nd backend to continue and you get an assertion failure.


AFAICS, we can just simply remove the assertion. But is there any 
codepaths that assume that after calling HeapTupleSatisfiesSnapshot, all 
appropriate hint bits are set?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: SetBufferCommitInfoNeedsSave and race conditions

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> AFAICS, we can just simply remove the assertion. But is there any 
> codepaths that assume that after calling HeapTupleSatisfiesSnapshot, all 
> appropriate hint bits are set?

There had better not be, since we are going to postpone setting hint
bits for recently-committed transactions as part of the async-commit
patch.

A quick grep suggests that VACUUM FULL might be at risk here.
        regards, tom lane


Re: SetBufferCommitInfoNeedsSave and race conditions

From
Alvaro Herrera
Date:
Tom Lane escribió:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
> > AFAICS, we can just simply remove the assertion. But is there any 
> > codepaths that assume that after calling HeapTupleSatisfiesSnapshot, all 
> > appropriate hint bits are set?
> 
> There had better not be, since we are going to postpone setting hint
> bits for recently-committed transactions as part of the async-commit
> patch.
> 
> A quick grep suggests that VACUUM FULL might be at risk here.

That particular case seems easily fixed since VACUUM FULL must hold an
exclusive lock; and we can forcibly set sync commit for VACUUM FULL.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: SetBufferCommitInfoNeedsSave and race conditions

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane escribi�:
>> A quick grep suggests that VACUUM FULL might be at risk here.

> That particular case seems easily fixed since VACUUM FULL must hold an
> exclusive lock; and we can forcibly set sync commit for VACUUM FULL.

Uh, that wouldn't help.  The problem is that if VACUUM FULL is *looking
at* a recently-committed tuple, tqual.c might decide it can't set the
hint bit yet because it's not certain the commit record for that other
transaction is flushed.

We could possibly hack things so that inside a VACUUM FULL (maybe plain
vacuum too?), we prefer flushing xlog to leaving hint bits unset.
That's likely to be messy though.

Probably a cleaner and more robust answer is to make VACUUM FULL call
tqual.c again in the places where it currently assumes it can look
directly at the hint bits.
        regards, tom lane


Re: SetBufferCommitInfoNeedsSave and race conditions

From
"Simon Riggs"
Date:
On Thu, 2007-06-28 at 15:16 -0400, Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
> > AFAICS, we can just simply remove the assertion. But is there any 
> > codepaths that assume that after calling HeapTupleSatisfiesSnapshot, all 
> > appropriate hint bits are set?
> 
> There had better not be, since we are going to postpone setting hint
> bits for recently-committed transactions as part of the async-commit
> patch.
> 
> A quick grep suggests that VACUUM FULL might be at risk here.

No we're clear: I caught that issue specifically for VACUUM FULL fairly
early on. VF assumes all hint bits are set after the first scan, so we
flush prior to the scan to ensure its safe to set the hint bits. There
are no concurrent hint bit setters, so we are good.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: SetBufferCommitInfoNeedsSave and race conditions

From
"Simon Riggs"
Date:
On Thu, 2007-06-28 at 15:29 -0400, Alvaro Herrera wrote:
> Tom Lane escribió:
> > Heikki Linnakangas <heikki@enterprisedb.com> writes:
> > > AFAICS, we can just simply remove the assertion. But is there any
> > > codepaths that assume that after calling HeapTupleSatisfiesSnapshot, all
> > > appropriate hint bits are set?
> >
> > There had better not be, since we are going to postpone setting hint
> > bits for recently-committed transactions as part of the async-commit
> > patch.
> >
> > A quick grep suggests that VACUUM FULL might be at risk here.
>
> That particular case seems easily fixed since VACUUM FULL must hold an
> exclusive lock; and we can forcibly set sync commit for VACUUM FULL.

Exactly what it does!

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: SetBufferCommitInfoNeedsSave and race conditions

From
Tom Lane
Date:
"Simon Riggs" <simon@2ndquadrant.com> writes:
> On Thu, 2007-06-28 at 15:16 -0400, Tom Lane wrote:
>> A quick grep suggests that VACUUM FULL might be at risk here.

> No we're clear: I caught that issue specifically for VACUUM FULL fairly
> early on. VF assumes all hint bits are set after the first scan, so we
> flush prior to the scan to ensure its safe to set the hint bits.

Flush what prior to the scan?

The methodology I suggested earlier (involving tracking LSN only at the
level of pg_clog pages) isn't going to make that work, unless you
somehow force the XID counter forward to the next page boundary.
It might be that that level of tracking is too coarse anyway, since
it essentially says that you can't hint any transaction until the
next 32K-transaction boundary is reached.
        regards, tom lane


Re: SetBufferCommitInfoNeedsSave and race conditions

From
"Simon Riggs"
Date:
On Thu, 2007-06-28 at 20:23 -0400, Tom Lane wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> > On Thu, 2007-06-28 at 15:16 -0400, Tom Lane wrote:
> >> A quick grep suggests that VACUUM FULL might be at risk here.
> 
> > No we're clear: I caught that issue specifically for VACUUM FULL fairly
> > early on. VF assumes all hint bits are set after the first scan, so we
> > flush prior to the scan to ensure its safe to set the hint bits.
> 
> Flush what prior to the scan?
> 
> The methodology I suggested earlier (involving tracking LSN only at the
> level of pg_clog pages) isn't going to make that work, unless you
> somehow force the XID counter forward to the next page boundary.
> It might be that that level of tracking is too coarse anyway, since
> it essentially says that you can't hint any transaction until the
> next 32K-transaction boundary is reached.

If you completely flush WAL after the AccessExclusiveLock has been taken
by VF, then you are guaranteed to have flushed all asynchronous commits
that touch the table.

You're right that I just broke VF again by changing the implementation
back to deferred hint setting. I'll fix again. Hmmm, I can add a test
for LockHeldByMe(AccessExclusiveLock) in tqual, or just have a
VacuumHintStrategy() call that alters the behaviour of the tqual
routines. Latter sounds best?

My aim is to get this all wrapped up by end of Monday night.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: SetBufferCommitInfoNeedsSave and race conditions

From
Tom Lane
Date:
"Simon Riggs" <simon@2ndquadrant.com> writes:
> On Thu, 2007-06-28 at 20:23 -0400, Tom Lane wrote:
>> The methodology I suggested earlier (involving tracking LSN only at the
>> level of pg_clog pages) isn't going to make that work, unless you
>> somehow force the XID counter forward to the next page boundary.

> If you completely flush WAL after the AccessExclusiveLock has been taken
> by VF, then you are guaranteed to have flushed all asynchronous commits
> that touch the table.

I don't believe this is correct (see system catalogs) and in any case
it's far too fragile for my taste.  I think it'd be a lot better to just
stop referencing the hint bits directly in VACUUM FULL.
        regards, tom lane


Re: SetBufferCommitInfoNeedsSave and race conditions

From
"Simon Riggs"
Date:
On Fri, 2007-06-29 at 11:13 -0400, Tom Lane wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> > On Thu, 2007-06-28 at 20:23 -0400, Tom Lane wrote:
> >> The methodology I suggested earlier (involving tracking LSN only at the
> >> level of pg_clog pages) isn't going to make that work, unless you
> >> somehow force the XID counter forward to the next page boundary.
> 
> > If you completely flush WAL after the AccessExclusiveLock has been taken
> > by VF, then you are guaranteed to have flushed all asynchronous commits
> > that touch the table.
> 
> I don't believe this is correct (see system catalogs) and in any case
> it's far too fragile for my taste.  I think it'd be a lot better to just
> stop referencing the hint bits directly in VACUUM FULL.

Well, according to the comments in repair_frag(), line 1799-1815,
specifically line 1810 says "we wouldnt be in repair_frag() at all" if
the tuple was "in a system catalog, inserted or deleted by a not yet
committed transaction". If we have flushed all committed and/or aborted
transactions then we're good.

Maybe the comment is wrong? Wouldn't be the first time. Either way,
please explain your concern in more detail.

I'd rather do this the easy way since VF seems soon to die (payback for
all the torture its caused us down the years).

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: SetBufferCommitInfoNeedsSave and race conditions

From
"Simon Riggs"
Date:
On Thu, 2007-06-28 at 20:23 -0400, Tom Lane wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> > On Thu, 2007-06-28 at 15:16 -0400, Tom Lane wrote:
> >> A quick grep suggests that VACUUM FULL might be at risk here.
> 
> > No we're clear: I caught that issue specifically for VACUUM FULL fairly
> > early on. VF assumes all hint bits are set after the first scan, so we
> > flush prior to the scan to ensure its safe to set the hint bits.
> 
> Flush what prior to the scan?
> 
> The methodology I suggested earlier (involving tracking LSN only at the
> level of pg_clog pages) isn't going to make that work, unless you
> somehow force the XID counter forward to the next page boundary.
> It might be that that level of tracking is too coarse anyway, since
> it essentially says that you can't hint any transaction until the
> next 32K-transaction boundary is reached.

Solutions I'm going for are these:

- Force XLogFlush() prior to initial VF scan. Tqual will set hint bits
if WAL has been flushed, else it will be deferred, so no WAL flushes
will be forced by normal hint bit setting and VF will work without
needing any crufty special cases or rework of VF logic.

- Use Tom's LSN tracking at clog page level. Make the LSN tracking store
an array of LSNs rather than just one. Array size is fixed at
NUMBER_TRACKED_LSNS_PER_PAGE, so that each LSN covers
32,000/NUMBER_TRACKED_LSNS_PER_PAGE transactions. I'd guess that storing
8 per page would be optimal, so each stored xid would track 4,000
transactions - probably around 1 sec worth of transactions when the
feature is used.

Comments?

--  Simon Riggs EnterpriseDB  http://www.enterprisedb.com



Re: SetBufferCommitInfoNeedsSave and race conditions

From
Gregory Stark
Date:
"Simon Riggs" <simon@2ndquadrant.com> writes:

> I'd guess that storing 8 per page would be optimal, so each stored xid would
> track 4,000 transactions - probably around 1 sec worth of transactions when
> the feature is used.

This is something we can experiment with but I suspect that while 8 might be
sufficient for many use cases there would be others where more would be
better. The cost to having more lsns stored in the clog would be pretty small.

On TPCC which has longer transactions on moderate hardware we only see order
of 1,000 txn/min. So a setting like 128 which allows a granularity of 256
transactions would be about 15s which is not so much longer than the xmin
horizon of the 90th percentile response time of 2*5s.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com



Re: SetBufferCommitInfoNeedsSave and race conditions

From
Bruce Momjian
Date:
Where are we on this?

---------------------------------------------------------------------------

Gregory Stark wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> 
> > I'd guess that storing 8 per page would be optimal, so each stored xid would
> > track 4,000 transactions - probably around 1 sec worth of transactions when
> > the feature is used.
> 
> This is something we can experiment with but I suspect that while 8 might be
> sufficient for many use cases there would be others where more would be
> better. The cost to having more lsns stored in the clog would be pretty small.
> 
> On TPCC which has longer transactions on moderate hardware we only see order
> of 1,000 txn/min. So a setting like 128 which allows a granularity of 256
> transactions would be about 15s which is not so much longer than the xmin
> horizon of the 90th percentile response time of 2*5s.
> 
> -- 
>   Gregory Stark
>   EnterpriseDB          http://www.enterprisedb.com
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
> 
>                http://archives.postgresql.org

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: SetBufferCommitInfoNeedsSave and race conditions

From
Gregory Stark
Date:
"Bruce Momjian" <bruce@momjian.us> writes:

> Where are we on this?

Well Simon just sent the reworked patch yesterday so the answer is we haven't
started tuning this parameter. (Bruce's message is referring to the discussion
about what the optimal value of lsns per clog page would be.)

I intend to do some benchmarking on it and testing what the best value of this
parameter is one of the things I aim to determine with these benchmarks.

I'm not 100% sure yet what to measure though. There are two questions here:

1) What are some good workloads to test which will require larger values for  this parameter or which will be hurt by
excessivelylarge values?
 

I think any short-transaction workload should be basically good enough. I'm
thinking of using just pgbench's default workload. Does anyone think there are
other workloads that we need to specifically test?

2) What metric do I use to determine if the value is large enough or too  large?

The gross measurement would be to look at tps. I would prefer to actually
count events which we want to minimize such as xlogflushes because the clog
lsn is too old and how much extra work the visibility checks do. I'm not sure
exactly how much of this we can really measure though. Are there any other
events having an insufficiently large or excessively large value of this
parameter will cause which we can count?

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com



Re: SetBufferCommitInfoNeedsSave and race conditions

From
"Simon Riggs"
Date:
On Wed, 2007-07-18 at 01:01 +0100, Gregory Stark wrote:
> "Bruce Momjian" <bruce@momjian.us> writes:
> 
> > Where are we on this?
> 
> Well Simon just sent the reworked patch yesterday so the answer is we haven't
> started tuning this parameter. (Bruce's message is referring to the discussion
> about what the optimal value of lsns per clog page would be.)

>From discussion, Greg wishes to see a certain parameter higher, though
I'm fairly comfortable with the value now. Thanks to Greg for making me
think this through in more detail. There is some confirmation required,
but no significant tuning effort required.

Here's the thinking:

When we commit, we store an LSN that covers a group of transactions.
When we set hint bits we check the appropriate LSN. If it has been
flushed, then we set the hint bit.

If we're using synch commits then we can always set hint bits.
If we're mixing async commits with sync commits then the clog LSNs will
very often be flushed already, so we'll be able to set most hint bits.

If we are doing *only* simple async commits then only the WAL writer
flushes WAL. In that case when we try to set hint bits we will only be
prevented from setting hint bits for the last "few" transactions. So how
many is that exactly?

We are flushing WAL every W seconds and doing T transactions per second,
then we will flush WAL every T*W transactions. e.g. W = 0.2 (default), T
~ 10,000 => T*W = 2,000 transactions.

We have divided up each clog page up into N groups of transactions, each
with their own LSN, then we will have 32,000/N transactions per group.

The best situation is that we set N such that we never defer the setting
of a hint bit for a transaction that has been flushed.
There is no point setting N any lower than32,000/N = T*W
i.e. N <= W * (32,000/T)

My laptop can do T=1000 with W=0.2 with pgbench, so N <= 6. 

By definition, anyone using async commit will have a high transaction
rate (else there is no benefit), so N seems able to be reasonably small.
If the transaction rate drops off momentarily the number of unflushed
committed async transactions doesn't rise, it falls quickly to zero as
the WAL writer flushes WAL up to the last async LSN.

The patch sets a value of 16, which seems likely to be enough to cover
most situations. 

Having thought that through, I see two changes I'd like to make:

- I realise that I've arranged the LSNs to be striped rather than
grouped. I'll change this part of the patch - couple of lines in
GetLSNIndex().

- The LSN of sync commits is also used to update the LSN of the clog
page. That is unnecessary and we only need set the clog LSN for async
commits.

I'll wait for other review comments before cutting version 23....

--  Simon Riggs EnterpriseDB  http://www.enterprisedb.com



Re: SetBufferCommitInfoNeedsSave and race conditions

From
Bruce Momjian
Date:
This has been saved for the 8.4 release:
http://momjian.postgresql.org/cgi-bin/pgpatches_hold

---------------------------------------------------------------------------

Simon Riggs wrote:
> On Thu, 2007-06-28 at 20:23 -0400, Tom Lane wrote:
> > "Simon Riggs" <simon@2ndquadrant.com> writes:
> > > On Thu, 2007-06-28 at 15:16 -0400, Tom Lane wrote:
> > >> A quick grep suggests that VACUUM FULL might be at risk here.
> > 
> > > No we're clear: I caught that issue specifically for VACUUM FULL fairly
> > > early on. VF assumes all hint bits are set after the first scan, so we
> > > flush prior to the scan to ensure its safe to set the hint bits.
> > 
> > Flush what prior to the scan?
> > 
> > The methodology I suggested earlier (involving tracking LSN only at the
> > level of pg_clog pages) isn't going to make that work, unless you
> > somehow force the XID counter forward to the next page boundary.
> > It might be that that level of tracking is too coarse anyway, since
> > it essentially says that you can't hint any transaction until the
> > next 32K-transaction boundary is reached.
> 
> Solutions I'm going for are these:
> 
> - Force XLogFlush() prior to initial VF scan. Tqual will set hint bits
> if WAL has been flushed, else it will be deferred, so no WAL flushes
> will be forced by normal hint bit setting and VF will work without
> needing any crufty special cases or rework of VF logic.
> 
> - Use Tom's LSN tracking at clog page level. Make the LSN tracking store
> an array of LSNs rather than just one. Array size is fixed at
> NUMBER_TRACKED_LSNS_PER_PAGE, so that each LSN covers
> 32,000/NUMBER_TRACKED_LSNS_PER_PAGE transactions. I'd guess that storing
> 8 per page would be optimal, so each stored xid would track 4,000
> transactions - probably around 1 sec worth of transactions when the
> feature is used.
> 
> Comments?
> 
> -- 
>   Simon Riggs
>   EnterpriseDB  http://www.enterprisedb.com
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: SetBufferCommitInfoNeedsSave and race conditions

From
Simon Riggs
Date:
On Wed, 2007-09-26 at 04:33 -0400, Bruce Momjian wrote:
> This has been saved for the 8.4 release:
> 
>     http://momjian.postgresql.org/cgi-bin/pgpatches_hold

Already applied.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com