Thread: HeapTupleSatisfiesDirty fails to test HEAP_XMAX_IS_LOCKED_ONLY for TransactionIdIsInProgress(...)

Hi all

Andres and I were going over a patch yesterday and found an unexpected
bug in tqual.c while attempting to trigger a hypothesized bug in that patch.

A SELECT ... FOR SHARE will incorrectly block on another open
transaction that ran SELECT ... FOR SHARE and still holds the locks if
it has to follow a ctid chain from the current snapshot through a
committed update to a share-locked tuple.

This also affects uniqueness checks in btrees, where it can cause
unnecessary waiting. It's also an issue with FOR KEY UPDATE, in that it
can cause an update to block when it doesn't have to.

The attached test case runs under isolationtester to exersise the
problem. I've tested it against 9.2, 9.3, and HEAD, but Andres looked
over the code and says the underlying bug goes back to the commit of
MVCC, it's just become easier to trigger. To reliably test this with
isolationtester I had to horribly abuse pg_advisory_lock(...) so that I
could force the first SELECT ... FOR UPDATE to acquire its snapshot
before the UPDATE runs.

A backtrace of the point where it's incorrectly blocked is attached.
What's happening is that the test for TransactionIdIsInProgress
unconditionally sets snapshot->xmax, even if xmax was only set for
locking purposes. This will cause the caller to wait for the xid in xmax
when it doesn't need to.

It should be testing HEAP_XMAX_IS_LOCKED_ONLY and only setting
snapshot->xmax if the tuple is really being deleted by an open tx.

Note that HeapTupleSatisfiesDirty tests the infomask for
HEAP_XMAX_IS_MULTI and handles multixacts separately, and in that case
it _does_ already test HEAP_XMAX_IS_LOCKED_ONLY.

When you run the attached test case it should block indefinitely. After
applying the attached patch it'll return as expected.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment
Craig Ringer wrote:

> A SELECT ... FOR SHARE will incorrectly block on another open
> transaction that ran SELECT ... FOR SHARE and still holds the locks if
> it has to follow a ctid chain from the current snapshot through a
> committed update to a share-locked tuple.
> 
> This also affects uniqueness checks in btrees, where it can cause
> unnecessary waiting. It's also an issue with FOR KEY UPDATE, in that it
> can cause an update to block when it doesn't have to.

Interesting bug.  Thanks for the patch.  I have applied it all the way
back to 8.4 (with adjustments for 9.2 and beyond).

> The attached test case runs under isolationtester to exersise the
> problem. I've tested it against 9.2, 9.3, and HEAD, but Andres looked
> over the code and says the underlying bug goes back to the commit of
> MVCC, it's just become easier to trigger. To reliably test this with
> isolationtester I had to horribly abuse pg_advisory_lock(...) so that I
> could force the first SELECT ... FOR UPDATE to acquire its snapshot
> before the UPDATE runs.

I didn't apply the test case.  I think if we want to test tqual.c
behavior we will need to introduce a large battery of tests.  I would
like to see more opinions on whether that's something we want.

> A backtrace of the point where it's incorrectly blocked is attached.
> What's happening is that the test for TransactionIdIsInProgress
> unconditionally sets snapshot->xmax, even if xmax was only set for
> locking purposes. This will cause the caller to wait for the xid in xmax
> when it doesn't need to.

Yeah, after actually going over the code I think the bug is clear.  (I
was initially unsure about SatisfiesDirty returning true not false for
this case; but the return value was correct, only snapshot->xmax was
being set incorrectly.  If you examine the callers they would misbehave
if the return value were changed; for example EvalPlanQualFetch would
completely skip the tuple if SatisfiesDirty returned false, which is not
what we want; we want the tuple to be processed.)  I think the comments
on what exactly SatisfiesDirty does about in-progress transactions ought
to be more specific.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



On Fri, Aug 2, 2013 at 5:25 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>> The attached test case runs under isolationtester to exersise the
>> problem. I've tested it against 9.2, 9.3, and HEAD, but Andres looked
>> over the code and says the underlying bug goes back to the commit of
>> MVCC, it's just become easier to trigger. To reliably test this with
>> isolationtester I had to horribly abuse pg_advisory_lock(...) so that I
>> could force the first SELECT ... FOR UPDATE to acquire its snapshot
>> before the UPDATE runs.
>
> I didn't apply the test case.  I think if we want to test tqual.c
> behavior we will need to introduce a large battery of tests.  I would
> like to see more opinions on whether that's something we want.

I haven't read this particular test, but I do think we could get a lot
of mileage out of applying the isolation tester stuff to more things,
and am generally in favor of that.  It has the advantages of (1)
allowing tests that require more than one session and (2) being run
regularly the buildfarm; but it's not something developers typically
run before every commit, so the run time of the test suite shouldn't
be a big issue for anyone.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On 2013-08-05 14:37:34 -0400, Robert Haas wrote:
> On Fri, Aug 2, 2013 at 5:25 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> >> The attached test case runs under isolationtester to exersise the
> >> problem. I've tested it against 9.2, 9.3, and HEAD, but Andres looked
> >> over the code and says the underlying bug goes back to the commit of
> >> MVCC, it's just become easier to trigger. To reliably test this with
> >> isolationtester I had to horribly abuse pg_advisory_lock(...) so that I
> >> could force the first SELECT ... FOR UPDATE to acquire its snapshot
> >> before the UPDATE runs.
> >
> > I didn't apply the test case.  I think if we want to test tqual.c
> > behavior we will need to introduce a large battery of tests.  I would
> > like to see more opinions on whether that's something we want.
> 
> I haven't read this particular test, but I do think we could get a lot
> of mileage out of applying the isolation tester stuff to more things,
> and am generally in favor of that.

The "problem" with the specific test is that it uses row level locks to
get to the situation where EvalPlanQualFetch has to chase down a ctid
chain by making a READ COMITTED transaction acquire a snapshot and only
after that wait.
Not sure if that's not too involved.

> It has the advantages of (1)
> allowing tests that require more than one session and (2) being run
> regularly the buildfarm; but it's not something developers typically
> run before every commit, so the run time of the test suite shouldn't
> be a big issue for anyone.

Agreed.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



On 2013-08-06 04:48:09 +0200, Andres Freund wrote:
> On 2013-08-05 14:37:34 -0400, Robert Haas wrote:
> > On Fri, Aug 2, 2013 at 5:25 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > >> The attached test case runs under isolationtester to exersise the
> > >> problem. I've tested it against 9.2, 9.3, and HEAD, but Andres looked
> > >> over the code and says the underlying bug goes back to the commit of
> > >> MVCC, it's just become easier to trigger. To reliably test this with
> > >> isolationtester I had to horribly abuse pg_advisory_lock(...) so that I
> > >> could force the first SELECT ... FOR UPDATE to acquire its snapshot
> > >> before the UPDATE runs.
> > >
> > > I didn't apply the test case.  I think if we want to test tqual.c
> > > behavior we will need to introduce a large battery of tests.  I would
> > > like to see more opinions on whether that's something we want.
> > 
> > I haven't read this particular test, but I do think we could get a lot
> > of mileage out of applying the isolation tester stuff to more things,
> > and am generally in favor of that.
> 
> The "problem" with the specific test is that it uses row level locks to
> get to the situation where EvalPlanQualFetch has to chase down a ctid
> chain by making a READ COMITTED transaction acquire a snapshot and only
> after that wait.
> Not sure if that's not too involved.

Errr, s/row level locks/advisory locks/

Thanks Craig,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



On 08/06/2013 02:37 AM, Robert Haas wrote:
> I haven't read this particular test, but I do think we could get a lot
> of mileage out of applying the isolation tester stuff to more things,
> and am generally in favor of that.  It has the advantages of (1)
> allowing tests that require more than one session and (2) being run
> regularly the buildfarm; but it's not something developers typically
> run before every commit, so the run time of the test suite shouldn't
> be a big issue for anyone.

The main issue with the test is that it's a dirty hack. What I really
want is a way to block a statement at a certain point - to say "block
after a snapshot is acquired" for example - and release that at a time
of my choosing.

That would require hooks (either callable from SQL in debug builds or
only accessible via C extensions) that can block until released by
action from another backend or after a timeout. The hook call sites
would be macros that're defined away unless the hooks were enabled at
compile time, in which case a shared_preload_library could register a
callback on a hook to do whatever it needs to do. The library would
probably expose some SQL functions as an extension to allow SQL-level
manipulation of the blocking/timeout behaviour.

Such hooks would be really useful for testing complex concurrency cases
that can't be expressed at the statement level. I usually just use a
"LOCK TABLE blah IN ACCESS EXCLUSIVE MODE" to set up a race, but that's
only useful when your issue is at the statement level. It's also not
possible to do this with isolationchecker, because it seems it can only
cope with one blocked session at a time and you need (at least) three
for this approach.

Opinions? Is this something that's even worth thinking about further or
a non-starter?

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



On 8/5/13 10:18 PM, Craig Ringer wrote:
> The main issue with the test is that it's a dirty hack. What I really
> want is a way to block a statement at a certain point - to say "block
> after a snapshot is acquired" for example - and release that at a time
> of my choosing.

...

> Opinions? Is this something that's even worth thinking about further or
> a non-starter?

+1

Something that sets Enova apart from many other places is we have extensive database unit tests. One of the reasons for
thatis you can do way more testing at the database level than you can trying to go through the application.
 

This seems like just another case of that. We could either try to come up with some C level unit tests (which I suspect
wouldn'twork in this case), or expose some hooks.
 
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net