Thread: Bug in wait time when waiting on nested subtransaction

Bug in wait time when waiting on nested subtransaction

From
Simon Riggs
Date:
Issue in current XactLockTableWait, starting with 3c27944fb2141,
discovered while reviewing https://commitfest.postgresql.org/40/3806/

Test demonstrating the problem is 001-isotest-tuplelock-subxact.v1.patch

A narrative description of the issue follows:
session1 - requests multiple nested subtransactions like this:
BEGIN; ...
SAVEPOINT subxid1; ...
SAVEPOINT subxid2; ...

If another session2 sees an xid from subxid2, it waits. If subxid2
aborts, then session2 sees the abort and can continue processing
normally.
However, if subxid2 subcommits, then the lock wait moves from subxid2
to the topxid. If subxid1 subsequently aborts, it will also abort
subxid2, but session2 waiting for subxid2 to complete doesn't see this
and waits for topxid instead. Which means that it waits longer than it
should, and later arriving lock waiters may actually get served first.

So it's a bug, but not too awful, since in many cases people don't use
nested subtransactions, and if they do use SAVEPOINTs, they don't
often close them using RELEASE. And in most cases the extra wait time
is not very long, hence why nobody ever reported this issue.

Thanks to Robert Haas and Julien Tachoires for asking the question
"are you sure the existing coding is correct?". You were both right;
it is not.

How to fix? Correct lock wait can be achieved while a subxid is
running if we do either
* a lock table entry for the subxid OR
* a subtrans entry that points to its immediate parent

So we have these options

1. Removing the XactLockTableDelete() call in CommitSubTransaction().
That releases lock waiters earlier than expected, which requires
pushups in XactLockTableWait() to cope with that (which are currently
broken). Why do we do it? To save shmem space in the lock table should
anyone want to run a transaction that contains thousands of
subtransactions, or more. So option (1) alone would eventually cause
us to run out of space in the lock table and a transaction would
receive ERRORs rather than be allowed to run for a long time.

2. In XactLockTableWait(), replace the call to SubTransGetParent(), so
we go up the levels one by one as we did before. However, (2) causes
huge subtrans contention and if we implemented that and backpatched it
the performance issues could be significant. So my feeling is that if
we do (2) then we should not backpatch it.

So both (1) and (2) have issues.

The main result from patch https://commitfest.postgresql.org/40/3806/
is that having subtrans point direct to topxid is very good for
performance in XidIsInMVCCSnapshot(), and presumably other places
also. This bug prevents the  simple application of a patch to improve
performance. So now we need a stronger mix of ideas to both resolve
the bug and fix the subtrans contention issue in HEAD.

My preferred solution would be a mix of the above, call it option (3)

3.
a) Include the lock table entry for the first 64 subtransactions only,
so that we limit shmem. For those first 64 entries, have the subtrans
point direct to top, since this makes a subtrans lookup into an O(1)
operation, which is important for performance of later actions.

b) For any subtransactions after first 64, delete the subxid lock on
subcommit, to save shmem, but make subtrans point to the immediate
parent (only), not the topxid. That fixes the bug, but causes
performance problems in XidInMVCCSnapshot() and others, so we also do
c) and d)

c) At top level commit, go back and alter subtrans again for subxids
so now it points to the topxid, so that we avoid O(N) behavior in
XidInMVCCSnapshot() and other callers. Additional cost for longer
transactions, but it saves considerable cost in later callers that
need to call  GetTopmostTransaction.

d) Optimize SubTransGetTopmostTransaction() so it retrieves entries
page-at-a-time. This will reduce the contention of repeatedly
re-visiting the same page(s) and ensure that a page is less often
paged out when we are still using it.

Thoughts?

--
Simon Riggs                http://www.EnterpriseDB.com/

Attachment

Re: Bug in wait time when waiting on nested subtransaction

From
Robert Haas
Date:
On Mon, Nov 28, 2022 at 10:28 AM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:
> So we have these options
>
> 1. Removing the XactLockTableDelete() call in CommitSubTransaction().
> That releases lock waiters earlier than expected, which requires
> pushups in XactLockTableWait() to cope with that (which are currently
> broken). Why do we do it? To save shmem space in the lock table should
> anyone want to run a transaction that contains thousands of
> subtransactions, or more. So option (1) alone would eventually cause
> us to run out of space in the lock table and a transaction would
> receive ERRORs rather than be allowed to run for a long time.

This seems unprincipled to me. The point of having a lock on the
subtransaction in the lock table is so that people can wait for the
subtransaction to end. If we don't remove the lock table entry at
subtransaction end, then that lock table entry doesn't serve that
purpose any more.

> 2. In XactLockTableWait(), replace the call to SubTransGetParent(), so
> we go up the levels one by one as we did before. However, (2) causes
> huge subtrans contention and if we implemented that and backpatched it
> the performance issues could be significant. So my feeling is that if
> we do (2) then we should not backpatch it.

What I find suspicious about the coding of this function is that it
doesn't distinguish between commits and aborts at all. Like, if a
subtransaction commits, the changes don't become globally visible
until the parent transaction also commits. If a subtransaction aborts,
though, what happens to the top-level XID doesn't seem to matter at
all. The comment seems to agree:

 * Note that this does the right thing for subtransactions: if we wait on a
 * subtransaction, we will exit as soon as it aborts or its top parent commits.

I feel like what I'd expect to see given this comment is code which
(1) waits until the supplied XID is no longer running, (2) checks
whether the XID aborted, and if so return at once, and (3) otherwise
recurse to the parent XID. But the code doesn't do that. Perhaps
that's not actually the right thing to do, since it seems like a big
behavior change, but then I don't understand the comment.

Incidentally, one possible optimization here to try to release locking
traffic would be to try waiting for the top-parent first using a
conditional lock acquisition. If that works, cool. If not, go back
around and work up the tree level by level. Since that path would only
be taken in the unhappy case where we're probably going to have to
wait anyway, the cost probably wouldn't be too bad.

> My preferred solution would be a mix of the above, call it option (3)
>
> 3.
> a) Include the lock table entry for the first 64 subtransactions only,
> so that we limit shmem. For those first 64 entries, have the subtrans
> point direct to top, since this makes a subtrans lookup into an O(1)
> operation, which is important for performance of later actions.
>
> b) For any subtransactions after first 64, delete the subxid lock on
> subcommit, to save shmem, but make subtrans point to the immediate
> parent (only), not the topxid. That fixes the bug, but causes
> performance problems in XidInMVCCSnapshot() and others, so we also do
> c) and d)
>
> c) At top level commit, go back and alter subtrans again for subxids
> so now it points to the topxid, so that we avoid O(N) behavior in
> XidInMVCCSnapshot() and other callers. Additional cost for longer
> transactions, but it saves considerable cost in later callers that
> need to call  GetTopmostTransaction.
>
> d) Optimize SubTransGetTopmostTransaction() so it retrieves entries
> page-at-a-time. This will reduce the contention of repeatedly
> re-visiting the same page(s) and ensure that a page is less often
> paged out when we are still using it.

I'm honestly not very sanguine about changing pg_subtrans to point
straight to the topmost XID. It's only OK to do that if there's
absolutely nothing that needs to know the full tree structure, and the
present case looks like an instance where we would like to have the
full tree structure. I would not be surprised if there are others.
That said, it seems a lot less likely that this would be an issue once
the top-level transaction is no longer running. At that point, all the
subtransactions are no longer running either: they either committed or
they rolled back, and I can't see a reason why any code should care
about anything other than which of those two things happened. So I
think your idea in (c) might have some possibilities.

You could also flip that idea around and have readers replace
immediate parent pointers with top-parent pointers opportunistically,
but I'm not sure that's actually any better. As you present it in (c)
above, there's a risk of going back and updating CLOG state that no
one will ever look at. But if you flipped it around and did it on the
read side, then you'd have the risk of a bunch of backends trying to
do it at the same time. I'm not sure whether either of those things is
a big problem in practice, or whether both are, or neither.

I agree that it looks possible to optimize
SubTransGetTopmostTransaction better for the case where we want to
traverse multiple or all levels, so that fewer pages are read. I don't
know to what degree that would affect user-observible performance, but
I can believe that it could be a win.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Bug in wait time when waiting on nested subtransaction

From
Alvaro Herrera
Date:
On 2022-Nov-28, Simon Riggs wrote:

> A narrative description of the issue follows:
> session1 - requests multiple nested subtransactions like this:
> BEGIN; ...
> SAVEPOINT subxid1; ...
> SAVEPOINT subxid2; ...

> However, if subxid2 subcommits, then the lock wait moves from subxid2
> to the topxid.

Hmm, do we really do that?  Seems very strange .. it sounds to me like
the lock should have been transferred to subxid1 (which is subxid2's
parent), not to the top-level Xid.  Maybe what the user wanted was to
release subxid1 before establishing subxid2?  Or do they want to
continue to be able to rollback to subxid1 after establishing subxid2?
(but why?)

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/



Re: Bug in wait time when waiting on nested subtransaction

From
Simon Riggs
Date:
On Mon, 28 Nov 2022 at 18:53, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2022-Nov-28, Simon Riggs wrote:
>
> > A narrative description of the issue follows:
> > session1 - requests multiple nested subtransactions like this:
> > BEGIN; ...
> > SAVEPOINT subxid1; ...
> > SAVEPOINT subxid2; ...
>
> > However, if subxid2 subcommits, then the lock wait moves from subxid2
> > to the topxid.
>
> Hmm, do we really do that?  Seems very strange .. it sounds to me like
> the lock should have been transferred to subxid1 (which is subxid2's
> parent), not to the top-level Xid.

Correct; that is exactly what I'm saying and why we have a bug since
3c27944fb2141.

> Maybe what the user wanted was to
> release subxid1 before establishing subxid2?  Or do they want to
> continue to be able to rollback to subxid1 after establishing subxid2?
> (but why?)

This isn't a description of a user's actions, it is a script that
illustrates the bug in XactLockTableWait().

Perhaps a better example would be nested code blocks with EXCEPTION
clauses where the outer block fails...
e.g.

DO $$
BEGIN
 SELECT 1;

 BEGIN
  SELECT 1;
  EXCEPTION WHEN OTHERS THEN
   RAISE NOTICE 's2';
  END;

  RAISE division_by_zero; -- now back in outer subxact, which now fails

 EXCEPTION WHEN OTHERS THEN
   RAISE NOTICE 's1';
END;$$;

Of course, debugging this is harder since there is no way to return
the current subxid in SQL.

-- 
Simon Riggs                http://www.EnterpriseDB.com/



Re: Bug in wait time when waiting on nested subtransaction

From
Simon Riggs
Date:
On Mon, 28 Nov 2022 at 17:38, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Nov 28, 2022 at 10:28 AM Simon Riggs
> <simon.riggs@enterprisedb.com> wrote:
> > So we have these options
> >
> > 1. Removing the XactLockTableDelete() call in CommitSubTransaction().
> > That releases lock waiters earlier than expected, which requires
> > pushups in XactLockTableWait() to cope with that (which are currently
> > broken). Why do we do it? To save shmem space in the lock table should
> > anyone want to run a transaction that contains thousands of
> > subtransactions, or more. So option (1) alone would eventually cause
> > us to run out of space in the lock table and a transaction would
> > receive ERRORs rather than be allowed to run for a long time.
>
> This seems unprincipled to me. The point of having a lock on the
> subtransaction in the lock table is so that people can wait for the
> subtransaction to end. If we don't remove the lock table entry at
> subtransaction end, then that lock table entry doesn't serve that
> purpose any more.

An easy point to confuse:
"subtransaction to end": The subtransaction is "still running" to
other backends even AFTER it has been subcommitted, but its state now
transfers to the parent.

So the subtransaction doesn't cease running until it aborts, one of
its parent aborts or top level commit. The subxid lock should, on
principle, exist until one of those events occurs. It doesn't, but
that is an optimization, for the stated reason.

(All of the above is current behavior).

> > 2. In XactLockTableWait(), replace the call to SubTransGetParent(), so
> > we go up the levels one by one as we did before. However, (2) causes
> > huge subtrans contention and if we implemented that and backpatched it
> > the performance issues could be significant. So my feeling is that if
> > we do (2) then we should not backpatch it.
>
> What I find suspicious about the coding of this function is that it
> doesn't distinguish between commits and aborts at all. Like, if a
> subtransaction commits, the changes don't become globally visible
> until the parent transaction also commits. If a subtransaction aborts,
> though, what happens to the top-level XID doesn't seem to matter at
> all. The comment seems to agree:
>
>  * Note that this does the right thing for subtransactions: if we wait on a
>  * subtransaction, we will exit as soon as it aborts or its top parent commits.
>
> I feel like what I'd expect to see given this comment is code which
> (1) waits until the supplied XID is no longer running, (2) checks
> whether the XID aborted, and if so return at once, and (3) otherwise
> recurse to the parent XID. But the code doesn't do that. Perhaps
> that's not actually the right thing to do, since it seems like a big
> behavior change, but then I don't understand the comment.

As I mention above, the correct behavior is that the subxact doesn't
cease running until it aborts, one of its parent aborts or top level
commit.

Which is slightly different from the comment, which may explain why
the bug exists.

> Incidentally, one possible optimization here to try to release locking
> traffic would be to try waiting for the top-parent first using a
> conditional lock acquisition. If that works, cool. If not, go back
> around and work up the tree level by level. Since that path would only
> be taken in the unhappy case where we're probably going to have to
> wait anyway, the cost probably wouldn't be too bad.

That sounds like a potential bug fix (not just an optimization).

> > My preferred solution would be a mix of the above, call it option (3)
> >
> > 3.
> > a) Include the lock table entry for the first 64 subtransactions only,
> > so that we limit shmem. For those first 64 entries, have the subtrans
> > point direct to top, since this makes a subtrans lookup into an O(1)
> > operation, which is important for performance of later actions.
> >
> > b) For any subtransactions after first 64, delete the subxid lock on
> > subcommit, to save shmem, but make subtrans point to the immediate
> > parent (only), not the topxid. That fixes the bug, but causes
> > performance problems in XidInMVCCSnapshot() and others, so we also do
> > c) and d)
> >
> > c) At top level commit, go back and alter subtrans again for subxids
> > so now it points to the topxid, so that we avoid O(N) behavior in
> > XidInMVCCSnapshot() and other callers. Additional cost for longer
> > transactions, but it saves considerable cost in later callers that
> > need to call  GetTopmostTransaction.
> >
> > d) Optimize SubTransGetTopmostTransaction() so it retrieves entries
> > page-at-a-time. This will reduce the contention of repeatedly
> > re-visiting the same page(s) and ensure that a page is less often
> > paged out when we are still using it.
>
> I'm honestly not very sanguine about changing pg_subtrans to point
> straight to the topmost XID. It's only OK to do that if there's
> absolutely nothing that needs to know the full tree structure, and the
> present case looks like an instance where we would like to have the
> full tree structure. I would not be surprised if there are others.

There are no others. It's a short list and I've checked. I ask others
to do the same.

> That said, it seems a lot less likely that this would be an issue once
> the top-level transaction is no longer running. At that point, all the
> subtransactions are no longer running either: they either committed or
> they rolled back, and I can't see a reason why any code should care
> about anything other than which of those two things happened. So I
> think your idea in (c) might have some possibilities.
>
> You could also flip that idea around and have readers replace
> immediate parent pointers with top-parent pointers opportunistically,
> but I'm not sure that's actually any better. As you present it in (c)
> above, there's a risk of going back and updating CLOG state that no
> one will ever look at. But if you flipped it around and did it on the
> read side, then you'd have the risk of a bunch of backends trying to
> do it at the same time. I'm not sure whether either of those things is
> a big problem in practice, or whether both are, or neither.

(Subtrans state, not clog state)

But that puts the work on the reader rather than the writer,.

> I agree that it looks possible to optimize
> SubTransGetTopmostTransaction better for the case where we want to
> traverse multiple or all levels, so that fewer pages are read. I don't
> know to what degree that would affect user-observible performance, but
> I can believe that it could be a win.

Good.

Thanks for commenting.

--
Simon Riggs                http://www.EnterpriseDB.com/



Re: Bug in wait time when waiting on nested subtransaction

From
Robert Haas
Date:
On Mon, Nov 28, 2022 at 2:45 PM Simon Riggs
<simon.riggs@enterprisedb.com> wrote:
> An easy point to confuse:
> "subtransaction to end": The subtransaction is "still running" to
> other backends even AFTER it has been subcommitted, but its state now
> transfers to the parent.
>
> So the subtransaction doesn't cease running until it aborts, one of
> its parent aborts or top level commit. The subxid lock should, on
> principle, exist until one of those events occurs. It doesn't, but
> that is an optimization, for the stated reason.

That's not what "running" means to me. Running means it's started and
hasn't yet committed or rolled back.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Bug in wait time when waiting on nested subtransaction

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> That's not what "running" means to me. Running means it's started and
> hasn't yet committed or rolled back.

A subxact definitely can't be considered committed until its topmost
parent commits.  However, it could be known to be rolled back before
its parent.  IIUC, the current situation is that we don't take
advantage of the latter case but just wait for the topmost parent.

One thing we need to be pretty careful of here is to not break the
promise of atomic commit.  At topmost commit, all subxacts must
appear committed simultaneously.  It's not quite clear to me whether
we need a similar guarantee in the rollback case.  It seems like
we shouldn't, but maybe I'm missing something, in which case maybe
the current behavior is correct?

            regards, tom lane



Re: Bug in wait time when waiting on nested subtransaction

From
Robert Haas
Date:
On Mon, Nov 28, 2022 at 3:01 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> One thing we need to be pretty careful of here is to not break the
> promise of atomic commit.  At topmost commit, all subxacts must
> appear committed simultaneously.  It's not quite clear to me whether
> we need a similar guarantee in the rollback case.  It seems like
> we shouldn't, but maybe I'm missing something, in which case maybe
> the current behavior is correct?

AFAICS, the behavior that Simon is complaining about here is an
exception to the way things work in general, and therefore his
complaint is well-founded. AbortSubTransaction() arranges to release
resources, whereas CommitSubTransaction() arranges to have them
transferred to the parent transaction. This isn't necessarily obvious
from looking at those functions, but if you drill down into the
AtEO(Sub)Xact functions that they call, that's what happens in nearly
all cases. Given that subtransaction abort releases resources
immediately, it seems pretty fair to wonder what the value is in
waiting for its parent or the topmost transaction. I don't see how
that can be necessary for correctness.

The commit message to which Simon (3c27944fb2141) points seems to have
inadvertently changed the behavior while trying to fix a bug and
improve performance. I remember being a bit skeptical about that fix
at the time. Back in the day, you couldn't XactLockTableWait() unless
you knew that the transaction had already started. That commit tried
to make it so that you could XactLockTableWait() earlier, because
that's apparently something that logical replication needs to do. But
that is a major redefinition of the charter of that function, and I am
wondering whether it was a mistake to fold together the thing that we
need in normal cases (which is to wait for a transaction we know has
started and may not have finished) from the thing we need in the
logical decoding case (which apparently has different requirements).
Maybe we should have solved that problem by finding a way to wait for
the transaction to start, and then afterwards wait for it to end. Or
maybe we should have altogether different entrypoints for the two
requirements. Or maybe using one function is fine but we just need it
to be more correct. I'm not really sure.

In short, I think Simon is right that there's a problem and right
about which commit caused it, but I'm not sure what I think we ought
to do about it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Bug in wait time when waiting on nested subtransaction

From
Simon Riggs
Date:
On Mon, 28 Nov 2022 at 21:10, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Nov 28, 2022 at 3:01 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > One thing we need to be pretty careful of here is to not break the
> > promise of atomic commit.  At topmost commit, all subxacts must
> > appear committed simultaneously.  It's not quite clear to me whether
> > we need a similar guarantee in the rollback case.  It seems like
> > we shouldn't, but maybe I'm missing something, in which case maybe
> > the current behavior is correct?
>
> In short, I think Simon is right that there's a problem and right
> about which commit caused it, but I'm not sure what I think we ought
> to do about it.

I'm comfortable with ignoring it, on the basis that it *is* a
performance optimization, but I suggest we keep the test (with
modified output) and document the behavior, if we do.

The really big issue is the loss of performance we get from having
subtrans point only to its immediate parent, which makes
XidInMVCCSnapshot() go really slow in the presence of lots of
subtransactions. So ignoring the issue on this thread will open the
door for the optimization posted for this patch:
https://commitfest.postgresql.org/40/3806/

--
Simon Riggs                http://www.EnterpriseDB.com/