Thread: Bug in wait time when waiting on nested subtransaction
Issue in current XactLockTableWait, starting with 3c27944fb2141, discovered while reviewing https://commitfest.postgresql.org/40/3806/ Test demonstrating the problem is 001-isotest-tuplelock-subxact.v1.patch A narrative description of the issue follows: session1 - requests multiple nested subtransactions like this: BEGIN; ... SAVEPOINT subxid1; ... SAVEPOINT subxid2; ... If another session2 sees an xid from subxid2, it waits. If subxid2 aborts, then session2 sees the abort and can continue processing normally. However, if subxid2 subcommits, then the lock wait moves from subxid2 to the topxid. If subxid1 subsequently aborts, it will also abort subxid2, but session2 waiting for subxid2 to complete doesn't see this and waits for topxid instead. Which means that it waits longer than it should, and later arriving lock waiters may actually get served first. So it's a bug, but not too awful, since in many cases people don't use nested subtransactions, and if they do use SAVEPOINTs, they don't often close them using RELEASE. And in most cases the extra wait time is not very long, hence why nobody ever reported this issue. Thanks to Robert Haas and Julien Tachoires for asking the question "are you sure the existing coding is correct?". You were both right; it is not. How to fix? Correct lock wait can be achieved while a subxid is running if we do either * a lock table entry for the subxid OR * a subtrans entry that points to its immediate parent So we have these options 1. Removing the XactLockTableDelete() call in CommitSubTransaction(). That releases lock waiters earlier than expected, which requires pushups in XactLockTableWait() to cope with that (which are currently broken). Why do we do it? To save shmem space in the lock table should anyone want to run a transaction that contains thousands of subtransactions, or more. So option (1) alone would eventually cause us to run out of space in the lock table and a transaction would receive ERRORs rather than be allowed to run for a long time. 2. In XactLockTableWait(), replace the call to SubTransGetParent(), so we go up the levels one by one as we did before. However, (2) causes huge subtrans contention and if we implemented that and backpatched it the performance issues could be significant. So my feeling is that if we do (2) then we should not backpatch it. So both (1) and (2) have issues. The main result from patch https://commitfest.postgresql.org/40/3806/ is that having subtrans point direct to topxid is very good for performance in XidIsInMVCCSnapshot(), and presumably other places also. This bug prevents the simple application of a patch to improve performance. So now we need a stronger mix of ideas to both resolve the bug and fix the subtrans contention issue in HEAD. My preferred solution would be a mix of the above, call it option (3) 3. a) Include the lock table entry for the first 64 subtransactions only, so that we limit shmem. For those first 64 entries, have the subtrans point direct to top, since this makes a subtrans lookup into an O(1) operation, which is important for performance of later actions. b) For any subtransactions after first 64, delete the subxid lock on subcommit, to save shmem, but make subtrans point to the immediate parent (only), not the topxid. That fixes the bug, but causes performance problems in XidInMVCCSnapshot() and others, so we also do c) and d) c) At top level commit, go back and alter subtrans again for subxids so now it points to the topxid, so that we avoid O(N) behavior in XidInMVCCSnapshot() and other callers. Additional cost for longer transactions, but it saves considerable cost in later callers that need to call GetTopmostTransaction. d) Optimize SubTransGetTopmostTransaction() so it retrieves entries page-at-a-time. This will reduce the contention of repeatedly re-visiting the same page(s) and ensure that a page is less often paged out when we are still using it. Thoughts? -- Simon Riggs http://www.EnterpriseDB.com/
Attachment
On Mon, Nov 28, 2022 at 10:28 AM Simon Riggs <simon.riggs@enterprisedb.com> wrote: > So we have these options > > 1. Removing the XactLockTableDelete() call in CommitSubTransaction(). > That releases lock waiters earlier than expected, which requires > pushups in XactLockTableWait() to cope with that (which are currently > broken). Why do we do it? To save shmem space in the lock table should > anyone want to run a transaction that contains thousands of > subtransactions, or more. So option (1) alone would eventually cause > us to run out of space in the lock table and a transaction would > receive ERRORs rather than be allowed to run for a long time. This seems unprincipled to me. The point of having a lock on the subtransaction in the lock table is so that people can wait for the subtransaction to end. If we don't remove the lock table entry at subtransaction end, then that lock table entry doesn't serve that purpose any more. > 2. In XactLockTableWait(), replace the call to SubTransGetParent(), so > we go up the levels one by one as we did before. However, (2) causes > huge subtrans contention and if we implemented that and backpatched it > the performance issues could be significant. So my feeling is that if > we do (2) then we should not backpatch it. What I find suspicious about the coding of this function is that it doesn't distinguish between commits and aborts at all. Like, if a subtransaction commits, the changes don't become globally visible until the parent transaction also commits. If a subtransaction aborts, though, what happens to the top-level XID doesn't seem to matter at all. The comment seems to agree: * Note that this does the right thing for subtransactions: if we wait on a * subtransaction, we will exit as soon as it aborts or its top parent commits. I feel like what I'd expect to see given this comment is code which (1) waits until the supplied XID is no longer running, (2) checks whether the XID aborted, and if so return at once, and (3) otherwise recurse to the parent XID. But the code doesn't do that. Perhaps that's not actually the right thing to do, since it seems like a big behavior change, but then I don't understand the comment. Incidentally, one possible optimization here to try to release locking traffic would be to try waiting for the top-parent first using a conditional lock acquisition. If that works, cool. If not, go back around and work up the tree level by level. Since that path would only be taken in the unhappy case where we're probably going to have to wait anyway, the cost probably wouldn't be too bad. > My preferred solution would be a mix of the above, call it option (3) > > 3. > a) Include the lock table entry for the first 64 subtransactions only, > so that we limit shmem. For those first 64 entries, have the subtrans > point direct to top, since this makes a subtrans lookup into an O(1) > operation, which is important for performance of later actions. > > b) For any subtransactions after first 64, delete the subxid lock on > subcommit, to save shmem, but make subtrans point to the immediate > parent (only), not the topxid. That fixes the bug, but causes > performance problems in XidInMVCCSnapshot() and others, so we also do > c) and d) > > c) At top level commit, go back and alter subtrans again for subxids > so now it points to the topxid, so that we avoid O(N) behavior in > XidInMVCCSnapshot() and other callers. Additional cost for longer > transactions, but it saves considerable cost in later callers that > need to call GetTopmostTransaction. > > d) Optimize SubTransGetTopmostTransaction() so it retrieves entries > page-at-a-time. This will reduce the contention of repeatedly > re-visiting the same page(s) and ensure that a page is less often > paged out when we are still using it. I'm honestly not very sanguine about changing pg_subtrans to point straight to the topmost XID. It's only OK to do that if there's absolutely nothing that needs to know the full tree structure, and the present case looks like an instance where we would like to have the full tree structure. I would not be surprised if there are others. That said, it seems a lot less likely that this would be an issue once the top-level transaction is no longer running. At that point, all the subtransactions are no longer running either: they either committed or they rolled back, and I can't see a reason why any code should care about anything other than which of those two things happened. So I think your idea in (c) might have some possibilities. You could also flip that idea around and have readers replace immediate parent pointers with top-parent pointers opportunistically, but I'm not sure that's actually any better. As you present it in (c) above, there's a risk of going back and updating CLOG state that no one will ever look at. But if you flipped it around and did it on the read side, then you'd have the risk of a bunch of backends trying to do it at the same time. I'm not sure whether either of those things is a big problem in practice, or whether both are, or neither. I agree that it looks possible to optimize SubTransGetTopmostTransaction better for the case where we want to traverse multiple or all levels, so that fewer pages are read. I don't know to what degree that would affect user-observible performance, but I can believe that it could be a win. -- Robert Haas EDB: http://www.enterprisedb.com
On 2022-Nov-28, Simon Riggs wrote: > A narrative description of the issue follows: > session1 - requests multiple nested subtransactions like this: > BEGIN; ... > SAVEPOINT subxid1; ... > SAVEPOINT subxid2; ... > However, if subxid2 subcommits, then the lock wait moves from subxid2 > to the topxid. Hmm, do we really do that? Seems very strange .. it sounds to me like the lock should have been transferred to subxid1 (which is subxid2's parent), not to the top-level Xid. Maybe what the user wanted was to release subxid1 before establishing subxid2? Or do they want to continue to be able to rollback to subxid1 after establishing subxid2? (but why?) -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
On Mon, 28 Nov 2022 at 18:53, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2022-Nov-28, Simon Riggs wrote: > > > A narrative description of the issue follows: > > session1 - requests multiple nested subtransactions like this: > > BEGIN; ... > > SAVEPOINT subxid1; ... > > SAVEPOINT subxid2; ... > > > However, if subxid2 subcommits, then the lock wait moves from subxid2 > > to the topxid. > > Hmm, do we really do that? Seems very strange .. it sounds to me like > the lock should have been transferred to subxid1 (which is subxid2's > parent), not to the top-level Xid. Correct; that is exactly what I'm saying and why we have a bug since 3c27944fb2141. > Maybe what the user wanted was to > release subxid1 before establishing subxid2? Or do they want to > continue to be able to rollback to subxid1 after establishing subxid2? > (but why?) This isn't a description of a user's actions, it is a script that illustrates the bug in XactLockTableWait(). Perhaps a better example would be nested code blocks with EXCEPTION clauses where the outer block fails... e.g. DO $$ BEGIN SELECT 1; BEGIN SELECT 1; EXCEPTION WHEN OTHERS THEN RAISE NOTICE 's2'; END; RAISE division_by_zero; -- now back in outer subxact, which now fails EXCEPTION WHEN OTHERS THEN RAISE NOTICE 's1'; END;$$; Of course, debugging this is harder since there is no way to return the current subxid in SQL. -- Simon Riggs http://www.EnterpriseDB.com/
On Mon, 28 Nov 2022 at 17:38, Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Nov 28, 2022 at 10:28 AM Simon Riggs > <simon.riggs@enterprisedb.com> wrote: > > So we have these options > > > > 1. Removing the XactLockTableDelete() call in CommitSubTransaction(). > > That releases lock waiters earlier than expected, which requires > > pushups in XactLockTableWait() to cope with that (which are currently > > broken). Why do we do it? To save shmem space in the lock table should > > anyone want to run a transaction that contains thousands of > > subtransactions, or more. So option (1) alone would eventually cause > > us to run out of space in the lock table and a transaction would > > receive ERRORs rather than be allowed to run for a long time. > > This seems unprincipled to me. The point of having a lock on the > subtransaction in the lock table is so that people can wait for the > subtransaction to end. If we don't remove the lock table entry at > subtransaction end, then that lock table entry doesn't serve that > purpose any more. An easy point to confuse: "subtransaction to end": The subtransaction is "still running" to other backends even AFTER it has been subcommitted, but its state now transfers to the parent. So the subtransaction doesn't cease running until it aborts, one of its parent aborts or top level commit. The subxid lock should, on principle, exist until one of those events occurs. It doesn't, but that is an optimization, for the stated reason. (All of the above is current behavior). > > 2. In XactLockTableWait(), replace the call to SubTransGetParent(), so > > we go up the levels one by one as we did before. However, (2) causes > > huge subtrans contention and if we implemented that and backpatched it > > the performance issues could be significant. So my feeling is that if > > we do (2) then we should not backpatch it. > > What I find suspicious about the coding of this function is that it > doesn't distinguish between commits and aborts at all. Like, if a > subtransaction commits, the changes don't become globally visible > until the parent transaction also commits. If a subtransaction aborts, > though, what happens to the top-level XID doesn't seem to matter at > all. The comment seems to agree: > > * Note that this does the right thing for subtransactions: if we wait on a > * subtransaction, we will exit as soon as it aborts or its top parent commits. > > I feel like what I'd expect to see given this comment is code which > (1) waits until the supplied XID is no longer running, (2) checks > whether the XID aborted, and if so return at once, and (3) otherwise > recurse to the parent XID. But the code doesn't do that. Perhaps > that's not actually the right thing to do, since it seems like a big > behavior change, but then I don't understand the comment. As I mention above, the correct behavior is that the subxact doesn't cease running until it aborts, one of its parent aborts or top level commit. Which is slightly different from the comment, which may explain why the bug exists. > Incidentally, one possible optimization here to try to release locking > traffic would be to try waiting for the top-parent first using a > conditional lock acquisition. If that works, cool. If not, go back > around and work up the tree level by level. Since that path would only > be taken in the unhappy case where we're probably going to have to > wait anyway, the cost probably wouldn't be too bad. That sounds like a potential bug fix (not just an optimization). > > My preferred solution would be a mix of the above, call it option (3) > > > > 3. > > a) Include the lock table entry for the first 64 subtransactions only, > > so that we limit shmem. For those first 64 entries, have the subtrans > > point direct to top, since this makes a subtrans lookup into an O(1) > > operation, which is important for performance of later actions. > > > > b) For any subtransactions after first 64, delete the subxid lock on > > subcommit, to save shmem, but make subtrans point to the immediate > > parent (only), not the topxid. That fixes the bug, but causes > > performance problems in XidInMVCCSnapshot() and others, so we also do > > c) and d) > > > > c) At top level commit, go back and alter subtrans again for subxids > > so now it points to the topxid, so that we avoid O(N) behavior in > > XidInMVCCSnapshot() and other callers. Additional cost for longer > > transactions, but it saves considerable cost in later callers that > > need to call GetTopmostTransaction. > > > > d) Optimize SubTransGetTopmostTransaction() so it retrieves entries > > page-at-a-time. This will reduce the contention of repeatedly > > re-visiting the same page(s) and ensure that a page is less often > > paged out when we are still using it. > > I'm honestly not very sanguine about changing pg_subtrans to point > straight to the topmost XID. It's only OK to do that if there's > absolutely nothing that needs to know the full tree structure, and the > present case looks like an instance where we would like to have the > full tree structure. I would not be surprised if there are others. There are no others. It's a short list and I've checked. I ask others to do the same. > That said, it seems a lot less likely that this would be an issue once > the top-level transaction is no longer running. At that point, all the > subtransactions are no longer running either: they either committed or > they rolled back, and I can't see a reason why any code should care > about anything other than which of those two things happened. So I > think your idea in (c) might have some possibilities. > > You could also flip that idea around and have readers replace > immediate parent pointers with top-parent pointers opportunistically, > but I'm not sure that's actually any better. As you present it in (c) > above, there's a risk of going back and updating CLOG state that no > one will ever look at. But if you flipped it around and did it on the > read side, then you'd have the risk of a bunch of backends trying to > do it at the same time. I'm not sure whether either of those things is > a big problem in practice, or whether both are, or neither. (Subtrans state, not clog state) But that puts the work on the reader rather than the writer,. > I agree that it looks possible to optimize > SubTransGetTopmostTransaction better for the case where we want to > traverse multiple or all levels, so that fewer pages are read. I don't > know to what degree that would affect user-observible performance, but > I can believe that it could be a win. Good. Thanks for commenting. -- Simon Riggs http://www.EnterpriseDB.com/
On Mon, Nov 28, 2022 at 2:45 PM Simon Riggs <simon.riggs@enterprisedb.com> wrote: > An easy point to confuse: > "subtransaction to end": The subtransaction is "still running" to > other backends even AFTER it has been subcommitted, but its state now > transfers to the parent. > > So the subtransaction doesn't cease running until it aborts, one of > its parent aborts or top level commit. The subxid lock should, on > principle, exist until one of those events occurs. It doesn't, but > that is an optimization, for the stated reason. That's not what "running" means to me. Running means it's started and hasn't yet committed or rolled back. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > That's not what "running" means to me. Running means it's started and > hasn't yet committed or rolled back. A subxact definitely can't be considered committed until its topmost parent commits. However, it could be known to be rolled back before its parent. IIUC, the current situation is that we don't take advantage of the latter case but just wait for the topmost parent. One thing we need to be pretty careful of here is to not break the promise of atomic commit. At topmost commit, all subxacts must appear committed simultaneously. It's not quite clear to me whether we need a similar guarantee in the rollback case. It seems like we shouldn't, but maybe I'm missing something, in which case maybe the current behavior is correct? regards, tom lane
On Mon, Nov 28, 2022 at 3:01 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > One thing we need to be pretty careful of here is to not break the > promise of atomic commit. At topmost commit, all subxacts must > appear committed simultaneously. It's not quite clear to me whether > we need a similar guarantee in the rollback case. It seems like > we shouldn't, but maybe I'm missing something, in which case maybe > the current behavior is correct? AFAICS, the behavior that Simon is complaining about here is an exception to the way things work in general, and therefore his complaint is well-founded. AbortSubTransaction() arranges to release resources, whereas CommitSubTransaction() arranges to have them transferred to the parent transaction. This isn't necessarily obvious from looking at those functions, but if you drill down into the AtEO(Sub)Xact functions that they call, that's what happens in nearly all cases. Given that subtransaction abort releases resources immediately, it seems pretty fair to wonder what the value is in waiting for its parent or the topmost transaction. I don't see how that can be necessary for correctness. The commit message to which Simon (3c27944fb2141) points seems to have inadvertently changed the behavior while trying to fix a bug and improve performance. I remember being a bit skeptical about that fix at the time. Back in the day, you couldn't XactLockTableWait() unless you knew that the transaction had already started. That commit tried to make it so that you could XactLockTableWait() earlier, because that's apparently something that logical replication needs to do. But that is a major redefinition of the charter of that function, and I am wondering whether it was a mistake to fold together the thing that we need in normal cases (which is to wait for a transaction we know has started and may not have finished) from the thing we need in the logical decoding case (which apparently has different requirements). Maybe we should have solved that problem by finding a way to wait for the transaction to start, and then afterwards wait for it to end. Or maybe we should have altogether different entrypoints for the two requirements. Or maybe using one function is fine but we just need it to be more correct. I'm not really sure. In short, I think Simon is right that there's a problem and right about which commit caused it, but I'm not sure what I think we ought to do about it. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, 28 Nov 2022 at 21:10, Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Nov 28, 2022 at 3:01 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > One thing we need to be pretty careful of here is to not break the > > promise of atomic commit. At topmost commit, all subxacts must > > appear committed simultaneously. It's not quite clear to me whether > > we need a similar guarantee in the rollback case. It seems like > > we shouldn't, but maybe I'm missing something, in which case maybe > > the current behavior is correct? > > In short, I think Simon is right that there's a problem and right > about which commit caused it, but I'm not sure what I think we ought > to do about it. I'm comfortable with ignoring it, on the basis that it *is* a performance optimization, but I suggest we keep the test (with modified output) and document the behavior, if we do. The really big issue is the loss of performance we get from having subtrans point only to its immediate parent, which makes XidInMVCCSnapshot() go really slow in the presence of lots of subtransactions. So ignoring the issue on this thread will open the door for the optimization posted for this patch: https://commitfest.postgresql.org/40/3806/ -- Simon Riggs http://www.EnterpriseDB.com/