Thread: Logical Decoding and HeapTupleSatisfiesVacuum assumptions
Hi, As of today, in the existing code in HeapTupleSatisfiesVacuum, if a row is inserted via an aborted transaction, then it's deemed to be dead immediately and can be cleaned up for re-use by HOT or vacuum, etc. With logical decoding, there might arise a case that such a row, if it belongs to a system catalog table or even a user catalog table (https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=66abc2608c7c00fcd449e00a9e23f13f02e65d04), then it might be being used for decoding at the very same moment that the abort came in. If the row is re-claimed immediately, then it's possible that the decoding happening alongside might face issues. The main issue here is that HeapTupleSatisfiesVacuum *assumes* that rows belonging to an aborted transaction are not visible to anyone else. As mentioned above, that assumption needs to change in the case when logical decoding is active and the row belongs to catalog tables. This was not a problem before, but now there are at least two patches out there which challenge these assumptions. One patch is for logical decoding of 2PC, in which case a PREPARED transaction might be getting decoded and a concurrent ROLLBACK PREPARED can happen alongside. Another patch is about streaming logical changes before commit/abort. The provided patch changes that assumption. We now pass in an additional argument to HeapTupleSatisfiesVacuum() to indicate if the current row belongs to a catalog table in the logical decoding environment. We call the existing RelationIsAccessibleInLogicalDecoding() function to ascertain if it's a catalog table worth considering for this case. So if this is the case, we return HEAPTUPLE_RECENTLY_DEAD if the xmin of the row is newer than OldestXmin. So, for the following case, we now will return HEAPTUPLE_RECENTLY_DEAD when earlier we would have returned HEAPTUPLE_DEAD: BEGIN; ALTER subscribed_table ADD COLUMN; .... ABORT; VACUUM pg_attribute; This is essentially the same logic that is used today for rows that got deleted (via update or delete). We return HEAPTUPLE_RECENTLY_DEAD for such deleted rows if the xmax of the row is newer than OldestXmin because some open transactions can/could still see the tuple. With changes to HeapTupleSatisfiesVacuum, we also had to tweak the logic in heap_prune_chain(). That function again assumes that only Updated tuples can return HEAPTUPLE_RECENTLY_DEAD. That assumption is not true and we check explicitly for the HeapTupleHeaderGetUpdateXid value to be valid before further processing. The patch had to make changes in all locations where HeapTupleSatisfiesVacuum gets called and I have done multiple "make check-world" runs with cassert enabled and things run fine. Also, since it only sets the flag for system/user catalog tables and that too ONLY in the logical decoding environment, it does not cause any performance issues in normal circumstances. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Attachment
On 26 December 2017 at 14:21, Nikhil Sontakke <nikhils@2ndquadrant.com> wrote: > With logical decoding, there might arise a case that such a row, if it > belongs to a system catalog table or even a user catalog table > (https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=66abc2608c7c00fcd449e00a9e23f13f02e65d04), > then it might be being used for decoding at the very same moment that > the abort came in. If the row is re-claimed immediately, then it's > possible that the decoding happening alongside might face issues. If we want to make this change, I'd like to see something that explains exactly how the problem in logical decoding occurs, or preferably a test case. I can't understand the problem by reading the archives. I'm not suggesting this patch doesn't work, I'm thinking about whether there are other ways. Surely if you are decoding a transaction and a concurrent abort is requested then decoding should be interrupted at the next sensible point. Allowing the two actions to occur without interlock is an issue, so I suggest we just don't do it, rather than allow it and fix subsequent breakage, which is what I understand this patch to do. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 01/19/2018 03:34 PM, Simon Riggs wrote: > On 26 December 2017 at 14:21, Nikhil Sontakke <nikhils@2ndquadrant.com> wrote: > >> With logical decoding, there might arise a case that such a row, if it >> belongs to a system catalog table or even a user catalog table >> (https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=66abc2608c7c00fcd449e00a9e23f13f02e65d04), >> then it might be being used for decoding at the very same moment that >> the abort came in. If the row is re-claimed immediately, then it's >> possible that the decoding happening alongside might face issues. > > If we want to make this change, I'd like to see something that > explains exactly how the problem in logical decoding occurs, or > preferably a test case. I can't understand the problem by reading the > archives. I'm not suggesting this patch doesn't work, I'm thinking > about whether there are other ways. > > Surely if you are decoding a transaction and a concurrent abort is > requested then decoding should be interrupted at the next sensible > point. Allowing the two actions to occur without interlock is an > issue, so I suggest we just don't do it, rather than allow it and fix > subsequent breakage, which is what I understand this patch to do. > I don't quite understand how the interlock would work? Can you elaborate how would that be implemented? Consider that logical decoding (a) is a read-only process, so it can't modify database state (e.g. locking a row in a catalog) (b) is asynchronous process, i.e. it may be happening quite a bit of time after the changes actually happened An example of a problematic transaction was already outlined in Nikhil's post, i.e. it's a transaction that (1) does DDL, e.g. to add a column to the decoded table (2) inserts some data into the table (with the new column) (3) aborts (possibly hours / gigabytes of WAL in the future) Currently, vacuum can just go and cleanup the catalog rows added in (1) because the transaction is aborted (and postgres already knows that), and we do ignore OldestXmin for catalog rows. But if you try decode the changes (and hand them over to the plugin for apply) without waiting for the final commit/abort, that will fail. Because it will read changes from WAL with a column missing in any existing catalog row. Kabooooom! Until now this was not an issue, because the decoding happens at commit time (or more precisely, we do decode the changes incrementally, but the apply only happens on commit). But it's an issue for both Nikhil's and mine patches. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Dec 26, 2017 at 9:21 AM, Nikhil Sontakke <nikhils@2ndquadrant.com> wrote: > The main issue here is that HeapTupleSatisfiesVacuum *assumes* that > rows belonging to an aborted transaction are not visible to anyone > else. One problem here is that if a transaction aborts, it might have done so after inserting or update a tuple in the heap and before inserting new index entries for the tuple, or after inserting only some of the necessary new index entries. Therefore, even if you prevent pruning, a snapshot from the point of view of the aborted transaction may be inconsistent. Similarly, if it aborts during a DDL operation, it may have made some but not all of the catalog changes involved, so that for example pg_class and pg_attribute could be inconsistent with each other or various pg_attribute rows could even be inconsistent among themselves. If you have a view of the catalog where these problems exist, you can't rely on, for example, being able to build a relcache entry without error. It is possible that you can avoid these problems if your snapshot is always using a command ID value that was reached prior to the error, although I'm not 100% sure that idea has no holes. Another problem is that CTID chains may be broken. Suppose that a transaction T1, using CID 1, does a HOT update of tuple A1 producing a new version A2. Then, later on, when the CID counter is at least 2, it aborts. A snapshot taken from the point of view of T1 at CID 1 should see A2. That will work fine most of the time. However, if transaction T2 comes along after T1 aborts and before logical decoding gets there and does its own HOT update of tuple A1 producing a new version A3, then tuple A2 is inaccessible through the indexes even if it still exists in the heap page. I think this problem is basically unsolvable and likely means that this whole approach needs to be abandoned. One other issue to consider is that the tuple freezing code assumes that any tuple that does not get removed when a page is pruned is OK to freeze. Commit 9c2f0a6c3cc8bb85b78191579760dbe9fb7814ec was necessary to repair a case where that assumption was violated. You might want to consider carefully whether there's any chance that this patch could introduce a similar problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 01/19/2018 06:54 PM, Robert Haas wrote: > On Tue, Dec 26, 2017 at 9:21 AM, Nikhil Sontakke > <nikhils@2ndquadrant.com> wrote: >> The main issue here is that HeapTupleSatisfiesVacuum *assumes* that >> rows belonging to an aborted transaction are not visible to anyone >> else. > > One problem here is that if a transaction aborts, it might have done > so after inserting or update a tuple in the heap and before inserting > new index entries for the tuple, or after inserting only some of the > necessary new index entries. ... [snip] ... > I think an important piece of this puzzle is that we only really care about catalog changes made in a transaction that aborts after doing some additional changes, with that catalog tuple in place. Because only then we actually need that catalog tuple in order to interpret the changes. AFAICS that guarantees the catalog changes were not interrupted half-way through, leaving some of the catalogs in inconsistent state. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jan 19, 2018 at 2:16 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > I think an important piece of this puzzle is that we only really care > about catalog changes made in a transaction that aborts after doing some > additional changes, with that catalog tuple in place. Because only then > we actually need that catalog tuple in order to interpret the changes. > > AFAICS that guarantees the catalog changes were not interrupted half-way > through, leaving some of the catalogs in inconsistent state. Yeah, that may be true, and I alluded to it in the part you didn't quote. However, it doesn't help with the second problem I mentioned, which looks to me to be a fatal problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 01/19/2018 08:33 PM, Robert Haas wrote: > On Fri, Jan 19, 2018 at 2:16 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> I think an important piece of this puzzle is that we only really care >> about catalog changes made in a transaction that aborts after doing some >> additional changes, with that catalog tuple in place. Because only then >> we actually need that catalog tuple in order to interpret the changes. >> >> AFAICS that guarantees the catalog changes were not interrupted half-way >> through, leaving some of the catalogs in inconsistent state. > > Yeah, that may be true, and I alluded to it in the part you didn't > quote. However, it doesn't help with the second problem I mentioned, > which looks to me to be a fatal problem. > Ah, OK - I didn't realize "using a command ID value that was reached prior to the error" refers to this scenario. Regarding the HOT issue - I have to admit I don't quite see why A2 wouldn't be reachable through the index, but that's likely due to my limited knowledge of the HOT internals. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jan 19, 2018 at 5:19 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > Regarding the HOT issue - I have to admit I don't quite see why A2 > wouldn't be reachable through the index, but that's likely due to my > limited knowledge of the HOT internals. The index entries only point to the root tuple in the HOT chain. Any subsequent entries can only be reached by following the CTID pointers (that's why they are called "Heap Only Tuples"). After T1 aborts, we're still OK because the CTID link isn't immediately cleared. But after T2 updates the tuple, it makes A1's CTID link point to A3, leaving no remaining link to A2. Although in most respects PostgreSQL treats commits and aborts surprisingly symmetrically, CTID links are an exception. When T2 comes to A1, it sees that A1's xmax is T1 and checks the status of T1. If T1 is still in progress, it waits. If T2 has committed, it must either abort with a serialization error or update A2 instead under EvalPlanQual semantics, depending on the isolation level. If T2 has aborted, it assumes that the CTID field of T1 is garbage nobody cares about, adds A3 to the page, and makes A1 point to A3 instead of A2. No record of the A1->A2 link is kept anywhere *precisely because* A2 can no longer be visible to anyone. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 20/01/18 00:52, Robert Haas wrote: > On Fri, Jan 19, 2018 at 5:19 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> Regarding the HOT issue - I have to admit I don't quite see why A2 >> wouldn't be reachable through the index, but that's likely due to my >> limited knowledge of the HOT internals. > > The index entries only point to the root tuple in the HOT chain. Any > subsequent entries can only be reached by following the CTID pointers > (that's why they are called "Heap Only Tuples"). After T1 aborts, > we're still OK because the CTID link isn't immediately cleared. But > after T2 updates the tuple, it makes A1's CTID link point to A3, > leaving no remaining link to A2. > > Although in most respects PostgreSQL treats commits and aborts > surprisingly symmetrically, CTID links are an exception. When T2 > comes to A1, it sees that A1's xmax is T1 and checks the status of T1. > If T1 is still in progress, it waits. If T2 has committed, it must > either abort with a serialization error or update A2 instead under > EvalPlanQual semantics, depending on the isolation level. If T2 has > aborted, it assumes that the CTID field of T1 is garbage nobody cares > about, adds A3 to the page, and makes A1 point to A3 instead of A2. > No record of the A1->A2 link is kept anywhere *precisely because* A2 > can no longer be visible to anyone. > I think this is the only real problem from your list for logical decoding catalog snapshots. But it's indeed quite a bad one. Is there something preventing us to remove the assumption that the CTID of T1 is garbage nobody cares about? I guess turning off HOT for catalogs is not an option :) General problem is that we have couple of assumptions (HeapTupleSatisfiesVacuum being one, what you wrote is another) about tuples from aborted transactions not being read by anybody. But if we want to add decoding of 2PC or transaction streaming that's no longer true so I think we should try to remove that assumption (even if we do it only for catalogs since that what we care about). The other option would be to make sure 2PC decoding/tx streaming does not read aborted transaction but that would mean locking the transaction every time we give control to output plugin. Given that output plugin may do network write, this would really mean locking the transaction for and unbounded period of time. That does not strike me as something we want to do, decoding should not interact with frontend transaction management, definitely not this badly. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Jan 22, 2018 at 10:40 AM, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote: > I think this is the only real problem from your list for logical > decoding catalog snapshots. But it's indeed quite a bad one. Is there > something preventing us to remove the assumption that the CTID of T1 is > garbage nobody cares about? I guess turning off HOT for catalogs is not > an option :) It doesn't seem like a great idea. For a lot of workloads it wouldn't hurt, but for some it might cause a lot of catalog bloat. Also, Tom would probably hate it with a fiery passion, since as I understand it he argued passionately for HOT to work for both catalog and non-catalog tables back when it was first implemented. CTID chain integrity is important for non-HOT updates too, at least if any SELECT .. FOR UPDATE/SHARE or UPDATE operations are taking place. Now, I suppose we're not going to do any of that during logical decoding, but I guess I can't say I'm absolutely positive that there are no other problem cases. If there aren't, you can imagine disallowing HOT updates only on catalogs and only when xmax is newer than the newest XID we'll never need to decode, or more narrowly still, only when it's in a list of XIDs currently being decoded. Independently of CTID, I also wonder if there's a risk of us trying to read from a multixact that's been truncated away. I haven't checked the multixact code in detail, but I bet there's no provision to keep around multixacts that are only interesting to aborted transactions. Conversely, what keeps us from freezing the xmin of a tuple that is invisible only to some aborted transaction, but visible to all committed transactions? Or just marking the page all-visible? > General problem is that we have couple of assumptions > (HeapTupleSatisfiesVacuum being one, what you wrote is another) about > tuples from aborted transactions not being read by anybody. But if we > want to add decoding of 2PC or transaction streaming that's no longer > true so I think we should try to remove that assumption (even if we do > it only for catalogs since that what we care about). I'm extremely skeptical about this idea. I think that assumption is fairly likely to be embedded in subtle ways in several more places that we haven't thought about yet. Unless we're very, very careful about doing something like this, we could end up with a system that mostly seems to work but is actually unreliable in ways that can't be fixed without non-back-patchable changes. > The other option would be to make sure 2PC decoding/tx streaming does > not read aborted transaction but that would mean locking the transaction > every time we give control to output plugin. Given that output plugin > may do network write, this would really mean locking the transaction for > and unbounded period of time. That does not strike me as something we > want to do, decoding should not interact with frontend transaction > management, definitely not this badly. Yeah, I don't think it's acceptable to postpone abort indefinitely. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 22/01/18 20:21, Robert Haas wrote: > On Mon, Jan 22, 2018 at 10:40 AM, Petr Jelinek > <petr.jelinek@2ndquadrant.com> wrote: >> I think this is the only real problem from your list for logical >> decoding catalog snapshots. But it's indeed quite a bad one. Is there >> something preventing us to remove the assumption that the CTID of T1 is >> garbage nobody cares about? I guess turning off HOT for catalogs is not >> an option :) > > It doesn't seem like a great idea. For a lot of workloads it wouldn't > hurt, but for some it might cause a lot of catalog bloat. Also, Tom > would probably hate it with a fiery passion, since as I understand it > he argued passionately for HOT to work for both catalog and > non-catalog tables back when it was first implemented. > I wasn't really serious about that proposal. > >> General problem is that we have couple of assumptions >> (HeapTupleSatisfiesVacuum being one, what you wrote is another) about >> tuples from aborted transactions not being read by anybody. But if we >> want to add decoding of 2PC or transaction streaming that's no longer >> true so I think we should try to remove that assumption (even if we do >> it only for catalogs since that what we care about). > > I'm extremely skeptical about this idea. I think that assumption is > fairly likely to be embedded in subtle ways in several more places > that we haven't thought about yet. Unless we're very, very careful > about doing something like this, we could end up with a system that > mostly seems to work but is actually unreliable in ways that can't be > fixed without non-back-patchable changes. I am worried about it too, I don't feel like I can judge if we just need to be very careful or if it's too big a risk so I'll defer to you here. >> The other option would be to make sure 2PC decoding/tx streaming does >> not read aborted transaction but that would mean locking the transaction >> every time we give control to output plugin. Given that output plugin >> may do network write, this would really mean locking the transaction for >> and unbounded period of time. That does not strike me as something we >> want to do, decoding should not interact with frontend transaction >> management, definitely not this badly. > > Yeah, I don't think it's acceptable to postpone abort indefinitely. > Hmm, maybe less bad and potentially acceptable version of this would be to make TransactionIdIsInProgress() treat transaction as running if it's being decoded, that might solve most of the issues. There is still potential interlocking issue with UPDATEs but they have to change same tuples as the aborted tx did. What I mean concretely is that decoding would check if tx has aborted, if not then it would put somewhere in shmem info about it working on the tx and once plugin is done with current call it would reset the shmem info to invalid tx. Then the TransactionIdIsInProgress() would check this shmem info before checking about abort. If the first check would show that transaction is aborted then decoding would not bother decoding it further. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 22 January 2018 at 20:03, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote: > On 22/01/18 20:21, Robert Haas wrote: >>> The other option would be to make sure 2PC decoding/tx streaming does >>> not read aborted transaction but that would mean locking the transaction >>> every time we give control to output plugin. Given that output plugin >>> may do network write, this would really mean locking the transaction for >>> and unbounded period of time. That does not strike me as something we >>> want to do, decoding should not interact with frontend transaction >>> management, definitely not this badly. >> >> Yeah, I don't think it's acceptable to postpone abort indefinitely. >> > > Hmm, maybe less bad and potentially acceptable version of this would be > to make TransactionIdIsInProgress() treat transaction as running if it's > being decoded, that might solve most of the issues. There is still > potential interlocking issue with UPDATEs but they have to change same > tuples as the aborted tx did. > > What I mean concretely is that decoding would check if tx has aborted, > if not then it would put somewhere in shmem info about it working on the > tx and once plugin is done with current call it would reset the shmem > info to invalid tx. Then the TransactionIdIsInProgress() would check > this shmem info before checking about abort. If the first check would > show that transaction is aborted then decoding would not bother decoding > it further. I must be missing something in this discussion, cos I don't see any problems with this "other option". Surely we prepare the 2PCxact and then it is persistent. Yes, potentially for an unbounded period of time. And it is (usually) up to the XA manager to resolve that. 2PC interacts with transaction management and yes, it can be slow. But the choice is slow and consistent, or not. This would only be used with the full choice of the user, just like synchronous_commit. In this case, we call the decoding plugin's precommit hook which would then prepare the 2PCxact and set a non-persistent flag saying it is being decoded. If decoding completes normally we release the lock and commit. If decoding fails or the DBA has another reason to do so, we provide a function that allows the flag to be unlocked. While it is locked the 2PCxact cannot be aborted or committed. There is no danger of accidental abort because the prepare has persistent state. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> I must be missing something in this discussion, cos I don't see any > problems with this "other option". > > Surely we prepare the 2PCxact and then it is persistent. Yes, > potentially for an unbounded period of time. And it is (usually) up to > the XA manager to resolve that. 2PC interacts with transaction > management and yes, it can be slow. But the choice is slow and > consistent, or not. This would only be used with the full choice of > the user, just like synchronous_commit. > > In this case, we call the decoding plugin's precommit hook which would > then prepare the 2PCxact and set a non-persistent flag saying it is > being decoded. If decoding completes normally we release the lock and > commit. If decoding fails or the DBA has another reason to do so, we > provide a function that allows the flag to be unlocked. While it is > locked the 2PCxact cannot be aborted or committed. > > There is no danger of accidental abort because the prepare has persistent state. This concurrent abort handling while decoding is ongoing is turning out to be complex affair. Thinking more about this, just to provide an example, we have a decoding plugin hook to determine if a GID for a 2PC was decoded at PREPARE time or COMMIT time as part of the 2PC logical decoding patch. We need that to determine the *same* static answer every time we see a specific GID while decoding across restarts; the plugin should know what it had done the last time around and should tell us the same later as well. It just occurred to me that as Simon also mentioned, it could/should also be the decoding plugin's responsibility to indicate if it's ok to go ahead with the abort of the transaction. So, we could consider adding a preAbort hook. That preAbort hook gets the GID, XID and other parameters as needed and tells us whether we can go ahead with the abort or if we need to wait out (maybe we pass in an ok_to_wait param as well). As an example, a plugin could lookup some shmem structure which points to the current transaction being decoded and does related processing to ensure that it stops decoding at a clean juncture, thus keeping the response time bounded to a maximum of one change record apply cycle. That passes the onus onto the plugin writers and keeps the core code around this concurrent abort handling clean. I will cook up something along the above lines unless there are any objections to this approach. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
On 23/01/18 16:01, Nikhil Sontakke wrote: >> I must be missing something in this discussion, cos I don't see any >> problems with this "other option". >> >> Surely we prepare the 2PCxact and then it is persistent. Yes, >> potentially for an unbounded period of time. And it is (usually) up to >> the XA manager to resolve that. 2PC interacts with transaction >> management and yes, it can be slow. But the choice is slow and >> consistent, or not. This would only be used with the full choice of >> the user, just like synchronous_commit. It's not about transaction being persistent but the abort command being blocked. >> >> In this case, we call the decoding plugin's precommit hook which would >> then prepare the 2PCxact and set a non-persistent flag saying it is >> being decoded. If decoding completes normally we release the lock and >> commit. If decoding fails or the DBA has another reason to do so, we >> provide a function that allows the flag to be unlocked. While it is >> locked the 2PCxact cannot be aborted or committed. Output plugin can't deal with precommit, that has to be handled elsewhere but in principle this is true. >> >> There is no danger of accidental abort because the prepare has persistent state. > > This concurrent abort handling while decoding is ongoing is turning > out to be complex affair. > > Thinking more about this, just to provide an example, we have a > decoding plugin hook to determine if a GID for a 2PC was decoded at > PREPARE time or COMMIT time as part of the 2PC logical decoding patch. > We need that to determine the *same* static answer every time we see a > specific GID while decoding across restarts; the plugin should know > what it had done the last time around and should tell us the same > later as well. It just occurred to me that as Simon also mentioned, it > could/should also be the decoding plugin's responsibility to indicate > if it's ok to go ahead with the abort of the transaction. > > So, we could consider adding a preAbort hook. That preAbort hook gets > the GID, XID and other parameters as needed and tells us whether we > can go ahead with the abort or if we need to wait out (maybe we pass > in an ok_to_wait param as well). As an example, a plugin could lookup > some shmem structure which points to the current transaction being > decoded and does related processing to ensure that it stops decoding > at a clean juncture, thus keeping the response time bounded to a > maximum of one change record apply cycle. That passes the onus onto > the plugin writers and keeps the core code around this concurrent > abort handling clean. > Having this as responsibility of plugin sounds interesting. It certainly narrows the scope for which we need to solve the abort issue. For 2PC that may be okay as we need to somehow interact with transaction manager as Simon noted. I am not sure if this helps streaming use-case though as there is not going to be any external transaction management involved there. In any case all this interlocking could potentially be made less impact-full by only doing it when we know the transaction did catalog changes prior to currently decoded change (which we do during decoding) since that's the only time we are interested in if it aborted or not. This all leads me to another idea. What if logical decoding provided API for "locking/unlocking" the currently decoded transaction against abort. This function would then be called by both decoding and output plugin before any catalog read. The function can be smart enough to be NOOP if the transaction is not running (ie we are not doing 2PC decoding or streaming) or when the transaction didn't do any catalog modifications (we already have that info easily accessible as bool). That would mean we'd never do any kind of heavy locking for prolonged periods of time (ie network calls) but only during catalog access and only when needed. It would also solve this for both 2PC and streaming and it would be easy to use by plugin authors. Just document that some call should be done before catalog access when in output plugin, can even be Asserted that the call was done probably. Thoughts? -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, > Having this as responsibility of plugin sounds interesting. It certainly > narrows the scope for which we need to solve the abort issue. For 2PC > that may be okay as we need to somehow interact with transaction manager > as Simon noted. I am not sure if this helps streaming use-case though as > there is not going to be any external transaction management involved there. > > In any case all this interlocking could potentially be made less > impact-full by only doing it when we know the transaction did catalog > changes prior to currently decoded change (which we do during decoding) > since that's the only time we are interested in if it aborted or not. > > This all leads me to another idea. What if logical decoding provided API > for "locking/unlocking" the currently decoded transaction against abort. > This function would then be called by both decoding and output plugin > before any catalog read. The function can be smart enough to be NOOP if > the transaction is not running (ie we are not doing 2PC decoding or > streaming) or when the transaction didn't do any catalog modifications > (we already have that info easily accessible as bool). > > That would mean we'd never do any kind of heavy locking for prolonged > periods of time (ie network calls) but only during catalog access and > only when needed. It would also solve this for both 2PC and streaming > and it would be easy to use by plugin authors. Just document that some > call should be done before catalog access when in output plugin, can > even be Asserted that the call was done probably. > > Thoughts? > Yeah, this might work. We already have SET_LOCKTAG_TRANSACTION() via which every transaction takes an exclusive lock on its own transactionid when it starts, for example. We ideally want a single solution to handle 2PC and ongoing (streaming) transactions. We could introduce a new SET_LOCKTAG_LOGICALTRANSACTION(). The logical decoding process could take a SHARED lock on this, check if the XID is still ok to decode, read the catalog and unlock. Abort/Commit transaction processing could take this in EXCLUSIVE mode. As mentioned above, the plugin API which takes this lock will be smart enough to be a NOOP if the transaction is not running (i.e we are not doing 2PC decoding or streaming) or when the transaction didn't do any catalog modifications. Thoughts? Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
On 29 January 2018 at 07:15, Nikhil Sontakke <nikhils@2ndquadrant.com> wrote: >> Having this as responsibility of plugin sounds interesting. It certainly >> narrows the scope for which we need to solve the abort issue. For 2PC >> that may be okay as we need to somehow interact with transaction manager >> as Simon noted. I am not sure if this helps streaming use-case though as >> there is not going to be any external transaction management involved there. I think we should recognize that the two cases are different. 2PC decoding patch is looking at what happens AFTER a PREPARE has occurred on a transaction. Streaming looks to have streaming occur right in the middle of a transaction, so it has other concerns and likely other solutions as a result. >> In any case all this interlocking could potentially be made less >> impact-full by only doing it when we know the transaction did catalog >> changes prior to currently decoded change (which we do during decoding) >> since that's the only time we are interested in if it aborted or not. >> >> This all leads me to another idea. What if logical decoding provided API >> for "locking/unlocking" the currently decoded transaction against abort. >> This function would then be called by both decoding and output plugin >> before any catalog read. The function can be smart enough to be NOOP if >> the transaction is not running (ie we are not doing 2PC decoding or >> streaming) or when the transaction didn't do any catalog modifications >> (we already have that info easily accessible as bool). >> >> That would mean we'd never do any kind of heavy locking for prolonged >> periods of time (ie network calls) but only during catalog access and >> only when needed. It would also solve this for both 2PC and streaming >> and it would be easy to use by plugin authors. Just document that some >> call should be done before catalog access when in output plugin, can >> even be Asserted that the call was done probably. >> >> Thoughts? >> > > Yeah, this might work. We already have SET_LOCKTAG_TRANSACTION() via > which every transaction takes an exclusive lock on its own > transactionid when it starts, for example. We ideally want a single > solution to handle 2PC and ongoing (streaming) transactions. We could > introduce a new SET_LOCKTAG_LOGICALTRANSACTION(). The logical decoding > process could take a SHARED lock on this, check if the XID is still ok > to decode, read the catalog and unlock. Abort/Commit transaction > processing could take this in EXCLUSIVE mode. > > As mentioned above, the plugin API which takes this lock will be smart > enough to be a NOOP if the transaction is not running (i.e we are not > doing 2PC decoding or streaming) or when the transaction didn't do any > catalog modifications. I think it is enough to say that ROLLBACK PREPARED cannot abort a transaction while decoding is taking place, if the plugin decides it wishes to prevent that. So the plugin API must have a pre-rollback-prepared call to allow the plugin to decide how to handle that (Is that what you meant by the pre-abort hook?). That call will just return when and if it is OK to abort. Any waiting or throwing of ERRORs can occur inside the plugin. The plugin can provide a function to allow an abort that type of abort if it wishes, but no new core functionality required for that. If needed, we would wait on a latch the plugin provides for that. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 23 January 2018 at 19:17, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote: > I am not sure if this helps streaming use-case though as > there is not going to be any external transaction management involved there. So, I think we need some specific discussion of what to do in that case. Streaming happens only with big transactions and only for short periods. The problem only occurs when we are decoding and we hit a catalog table change. Processing of that is short, then we continue. So it seems perfectly fine to block aborts in those circumstances. We can just mark that state in a in-memory array of StreamingDecodedTransactions that has size SizeOf(TransactionId) * MaxNumWalSenders. We can add a check into RecordTransactionAbort() just before the critical section to see if we are currently processing a StreamingDecodedTransaction and if so, poll until we're OK to abort. The check will be quick and the abort path is not considered one we need to optimize. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 01/29/2018 11:23 AM, Simon Riggs wrote: > On 29 January 2018 at 07:15, Nikhil Sontakke <nikhils@2ndquadrant.com> wrote: > >>> Having this as responsibility of plugin sounds interesting. It >>> certainly narrows the scope for which we need to solve the abort >>> issue. For 2PC that may be okay as we need to somehow interact >>> with transaction manager as Simon noted. I am not sure if this >>> helps streaming use-case though as there is not going to be any >>> external transaction management involved there. > > I think we should recognize that the two cases are different. > > 2PC decoding patch is looking at what happens AFTER a PREPARE has > occurred on a transaction. > > Streaming looks to have streaming occur right in the middle of a > transaction, so it has other concerns and likely other solutions as a > result. > I don't quite see how these cases are so different, or why would they require different handling. Can you explain? We need to deal with decoding a transaction that aborts while we're decoding it - we must not continue decoding it (in the sense of passing the changes to the plugin). I don't see any major difference between ROLLBACK and ROLLBACK PREPARED. So I think the pre-abort hook solution should work for the streaming case too. At least - I don't see why it wouldn't. While discussing this with Peter Eisentraut off-list, I think we came up with an alternative idea for the streaming case, that does not require the pre-abort hook. The important detail is that we only really care about aborts in transactions that modified catalogs in some way (e.g. by doing DDL). But we can safely decode (and stream) changes up to the point when the catalogs get modified, so we can do two different things at that point: (1) Stop streaming changes from that transaction, and instead start spilling it to disk (and then continue with the replay only when it actually commits). (2) Note the transaction ID somewhere and restart the decoding (that is, notify the downstream to throw away the data and go back in WAL to read all the data from scratch) but spilling that one transaction to disk instead of streaming it. Neither of these solutions is currently implemented and would require changes to ReorderBuffer (which currently does not support mixing of spill-to-disk and streaming) and possibly the logical decoding infrastructure (e.g. to stash the XIDs somewhere and allow restarting from previous LSN). The good thing is it does not need to know about aborted transactions and so does not require communication with transaction manager using pre-abort hooks etc. I think this would be a massive advantage. The main question (at least for me) is if it's actually cheaper, compared to the pre-abort hook. In my experience, aborts tend to be fairly rare in practice - maybe 1:1000 to commits. On the other hand, temporary tables are fairly common thing, and they count as catalog changes, of course. Maybe this is not that bad though, as it only really matters for large transactions, which is when we start to spill to disk or stream. And that should not be very often - if it happens very often, you probably need a higher limit (similarly to work_mem). The cases where this would matter are large ETL jobs, upgrade scripts and so on - these tend to be large and mix DDL (temporary tables, ALTER TABLE, ...). That's unfortunate, as it's one of the cases the streaming was supposed to help. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 29 January 2018 at 13:34, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > The important detail is that we only really care > about aborts in transactions that modified catalogs in some way (e.g. by > doing DDL). But we can safely decode (and stream) changes up to the > point when the catalogs get modified, so we can do two different things > at that point: Safely? Hmm. I think what we are missing here is a test case, or a detailed description of what we think happens. I can't see why a concurrent abort would be unsafe, as in, it would make decoding explode. If all it does is give the wrong answer when we process DDL, then all we have to do is check for abort before and after we process any DDL during decoding. Handling DDL is rare, so that test won't cost much in terms of handling DDL. So ISTM that we can continue decoding after we hit DDL, we just need some care. (My earlier comments were based on the idea that 2PC patch would simply block aborts, which of course will not work for streaming, hence the difference) -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 01/29/2018 02:49 PM, Simon Riggs wrote: > On 29 January 2018 at 13:34, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > >> The important detail is that we only really care >> about aborts in transactions that modified catalogs in some way (e.g. by >> doing DDL). But we can safely decode (and stream) changes up to the >> point when the catalogs get modified, so we can do two different things >> at that point: > > Safely? Hmm. > > I think what we are missing here is a test case, or a detailed > description of what we think happens. I can't see why a concurrent > abort would be unsafe, as in, it would make decoding explode. > > If all it does is give the wrong answer when we process DDL, then > all we have to do is check for abort before and after we process any > DDL during decoding. Handling DDL is rare, so that test won't cost > much in terms of handling DDL. So ISTM that we can continue decoding > after we hit DDL, we just need some care. > We do not decode DDL, but maybe I misunderstood what you mean. The issue is that we consult catalogs during the decoding for various reasons (to interpret table structure, see if the table is part of a replication set, and the plugin is free to use arbitrary user catalogs). Hitting the issue is tricky, because of timing issues. But imagine a transaction like this. BEGIN; INSERT INTO t(a,b) VALUES (1,2); SAVEPOINT s; ALTER TABLE t ADD COLUMN c INT; INSERT INTO t(a,b,c) VALUES (1,2,3); ROLLBACK TO s; COMMIT; Now assume that we start streaming the transaction, and it happens like this (T refers to the steps of the transaction): 1) start decoding the subtransaction (not yet aborted) 2) T: ROLLBACK TO s; 3) VACUUM the catalogs, which may remove pg_attribute rows for "c" or make them inaccessible through indexes, etc. 4) inspect the new row (which we still have in reorderbuffer) 5) Kabooom! The row has column "c" which we don't see in the catalog. > (My earlier comments were based on the idea that 2PC patch would > simply block aborts, which of course will not work for streaming, > hence the difference) > I'm sorry, but I don't quite see why wouldn't that work? There are far too many ideas flying around on this thread, so I'm probably missing something. Can you elaborate? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 29 January 2018 at 14:13, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > 4) inspect the new row (which we still have in reorderbuffer) > > 5) Kabooom! The row has column "c" which we don't see in the catalog. We don't use caches? Why does a cache miss cause it to explode? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 01/29/2018 03:17 PM, Simon Riggs wrote: > On 29 January 2018 at 14:13, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > >> 4) inspect the new row (which we still have in reorderbuffer) >> >> 5) Kabooom! The row has column "c" which we don't see in the catalog. > > We don't use caches? Why does a cache miss cause it to explode? > We do use caches (and we invalidate them), of course. But the problem is that by the time we get to lookup the row, it may be either removed by VACUUM (because the catalog cleanup is more aggressive) or not reachable using an index (which is the HOT issue pointed out by Robert earlier in this thread). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi all, On 29 January 2018 at 12:45, Nikhil Sontakke <nikhils@2ndquadrant.com> wrote: > Hi, > >> Having this as responsibility of plugin sounds interesting. It certainly >> narrows the scope for which we need to solve the abort issue. For 2PC >> that may be okay as we need to somehow interact with transaction manager >> as Simon noted. I am not sure if this helps streaming use-case though as >> there is not going to be any external transaction management involved there. >> >> In any case all this interlocking could potentially be made less >> impact-full by only doing it when we know the transaction did catalog >> changes prior to currently decoded change (which we do during decoding) >> since that's the only time we are interested in if it aborted or not. >> >> This all leads me to another idea. What if logical decoding provided API >> for "locking/unlocking" the currently decoded transaction against abort. >> This function would then be called by both decoding and output plugin >> before any catalog read. The function can be smart enough to be NOOP if >> the transaction is not running (ie we are not doing 2PC decoding or >> streaming) or when the transaction didn't do any catalog modifications >> (we already have that info easily accessible as bool). >> >> That would mean we'd never do any kind of heavy locking for prolonged >> periods of time (ie network calls) but only during catalog access and >> only when needed. It would also solve this for both 2PC and streaming >> and it would be easy to use by plugin authors. Just document that some >> call should be done before catalog access when in output plugin, can >> even be Asserted that the call was done probably. >> >> Thoughts? >> > > Yeah, this might work. We already have SET_LOCKTAG_TRANSACTION() via > which every transaction takes an exclusive lock on its own > transactionid when it starts, for example. We ideally want a single > solution to handle 2PC and ongoing (streaming) transactions. We could > introduce a new SET_LOCKTAG_LOGICALTRANSACTION(). The logical decoding > process could take a SHARED lock on this, check if the XID is still ok > to decode, read the catalog and unlock. Abort/Commit transaction > processing could take this in EXCLUSIVE mode. > > As mentioned above, the plugin API which takes this lock will be smart > enough to be a NOOP if the transaction is not running (i.e we are not > doing 2PC decoding or streaming) or when the transaction didn't do any > catalog modifications. > > Thoughts? > This latest version takes care of the abort-while-decoding issue along with additional test cases and documentation changes. We now maintain a list of processes that are decoding a specific transactionID and make it a decode groupmember of a decode groupleader process. The decode groupleader process is basically the PGPROC entry which points to the prepared 2PC transaction or an ongoing regular transaction. If the 2PC is rollback'ed then FinishPreparedTransactions uses the decode groupleader process to let all the decode groupmember processes know that it's aborting. A similar logic can be used for the decoding of uncommitted transactions. The decode groupmember processes are able to abort sanely in such a case. We also have two new APIs "LogicalLockTransaction" and "LogicalUnlockTransaction" that the decoding backends need to use while doing system or user catalog tables access. The abort code interlocks with decoding backends that might be in the process of accessing catalog tables and waits for those few moments before aborting the transaction. The implementation uses the LockHashPartitionLockByProc on the decode groupleader process to control access to these additional fields in the PGPROC structure amongst the decode groupleader and the other decode groupmember processes and does not need to use the ProcArrayLock at all. The implementation is inspired from the *existing* lockGroupLeader solution which uses a similar technique to track processes waiting on a leader holding that lock. I believe it's an optimal solution for this problem of ours. Have added TAP tests to test multiple decoding backends working on the same transaction. Used delays in the test-decoding plugin to introduce waits after making the LogicalLockTransaction call and calling ROLLBACK to ensure that it interlocks with such decoding backends which are doing catalog access. Tests working as desired. Also "make check-world" passes with asserts enabled. I have separated out the decode groupleader/groupmember changes from the main logical decoding of 2PC patch into a separate patch for REVIEW only. Both main patch and this separate review only patch are attached with this email. I have posted this same explanation about abort handling on the other thread for the main 2PC logical decoding patch (http://www.postgresql-archive.org/logical-decoding-of-two-phase-transactions-td5936887.html). Comments appreciated. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services