Thread: [HACKERS] logical decoding of two-phase transactions
Here is resubmission of patch to implement logical decoding of two-phase transactions (instead of treating them as usual transaction when commit) [1] I’ve slightly polished things and used test_decoding output plugin as client. General idea quite simple here: * Write gid along with commit/prepare records in case of 2pc * Add several routines to decode prepare records in the same way as it already happens in logical decoding. I’ve also added explicit LOCK statement in test_decoding regression suit to check that it doesn’t break thing. If somebody can create scenario that will block decoding because of existing dummy backend lock that will be great help. Right now all my tests passing (including TAP tests to check recovery of twophase tx in case of failures from adjacent mail thread). If we will agree about current approach than I’m ready to add this stuff to proposed in-core logical replication. [1] https://www.postgresql.org/message-id/EE7452CA-3C39-4A0E-97EC-17A414972884%40postgrespro.ru -- Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 31 December 2016 at 08:36, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: > Here is resubmission of patch to implement logical decoding of two-phase transactions (instead of treating them > as usual transaction when commit) [1] I’ve slightly polished things and used test_decoding output plugin as client. Sounds good. > General idea quite simple here: > > * Write gid along with commit/prepare records in case of 2pc GID is now variable sized. You seem to have added this to every commit, not just 2PC > * Add several routines to decode prepare records in the same way as it already happens in logical decoding. > > I’ve also added explicit LOCK statement in test_decoding regression suit to check that it doesn’t break thing. Please explain that in comments in the patch. > If > somebody can create scenario that will block decoding because of existing dummy backend lock that will be great > help. Right now all my tests passing (including TAP tests to check recovery of twophase tx in case of failures from > adjacent mail thread). > > If we will agree about current approach than I’m ready to add this stuff to proposed in-core logical replication. > > [1] https://www.postgresql.org/message-id/EE7452CA-3C39-4A0E-97EC-17A414972884%40postgrespro.ru We'll need some measurements about additional WAL space or mem usage from these approaches. Thanks. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 4 January 2017 at 21:20, Simon Riggs <simon@2ndquadrant.com> wrote: > On 31 December 2016 at 08:36, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: >> Here is resubmission of patch to implement logical decoding of two-phase transactions (instead of treating them >> as usual transaction when commit) [1] I’ve slightly polished things and used test_decoding output plugin as client. > > Sounds good. > >> General idea quite simple here: >> >> * Write gid along with commit/prepare records in case of 2pc > > GID is now variable sized. You seem to have added this to every > commit, not just 2PC I've just realised that you're adding GID because it allows you to uniquely identify the prepared xact. But then the prepared xact will also have a regular TransactionId, which is also unique. GID exists for users to specify things, but it is not needed internally and we don't need to add it here. What we do need is for the commit prepared message to remember what the xid of the prepare was and then re-find it using the commit WAL record's twophase_xid field. So we don't need to add GID to any WAL records, nor to any in-memory structures. Please re-work the patch to include twophase_xid, which should make the patch smaller and much faster too. Please add comments to explain how and why patches work. Design comments allow us to check the design makes sense and if it does whether all the lines in the patch are needed to follow the design. Without that patches are much harder to commit and we all want patches to be easier to commit. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Thank you for looking into this. > On 5 Jan 2017, at 09:43, Simon Riggs <simon@2ndquadrant.com> wrote: >> >> GID is now variable sized. You seem to have added this to every >> commit, not just 2PC > Hm, didn’t realise that, i’ll fix. > I've just realised that you're adding GID because it allows you to > uniquely identify the prepared xact. But then the prepared xact will > also have a regular TransactionId, which is also unique. GID exists > for users to specify things, but it is not needed internally and we > don't need to add it here. I think we anyway can’t avoid pushing down GID to the client side. If we will push down only local TransactionId to remote server then we will lose mapping of GID to TransactionId, and there will be no way for user to identify his transaction on second server. Also Open XA and lots of libraries (e.g. J2EE) assumes that there is the same GID everywhere and it’s the same GID that was issued by the client. Requirements for two-phase decoding can be different depending on what one want to build around it and I believe in some situations pushing down xid is enough. But IMO dealing with reconnects, failures and client libraries will force programmer to use the same GID everywhere. > What we do need is for the commit prepared > message to remember what the xid of the prepare was and then re-find > it using the commit WAL record's twophase_xid field. So we don't need > to add GID to any WAL records, nor to any in-memory structures. Other part of the story is how to find GID during decoding of commit prepared record. I did that by adding GID field to the commit WAL record, because by the time of decoding all memory structures that were holding xid<->gid correspondence are already cleaned up. -- Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 5 January 2017 at 10:21, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: > Thank you for looking into this. > >> On 5 Jan 2017, at 09:43, Simon Riggs <simon@2ndquadrant.com> wrote: >>> >>> GID is now variable sized. You seem to have added this to every >>> commit, not just 2PC >> > > Hm, didn’t realise that, i’ll fix. > >> I've just realised that you're adding GID because it allows you to >> uniquely identify the prepared xact. But then the prepared xact will >> also have a regular TransactionId, which is also unique. GID exists >> for users to specify things, but it is not needed internally and we >> don't need to add it here. > > I think we anyway can’t avoid pushing down GID to the client side. > > If we will push down only local TransactionId to remote server then we will lose mapping > of GID to TransactionId, and there will be no way for user to identify his transaction on > second server. Also Open XA and lots of libraries (e.g. J2EE) assumes that there is > the same GID everywhere and it’s the same GID that was issued by the client. > > Requirements for two-phase decoding can be different depending on what one want > to build around it and I believe in some situations pushing down xid is enough. But IMO > dealing with reconnects, failures and client libraries will force programmer to use > the same GID everywhere. Surely in this case the master server is acting as the Transaction Manager, and it knows the mapping, so we are good? I guess if you are using >2 nodes then you need to use full 2PC on each node. But even then, if you adopt the naming convention that all in-progress xacts will be called RepOriginId-EPOCH-XID, so they have a fully unique GID on all of the child nodes then we don't need to add the GID. Please explain precisely how you expect to use this, to check that GID is required. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> On 5 Jan 2017, at 13:49, Simon Riggs <simon@2ndquadrant.com> wrote: > > Surely in this case the master server is acting as the Transaction > Manager, and it knows the mapping, so we are good? > > I guess if you are using >2 nodes then you need to use full 2PC on each node. > > Please explain precisely how you expect to use this, to check that GID > is required. > For example if we are using logical replication just for failover/HA and allowing user to be transaction manager itself. Then suppose that user prepared tx on server A and server A crashed. After that client may want to reconnect to server B and commit/abort that tx. But user only have GID that was used during prepare. > But even then, if you adopt the naming convention that all in-progress > xacts will be called RepOriginId-EPOCH-XID, so they have a fully > unique GID on all of the child nodes then we don't need to add the > GID. Yes, that’s also possible but seems to be less flexible restricting us to some specific GID format. Anyway, I can measure WAL space overhead introduced by the GID’s inside commit records to know exactly what will be the cost of such approach. -- Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 5 January 2017 at 20:43, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: > Anyway, I can measure WAL space overhead introduced by the GID’s inside commit records > to know exactly what will be the cost of such approach. Sounds like a good idea, especially if you remove any attempt to work with GIDs for !2PC commits at the same time. I don't think I care about having access to the GID for the use case I have in mind, since we'd actually be wanting to hijack a normal COMMIT and internally transform it to PREPARE TRANSACTION, <do stuff>, COMMIT PREPARED. But for the more general case of logical decoding of 2PC I can see the utility of having the xact identifier. If we presume we're only interested in logically decoding 2PC xacts that are not yet COMMIT PREPAREd, can we not avoid the WAL overhead of writing the GID by looking it up in our shmem state at decoding-time for PREPARE TRANSACTION? If we can't find the prepared transaction in TwoPhaseState we know to expect a following ROLLBACK PREPARED or COMMIT PREPARED, so we shouldn't decode it at the PREPARE TRANSACTION stage. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 5 January 2017 at 12:43, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: > >> On 5 Jan 2017, at 13:49, Simon Riggs <simon@2ndquadrant.com> wrote: >> >> Surely in this case the master server is acting as the Transaction >> Manager, and it knows the mapping, so we are good? >> >> I guess if you are using >2 nodes then you need to use full 2PC on each node. >> >> Please explain precisely how you expect to use this, to check that GID >> is required. >> > > For example if we are using logical replication just for failover/HA and allowing user > to be transaction manager itself. Then suppose that user prepared tx on server A and server A > crashed. After that client may want to reconnect to server B and commit/abort that tx. > But user only have GID that was used during prepare. I don't think that's the case your trying to support and I don't think that's a common case that we want to pay the price to put into core in a non-optional way. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 5 January 2017 at 20:43, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: > >> On 5 Jan 2017, at 13:49, Simon Riggs <simon@2ndquadrant.com> wrote: >> >> Surely in this case the master server is acting as the Transaction >> Manager, and it knows the mapping, so we are good? >> >> I guess if you are using >2 nodes then you need to use full 2PC on each node. >> >> Please explain precisely how you expect to use this, to check that GID >> is required. >> > > For example if we are using logical replication just for failover/HA and allowing user > to be transaction manager itself. Then suppose that user prepared tx on server A and server A > crashed. After that client may want to reconnect to server B and commit/abort that tx. > But user only have GID that was used during prepare. > >> But even then, if you adopt the naming convention that all in-progress >> xacts will be called RepOriginId-EPOCH-XID, so they have a fully >> unique GID on all of the child nodes then we don't need to add the >> GID. > > Yes, that’s also possible but seems to be less flexible restricting us to some > specific GID format. > > Anyway, I can measure WAL space overhead introduced by the GID’s inside commit records > to know exactly what will be the cost of such approach. Stas, Have you had a chance to look at this further? I think the approach of storing just the xid and fetching the GID during logical decoding of the PREPARE TRANSACTION is probably the best way forward, per my prior mail. That should eliminate Simon's objection re the cost of tracking GIDs and still let us have access to them when we want them, which is the best of both worlds really. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
>> >> Yes, that’s also possible but seems to be less flexible restricting us to some >> specific GID format. >> >> Anyway, I can measure WAL space overhead introduced by the GID’s inside commit records >> to know exactly what will be the cost of such approach. > > Stas, > > Have you had a chance to look at this further? Generally i’m okay with Simon’s approach and will send send updated patch. Anyway want to perform some test to estimate how much disk space is actually wasted by extra WAL records. > I think the approach of storing just the xid and fetching the GID > during logical decoding of the PREPARE TRANSACTION is probably the > best way forward, per my prior mail. I don’t think that’s possible in this way. If we will not put GID in commit record, than by the time when logical decoding will happened transaction will be already committed/aborted and there will be no easy way to get that GID. I thought about several possibilities: * Tracking xid/gid map in memory also doesn’t help much — if server reboots between prepare and commit we’ll lose that mapping. * We can provide some hooks on prepared tx recovery during startup, but that approach also fails if reboot happened between commit and decoding of that commit. * Logical messages are WAL-logged, but they don’t have any redo function so don’t helps much. So to support user-accessible 2PC over replication based on 2PC decoding we should invent something more nasty like writing them into a table. > That should eliminate Simon's > objection re the cost of tracking GIDs and still let us have access to > them when we want them, which is the best of both worlds really. Having 2PC decoding in core is a good thing anyway even without GID tracking =) -- Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 26 Jan. 2017 18:43, "Stas Kelvich" <s.kelvich@postgrespro.ru> wrote:
>>
>> Yes, that’s also possible but seems to be less flexible restricting us to some
>> specific GID format.
>>
>> Anyway, I can measure WAL space overhead introduced by the GID’s inside commit records
>> to know exactly what will be the cost of such approach.
>
> I think the approach of storing just the xid and fetching the GID> during logical decoding of the PREPARE TRANSACTION is probably theI don’t think that’s possible in this way. If we will not put GID in commit record, than by the time when logical decoding will happened transaction will be already committed/aborted and there will
> best way forward, per my prior mail.
be no easy way to get that GID.
My thinking is that if the 2PC xact is by that point COMMIT PREPARED or ROLLBACK PREPARED we don't care that it was ever 2pc and should just decode it as a normal xact. Its gid has ceased to be significant and no longer holds meaning since the xact is resolved.
The point of logical decoding of 2pc is to allow peers to participate in a decision on whether to commit or not. Rather than only being able to decode the xact once committed as is currently the case.
If it's already committed there's no point treating it as anything special.
So when we get to the prepare transaction in xlog we look to see if it's already committed / rolled back. If so we proceed normally like current decoding does. Only if it's still prepared do we decode it as 2pc and supply the gid to a new output plugin callback for prepared xacts.
I thought about several possibilities:
* Tracking xid/gid map in memory also doesn’t help much — if server reboots between prepare
and commit we’ll lose that mapping.
Er what? That's why I suggested using the prepared xacts shmem state. It's persistent as you know from your work on prepared transaction files. It has all the required info.
> On 26 Jan 2017, at 12:51, Craig Ringer <craig@2ndquadrant.com> wrote: > > * Tracking xid/gid map in memory also doesn’t help much — if server reboots between prepare > and commit we’ll lose that mapping. > > Er what? That's why I suggested using the prepared xacts shmem state. It's persistent as you know from your work on preparedtransaction files. It has all the required info. Imagine following scenario: 1. PREPARE happend 2. PREPARE decoded and sent where it should be sent 3. We got all responses from participating nodes and issuing COMMIT/ABORT 4. COMMIT/ABORT decoded and sent After step 3 there is no more memory state associated with that prepared tx, so if will fail between 3 and 4 then we can’t know GID unless we wrote it commit record (or table). -- Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 26 January 2017 at 19:34, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: > Imagine following scenario: > > 1. PREPARE happend > 2. PREPARE decoded and sent where it should be sent > 3. We got all responses from participating nodes and issuing COMMIT/ABORT > 4. COMMIT/ABORT decoded and sent > > After step 3 there is no more memory state associated with that prepared tx, so if will fail > between 3 and 4 then we can’t know GID unless we wrote it commit record (or table). If the decoding session crashes/disconnects and restarts between 3 and 4, we know the xact is now committed or rolled backand we don't care about its gid anymore, we can decode it as a normal committed xact or skip over it if aborted. If Pg crashes between 3 and 4 the same applies, since all decoding sessions must restart. No decoding session can ever start up between 3 and 4 without passing through 1 and 2, since we always restart decoding at restart_lsn and restart_lsn cannot be advanced past the assignment (BEGIN) of a given xid until we pass its commit record and the downstream confirms it has flushed the results. The reorder buffer doesn't even really need to keep track of the gid between 3 and 4, though it should do to save the output plugin and downstream the hassle of keeping an xid to gid mapping. All it needs is to know if we sent a given xact's data to the output plugin at PREPARE time, so we can suppress sending them again at COMMIT time, and we can store that info on the ReorderBufferTxn. We can store the gid there too. We'll need two new output plugin callbacks prepare_cb rollback_cb since an xact can roll back after we decode PREPARE TRANSACTION (or during it, even) and we have to be able to tell the downstream to throw the data away. I don't think the rollback callback should be called abort_prepared_cb, because we'll later want to add the ability to decode interleaved xacts' changes as they are made, before commit, and in that case will also need to know if they abort. We won't care if they were prepared xacts or not, but we'll know based on the ReorderBufferTXN anyway. We don't need a separate commit_prepared_cb, the existing commit_cb is sufficient. The gid will be accessible on the ReorderBufferTXN. Now, if it's simpler to just xlog the gid at COMMIT PREPARED time when wal_level >= logical I don't think that's the end of the world. But since we already have almost everything we need in memory, why not just stash the gid on ReorderBufferTXN? -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jan 27, 2017 at 8:52 AM, Craig Ringer <craig@2ndquadrant.com> wrote: > Now, if it's simpler to just xlog the gid at COMMIT PREPARED time when > wal_level >= logical I don't think that's the end of the world. But > since we already have almost everything we need in memory, why not > just stash the gid on ReorderBufferTXN? I have been through this thread... And to be honest, I have a hard time understanding for which purpose the information of a 2PC transaction is useful in the case of logical decoding. The prepare and commit prepared have been received by a node which is at the root of the cluster tree, a node of the cluster at an upper level, or a client, being in charge of issuing all the prepare queries, and then issue the commit prepared to finish the transaction across a cluster. In short, even if you do logical decoding from the root node, or the one at a higher level, you would care just about the fact that it has been committed. -- Michael
On Tue, Jan 31, 2017 at 3:29 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Fri, Jan 27, 2017 at 8:52 AM, Craig Ringer <craig@2ndquadrant.com> wrote: >> Now, if it's simpler to just xlog the gid at COMMIT PREPARED time when >> wal_level >= logical I don't think that's the end of the world. But >> since we already have almost everything we need in memory, why not >> just stash the gid on ReorderBufferTXN? > > I have been through this thread... And to be honest, I have a hard > time understanding for which purpose the information of a 2PC > transaction is useful in the case of logical decoding. The prepare and > commit prepared have been received by a node which is at the root of > the cluster tree, a node of the cluster at an upper level, or a > client, being in charge of issuing all the prepare queries, and then > issue the commit prepared to finish the transaction across a cluster. > In short, even if you do logical decoding from the root node, or the > one at a higher level, you would care just about the fact that it has > been committed. By the way, I have moved this patch to next CF, you guys seem to make the discussion move on. -- Michael
On 31 Jan. 2017 19:29, "Michael Paquier" <michael.paquier@gmail.com> wrote:
On Fri, Jan 27, 2017 at 8:52 AM, Craig Ringer <craig@2ndquadrant.com> wrote:I have been through this thread... And to be honest, I have a hard
> Now, if it's simpler to just xlog the gid at COMMIT PREPARED time when
> wal_level >= logical I don't think that's the end of the world. But
> since we already have almost everything we need in memory, why not
> just stash the gid on ReorderBufferTXN?
time understanding for which purpose the information of a 2PC
transaction is useful in the case of logical decoding.
TL;DR: this lets us decode the xact after prepare but before commit so decoding/replay outcomes can affect the commit-or-abort decision.
The prepare and
commit prepared have been received by a node which is at the root of
the cluster tree, a node of the cluster at an upper level, or a
client, being in charge of issuing all the prepare queries, and then
issue the commit prepared to finish the transaction across a cluster.
In short, even if you do logical decoding from the root node, or the
one at a higher level, you would care just about the fact that it has
been committed.
That's where you've misunderstood - it isn't committed yet. The point or this change is to allow us to do logical decoding at the PREPARE TRANSACTION point. The xact is not yet committed or rolled back.
This allows the results of logical decoding - or more interestingly results of replay on another node / to another app / whatever to influence the commit or rollback decision.
Stas wants this for a conflict-free logical semi-synchronous replication multi master solution. At PREPARE TRANSACTION time we replay the xact to other nodes, each of which applies it and PREPARE TRANSACTION, then replies to confirm it has successfully prepared the xact. When all nodes confirm the xact is prepared it is safe for the origin node to COMMIT PREPARED. The other nodes then see hat the first node has committed and they commit too.
Alternately if any node replies "could not replay xact" or "could not prepare xact" the origin node knows to ROLLBACK PREPARED. All the other nodes see that and rollback too.
This makes it possible to be much more confident that what's replicated is exactly the same on all nodes, with no after-the-fact MM conflict resolution that apps must be aware of to function correctly.
To really make it rock solid you also have to send the old and new values of a row, or have row versions, or send old row hashes. Something I also want to have, but we can mostly get that already with REPLICA IDENTITY FULL.
It is of interest to me because schema changes in MM logical replication are more challenging awkward and restrictive without it. Optimistic conflict resolution doesn't work well for schema changes and once the conflciting schema changes are committed on different nodes there is no going back. So you need your async system to have a global locking model for schema changes to stop conflicts arising. Or expect the user not to do anything silly / misunderstand anything and know all the relevant system limitations and requirements... which we all know works just great in practice. You also need a way to ensure that schema changes don't render committed-but-not-yet-replayed row changes from other peers nonsensical. The safest way is a barrier where all row changes committed on any node before committing the schema change on the origin node must be fully replayed on every other node, making an async MM system temporarily sync single master (and requiring all nodes to be up and reachable). Otherwise you need a way to figure out how to conflict-resolve incoming rows with missing columns / added columns / changed types / renamed tables etc which is no fun and nearly impossible in the general case.
2PC decoding lets us avoid all this mess by sending all nodes the proposed schema change and waiting until they all confirm successful prepare before committing it. It can also be used to solve the row compatibility problems with some more lazy inter-node chat in logical WAL messages.
I think the purpose of having the GID available to the decoding output plugin at PREPARE TRANSACTION time is that it can co-operate with a global transaction manager that way. Each node can tell the GTM "I'm ready to commit [X]". It is IMO not crucial since you can otherwise use a (node-id, xid) tuple, but it'd be nice for coordinating with external systems, simplifying inter node chatter, integrating logical deocding into bigger systems with external transaction coordinators/arbitrators etc. It seems pretty silly _not_ to have it really.
Personally I don't think lack of access to the GID justifies blocking 2PC logical decoding. It can be added separately. But it'd be nice to have especially if it's cheap.
On 31.01.2017 09:29, Michael Paquier wrote: > On Fri, Jan 27, 2017 at 8:52 AM, Craig Ringer <craig@2ndquadrant.com> wrote: >> Now, if it's simpler to just xlog the gid at COMMIT PREPARED time when >> wal_level >= logical I don't think that's the end of the world. But >> since we already have almost everything we need in memory, why not >> just stash the gid on ReorderBufferTXN? > I have been through this thread... And to be honest, I have a hard > time understanding for which purpose the information of a 2PC > transaction is useful in the case of logical decoding. The prepare and > commit prepared have been received by a node which is at the root of > the cluster tree, a node of the cluster at an upper level, or a > client, being in charge of issuing all the prepare queries, and then > issue the commit prepared to finish the transaction across a cluster. > In short, even if you do logical decoding from the root node, or the > one at a higher level, you would care just about the fact that it has > been committed. Sorry, may be I do not completely understand your arguments. Actually our multimaster is completely based now on logical replication and 2PC (more precisely we are using 3PC now:) State of transaction (prepared, precommitted, committed) should be persisted in WAL to make it possible to perform recovery. Recovery can involve transactions in any state. So there three records in the WAL: PREPARE, PRECOMMIT, COMMIT_PREPARED and recovery can involve either all of them, either PRECOMMIT+COMMIT_PREPARED either just COMMIT_PREPARED. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 31 Jan. 2017 22:43, "Konstantin Knizhnik" <k.knizhnik@postgrespro.ru> wrote:
On 31.01.2017 09:29, Michael Paquier wrote:On Fri, Jan 27, 2017 at 8:52 AM, Craig Ringer <craig@2ndquadrant.com> wrote:Now, if it's simpler to just xlog the gid at COMMIT PREPARED time whenI have been through this thread... And to be honest, I have a hard
wal_level >= logical I don't think that's the end of the world. But
since we already have almost everything we need in memory, why not
just stash the gid on ReorderBufferTXN?
time understanding for which purpose the information of a 2PC
transaction is useful in the case of logical decoding. The prepare and
commit prepared have been received by a node which is at the root of
the cluster tree, a node of the cluster at an upper level, or a
client, being in charge of issuing all the prepare queries, and then
issue the commit prepared to finish the transaction across a cluster.
In short, even if you do logical decoding from the root node, or the
one at a higher level, you would care just about the fact that it has
been committed.
in any state. So there three records in the WAL: PREPARE, PRECOMMIT, COMMIT_PREPARED and
recovery can involve either all of them, either PRECOMMIT+COMMIT_PREPARED either just COMMIT_PREPARED.
That's your modified Pg though.
This 2pc logical decoding patch proposal is for core and I think it just confused things to introduce discussion of unrelated changes made by your product to the codebase.
--
Konstantin Knizhnik --
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Jan 31, 2017 at 6:22 PM, Craig Ringer <craig@2ndquadrant.com> wrote: > That's where you've misunderstood - it isn't committed yet. The point or > this change is to allow us to do logical decoding at the PREPARE TRANSACTION > point. The xact is not yet committed or rolled back. Yes, I got that. I was looking for a why or an actual use-case. > Stas wants this for a conflict-free logical semi-synchronous replication > multi master solution. This sentence is hard to decrypt, less without "multi master" as the concept applies basically only to only one master node. > At PREPARE TRANSACTION time we replay the xact to > other nodes, each of which applies it and PREPARE TRANSACTION, then replies > to confirm it has successfully prepared the xact. When all nodes confirm the > xact is prepared it is safe for the origin node to COMMIT PREPARED. The > other nodes then see hat the first node has committed and they commit too. OK, this is the argument I was looking for. So in your schema the origin node, the one generating the changes, is itself in charge of deciding if the 2PC should work or not. There are two channels between the origin node and the replicas replaying the logical changes, one is for the logical decoder with a receiver, the second one is used to communicate the WAL apply status. I thought about something like postgres_fdw doing this job with a transaction that does writes across several nodes, that's why I got confused about this feature. Everything goes through one channel, so the failure handling is just simplified. > Alternately if any node replies "could not replay xact" or "could not > prepare xact" the origin node knows to ROLLBACK PREPARED. All the other > nodes see that and rollback too. The origin node could just issue the ROLLBACK or COMMIT and the logical replicas would just apply this change. > To really make it rock solid you also have to send the old and new values of > a row, or have row versions, or send old row hashes. Something I also want > to have, but we can mostly get that already with REPLICA IDENTITY FULL. On a primary key (or a unique index), the default replica identity is enough I think. > It is of interest to me because schema changes in MM logical replication are > more challenging awkward and restrictive without it. Optimistic conflict > resolution doesn't work well for schema changes and once the conflicting > schema changes are committed on different nodes there is no going back. So > you need your async system to have a global locking model for schema changes > to stop conflicts arising. Or expect the user not to do anything silly / > misunderstand anything and know all the relevant system limitations and > requirements... which we all know works just great in practice. You also > need a way to ensure that schema changes don't render > committed-but-not-yet-replayed row changes from other peers nonsensical. The > safest way is a barrier where all row changes committed on any node before > committing the schema change on the origin node must be fully replayed on > every other node, making an async MM system temporarily sync single master > (and requiring all nodes to be up and reachable). Otherwise you need a way > to figure out how to conflict-resolve incoming rows with missing columns / > added columns / changed types / renamed tables etc which is no fun and > nearly impossible in the general case. That's one vision of things, FDW-like approaches would be a second, but those are not able to pass down utility statements natively, though this stuff can be done with the utility hook. > I think the purpose of having the GID available to the decoding output > plugin at PREPARE TRANSACTION time is that it can co-operate with a global > transaction manager that way. Each node can tell the GTM "I'm ready to > commit [X]". It is IMO not crucial since you can otherwise use a (node-id, > xid) tuple, but it'd be nice for coordinating with external systems, > simplifying inter node chatter, integrating logical deocding into bigger > systems with external transaction coordinators/arbitrators etc. It seems > pretty silly _not_ to have it really. Well, Postgres-XC/XL save the 2PC GID for this purpose in the GTM, this way the COMMIT/ABORT PREPARED can be issued from any nodes, and there is a centralized conflict resolution, the latter being done with a huge cost, causing much bottleneck in scaling performance. > Personally I don't think lack of access to the GID justifies blocking 2PC > logical decoding. It can be added separately. But it'd be nice to have > especially if it's cheap. I think it should be added reading this thread. -- Michael
On Tue, Jan 31, 2017 at 9:05 PM, Michael Paquier <michael.paquier@gmail.com> wrote: >> Personally I don't think lack of access to the GID justifies blocking 2PC >> logical decoding. It can be added separately. But it'd be nice to have >> especially if it's cheap. > > I think it should be added reading this thread. +1. If on the logical replication master the user executes PREPARE TRANSACTION 'mumble', isn't it sensible to want the logical replica to prepare the same set of changes with the same GID? To me, that not only seems like *a* sensible thing to want to do but probably the *most* sensible thing to want to do. And then, when the eventual COMMIT PREPAPARED 'mumble' comes along, you want to have the replica run the same command. If you don't do that, then the alternative is that the replica has to make up new names based on the master's XID. But that kinda sucks, because now if replication stops due to a conflict or whatever and you have to disentangle things by hand, all the names on the replica are basically meaningless. Also, including the GID in the WAL for each COMMIT/ABORT PREPARED doesn't seem inordinately expensive to me. For that to really add up to a significant cost, wouldn't you need to be doing LOTS of 2PC transactions, each with very little work, so that the commit/abort prepared records weren't swamped by everything else? That seems like an unlikely scenario, but if it does happen, that's exactly when you'll be most grateful for the GID tracking. I think. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Also, including the GID in the WAL for each COMMIT/ABORT PREPARED > doesn't seem inordinately expensive to me. I'm confused ... isn't it there already? If not, how do we handle reconstructing 2PC state from WAL at all? regards, tom lane
On 02/01/2017 10:32 PM, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Also, including the GID in the WAL for each COMMIT/ABORT PREPARED >> doesn't seem inordinately expensive to me. > I'm confused ... isn't it there already? If not, how do we handle > reconstructing 2PC state from WAL at all? > > regards, tom lane > > Right now logical decoding ignores prepare and take in account only "commit prepared": /* * Currently decoding ignores PREPARE TRANSACTION and will just * decode the transactionwhen the COMMIT PREPARED is sent or * throw away the transaction's contents when a ROLLBACK PREPARED * is received. In the future we could add code to expose prepared * transactions in the changestreamallowing for a kind of * distributed 2PC. */ For some scenarios it works well, but if we really need prepared state at replica (as in case of multimaster), then it isnot enough. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 2 Feb. 2017 08:32, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:I'm confused ... isn't it there already? If not, how do we handle
> Also, including the GID in the WAL for each COMMIT/ABORT PREPARED
> doesn't seem inordinately expensive to me.
reconstructing 2PC state from WAL at all?
Right. Per my comments uothread I don't see why we need to add anything more to WAL here.
Stas was concerned about what happens in logical decoding if we crash between PREPSRE TRANSACTION and COMMIT PREPARED. But we'll always go back and decode the whole txn again anyway so it doesn't matter.
We can just track it on ReorderBufferTxn when we see it at PREPARE TRANSACTION time.
On Wed, Feb 1, 2017 at 2:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Also, including the GID in the WAL for each COMMIT/ABORT PREPARED >> doesn't seem inordinately expensive to me. > > I'm confused ... isn't it there already? If not, how do we handle > reconstructing 2PC state from WAL at all? By XID. See xl_xact_twophase, which gets included in xl_xact_commit or xl_xact_abort. The GID has got to be there in the XL_XACT_PREPARE record, but not when actually committing/rolling back. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Feb 1, 2017 at 4:35 PM, Craig Ringer <craig@2ndquadrant.com> wrote: > Right. Per my comments uothread I don't see why we need to add anything more > to WAL here. > > Stas was concerned about what happens in logical decoding if we crash > between PREPSRE TRANSACTION and COMMIT PREPARED. But we'll always go back > and decode the whole txn again anyway so it doesn't matter. > > We can just track it on ReorderBufferTxn when we see it at PREPARE > TRANSACTION time. Oh, hmm. I guess if that's how it works then we don't need it in WAL after all. I'm not sure that re-decoding the already-prepared transaction is a very good plan, but if that's what we're doing anyway this patch probably shouldn't change it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 3 February 2017 at 03:34, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Feb 1, 2017 at 4:35 PM, Craig Ringer <craig@2ndquadrant.com> wrote: >> Right. Per my comments uothread I don't see why we need to add anything more >> to WAL here. >> >> Stas was concerned about what happens in logical decoding if we crash >> between PREPSRE TRANSACTION and COMMIT PREPARED. But we'll always go back >> and decode the whole txn again anyway so it doesn't matter. >> >> We can just track it on ReorderBufferTxn when we see it at PREPARE >> TRANSACTION time. > > Oh, hmm. I guess if that's how it works then we don't need it in WAL > after all. I'm not sure that re-decoding the already-prepared > transaction is a very good plan, but if that's what we're doing anyway > this patch probably shouldn't change it. We don't have much choice at the moment. Logical decoding must restart from the xl_running_xacts most recently prior to the xid allocation for the oldest xact the client hasn't confirmed receipt of decoded data + commit for. That's because reorder buffers are not persistent; if a decoding session crashes we throw away accumulated reorder buffers, both those in memory and those spilled to disk. We have to re-create them by restarting decoding from the beginning of the oldest xact of interest. We could make reorder buffers persistent and shared between decoding sessions but it'd totally change the logical decoding model and create some other problems. It's certainly not a topic for this patch. So we can take it as given that we'll always restart decoding from BEGIN again at a crash. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Feb 2, 2017 at 7:14 PM, Craig Ringer <craig@2ndquadrant.com> wrote: > We could make reorder buffers persistent and shared between decoding > sessions but it'd totally change the logical decoding model and create > some other problems. It's certainly not a topic for this patch. So we > can take it as given that we'll always restart decoding from BEGIN > again at a crash. OK, thanks for the explanation. I have never liked this design very much, and told Andres so: big transactions are bound to cause noticeable replication lag. But you're certainly right that it's not a topic for this patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017-02-03 17:47:50 -0500, Robert Haas wrote: > On Thu, Feb 2, 2017 at 7:14 PM, Craig Ringer <craig@2ndquadrant.com> wrote: > > We could make reorder buffers persistent and shared between decoding > > sessions but it'd totally change the logical decoding model and create > > some other problems. It's certainly not a topic for this patch. So we > > can take it as given that we'll always restart decoding from BEGIN > > again at a crash. Sharing them seems unlikely (filtering and such would become a lot more complicated) and separate from persistency. I'm not sure however how it'd "totally change the logical decoding model"? Even if we'd not always restart decoding, we'd still have the option to add the information necessary to the spill files, so I'm unclear how persistency plays a role here? > OK, thanks for the explanation. I have never liked this design very > much, and told Andres so: big transactions are bound to cause > noticeable replication lag. But you're certainly right that it's not > a topic for this patch. Streaming and persistency of spill files are different topics, no? Either would have initially complicated things beyond the point of getting things into core - I'm all for adding them at some point. Persistent spill files (which'd also spilling of small transactions at regular intervals) also has the issue that it makes the spill format something that can't be adapted in bugfixes etc, and that we need to fsync it. I still haven't seen a credible model for being able to apply a stream of interleaved transactions that can roll back individually; I think we really need the ability to have multiple transactions alive in one backend for that. Andres
On Fri, Feb 3, 2017 at 6:00 PM, Andres Freund <andres@anarazel.de> wrote: > I still haven't seen a credible model for being able to apply a stream > of interleaved transactions that can roll back individually; I think we > really need the ability to have multiple transactions alive in one > backend for that. Hmm, yeah, that's a problem. That smells like autonomous transactions. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017-02-03 18:47:23 -0500, Robert Haas wrote: > On Fri, Feb 3, 2017 at 6:00 PM, Andres Freund <andres@anarazel.de> wrote: > > I still haven't seen a credible model for being able to apply a stream > > of interleaved transactions that can roll back individually; I think we > > really need the ability to have multiple transactions alive in one > > backend for that. > > Hmm, yeah, that's a problem. That smells like autonomous transactions. Unfortunately the last few proposals, like spawning backends, to deal with autonomous xacts aren't really suitable for replication, unless you only have very large ones. And it really needs to be an implementation where ATs can freely be switched inbetween. On the other hand, a good deal of problems (like locking) shouldn't be an issue, since there's obviously a possible execution schedule. I suspect this'd need some low-level implemention close to xact.c that'd allow switching between transactions. - Andres
On Fri, Feb 3, 2017 at 7:08 PM, Andres Freund <andres@anarazel.de> wrote: > On 2017-02-03 18:47:23 -0500, Robert Haas wrote: >> On Fri, Feb 3, 2017 at 6:00 PM, Andres Freund <andres@anarazel.de> wrote: >> > I still haven't seen a credible model for being able to apply a stream >> > of interleaved transactions that can roll back individually; I think we >> > really need the ability to have multiple transactions alive in one >> > backend for that. >> >> Hmm, yeah, that's a problem. That smells like autonomous transactions. > > Unfortunately the last few proposals, like spawning backends, to deal > with autonomous xacts aren't really suitable for replication, unless you > only have very large ones. And it really needs to be an implementation > where ATs can freely be switched inbetween. On the other hand, a good > deal of problems (like locking) shouldn't be an issue, since there's > obviously a possible execution schedule. > > I suspect this'd need some low-level implemention close to xact.c that'd > allow switching between transactions. Yeah. Well, I still feel like that's also how autonomous transactions oughta work, but I realize that's not a unanimous viewpoint. :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017-02-03 19:09:43 -0500, Robert Haas wrote: > On Fri, Feb 3, 2017 at 7:08 PM, Andres Freund <andres@anarazel.de> wrote: > > On 2017-02-03 18:47:23 -0500, Robert Haas wrote: > >> On Fri, Feb 3, 2017 at 6:00 PM, Andres Freund <andres@anarazel.de> wrote: > >> > I still haven't seen a credible model for being able to apply a stream > >> > of interleaved transactions that can roll back individually; I think we > >> > really need the ability to have multiple transactions alive in one > >> > backend for that. > >> > >> Hmm, yeah, that's a problem. That smells like autonomous transactions. > > > > Unfortunately the last few proposals, like spawning backends, to deal > > with autonomous xacts aren't really suitable for replication, unless you > > only have very large ones. And it really needs to be an implementation > > where ATs can freely be switched inbetween. On the other hand, a good > > deal of problems (like locking) shouldn't be an issue, since there's > > obviously a possible execution schedule. > > > > I suspect this'd need some low-level implemention close to xact.c that'd > > allow switching between transactions. > > Yeah. Well, I still feel like that's also how autonomous transactions > oughta work, but I realize that's not a unanimous viewpoint. :-) Same here ;)
On 02/04/2017 03:08 AM, Andres Freund wrote: > On 2017-02-03 18:47:23 -0500, Robert Haas wrote: >> On Fri, Feb 3, 2017 at 6:00 PM, Andres Freund <andres@anarazel.de> wrote: >>> I still haven't seen a credible model for being able to apply a stream >>> of interleaved transactions that can roll back individually; I think we >>> really need the ability to have multiple transactions alive in one >>> backend for that. >> Hmm, yeah, that's a problem. That smells like autonomous transactions. > Unfortunately the last few proposals, like spawning backends, to deal > with autonomous xacts aren't really suitable for replication, unless you > only have very large ones. And it really needs to be an implementation > where ATs can freely be switched inbetween. On the other hand, a good > deal of problems (like locking) shouldn't be an issue, since there's > obviously a possible execution schedule. > > I suspect this'd need some low-level implemention close to xact.c that'd > allow switching between transactions. Let me add my two coins here: 1. We are using logical decoding in our multimaster and applying transactions concurrently by pool of workers. Unlike asynchronousreplication, in multimaster we need to perform voting for each transaction commit, so if transactions are appliedby single workers, then performance will be awful and, moreover, there is big chance to get "deadlock" when none of workers can completevoting because different nodes are performing voting for different transactions. I could not say that there are no problems with this approach. There are definitely a lot of challenges. First of all weneed to use special DTM (distributed transaction manager) to provide consistent applying of transaction at different nodes.Second problem is once again related with kind of "deadlock" explained above. Even if we apply transactions concurrently, it isstill possible to get such deadlock if we do not have enough workers. This is why we allow to launch extra workers dynamically(but finally it is limited by maximal number of configures bgworkers). But in any case, I think that "parallel apply" is "must have" mode for logical replication. 2. We have implemented autonomous transactions in PgPro EE. Unlike proposal currently present at commit fest, we executeautonomous transaction within the same backend. So we are just storing and restoring transaction context. Unfortunatelyit is also not so cheap operation. Autonomous transaction should not see any changes done by parent transaction (because it can be rollbackedafter commit of autonomous transaction). But there are catalog and relation caches inside backend, so we have toclean this caches before switching to ATX. It is quite expensive operation and so speed of execution of PL/pg-SQL function with autonomoustransaction is several order of magnitude slower than without it. So autonomous transaction can be used for audits(its the primary goal of using ATX in Oracle PL/SQL applications) but this mechanism is not efficient for concurrent execution ofmultiple transaction in one backend. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
> On 31 Jan 2017, at 12:22, Craig Ringer <craig@2ndquadrant.com> wrote: > > Personally I don't think lack of access to the GID justifies blocking 2PC logical decoding. It can be added separately.But it'd be nice to have especially if it's cheap. Agreed. > On 2 Feb 2017, at 00:35, Craig Ringer <craig@2ndquadrant.com> wrote: > > Stas was concerned about what happens in logical decoding if we crash between PREPSRE TRANSACTION and COMMIT PREPARED.But we'll always go back and decode the whole txn again anyway so it doesn't matter. Not exactly. It seems that in previous discussions we were not on the same page, probably due to unclear arguments by me. From my point of view there is no problems (or at least new problems comparing to ordinary 2PC) with preparing transactionson slave servers with something like “#{xid}#{node_id}” instead of GID if issuing node is coordinator of thattransaction. In case of failure, restart, crash we have the same options about deciding what to do with uncommitted transactions. My concern is about the situation with external coordinator. That scenario is quite important for users of postgres native2pc, notably J2EE user. Suppose user (or his framework) issuing “prepare transaction ‘mytxname’;" to servers withordinary synchronous physical replication. If master will crash and replica will be promoted than user can reconnectto it and commit/abort that transaction using his GID. And it is unclear to me how to achieve same behaviour withlogical replication of 2pc without GID in commit record. If we will prepare with “#{xid}#{node_id}” on acceptor nodes,then if donor node will crash we’ll lose mapping between user’s gid and our internal gid; contrary we can prepare withuser's GID on acceptors, but then we will not know that GID on donor during commit decode (by the time decoding happensall memory state already gone and we can’t exchange our xid to gid). I performed some tests to understand real impact on size of WAL. I've compared postgres -master with wal_level = logical,after 3M 2PC transactions with patched postgres where GID’s are stored inside commit record too. Testing with 194-bytesand 6-bytes GID’s. (GID max size is 200 bytes) -master, 6-byte GID after 3M transaction: pg_current_xlog_location = 0/9572CB28 -patched, 6-byte GID after 3M transaction: pg_current_xlog_location = 0/96C442E0 so with 6-byte GID’s difference in WAL size is less than 1% -master, 194-byte GID after 3M transaction: pg_current_xlog_location = 0/B7501578 -patched, 194-byte GID after 3M transaction: pg_current_xlog_location = 0/D8B43E28 and with 194-byte GID’s difference in WAL size is about 18% So using big GID’s (as J2EE does) can cause notable WAL bloat, while small GID’s are almost unnoticeable. May be we can introduce configuration option track_commit_gid by analogy with track_commit_timestamp and make that behaviouroptional? Any objections to that? -- Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 9 February 2017 at 21:23, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: >> On 2 Feb 2017, at 00:35, Craig Ringer <craig@2ndquadrant.com> wrote: >> >> Stas was concerned about what happens in logical decoding if we crash between PREPSRE TRANSACTION and COMMIT PREPARED.But we'll always go back and decode the whole txn again anyway so it doesn't matter. > > Not exactly. It seems that in previous discussions we were not on the same page, probably due to unclear arguments by me. > > From my point of view there is no problems (or at least new problems comparing to ordinary 2PC) with preparing transactionson slave servers with something like “#{xid}#{node_id}” instead of GID if issuing node is coordinator of thattransaction. In case of failure, restart, crash we have the same options about deciding what to do with uncommitted transactions. But we don't *need* to do that. We have access to the GID of the 2PC xact from PREPARE TRANSACTION until COMMIT PREPARED, after which we have no need for it. So we can always use the user-supplied GID. > I performed some tests to understand real impact on size of WAL. I've compared postgres -master with wal_level = logical,after 3M 2PC transactions with patched postgres where GID’s are stored inside commit record too. Why do you do this? You don't need to. You can look the GID up from the 2pc status table in memory unless the master already did COMMIT PREPARED, in which case you can just decode it as a normal xact as if it were never 2pc in the first place. I don't think I've managed to make this point by description, so I'll try to modify your patch to demonstrate. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 01/03/17 10:24, Craig Ringer wrote: > On 9 February 2017 at 21:23, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: > >>> On 2 Feb 2017, at 00:35, Craig Ringer <craig@2ndquadrant.com> wrote: >>> >>> Stas was concerned about what happens in logical decoding if we crash between PREPSRE TRANSACTION and COMMIT PREPARED.But we'll always go back and decode the whole txn again anyway so it doesn't matter. >> >> Not exactly. It seems that in previous discussions we were not on the same page, probably due to unclear arguments byme. >> >> From my point of view there is no problems (or at least new problems comparing to ordinary 2PC) with preparing transactionson slave servers with something like “#{xid}#{node_id}” instead of GID if issuing node is coordinator of thattransaction. In case of failure, restart, crash we have the same options about deciding what to do with uncommitted transactions. > > But we don't *need* to do that. We have access to the GID of the 2PC > xact from PREPARE TRANSACTION until COMMIT PREPARED, after which we > have no need for it. So we can always use the user-supplied GID. > >> I performed some tests to understand real impact on size of WAL. I've compared postgres -master with wal_level = logical,after 3M 2PC transactions with patched postgres where GID’s are stored inside commit record too. > > Why do you do this? You don't need to. You can look the GID up from > the 2pc status table in memory unless the master already did COMMIT > PREPARED, in which case you can just decode it as a normal xact as if > it were never 2pc in the first place. > > I don't think I've managed to make this point by description, so I'll > try to modify your patch to demonstrate. > If I understand you correctly you are saying that if PREPARE is being decoded, we can load the GID from the 2pc info in memory about the specific 2pc. The info gets removed on COMMIT PREPARED but at that point there is no real difference between replicating it as 2pc or 1pc since the 2pc behavior is for all intents and purposes lost at that point. Works for me. I guess the hard part is knowing if COMMIT PREPARED happened at the time PREPARE is decoded, but I existence of the needed info could be probably be used for that. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2 March 2017 at 06:20, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote: > If I understand you correctly you are saying that if PREPARE is being > decoded, we can load the GID from the 2pc info in memory about the > specific 2pc. The info gets removed on COMMIT PREPARED but at that point > there is no real difference between replicating it as 2pc or 1pc since > the 2pc behavior is for all intents and purposes lost at that point. > Works for me. I guess the hard part is knowing if COMMIT PREPARED > happened at the time PREPARE is decoded, but I existence of the needed > info could be probably be used for that. Right. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
> On 2 Mar 2017, at 01:20, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote: > > The info gets removed on COMMIT PREPARED but at that point > there is no real difference between replicating it as 2pc or 1pc since > the 2pc behavior is for all intents and purposes lost at that point. > If we are doing 2pc and COMMIT PREPARED happens then we should replicate that without transaction body to the receiving servers since tx is already prepared on them with some GID. So we need a way to construct that GID. It seems that last ~10 messages I’m failing to explain some points about this topic. Or, maybe, I’m failing to understand some points. Can we maybe setup skype call to discuss this and post summary here? Craig? Peter? -- Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 2 March 2017 at 15:27, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: > >> On 2 Mar 2017, at 01:20, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote: >> >> The info gets removed on COMMIT PREPARED but at that point >> there is no real difference between replicating it as 2pc or 1pc since >> the 2pc behavior is for all intents and purposes lost at that point. >> > > If we are doing 2pc and COMMIT PREPARED happens then we should > replicate that without transaction body to the receiving servers since tx > is already prepared on them with some GID. So we need a way to construct > that GID. We already have it, because we just decoded the PREPARE TRANSACTION. I'm preparing a patch revision to demonstrate this. BTW, I've been reviewing the patch in more detail. Other than a bunch of copy-and-paste that I'm cleaning up, the main issue I've found is that in DecodePrepare, you call: SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid, parsed->nsubxacts, parsed->subxacts); but I am not convinced it is correct to call it at PREPARE TRANSACTION time, only at COMMIT PREPARED time. We want to see the 2pc prepared xact's state when decoding it, but there might be later commits that cannot yet see that state and shouldn't have it visible in their snapshots. Imagine, say BEGIN; ALTER TABLE t ADD COLUMN ... INSERT INTO 't' ... PREPARE TRANSACTION 'x'; BEGIN; INSERT INTO t ...; COMMIT; COMMIT PREPARED 'x'; We want to see the new column when decoding the prepared xact, but _not_ when decoding the subsequent xact between the prepare and commit. This particular case cannot occur because the lock held by ALTER TABLE blocks the INSERT in the other xact, but how sure are you that there are no other snapshot issues that could arise if we promote a snapshot to visible early? What about if we ROLLBACK PREPARED after we made the snapshot visible? The tests don't appear to cover logical decoding 2PC sessions that do DDL at all. I emphasised that that would be one of the main problem areas when we originally discussed this. I'll look at adding some, since I think this is one of the areas that's most likely to find issues. > It seems that last ~10 messages I’m failing to explain some points about this > topic. Or, maybe, I’m failing to understand some points. Can we maybe setup > skype call to discuss this and post summary here? Craig? Peter? Let me prep an updated patch. Time zones make it rather hard to do voice; I'm in +0800 Western Australia, Petr is in +0200... -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2 March 2017 at 16:00, Craig Ringer <craig@2ndquadrant.com> wrote: > What about if we ROLLBACK PREPARED after > we made the snapshot visible? Yeah, I'm pretty sure that's going to be a problem actually. You're telling the snapshot builder that an xact committed at PREPARE TRANSACTION time. If we then ROLLBACK PREPARED, we're in a mess. It looks like it'll cause issues with catalogs, user-catalog tables, etc. I suspect we need to construct a temporary snapshot to decode PREPARE TRANSACTION then discard it. If we later COMMIT PREPARED we should perform the current steps to merge the snapshot state in. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
> On 2 Mar 2017, at 11:00, Craig Ringer <craig@2ndquadrant.com> wrote: > > We already have it, because we just decoded the PREPARE TRANSACTION. > I'm preparing a patch revision to demonstrate this. Yes, we already have it, but if server reboots between commit prepared (all prepared state is gone) and decoding of this commit prepared then we loose that mapping, isn’t it? > BTW, I've been reviewing the patch in more detail. Other than a bunch > of copy-and-paste that I'm cleaning up, the main issue I've found is > that in DecodePrepare, you call: > > SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid, > parsed->nsubxacts, parsed->subxacts); > > but I am not convinced it is correct to call it at PREPARE TRANSACTION > time, only at COMMIT PREPARED time. We want to see the 2pc prepared > xact's state when decoding it, but there might be later commits that > cannot yet see that state and shouldn't have it visible in their > snapshots. Agree, that is problem. That allows to decode this PREPARE, but after that it is better to mark this transaction as running in snapshot or perform prepare decoding with some kind of copied-end-edited snapshot. I’ll have a look at this. -- Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 2 March 2017 at 16:20, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: > >> On 2 Mar 2017, at 11:00, Craig Ringer <craig@2ndquadrant.com> wrote: >> >> We already have it, because we just decoded the PREPARE TRANSACTION. >> I'm preparing a patch revision to demonstrate this. > > Yes, we already have it, but if server reboots between commit prepared (all > prepared state is gone) and decoding of this commit prepared then we loose > that mapping, isn’t it? I was about to explain how restart_lsn works again, and how that would mean we'd always re-decode the PREPARE TRANSACTION before any COMMIT PREPARED or ROLLBACK PREPARED on crash. But... Actually, the way you've implemented it, that won't be the case. You treat PREPARE TRANSACTION as a special-case of COMMIT, and the client will presumably send replay confirmation after it has applied the PREPARE TRANSACTION. In fact, it has to if we want 2PC to work with synchronous replication. This will allow restart_lsn to advance to after the PREPARE TRANSACTION record if there's no other older xact and we see a suitable xl_running_xacts record. So we wouldn't decode the PREPARE TRANSACTION again after restart. Hm. That's actually a pretty good reason to xlog the gid for 2pc rollback and commit if we're at wal_level >= logical . Being able to advance restart_lsn and avoid the re-decoding work is a big win. Come to think of it, we have to advance the client replication identifier as part of PREPARE TRANSACTION anyway, otherwise we'd try to repeat and re-prepare the same xact on crash recovery. Given that, I withdraw my objection to adding the gid to commit and rollback xlog records, though it should only be done if they're 2pc commit/abort, and only if XLogLogicalInfoActive(). >> BTW, I've been reviewing the patch in more detail. Other than a bunch >> of copy-and-paste that I'm cleaning up, the main issue I've found is >> that in DecodePrepare, you call: >> >> SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid, >> parsed->nsubxacts, parsed->subxacts); >> >> but I am not convinced it is correct to call it at PREPARE TRANSACTION >> time, only at COMMIT PREPARED time. We want to see the 2pc prepared >> xact's state when decoding it, but there might be later commits that >> cannot yet see that state and shouldn't have it visible in their >> snapshots. > > Agree, that is problem. That allows to decode this PREPARE, but after that > it is better to mark this transaction as running in snapshot or perform prepare > decoding with some kind of copied-end-edited snapshot. I’ll have a look at this. Thanks. It's also worth noting that with your current approach, 2PC xacts will produce two calls to the output plugin's commit() callback, once for the PREPARE TRANSACTION and another for the COMMIT PREPARED or ROLLBACK PREPARED, the latter two with a faked-up state. I'm not a huge fan of that. It's not entirely backward compatible since it violates the previously safe assumption that there's a 1:1 relationship between begin and commit callbacks with no interleaving, for one thing, and I think it's also a bit misleading to send a PREPARE TRANSACTION to a callback that could previously only receive a true commit. I particularly dislike calling a commit callback for an abort. So I'd like to look further into the interface side of things. I'm inclined to suggest adding new callbacks for 2pc prepare, commit and rollback, and if the output plugin doesn't set them fall back to the existing behaviour. Plugins that aren't interested in 2PC (think ETL) should probably not have to deal with it, we might as well just send them only the actually committed xacts, when they commit. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 02/03/17 13:23, Craig Ringer wrote: > On 2 March 2017 at 16:20, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: >> >>> On 2 Mar 2017, at 11:00, Craig Ringer <craig@2ndquadrant.com> wrote: >>> >>> We already have it, because we just decoded the PREPARE TRANSACTION. >>> I'm preparing a patch revision to demonstrate this. >> >> Yes, we already have it, but if server reboots between commit prepared (all >> prepared state is gone) and decoding of this commit prepared then we loose >> that mapping, isn’t it? > > I was about to explain how restart_lsn works again, and how that would > mean we'd always re-decode the PREPARE TRANSACTION before any COMMIT > PREPARED or ROLLBACK PREPARED on crash. But... > > Actually, the way you've implemented it, that won't be the case. You > treat PREPARE TRANSACTION as a special-case of COMMIT, and the client > will presumably send replay confirmation after it has applied the > PREPARE TRANSACTION. In fact, it has to if we want 2PC to work with > synchronous replication. This will allow restart_lsn to advance to > after the PREPARE TRANSACTION record if there's no other older xact > and we see a suitable xl_running_xacts record. So we wouldn't decode > the PREPARE TRANSACTION again after restart. > Unless we just don't let restart_lsn to go forward if there is 2pc that wasn't decoded yet (twopcs store the prepare lsn) but that's probably too much of a kludge. > > It's also worth noting that with your current approach, 2PC xacts will > produce two calls to the output plugin's commit() callback, once for > the PREPARE TRANSACTION and another for the COMMIT PREPARED or > ROLLBACK PREPARED, the latter two with a faked-up state. I'm not a > huge fan of that. It's not entirely backward compatible since it > violates the previously safe assumption that there's a 1:1 > relationship between begin and commit callbacks with no interleaving, > for one thing, and I think it's also a bit misleading to send a > PREPARE TRANSACTION to a callback that could previously only receive a > true commit. > > I particularly dislike calling a commit callback for an abort. So I'd > like to look further into the interface side of things. I'm inclined > to suggest adding new callbacks for 2pc prepare, commit and rollback, > and if the output plugin doesn't set them fall back to the existing > behaviour. Plugins that aren't interested in 2PC (think ETL) should > probably not have to deal with it, we might as well just send them > only the actually committed xacts, when they commit. > I think this is a good approach to handle it. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 3/2/17 11:34 AM, Petr Jelinek wrote: > On 02/03/17 13:23, Craig Ringer wrote: >> >> I particularly dislike calling a commit callback for an abort. So I'd >> like to look further into the interface side of things. I'm inclined >> to suggest adding new callbacks for 2pc prepare, commit and rollback, >> and if the output plugin doesn't set them fall back to the existing >> behaviour. Plugins that aren't interested in 2PC (think ETL) should >> probably not have to deal with it, we might as well just send them >> only the actually committed xacts, when they commit. >> > > I think this is a good approach to handle it. It's been a while since there was any activity on this thread and a very long time since the last patch. As far as I can see there are far more questions than answers in this thread. If you need more time to produce a patch, please post an explanation for the delay and a schedule for the new patch. If no patch or explanation is is posted by 2017-03-17 AoE I will mark this submission "Returned with Feedback". -- -David david@pgmasters.net
On 02/03/17 17:34, Petr Jelinek wrote: > On 02/03/17 13:23, Craig Ringer wrote: >> On 2 March 2017 at 16:20, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: >>> >>>> On 2 Mar 2017, at 11:00, Craig Ringer <craig@2ndquadrant.com> wrote: >>>> >>>> We already have it, because we just decoded the PREPARE TRANSACTION. >>>> I'm preparing a patch revision to demonstrate this. >>> >>> Yes, we already have it, but if server reboots between commit prepared (all >>> prepared state is gone) and decoding of this commit prepared then we loose >>> that mapping, isn’t it? >> >> I was about to explain how restart_lsn works again, and how that would >> mean we'd always re-decode the PREPARE TRANSACTION before any COMMIT >> PREPARED or ROLLBACK PREPARED on crash. But... >> >> Actually, the way you've implemented it, that won't be the case. You >> treat PREPARE TRANSACTION as a special-case of COMMIT, and the client >> will presumably send replay confirmation after it has applied the >> PREPARE TRANSACTION. In fact, it has to if we want 2PC to work with >> synchronous replication. This will allow restart_lsn to advance to >> after the PREPARE TRANSACTION record if there's no other older xact >> and we see a suitable xl_running_xacts record. So we wouldn't decode >> the PREPARE TRANSACTION again after restart. >> Thinking about this some more. Why can't we use the same mechanism standby uses, ie, use xid to identify the 2PC? If output plugin cares about doing 2PC in two phases, it can send xid as part of its protocol (like the PG10 logical replication and pglogical do already) and simply remember on downstream the remote node + remote xid of the 2PC in progress. That way there is no need for gids in COMMIT PREPARED and this patch would be much simpler (as the tracking would be left to actual replication implementation as opposed to decoding). Or am I missing something? -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 15 March 2017 at 15:42, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote: > Thinking about this some more. Why can't we use the same mechanism > standby uses, ie, use xid to identify the 2PC? It pushes work onto the downstream, which has to keep an <xid,gid> mapping in a crash-safe, persistent form. We'll be doing a flush of some kind anyway so we can report successful prepare to the upstream so an additional flush of a SLRU might not be so bad for a postgres downstream. And I guess any other clients will have some kind of downstream persistent mapping to use. So I think I have a mild preference for recording the gid on 2pc commit and abort records in the master's WAL, where it's very cheap and simple. But I agree that just sending the xid is a viable option if that falls through. I'm going to try to pick this patch up and amend its interface per our discussion earlier, see if I can get it committable. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
> On 16 Mar 2017, at 14:44, Craig Ringer <craig@2ndquadrant.com> wrote: > > I'm going to try to pick this patch up and amend its interface per our > discussion earlier, see if I can get it committable. I’m working right now on issue with building snapshots for decoding prepared tx. I hope I'll send updated patch later today. > -- > Craig Ringer http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services
>> On 2 Mar 2017, at 11:00, Craig Ringer <craig@2ndquadrant.com> wrote: >> >> BTW, I've been reviewing the patch in more detail. Other than a bunch >> of copy-and-paste that I'm cleaning up, the main issue I've found is >> that in DecodePrepare, you call: >> >> SnapBuildCommitTxn(ctx->snapshot_builder, buf->origptr, xid, >> parsed->nsubxacts, parsed->subxacts); >> >> but I am not convinced it is correct to call it at PREPARE TRANSACTION >> time, only at COMMIT PREPARED time. We want to see the 2pc prepared >> xact's state when decoding it, but there might be later commits that >> cannot yet see that state and shouldn't have it visible in their >> snapshots. > > Agree, that is problem. That allows to decode this PREPARE, but after that > it is better to mark this transaction as running in snapshot or perform prepare > decoding with some kind of copied-end-edited snapshot. I’ll have a look at this. > While working on this i’ve spotted quite a nasty corner case with aborted prepared transaction. I have some not that great ideas how to fix it, but maybe i blurred my view and missed something. So want to ask here at first. Suppose we created a table, then in 2pc tx we are altering it and after that aborting tx. So pg_class will have something like this: xmin | xmax | relname 100 | 200 | mytable 200 | 0 | mytable After previous abort, tuple (100,200,mytable) becomes visible and if we will alter table again then xmax of first tuple will be set current xid, resulting in following table: xmin | xmax | relname 100 | 300 | mytable 200 | 0 | mytable 300 | 0 | mytable In that moment we’ve lost information that first tuple was deleted by our prepared tx. And from POV of historic snapshot that will be constructed to decode prepare first tuple is visible, but actually send tuple should be used. Moreover such snapshot could see both tuples violating oid uniqueness, but heapscan stops after finding first one. I see here two possible workarounds: * Try at first to scan catalog filtering out tuples with xmax bigger than snapshot->xmax as it was possibly deleted by our tx. Than if nothing found scan in a usual way. * Do not decode such transaction at all. If by the time of decoding prepare record we already know that it is aborted than such decoding doesn’t have a lot of sense. IMO intended usage of logical 2pc decoding is to decide about commit/abort based on answers from logical subscribers/replicas. So there will be barrier between prepare and commit/abort and such situations shouldn’t happen. -- Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 17 March 2017 at 08:10, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: > While working on this i’ve spotted quite a nasty corner case with aborted prepared > transaction. I have some not that great ideas how to fix it, but maybe i blurred my > view and missed something. So want to ask here at first. > > Suppose we created a table, then in 2pc tx we are altering it and after that aborting tx. > So pg_class will have something like this: > > xmin | xmax | relname > 100 | 200 | mytable > 200 | 0 | mytable > > After previous abort, tuple (100,200,mytable) becomes visible and if we will alter table > again then xmax of first tuple will be set current xid, resulting in following table: > > xmin | xmax | relname > 100 | 300 | mytable > 200 | 0 | mytable > 300 | 0 | mytable > > In that moment we’ve lost information that first tuple was deleted by our prepared tx. Right. And while the prepared xact has aborted, we don't control when it aborts and when those overwrites can start happening. We can and should check if a 2pc xact is aborted before we start decoding it so we can skip decoding it if it's already aborted, but it could be aborted *while* we're decoding it, then have data needed for its snapshot clobbered. This hasn't mattered in the past because prepared xacts (and especially aborted 2pc xacts) have never needed snapshots, we've never needed to do something from the perspective of a prepared xact. I think we'll probably need to lock the 2PC xact so it cannot be aborted or committed while we're decoding it, until we finish decoding it. So we lock it, then check if it's already aborted/already committed/in progress. If it's aborted, treat it like any normal aborted xact. If it's committed, treat it like any normal committed xact. If it's in progress, keep the lock and decode it. People using logical decoding for 2PC will presumably want to control 2PC via logical decoding, so they're not so likely to mind such a lock. > * Try at first to scan catalog filtering out tuples with xmax bigger than snapshot->xmax > as it was possibly deleted by our tx. Than if nothing found scan in a usual way. I don't think that'll be at all viable with the syscache/relcache machinery. Way too intrusive. > * Do not decode such transaction at all. Yes, that's what I'd like to do, per above. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 16 March 2017 at 19:52, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: > > I’m working right now on issue with building snapshots for decoding prepared tx. > I hope I'll send updated patch later today. Great. What approach are you taking? It looks like the snapshot builder actually does most of the work we need for this already, maintaining a stack of snapshots we can use. It might be as simple as invalidating the relcache/syscache when we exit (and enter?) decoding of a prepared 2pc xact, since it violates the usual assumption of logical decoding that we decode things strictly in commit-time order. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Mar 16, 2017 at 10:34 PM, Craig Ringer <craig@2ndquadrant.com> wrote: > On 17 March 2017 at 08:10, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: >> While working on this i’ve spotted quite a nasty corner case with aborted prepared >> transaction. I have some not that great ideas how to fix it, but maybe i blurred my >> view and missed something. So want to ask here at first. >> >> Suppose we created a table, then in 2pc tx we are altering it and after that aborting tx. >> So pg_class will have something like this: >> >> xmin | xmax | relname >> 100 | 200 | mytable >> 200 | 0 | mytable >> >> After previous abort, tuple (100,200,mytable) becomes visible and if we will alter table >> again then xmax of first tuple will be set current xid, resulting in following table: >> >> xmin | xmax | relname >> 100 | 300 | mytable >> 200 | 0 | mytable >> 300 | 0 | mytable >> >> In that moment we’ve lost information that first tuple was deleted by our prepared tx. > > Right. And while the prepared xact has aborted, we don't control when > it aborts and when those overwrites can start happening. We can and > should check if a 2pc xact is aborted before we start decoding it so > we can skip decoding it if it's already aborted, but it could be > aborted *while* we're decoding it, then have data needed for its > snapshot clobbered. > > This hasn't mattered in the past because prepared xacts (and > especially aborted 2pc xacts) have never needed snapshots, we've never > needed to do something from the perspective of a prepared xact. > > I think we'll probably need to lock the 2PC xact so it cannot be > aborted or committed while we're decoding it, until we finish decoding > it. So we lock it, then check if it's already aborted/already > committed/in progress. If it's aborted, treat it like any normal > aborted xact. If it's committed, treat it like any normal committed > xact. If it's in progress, keep the lock and decode it. But that lock could need to be held for an unbounded period of time - as long as decoding takes to complete - which seems pretty undesirable. Worse still, the same problem will arise if you eventually want to start decoding ordinary, non-2PC transactions that haven't committed yet, which I think is something we definitely want to do eventually; the current handling of bulk loads or bulk updates leads to significant latency. You're not going to be able to tell an active transaction that it isn't allowed to abort until you get done with it, and I don't really think you should be allowed to lock out 2PC aborts for long periods of time either. That's going to stink for users. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 17/03/17 03:34, Craig Ringer wrote: > On 17 March 2017 at 08:10, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: > >> While working on this i’ve spotted quite a nasty corner case with aborted prepared >> transaction. I have some not that great ideas how to fix it, but maybe i blurred my >> view and missed something. So want to ask here at first. >> >> Suppose we created a table, then in 2pc tx we are altering it and after that aborting tx. >> So pg_class will have something like this: >> >> xmin | xmax | relname >> 100 | 200 | mytable >> 200 | 0 | mytable >> >> After previous abort, tuple (100,200,mytable) becomes visible and if we will alter table >> again then xmax of first tuple will be set current xid, resulting in following table: >> >> xmin | xmax | relname >> 100 | 300 | mytable >> 200 | 0 | mytable >> 300 | 0 | mytable >> >> In that moment we’ve lost information that first tuple was deleted by our prepared tx. > > Right. And while the prepared xact has aborted, we don't control when > it aborts and when those overwrites can start happening. We can and > should check if a 2pc xact is aborted before we start decoding it so > we can skip decoding it if it's already aborted, but it could be > aborted *while* we're decoding it, then have data needed for its > snapshot clobbered. > > This hasn't mattered in the past because prepared xacts (and > especially aborted 2pc xacts) have never needed snapshots, we've never > needed to do something from the perspective of a prepared xact. > > I think we'll probably need to lock the 2PC xact so it cannot be > aborted or committed while we're decoding it, until we finish decoding > it. So we lock it, then check if it's already aborted/already > committed/in progress. If it's aborted, treat it like any normal > aborted xact. If it's committed, treat it like any normal committed > xact. If it's in progress, keep the lock and decode it. > > People using logical decoding for 2PC will presumably want to control > 2PC via logical decoding, so they're not so likely to mind such a > lock. > >> * Try at first to scan catalog filtering out tuples with xmax bigger than snapshot->xmax >> as it was possibly deleted by our tx. Than if nothing found scan in a usual way. > > I don't think that'll be at all viable with the syscache/relcache > machinery. Way too intrusive. > I think only genam would need changes to do two-phase scan for this as the catalog scans should ultimately go there. It's going to slow down things but we could limit the impact by doing the two-phase scan only when historical snapshot is in use and the tx being decoded changed catalogs (we already have global knowledge of the first one, and it would be trivial to add the second one as we have local knowledge of that as well). What I think is better strategy than filtering out by xmax would be filtering "in" by xmin though. Meaning that first scan would return only tuples modified by current tx which are visible in snapshot and second scan would return the other visible tuples. That way whatever the decoded tx seen should always win. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 17 March 2017 at 23:59, Robert Haas <robertmhaas@gmail.com> wrote: > But that lock could need to be held for an unbounded period of time - > as long as decoding takes to complete - which seems pretty > undesirable. Yeah. We could use a recovery-conflict like mechanism to signal the decoding session that someone wants to abort the xact, but it gets messy. > Worse still, the same problem will arise if you > eventually want to start decoding ordinary, non-2PC transactions that > haven't committed yet, which I think is something we definitely want > to do eventually; the current handling of bulk loads or bulk updates > leads to significant latency. Yeah. If it weren't for that, I'd probably still just pursue locking. But you're right that we'll have to solve this sooner or later. I'll admit I hoped for later. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 19 March 2017 at 21:26, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote: > I think only genam would need changes to do two-phase scan for this as > the catalog scans should ultimately go there. It's going to slow down > things but we could limit the impact by doing the two-phase scan only > when historical snapshot is in use and the tx being decoded changed > catalogs (we already have global knowledge of the first one, and it > would be trivial to add the second one as we have local knowledge of > that as well). We'll also have to clobber caches after we finish decoding a 2pc xact, since we don't know those changes are visible to other xacts and can't guarantee they'll ever be (if it aborts). That's going to be "interesting" when trying to decode interleaved transaction streams since we can't afford to clobber caches whenever we see an xlog record from a different xact. We'll probably have to switch to linear decoding with reordering when someone makes catalog changes. TBH, I have no idea how to approach the genam changes for the proposed double-scan method. It sounds like Stas has some idea how to proceed though (right?) -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 20/03/17 09:32, Craig Ringer wrote: > On 19 March 2017 at 21:26, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote: > >> I think only genam would need changes to do two-phase scan for this as >> the catalog scans should ultimately go there. It's going to slow down >> things but we could limit the impact by doing the two-phase scan only >> when historical snapshot is in use and the tx being decoded changed >> catalogs (we already have global knowledge of the first one, and it >> would be trivial to add the second one as we have local knowledge of >> that as well). > > We'll also have to clobber caches after we finish decoding a 2pc xact, > since we don't know those changes are visible to other xacts and can't > guarantee they'll ever be (if it aborts). > AFAIK reorder buffer already does that. > That's going to be "interesting" when trying to decode interleaved > transaction streams since we can't afford to clobber caches whenever > we see an xlog record from a different xact. We'll probably have to > switch to linear decoding with reordering when someone makes catalog > changes. We may need something that allows for representing multiple parallel transactions in single process and a cheap way of switching between them (ie, similar things we need for autonomous transactions). But that's not something current patch has to deal with. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
> On 20 Mar 2017, at 11:32, Craig Ringer <craig@2ndquadrant.com> wrote: > > On 19 March 2017 at 21:26, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote: > >> I think only genam would need changes to do two-phase scan for this as >> the catalog scans should ultimately go there. It's going to slow down >> things but we could limit the impact by doing the two-phase scan only >> when historical snapshot is in use and the tx being decoded changed >> catalogs (we already have global knowledge of the first one, and it >> would be trivial to add the second one as we have local knowledge of >> that as well). > > > TBH, I have no idea how to approach the genam changes for the proposed > double-scan method. It sounds like Stas has some idea how to proceed > though (right?) > I thought about having special field (or reusing one of the existing fields) in snapshot struct to force filtering xmax > snap->xmax or xmin = snap->xmin as Petr suggested. Then this logic can reside in ReorderBufferCommit(). However this is not solving problem with catcache, so I'm looking into it right now. > On 17 Mar 2017, at 05:38, Craig Ringer <craig@2ndquadrant.com> wrote: > > On 16 March 2017 at 19:52, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: > >> >> I’m working right now on issue with building snapshots for decoding prepared tx. >> I hope I'll send updated patch later today. > > > Great. > > What approach are you taking? Just as before I marking this transaction committed in snapbuilder, but after decoding I delete this transaction from xip (which holds committed transactions in case of historic snapshot). > -- > Craig Ringer http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services
> I thought about having special field (or reusing one of the existing fields) > in snapshot struct to force filtering xmax > snap->xmax or xmin = snap->xmin > as Petr suggested. Then this logic can reside in ReorderBufferCommit(). > However this is not solving problem with catcache, so I'm looking into it right now. OK, so this is only an issue if we have xacts that change the schema of tables and also insert/update/delete to their heaps. Right? So, given that this is CF3 for Pg10, should we take a step back and impose the limitation that we can decode 2PC with schema changes or data row changes, but not both? Applications can record DDL in transactional logical WAL messages for decoding during 2pc processing. Or apps can do 2pc for DML. They just can't do both at the same time, in the same xact. Imperfect, but a lot less invasive. And we can even permit apps to use the locking-based approach I outlined earlier instead: All we have to do IMO is add an output plugin callback to filter whether we want to decode a given 2pc xact at PREPARE TRANSACTION time or defer until COMMIT PREPARED. It could: * mark the xact for deferred decoding at commit time (the default if the callback doesn't exist); or * Acquire a lock on the 2pc xact and request immediate decoding only if it gets the lock so concurrent ROLLBACK PREPARED is blocked; or * inspect the reorder buffer contents for row changes and decide whether to decode now or later based on that. It has a few downsides - for example, temp tables will be considered "catalog changes" for now. But .. eh. We already accept a bunch of practical limitations for catalog changes and DDL in logical decoding, most notably regarding practical handling of full table rewrites. > Just as before I marking this transaction committed in snapbuilder, but after > decoding I delete this transaction from xip (which holds committed transactions > in case of historic snapshot). That seems kind of hacky TBH. I didn't much like marking it as committed then un-committing it. I think it's mostly an interface issue though. I'd rather say SnapBuildPushPrepareTransaction and SnapBuildPopPreparedTransaction or something, to make it clear what we're doing. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 17 March 2017 at 23:59, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Mar 16, 2017 at 10:34 PM, Craig Ringer <craig@2ndquadrant.com> wrote: >> On 17 March 2017 at 08:10, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: >>> While working on this i’ve spotted quite a nasty corner case with aborted prepared >>> transaction. I have some not that great ideas how to fix it, but maybe i blurred my >>> view and missed something. So want to ask here at first. >>> >>> Suppose we created a table, then in 2pc tx we are altering it and after that aborting tx. >>> So pg_class will have something like this: >>> >>> xmin | xmax | relname >>> 100 | 200 | mytable >>> 200 | 0 | mytable >>> >>> After previous abort, tuple (100,200,mytable) becomes visible and if we will alter table >>> again then xmax of first tuple will be set current xid, resulting in following table: >>> >>> xmin | xmax | relname >>> 100 | 300 | mytable >>> 200 | 0 | mytable >>> 300 | 0 | mytable >>> >>> In that moment we’ve lost information that first tuple was deleted by our prepared tx. >> >> Right. And while the prepared xact has aborted, we don't control when >> it aborts and when those overwrites can start happening. We can and >> should check if a 2pc xact is aborted before we start decoding it so >> we can skip decoding it if it's already aborted, but it could be >> aborted *while* we're decoding it, then have data needed for its >> snapshot clobbered. >> >> This hasn't mattered in the past because prepared xacts (and >> especially aborted 2pc xacts) have never needed snapshots, we've never >> needed to do something from the perspective of a prepared xact. >> >> I think we'll probably need to lock the 2PC xact so it cannot be >> aborted or committed while we're decoding it, until we finish decoding >> it. So we lock it, then check if it's already aborted/already >> committed/in progress. If it's aborted, treat it like any normal >> aborted xact. If it's committed, treat it like any normal committed >> xact. If it's in progress, keep the lock and decode it. > > But that lock could need to be held for an unbounded period of time - > as long as decoding takes to complete - which seems pretty > undesirable. This didn't seem to be too much of a problem when I read it. Sure, the issue noted by Stas exists, but it requires Alter-Abort-Alter for it to be a problem. Meaning that normal non-DDL transactions do not have problems. Neither would a real-time system that uses the decoded data to decide whether to commit or abort the transaction; in that case there would never be an abort until after decoding. So I suggest we have a pre-prepare callback to ensure that the plugin can decide whether to decode or not. We can pass information to the plugin such as whether we have issued DDL in that xact or not. The plugin can then decide how it wishes to handle it, so if somebody doesn't like the idea of a lock then don't use one. The plugin is already responsible for many things, so this is nothing new. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> On 20 Mar 2017, at 15:17, Craig Ringer <craig@2ndquadrant.com> wrote: > >> I thought about having special field (or reusing one of the existing fields) >> in snapshot struct to force filtering xmax > snap->xmax or xmin = snap->xmin >> as Petr suggested. Then this logic can reside in ReorderBufferCommit(). >> However this is not solving problem with catcache, so I'm looking into it right now. > > OK, so this is only an issue if we have xacts that change the schema > of tables and also insert/update/delete to their heaps. Right? > > So, given that this is CF3 for Pg10, should we take a step back and > impose the limitation that we can decode 2PC with schema changes or > data row changes, but not both? Yep, time is tight. I’ll try today/tomorrow to proceed with this two scan approach. If I’ll fail to do that during this time then I’ll just update this patch to decode only non-ddl 2pc transactions as you suggested. >> Just as before I marking this transaction committed in snapbuilder, but after >> decoding I delete this transaction from xip (which holds committed transactions >> in case of historic snapshot). > > That seems kind of hacky TBH. I didn't much like marking it as > committed then un-committing it. > > I think it's mostly an interface issue though. I'd rather say > SnapBuildPushPrepareTransaction and SnapBuildPopPreparedTransaction or > something, to make it clear what we're doing. Yes, that will be less confusing. However there is no any kind of queue, so SnapBuildStartPrepare / SnapBuildFinishPrepare should work too. > -- > Craig Ringer http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 20 March 2017 at 20:57, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: > >> On 20 Mar 2017, at 15:17, Craig Ringer <craig@2ndquadrant.com> wrote: >> >>> I thought about having special field (or reusing one of the existing fields) >>> in snapshot struct to force filtering xmax > snap->xmax or xmin = snap->xmin >>> as Petr suggested. Then this logic can reside in ReorderBufferCommit(). >>> However this is not solving problem with catcache, so I'm looking into it right now. >> >> OK, so this is only an issue if we have xacts that change the schema >> of tables and also insert/update/delete to their heaps. Right? >> >> So, given that this is CF3 for Pg10, should we take a step back and >> impose the limitation that we can decode 2PC with schema changes or >> data row changes, but not both? > > Yep, time is tight. I’ll try today/tomorrow to proceed with this two scan approach. > If I’ll fail to do that during this time then I’ll just update this patch to decode > only non-ddl 2pc transactions as you suggested. I wasn't suggesting not decoding them, but giving the plugin the option of whether to proceed with decoding or not. As Simon said, have a pre-decode-prepared callback that lets the plugin get a lock on the 2pc xact if it wants, or say it doesn't want to decode it until it commits. That'd be useful anyway, so we can filter and only do decoding at prepare transaction time of xacts the downstream wants to know about before they commit. >>> Just as before I marking this transaction committed in snapbuilder, but after >>> decoding I delete this transaction from xip (which holds committed transactions >>> in case of historic snapshot). >> >> That seems kind of hacky TBH. I didn't much like marking it as >> committed then un-committing it. >> >> I think it's mostly an interface issue though. I'd rather say >> SnapBuildPushPrepareTransaction and SnapBuildPopPreparedTransaction or >> something, to make it clear what we're doing. > > Yes, that will be less confusing. However there is no any kind of queue, so > SnapBuildStartPrepare / SnapBuildFinishPrepare should work too. Yeah, that's better. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
> On 20 Mar 2017, at 16:39, Craig Ringer <craig@2ndquadrant.com> wrote: > > On 20 March 2017 at 20:57, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: >> >>> On 20 Mar 2017, at 15:17, Craig Ringer <craig@2ndquadrant.com> wrote: >>> >>>> I thought about having special field (or reusing one of the existing fields) >>>> in snapshot struct to force filtering xmax > snap->xmax or xmin = snap->xmin >>>> as Petr suggested. Then this logic can reside in ReorderBufferCommit(). >>>> However this is not solving problem with catcache, so I'm looking into it right now. >>> >>> OK, so this is only an issue if we have xacts that change the schema >>> of tables and also insert/update/delete to their heaps. Right? >>> >>> So, given that this is CF3 for Pg10, should we take a step back and >>> impose the limitation that we can decode 2PC with schema changes or >>> data row changes, but not both? >> >> Yep, time is tight. I’ll try today/tomorrow to proceed with this two scan approach. >> If I’ll fail to do that during this time then I’ll just update this patch to decode >> only non-ddl 2pc transactions as you suggested. > > I wasn't suggesting not decoding them, but giving the plugin the > option of whether to proceed with decoding or not. > > As Simon said, have a pre-decode-prepared callback that lets the > plugin get a lock on the 2pc xact if it wants, or say it doesn't want > to decode it until it commits. > > That'd be useful anyway, so we can filter and only do decoding at > prepare transaction time of xacts the downstream wants to know about > before they commit. Ah, got that. Okay. > -- > Craig Ringer http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services
On 20 March 2017 at 21:47, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: > >> On 20 Mar 2017, at 16:39, Craig Ringer <craig@2ndquadrant.com> wrote: >> >> On 20 March 2017 at 20:57, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: >>> >>>> On 20 Mar 2017, at 15:17, Craig Ringer <craig@2ndquadrant.com> wrote: >>>> >>>>> I thought about having special field (or reusing one of the existing fields) >>>>> in snapshot struct to force filtering xmax > snap->xmax or xmin = snap->xmin >>>>> as Petr suggested. Then this logic can reside in ReorderBufferCommit(). >>>>> However this is not solving problem with catcache, so I'm looking into it right now. >>>> >>>> OK, so this is only an issue if we have xacts that change the schema >>>> of tables and also insert/update/delete to their heaps. Right? >>>> >>>> So, given that this is CF3 for Pg10, should we take a step back and >>>> impose the limitation that we can decode 2PC with schema changes or >>>> data row changes, but not both? >>> >>> Yep, time is tight. I’ll try today/tomorrow to proceed with this two scan approach. >>> If I’ll fail to do that during this time then I’ll just update this patch to decode >>> only non-ddl 2pc transactions as you suggested. >> >> I wasn't suggesting not decoding them, but giving the plugin the >> option of whether to proceed with decoding or not. >> >> As Simon said, have a pre-decode-prepared callback that lets the >> plugin get a lock on the 2pc xact if it wants, or say it doesn't want >> to decode it until it commits. >> >> That'd be useful anyway, so we can filter and only do decoding at >> prepare transaction time of xacts the downstream wants to know about >> before they commit. > > Ah, got that. Okay. Any news here? We're in the last week of the CF. If you have a patch that's nearly ready or getting there, now would be a good time to post it for help and input from others. I would really like to get this in, but we're running out of time. Even if you just post your snapshot management work, with the cosmetic changes discussed above, that would be a valuable start. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 27 March 2017 at 09:31, Craig Ringer <craig@2ndquadrant.com> wrote: > We're in the last week of the CF. If you have a patch that's nearly > ready or getting there, now would be a good time to post it for help > and input from others. > > I would really like to get this in, but we're running out of time. > > Even if you just post your snapshot management work, with the cosmetic > changes discussed above, that would be a valuable start. I'm going to pick up the last patch and: * Ensure we only add the GID to xact records for 2pc commits and aborts * Add separate callbacks for prepare, abort prepared, and commit prepared (of xacts already processed during prepare), so we aren't overloading the "commit" callback and don't have to create fake empty transactions to pass to the commit callback; * Add another callback to determine whether an xact should be processed at PREPARE TRANSACTION or COMMIT PREPARED time. * Rename the snapshot builder faux-commit stuff in the current patch so it's clearer what's going on. * Write tests covering DDL, abort-during-decode, etc Some special care is needed for the callback that decides whether to process a given xact as 2PC or not. It's called before PREPARE TRANSACTION to decide whether to decode any given xact at prepare time or wait until it commits. It's called again at COMMIT PREPARED time if we crashed after we processed PREPARE TRANSACTION and advanced our confirmed_flush_lsn such that we won't re-process the PREPARE TRANSACTION again. Our restart_lsn might've advanced past it so we never even decode it, so we can't rely on seeing it at all. It has access to the xid, gid and invalidations, all of which we have at both prepare and commit time, to make its decision from. It must have the same result at prepare and commit time for any given xact. We can probably use a cache in the reorder buffer to avoid the 2nd call on commit prepared if we haven't crashed/reconnected between the two. This proposal does not provide a way to safely decode a 2pc xact that made catalog changes which may be aborted while being decoded. The plugin must lock such an xact so that it can't be aborted while being processed, or defer decoding until commit prepared. It can use the invalidations for the commit to decide. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
> On 27 Mar 2017, at 12:26, Craig Ringer <craig@2ndquadrant.com> wrote: > > On 27 March 2017 at 09:31, Craig Ringer <craig@2ndquadrant.com> wrote: > >> We're in the last week of the CF. If you have a patch that's nearly >> ready or getting there, now would be a good time to post it for help >> and input from others. >> >> I would really like to get this in, but we're running out of time. >> >> Even if you just post your snapshot management work, with the cosmetic >> changes discussed above, that would be a valuable start. > > I'm going to pick up the last patch and: I’m heavily underestimated amount of changes there, but almost finished and will send updated patch in several hours. > * Ensure we only add the GID to xact records for 2pc commits and aborts And only during wal_level >= logical. Done. Also patch adds origin info to prepares and aborts. > * Add separate callbacks for prepare, abort prepared, and commit > prepared (of xacts already processed during prepare), so we aren't > overloading the "commit" callback and don't have to create fake empty > transactions to pass to the commit callback; Done. > * Add another callback to determine whether an xact should be > processed at PREPARE TRANSACTION or COMMIT PREPARED time. Also done. > * Rename the snapshot builder faux-commit stuff in the current patch > so it's clearer what's going on. Hm. Okay, i’ll leave that part to you. > * Write tests covering DDL, abort-during-decode, etc I’ve extended test, but it is good to have some more. > Some special care is needed for the callback that decides whether to > process a given xact as 2PC or not. It's called before PREPARE > TRANSACTION to decide whether to decode any given xact at prepare time > or wait until it commits. It's called again at COMMIT PREPARED time if > we crashed after we processed PREPARE TRANSACTION and advanced our > confirmed_flush_lsn such that we won't re-process the PREPARE > TRANSACTION again. Our restart_lsn might've advanced past it so we > never even decode it, so we can't rely on seeing it at all. It has > access to the xid, gid and invalidations, all of which we have at both > prepare and commit time, to make its decision from. It must have the > same result at prepare and commit time for any given xact. We can > probably use a cache in the reorder buffer to avoid the 2nd call on > commit prepared if we haven't crashed/reconnected between the two. Good point. Didn’t think about restart_lsn in case when we are skipping this particular prepare (filter_prepared() -> true, in my terms). I think that should work properly as it use the same code path as it was before, but I’ll look at it. > This proposal does not provide a way to safely decode a 2pc xact that > made catalog changes which may be aborted while being decoded. The > plugin must lock such an xact so that it can't be aborted while being > processed, or defer decoding until commit prepared. It can use the > invalidations for the commit to decide. I had played with that two-pass catalog scan and it seems to be working but after some time I realised that it is not useful for the main case when commit/abort is generated after receiver side will answer to prepares. Also that two-pass scan is a massive change in relcache.c and genam.c (FWIW there were no problems with cache, but some problems with index scan and handling one-to-many queries to catalog, e.g. table with it fields) Finally i decided to throw it and switched to filter_prepare callback and passed there txn structure to allow access to has_catalog_changes field. Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 27 March 2017 at 17:53, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: > I’m heavily underestimated amount of changes there, but almost finished > and will send updated patch in several hours. Oh, brilliant! Please post whatever you have before you knock off for the day anyway, even if it's just a WIP, so I can pick it up tomorrow my time and poke at its tests etc. I'm in Western Australia +0800 time, significantly ahead of you. > Done. [snip] > Also done. Great, time is short so that's fantastic. > I’ve extended test, but it is good to have some more. I don't mind writing tests and I've done quite a bit with TAP now, so happy to help there. >> Some special care is needed for the callback that decides whether to >> process a given xact as 2PC or not. It's called before PREPARE >> TRANSACTION to decide whether to decode any given xact at prepare time >> or wait until it commits. It's called again at COMMIT PREPARED time if >> we crashed after we processed PREPARE TRANSACTION and advanced our >> confirmed_flush_lsn such that we won't re-process the PREPARE >> TRANSACTION again. Our restart_lsn might've advanced past it so we >> never even decode it, so we can't rely on seeing it at all. It has >> access to the xid, gid and invalidations, all of which we have at both >> prepare and commit time, to make its decision from. It must have the >> same result at prepare and commit time for any given xact. We can >> probably use a cache in the reorder buffer to avoid the 2nd call on >> commit prepared if we haven't crashed/reconnected between the two. > > Good point. Didn’t think about restart_lsn in case when we are skipping this > particular prepare (filter_prepared() -> true, in my terms). I think that should > work properly as it use the same code path as it was before, but I’ll look at it. I suspect that's going to be fragile in the face of interleaving of xacts if we crash between prepare and commit prepared. (Apologies if the below is long or disjointed, it's been a long day but trying to sort thoughts out). Consider ("SSU" = "standby status update"): 0/050 xid 1 BEGIN 0/060 xid 1 INSERT ... 0/070 xid 2 BEGIN 0/080 xid 2 INSERT ... 0/090 xid 3 BEGIN 0/095 xid 3 INSERT ... 0/100 xid 3 PREPARE TRANSACTION 'x' => sent to client [y/n]? SSU: confirmed_flush_lsn = 0/100, restart_lsn 0/050 (if we sent to client) 0/200 xid 2 COMMIT => sent to client SSU: confirmed_flush_lsn = 0/200, restart_lsn 0/050 0/250 xl_running_xacts logged, xids = [1,3] [CRASH or disconnect/reconnect] Restart decoding at 0/050. skip output of xid 3 PREPARE TRANSACTION @ 0/100: is <= confirmed_flush_lsn skip output of xid 2 COMMIT @ 0/200: is <= confirmed_flush_lsn 0/300 xid 2 COMMIT PREPARED 'x' => sent to client, confirmed_flush_lsn is > confirmed_flush_lsn In the above, our problem is that restart_lsn is held down by some other xact, so we can't rely on it to tell us if we replayed xid 3 to the output plugin or not. We can't use confirmed_flush_lsn either, since it'll advance at xid 2's commit whether or not we replayed xid 3's prepare to the client. Since xid 3 will still be in xl_running_xacts when prepared, when we recover SnapBuildProcessChange will return true for its changes and we'll (re)buffer them, whether or not we landed up sending to the client at prepare time. Nothing much to be done about that, we'll just discard them when we process the prepare or the commit prepared, depending on where we consult our filter callback again. We MUST ask our filter callback again though, before we test SnapBuildXactNeedsSkip when processing the PREPARE TRANSACTION again. Otherwise we'll discard the buffered changes, and if we *didn't* send them to the client already ... splat. We can call the filter callback again on xid 3's prepare to find out "would you have replayed it when we passed it last time". Or we can call it when we get to the commit instead, to ask "when called last time at prepare, did you replay or not?" But we have to consult the callback. By default we'd just skip ReorderBufferCommit processing for xid 3 entirely, which we'll do via the SnapBuildXactNeedsSkip call in DecodeCommit when we process the COMMIT PREPARED. If there was no other running xact when we decoded the PREPARE TRANSACTION the first time around (i.e. xid 1 and 2 didn't exist in the above), and if we do send it to the client at prepare time, I think we can safely advance restart_lsn to the most recent xl_running_xacts once we get replay confirmation. So we can pretend we already committed at PREPARE TRANSACTION time for restart purposes if we output at PREPARE TRANSACTION time, it just doesn't help us with deciding whether to send the buffer contents at COMMIT PREPARED time or not. TL;DR: we can't rely on restart_lsn or confirmed_flush_lsn or xl_running_xacts, we must ask the filter callback when we (re)decode the PREPARE TRANSACTION record and/or at COMMIT PREPARED time. This isn't a big deal. We just have to make sure we consult the filter callback again when we decode an already-confirmed prepare transaction, or at commit prepared time if we don't know what its result was already. >> This proposal does not provide a way to safely decode a 2pc xact that >> made catalog changes which may be aborted while being decoded. The >> plugin must lock such an xact so that it can't be aborted while being >> processed, or defer decoding until commit prepared. It can use the >> invalidations for the commit to decide. > > I had played with that two-pass catalog scan and it seems to be > working but after some time I realised that it is not useful for the main > case when commit/abort is generated after receiver side will answer to > prepares. Also that two-pass scan is a massive change in relcache.c and > genam.c (FWIW there were no problems with cache, but some problems > with index scan and handling one-to-many queries to catalog, e.g. table > with it fields) Yeah, it was the intrusiveness I was concerned about. I don't think we can even remotely hope to do that for Pg 10. > Finally i decided to throw it and switched to filter_prepare callback and > passed there txn structure to allow access to has_catalog_changes field. I think that's how we'll need to go. Plugins can either defer processing on all 2pc xacts with catalog changes, or lock the xact. It's not perfect, but it's far from unreasonable when you consider that plugins would only be locking 2pc xacts where they expect the result of logical decoding to influence the commit/abort decision, so we won't be doing a commit/abort until we finish decoding the prepare anyway. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
> On 27 Mar 2017, at 16:29, Craig Ringer <craig@2ndquadrant.com> wrote: > > On 27 March 2017 at 17:53, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: > >> I’m heavily underestimated amount of changes there, but almost finished >> and will send updated patch in several hours. > > Oh, brilliant! Please post whatever you have before you knock off for > the day anyway, even if it's just a WIP, so I can pick it up tomorrow > my time and poke at its tests etc. > Ok, here it is. Major differences comparing to previous version: * GID is stored to commit/abort records only when wal_level >= logical. * More consistency about storing and parsing origin info. Now it is stored in prepare and abort records when repsession is active. * Some clenup, function renames to get rid of xact_even/gid fields in ReorderBuffer which i used only to copy them ReorderBufferTXN. * Changed output plugin interface to one that was suggested upthread. Now prepare/CP/AP is separate callback, and if none of them is set then 2pc tx will be decoded as 1pc to provide back-compatibility. * New callback filter_prepare() that can be used to switch between 1pc/2pc style of decoding 2pc tx. * test_decoding uses new API and filters out aborted and running prepared tx. It is actually easy to move unlock of 2PCState there to prepare callback to allow decode of running tx, but since that extension is example ISTM that is better not to hold that lock there during whole prepare decoding. However I leaved enough information there about this and about case when that locks are not need at all (when we are coordinating this tx). Talking about locking of running prepared tx during decode, I think better solution would be to use own custom lock here and register XACT_EVENT_PRE_ABORT callback in extension to conflict with this lock. Decode should hold it in shared way, while commit in excluseve. That will allow to lock stuff granularly ang block only tx that is being decoded. However we don’t have XACT_EVENT_PRE_ABORT, but it is several LOCs to add it. Should I? * It is actually doesn’t pass one of mine regression tests. I’ve added expected output as it should be. I’ll try to send follow up message with fix, but right now sending it as is, as you asked. Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Hi, On 2017-03-28 00:19:29 +0300, Stas Kelvich wrote: > Ok, here it is. On a very quick skim, this doesn't seem to solve the issues around deadlocks of prepared transactions vs. catalog tables. What if the prepared transaction contains something like LOCK pg_class; (there's a lot more realistic examples)? Then decoding won't be able to continue, until that transaction is committed / aborted? - Andres
> On 28 Mar 2017, at 00:19, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: > > * It is actually doesn’t pass one of mine regression tests. I’ve added expected output > as it should be. I’ll try to send follow up message with fix, but right now sending it > as is, as you asked. > > Fixed. I forgot to postpone ReorderBufferTxn cleanup in case of prepare. So it pass provided regression tests right now. I’ll give it more testing tomorrow and going to write TAP test to check behaviour when we loose info whether prepare was sent to subscriber or not. Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On 28 March 2017 at 05:25, Andres Freund <andres@anarazel.de> wrote: > On a very quick skim, this doesn't seem to solve the issues around > deadlocks of prepared transactions vs. catalog tables. What if the > prepared transaction contains something like LOCK pg_class; (there's a > lot more realistic examples)? Then decoding won't be able to continue, > until that transaction is committed / aborted? Yeah, that's a problem and one we discussed in the past, though I lost track of it in amongst the recent work. I'm currently writing a few TAP tests intended to check this sort of thing, mixed DDL/DML, overlapping xacts, interleaved prepared xacts, etc. If they highlight problems they'll be useful for the next iteration of this patch anyway. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 28 March 2017 at 08:50, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: > >> On 28 Mar 2017, at 00:19, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: >> >> * It is actually doesn’t pass one of mine regression tests. I’ve added expected output >> as it should be. I’ll try to send follow up message with fix, but right now sending it >> as is, as you asked. >> >> > > Fixed. I forgot to postpone ReorderBufferTxn cleanup in case of prepare. > > So it pass provided regression tests right now. > > I’ll give it more testing tomorrow and going to write TAP test to check behaviour > when we loose info whether prepare was sent to subscriber or not. Great, thanks. I'll try to have some TAP tests ready. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
> On 28 Mar 2017, at 00:25, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2017-03-28 00:19:29 +0300, Stas Kelvich wrote: >> Ok, here it is. > > On a very quick skim, this doesn't seem to solve the issues around > deadlocks of prepared transactions vs. catalog tables. What if the > prepared transaction contains something like LOCK pg_class; (there's a > lot more realistic examples)? Then decoding won't be able to continue, > until that transaction is committed / aborted? But why is that deadlock? Seems as just lock. In case of prepared lock of pg_class decoding will wait until it committed and then continue to decode. As well as anything in postgres that accesses pg_class, including inability to connect to database and bricking database if you accidentally disconnected before committing that tx (as you showed me some while ago :-). IMO it is issue of being able to prepare such lock, than of decoding. Is there any other scenarios where catalog readers are blocked except explicit lock on catalog table? Alters on catalogs seems to be prohibited. Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 2017-03-28 04:12:41 +0300, Stas Kelvich wrote: > > > On 28 Mar 2017, at 00:25, Andres Freund <andres@anarazel.de> wrote: > > > > Hi, > > > > On 2017-03-28 00:19:29 +0300, Stas Kelvich wrote: > >> Ok, here it is. > > > > On a very quick skim, this doesn't seem to solve the issues around > > deadlocks of prepared transactions vs. catalog tables. What if the > > prepared transaction contains something like LOCK pg_class; (there's a > > lot more realistic examples)? Then decoding won't be able to continue, > > until that transaction is committed / aborted? > > But why is that deadlock? Seems as just lock. If you actually need separate decoding of 2PC, then you want to wait for the PREPARE to be replicated. If that replication has to wait for the to-be-replicated prepared transaction to commit prepared, and commit prepare will only happen once replication happened... > Is there any other scenarios where catalog readers are blocked except explicit lock > on catalog table? Alters on catalogs seems to be prohibited. VACUUM FULL on catalog tables (but that can't happen in xact => 2pc) CLUSTER on catalog tables (can happen in xact) ALTER on tables modified in the same transaction (even of non catalog tables!), because a lot of routines will do a heap_open() to get the tupledesc etc. Greetings, Andres Freund
On 28 March 2017 at 02:25, Andres Freund <andres@anarazel.de> wrote: > On 2017-03-28 04:12:41 +0300, Stas Kelvich wrote: >> >> > On 28 Mar 2017, at 00:25, Andres Freund <andres@anarazel.de> wrote: >> > >> > Hi, >> > >> > On 2017-03-28 00:19:29 +0300, Stas Kelvich wrote: >> >> Ok, here it is. >> > >> > On a very quick skim, this doesn't seem to solve the issues around >> > deadlocks of prepared transactions vs. catalog tables. What if the >> > prepared transaction contains something like LOCK pg_class; (there's a >> > lot more realistic examples)? Then decoding won't be able to continue, >> > until that transaction is committed / aborted? >> >> But why is that deadlock? Seems as just lock. > > If you actually need separate decoding of 2PC, then you want to wait for > the PREPARE to be replicated. If that replication has to wait for the > to-be-replicated prepared transaction to commit prepared, and commit > prepare will only happen once replication happened... Surely that's up to the decoding plugin? If the plugin takes locks it had better make sure it can get the locks or timeout. But that's true of any resource the plugin needs access to and can't obtain when needed. This issue could occur now if the transaction tool a session lock on a catalog table. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2017-03-28 03:30:28 +0100, Simon Riggs wrote: > On 28 March 2017 at 02:25, Andres Freund <andres@anarazel.de> wrote: > > On 2017-03-28 04:12:41 +0300, Stas Kelvich wrote: > >> > >> > On 28 Mar 2017, at 00:25, Andres Freund <andres@anarazel.de> wrote: > >> > > >> > Hi, > >> > > >> > On 2017-03-28 00:19:29 +0300, Stas Kelvich wrote: > >> >> Ok, here it is. > >> > > >> > On a very quick skim, this doesn't seem to solve the issues around > >> > deadlocks of prepared transactions vs. catalog tables. What if the > >> > prepared transaction contains something like LOCK pg_class; (there's a > >> > lot more realistic examples)? Then decoding won't be able to continue, > >> > until that transaction is committed / aborted? > >> > >> But why is that deadlock? Seems as just lock. > > > > If you actually need separate decoding of 2PC, then you want to wait for > > the PREPARE to be replicated. If that replication has to wait for the > > to-be-replicated prepared transaction to commit prepared, and commit > > prepare will only happen once replication happened... > > Surely that's up to the decoding plugin? It can't do much about it, so not really. A lot of the functions dealing with datatypes (temporarily) lock relations. Both the actual user tables, and system catalog tables (cache lookups...). > If the plugin takes locks it had better make sure it can get the locks > or timeout. But that's true of any resource the plugin needs access to > and can't obtain when needed. > This issue could occur now if the transaction tool a session lock on a > catalog table. That's not a self deadlock, and we don't don't do session locks outside of operations like CIC? Greetings, Andres Freund
On 28 March 2017 at 09:25, Andres Freund <andres@anarazel.de> wrote: > If you actually need separate decoding of 2PC, then you want to wait for > the PREPARE to be replicated. If that replication has to wait for the > to-be-replicated prepared transaction to commit prepared, and commit > prepare will only happen once replication happened... In other words, the output plugin cannot decode a transaction at PREPARE TRANSACTION time if that xact holds an AccessExclusiveLock on a catalog relation we must be able to read in order to decode the xact. >> Is there any other scenarios where catalog readers are blocked except explicit lock >> on catalog table? Alters on catalogs seems to be prohibited. > > VACUUM FULL on catalog tables (but that can't happen in xact => 2pc) > CLUSTER on catalog tables (can happen in xact) > ALTER on tables modified in the same transaction (even of non catalog > tables!), because a lot of routines will do a heap_open() to get the > tupledesc etc. Right, and the latter one is the main issue, since it's by far the most likely and hard to just work around. The tests Stas has in place aren't sufficient to cover this, as they decode only after everything has committed. I'm expanding the pg_regress coverage to do decoding between prepare and commit (when we actually care) first, and will add some tests involving strong locks. I've found one bug where it doesn't decode a 2pc xact at prepare or commit time, even without restart or strong lock issues. Pretty sure it's due to assumptions made about the filter callback. The current code as used by test_decoding won't work correctly. If txn->has_catalog_changes and if it's still in-progress, the filter skips decoding at PREPARE time. But it isn't then decoded at COMMIT PREPARED time either, if we processed past the PREPARE TRANSACTION. Bug. Also, by skipping decoding of 2pc xacts with catalog changes in this test we also hide the locking issues. However, even once I add an option to force decoding of 2pc xacts with catalog changes to test_decoding, I cannot reproduce the expected locking issues so far. See tests in attached updated version, in contrib/test_decoding/sql/prepare.sql . Haven't done any TAP tests yet, since the pg_regress tests are so far sufficient to turn up issues. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 28 March 2017 at 10:53, Craig Ringer <craig@2ndquadrant.com> wrote: > However, even once I add an option to force decoding of 2pc xacts with > catalog changes to test_decoding, I cannot reproduce the expected > locking issues so far. See tests in attached updated version, in > contrib/test_decoding/sql/prepare.sql . I haven't been able to create issues with CLUSTER, any ALTER TABLEs I've tried, or anything similar. An explicit AEL on pg_attribute causes the decoding stall, but you can't do anything much else either, and I don't see how that'd arise under normal circumstances. If it's a sufficiently obscure issue I'm willing to document it as "don't do that" or "use a command filter to prohibit that". But it's more likely that I'm just not spotting the cases where the issue arises. Attempting to CLUSTER a system catalog like pg_class or pg_attribute causes PREPARE TRANSACTION to fail with ERROR: cannot PREPARE a transaction that modified relation mapping and I didn't find any catalogs I could CLUSTER that'd also block decoding. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 28 March 2017 at 03:53, Craig Ringer <craig@2ndquadrant.com> wrote: > On 28 March 2017 at 09:25, Andres Freund <andres@anarazel.de> wrote: > >> If you actually need separate decoding of 2PC, then you want to wait for >> the PREPARE to be replicated. If that replication has to wait for the >> to-be-replicated prepared transaction to commit prepared, and commit >> prepare will only happen once replication happened... > > In other words, the output plugin cannot decode a transaction at > PREPARE TRANSACTION time if that xact holds an AccessExclusiveLock on > a catalog relation we must be able to read in order to decode the > xact. Yes, I understand. The decoding plugin can choose to enable lock_timeout, or it can choose to wait for manual resolution, or it could automatically abort such a transaction to avoid needing to decode it. I don't think its for us to say what the plugin is allowed to do. We decided on a plugin architecture, so we have to trust that the plugin author resolves the issues. We can document them so those choices are clear. This doesn't differ in any respect from any other resource it might need yet cannot obtain. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2017-03-28 15:32:49 +0100, Simon Riggs wrote: > On 28 March 2017 at 03:53, Craig Ringer <craig@2ndquadrant.com> wrote: > > On 28 March 2017 at 09:25, Andres Freund <andres@anarazel.de> wrote: > > > >> If you actually need separate decoding of 2PC, then you want to wait for > >> the PREPARE to be replicated. If that replication has to wait for the > >> to-be-replicated prepared transaction to commit prepared, and commit > >> prepare will only happen once replication happened... > > > > In other words, the output plugin cannot decode a transaction at > > PREPARE TRANSACTION time if that xact holds an AccessExclusiveLock on > > a catalog relation we must be able to read in order to decode the > > xact. > > Yes, I understand. > > The decoding plugin can choose to enable lock_timeout, or it can > choose to wait for manual resolution, or it could automatically abort > such a transaction to avoid needing to decode it. That doesn't solve the problem. You still left with replication that can't progress. I think that's completely unacceptable. We need a proper solution to this, not throw our hands up in the air and hope that it's not going to hurt a whole lot of peopel. > I don't think its for us to say what the plugin is allowed to do. We > decided on a plugin architecture, so we have to trust that the plugin > author resolves the issues. We can document them so those choices are > clear. I don't think this is "plugin architecture" related. The output pluging can't do right here, this has to be solved at a higher level. - Andres
On 28 March 2017 at 15:38, Andres Freund <andres@anarazel.de> wrote: > On 2017-03-28 15:32:49 +0100, Simon Riggs wrote: >> On 28 March 2017 at 03:53, Craig Ringer <craig@2ndquadrant.com> wrote: >> > On 28 March 2017 at 09:25, Andres Freund <andres@anarazel.de> wrote: >> > >> >> If you actually need separate decoding of 2PC, then you want to wait for >> >> the PREPARE to be replicated. If that replication has to wait for the >> >> to-be-replicated prepared transaction to commit prepared, and commit >> >> prepare will only happen once replication happened... >> > >> > In other words, the output plugin cannot decode a transaction at >> > PREPARE TRANSACTION time if that xact holds an AccessExclusiveLock on >> > a catalog relation we must be able to read in order to decode the >> > xact. >> >> Yes, I understand. >> >> The decoding plugin can choose to enable lock_timeout, or it can >> choose to wait for manual resolution, or it could automatically abort >> such a transaction to avoid needing to decode it. > > That doesn't solve the problem. You still left with replication that > can't progress. I think that's completely unacceptable. We need a > proper solution to this, not throw our hands up in the air and hope that > it's not going to hurt a whole lot of peopel. Nobody is throwing their hands in the air, nobody is just hoping. The concern raised is real and needs to be handled somewhere; the only point of discussion is where and how. >> I don't think its for us to say what the plugin is allowed to do. We >> decided on a plugin architecture, so we have to trust that the plugin >> author resolves the issues. We can document them so those choices are >> clear. > > I don't think this is "plugin architecture" related. The output pluging > can't do right here, this has to be solved at a higher level. That assertion is obviously false... the plugin can resolve this in various ways, if we allow it. You can say that in your opinion you prefer to see this handled in some higher level way, though it would be good to hear why and how. Bottom line here is we shouldn't reject this patch on this point, especially since any resource issue found during decoding could similarly prevent progress with decoding. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2017-03-28 15:55:15 +0100, Simon Riggs wrote: > On 28 March 2017 at 15:38, Andres Freund <andres@anarazel.de> wrote: > > On 2017-03-28 15:32:49 +0100, Simon Riggs wrote: > >> On 28 March 2017 at 03:53, Craig Ringer <craig@2ndquadrant.com> wrote: > >> > On 28 March 2017 at 09:25, Andres Freund <andres@anarazel.de> wrote: > >> > > >> >> If you actually need separate decoding of 2PC, then you want to wait for > >> >> the PREPARE to be replicated. If that replication has to wait for the > >> >> to-be-replicated prepared transaction to commit prepared, and commit > >> >> prepare will only happen once replication happened... > >> > > >> > In other words, the output plugin cannot decode a transaction at > >> > PREPARE TRANSACTION time if that xact holds an AccessExclusiveLock on > >> > a catalog relation we must be able to read in order to decode the > >> > xact. > >> > >> Yes, I understand. > >> > >> The decoding plugin can choose to enable lock_timeout, or it can > >> choose to wait for manual resolution, or it could automatically abort > >> such a transaction to avoid needing to decode it. > > > > That doesn't solve the problem. You still left with replication that > > can't progress. I think that's completely unacceptable. We need a > > proper solution to this, not throw our hands up in the air and hope that > > it's not going to hurt a whole lot of peopel. > > Nobody is throwing their hands in the air, nobody is just hoping. The > concern raised is real and needs to be handled somewhere; the only > point of discussion is where and how. > >> I don't think its for us to say what the plugin is allowed to do. We > >> decided on a plugin architecture, so we have to trust that the plugin > >> author resolves the issues. We can document them so those choices are > >> clear. > > > > I don't think this is "plugin architecture" related. The output pluging > > can't do right here, this has to be solved at a higher level. > > That assertion is obviously false... the plugin can resolve this in > various ways, if we allow it. Handling it by breaking replication isn't handling it (e.g. timeouts in decoding etc). Handling it by rolling back *prepared* transactions (which are supposed to be guaranteed to succeed!), isn't either. > You can say that in your opinion you prefer to see this handled in > some higher level way, though it would be good to hear why and how. It's pretty obvious why: A bit of DDL by the user shouldn't lead to the issues mentioned above. > Bottom line here is we shouldn't reject this patch on this point, I think it definitely has to be rejected because of that. And I didn't bring this up at the last minute, I repeatedly brought it up before. Both to Craig and Stas. One way to fix this would be to allow decoding to acquire such locks (i.e. locks held by the prepared xact we're decoding) - there unfortunately are some practical issues with that (e.g. the locking code doesnt' necessarily expect a second non-exclusive locker, when there's an exclusive one), or we could add an exception to the locking code to simply not acquire such locks. > especially since any resource issue found during decoding could > similarly prevent progress with decoding. For example? - Andres
On 28 Mar. 2017 23:08, "Andres Freund" <andres@anarazel.de> wrote:
> >> I don't think its for us to say what the plugin is allowed to do. We
> >> decided on a plugin architecture, so we have to trust that the plugin
> >> author resolves the issues. We can document them so those choices are
> >> clear.
> >
> > I don't think this is "plugin architecture" related. The output pluging
> > can't do right here, this has to be solved at a higher level.
>
> That assertion is obviously false... the plugin can resolve this in
> various ways, if we allow it.
Handling it by breaking replication isn't handling it (e.g. timeouts in
decoding etc).
IMO, if it's a rare condition and we can abort decoding then recover cleanly and succeed on retry, that's OK. Not dissimilar to the deadlock detector. But right now that's not the case, it's possible (however artificially) to create prepared xacts for which decoding will stall and not succeed.
Handling it by rolling back *prepared* transactions
(which are supposed to be guaranteed to succeed!), isn't either.
I agree, we can't rely on anything for which the only way to continue is to rollback a prepared xact.
> You can say that in your opinion you prefer to see this handled in
> some higher level way, though it would be good to hear why and how.
It's pretty obvious why: A bit of DDL by the user shouldn't lead to the
issues mentioned above.
I agree that it shouldn't, and in fact DDL is the main part of why I want 2PC decoding.
What's surprised me is that I haven't actually been able to create any situations where, with test_decoding, we have such a failure. Not unless I manually LOCK TABLE pg_attribute, anyway.
Notably, we already disallow prepared xacts that make changes to the relfilenodemap, which covers a lot of the problem cases like CLUSTERing system tables.
> Bottom line here is we shouldn't reject this patch on this point,
I think it definitely has to be rejected because of that. And I didn't
bring this up at the last minute, I repeatedly brought it up before.
Both to Craig and Stas.
Yes, and I lost track of it while focusing on the catalog tuple visibility issues. I warned Stas of this issue when he first mentioned an interest in decoding of 2PC actually, but haven't kept a proper eye on it since.
Andres and I even discussed this back in the early BDR days, it's not new and is part of why I poked Stas to try some DDL tests etc. The tests in the initial patch didn't have enough coverage to trigger any issues - they didn't actually test decoding of a 2pc xact while it was still in-progress at all. But even once I added more tests I've actually been unable to reproduce this in a realistic real world example.
Frankly I'm confused by that, since I would expect an AEL on some_table to cause decoding of some_table to get stuck. It does not.
That doesn't mean we should accept failure cases and commit something with holes in it. But it might inform our choices about how we solve those issues.
One way to fix this would be to allow decoding to acquire such locks
(i.e. locks held by the prepared xact we're decoding) - there
unfortunately are some practical issues with that (e.g. the locking code
doesnt' necessarily expect a second non-exclusive locker, when there's
an exclusive one), or we could add an exception to the locking code to
simply not acquire such locks.
I've been meaning to see if we can use the parallel infrastructure's session leader infrastructure for this, by making the 2pc fake-proc a leader and making our decoding session inherit its locks. I haven't dug into it to see if it's even remotely practical yet, and won't be able to until early pg11.
We could proceed with the caveat that decoding plugins that use 2pc support must defer decoding of 2pc xacts containing ddl until commit prepared, or must take responsibility for ensuring (via a command filter, etc) that xacts are safe to decode and 2pc lock xacts during decoding. But we're likely to change the interface for all that when we iterate for pg11 and I'd rather not carry more BC than we have to. Also, the patch has unsolved issues with how it keeps track of whether an xact was output at prepare time or not and suppresses output at commit time.
I'm inclined to shelve the patch for Pg 10. We've only got a couple of days left, the tests are still pretty minimal. We have open issues around locking, less than totally satisfactory abort handling, and potential to skip replay of transactions for both prepare and commit prepared. It's not ready to go. However, it's definitely to the point where with a little more work it'll be practical to patch into variants of Pg until we can mainstream it in Pg 11, which is nice.
--
Craig Ringer
> On 28 Mar 2017, at 18:08, Andres Freund <andres@anarazel.de> wrote: > > On 2017-03-28 15:55:15 +0100, Simon Riggs wrote: >> >> >> That assertion is obviously false... the plugin can resolve this in >> various ways, if we allow it. > > Handling it by breaking replication isn't handling it (e.g. timeouts in > decoding etc). Handling it by rolling back *prepared* transactions > (which are supposed to be guaranteed to succeed!), isn't either. > > >> You can say that in your opinion you prefer to see this handled in >> some higher level way, though it would be good to hear why and how. > > It's pretty obvious why: A bit of DDL by the user shouldn't lead to the > issues mentioned above. > > >> Bottom line here is we shouldn't reject this patch on this point, > > I think it definitely has to be rejected because of that. And I didn't > bring this up at the last minute, I repeatedly brought it up before. > Both to Craig and Stas. Okay. In order to find more realistic cases that blocks replication i’ve created following setup: * in backend: tests_decoding plugins hooks on xact events and utility statement hooks and transform each commit into prepare, then sleeps on latch. If transaction contains DDL that whole statement pushed in wal as transactional message. If DDL can not be prepared or disallows execution in transaction block than it goes as nontransactional logical message without prepare/decode injection. If transaction didn’t issued any DDL and didn’t write anything to wal, then it skips 2pc too. * after prepare is decoded, output plugin in walsender unlocks backend allowing to proceed with commit prepared. So in case when decoding tries to access blocked catalog everything should stop. * small python script that consumes decoded wal from walsender (thanks Craig and Petr) After small acrobatics with that hooks I’ve managed to run whole regression suite in parallel mode through such setup of test_decoding without any deadlocks. I’ve added two xact_events to postgres and allowedn prepare of transactions that touched temp tables since they are heavily used in tests and creates a lot of noise in diffs. So it boils down to 3 failed regression tests out of 177, namely: * transactions.sql — here commit of tx stucks with obtaining SafeSnapshot(). I didn’t look what is happening there specifically, but just checked that walsender isn’t blocked. I’m going to look more closely at this. * prepared_xacts.sql — here select prepared_xacts() sees our prepared tx. It is possible to filter them out, but obviously it works as expected. * guc.sql — here pendingActions arrives on 'DISCARD ALL’ preventing tx from being prepared. I didn’t found the way to check presence of pendingActions outside of async.c so decided to leave it as is. It seems that at least in regression tests nothing can block twophase logical decoding. Is that strong enough argument to hypothesis that current approach doesn’t creates deadlock except locks on catalog which should be disallowed anyway? Patches attached. logical_twophase_v5 is slightly modified version of previous patch merged with Craig’s changes. Second file is set of patches over previous one, that implements logic i’ve just described. There is runtest.sh script that setups postgres, runs python logical consumer in background and starts regression test. Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Thu, Mar 30, 2017 at 12:55 AM, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: > >> On 28 Mar 2017, at 18:08, Andres Freund <andres@anarazel.de> wrote: >> >> On 2017-03-28 15:55:15 +0100, Simon Riggs wrote: >>> >>> >>> That assertion is obviously false... the plugin can resolve this in >>> various ways, if we allow it. >> >> Handling it by breaking replication isn't handling it (e.g. timeouts in >> decoding etc). Handling it by rolling back *prepared* transactions >> (which are supposed to be guaranteed to succeed!), isn't either. >> >> >>> You can say that in your opinion you prefer to see this handled in >>> some higher level way, though it would be good to hear why and how. >> >> It's pretty obvious why: A bit of DDL by the user shouldn't lead to the >> issues mentioned above. >> >> >>> Bottom line here is we shouldn't reject this patch on this point, >> >> I think it definitely has to be rejected because of that. And I didn't >> bring this up at the last minute, I repeatedly brought it up before. >> Both to Craig and Stas. > > Okay. In order to find more realistic cases that blocks replication > i’ve created following setup: > > * in backend: tests_decoding plugins hooks on xact events and utility > statement hooks and transform each commit into prepare, then sleeps > on latch. If transaction contains DDL that whole statement pushed in > wal as transactional message. If DDL can not be prepared or disallows > execution in transaction block than it goes as nontransactional logical > message without prepare/decode injection. If transaction didn’t issued any > DDL and didn’t write anything to wal, then it skips 2pc too. > > * after prepare is decoded, output plugin in walsender unlocks backend > allowing to proceed with commit prepared. So in case when decoding > tries to access blocked catalog everything should stop. > > * small python script that consumes decoded wal from walsender (thanks > Craig and Petr) > > After small acrobatics with that hooks I’ve managed to run whole > regression suite in parallel mode through such setup of test_decoding > without any deadlocks. I’ve added two xact_events to postgres and > allowedn prepare of transactions that touched temp tables since > they are heavily used in tests and creates a lot of noise in diffs. > > So it boils down to 3 failed regression tests out of 177, namely: > > * transactions.sql — here commit of tx stucks with obtaining SafeSnapshot(). > I didn’t look what is happening there specifically, but just checked that > walsender isn’t blocked. I’m going to look more closely at this. > > * prepared_xacts.sql — here select prepared_xacts() sees our prepared > tx. It is possible to filter them out, but obviously it works as expected. > > * guc.sql — here pendingActions arrives on 'DISCARD ALL’ preventing tx > from being prepared. I didn’t found the way to check presence of > pendingActions outside of async.c so decided to leave it as is. > > It seems that at least in regression tests nothing can block twophase > logical decoding. Is that strong enough argument to hypothesis that current > approach doesn’t creates deadlock except locks on catalog which should be > disallowed anyway? > > Patches attached. logical_twophase_v5 is slightly modified version of previous > patch merged with Craig’s changes. Second file is set of patches over previous > one, that implements logic i’ve just described. There is runtest.sh script that > setups postgres, runs python logical consumer in background and starts > regression test. > > I reviewed this patch but when I tried to build contrib/test_decoding I got the following error. $ make gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -g -g -fpic -I. -I. -I../../src/include -D_GNU_SOURCE -c -o test_decoding.o test_decoding.c -MMD -MP -MF .deps/test_decoding.Po test_decoding.c: In function '_PG_init': test_decoding.c:126: warning: assignment from incompatible pointer type test_decoding.c: In function 'test_decoding_process_utility': test_decoding.c:271: warning: passing argument 5 of 'PreviousProcessUtilityHook' from incompatible pointer type test_decoding.c:271: note: expected 'struct QueryEnvironment *' but argument is of type 'struct DestReceiver *' test_decoding.c:271: warning: passing argument 6 of 'PreviousProcessUtilityHook' from incompatible pointer type test_decoding.c:271: note: expected 'struct DestReceiver *' but argument is of type 'char *' test_decoding.c:271: error: too few arguments to function 'PreviousProcessUtilityHook' test_decoding.c:276: warning: passing argument 5 of 'standard_ProcessUtility' from incompatible pointer type ../../src/include/tcop/utility.h:38: note: expected 'struct QueryEnvironment *' but argument is of type 'struct DestReceiver *' test_decoding.c:276: warning: passing argument 6 of 'standard_ProcessUtility' from incompatible pointer type ../../src/include/tcop/utility.h:38: note: expected 'struct DestReceiver *' but argument is of type 'char *' test_decoding.c:276: error: too few arguments to function 'standard_ProcessUtility' test_decoding.c: At top level: test_decoding.c:285: warning: 'test_decoding_twophase_commit' was used with no prototype before its definition make: *** [test_decoding.o] Error 1 --- After applied both patches the regression test 'make check' failed. I think you should update expected/transactions.out file as well. $ cat src/test/regress/regression.diffs *** /home/masahiko/pgsql/source/postgresql/src/test/regress/expected/transactions.out Mon May 2 09:16:02 2016 --- /home/masahiko/pgsql/source/postgresql/src/test/regress/results/transactions.out Tue Apr 4 09:52:44 2017 *************** *** 43,58 **** -- Read-only tests CREATE TABLE writetest (a int); CREATE TEMPORARY TABLE temptest (a int); ! BEGIN; ! SET TRANSACTION ISOLATION LEVEL SERIALIZABLE, READ ONLY, DEFERRABLE; -- ok ! SELECT * FROM writetest; -- ok ! a ! --- ! (0 rows) ! ! SET TRANSACTION READ WRITE; --fail ! ERROR: transaction read-write mode must be set before any query ! COMMIT; BEGIN; SET TRANSACTION READ ONLY; -- ok SET TRANSACTION READ WRITE; -- ok --- 43,53 ---- -- Read-only tests CREATE TABLE writetest (a int); CREATE TEMPORARY TABLE temptest (a int); ! -- BEGIN; ! -- SET TRANSACTION ISOLATION LEVEL SERIALIZABLE, READ ONLY, DEFERRABLE; -- ok ! -- SELECT * FROM writetest; -- ok ! -- SET TRANSACTION READ WRITE; --fail ! -- COMMIT; BEGIN; SET TRANSACTION READ ONLY; -- ok SET TRANSACTION READ WRITE; -- ok ====================================================================== There are still some unnecessary code in v5 patch. --- +/* PREPARE callback */ +static void +pg_decode_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, + XLogRecPtr prepare_lsn) +{ + TestDecodingData *data = ctx->output_plugin_private; + int backend_procno; + + // if (data->skip_empty_xacts && !data->xact_wrote_changes) + // return; + + OutputPluginPrepareWrite(ctx, true); + Could you please update these patches? Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
> On 4 Apr 2017, at 04:23, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > I reviewed this patch but when I tried to build contrib/test_decoding > I got the following error. > Thanks! Yes, seems that 18ce3a4a changed ProcessUtility_hook signature. Updated. > There are still some unnecessary code in v5 patch. Actually second diff isn’t intended to be part of the patch, I've just shared the way I ran regression test suite through the 2pc decoding changing all commits to prepare/commits where commits happens only after decoding of prepare is finished (more details in my previous message in this thread). That is just argument against Andres concern that prepared transaction is able to deadlock with decoding process — at least no such cases in regression tests. And that concern is main thing blocking this patch. Except explicit catalog locks in prepared tx nobody yet found such cases and it is hard to address or argue about. Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Tue, Apr 4, 2017 at 7:06 PM, Stas Kelvich <s.kelvich@postgrespro.ru> wrote: > >> On 4 Apr 2017, at 04:23, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> >> >> I reviewed this patch but when I tried to build contrib/test_decoding >> I got the following error. >> > > Thanks! > > Yes, seems that 18ce3a4a changed ProcessUtility_hook signature. > Updated. > >> There are still some unnecessary code in v5 patch. > Thank you for updating the patch! > Actually second diff isn’t intended to be part of the patch, I've just shared > the way I ran regression test suite through the 2pc decoding changing > all commits to prepare/commits where commits happens only after decoding > of prepare is finished (more details in my previous message in this thread). Understood. Sorry for the noise. > > That is just argument against Andres concern that prepared transaction > is able to deadlock with decoding process — at least no such cases in > regression tests. > > And that concern is main thing blocking this patch. Except explicit catalog > locks in prepared tx nobody yet found such cases and it is hard to address > or argue about. > Hmm, I also has not found such deadlock case yet. Other than that issue current patch still could not pass 'make check' test of contrib/test_decoding. *** 154,167 **** (4 rows) :get_with2pc ! data ! ------------------------------------------------------------------------- ! BEGIN ! table public.test_prepared1: INSERT: id[integer]:5 ! table public.test_prepared1: INSERT: id[integer]:6 data[text]:'frakbar' ! PREPARE TRANSACTION 'test_prepared#3'; ! COMMIT PREPARED 'test_prepared#3'; ! (5 rows) -- make sure stuff still works INSERT INTO test_prepared1 VALUES (8); --- 154,162 ---- (4 rows) :get_with2pc ! data ! ------ ! (0 rows) -- make sure stuff still works INSERT INTO test_prepared1 VALUES (8); I guess that the this part is a unexpected result and should be fixed. Right? ----- *** 215,222 **** -- If we try to decode it now we'll deadlock SET statement_timeout = '10s'; :get_with2pc_nofilter ! -- FIXME we expect a timeout here, but it actually works... ! ERROR: statement timed out RESET statement_timeout; -- we can decode past it by skipping xacts with catalog changes --- 210,222 ---- -- If we try to decode it now we'll deadlock SET statement_timeout = '10s'; :get_with2pc_nofilter ! data ! ---------------------------------------------------------------------------- ! BEGIN ! table public.test_prepared1: INSERT: id[integer]:10 data[text]:'othercol' ! table public.test_prepared1: INSERT: id[integer]:11 data[text]:'othercol2' ! PREPARE TRANSACTION 'test_prepared_lock' ! (4 rows) RESET statement_timeout; -- we can decode past it by skipping xacts with catalog changes Probably we can ignore this part for now. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On 2017-04-04 13:06:13 +0300, Stas Kelvich wrote: > That is just argument against Andres concern that prepared transaction > is able to deadlock with decoding process — at least no such cases in > regression tests. There's few longer / adverse xacts, that doesn't say much. > And that concern is main thing blocking this patch. Except explicit catalog > locks in prepared tx nobody yet found such cases and it is hard to address > or argue about. I doubt that's the case. But even if it were so, it's absolutely not acceptable that a plain user can cause such deadlocks. So I don't think this argument buys you anything. - Andres
Hi
> On 4 April 2017 at 19:13, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Other than that issue current patch still could not pass 'make check'
> test of contrib/test_decoding.
Just a note about this patch. Of course time flies by and it needs rebase,
but also there are few failing tests right now:
* one that was already mentioned by Masahiko
* one from `ddl`, where expected is:
```
SELECT slot_name, plugin, slot_type, active,
NOT catalog_xmin IS NULL AS catalog_xmin_set,
xmin IS NULl AS data_xmin_not_set,
pg_wal_lsn_diff(restart_lsn, '0/01000000') > 0 AS some_wal
FROM pg_replication_slots;
slot_name | plugin | slot_type | active | catalog_xmin_set | data_xmin_not_set | some_wal
-----------------+---------------+-----------+--------+------------------+-------------------+----------
regression_slot | test_decoding | logical | f | t | t | t
(1 row)
```
but the result is:
```
SELECT slot_name, plugin, slot_type, active,
NOT catalog_xmin IS NULL AS catalog_xmin_set,
xmin IS NULl AS data_xmin_not_set,
pg_wal_lsn_diff(restart_lsn, '0/01000000') > 0 AS some_wal
FROM pg_replication_slots;
ERROR: function pg_wal_lsn_diff(pg_lsn, unknown) does not exist
LINE 5: pg_wal_lsn_diff(restart_lsn, '0/01000000') > 0 AS some_w...
^
HINT: No function matches the given name and argument types. You might need to add explicit type casts.
```
Dmitry Dolgov <9erthalion6@gmail.com> writes: > Just a note about this patch. Of course time flies by and it needs rebase, > but also there are few failing tests right now: > ERROR: function pg_wal_lsn_diff(pg_lsn, unknown) does not exist Apparently you are not testing against current HEAD. That's been there since d10c626de (a whole two days now ;-)). regards, tom lane
On 13 May 2017 at 22:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Apparently you are not testing against current HEAD. That's been there
> since d10c626de (a whole two days now ;-))
Indeed, I was working on a more than two-day old antiquity. Unfortunately, it's even more complicated
to apply this patch against the current HEAD, so I'll wait for a rebased version.
>
> Apparently you are not testing against current HEAD. That's been there
> since d10c626de (a whole two days now ;-))
Indeed, I was working on a more than two-day old antiquity. Unfortunately, it's even more complicated
to apply this patch against the current HEAD, so I'll wait for a rebased version.
Hi, FYI all, wanted to mention that I am working on an updated version of the latest patch that I plan to submit to a later CF. Regards, Nikhils On 14 May 2017 at 04:02, Dmitry Dolgov <9erthalion6@gmail.com> wrote: > On 13 May 2017 at 22:22, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> Apparently you are not testing against current HEAD. That's been there >> since d10c626de (a whole two days now ;-)) > > Indeed, I was working on a more than two-day old antiquity. Unfortunately, > it's even more complicated > to apply this patch against the current HEAD, so I'll wait for a rebased > version. -- Nikhil Sontakke http://www.2ndQuadrant.com/PostgreSQL/Postgres-XL Development, 24x7 Support, Training& Services
> On 7 Sep 2017, at 18:58, Nikhil Sontakke <nikhils@2ndquadrant.com> wrote: > > Hi, > > FYI all, wanted to mention that I am working on an updated version of > the latest patch that I plan to submit to a later CF. > Cool! So what kind of architecture do you have in mind? Same way as is it was implemented before? As far as I remember there were two main issues: * Decodong of aborted prepared transaction. If such transaction modified catalog then we can’t read reliable info with our historic snapshot, since clog already have aborted bit for our tx it will brake visibility logic. There are some way to deal with that — by doing catalog seq scan two times and counting number of tuples (details upthread) or by hijacking clog values in historic visibility function. But ISTM it is better not solve this issue at all =) In most cases intended usage of decoding of 2PC transaction is to do some form of distributed commit, so naturally decoding will happens only with in-progress transactions and we commit/abort will happen only after it is decoded, sent and response is received. So we can just have atomic flag that prevents commit/abort of tx currently being decoded. And we can filter interesting prepared transactions based on GID, to prevent holding this lock for ordinary 2pc. * Possible deadlocks that Andres was talking about. I spend some time trying to find that, but didn’t find any. If locking pg_class in prepared tx is the only example then (imho) it is better to just forbid to prepare such transactions. Otherwise if some realistic examples that can block decoding are actually exist, then we probably need to reconsider the way tx being decoded. Anyway this part probably need Andres blessing. Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 2017-09-27 14:46, Stas Kelvich wrote: >> On 7 Sep 2017, at 18:58, Nikhil Sontakke <nikhils@2ndquadrant.com> >> wrote: >> >> Hi, >> >> FYI all, wanted to mention that I am working on an updated version of >> the latest patch that I plan to submit to a later CF. >> > > Cool! > > So what kind of architecture do you have in mind? Same way as is it > was implemented before? > As far as I remember there were two main issues: > > * Decodong of aborted prepared transaction. > > If such transaction modified catalog then we can’t read reliable info > with our historic snapshot, > since clog already have aborted bit for our tx it will brake > visibility logic. There are some way to > deal with that — by doing catalog seq scan two times and counting > number of tuples (details > upthread) or by hijacking clog values in historic visibility function. > But ISTM it is better not solve this > issue at all =) In most cases intended usage of decoding of 2PC > transaction is to do some form > of distributed commit, so naturally decoding will happens only with > in-progress transactions and > we commit/abort will happen only after it is decoded, sent and > response is received. So we can > just have atomic flag that prevents commit/abort of tx currently being > decoded. And we can filter > interesting prepared transactions based on GID, to prevent holding > this lock for ordinary 2pc. > > * Possible deadlocks that Andres was talking about. > > I spend some time trying to find that, but didn’t find any. If locking > pg_class in prepared tx is the only > example then (imho) it is better to just forbid to prepare such > transactions. Otherwise if some realistic > examples that can block decoding are actually exist, then we probably > need to reconsider the way > tx being decoded. Anyway this part probably need Andres blessing. Just rebased patch logical_twophase_v6 to master. Fixed small issues: - XactLogAbortRecord wrote DBINFO twice, but it was decoded in ParseAbortRecord only once. Second DBINFO were parsed as ORIGIN. Fixed by removing second write of DBINFO. - SnapBuildPrepareTxnFinish tried to remove xid from `running` instead of `committed`. And it removed only xid, without subxids. - test_decoding skipped returning "COMMIT PREPARED" and "ABORT PREPARED", Big issue were with decoding ddl-including two-phase transactions: - prepared.out were misleading. We could not reproduce decoding body of "test_prepared#3" with logical_twophase_v6.diff. It was skipped if `pg_logical_slot_get_changes` were called without `twophase-decode-with-catalog-changes` set, and only "COMMIT PREPARED test_prepared#3" were decoded. The reason is "PREPARE TRANSACTION" is passed to `pg_filter_prepare` twice: - first on "PREPARE" itself, - second - on "COMMIT PREPARED". In v6, `pg_filter_prepare` without `with-catalog-changes` first time answered "true" (ie it should not be decoded), and second time (when transaction became committed) it answered "false" (ie it should be decoded). But second time in DecodePrepare `ctx->snapshot_builder->start_decoding_at` is already in a future compared to `buf->origptr` (because it is on "COMMIT PREPARED" lsn). Therefore DecodePrepare just called ReorderBufferForget. If `pg_filter_prepare` is called with `with-catalog-changes`, then it returns "false" both times, thus DeocdePrepare decodes transaction in first time, and calls `ReorderBufferForget` in second time. I didn't found a way to fix it gracefully. I just change `pg_filter_prepare` to return same answer both times: "false" if called `with-catalog-changes` (ie need to call DecodePrepare), and "true" otherwise. With this change, catalog changing two-phase transaction is decoded as simple one-phase transaction, if `pg_logical_slot_get_changes` is called without `with-catalog-changes`. -- With regards, Sokolov Yura Postgres Professional: https://postgrespro.ru The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 2017-10-26 22:01, Sokolov Yura wrote: > On 2017-09-27 14:46, Stas Kelvich wrote: >>> On 7 Sep 2017, at 18:58, Nikhil Sontakke <nikhils@2ndquadrant.com> >>> wrote: >>> >>> Hi, >>> >>> FYI all, wanted to mention that I am working on an updated version of >>> the latest patch that I plan to submit to a later CF. >>> >> >> Cool! >> >> So what kind of architecture do you have in mind? Same way as is it >> was implemented before? >> As far as I remember there were two main issues: >> >> * Decodong of aborted prepared transaction. >> >> If such transaction modified catalog then we can’t read reliable info >> with our historic snapshot, >> since clog already have aborted bit for our tx it will brake >> visibility logic. There are some way to >> deal with that — by doing catalog seq scan two times and counting >> number of tuples (details >> upthread) or by hijacking clog values in historic visibility function. >> But ISTM it is better not solve this >> issue at all =) In most cases intended usage of decoding of 2PC >> transaction is to do some form >> of distributed commit, so naturally decoding will happens only with >> in-progress transactions and >> we commit/abort will happen only after it is decoded, sent and >> response is received. So we can >> just have atomic flag that prevents commit/abort of tx currently being >> decoded. And we can filter >> interesting prepared transactions based on GID, to prevent holding >> this lock for ordinary 2pc. >> >> * Possible deadlocks that Andres was talking about. >> >> I spend some time trying to find that, but didn’t find any. If locking >> pg_class in prepared tx is the only >> example then (imho) it is better to just forbid to prepare such >> transactions. Otherwise if some realistic >> examples that can block decoding are actually exist, then we probably >> need to reconsider the way >> tx being decoded. Anyway this part probably need Andres blessing. > > Just rebased patch logical_twophase_v6 to master. > > Fixed small issues: > - XactLogAbortRecord wrote DBINFO twice, but it was decoded in > ParseAbortRecord only once. Second DBINFO were parsed as ORIGIN. > Fixed by removing second write of DBINFO. > - SnapBuildPrepareTxnFinish tried to remove xid from `running` instead > of `committed`. And it removed only xid, without subxids. > - test_decoding skipped returning "COMMIT PREPARED" and "ABORT > PREPARED", > > Big issue were with decoding ddl-including two-phase transactions: > - prepared.out were misleading. We could not reproduce decoding body of > "test_prepared#3" with logical_twophase_v6.diff. It was skipped if > `pg_logical_slot_get_changes` were called without > `twophase-decode-with-catalog-changes` set, and only "COMMIT PREPARED > test_prepared#3" were decoded. > The reason is "PREPARE TRANSACTION" is passed to `pg_filter_prepare` > twice: > - first on "PREPARE" itself, > - second - on "COMMIT PREPARED". > In v6, `pg_filter_prepare` without `with-catalog-changes` first time > answered "true" (ie it should not be decoded), and second time (when > transaction became committed) it answered "false" (ie it should be > decoded). But second time in DecodePrepare > `ctx->snapshot_builder->start_decoding_at` > is already in a future compared to `buf->origptr` (because it is > on "COMMIT PREPARED" lsn). Therefore DecodePrepare just called > ReorderBufferForget. > If `pg_filter_prepare` is called with `with-catalog-changes`, then > it returns "false" both times, thus DeocdePrepare decodes transaction > in first time, and calls `ReorderBufferForget` in second time. > > I didn't found a way to fix it gracefully. I just change > `pg_filter_prepare` > to return same answer both times: "false" if called > `with-catalog-changes` > (ie need to call DecodePrepare), and "true" otherwise. With this > change, catalog changing two-phase transaction is decoded as simple > one-phase transaction, if `pg_logical_slot_get_changes` is called > without `with-catalog-changes`. Small improvement compared to v7: - twophase_gid is written with alignment padding in the XactLogCommitRecord and XactLogAbortRecord. -- Sokolov Yura Postgres Professional: https://postgrespro.ru The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 28 October 2017 at 03:53, Sokolov Yura <y.sokolov@postgrespro.ru> wrote: > On 2017-10-26 22:01, Sokolov Yura wrote: > Small improvement compared to v7: > - twophase_gid is written with alignment padding in the XactLogCommitRecord > and XactLogAbortRecord. I think Nikhils has done some significant work on this patch. Hopefully he'll be able to share it. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Hi all, > > I think Nikhils has done some significant work on this patch. > Hopefully he'll be able to share it. > PFA, latest patch. This builds on top of the last patch submitted by Sokolov Yura and adds the actual logical replication interfaces to allow PREPARE or COMMIT/ROLLBACK PREPARED on a logical subscriber. I tested with latest PG head by setting up PUBLICATION/SUBSCRIPTION for some tables. I tried DML on these tables via 2PC and it seems to work with subscribers honoring COMMIT|ROLLBACK PREPARED commands. Now getting back to the two main issues that we have been discussing: Logical decoding deadlocking/hanging due to locks on catalog tables ==================================================== When we are decoding, we do not hold long term locks on the table. We do RelationIdGetRelation() and RelationClose() which increments/decrements ref counts. Also this ref count is held/released per ReorderBuffer change record. The call to RelationIdGetRelation() holds an AccessShareLock on pg_class, pg_attribute etc. while building the relation descriptor. The plugin itself can access rel/syscache but none of it holds a lock stronger than AccessShareLock on the catalog tables. Even activities like: ALTER user_table; CLUSTER user_table; Do not hold locks that will allow decoding to stall. The only issue could be with locks on catalog objects itself in the prepared transaction. Now if the 2PC transaction is taking an AccessExclusiveLock on catalog objects via "LOCK pg_class" for example, then pretty much nothing else will progress ahead in other sessions in the database till this active session COMMIT PREPAREs or aborts this 2PC transaction. Also, in some cases like CLUSTER on catalog objects, the code explicitly denies preparation of a 2PC transaction. postgres=# BEGIN; postgres=# CLUSTER pg_class using pg_class_oid_index ; postgres=# PREPARE TRANSACTION 'test_prepared_lock'; ERROR: cannot PREPARE a transaction that modified relation mapping This makes sense because we do not want to get into a state where the DB is unable to progress meaningfully at all. Is there any other locking scenario that we need to consider? Otherwise, are we all ok on this point being a non-issue for 2PC logical decoding? Now on to the second issue: 2PC Logical decoding with concurrent "ABORT PREPARED" of the same ========================================================= Before 2PC, we always decoded regular committed transaction records. Now with prepared transactions, we run the risk of running decoding when some other backend could come in and COMMIT PREPARE or ROLLBACK PREPARE simultaneously. If the other backend commits, that's not an issue at all. The issue is with a concurrent rollback of the prepared transaction. We need a way to ensure that the 2PC does not abort when we are in the midst of a change record apply activity. One way to handle this is to ensure that we interlock the abort prepared with an ongoing logical decoding operation for a bounded period of maximum one change record apply cycle. I am outlining one solution but am all ears for better, elegant solutions. * We introduce two new booleans in the TwoPhaseState GlobalTransactionData structure. bool beingdecoded; bool abortpending; 1) Before we start iterating through the change records, if it happens to be a prepared transaction, we check "abortpending" in the corresponding TwoPhaseState entry. If it's not set, then we set "beingdecoded". If abortpending is set, we know that this transaction is going to go away and we treat it like a regular abort and do not do any decoding at all. 2) With "beingdecoded" set, we start with the first change record from the iteration, decode it and apply it. 3) Before starting decode of the next change record, we re-check if "abortpending" is set. If "abortpending" is set, we do not decode the next change record. Thus the abort is delay-bounded to a maximum of one change record decoding/apply cycle after we signal our intent to abort it. Then, we need to send ABORT (regular, not rollback prepared, since we have not sent "PREPARE" yet. We cannot send PREPARE midways because the transaction block on the whole might not be consistent) to the subscriber. We will have to add an ABORT callback in pgoutput for this. There's only a COMMIT callback as of now. The subscribers will ABORT this transaction midways due to this. We can then follow this up with a DUMMY prepared txn. E.g. "BEGIN; PREPARE TRANSACTION 'gid'"; The reasoning for the DUMMY 2PC is mentioned below in (6). 4) Keep decoding change records as long as "abortpending" is not set. 5) At end of the change set, send "PREPARE" to the subscribers and then remove the "beingdecoded" flag from the TwoPhaseState entry. We are now free to commit/rollback the prepared transaction anytime. 6) We will still decode the "ROLLBACK PREPARED" wal entry when it comes to us on the provider. This will call the abort_prepared callback on the subscriber. I have already added this in my patch. This abort_prepared callback will abort the dummy PREPARED query from step (3) above. Instead of doing this, we could actually check if the 'GID' entry exists and then call ROLLBACK PREPARED on the subscriber. But in that case we can't be sure if the GID does not exist because of a rollback-during-decode-issue on the provider or due to something else. If we are ok with not finding GIDs on the subscriber side, then am fine with removing the DUMMY prepare from step (3). 7) When the above activity is happening if another backend wants to abort the prepared transaction then it will set "abortpending". If "beingdecoded" is true, the abort prepared function will wait till it clears out by releasing the lock and re-checking in a few moments. When beingdecoded clears out (which will happen before the next change record apply in walsender when it sees "abortpending" set) , the abort prepare can go ahead as usual. Note that we will have to be careful to clear this "beingdecoded" flag even if the decoding fails or subscription is dropped or any other issues. Then this can work fine, IMO. Thoughts? Holes in the theory? Other issues? I am attaching my latest and greatest WIP patch with does not contain any of the above abort handling yet. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Attachment
On 23 November 2017 at 20:27, Nikhil Sontakke
wrote:
>
> Is there any other locking scenario that we need to consider?
> Otherwise, are we all ok on this point being a non-issue for 2PC
> logical decoding?
>
Yeah.
I didn't find any sort of sensible situation where locking would pose
issues. Unless you're taking explicit LOCKs on catalog tables, you should
be fine.
There may be issues with CLUSTER or VACUUM FULL of non-relmapped catalog
relations I guess. Personally I put that in the "don't do that" box, but if
we really have to guard against it we could slightly expand the limits on
which txns you can PREPARE to any txn that has a strong lock on a catalog
relation.
> The issue is with a concurrent rollback of the prepared transaction.
> We need a way to ensure that
> the 2PC does not abort when we are in the midst of a change record
> apply activity.
>
The *reason* we care about this is that tuples created by aborted txns are
not considered "recently dead" by vacuum. They can be marked invalid and
removed immediately due to hint-bit setting and HOT pruning, vacuum runs,
etc.
This could create an inconsistent view of the catalogs if our prepared txn
did any DDL. For example, we might've updated a pg_class row, so we created
a new row and set xmax on the old one. Vacuum may merrily remove our new
row so there's no way we can find the correct data anymore, we'd have to
find the outdated row or no row. By my reading of HeapTupleSatisfiesMVCC
we'll see the old pg_class tuple.
Similar issues apply for pg_attribute etc etc. We might try to decode a
record according to the wrong table structure because relcache lookups
performed by the plugin will report outdated information.
The sanest option here seems to be to stop the txn from aborting while
we're decoding it, hence Nikhil's suggestions.
> * We introduce two new booleans in the TwoPhaseState
> GlobalTransactionData structure.
> bool beingdecoded;
> bool abortpending;
>
I think it's premature to rule out the simple option of doing a LockGXact
when we start decoding. Improve the error "prepared transaction with
identifier \"%s\" is busy" to report the locking pid too. It means you
cannot rollback or commit a prepared xact while it's being decoded, but for
the intended use of this patch, I think that's absolutely fine anyway.
But I like your suggestion much more than the idea of taking a LWLock on
TwoPhaseStateLock while decoding a record. Lots of LWLock churn, and
LWLocks held over arbitrary user plugin code. Not viable.
With your way we just have to take a LWLock once on TwoPhaseState when we
test abortpending and set beingdecoded. After that, during decoding, we can
do unlocked tests of abortpending, since a stale read will do nothing worse
than delay our response a little. The existing 2PC ops already take the
LWLock and can examine beingdecoded then. I expect they'd need to WaitLatch
in a loop until beingdecoded was cleared, re-acquiring the LWLock and
re-checking each time it's woken. We should probably add a field there for
a waiter proc that wants its latch set, so 2pc ops don't usually have to
poll for decoding to finish. (Unless condition variables will help us here?)
However, let me make an entirely alternative suggestion. Should we add a
heavyweight lock class for 2PC xacts instead, and leverage the existing
infrastructure? We already use transaction locks widely after all. That
way, we just take some kind of share lock on the 2PC xact by xid when we
start logical decoding of the 2pc xact. ROLLBACK PREPARED and COMMIT
PREPARED would acquire the same heavyweight lock in an exclusive mode
before grabbing TwoPhaseStateLock and doing their work.
That way we get automatic cleanup when decoding procs exit, we get wakeups
for waiters, etc, all for "free".
How practical is adding a lock class?
(Frankly I've often wished I could add new heavyweight lock classes when
working on complex extensions like BDR, too, and in an ideal world we'd be
able to register lock classes for use by extensions...)
3) Before starting decode of the next change record, we re-check if
> "abortpending" is set. If "abortpending"
> is set, we do not decode the next change record. Thus the abort is
> delay-bounded to a maximum of one change record decoding/apply cycle
> after we signal our intent to abort it. Then, we need to send ABORT
> (regular, not rollback prepared, since we have not sent "PREPARE" yet.
>
Just to be explicit, this means "tell the downstream the xact has aborted".
Currently logical decoding does not ever start decoding an xact until it's
committed, so it has never needed an abort callback on the output plugin
interface.
But we'll need one when we start doing speculative logical decoding of big
txns before they commit, and we'll need it for this. It's relatively
trivial.
> This abort_prepared callback will abort the dummy PREPARED query from
>
step (3) above. Instead of doing this, we could actually check if the
> 'GID' entry exists and then call ROLLBACK PREPARED on the subscriber.
> But in that case we can't be sure if the GID does not exist because of
> a rollback-during-decode-issue on the provider or due to something
> else. If we are ok with not finding GIDs on the subscriber side, then
> am fine with removing the DUMMY prepare from step (3).
>
I prefer the latter approach personally, not doing the dummy 2pc xact.
Instead we can just ignore a ROLLBACK PREPARED for a txn whose gid does not
exist on the downstream. I can easily see situations where we might
manually abort a txn and wouldn't want logical decoding to get perpetually
stuck waiting to abort a gid that doesn't exist, for example.
Ignoring commit prepared for a missing xact would not be great, but I think
it's sensible enough to ignore missing GIDs for rollback prepared.
We'd need a race-free way to do that though, so I think we'd have to
extend FinishPreparedTransaction and LockGXact with some kind of missing_ok
flag. I doubt that'd be controversial.
A couple of other considerations not covered in what you wrote:
- It's really important that the hook that decides whether to decode an
xact at prepare or commit prepared time reports the same answer each and
every time, including if it's called after a prepared txn has committed. It
probably can't look at anything more than the xact's origin replica
identity, xid, and gid. This also means we need to know the gid of prepared
txns when processing their commit record, so we can tell logical decoding
whether we already sent the data to the client at prepare-transaction time,
or if we should send it at commit-prepared time instead.
- You need to flush the syscache when you finish decoding a PREPARE
TRANSACTION of an xact that made catalog changes, unless it's immediately
followed by COMMIT PREPARED of the same xid. Because xacts between the two
cannot see changes the prepared xact made to the catalogs.
- For the same reason we need to ensure that the historic snapshot used to
decode a 2PC xact that made catalog changes isn't then used for subsequent
xacts between the prepare and commit. They'd see the uncommitted catalogs
of the prepared xact.
-- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi Craig, > I didn't find any sort of sensible situation where locking would pose > issues. Unless you're taking explicit LOCKs on catalog tables, you should be > fine. > > There may be issues with CLUSTER or VACUUM FULL of non-relmapped catalog > relations I guess. Personally I put that in the "don't do that" box, but if > we really have to guard against it we could slightly expand the limits on > which txns you can PREPARE to any txn that has a strong lock on a catalog > relation. > Well, we don't allow VACUUM FULL of regular tables in transaction blocks. I tried "CLUSTER user_table USING pkey", it works and also it does not take strong locks on catalog tables which would halt the decoding process. ALTER TABLE works without stalling decoding already as mentioned earlier. >> >> The issue is with a concurrent rollback of the prepared transaction. >> We need a way to ensure that >> the 2PC does not abort when we are in the midst of a change record >> apply activity. > > > The *reason* we care about this is that tuples created by aborted txns are > not considered "recently dead" by vacuum. They can be marked invalid and > removed immediately due to hint-bit setting and HOT pruning, vacuum runs, > etc. > > This could create an inconsistent view of the catalogs if our prepared txn > did any DDL. For example, we might've updated a pg_class row, so we created > a new row and set xmax on the old one. Vacuum may merrily remove our new row > so there's no way we can find the correct data anymore, we'd have to find > the outdated row or no row. By my reading of HeapTupleSatisfiesMVCC we'll > see the old pg_class tuple. > > Similar issues apply for pg_attribute etc etc. We might try to decode a > record according to the wrong table structure because relcache lookups > performed by the plugin will report outdated information. > We actually do the decoding in a PG_TRY/CATCH block, so if there are any errors we can clean those up in the CATCH block. If it's a prepared transaction then we can send an ABORT to the remote side to clean itself up. > The sanest option here seems to be to stop the txn from aborting while we're > decoding it, hence Nikhil's suggestions. > If we do the cleanup above in the CATCH block, then do we really care? I guess the issue would be in determining why we reached the CATCH block, whether it was due to a decoding error or due to network issues or something else.. >> >> * We introduce two new booleans in the TwoPhaseState >> GlobalTransactionData structure. >> bool beingdecoded; >> bool abortpending; > > > I think it's premature to rule out the simple option of doing a LockGXact > when we start decoding. Improve the error "prepared transaction with > identifier \"%s\" is busy" to report the locking pid too. It means you > cannot rollback or commit a prepared xact while it's being decoded, but for > the intended use of this patch, I think that's absolutely fine anyway. > > But I like your suggestion much more than the idea of taking a LWLock on > TwoPhaseStateLock while decoding a record. Lots of LWLock churn, and LWLocks > held over arbitrary user plugin code. Not viable. > > With your way we just have to take a LWLock once on TwoPhaseState when we > test abortpending and set beingdecoded. After that, during decoding, we can > do unlocked tests of abortpending, since a stale read will do nothing worse > than delay our response a little. The existing 2PC ops already take the > LWLock and can examine beingdecoded then. I expect they'd need to WaitLatch > in a loop until beingdecoded was cleared, re-acquiring the LWLock and > re-checking each time it's woken. We should probably add a field there for a > waiter proc that wants its latch set, so 2pc ops don't usually have to poll > for decoding to finish. (Unless condition variables will help us here?) > Yes, WaitLatch could do the job here. > However, let me make an entirely alternative suggestion. Should we add a > heavyweight lock class for 2PC xacts instead, and leverage the existing > infrastructure? We already use transaction locks widely after all. That way, > we just take some kind of share lock on the 2PC xact by xid when we start > logical decoding of the 2pc xact. ROLLBACK PREPARED and COMMIT PREPARED > would acquire the same heavyweight lock in an exclusive mode before grabbing > TwoPhaseStateLock and doing their work. > > That way we get automatic cleanup when decoding procs exit, we get wakeups > for waiters, etc, all for "free". > > How practical is adding a lock class? Am open to suggestions. This looks like it could work decently. > > Just to be explicit, this means "tell the downstream the xact has aborted". > Currently logical decoding does not ever start decoding an xact until it's > committed, so it has never needed an abort callback on the output plugin > interface. > > But we'll need one when we start doing speculative logical decoding of big > txns before they commit, and we'll need it for this. It's relatively > trivial. > Yes, it will be a standard wrapper call to implement on both send and apply side. >> >> This abort_prepared callback will abort the dummy PREPARED query from >> >> step (3) above. Instead of doing this, we could actually check if the >> 'GID' entry exists and then call ROLLBACK PREPARED on the subscriber. >> But in that case we can't be sure if the GID does not exist because of >> a rollback-during-decode-issue on the provider or due to something >> else. If we are ok with not finding GIDs on the subscriber side, then >> am fine with removing the DUMMY prepare from step (3). > > > I prefer the latter approach personally, not doing the dummy 2pc xact. > Instead we can just ignore a ROLLBACK PREPARED for a txn whose gid does not > exist on the downstream. I can easily see situations where we might manually > abort a txn and wouldn't want logical decoding to get perpetually stuck > waiting to abort a gid that doesn't exist, for example. > > Ignoring commit prepared for a missing xact would not be great, but I think > it's sensible enough to ignore missing GIDs for rollback prepared. > Yes, that makes sense in case of ROLLBACK. If we find a missing GID for a COMMIT PREPARE we are in for some trouble. > We'd need a race-free way to do that though, so I think we'd have to extend > FinishPreparedTransaction and LockGXact with some kind of missing_ok flag. I > doubt that'd be controversial. > Sure. > > A couple of other considerations not covered in what you wrote: > > - It's really important that the hook that decides whether to decode an xact > at prepare or commit prepared time reports the same answer each and every > time, including if it's called after a prepared txn has committed. It > probably can't look at anything more than the xact's origin replica > identity, xid, and gid. This also means we need to know the gid of prepared > txns when processing their commit record, so we can tell logical decoding > whether we already sent the data to the client at prepare-transaction time, > or if we should send it at commit-prepared time instead. > We already have a filter_prepare_cb hook in place for this. TBH, I don't think this patch needs to worry about the internals of that hook. Whatever it returns, if it returns the same value everytime then we should be good from the logical decoding perspective I think, if we encode the logic in the GID itself, then it will be good and consistent everytime. For example, if the hook sees a GID with the prefix '_$Logical_', then it knows it has to PREPARE it. Others can be decoded at commit time. > - You need to flush the syscache when you finish decoding a PREPARE > TRANSACTION of an xact that made catalog changes, unless it's immediately > followed by COMMIT PREPARED of the same xid. Because xacts between the two > cannot see changes the prepared xact made to the catalogs. > > - For the same reason we need to ensure that the historic snapshot used to > decode a 2PC xact that made catalog changes isn't then used for subsequent > xacts between the prepare and commit. They'd see the uncommitted catalogs of > the prepared xact. > Yes, we will do TeardownHistoricSnapshot and syscache flush as part of the cleanup for 2PC transactions. Regards, Nikhils > -- > Craig Ringer http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services -- Nikhil Sontakke http://www.2ndQuadrant.com/PostgreSQL/Postgres-XL Development, 24x7 Support, Training& Services
On 24 November 2017 at 13:44, Nikhil Sontakke
wrote:
>
> > This could create an inconsistent view of the catalogs if our prepared
> txn
> > did any DDL. For example, we might've updated a pg_class row, so we
> created
> > a new row and set xmax on the old one. Vacuum may merrily remove our new
> row
> > so there's no way we can find the correct data anymore, we'd have to find
> > the outdated row or no row. By my reading of HeapTupleSatisfiesMVCC we'll
> > see the old pg_class tuple.
> >
> > Similar issues apply for pg_attribute etc etc. We might try to decode a
> > record according to the wrong table structure because relcache lookups
> > performed by the plugin will report outdated information.
> >
>
> We actually do the decoding in a PG_TRY/CATCH block, so if there are
> any errors we
> can clean those up in the CATCH block. If it's a prepared transaction
> then we can send
> an ABORT to the remote side to clean itself up.
>
Yeah. I suspect it might not always ERROR gracefully though.
> > How practical is adding a lock class?
>
> Am open to suggestions. This looks like it could work decently.
It looks amazingly simple from here. Which probably means there's more to
it that I haven't seen yet. I could use advice from someone who knows the
locking subsystem better.
> Yes, that makes sense in case of ROLLBACK. If we find a missing GID
> for a COMMIT PREPARE we are in for some trouble.
>
I agree. But it's really down to the apply worker / plugin to set policy
there, I think. It's not the 2PC decoding support's problem.
I'd argue that a plugin that wishes to strictly track and match 2PC aborts
with the subsequent ROLLBACK PREPARED could do so by recording the abort
locally. It need not rely on faked-up 2pc xacts from the output plugin.
Though it might choose to create them on the downstream as its method of
tracking aborts.
In other words, we don't need the logical decoding infrastructure's help
here. It doesn't have to fake up 2PC xacts for us. Output plugins/apply
workers that want to can do it themselves, and those that don't can ignore
rollback prepared for non-matched GIDs instead.
> We'd need a race-free way to do that though, so I think we'd have to
> extend
> > FinishPreparedTransaction and LockGXact with some kind of missing_ok
> flag. I
> > doubt that'd be controversial.
> >
>
> Sure.
I reckon that should be in-scope for this patch, and pretty clearly useful.
Also simple.
>
> > - It's really important that the hook that decides whether to decode an
> xact
> > at prepare or commit prepared time reports the same answer each and every
> > time, including if it's called after a prepared txn has committed. It
> > probably can't look at anything more than the xact's origin replica
> > identity, xid, and gid. This also means we need to know the gid of
> prepared
> > txns when processing their commit record, so we can tell logical decoding
> > whether we already sent the data to the client at prepare-transaction
> time,
> > or if we should send it at commit-prepared time instead.
> >
>
> We already have a filter_prepare_cb hook in place for this. TBH, I
> don't think this patch needs to worry about the internals of that
> hook. Whatever it returns, if it returns the same value everytime then
> we should be good from the logical decoding perspective.
>
I agree. I meant that it should try to pass only info that's accessible at
both PREPARE TRANSACTION and COMMIT PREPARED time, and we should document
the importance of returning a consistent result. In particular, it's always
wrong to examine the current twophase state when deciding what to return.
> I think, if we encode the logic in the GID itself, then it will be
> good and consistent everytime. For example, if the hook sees a GID
> with the prefix '_$Logical_', then it knows it has to PREPARE it.
> Others can be decoded at commit time.
>
Yep. We can also safely tell the hook:
* the xid
* whether the xact has made catalog changes (since we know that at prepare
and commit time)
but probably not much else.
> > - You need to flush the syscache when you finish decoding a PREPARE
> > TRANSACTION of an xact that made catalog changes, unless it's immediately
> > followed by COMMIT PREPARED of the same xid. Because xacts between the
> two
> > cannot see changes the prepared xact made to the catalogs.
> >
> > - For the same reason we need to ensure that the historic snapshot used
> to
> > decode a 2PC xact that made catalog changes isn't then used for
> subsequent
> > xacts between the prepare and commit. They'd see the uncommitted
> catalogs of
> > the prepared xact.
> >
>
> Yes, we will do TeardownHistoricSnapshot and syscache flush as part of
> the cleanup for 2PC transactions.
>
Great.
Thanks.
-- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Nov 24, 2017 at 3:41 PM, Craig Ringer <craig@2ndquadrant.com> wrote: > It looks amazingly simple from here. Which probably means there's more to it > that I haven't seen yet. I could use advice from someone who knows the > locking subsystem better. The status of this patch is I think not correct. It is marked as waiting on author but Nikhil has showed up and has written an updated patch. So I am moving it to next CF with "needs review". -- Michael
Hi, On 24/11/17 07:41, Craig Ringer wrote: > On 24 November 2017 at 13:44, Nikhil Sontakke <nikhils@2ndquadrant.com > > > > How practical is adding a lock class? > > Am open to suggestions. This looks like it could work decently. > > > It looks amazingly simple from here. Which probably means there's more > to it that I haven't seen yet. I could use advice from someone who knows > the locking subsystem better. > Hmm, I don't like the interaction that would have with ROLLBACK, meaning that ROLLBACK has to wait for decoding to finish which may take longer than the transaction took itself (given potential network calls, it's practically unbounded time). I also think that if we'll want to add streaming of transactions in the future, we'll face similar problem and the locking approach will not work there as the transaction may still be locked by the owning backend while we are decoding it. From my perspective this patch changes the assumption in HeapTupleSatisfiesVacuum() that changes done by aborted transaction can't be seen by anybody else. That's clearly not true here as the decoding can see it. So perhaps better approach would be to not return HEAPTUPLE_DEAD if the transaction id is newer than the OldestXmin (same logic we use for deleted tuples of committed transactions) in the HeapTupleSatisfiesVacuum() even for aborted transactions. I also briefly checked HOT pruning and AFAICS the normal HOT pruning (the one not called by vacuum) also uses the xmin as authoritative even for aborted txes so nothing needs to be done there probably. In case we are worried that this affects cleanups of for example large aborted COPY transactions and we think it's worth worrying about then we could limit the new OldestXmin based logic only to catalog tuples as those are the only ones we need available in decoding. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 30 November 2017 at 07:40, Petr Jelinek
wrote:
> Hi,
>
> On 24/11/17 07:41, Craig Ringer wrote:
> > On 24 November 2017 at 13:44, Nikhil Sontakke >
> >
> > > How practical is adding a lock class?
> >
> > Am open to suggestions. This looks like it could work decently.
> >
> >
> > It looks amazingly simple from here. Which probably means there's more
> > to it that I haven't seen yet. I could use advice from someone who knows
> > the locking subsystem better.
> >
>
> Hmm, I don't like the interaction that would have with ROLLBACK, meaning
> that ROLLBACK has to wait for decoding to finish which may take longer
> than the transaction took itself (given potential network calls, it's
> practically unbounded time).
>
Yeah. We could check for waiters before we do the network I/O and release +
bail out. But once we enter the network call we're committed and it could
take a long time.
I don't find that particularly troubling for 2PC, but it's an obvious
nonstarter if we want to use the same mechanism for streaming normal xacts
out before commit.
Even for 2PC, if we have >1 downstream then once one reports an ERROR on
PREPARE TRANSACTION, there's probably no point continuing to stream the 2PC
xact out to other peers. So being able to abort the txn while it's being
decoded, causing decoding to bail out, is desirable there too.
> I also think that if we'll want to add streaming of transactions in the
> future, we'll face similar problem and the locking approach will not
> work there as the transaction may still be locked by the owning backend
> while we are decoding it.
>
Agreed. For that reason I agree that we need to look further afield than
locking-based solutions.
> From my perspective this patch changes the assumption in
> HeapTupleSatisfiesVacuum() that changes done by aborted transaction
> can't be seen by anybody else. That's clearly not true here as the
> decoding can see it.
Yes, *if* we don't use some locking-like approach to stop abort from
occurring while decoding is processing an xact.
> So perhaps better approach would be to not return
> HEAPTUPLE_DEAD if the transaction id is newer than the OldestXmin (same
> logic we use for deleted tuples of committed transactions) in the
> HeapTupleSatisfiesVacuum() even for aborted transactions. I also briefly
> checked HOT pruning and AFAICS the normal HOT pruning (the one not
> called by vacuum) also uses the xmin as authoritative even for aborted
> txes so nothing needs to be done there probably.
>
> In case we are worried that this affects cleanups of for example large
> aborted COPY transactions and we think it's worth worrying about then we
> could limit the new OldestXmin based logic only to catalog tuples as
> those are the only ones we need available in decoding.
Yeah, if it's limited to catalog tuples only then that sounds good. I was
quite concerned about how it'd impact vacuuming otherwise, but if limited
to catalogs about the only impact should be on workloads that create lots
of TEMPORARY tables then ROLLBACK - and not much on those.
-- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, >> So perhaps better approach would be to not return >> HEAPTUPLE_DEAD if the transaction id is newer than the OldestXmin (same >> logic we use for deleted tuples of committed transactions) in the >> HeapTupleSatisfiesVacuum() even for aborted transactions. I also briefly >> checked HOT pruning and AFAICS the normal HOT pruning (the one not >> called by vacuum) also uses the xmin as authoritative even for aborted >> txes so nothing needs to be done there probably. >> >> In case we are worried that this affects cleanups of for example large >> aborted COPY transactions and we think it's worth worrying about then we >> could limit the new OldestXmin based logic only to catalog tuples as >> those are the only ones we need available in decoding. > > > Yeah, if it's limited to catalog tuples only then that sounds good. I was > quite concerned about how it'd impact vacuuming otherwise, but if limited to > catalogs about the only impact should be on workloads that create lots of > TEMPORARY tables then ROLLBACK - and not much on those. > Based on these discussions, I think there are two separate issues here: 1) Make HeapTupleSatisfiesVacuum() to behave differently for recently aborted catalog tuples. 2) Invent a mechanism to stop a specific logical decoding activity in the middle. The reason to stop it could be a concurrent abort, maybe a global transaction manager decides to rollback, or any other reason, for example. ISTM, that for 2, if (1) is able to leave the recently abort tuples around for a little bit while (we only really need them till the decode of the current change record is ongoing), then we could accomplish it via a callback. This callback should be called before commencing decode and network send of each change record. In case of in-core logical decoding, the callback for pgoutput could check for the transaction having aborted (a call to TransactionIdDidAbort() or similar such functions), additional logic can be added as needed for various scenarios. If it's aborted, we will abandon decoding and send an ABORT to the subscribers before returning. Regards, Nikhils
PFA, latest patch for this functionality. This patch contains the following changes as compared to the earlier patch: - Fixed a bunch of typos and comments - Modified HeapTupleSatisfiesVacuum to return HEAPTUPLE_RECENTLY_DEAD if the transaction id is newer than OldestXmin. Doing this only for CATALOG tables (htup->t_tableOid < (Oid) FirstNormalObjectId). - Added a filter callback filter_decode_txn_cb_wrapper() to decide if it's ok to decode the NEXT change record. This filter as of now checks if the XID that is involved got aborted. Additional checks can be added here as needed. - Added ABORT callback in the decoding process. This was not needed before because we always used to decode committed transactions. With 2PC transactions, it possible that while we are decoding it, another backend might issue a concurrent ROLLBACK PREPARED. So when filter_decode_txn_cb_wrapper() gets called, it will tell us to not to decode the next change record. In that case we need to send an ABORT to the subscriber (and not ROLLBACK PREPARED because we are yet to issue PREPARE to the subscriber) - Added all functionality to read the abort command and apply it on the remote subscriber as needed. - Added functionality in ReorderBufferCommit() to abort midways based on the feedback from filter_decode_txn_cb_wrapper() - Modified LockGXact() and FinishPreparedTransaction() to allow missing GID in case of "ROLLBACK PREPARED". Currently, this will only happen in the logical apply code path. We still send it to the subscriber because it's difficult to identify on the provider if this transaction was aborted midways in decoding or if it's in PREPARED state on the subscriber. It will error out as before in all other cases. - Totally removed snapshot addition/deletion code while doing the decoding. That's not needed at all while decoding an ongoing transaction. The entries in the snapshot are needed for future transactions to be able to decode older transactions. For 2PC transactions, we don't need to decode them till COMMIT PREPARED gets called. This has simplified all that unwanted snapshot push/pop code, which is nice. Regards, Nikhils On 30 November 2017 at 16:08, Nikhil Sontakke <nikhils@2ndquadrant.com> wrote: > Hi, > > >>> So perhaps better approach would be to not return >>> HEAPTUPLE_DEAD if the transaction id is newer than the OldestXmin (same >>> logic we use for deleted tuples of committed transactions) in the >>> HeapTupleSatisfiesVacuum() even for aborted transactions. I also briefly >>> checked HOT pruning and AFAICS the normal HOT pruning (the one not >>> called by vacuum) also uses the xmin as authoritative even for aborted >>> txes so nothing needs to be done there probably. >>> >>> In case we are worried that this affects cleanups of for example large >>> aborted COPY transactions and we think it's worth worrying about then we >>> could limit the new OldestXmin based logic only to catalog tuples as >>> those are the only ones we need available in decoding. >> >> >> Yeah, if it's limited to catalog tuples only then that sounds good. I was >> quite concerned about how it'd impact vacuuming otherwise, but if limited to >> catalogs about the only impact should be on workloads that create lots of >> TEMPORARY tables then ROLLBACK - and not much on those. >> > > Based on these discussions, I think there are two separate issues here: > > 1) Make HeapTupleSatisfiesVacuum() to behave differently for recently > aborted catalog tuples. > > 2) Invent a mechanism to stop a specific logical decoding activity in > the middle. The reason to stop it could be a concurrent abort, maybe a > global transaction manager decides to rollback, or any other reason, > for example. > > ISTM, that for 2, if (1) is able to leave the recently abort tuples > around for a little bit while (we only really need them till the > decode of the current change record is ongoing), then we could > accomplish it via a callback. This callback should be called before > commencing decode and network send of each change record. In case of > in-core logical decoding, the callback for pgoutput could check for > the transaction having aborted (a call to TransactionIdDidAbort() or > similar such functions), additional logic can be added as needed for > various scenarios. If it's aborted, we will abandon decoding and send > an ABORT to the subscribers before returning. > > Regards, > Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Attachment
On 4 December 2017 at 23:15, Nikhil Sontakke
wrote:
> PFA, latest patch for this functionality.
> This patch contains the following changes as compared to the earlier patch:
>
> - Fixed a bunch of typos and comments
>
> - Modified HeapTupleSatisfiesVacuum to return HEAPTUPLE_RECENTLY_DEAD
> if the transaction id is newer than OldestXmin. Doing this only for
> CATALOG tables (htup->t_tableOid < (Oid) FirstNormalObjectId).
>
Because logical decoding supports user-catalog relations, we need to use
the same sort of logical that GetOldestXmin uses instead of a simple
oid-range check. See RelationIsAccessibleInLogicalDecoding() and the
user_catalog_table reloption.
Otherwise pseudo-catalogs used by logical decoding output plugins could
still suffer issues with needed tuples getting vacuumed, though only if the
txn being decoded made changes to those tables than ROLLBACKed. It's a
pretty tiny corner case for decoding of 2pc but a bigger one when we're
addressing streaming decoding.
Otherwise I'm really, really happy with how this is progressing and want to
find time to play with it.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
>> - Modified HeapTupleSatisfiesVacuum to return HEAPTUPLE_RECENTLY_DEAD >> if the transaction id is newer than OldestXmin. Doing this only for >> CATALOG tables (htup->t_tableOid < (Oid) FirstNormalObjectId). > > > Because logical decoding supports user-catalog relations, we need to use the > same sort of logical that GetOldestXmin uses instead of a simple oid-range > check. See RelationIsAccessibleInLogicalDecoding() and the > user_catalog_table reloption. > Unfortunately, HeapTupleSatisfiesVacuum does not have the Relation structure handily available to allow for these checks.. > Otherwise pseudo-catalogs used by logical decoding output plugins could > still suffer issues with needed tuples getting vacuumed, though only if the > txn being decoded made changes to those tables than ROLLBACKed. It's a > pretty tiny corner case for decoding of 2pc but a bigger one when we're > addressing streaming decoding. > We disallow rewrites on user_catalog_tables, so they cannot change underneath. Yes, DML can be carried out on them inside a 2PC transaction which then gets ROLLBACK'ed. But if it's getting aborted, then we are not interested in that data anyways. Also, now that we have the "filter_decode_txn_cb_wrapper()" function, we will stop decoding by the next change record cycle because of the abort. So, I am not sure if we need to track user_catalog_tables in HeapTupleSatisfiesVacuum explicitly. > Otherwise I'm really, really happy with how this is progressing and want to > find time to play with it. Yeah, I will do some more testing and add a few more test cases in the test_decoding plugin. It might be handy to have a DELAY of a few seconds after every change record processing, for example. That ways, we can have a TAP test which can do a few WAL activities and then we introduce a concurrent rollback midways from another session in the middle of that delayed processing. I have done debugger based testing of this concurrent rollback functionality as of now. Another test (actually, functionality) that might come in handy, is to have a way for DDL to be actually carried out on the subscriber. We will need something like pglogical.replicate_ddl_command to be added to the core for this to work. We can add this functionality as a follow-on separate patch after discussing how we want to implement that in core. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
On 5 December 2017 at 16:00, Nikhil Sontakke
wrote:
>
> We disallow rewrites on user_catalog_tables, so they cannot change
> underneath. Yes, DML can be carried out on them inside a 2PC
> transaction which then gets ROLLBACK'ed. But if it's getting aborted,
> then we are not interested in that data anyways. Also, now that we
> have the "filter_decode_txn_cb_wrapper()" function, we will stop
> decoding by the next change record cycle because of the abort.
>
> So, I am not sure if we need to track user_catalog_tables in
> HeapTupleSatisfiesVacuum explicitly.
>
I guess it's down to whether, when we're decoding a txn that just got
concurrently aborted, the output plugin might do anything with its user
catalogs that could cause a crash.
Output plugins are most likely to be using the genam (or even SPI, I
guess?) to read user-catalogs during logical decoding. Logical decoding its
self does not rely on the correctness of user catalogs in any way, it's
only a concern for output plugin callbacks.
It may make sense to kick this one down the road at this point, I can't
conclusively see where it'd cause an actual problem.
>
> > Otherwise I'm really, really happy with how this is progressing and want
> to
> > find time to play with it.
>
> Yeah, I will do some more testing and add a few more test cases in the
> test_decoding plugin. It might be handy to have a DELAY of a few
> seconds after every change record processing, for example. That ways,
> we can have a TAP test which can do a few WAL activities and then we
> introduce a concurrent rollback midways from another session in the
> middle of that delayed processing. I have done debugger based testing
> of this concurrent rollback functionality as of now.
>
>
Sounds good.
> Another test (actually, functionality) that might come in handy, is to
> have a way for DDL to be actually carried out on the subscriber. We
> will need something like pglogical.replicate_ddl_command to be added
> to the core for this to work. We can add this functionality as a
> follow-on separate patch after discussing how we want to implement
> that in core.
Yeah, definitely a different patch, but assuredly valuable.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 12/4/17 10:15, Nikhil Sontakke wrote: > PFA, latest patch for this functionality. This probably needs documentation updates for the logical decoding chapter. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 12/7/17 08:31, Peter Eisentraut wrote: > On 12/4/17 10:15, Nikhil Sontakke wrote: >> PFA, latest patch for this functionality. > > This probably needs documentation updates for the logical decoding chapter. You need the attached patch to be able to compile without warnings. Also, the regression tests crash randomly for me at frame #4: 0x000000010a6febdb postgres`heap_prune_record_prunable(prstate=0x00007ffee5578990, xid=0) at pruneheap.c:625 622 * This should exactly match the PageSetPrunable macro. We can't store 623 * directly into the page header yet, so we update working state. 624 */ -> 625 Assert(TransactionIdIsNormal(xid)); 626 if (!TransactionIdIsValid(prstate->new_prune_xid) || 627 TransactionIdPrecedes(xid, prstate->new_prune_xid)) 628 prstate->new_prune_xid = xid; Did you build with --enable-cassert? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Hi, Thanks for the warning fix, I will also look at the cassert case soon. I have been adding more test cases to this patch. I added a TAP test which now allows us to do a concurrent ROLLBACK PREPARED when the walsender is in the midst of decoding this very prepared transaction. Have added a "decode-delay" parameter to test_decoding via which each apply call sleeps for a few configurable number of seconds allowing us to have deterministic rollback in parallel. This logic seems to work ok. However, I am battling an issue with invalidations now. Consider the below test case: CREATE TABLE test_prepared1(id integer primary key); -- test prepared xact containing ddl BEGIN; INSERT INTO test_prepared1 VALUES (5); ALTER TABLE test_prepared1 ADD COLUMN data text; INSERT INTO test_prepared1 VALUES (6, 'frakbar'); PREPARE TRANSACTION 'test_prepared#3'; COMMIT PREPARED 'test_prepared#3'; SELECT data FROM pg_logical_slot_get_changes(..) <-- this shows the 2PC being decoded appropriately -- make sure stuff still works INSERT INTO test_prepared1 VALUES (8); SELECT data FROM pg_logical_slot_get_changes(..) The last pg_logical_slot_get_changes call, shows: table public.test_prepared1: INSERT: id[integer]:8 whereas since the 2PC committed, it should have shown: table public.test_prepared1: INSERT: id[integer]:8 data[text]:null This is an issue because of the way we are handling invalidations. We don't allow ReorderBufferAddInvalidations() at COMMIT PREPARE time since we assume that handling them at PREPARE time is enough. Apparently, it's not enough. Am trying to allow invalidations at COMMIT PREPARE time as well, but maybe calling ReorderBufferAddInvalidations() blindly again is not a good idea. Also, if I do that, then I am getting some restart_lsn inconsistencies which causes subsequent pg_logical_slot_get_changes() calls to re-decode older records. I continue to investigate. I am attaching the latest WIP patch. This contains the additional TAP test changes. Regards, Nikhils On 8 December 2017 at 01:15, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > On 12/7/17 08:31, Peter Eisentraut wrote: >> On 12/4/17 10:15, Nikhil Sontakke wrote: >>> PFA, latest patch for this functionality. >> >> This probably needs documentation updates for the logical decoding chapter. > > You need the attached patch to be able to compile without warnings. > > Also, the regression tests crash randomly for me at > > frame #4: 0x000000010a6febdb > postgres`heap_prune_record_prunable(prstate=0x00007ffee5578990, xid=0) > at pruneheap.c:625 > 622 * This should exactly match the PageSetPrunable macro. We > can't store > 623 * directly into the page header yet, so we update working state. > 624 */ > -> 625 Assert(TransactionIdIsNormal(xid)); > 626 if (!TransactionIdIsValid(prstate->new_prune_xid) || > 627 TransactionIdPrecedes(xid, prstate->new_prune_xid)) > 628 prstate->new_prune_xid = xid; > > Did you build with --enable-cassert? > > -- > Peter Eisentraut http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Attachment
On 12 December 2017 at 12:04, Nikhil Sontakke <nikhils@2ndquadrant.com> wrote: > This is an issue because of the way we are handling invalidations. We > don't allow ReorderBufferAddInvalidations() at COMMIT PREPARE time > since we assume that handling them at PREPARE time is enough. > Apparently, it's not enough. Not sure what that means. I think we would need to fire invalidations at COMMIT PREPARED, yet logically decode them at PREPARE. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> I think we would need to fire invalidations at COMMIT PREPARED, yet > logically decode them at PREPARE. > Yes, we need invalidations to logically decode at PREPARE and then we need invalidations to be executed at COMMIT PREPARED time as well. DecodeCommit() needs to know when it's processing a COMMIT PREPARED whether this transaction was decoded at PREPARE time.The main issue is that we cannot expect the ReorderBufferTXN structure which was created at PREPARE time to be around till the COMMIT PREPARED gets called. The patch earlier was not cleaning this structure at PREPARE and was adding an is_prepared flag to it so that COMMIT PREPARED knew that it was decoded at PREPARE time. This structure can very well be not around when you restart between PREPARE and COMMIT PREPARED, for example. So now, it's the onus of the prepare filter callback to always give us the answer if a given transaction was decoded at PREPARE time or not. We now hand over the ReorderBufferTxn structure (it can be NULL), xid and gid and the prepare filter tells us what to do. Always. The is_prepared flag can be cached in the txn structure to aid in re-lookups, but if it's not set, the filter could do xid lookup, gid inspection and other shenanigans to give us the same answer every invocation around. Because of the above, we can very well cleanup the ReorderBufferTxn at PREPARE time and it need not hang around till COMMIT PREPARED gets called, which is a good win in terms of resource management. My test cases pass (including the scenario described earlier) with the above code changes in place. I have also added crash testing related TAP test cases, they uncovered a bug in the prepare redo restart code path which I fixed. I believe this patch is in very stable state now. Multiple runs of the crash TAP test pass without issues. Multiple runs of "make check-world" with cassert enabled also pass without issues. Note that this patch does not contain the HeapTupleSatisfiesVacuum changes. I believe we need changes to HeapTupleSatisfiesVacuum given than logical decoding changes the assumption that catalog tuples belonging to a transaction which never committed can be reclaimed immediately. With 2PC logical decoding or streaming logical decoding, we can always have a split time window in which the ongoing decode cycle needs those tuples. The solution is that even for aborted transactions, we do not return HEAPTUPLE_DEAD if the transaction id is newer than the OldestXmin (same logic we use for deleted tuples of committed transactions). We can do this only for catalog table rows (both system and user defined) to limit the scope of impact. In any case, this needs to be a separate patch along with a separate discussion thread. Peter, I will submit a follow-on patch with documentation changes soon. But this patch is complete IMO, with all the required 2PC logical decoding functionality. Comments, feedback is most welcome. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Attachment
On 12/19/17 03:37, Nikhil Sontakke wrote: > Note that this patch does not contain the HeapTupleSatisfiesVacuum > changes. I believe we need changes to HeapTupleSatisfiesVacuum given > than logical decoding changes the assumption that catalog tuples > belonging to a transaction which never committed can be reclaimed > immediately. With 2PC logical decoding or streaming logical decoding, > we can always have a split time window in which the ongoing decode > cycle needs those tuples. The solution is that even for aborted > transactions, we do not return HEAPTUPLE_DEAD if the transaction id is > newer than the OldestXmin (same logic we use for deleted tuples of > committed transactions). We can do this only for catalog table rows > (both system and user defined) to limit the scope of impact. In any > case, this needs to be a separate patch along with a separate > discussion thread. Are you working on that as well? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>> Note that this patch does not contain the HeapTupleSatisfiesVacuum >> changes. I believe we need changes to HeapTupleSatisfiesVacuum given >> than logical decoding changes the assumption that catalog tuples >> belonging to a transaction which never committed can be reclaimed >> immediately. With 2PC logical decoding or streaming logical decoding, >> we can always have a split time window in which the ongoing decode >> cycle needs those tuples. The solution is that even for aborted >> transactions, we do not return HEAPTUPLE_DEAD if the transaction id is >> newer than the OldestXmin (same logic we use for deleted tuples of >> committed transactions). We can do this only for catalog table rows >> (both system and user defined) to limit the scope of impact. In any >> case, this needs to be a separate patch along with a separate >> discussion thread. > > Are you working on that as well? Sure, I was planning to work on that after getting the documentation for this patch out of the way. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Hi, >> >> Are you working on that as well? > > Sure, I was planning to work on that after getting the documentation > for this patch out of the way. > PFA, patch with documentation. Have added requisite entries in the logical decoding output plugins section. No changes are needed elsewhere, AFAICS. I will submit the HeapTupleSatisfiesVacuum patch on a separate discussion, soon. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Attachment
Hi, > > PFA, patch with documentation. Have added requisite entries in the > logical decoding output plugins section. No changes are needed > elsewhere, AFAICS. > PFA, patch which applies cleanly against latest git head. I also removed unwanted newlines and took care of the cleanup TODO about making ReorderBufferTXN structure using a txn_flags field instead of separate booleans for various statuses like has_catalog_changes, is_subxact, is_serialized etc. The patch uses this txn_flags field for the newer prepare related info as well. "make check-world" passes ok, including the additional regular and tap tests that we have added as part of this patch. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Attachment
Hi all, > > PFA, patch which applies cleanly against latest git head. I also > removed unwanted newlines and took care of the cleanup TODO about > making ReorderBufferTXN structure using a txn_flags field instead of > separate booleans for various statuses like has_catalog_changes, > is_subxact, is_serialized etc. The patch uses this txn_flags field for > the newer prepare related info as well. > > "make check-world" passes ok, including the additional regular and tap > tests that we have added as part of this patch. > PFA, latest version of this patch. This latest version takes care of the abort-while-decoding issue along with additional test cases and documentation changes. We now maintain a list of processes that are decoding a specific transactionID and make it a decode groupmember of a decode groupleader process. The decode groupleader process is basically the PGPROC entry which points to the prepared 2PC transaction or an ongoing regular transaction. If the 2PC is rollback'ed then FinishPreparedTransactions uses the decode groupleader process to let all the decode groupmember processes know that it's aborting. A similar logic can be used for the decoding of uncommitted transactions. The decode groupmember processes are able to abort sanely in such a case. We also have two new APIs "LogicalLockTransaction" and "LogicalUnlockTransaction" that the decoding backends need to use while doing system or user catalog tables access. The abort code interlocks with decoding backends that might be in the process of accessing catalog tables and waits for those few moments before aborting the transaction. The implementation uses the LockHashPartitionLockByProc on the decode groupleader process to control access to these additional fields in the PGPROC structure amongst the decode groupleader and the other decode groupmember processes and does not need to use the ProcArrayLock at all. The implementation is inspired from the *existing* lockGroupLeader solution which uses a similar technique to track processes waiting on a leader holding that lock. I believe it's an optimal solution for this problem of ours. Have added TAP tests to test multiple decoding backends working on the same transaction. Used delays in the test-decoding plugin to introduce waits after making the LogicalLockTransaction call and calling ROLLBACK to ensure that it interlocks with such decoding backends which are doing catalog access. Tests working as desired. Also "make check-world" passes with asserts enabled. I will post this same explanation about abort handling on the other thread (http://www.postgresql-archive.org/Logical-Decoding-and-HeapTupleSatisfiesVacuum-assumptions-td5998294.html). Comments appreciated. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Attachment
Hi! Thanks for working on this patch. Reading through patch I’ve noticed that you deleted call to SnapBuildCommitTxn() in DecodePrepare(). As you correctly spotted upthread there was unnecessary code that marked transaction as running after decoding of prepare. However call marking it as committed before decoding of prepare IMHO is still needed as SnapBuildCommitTxn does some useful thing like setting base snapshot for parent transactions which were skipped because of SnapBuildXactNeedsSkip(). E.g. current code will crash in assert for following transaction: BEGIN; SAVEPOINT one; CREATE TABLE test_prepared_savepoints (a int); PREPARE TRANSACTION 'x'; COMMIT PREPARED 'x'; :get_with2pc_nofilter :get_with2pc_nofilter <- second call will crash decoder With following backtrace: frame #3: 0x000000010dc47b40 postgres`ExceptionalCondition(conditionName="!(txn->ninvalidations == 0)", errorType="FailedAssertion",fileName="reorderbuffer.c", lineNumber=1944) at assert.c:54 frame #4: 0x000000010d9ff4dc postgres`ReorderBufferForget(rb=0x00007fe1ab832318, xid=816, lsn=35096144) at reorderbuffer.c:1944 frame #5: 0x000000010d9f055c postgres`DecodePrepare(ctx=0x00007fe1ab81b918, buf=0x00007ffee2650408, parsed=0x00007ffee2650088)at decode.c:703 frame #6: 0x000000010d9ef718 postgres`DecodeXactOp(ctx=0x00007fe1ab81b918, buf=0x00007ffee2650408) at decode.c:310 That can be fixed by calling SnapBuildCommitTxn() in DecodePrepare() which I believe is safe because during normal work prepared transaction holds relation locks until commit/abort and in between nobody can access altered relations (or just I don’t know such situations — that was the reason why i had marked that xids as running in previous versions). > On 6 Feb 2018, at 15:20, Nikhil Sontakke <nikhils@2ndquadrant.com> wrote: > > Hi all, > >> >> PFA, patch which applies cleanly against latest git head. I also >> removed unwanted newlines and took care of the cleanup TODO about >> making ReorderBufferTXN structure using a txn_flags field instead of >> separate booleans for various statuses like has_catalog_changes, >> is_subxact, is_serialized etc. The patch uses this txn_flags field for >> the newer prepare related info as well. >> >> "make check-world" passes ok, including the additional regular and tap >> tests that we have added as part of this patch. >> > > PFA, latest version of this patch. > > This latest version takes care of the abort-while-decoding issue along > with additional test cases and documentation changes. > > -- Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Hi Stas, > Reading through patch I’ve noticed that you deleted call to SnapBuildCommitTxn() > in DecodePrepare(). As you correctly spotted upthread there was unnecessary > code that marked transaction as running after decoding of prepare. However call > marking it as committed before decoding of prepare IMHO is still needed as > SnapBuildCommitTxn does some useful thing like setting base snapshot for parent > transactions which were skipped because of SnapBuildXactNeedsSkip(). > > E.g. current code will crash in assert for following transaction: > > BEGIN; > SAVEPOINT one; > CREATE TABLE test_prepared_savepoints (a int); > PREPARE TRANSACTION 'x'; > COMMIT PREPARED 'x'; > :get_with2pc_nofilter > :get_with2pc_nofilter <- second call will crash decoder > Thanks for taking a look! The first ":get_with2pc_nofilter" call consumes the data appropriately. The second ":get_with2pc_nofilter" sees that it has to skip and hence enters the ReorderBufferForget() function in the skip code path causing the assert. If we have to skip anyways why do we need to setup SnapBuildCommitTxn() for such a transaction is my query? I don't see the need for doing that for skipped transactions.. Will continue to look at this and will add this scenario to the test cases. Further comments/feedback appreciated. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Hi, First off: This patch has way too many different types of changes as part of one huge commit. This needs to be split into several pieces. First the cleanups (e.g. the fields -> flag changes), then the individual infrastructure pieces (like the twophase.c changes, best split into several pieces as well, the locking stuff), then the main feature, then support for it in the output plugin. Each should have an individual explanation about why the change is necessary and not a bad idea. On 2018-02-06 17:50:40 +0530, Nikhil Sontakke wrote: > @@ -46,6 +48,9 @@ typedef struct > bool skip_empty_xacts; > bool xact_wrote_changes; > bool only_local; > + bool twophase_decoding; > + bool twophase_decode_with_catalog_changes; > + int decode_delay; /* seconds to sleep after every change record */ This seems too big a crock to add just for testing. It'll also make the testing timing dependent... > } TestDecodingData; > void > _PG_init(void) > @@ -85,9 +106,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb) > cb->begin_cb = pg_decode_begin_txn; > cb->change_cb = pg_decode_change; > cb->commit_cb = pg_decode_commit_txn; > + cb->abort_cb = pg_decode_abort_txn; > cb->filter_by_origin_cb = pg_decode_filter; > cb->shutdown_cb = pg_decode_shutdown; > cb->message_cb = pg_decode_message; > + cb->filter_prepare_cb = pg_filter_prepare; > + cb->filter_decode_txn_cb = pg_filter_decode_txn; > + cb->prepare_cb = pg_decode_prepare_txn; > + cb->commit_prepared_cb = pg_decode_commit_prepared_txn; > + cb->abort_prepared_cb = pg_decode_abort_prepared_txn; > } Why does this introduce both abort_cb and abort_prepared_cb? That seems to conflate two separate features. > +/* Filter out unnecessary two-phase transactions */ > +static bool > +pg_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, > + TransactionId xid, const char *gid) > +{ > + TestDecodingData *data = ctx->output_plugin_private; > + > + /* treat all transactions as one-phase */ > + if (!data->twophase_decoding) > + return true; > + > + if (txn && txn_has_catalog_changes(txn) && > + !data->twophase_decode_with_catalog_changes) > + return true; What? I'm INCREDIBLY doubtful this is a sane thing to expose to output plugins. As in, unless I hear a very very convincing reason I'm strongly opposed. > +/* > + * Check if we should continue to decode this transaction. > + * > + * If it has aborted in the meanwhile, then there's no sense > + * in decoding and sending the rest of the changes, we might > + * as well ask the subscribers to abort immediately. > + * > + * This should be called if we are streaming a transaction > + * before it's committed or if we are decoding a 2PC > + * transaction. Otherwise we always decode committed > + * transactions > + * > + * Additional checks can be added here, as needed > + */ > +static bool > +pg_filter_decode_txn(LogicalDecodingContext *ctx, > + ReorderBufferTXN *txn) > +{ > + /* > + * Due to caching, repeated TransactionIdDidAbort calls > + * shouldn't be that expensive > + */ > + if (txn != NULL && > + TransactionIdIsValid(txn->xid) && > + TransactionIdDidAbort(txn->xid)) > + return true; > + > + /* if txn is NULL, filter it out */ Why can this be NULL? > + return (txn != NULL)? false:true; > +} This definitely shouldn't be a task for each output plugin. Even if we want to make this configurable, I'm doubtful that it's a good idea to do so here - make its much less likely to hit edge cases. > static bool > pg_decode_filter(LogicalDecodingContext *ctx, > RepOriginId origin_id) > @@ -409,8 +622,18 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, > } > data->xact_wrote_changes = true; > > + if (!LogicalLockTransaction(txn)) > + return; It really really can't be right that this is exposed to output plugins. > + /* if decode_delay is specified, sleep with above lock held */ > + if (data->decode_delay > 0) > + { > + elog(LOG, "sleeping for %d seconds", data->decode_delay); > + pg_usleep(data->decode_delay * 1000000L); > + } Really not on board. > @@ -1075,6 +1077,21 @@ EndPrepare(GlobalTransaction gxact) > Assert(hdr->magic == TWOPHASE_MAGIC); > hdr->total_len = records.total_len + sizeof(pg_crc32c); > > + replorigin = (replorigin_session_origin != InvalidRepOriginId && > + replorigin_session_origin != DoNotReplicateId); > + > + if (replorigin) > + { > + Assert(replorigin_session_origin_lsn != InvalidXLogRecPtr); > + hdr->origin_lsn = replorigin_session_origin_lsn; > + hdr->origin_timestamp = replorigin_session_origin_timestamp; > + } > + else > + { > + hdr->origin_lsn = InvalidXLogRecPtr; > + hdr->origin_timestamp = 0; > + } > + > /* > * If the data size exceeds MaxAllocSize, we won't be able to read it in > * ReadTwoPhaseFile. Check for that now, rather than fail in the case > @@ -1107,7 +1124,16 @@ EndPrepare(GlobalTransaction gxact) > XLogBeginInsert(); > for (record = records.head; record != NULL; record = record->next) > XLogRegisterData(record->data, record->len); > + > + XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN); > + Can we perhaps merge a bit of the code with the plain commit path on this? > gxact->prepare_end_lsn = XLogInsert(RM_XACT_ID, XLOG_XACT_PREPARE); > + > + if (replorigin) > + /* Move LSNs forward for this replication origin */ > + replorigin_session_advance(replorigin_session_origin_lsn, > + gxact->prepare_end_lsn); > + Why is it ok to do this at PREPARE time? I guess the theory is that the origin LSN is going to be from the sources PREPARE too? If so, this needs to be commented upon here. > +/* > + * ParsePrepareRecord > + */ > +void > +ParsePrepareRecord(uint8 info, char *xlrec, xl_xact_parsed_prepare *parsed) > +{ > + TwoPhaseFileHeader *hdr; > + char *bufptr; > + > + hdr = (TwoPhaseFileHeader *) xlrec; > + bufptr = xlrec + MAXALIGN(sizeof(TwoPhaseFileHeader)); > + > + parsed->origin_lsn = hdr->origin_lsn; > + parsed->origin_timestamp = hdr->origin_timestamp; > + parsed->twophase_xid = hdr->xid; > + parsed->dbId = hdr->database; > + parsed->nsubxacts = hdr->nsubxacts; > + parsed->ncommitrels = hdr->ncommitrels; > + parsed->nabortrels = hdr->nabortrels; > + parsed->nmsgs = hdr->ninvalmsgs; > + > + strncpy(parsed->twophase_gid, bufptr, hdr->gidlen); > + bufptr += MAXALIGN(hdr->gidlen); > + > + parsed->subxacts = (TransactionId *) bufptr; > + bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId)); > + > + parsed->commitrels = (RelFileNode *) bufptr; > + bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode)); > + > + parsed->abortrels = (RelFileNode *) bufptr; > + bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileNode)); > + > + parsed->msgs = (SharedInvalidationMessage *) bufptr; > + bufptr += MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage)); > +} So this is now basically a commit record. I quite dislike duplicating things this way. Can't we make commit records versatile enough to represent this without problems? > /* > * Reads 2PC data from xlog. During checkpoint this data will be moved to > @@ -1365,7 +1428,7 @@ StandbyTransactionIdIsPrepared(TransactionId xid) > * FinishPreparedTransaction: execute COMMIT PREPARED or ROLLBACK PREPARED > */ > void > -FinishPreparedTransaction(const char *gid, bool isCommit) > +FinishPreparedTransaction(const char *gid, bool isCommit, bool missing_ok) > { > GlobalTransaction gxact; > PGPROC *proc; > @@ -1386,8 +1449,20 @@ FinishPreparedTransaction(const char *gid, bool isCommit) > /* > * Validate the GID, and lock the GXACT to ensure that two backends do not > * try to commit the same GID at once. > + * > + * During logical decoding, on the apply side, it's possible that a prepared > + * transaction got aborted while decoding. In that case, we stop the > + * decoding and abort the transaction immediately. However the ROLLBACK > + * prepared processing still reaches the subscriber. In that case it's ok > + * to have a missing gid > */ > - gxact = LockGXact(gid, GetUserId()); > + gxact = LockGXact(gid, GetUserId(), missing_ok); > + if (gxact == NULL) > + { > + Assert(missing_ok && !isCommit); > + return; > + } I'm very doubtful it is sane to handle this at such a low level. > @@ -2358,6 +2443,13 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn, XLogRecPtr end_lsn) > Assert(TwoPhaseState->numPrepXacts < max_prepared_xacts); > TwoPhaseState->prepXacts[TwoPhaseState->numPrepXacts++] = gxact; > > + if (origin_id != InvalidRepOriginId) > + { > + /* recover apply progress */ > + replorigin_advance(origin_id, hdr->origin_lsn, end_lsn, > + false /* backward */ , false /* WAL */ ); > + } > + It's unclear to me why this is necessary / a good idea? > case XLOG_XACT_PREPARE: > + { > + xl_xact_parsed_prepare parsed; > > - /* > - * Currently decoding ignores PREPARE TRANSACTION and will just > - * decode the transaction when the COMMIT PREPARED is sent or > - * throw away the transaction's contents when a ROLLBACK PREPARED > - * is received. In the future we could add code to expose prepared > - * transactions in the changestream allowing for a kind of > - * distributed 2PC. > - */ > - ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr); > + /* check that output plugin is capable of twophase decoding */ > + if (!ctx->enable_twophase) > + { > + ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr); > + break; > + } > + > + /* ok, parse it */ > + ParsePrepareRecord(XLogRecGetInfo(buf->record), > + XLogRecGetData(buf->record), &parsed); > + > + /* does output plugin want this particular transaction? */ > + if (ctx->callbacks.filter_prepare_cb && > + ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid, > + parsed.twophase_gid)) > + { > + ReorderBufferProcessXid(reorder, parsed.twophase_xid, > + buf->origptr); We're calling ReorderBufferProcessXid() on two different xids in different branches, is that intentional? > + if (TransactionIdIsValid(parsed->twophase_xid) && > + ReorderBufferTxnIsPrepared(ctx->reorder, > + parsed->twophase_xid, parsed->twophase_gid)) > + { > + Assert(xid == parsed->twophase_xid); > + /* we are processing COMMIT PREPARED */ > + ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr, > + commit_time, origin_id, origin_lsn, parsed->twophase_gid, true); > + } > + else > + { > + /* replay actions of all transaction + subtransactions in order */ > + ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr, > + commit_time, origin_id, origin_lsn); > + } > +} Why do we want this via the same routine? > +bool > +LogicalLockTransaction(ReorderBufferTXN *txn) > +{ > + bool ok = false; > + > + /* > + * Prepared transactions and uncommitted transactions > + * that have modified catalogs need to interlock with > + * concurrent rollback to ensure that there are no > + * issues while decoding > + */ > + > + if (!txn_has_catalog_changes(txn)) > + return true; > + > + /* > + * Is it a prepared txn? Similar checks for uncommitted > + * transactions when we start supporting them > + */ > + if (!txn_prepared(txn)) > + return true; > + > + /* check cached status */ > + if (txn_commit(txn)) > + return true; > + if (txn_rollback(txn)) > + return false; > + > + /* > + * Find the PROC that is handling this XID and add ourself as a > + * decodeGroupMember > + */ > + if (MyProc->decodeGroupLeader == NULL) > + { > + PGPROC *proc = BecomeDecodeGroupLeader(txn->xid, txn_prepared(txn)); > + > + /* > + * If decodeGroupLeader is NULL, then the only possibility > + * is that the transaction completed and went away > + */ > + if (proc == NULL) > + { > + Assert(!TransactionIdIsInProgress(txn->xid)); > + if (TransactionIdDidCommit(txn->xid)) > + { > + txn->txn_flags |= TXN_COMMIT; > + return true; > + } > + else > + { > + txn->txn_flags |= TXN_ROLLBACK; > + return false; > + } > + } > + > + /* Add ourself as a decodeGroupMember */ > + if (!BecomeDecodeGroupMember(proc, proc->pid, txn_prepared(txn))) > + { > + Assert(!TransactionIdIsInProgress(txn->xid)); > + if (TransactionIdDidCommit(txn->xid)) > + { > + txn->txn_flags |= TXN_COMMIT; > + return true; > + } > + else > + { > + txn->txn_flags |= TXN_ROLLBACK; > + return false; > + } > + } > + } Are we ok with this low-level lock / pgproc stuff happening outside of procarray / lock related files? Where is the locking scheme documented? > +/* ReorderBufferTXN flags */ > +#define TXN_HAS_CATALOG_CHANGES 0x0001 > +#define TXN_IS_SUBXACT 0x0002 > +#define TXN_SERIALIZED 0x0004 > +#define TXN_PREPARE 0x0008 > +#define TXN_COMMIT_PREPARED 0x0010 > +#define TXN_ROLLBACK_PREPARED 0x0020 > +#define TXN_COMMIT 0x0040 > +#define TXN_ROLLBACK 0x0080 > + > +/* does the txn have catalog changes */ > +#define txn_has_catalog_changes(txn) (txn->txn_flags & TXN_HAS_CATALOG_CHANGES) > +/* is the txn known as a subxact? */ > +#define txn_is_subxact(txn) (txn->txn_flags & TXN_IS_SUBXACT) > +/* > + * Has this transaction been spilled to disk? It's not always possible to > + * deduce that fact by comparing nentries with nentries_mem, because e.g. > + * subtransactions of a large transaction might get serialized together > + * with the parent - if they're restored to memory they'd have > + * nentries_mem == nentries. > + */ > +#define txn_is_serialized(txn) (txn->txn_flags & TXN_SERIALIZED) > +/* is this txn prepared? */ > +#define txn_prepared(txn) (txn->txn_flags & TXN_PREPARE) > +/* was this prepared txn committed in the meanwhile? */ > +#define txn_commit_prepared(txn) (txn->txn_flags & TXN_COMMIT_PREPARED) > +/* was this prepared txn aborted in the meanwhile? */ > +#define txn_rollback_prepared(txn) (txn->txn_flags & TXN_ROLLBACK_PREPARED) > +/* was this txn committed in the meanwhile? */ > +#define txn_commit(txn) (txn->txn_flags & TXN_COMMIT) > +/* was this prepared txn aborted in the meanwhile? */ > +#define txn_rollback(txn) (txn->txn_flags & TXN_ROLLBACK) > + These txn_* names seem too generic imo - fairly likely to conflict with other pieces of code imo. Greetings, Andres Freund
Hi Andres, > First off: This patch has way too many different types of changes as > part of one huge commit. This needs to be split into several > pieces. First the cleanups (e.g. the fields -> flag changes), then the > individual infrastructure pieces (like the twophase.c changes, best > split into several pieces as well, the locking stuff), then the main > feature, then support for it in the output plugin. Each should have an > individual explanation about why the change is necessary and not a bad > idea. > Ok, I will break this patch into multiple logical pieces and re-submit. > > On 2018-02-06 17:50:40 +0530, Nikhil Sontakke wrote: >> @@ -46,6 +48,9 @@ typedef struct >> bool skip_empty_xacts; >> bool xact_wrote_changes; >> bool only_local; >> + bool twophase_decoding; >> + bool twophase_decode_with_catalog_changes; >> + int decode_delay; /* seconds to sleep after every change record */ > > This seems too big a crock to add just for testing. It'll also make the > testing timing dependent... > The idea *was* to make testing timing dependent. We wanted to simulate the case when a rollback is issued by another backend while the decoding is still ongoing. This allows that test case to be tested. >> } TestDecodingData; > >> void >> _PG_init(void) >> @@ -85,9 +106,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb) >> cb->begin_cb = pg_decode_begin_txn; >> cb->change_cb = pg_decode_change; >> cb->commit_cb = pg_decode_commit_txn; >> + cb->abort_cb = pg_decode_abort_txn; > >> cb->filter_by_origin_cb = pg_decode_filter; >> cb->shutdown_cb = pg_decode_shutdown; >> cb->message_cb = pg_decode_message; >> + cb->filter_prepare_cb = pg_filter_prepare; >> + cb->filter_decode_txn_cb = pg_filter_decode_txn; >> + cb->prepare_cb = pg_decode_prepare_txn; >> + cb->commit_prepared_cb = pg_decode_commit_prepared_txn; >> + cb->abort_prepared_cb = pg_decode_abort_prepared_txn; >> } > > Why does this introduce both abort_cb and abort_prepared_cb? That seems > to conflate two separate features. > Consider the case when we have a bunch of change records to apply for a transaction. We sent a "BEGIN" and then start decoding each change record one by one. Now a rollback was encountered while we were decoding. In that case it doesn't make sense to keep on decoding and sending the change records. We immediately send a regular ABORT. We cannot send "ROLLBACK PREPARED" because the transaction was not prepared on the subscriber and have to send a regular ABORT instead. And we need the "ROLLBACK PREPARED" callback for the case when a prepared transaction gets rolled back and is encountered during the usual WAL processing. Please take a look at "contrib/test_decoding/t/001_twophase.pl" where this test case is enacted. > >> +/* Filter out unnecessary two-phase transactions */ >> +static bool >> +pg_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, >> + TransactionId xid, const char *gid) >> +{ >> + TestDecodingData *data = ctx->output_plugin_private; >> + >> + /* treat all transactions as one-phase */ >> + if (!data->twophase_decoding) >> + return true; >> + >> + if (txn && txn_has_catalog_changes(txn) && >> + !data->twophase_decode_with_catalog_changes) >> + return true; > > What? I'm INCREDIBLY doubtful this is a sane thing to expose to output > plugins. As in, unless I hear a very very convincing reason I'm strongly > opposed. > These bools are specific to the test_decoding plugin. Again, these are useful in testing decoding in various scenarios with twophase decoding enabled/disabled. Testing decoding when catalog changes are allowed/disallowed etc. Please take a look at "contrib/test_decoding/sql/prepared.sql" for the various scenarios. > >> +/* >> + * Check if we should continue to decode this transaction. >> + * >> + * If it has aborted in the meanwhile, then there's no sense >> + * in decoding and sending the rest of the changes, we might >> + * as well ask the subscribers to abort immediately. >> + * >> + * This should be called if we are streaming a transaction >> + * before it's committed or if we are decoding a 2PC >> + * transaction. Otherwise we always decode committed >> + * transactions >> + * >> + * Additional checks can be added here, as needed >> + */ >> +static bool >> +pg_filter_decode_txn(LogicalDecodingContext *ctx, >> + ReorderBufferTXN *txn) >> +{ >> + /* >> + * Due to caching, repeated TransactionIdDidAbort calls >> + * shouldn't be that expensive >> + */ >> + if (txn != NULL && >> + TransactionIdIsValid(txn->xid) && >> + TransactionIdDidAbort(txn->xid)) >> + return true; >> + >> + /* if txn is NULL, filter it out */ > > Why can this be NULL? > Depending on parameters passed to the ReorderBufferTXNByXid() function, the txn might be NULL in some cases, especially during restarts. >> + return (txn != NULL)? false:true; >> +} > > > This definitely shouldn't be a task for each output plugin. Even if we > want to make this configurable, I'm doubtful that it's a good idea to do > so here - make its much less likely to hit edge cases. > Agreed, I will try to add it to the core logical decoding handling. > > >> static bool >> pg_decode_filter(LogicalDecodingContext *ctx, >> RepOriginId origin_id) >> @@ -409,8 +622,18 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, >> } >> data->xact_wrote_changes = true; >> >> + if (!LogicalLockTransaction(txn)) >> + return; > > It really really can't be right that this is exposed to output plugins. > This was discussed in the other thread (http://www.postgresql-archive.org/Logical-Decoding-and-HeapTupleSatisfiesVacuum-assumptions-td5998294i20.html). Any catalog access in any plugins need to interlock with concurrent aborts. This is only a problem if the transaction is a prepared one or yet uncommitted one. Rest of the majority of the cases, this function will do nothing at all. > >> + /* if decode_delay is specified, sleep with above lock held */ >> + if (data->decode_delay > 0) >> + { >> + elog(LOG, "sleeping for %d seconds", data->decode_delay); >> + pg_usleep(data->decode_delay * 1000000L); >> + } > > Really not on board. > Again, specific to test_decoding plugin. We want to test the interlocking code for concurrent abort handling which needs to wait out for plugins in locked state before allowing the rollback to go ahead. Please take a look at "contrib/test_decoding/t/001_twophase.pl" and "Waiting for backends to abort" string. > > > >> @@ -1075,6 +1077,21 @@ EndPrepare(GlobalTransaction gxact) >> Assert(hdr->magic == TWOPHASE_MAGIC); >> hdr->total_len = records.total_len + sizeof(pg_crc32c); >> >> + replorigin = (replorigin_session_origin != InvalidRepOriginId && >> + replorigin_session_origin != DoNotReplicateId); >> + >> + if (replorigin) >> + { >> + Assert(replorigin_session_origin_lsn != InvalidXLogRecPtr); >> + hdr->origin_lsn = replorigin_session_origin_lsn; >> + hdr->origin_timestamp = replorigin_session_origin_timestamp; >> + } >> + else >> + { >> + hdr->origin_lsn = InvalidXLogRecPtr; >> + hdr->origin_timestamp = 0; >> + } >> + >> /* >> * If the data size exceeds MaxAllocSize, we won't be able to read it in >> * ReadTwoPhaseFile. Check for that now, rather than fail in the case >> @@ -1107,7 +1124,16 @@ EndPrepare(GlobalTransaction gxact) >> XLogBeginInsert(); >> for (record = records.head; record != NULL; record = record->next) >> XLogRegisterData(record->data, record->len); >> + >> + XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN); >> + > > Can we perhaps merge a bit of the code with the plain commit path on > this? > Given that PREPARE ROLLBACK handling is totally separate from the regular commit code paths, wouldn't it be a little difficult? > >> gxact->prepare_end_lsn = XLogInsert(RM_XACT_ID, XLOG_XACT_PREPARE); >> + >> + if (replorigin) >> + /* Move LSNs forward for this replication origin */ >> + replorigin_session_advance(replorigin_session_origin_lsn, >> + gxact->prepare_end_lsn); >> + > > Why is it ok to do this at PREPARE time? I guess the theory is that the > origin LSN is going to be from the sources PREPARE too? If so, this > needs to be commented upon here. > Ok, will add a comment. > >> +/* >> + * ParsePrepareRecord >> + */ >> +void >> +ParsePrepareRecord(uint8 info, char *xlrec, xl_xact_parsed_prepare *parsed) >> +{ >> + TwoPhaseFileHeader *hdr; >> + char *bufptr; >> + >> + hdr = (TwoPhaseFileHeader *) xlrec; >> + bufptr = xlrec + MAXALIGN(sizeof(TwoPhaseFileHeader)); >> + >> + parsed->origin_lsn = hdr->origin_lsn; >> + parsed->origin_timestamp = hdr->origin_timestamp; >> + parsed->twophase_xid = hdr->xid; >> + parsed->dbId = hdr->database; >> + parsed->nsubxacts = hdr->nsubxacts; >> + parsed->ncommitrels = hdr->ncommitrels; >> + parsed->nabortrels = hdr->nabortrels; >> + parsed->nmsgs = hdr->ninvalmsgs; >> + >> + strncpy(parsed->twophase_gid, bufptr, hdr->gidlen); >> + bufptr += MAXALIGN(hdr->gidlen); >> + >> + parsed->subxacts = (TransactionId *) bufptr; >> + bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId)); >> + >> + parsed->commitrels = (RelFileNode *) bufptr; >> + bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode)); >> + >> + parsed->abortrels = (RelFileNode *) bufptr; >> + bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileNode)); >> + >> + parsed->msgs = (SharedInvalidationMessage *) bufptr; >> + bufptr += MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage)); >> +} > > So this is now basically a commit record. I quite dislike duplicating > things this way. Can't we make commit records versatile enough to > represent this without problems? > Maybe we can. We have already re-used existing records for XLOG_XACT_COMMIT_PREPARED and XLOG_XACT_ABORT_PREPARED. We can add a flag to existing commit records to indicate that it's a PREPARE and not a COMMIT. > >> /* >> * Reads 2PC data from xlog. During checkpoint this data will be moved to >> @@ -1365,7 +1428,7 @@ StandbyTransactionIdIsPrepared(TransactionId xid) >> * FinishPreparedTransaction: execute COMMIT PREPARED or ROLLBACK PREPARED >> */ >> void >> -FinishPreparedTransaction(const char *gid, bool isCommit) >> +FinishPreparedTransaction(const char *gid, bool isCommit, bool missing_ok) >> { >> GlobalTransaction gxact; >> PGPROC *proc; >> @@ -1386,8 +1449,20 @@ FinishPreparedTransaction(const char *gid, bool isCommit) >> /* >> * Validate the GID, and lock the GXACT to ensure that two backends do not >> * try to commit the same GID at once. >> + * >> + * During logical decoding, on the apply side, it's possible that a prepared >> + * transaction got aborted while decoding. In that case, we stop the >> + * decoding and abort the transaction immediately. However the ROLLBACK >> + * prepared processing still reaches the subscriber. In that case it's ok >> + * to have a missing gid >> */ >> - gxact = LockGXact(gid, GetUserId()); >> + gxact = LockGXact(gid, GetUserId(), missing_ok); >> + if (gxact == NULL) >> + { >> + Assert(missing_ok && !isCommit); >> + return; >> + } > > I'm very doubtful it is sane to handle this at such a low level. > FinishPreparedTransaction() is called directly from ProcessUtility. If not here, where else could we do this? > >> @@ -2358,6 +2443,13 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn, XLogRecPtr end_lsn) >> Assert(TwoPhaseState->numPrepXacts < max_prepared_xacts); >> TwoPhaseState->prepXacts[TwoPhaseState->numPrepXacts++] = gxact; >> >> + if (origin_id != InvalidRepOriginId) >> + { >> + /* recover apply progress */ >> + replorigin_advance(origin_id, hdr->origin_lsn, end_lsn, >> + false /* backward */ , false /* WAL */ ); >> + } >> + > > It's unclear to me why this is necessary / a good idea? > Keeping PREPARE handling as close to regular COMMIT handling seems like a good idea, no? > > >> case XLOG_XACT_PREPARE: >> + { >> + xl_xact_parsed_prepare parsed; >> >> - /* >> - * Currently decoding ignores PREPARE TRANSACTION and will just >> - * decode the transaction when the COMMIT PREPARED is sent or >> - * throw away the transaction's contents when a ROLLBACK PREPARED >> - * is received. In the future we could add code to expose prepared >> - * transactions in the changestream allowing for a kind of >> - * distributed 2PC. >> - */ >> - ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr); >> + /* check that output plugin is capable of twophase decoding */ >> + if (!ctx->enable_twophase) >> + { >> + ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr); >> + break; >> + } >> + >> + /* ok, parse it */ >> + ParsePrepareRecord(XLogRecGetInfo(buf->record), >> + XLogRecGetData(buf->record), &parsed); >> + >> + /* does output plugin want this particular transaction? */ >> + if (ctx->callbacks.filter_prepare_cb && >> + ReorderBufferPrepareNeedSkip(reorder, parsed.twophase_xid, >> + parsed.twophase_gid)) >> + { >> + ReorderBufferProcessXid(reorder, parsed.twophase_xid, >> + buf->origptr); > > We're calling ReorderBufferProcessXid() on two different xids in > different branches, is that intentional? > Don't think that's intentional. Maybe Stas can also provide his views on this? >> + if (TransactionIdIsValid(parsed->twophase_xid) && >> + ReorderBufferTxnIsPrepared(ctx->reorder, >> + parsed->twophase_xid, parsed->twophase_gid)) >> + { >> + Assert(xid == parsed->twophase_xid); >> + /* we are processing COMMIT PREPARED */ >> + ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr, >> + commit_time, origin_id, origin_lsn, parsed->twophase_gid, true); >> + } >> + else >> + { >> + /* replay actions of all transaction + subtransactions in order */ >> + ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr, >> + commit_time, origin_id, origin_lsn); >> + } >> +} > > Why do we want this via the same routine? > As I mentioned above, xl_xact_parsed_commit handles both regular commits and also "COMMIT PREPARED". That's why one routine for them both. > > >> +bool >> +LogicalLockTransaction(ReorderBufferTXN *txn) >> +{ >> + bool ok = false; >> + >> + /* >> + * Prepared transactions and uncommitted transactions >> + * that have modified catalogs need to interlock with >> + * concurrent rollback to ensure that there are no >> + * issues while decoding >> + */ >> + >> + if (!txn_has_catalog_changes(txn)) >> + return true; >> + >> + /* >> + * Is it a prepared txn? Similar checks for uncommitted >> + * transactions when we start supporting them >> + */ >> + if (!txn_prepared(txn)) >> + return true; >> + >> + /* check cached status */ >> + if (txn_commit(txn)) >> + return true; >> + if (txn_rollback(txn)) >> + return false; >> + >> + /* >> + * Find the PROC that is handling this XID and add ourself as a >> + * decodeGroupMember >> + */ >> + if (MyProc->decodeGroupLeader == NULL) >> + { >> + PGPROC *proc = BecomeDecodeGroupLeader(txn->xid, txn_prepared(txn)); >> + >> + /* >> + * If decodeGroupLeader is NULL, then the only possibility >> + * is that the transaction completed and went away >> + */ >> + if (proc == NULL) >> + { >> + Assert(!TransactionIdIsInProgress(txn->xid)); >> + if (TransactionIdDidCommit(txn->xid)) >> + { >> + txn->txn_flags |= TXN_COMMIT; >> + return true; >> + } >> + else >> + { >> + txn->txn_flags |= TXN_ROLLBACK; >> + return false; >> + } >> + } >> + >> + /* Add ourself as a decodeGroupMember */ >> + if (!BecomeDecodeGroupMember(proc, proc->pid, txn_prepared(txn))) >> + { >> + Assert(!TransactionIdIsInProgress(txn->xid)); >> + if (TransactionIdDidCommit(txn->xid)) >> + { >> + txn->txn_flags |= TXN_COMMIT; >> + return true; >> + } >> + else >> + { >> + txn->txn_flags |= TXN_ROLLBACK; >> + return false; >> + } >> + } >> + } > > Are we ok with this low-level lock / pgproc stuff happening outside of > procarray / lock related files? Where is the locking scheme documented? > Some details are in src/include/storage/proc.h where these fields have been added. This implementation is similar to the existing lockGroupLeader implementation and uses the same locking mechanism using LockHashPartitionLockByProc. > > >> +/* ReorderBufferTXN flags */ >> +#define TXN_HAS_CATALOG_CHANGES 0x0001 >> +#define TXN_IS_SUBXACT 0x0002 >> +#define TXN_SERIALIZED 0x0004 >> +#define TXN_PREPARE 0x0008 >> +#define TXN_COMMIT_PREPARED 0x0010 >> +#define TXN_ROLLBACK_PREPARED 0x0020 >> +#define TXN_COMMIT 0x0040 >> +#define TXN_ROLLBACK 0x0080 >> + >> +/* does the txn have catalog changes */ >> +#define txn_has_catalog_changes(txn) (txn->txn_flags & TXN_HAS_CATALOG_CHANGES) >> +/* is the txn known as a subxact? */ >> +#define txn_is_subxact(txn) (txn->txn_flags & TXN_IS_SUBXACT) >> +/* >> + * Has this transaction been spilled to disk? It's not always possible to >> + * deduce that fact by comparing nentries with nentries_mem, because e.g. >> + * subtransactions of a large transaction might get serialized together >> + * with the parent - if they're restored to memory they'd have >> + * nentries_mem == nentries. >> + */ >> +#define txn_is_serialized(txn) (txn->txn_flags & TXN_SERIALIZED) >> +/* is this txn prepared? */ >> +#define txn_prepared(txn) (txn->txn_flags & TXN_PREPARE) >> +/* was this prepared txn committed in the meanwhile? */ >> +#define txn_commit_prepared(txn) (txn->txn_flags & TXN_COMMIT_PREPARED) >> +/* was this prepared txn aborted in the meanwhile? */ >> +#define txn_rollback_prepared(txn) (txn->txn_flags & TXN_ROLLBACK_PREPARED) >> +/* was this txn committed in the meanwhile? */ >> +#define txn_commit(txn) (txn->txn_flags & TXN_COMMIT) >> +/* was this prepared txn aborted in the meanwhile? */ >> +#define txn_rollback(txn) (txn->txn_flags & TXN_ROLLBACK) >> + > > These txn_* names seem too generic imo - fairly likely to conflict with > other pieces of code imo. > Happy to add the RB prefix to all of them for clarity. E.g. /* ReorderBufferTXN flags */ #define RBTXN_HAS_CATALOG_CHANGES 0x0001 I will submit multiple patches with cleanups where needed as discussed above soon. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Hi, On 2018-02-12 13:36:16 +0530, Nikhil Sontakke wrote: > Hi Andres, > > > First off: This patch has way too many different types of changes as > > part of one huge commit. This needs to be split into several > > pieces. First the cleanups (e.g. the fields -> flag changes), then the > > individual infrastructure pieces (like the twophase.c changes, best > > split into several pieces as well, the locking stuff), then the main > > feature, then support for it in the output plugin. Each should have an > > individual explanation about why the change is necessary and not a bad > > idea. > > > > Ok, I will break this patch into multiple logical pieces and re-submit. Thanks. > > > > On 2018-02-06 17:50:40 +0530, Nikhil Sontakke wrote: > >> @@ -46,6 +48,9 @@ typedef struct > >> bool skip_empty_xacts; > >> bool xact_wrote_changes; > >> bool only_local; > >> + bool twophase_decoding; > >> + bool twophase_decode_with_catalog_changes; > >> + int decode_delay; /* seconds to sleep after every change record */ > > > > This seems too big a crock to add just for testing. It'll also make the > > testing timing dependent... > > > > The idea *was* to make testing timing dependent. We wanted to simulate > the case when a rollback is issued by another backend while the > decoding is still ongoing. This allows that test case to be tested. What I mean is that this will be hell on the buildfarm because the different animals are differently fast. > >> } TestDecodingData; > > > >> void > >> _PG_init(void) > >> @@ -85,9 +106,15 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb) > >> cb->begin_cb = pg_decode_begin_txn; > >> cb->change_cb = pg_decode_change; > >> cb->commit_cb = pg_decode_commit_txn; > >> + cb->abort_cb = pg_decode_abort_txn; > > > >> cb->filter_by_origin_cb = pg_decode_filter; > >> cb->shutdown_cb = pg_decode_shutdown; > >> cb->message_cb = pg_decode_message; > >> + cb->filter_prepare_cb = pg_filter_prepare; > >> + cb->filter_decode_txn_cb = pg_filter_decode_txn; > >> + cb->prepare_cb = pg_decode_prepare_txn; > >> + cb->commit_prepared_cb = pg_decode_commit_prepared_txn; > >> + cb->abort_prepared_cb = pg_decode_abort_prepared_txn; > >> } > > > > Why does this introduce both abort_cb and abort_prepared_cb? That seems > > to conflate two separate features. > > > > Consider the case when we have a bunch of change records to apply for > a transaction. We sent a "BEGIN" and then start decoding each change > record one by one. Now a rollback was encountered while we were > decoding. This will be quite the mess once streaming of changes is introduced. > >> +/* Filter out unnecessary two-phase transactions */ > >> +static bool > >> +pg_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, > >> + TransactionId xid, const char *gid) > >> +{ > >> + TestDecodingData *data = ctx->output_plugin_private; > >> + > >> + /* treat all transactions as one-phase */ > >> + if (!data->twophase_decoding) > >> + return true; > >> + > >> + if (txn && txn_has_catalog_changes(txn) && > >> + !data->twophase_decode_with_catalog_changes) > >> + return true; > > > > What? I'm INCREDIBLY doubtful this is a sane thing to expose to output > > plugins. As in, unless I hear a very very convincing reason I'm strongly > > opposed. > > > > These bools are specific to the test_decoding plugin. txn_has_catalog_changes() definitely isn't just exposed to test_decoding. I think you're making the output plugin interface massively more complicated in this patch and I think we need to push back on that. > Again, these are useful in testing decoding in various scenarios with > twophase decoding enabled/disabled. Testing decoding when catalog > changes are allowed/disallowed etc. Please take a look at > "contrib/test_decoding/sql/prepared.sql" for the various scenarios. I don't se ehow that addresses my concern in any sort of way. > >> +/* > >> + * Check if we should continue to decode this transaction. > >> + * > >> + * If it has aborted in the meanwhile, then there's no sense > >> + * in decoding and sending the rest of the changes, we might > >> + * as well ask the subscribers to abort immediately. > >> + * > >> + * This should be called if we are streaming a transaction > >> + * before it's committed or if we are decoding a 2PC > >> + * transaction. Otherwise we always decode committed > >> + * transactions > >> + * > >> + * Additional checks can be added here, as needed > >> + */ > >> +static bool > >> +pg_filter_decode_txn(LogicalDecodingContext *ctx, > >> + ReorderBufferTXN *txn) > >> +{ > >> + /* > >> + * Due to caching, repeated TransactionIdDidAbort calls > >> + * shouldn't be that expensive > >> + */ > >> + if (txn != NULL && > >> + TransactionIdIsValid(txn->xid) && > >> + TransactionIdDidAbort(txn->xid)) > >> + return true; > >> + > >> + /* if txn is NULL, filter it out */ > > > > Why can this be NULL? > > > > Depending on parameters passed to the ReorderBufferTXNByXid() > function, the txn might be NULL in some cases, especially during > restarts. That a) isn't an explanation why that's ok b) reasoning why this ever needs to be exposed to the output plugin. > >> static bool > >> pg_decode_filter(LogicalDecodingContext *ctx, > >> RepOriginId origin_id) > >> @@ -409,8 +622,18 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, > >> } > >> data->xact_wrote_changes = true; > >> > >> + if (!LogicalLockTransaction(txn)) > >> + return; > > > > It really really can't be right that this is exposed to output plugins. > > > > This was discussed in the other thread > (http://www.postgresql-archive.org/Logical-Decoding-and-HeapTupleSatisfiesVacuum-assumptions-td5998294i20.html). > Any catalog access in any plugins need to interlock with concurrent > aborts. This is only a problem if the transaction is a prepared one or > yet uncommitted one. Rest of the majority of the cases, this function > will do nothing at all. That doesn't address at all that it's not ok that the output plugin needs to handle this. Doing this in output plugins, the majority of which are external projects, means that a) the work needs to be done many times. b) we can't simply adjust the relevant code in a minor release, because every output plugin needs to be changed. > > > >> + /* if decode_delay is specified, sleep with above lock held */ > >> + if (data->decode_delay > 0) > >> + { > >> + elog(LOG, "sleeping for %d seconds", data->decode_delay); > >> + pg_usleep(data->decode_delay * 1000000L); > >> + } > > > > Really not on board. > > > > Again, specific to test_decoding plugin. Again, this is not a justification. People look at the code to write output plugins. Also see my above complaint about this going to be hell to get right on slow buildfarm members - we're going to crank up the sleep times to make it robust-ish. > >> + XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN); > >> + > > > > Can we perhaps merge a bit of the code with the plain commit path on > > this? > > > > Given that PREPARE ROLLBACK handling is totally separate from the > regular commit code paths, wouldn't it be a little difficult? Why? A helper function doing so ought to be doable. > >> @@ -1386,8 +1449,20 @@ FinishPreparedTransaction(const char *gid, bool isCommit) > >> /* > >> * Validate the GID, and lock the GXACT to ensure that two backends do not > >> * try to commit the same GID at once. > >> + * > >> + * During logical decoding, on the apply side, it's possible that a prepared > >> + * transaction got aborted while decoding. In that case, we stop the > >> + * decoding and abort the transaction immediately. However the ROLLBACK > >> + * prepared processing still reaches the subscriber. In that case it's ok > >> + * to have a missing gid > >> */ > >> - gxact = LockGXact(gid, GetUserId()); > >> + gxact = LockGXact(gid, GetUserId(), missing_ok); > >> + if (gxact == NULL) > >> + { > >> + Assert(missing_ok && !isCommit); > >> + return; > >> + } > > > > I'm very doubtful it is sane to handle this at such a low level. > > > > FinishPreparedTransaction() is called directly from ProcessUtility. If > not here, where else could we do this? I don't think this is something that ought to be handled at this layer at all. You should get an error in that case, the replay logic needs to handle that, not the low level 2pc code. > >> @@ -2358,6 +2443,13 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn, XLogRecPtr end_lsn) > >> Assert(TwoPhaseState->numPrepXacts < max_prepared_xacts); > >> TwoPhaseState->prepXacts[TwoPhaseState->numPrepXacts++] = gxact; > >> > >> + if (origin_id != InvalidRepOriginId) > >> + { > >> + /* recover apply progress */ > >> + replorigin_advance(origin_id, hdr->origin_lsn, end_lsn, > >> + false /* backward */ , false /* WAL */ ); > >> + } > >> + > > > > It's unclear to me why this is necessary / a good idea? > > > > Keeping PREPARE handling as close to regular COMMIT handling seems > like a good idea, no? But this code *means* something? Explain to me why it's a good idea to advance, or don't do it. Greetings, Andres Freund
Hi Andres, >> > First off: This patch has way too many different types of changes as >> > part of one huge commit. This needs to be split into several >> > pieces. First the cleanups (e.g. the fields -> flag changes), then the >> > individual infrastructure pieces (like the twophase.c changes, best >> > split into several pieces as well, the locking stuff), then the main >> > feature, then support for it in the output plugin. Each should have an >> > individual explanation about why the change is necessary and not a bad >> > idea. >> > >> >> Ok, I will break this patch into multiple logical pieces and re-submit. > > Thanks. > Attached are 5 patches split up from the original patch that I had submitted earlier. ReorderBufferTXN_flags_cleanup_1.patch: cleanup of the ReorderBufferTXN bools and addition of some new flags that following patches will need. Logical_lock_unlock_api_2.patch: Streaming changes of uncommitted transactions and of prepared transaction runs the risk of aborts (rollback prepared) happening while we are decoding. It's not a problem for most transactions, but some of the transactions which do catalog changes need to get a consistent view of the metadata so that the decoding does not behave in uncertain ways when such concurrent aborts occur. We came up with the concept of a logical locking/unlocking API to safeguard access to catalog tables. This patch contains the implementation for this functionality. 2PC_gid_wal_and_2PC_origin_tracking_3.patch: We now store the 2PC gid in the commit/abort records. This allows us to send the proper gid to the downstream across restarts. We also want to avoid receiving the prepared transaction AGAIN from the upstream and use replorigin tracking across prepared transactions. reorderbuffer_2PC_logic_4.patch: Add decoding logic to understand PREPARE related wal records and relevant changes in the reorderbuffer logic to deal with 2PC. This includes logic to handle concurrent rollbacks while we are going through the change buffers belonging to a prepared or uncommitted transaction. pgoutput_plugin_support_2PC_5.patch: Logical protocol changes to apply and send changes via the internal pgoutput output plugin. Includes test case and relevant documentation changes. Besides the above, you had feedback around the test_decoding plugin and the use of sleep() etc. I will submit a follow-on patch for the test_decoding plugin stuff soon. More comments inline below. >> >> bool only_local; >> >> + bool twophase_decoding; >> >> + bool twophase_decode_with_catalog_changes; >> >> + int decode_delay; /* seconds to sleep after every change record */ >> > >> > This seems too big a crock to add just for testing. It'll also make the >> > testing timing dependent... >> > >> >> The idea *was* to make testing timing dependent. We wanted to simulate >> the case when a rollback is issued by another backend while the >> decoding is still ongoing. This allows that test case to be tested. > > What I mean is that this will be hell on the buildfarm because the > different animals are differently fast. > Will handle this in the test_decoding plugin patch soon. > >> >> +/* Filter out unnecessary two-phase transactions */ >> >> +static bool >> >> +pg_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, >> >> + TransactionId xid, const char *gid) >> >> +{ >> >> + TestDecodingData *data = ctx->output_plugin_private; >> >> + >> >> + /* treat all transactions as one-phase */ >> >> + if (!data->twophase_decoding) >> >> + return true; >> >> + >> >> + if (txn && txn_has_catalog_changes(txn) && >> >> + !data->twophase_decode_with_catalog_changes) >> >> + return true; >> > >> > What? I'm INCREDIBLY doubtful this is a sane thing to expose to output >> > plugins. As in, unless I hear a very very convincing reason I'm strongly >> > opposed. >> > >> >> These bools are specific to the test_decoding plugin. > Will handle in the test_decoding plugin patch soon. > txn_has_catalog_changes() definitely isn't just exposed to > test_decoding. I think you're making the output plugin interface > massively more complicated in this patch and I think we need to push > back on that. > > >> Again, these are useful in testing decoding in various scenarios with >> twophase decoding enabled/disabled. Testing decoding when catalog >> changes are allowed/disallowed etc. Please take a look at >> "contrib/test_decoding/sql/prepared.sql" for the various scenarios. > > I don't se ehow that addresses my concern in any sort of way. > Will handle in the test_decoding plugin patch soon. > >> >> +/* >> >> + * Check if we should continue to decode this transaction. >> >> + * >> >> + * If it has aborted in the meanwhile, then there's no sense >> >> + * in decoding and sending the rest of the changes, we might >> >> + * as well ask the subscribers to abort immediately. >> >> + * >> >> + * This should be called if we are streaming a transaction >> >> + * before it's committed or if we are decoding a 2PC >> >> + * transaction. Otherwise we always decode committed >> >> + * transactions >> >> + * >> >> + * Additional checks can be added here, as needed >> >> + */ >> >> +static bool >> >> +pg_filter_decode_txn(LogicalDecodingContext *ctx, >> >> + ReorderBufferTXN *txn) >> >> +{ >> >> + /* >> >> + * Due to caching, repeated TransactionIdDidAbort calls >> >> + * shouldn't be that expensive >> >> + */ >> >> + if (txn != NULL && >> >> + TransactionIdIsValid(txn->xid) && >> >> + TransactionIdDidAbort(txn->xid)) >> >> + return true; >> >> + >> >> + /* if txn is NULL, filter it out */ >> > >> > Why can this be NULL? >> > >> >> Depending on parameters passed to the ReorderBufferTXNByXid() >> function, the txn might be NULL in some cases, especially during >> restarts. > > That a) isn't an explanation why that's ok b) reasoning why this ever > needs to be exposed to the output plugin. > Removing this pg_filter_decode_txn() function. You are right, there's no need to expose this function to the output plugin and we can make the decision entirely inside the ReorderBuffer code handling. > >> >> static bool >> >> pg_decode_filter(LogicalDecodingContext *ctx, >> >> RepOriginId origin_id) >> >> @@ -409,8 +622,18 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, >> >> } >> >> data->xact_wrote_changes = true; >> >> >> >> + if (!LogicalLockTransaction(txn)) >> >> + return; >> > >> > It really really can't be right that this is exposed to output plugins. >> > >> >> This was discussed in the other thread >> (http://www.postgresql-archive.org/Logical-Decoding-and-HeapTupleSatisfiesVacuum-assumptions-td5998294i20.html). >> Any catalog access in any plugins need to interlock with concurrent >> aborts. This is only a problem if the transaction is a prepared one or >> yet uncommitted one. Rest of the majority of the cases, this function >> will do nothing at all. > > That doesn't address at all that it's not ok that the output plugin > needs to handle this. Doing this in output plugins, the majority of > which are external projects, means that a) the work needs to be done > many times. b) we can't simply adjust the relevant code in a minor > release, because every output plugin needs to be changed. > How do we know if the external project is going to access catalog data? How do we ensure that the data that they access is safe from concurrent aborts if we are decoding uncommitted or prepared transactions? We are providing a guideline here and recommending them to use these APIs if they need to. >> > >> >> + /* if decode_delay is specified, sleep with above lock held */ >> >> + if (data->decode_delay > 0) >> >> + { >> >> + elog(LOG, "sleeping for %d seconds", data->decode_delay); >> >> + pg_usleep(data->decode_delay * 1000000L); >> >> + } >> > >> > Really not on board. >> > >> >> Again, specific to test_decoding plugin. > > Again, this is not a justification. People look at the code to write > output plugins. Also see my above complaint about this going to be hell > to get right on slow buildfarm members - we're going to crank up the > sleep times to make it robust-ish. > Sure, as mentioned above, will come up with a different way for the test_decoding plugin later. > > >> >> + XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN); >> >> + >> > >> > Can we perhaps merge a bit of the code with the plain commit path on >> > this? >> > >> >> Given that PREPARE ROLLBACK handling is totally separate from the >> regular commit code paths, wouldn't it be a little difficult? > > Why? A helper function doing so ought to be doable. > Can you elaborate on what exactly you mean here? > > >> >> @@ -1386,8 +1449,20 @@ FinishPreparedTransaction(const char *gid, bool isCommit) >> >> /* >> >> * Validate the GID, and lock the GXACT to ensure that two backends do not >> >> * try to commit the same GID at once. >> >> + * >> >> + * During logical decoding, on the apply side, it's possible that a prepared >> >> + * transaction got aborted while decoding. In that case, we stop the >> >> + * decoding and abort the transaction immediately. However the ROLLBACK >> >> + * prepared processing still reaches the subscriber. In that case it's ok >> >> + * to have a missing gid >> >> */ >> >> - gxact = LockGXact(gid, GetUserId()); >> >> + gxact = LockGXact(gid, GetUserId(), missing_ok); >> >> + if (gxact == NULL) >> >> + { >> >> + Assert(missing_ok && !isCommit); >> >> + return; >> >> + } >> > >> > I'm very doubtful it is sane to handle this at such a low level. >> > >> >> FinishPreparedTransaction() is called directly from ProcessUtility. If >> not here, where else could we do this? > > I don't think this is something that ought to be handled at this layer > at all. You should get an error in that case, the replay logic needs to > handle that, not the low level 2pc code. > Removed the above changes. The replay logic now checks if the GID still exists in the abort rollback codepath. If not, it returns immediately. In case of commit rollback replay, the GID has to obviously exist at the downstream. > >> >> @@ -2358,6 +2443,13 @@ PrepareRedoAdd(char *buf, XLogRecPtr start_lsn, XLogRecPtr end_lsn) >> >> Assert(TwoPhaseState->numPrepXacts < max_prepared_xacts); >> >> TwoPhaseState->prepXacts[TwoPhaseState->numPrepXacts++] = gxact; >> >> >> >> + if (origin_id != InvalidRepOriginId) >> >> + { >> >> + /* recover apply progress */ >> >> + replorigin_advance(origin_id, hdr->origin_lsn, end_lsn, >> >> + false /* backward */ , false /* WAL */ ); >> >> + } >> >> + >> > >> > It's unclear to me why this is necessary / a good idea? >> > >> >> Keeping PREPARE handling as close to regular COMMIT handling seems >> like a good idea, no? > > But this code *means* something? Explain to me why it's a good idea to > advance, or don't do it. > We want to do this to use it as protection against receiving the prepared tx again. Other than the above, *) Changed the flags and added "RB" prefix to all flags and macros. *) Added a few fields into existing xl_xact_parsed_commit record and avoided creating an entirely new xl_xact_parsed_prepare record. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Attachment
Hi, On 2018-02-28 21:12:42 +0530, Nikhil Sontakke wrote: > Attached are 5 patches split up from the original patch that I had > submitted earlier. In the future you should number them. Right now they appear to be out of order in your email. I suggest using git format-patch, that does all the necessary work for you. Greetings, Andres Freund
On 2 March 2018 at 08:53, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2018-02-28 21:12:42 +0530, Nikhil Sontakke wrote:
> Attached are 5 patches split up from the original patch that I had
> submitted earlier.
In the future you should number them. Right now they appear to be out of
order in your email. I suggest using git format-patch, that does all
the necessary work for you.
Yep, specially git format-patch with a -v argument, so the whole patchset is visibly versioned and sorts in the correct order.
Hi Andres and Craig, >> In the future you should number them. Right now they appear to be out of >> order in your email. I suggest using git format-patch, that does all >> the necessary work for you. >> > Yep, specially git format-patch with a -v argument, so the whole patchset is > visibly versioned and sorts in the correct order. > I did try to use *_Number.patch to convey the sequence, but admittedly it's pretty lame. I will re-submit with "git format-patch" soon. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Hi Andres, > > I will re-submit with "git format-patch" soon. > PFA, patches in "format-patch" format. This patch set also includes changes in the test_decoding plugin along with an additional savepoint related test case that was pointed out on this thread, upstream. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Attachment
- 0006-Teach-test_decoding-plugin-to-work-with-2PC.patch
- 0005-pgoutput-output-plugin-support-for-logical-decoding-.patch
- 0004-Teach-ReorderBuffer-to-deal-with-2PC.patch
- 0003-Add-support-for-logging-GID-in-commit-abort-WAL-reco.patch
- 0002-Introduce-LogicalLockTransaction-LogicalUnlockTransa.patch
- 0001-Cleaning-up-and-addition-of-new-flags-in-ReorderBuff.patch
Hi Nikhil, I've been looking at this patch over the past few days, so here are my thoughts so far ... decoding aborted transactions ============================= First, let's talk about handling of aborted transaction, which was originally discussed in thread [1]. I'll try to summarize the status and explain my understanding of the choices first. [1] https://www.postgresql.org/message-id/CAMGcDxeHBaXCz12LdfEmyJdghbms_dtC26pRZXKWRV2dazO-UQ%40mail.gmail.com There were multiple ideas about how to deal with aborted transactions, but I we eventually found various issues in all of them except for two - interlocking decoding and aborts, and modifying the rules so that aborted transactions are considered to be running while being decoded. This patch uses the first approach, i.e. interlock. It has a couple of disadvantages: a) The abort may need to wait for decoding workers for a while. This is annoying, but aborts are generally rare. And for systems with many concurrent short transactions (where even tiny delays would matter) it's unlikely the decoding workers will already start decoding the aborted transaction. b) output plugins need to call lock/unlock explicitly from the callbacks Technically, we could wrap the whole callback in a lock/unlock, but that would needlessly increase the amount of time spent holding the lock, making the previous point much worse. As the callbacks are expected to do network I/O etc. the amount of time could be quite significant. The main disadvantage is of course that it's likely much less invasive than tweaking which transactions are seen as running. So I think taking this approach is a sensible choice at this point. Now, about the interlock implementation - I see you've reused the "lock group" concept from parallel query. That may make sense, unfortunately there's about no documentation explaining how it works, what is the "protocol" etc. There is fairly extensive documentation for "lock groups" in src/backend/storage/lmgr/README, but while the "decoding group" code is inspired by it, the code is actually very different. Compare for example BecomeLockGroupLeader and BecomeDecodeGroupLeader, and you'll see what I mean. So I think the first thing we need to do is add proper documentation (possibly into the same README), explaining how the decode groups work, how the decodeAbortPending works, etc. Also, some function names seem a bit misleading. For example in the lock group "BecomeLockGroupLeader" means (make the current process a group leader), but apparently "BecomeDecodeGroupLeader" means "find the process handling XID and make it a leader". But perhaps I got that entirely wrong. Of course LogicalLockTransaction and LogicalUnlockTransaction, should have proper comments, which is particularly important as it's part of the public API. BTW, do we need to do any of this with (wal_level < logical)? I don't see any quick bail-out in any of the functions in this case, but it seems like a fairly obvious optimization. Similarly, can't the logical workers indicate that they need to decode 2PC transactions (or in-progress transactions in general) in some way? If we knew there are no such workers, that would also allow ignoring the interlock, no? Another thing is that I'm yet to see any performance tests. While we do believe it will work fine, it's based on a number of assumptions: a) aborts are rare b) it has no measurable impact on commit I think we need to verify this by actually measuring the impact on a bunch of workloads. In particular, I think we need to test i) impact on commit-only workloads ii) impact on worst-case scenario I'm not sure how (ii) would look like, considering the patch only deals with decoding 2PC transactions, which have significant overhead on their own - so I'm afraid the impact on "regular transactions" might be much worse, once we add support for that. decoding 2PC transactions ========================= Now, the main topic of the patch. Overall the changes make sense, I think - it modifies about the same places I touched in the streaming patch, in similar ways. The following comments are mostly in random order: 1) test_decoding.c ------------------ The "filter" functions do not follow the naming convention, so I suggest to rename them like this: - pg_filter_decode_txn -> pg_decode_filter_txn - pg_filter_prepare -> pg_decode_filter_prepare_txn or something like that. Also, looking at those functions (and those same callbacks in the pgoutput plugin) I wonder if we really need to make them part of the output plugin API. I mean, AFAICS their only purpose is to filter 2PC transactions, but I don't quite see why implementing those checks should be responsibility of the plugin? I suppose it was done to make test_decoding customizable (i.e. allow enabling/disabling of decoding 2PC as needed), right? In that case I suggest make it configurable by plugin-level flags (I see LogicalDecodingContext already has a enable_twophase), and moving the checks to a function that is not part of the plugin API. Of course, in that case the flag needs to be customizable from plugin options, not just "Does the plugin have all the callbacks?". The "twophase-decoding" and "twophase-decode-with-catalog-changes" seem a bit inconsistently named too (why decode vs. decoding?). 2) regression tests ------------------- I really dislike the use of \set to run the same query repeatedly. It makes analysis of regression failures even more tedious than it already is. I'd just copy the query to all the places. 3) worker.c ----------- The comment in apply_handle_rollback_prepared_txn says this: /* * During logical decoding, on the apply side, it's possible that a * prepared transaction got aborted while decoding. In that case, we * stop the decoding and abort the transaction immediately. However * the ROLLBACK prepared processing still reaches the subscriber. In * that case it's ok to have a missing gid */ if (LookupGXact(commit_data->gid)) { ... } But is it safe to assume it never happens due to an error? In other words, is there a way to decide that the GID really aborted? Or, why should the provider sent the rollback at all - surely it could know if the transaction/GID was sent to subscriber or not, right? 4) twophase.c ------------- I wonder why the patch modifies the TWOPHASE_MAGIC at all - if it's meant to identify 2PC files, then why not to keep the value. And if we really need to modify it, why not to use another random number? By only adding 1 to the current one, it makes it look like a random bit flip. 5) decode.c ----------- The changes in DecodeCommit need proper comments. In DecodeAbort, the "if" includes this condition: ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid) which essentially means ROLLBACK PREPARED is translated into "is the transaction prepared?. Shouldn't the code look at xl_xact_parsed_abort instead, and make the ReorderBufferTxnIsPrepared an Assert? 6) logical.c ------------ I see StartupDecodingContext does this: twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) + (ctx->callbacks.commit_prepared_cb != NULL) + (ctx->callbacks.abort_prepared_cb != NULL); It seems a bit strange to make arithmetics on bools, I guess. In any case, I think this should be an ERROR and not a WARNING: if (twophase_callbacks != 3 && twophase_callbacks != 0) ereport(WARNING, (errmsg("Output plugin registered only %d twophase callbacks. " "Twophase transactions will be decoded at commit time.", twophase_callbacks))); A plugin that implements only a subset of the callbacks seems outright broken, so let's just fail. 7) proto.c / worker.c --------------------- Until now, the 'action' (essentially the first byte of each message) clearly identified what the message does. So 'C' -> commit, 'I' -> insert, 'D' -> delete etc. This also means the "handle" methods were inherently simple, because each handled exactly one particular action and nothing else. You've expanded the protocol in a way that suddenly 'C' means either COMMIT or ROLLBACK, and 'P' means PREPARE, ROLLBACK PREPARED or COMMIT PREPARED. I don't think that's how the protocol should be extended - if anything, it's damn confusing and unlike the existing code. You should define new action, and keep the handlers in worker.c simple. Also, this probably implies LOGICALREP_PROTO_VERSION_NUM increase. 8) reorderbuffer.h/c -------------------- Similarly, I wonder why you replaced the ReorderBuffer boolean flags (is_known_subxact, has_catalog_changes) with a bitmask? I find it way more difficult to read (which is subjective, of course) but it also makes IDEs dumber (suddenly they can't offer you field names). Surely it wasn't done to save space, because by using an "int" you've saved just 4B (there are 8 flags right now, so it'd need 8 bytes with plain bool flags) on a structure that is already ~200B. And you the added gid[GIDSIZE] to it, making it 400B for *all* transactions and subtransactions (not just 2PC). Not to mention that the GID is usually much shorter than the 200B. So I suggest to use just a simple (char *) pointer for the GID, keeping it NULL for most transactions, and switching back to plain bool flags. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 5 March 2018 at 16:37, Nikhil Sontakke <nikhils@2ndquadrant.com> wrote: >> >> I will re-submit with "git format-patch" soon. >> > PFA, patches in "format-patch" format. > > This patch set also includes changes in the test_decoding plugin along > with an additional savepoint related test case that was pointed out on > this thread, upstream. Reviewing 0001-Cleaning-up-and-addition-of-new-flags-in-ReorderBuff.patch Change from is_known_as_subxact to rbtxn_is_subxact loses some meaning, since rbtxn entries with this flag set false might still be subxacts, we just don't know yet. rbtxn_is_serialized refers to RBTXN_SERIALIZED so flag name should be RBTXN_IS_SERIALIZED so it matches Otherwise looks OK to commit Reviewing 0003-Add-support-for-logging-GID-in-commit-abort-WAL-reco Looks fine, reworked patch attached * added changes to xact.h from patch 4 so that this is a whole, committable patch * added comments to make abort and commit structs look same Attached patch is proposed for a separate, early commit as part of this -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On 23 March 2018 at 15:26, Simon Riggs <simon@2ndquadrant.com> wrote: > Reviewing 0003-Add-support-for-logging-GID-in-commit-abort-WAL-reco > > Looks fine, reworked patch attached > * added changes to xact.h from patch 4 so that this is a whole, > committable patch > * added comments to make abort and commit structs look same > > Attached patch is proposed for a separate, early commit as part of this Looking to commit "logging GID" patch today, if no further objections. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018-03-27 10:19:37 +0100, Simon Riggs wrote: > On 23 March 2018 at 15:26, Simon Riggs <simon@2ndquadrant.com> wrote: > > > Reviewing 0003-Add-support-for-logging-GID-in-commit-abort-WAL-reco > > > > Looks fine, reworked patch attached > > * added changes to xact.h from patch 4 so that this is a whole, > > committable patch > > * added comments to make abort and commit structs look same > > > > Attached patch is proposed for a separate, early commit as part of this > > Looking to commit "logging GID" patch today, if no further objections. None here. Greetings, Andres Freund
Hi Tomas, > Now, about the interlock implementation - I see you've reused the "lock > group" concept from parallel query. That may make sense, unfortunately > there's about no documentation explaining how it works, what is the > "protocol" etc. There is fairly extensive documentation for "lock > groups" in src/backend/storage/lmgr/README, but while the "decoding > group" code is inspired by it, the code is actually very different. > Compare for example BecomeLockGroupLeader and BecomeDecodeGroupLeader, > and you'll see what I mean. > > So I think the first thing we need to do is add proper documentation > (possibly into the same README), explaining how the decode groups work, > how the decodeAbortPending works, etc. > I have added details about this in src/backend/storage/lmgr/README as suggested by you. > > BTW, do we need to do any of this with (wal_level < logical)? I don't > see any quick bail-out in any of the functions in this case, but it > seems like a fairly obvious optimization. > The calls to the LogicalLockTransaction/LogicalUnLockTransaction APIs will be from inside plugins or the reorderbuffer code paths. Those will get invoked only in the wal_level logical case, hence I did not add further checks. > Similarly, can't the logical workers indicate that they need to decode > 2PC transactions (or in-progress transactions in general) in some way? > If we knew there are no such workers, that would also allow ignoring the > interlock, no? > These APIs check if the transaction is already committed and cache that information for further calls, so for regular transactions this becomes a no-op > > decoding 2PC transactions > ========================= > > Now, the main topic of the patch. Overall the changes make sense, I > think - it modifies about the same places I touched in the streaming > patch, in similar ways. > > The following comments are mostly in random order: > > 1) test_decoding.c > ------------------ > > The "filter" functions do not follow the naming convention, so I suggest > to rename them like this: > > - pg_filter_decode_txn -> pg_decode_filter_txn > - pg_filter_prepare -> pg_decode_filter_prepare_txn > > or something like that. Also, looking at those functions (and those same > callbacks in the pgoutput plugin) I wonder if we really need to make > them part of the output plugin API. > > I mean, AFAICS their only purpose is to filter 2PC transactions, but I > don't quite see why implementing those checks should be responsibility > of the plugin? I suppose it was done to make test_decoding customizable > (i.e. allow enabling/disabling of decoding 2PC as needed), right? > > In that case I suggest make it configurable by plugin-level flags (I see > LogicalDecodingContext already has a enable_twophase), and moving the > checks to a function that is not part of the plugin API. Of course, in > that case the flag needs to be customizable from plugin options, not > just "Does the plugin have all the callbacks?". > The idea behind exposing the API is to allow the plugins to have selective control over specific 2PC actions. They might want to decode certain 2PC but not some others. By providing this callback, they can do that selectively. > The "twophase-decoding" and "twophase-decode-with-catalog-changes" seem > a bit inconsistently named too (why decode vs. decoding?). > This has been removed in the latest patches altogether. Maybe you were referring to an older patch. > > 2) regression tests > ------------------- > > I really dislike the use of \set to run the same query repeatedly. It > makes analysis of regression failures even more tedious than it already > is. I'd just copy the query to all the places. > They are long-winded queries and IMO made the test file look too cluttered and verbose.. > > 3) worker.c > ----------- > > The comment in apply_handle_rollback_prepared_txn says this: > > /* > * During logical decoding, on the apply side, it's possible that a > * prepared transaction got aborted while decoding. In that case, we > * stop the decoding and abort the transaction immediately. However > * the ROLLBACK prepared processing still reaches the subscriber. In > * that case it's ok to have a missing gid > */ > if (LookupGXact(commit_data->gid)) { ... } > > But is it safe to assume it never happens due to an error? In other > words, is there a way to decide that the GID really aborted? Or, why > should the provider sent the rollback at all - surely it could know if > the transaction/GID was sent to subscriber or not, right? > Since we decode in commit WAL order, when we reach the ROLLBACK PREPARED wal record, we cannot be sure that we did infact abort the decoding mid ways because of this concurrent rollback. It's possible that this rollback comes much much later as well when all decoding backends have successfully prepared it on the subscribers already. > > 4) twophase.c > ------------- > > I wonder why the patch modifies the TWOPHASE_MAGIC at all - if it's > meant to identify 2PC files, then why not to keep the value. And if we > really need to modify it, why not to use another random number? By only > adding 1 to the current one, it makes it look like a random bit flip. > We could retain the existing magic here. > > 5) decode.c > ----------- > > The changes in DecodeCommit need proper comments. > > In DecodeAbort, the "if" includes this condition: > > ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid) > > which essentially means ROLLBACK PREPARED is translated into "is the > transaction prepared?. Shouldn't the code look at xl_xact_parsed_abort > instead, and make the ReorderBufferTxnIsPrepared an Assert? This again goes back to the earlier callback in which want the pg_decode_filter_prepare_txn to selectively decide to filter out or decode some of the 2PC transactions. If we allow that callback, then we need to consult ReorderBufferTxnIsPrepared to get the same response for these 2PC transactions. > > > 6) logical.c > ------------ > > I see StartupDecodingContext does this: > > twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) + > (ctx->callbacks.commit_prepared_cb != NULL) + > (ctx->callbacks.abort_prepared_cb != NULL); > > It seems a bit strange to make arithmetics on bools, I guess. In any > case, I think this should be an ERROR and not a WARNING: > > if (twophase_callbacks != 3 && twophase_callbacks != 0) > ereport(WARNING, > (errmsg("Output plugin registered only %d twophase callbacks. " > "Twophase transactions will be decoded at commit time.", > twophase_callbacks))); > > A plugin that implements only a subset of the callbacks seems outright > broken, so let's just fail. > Ok, done. > > 7) proto.c / worker.c > --------------------- > > Until now, the 'action' (essentially the first byte of each message) > clearly identified what the message does. So 'C' -> commit, 'I' -> > insert, 'D' -> delete etc. This also means the "handle" methods were > inherently simple, because each handled exactly one particular action > and nothing else. > > You've expanded the protocol in a way that suddenly 'C' means either > COMMIT or ROLLBACK, and 'P' means PREPARE, ROLLBACK PREPARED or COMMIT > PREPARED. I don't think that's how the protocol should be extended - if > anything, it's damn confusing and unlike the existing code. You should > define new action, and keep the handlers in worker.c simple. > I thought this grouped regular commit and 2PC transactions properly. Can look at this again if this style is not favored. > Also, this probably implies LOGICALREP_PROTO_VERSION_NUM increase. > Ok, increased it to 2. PFA, latest patch set. The ReorderBufferCommit() handling has been further simplified now without worrying too much about optimizing for abort handling at various steps. This also contains an additional/optional 7th patch which has a test case to solely demonstrate the concurrent abort/logical decoding interlocking. It uses the delay using sleep logic while holding LogicalTransactionLock. This additional patch might not be considered for commit as the delay based approach is prone to failures on slower machines. Simon, 0003-Add-GID-and-replica-origin-to-two-phase-commit-abort.patch is the exact patch that you had posted for an earlier commit. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Attachment
- 0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.patch
- 0002-Introduce-LogicalLockTransaction-LogicalUnlockTransa.patch
- 0003-Add-GID-and-replica-origin-to-two-phase-commit-abort.patch
- 0004-Support-decoding-of-two-phase-transactions-at-PREPAR.patch
- 0005-pgoutput-output-plugin-support-for-logical-decoding-.patch
- 0006-Teach-test_decoding-plugin-to-work-with-2PC.patch
- 0007-Additional-optional-test-case-to-demonstrate-decoding-rollbac.patch
On 28 March 2018 at 16:28, Nikhil Sontakke <nikhils@2ndquadrant.com> wrote: > Simon, 0003-Add-GID-and-replica-origin-to-two-phase-commit-abort.patch > is the exact patch that you had posted for an earlier commit. 0003 Pushed -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, I've been reviewing the last patch version, focusing mostly on the decoding group part. Let me respond to several points first, then new review bits. On 03/28/2018 05:28 PM, Nikhil Sontakke wrote: > Hi Tomas, > >> Now, about the interlock implementation - I see you've reused the "lock >> group" concept from parallel query. That may make sense, unfortunately >> there's about no documentation explaining how it works, what is the >> "protocol" etc. There is fairly extensive documentation for "lock >> groups" in src/backend/storage/lmgr/README, but while the "decoding >> group" code is inspired by it, the code is actually very different. >> Compare for example BecomeLockGroupLeader and BecomeDecodeGroupLeader, >> and you'll see what I mean. >> >> So I think the first thing we need to do is add proper documentation >> (possibly into the same README), explaining how the decode groups work, >> how the decodeAbortPending works, etc. >> > > I have added details about this in src/backend/storage/lmgr/README as > suggested by you. > Thanks. I think the README is a good start, but I think we also need to improve the comments, which is usually more detailed than the README. For example, it's not quite acceptable that LogicalLockTransaction and LogicalUnlockTransaction have about no comments, especially when it's meant to be public API for decoding plugins. >> >> BTW, do we need to do any of this with (wal_level < logical)? I don't >> see any quick bail-out in any of the functions in this case, but it >> seems like a fairly obvious optimization. >> > > The calls to the LogicalLockTransaction/LogicalUnLockTransaction APIs > will be from inside plugins or the reorderbuffer code paths. Those > will get invoked only in the wal_level logical case, hence I did not > add further checks. > Oh, right. >> Similarly, can't the logical workers indicate that they need to decode >> 2PC transactions (or in-progress transactions in general) in some way? >> If we knew there are no such workers, that would also allow ignoring the >> interlock, no? >> > > These APIs check if the transaction is already committed and cache > that information for further calls, so for regular transactions this > becomes a no-op > I see. So when the output plugin never calls LogicalLockTransaction on an in-progress transaction (e.g. 2PC after PREPARE), it never actually initializes the decoding group. Works for me. >> >> 2) regression tests >> ------------------- >> >> I really dislike the use of \set to run the same query repeatedly. It >> makes analysis of regression failures even more tedious than it already >> is. I'd just copy the query to all the places. >> > > They are long-winded queries and IMO made the test file look too > cluttered and verbose.. > Well, I don't think that's a major problem, and it certainly makes it more difficult to investigate regression failures. >> >> 3) worker.c >> ----------- >> >> The comment in apply_handle_rollback_prepared_txn says this: >> >> /* >> * During logical decoding, on the apply side, it's possible that a >> * prepared transaction got aborted while decoding. In that case, we >> * stop the decoding and abort the transaction immediately. However >> * the ROLLBACK prepared processing still reaches the subscriber. In >> * that case it's ok to have a missing gid >> */ >> if (LookupGXact(commit_data->gid)) { ... } >> >> But is it safe to assume it never happens due to an error? In other >> words, is there a way to decide that the GID really aborted? Or, why >> should the provider sent the rollback at all - surely it could know if >> the transaction/GID was sent to subscriber or not, right? >> > > Since we decode in commit WAL order, when we reach the ROLLBACK > PREPARED wal record, we cannot be sure that we did infact abort the > decoding mid ways because of this concurrent rollback. It's possible > that this rollback comes much much later as well when all decoding > backends have successfully prepared it on the subscribers already. > Ah, OK. So when the transaction gets aborted (by ROLLBACK PREPARED) concurrently with the decoding, we abort the apply transaction and discard the ReorderBufferTXN. Which means that later, when we decode the abort, we don't know whether the decoding reached abort or prepare, and so we have to send the ROLLBACK PREPARED to the subscriber too. For a moment I was thinking we might simply remember TXN outcome in reorder buffer, but obviously that does not work - the decoding might restart in between, and as you say the distance (in terms of WAL) may be quite significant. >> >> 7) proto.c / worker.c >> --------------------- >> >> Until now, the 'action' (essentially the first byte of each message) >> clearly identified what the message does. So 'C' -> commit, 'I' -> >> insert, 'D' -> delete etc. This also means the "handle" methods were >> inherently simple, because each handled exactly one particular action >> and nothing else. >> >> You've expanded the protocol in a way that suddenly 'C' means either >> COMMIT or ROLLBACK, and 'P' means PREPARE, ROLLBACK PREPARED or COMMIT >> PREPARED. I don't think that's how the protocol should be extended - if >> anything, it's damn confusing and unlike the existing code. You should >> define new action, and keep the handlers in worker.c simple. >> > > I thought this grouped regular commit and 2PC transactions properly. > Can look at this again if this style is not favored. > Hmmm, it's not how I'd do it, but perhaps someone who originally designed the protocol should review this bit. Now, the new bits ... attached is a .diff with a couple of changes and comments on various places. 1) LogicalLockTransaction - This function is part of a public API, yet it has no comment. That needs fixing - it has to be clear how to use it. The .diff suggests a comment, but it may need improvements. - As I mentioned in the previous review, BecomeDecodeGroupLeader is a misleading name. It suggest the called becomes a leader, while in fact it looks up the PROC running the XID and makes it a leader. This is obviously due to copying the code from lock groups, where the caller actually becomes the leader. It's incorrect here. I suggest something like LookupDecodeGroupLeader() or something. - In the "if (MyProc->decodeGroupLeader == NULL)" block there are two blocks rechecking the transaction status: if (proc == NULL) { ... recheck ... } if (!BecomeDecodeGroupMember(proc, proc->pid, rbtxn_prepared(txn))) { ... recheck ...} I suggest to join them into a single block. - This Assert() is either bogus and there can indeed be cases with (MyProc->decodeGroupLeader==NULL), or the "if" is unnecessary: Assert(MyProc->decodeGroupLeader); if (MyProc->decodeGroupLeader) { ... } - I'm wondering why we're maintaining decodeAbortPending flags both for the leader and all the members. ISTM it'd be perfectly fine to only check the leader, particularly because RemoveDecodeGroupMemberLocked removes the members from the decoding group. So that seems unnecessary, and we can remove the if (MyProc->decodeAbortPending) { ... } - LogicalUnlockTransaction needs a comment(s) too. 2) BecomeDecodeGroupLeader - Wrong name (already mentioned above). - It can bail out when (!proc), which will simplify the code a bit. - Why does it check PID of the process at all? Seems unnecessary, considering we're already checking the XID. - Can a proc executing a XID have a different leader? I don't think so, so I'd make that an Assert(). Assert(!proc || (proc->decodeGroupLeader == proc)); And it'll allow simplification of some of the conditions. - We're only dealing with prepared transactions now, so I'd just drop the is_prepared flag - it'll make the code a bit simpler, we can add it later in patch adding decoding of regular in-progress transactions. We can't test the (!is_prepared) anyway. - Why are we making the leader also a member of the group? Seems rather unnecessary, and it complicates the abort handling, because we need to skip the leader when deciding to wait. 3) LogicalDecodeRemoveTransaction - It's not clear to me what happens when a decoding backend gets killed between LogicalLockTransaction/LogicalUnlockTransaction. Doesn't that mean LogicalDecodeRemoveTransaction will get stuck, because the proc is still in the decoding group? - The loop now tweaks decodeAbortPending of the members, but I don't think that's necessary either - the LogicalUnlockTransaction can check the leader flag just as easily. 4) a bunch of comment / docs improvements, ... I'm suggesting rewording a couple of comments. I've also added a couple of missing comments - e.g. to LogicalLockTransaction and the lock group methods in general. Also, a couple more questions and suggestions in XXX comments. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On 2018-03-29 23:52:18 +0200, Tomas Vondra wrote: > > I have added details about this in src/backend/storage/lmgr/README as > > suggested by you. > > > > Thanks. I think the README is a good start, but I think we also need to > improve the comments, which is usually more detailed than the README. > For example, it's not quite acceptable that LogicalLockTransaction and > LogicalUnlockTransaction have about no comments, especially when it's > meant to be public API for decoding plugins. FWIW, for me that's ground to not accept the feature. Burdening output plugins with this will make their development painful (because they'll have to adapt regularly) and correctness doubful (there's nothing checking for the lock being skipped). Another way needs to be found. - Andres
On 03/29/2018 11:58 PM, Andres Freund wrote: > On 2018-03-29 23:52:18 +0200, Tomas Vondra wrote: >>> I have added details about this in src/backend/storage/lmgr/README as >>> suggested by you. >>> >> >> Thanks. I think the README is a good start, but I think we also need to >> improve the comments, which is usually more detailed than the README. >> For example, it's not quite acceptable that LogicalLockTransaction and >> LogicalUnlockTransaction have about no comments, especially when it's >> meant to be public API for decoding plugins. > > FWIW, for me that's ground to not accept the feature. Burdening output > plugins with this will make their development painful (because they'll > have to adapt regularly) and correctness doubful (there's nothing > checking for the lock being skipped). Another way needs to be found. > The lack of docs/comments, or the fact that the decoding plugins would need to do some lock/unlock operation? I agree with the former, of course - docs are a must. I disagree with the latter, though - there have been about no proposals how to do it without the locking. If there are, I'd like to hear about it. FWIW plugins that don't want to decode in-progress transactions don't need to do anything, obviously. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2018-03-30 00:23:00 +0200, Tomas Vondra wrote: > On 03/29/2018 11:58 PM, Andres Freund wrote: > > FWIW, for me that's ground to not accept the feature. Burdening output > > plugins with this will make their development painful (because they'll > > have to adapt regularly) and correctness doubful (there's nothing > > checking for the lock being skipped). Another way needs to be found. > > > > The lack of docs/comments, or the fact that the decoding plugins would > need to do some lock/unlock operation? The latter. > I agree with the former, of course - docs are a must. I disagree with > the latter, though - there have been about no proposals how to do it > without the locking. If there are, I'd like to hear about it. I don't care. Either another solution needs to be found, or the locking needs to be automatically performed when necessary. Greetings, Andres Freund
On 29/03/18 23:58, Andres Freund wrote: > On 2018-03-29 23:52:18 +0200, Tomas Vondra wrote: >>> I have added details about this in src/backend/storage/lmgr/README as >>> suggested by you. >>> >> >> Thanks. I think the README is a good start, but I think we also need to >> improve the comments, which is usually more detailed than the README. >> For example, it's not quite acceptable that LogicalLockTransaction and >> LogicalUnlockTransaction have about no comments, especially when it's >> meant to be public API for decoding plugins. > > FWIW, for me that's ground to not accept the feature. Burdening output > plugins with this will make their development painful (because they'll > have to adapt regularly) and correctness doubful (there's nothing > checking for the lock being skipped). Another way needs to be found. > I have to agree with Andres here. It's also visible in the latter patches. The pgoutput patch forgets to call these new APIs completely. The test_decoding calls them, but it does so even when it's processing changes for committed transaction.. I think that should be avoided as it means potentially doing SLRU lookup for every change. So doing it right is indeed not easy. I as wondering how to hide this. Best idea I had so far would be to put it in heap_beginscan (and index_beginscan given that catalog scans use is as well) behind some condition. That would also improve performance because locking would not need to happen for syscache hits. The problem is however how to inform the heap_beginscan about the fact that we are in 2PC decoding. We definitely don't want to change all the scan apis for this. I wonder if we could add some kind of property to Snapshot which would indicate this fact - logical decoding is using it's own snapshots it could inject the information about being inside the 2PC decoding. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 30/03/18 00:30, Petr Jelinek wrote: > On 29/03/18 23:58, Andres Freund wrote: >> On 2018-03-29 23:52:18 +0200, Tomas Vondra wrote: >>>> I have added details about this in src/backend/storage/lmgr/README as >>>> suggested by you. >>>> >>> >>> Thanks. I think the README is a good start, but I think we also need to >>> improve the comments, which is usually more detailed than the README. >>> For example, it's not quite acceptable that LogicalLockTransaction and >>> LogicalUnlockTransaction have about no comments, especially when it's >>> meant to be public API for decoding plugins. >> >> FWIW, for me that's ground to not accept the feature. Burdening output >> plugins with this will make their development painful (because they'll >> have to adapt regularly) and correctness doubful (there's nothing >> checking for the lock being skipped). Another way needs to be found. >> > > I have to agree with Andres here. It's also visible in the latter > patches. The pgoutput patch forgets to call these new APIs completely. > The test_decoding calls them, but it does so even when it's processing > changes for committed transaction.. I think that should be avoided as it > means potentially doing SLRU lookup for every change. So doing it right > is indeed not easy. Ah turns out it actually does not need SLRU lookup in this case (I missed the reorder buffer call), so I take that part back. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi Petr, Andres and Tomas >>> Thanks. I think the README is a good start, but I think we also need to >>> improve the comments, which is usually more detailed than the README. >>> For example, it's not quite acceptable that LogicalLockTransaction and >>> LogicalUnlockTransaction have about no comments, especially when it's >>> meant to be public API for decoding plugins. >> Tomas, thanks for providing your review comments based patch. I will include the documentation that you have provided in that patch for the APIs. Will also look at your decodeGroupLocking related comments and submit a fresh patch soon. >> FWIW, for me that's ground to not accept the feature. Burdening output >> plugins with this will make their development painful (because they'll >> have to adapt regularly) and correctness doubful (there's nothing >> checking for the lock being skipped). Another way needs to be found. >> > > I have to agree with Andres here. > Ok. Let's have another go at alleviating this issue then. > I as wondering how to hide this. Best idea I had so far would be to put > it in heap_beginscan (and index_beginscan given that catalog scans use > is as well) behind some condition. That would also improve performance > because locking would not need to happen for syscache hits. The problem > is however how to inform the heap_beginscan about the fact that we are > in 2PC decoding. We definitely don't want to change all the scan apis > for this. I wonder if we could add some kind of property to Snapshot > which would indicate this fact - logical decoding is using it's own > snapshots it could inject the information about being inside the 2PC > decoding. > The idea of adding that info in the Snapshot itself is interesting. We could introduce a logicalxid field in SnapshotData to point to the XID that the decoding backend is interested in. This could be added only for the 2PC case. Support in the future for in-progress transactions could use this field as well. If it's a valid XID, we could call LogicalLockTransaction/LogicalUnlockTransaction on that XID from heap_beginscan/head_endscan respectively. I can also look at what other *_beginscan APIs would need this as well. Regards, Nikhils
On 30/03/18 09:56, Nikhil Sontakke wrote: > >> I as wondering how to hide this. Best idea I had so far would be to put >> it in heap_beginscan (and index_beginscan given that catalog scans use >> is as well) behind some condition. That would also improve performance >> because locking would not need to happen for syscache hits. The problem >> is however how to inform the heap_beginscan about the fact that we are >> in 2PC decoding. We definitely don't want to change all the scan apis >> for this. I wonder if we could add some kind of property to Snapshot >> which would indicate this fact - logical decoding is using it's own >> snapshots it could inject the information about being inside the 2PC >> decoding. >> > > The idea of adding that info in the Snapshot itself is interesting. We > could introduce a logicalxid field in SnapshotData to point to the XID > that the decoding backend is interested in. This could be added only > for the 2PC case. Support in the future for in-progress transactions > could use this field as well. If it's a valid XID, we could call > LogicalLockTransaction/LogicalUnlockTransaction on that XID from > heap_beginscan/head_endscan respectively. I can also look at what > other *_beginscan APIs would need this as well. > So I have spent some significant time today thinking about this (the issue in general not this specific idea). And I think this proposal does not work either. The problem is that we fundamentally want two things, not one. It's true we want to block ABORT from finishing while we are reading catalogs, but the other important part is that we want to bail gracefully when ABORT happened for the transaction being decoded. In other words,, if we do the locking transparently somewhere in the scan or catalog read or similar there is no way to let the plugin know that it should bail. So the locking code that's called from several layers deep would have only one option, to ERROR. I don't think we want to throw ERRORs when transaction which is being decoded has been aborted as that disrupts the replication. I think that we basically only have two options here that can satisfy both blocking ABORT and bailing gracefully in case ABORT has happened. Either the plugin has full control over locking (as in the patch), so that it can bail when the locking function reports that the transaction has aborted. Or we do the locking around the plugin calls, ie directly in logical decoding callback wrappers or similar. Both of these options have some disadvantages. Locking inside plugin make the plugin code much more complex if it wants to support this. For example if I as plugin author call any function that somewhere access syscache, I have to do the locking around that function call. Locking around plugin callbacks can hold he lock for longer periods of time since plugins usually end up writing to network. I think for most use-cases of 2PC decoding the latter is more useful as plugin should be connected to some kind transaction management solution. Also the time should be bounded by things like wal_sender_timeout (or statement_timeout for SQL variant of decoding). Note that I was initially advocating against locking around whole callbacks when Nikhil originally came up with the idea, but after we went over several other options here and given it a lot of thought I now think it's probably least bad way we have available. At least until somebody figures out how to solve all the issues around reading aborted catalog changes, but that does seem like rather large project on its own. And if we do locking around plugin callbacks now then we can easily switch to that solution if it ever happens without anybody having to rewrite the plugins. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On March 30, 2018 10:27:18 AM PDT, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote: >. Locking >around plugin callbacks can hold he lock for longer periods of time >since plugins usually end up writing to network. I think for most >use-cases of 2PC decoding the latter is more useful as plugin should be >connected to some kind transaction management solution. Also the time >should be bounded by things like wal_sender_timeout (or >statement_timeout for SQL variant of decoding). Quick thought: Should be simple to release lock when interacting with network. Could also have abort signal lockers. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
> > Quick thought: Should be simple to release lock when interacting with network. I don’t think this will be that simple. The network calls will typically happen from inside the plugins and we don’t wantto make plugin authors responsible for that. > Could also have abort signal lockers. With the decodegroup locking we do have access to all the decoding backend pids. So we could signal them. But am not suresignaling will work if the plugin is in the midst of a network Call. I agree with Petr. With this decodegroup Lock implementation we have an inexpensive but workable implementation for locking around the plugin call. Sure, the abortwill be penalized but it’s bounded by the Wal sender timeout or a max of one change apply cycle. As he mentioned if we can optimize this later we can do so without changing plugin coding semantics later. Regards, Nikhils > > Andres > -- > Sent from my Android device with K-9 Mail. Please excuse my brevity.
> > Quick thought: Should be simple to release lock when interacting with network. I don’t think this will be that simple. The network calls will typically happen from inside the plugins and we don’t wantto make plugin authors responsible for that. > Could also have abort signal lockers. With the decodegroup locking we do have access to all the decoding backend pids. So we could signal them. But am not suresignaling will work if the plugin is in the midst of a network Call. I agree with Petr. With this decodegroup Lock implementation we have an inexpensive but workable implementation for locking around the plugin call. Sure, the abortwill be penalized but it’s bounded by the Wal sender timeout or a max of one change apply cycle. As he mentioned if we can optimize this later we can do so without changing plugin coding semantics later. Regards, Nikhils > > Andres > -- > Sent from my Android device with K-9 Mail. Please excuse my brevity.
On 30/03/18 19:36, Andres Freund wrote: > > > On March 30, 2018 10:27:18 AM PDT, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote: >> . Locking >> around plugin callbacks can hold he lock for longer periods of time >> since plugins usually end up writing to network. I think for most >> use-cases of 2PC decoding the latter is more useful as plugin should be >> connected to some kind transaction management solution. Also the time >> should be bounded by things like wal_sender_timeout (or >> statement_timeout for SQL variant of decoding). > > Quick thought: Should be simple to release lock when interacting with network. Could also have abort signal lockers. > I thought about that as well, but then we need to change API of the write functions of logical decoding to return info about transaction being aborted in mean time so that plugin can abort. Seems bit ugly that those should know about it. Alternatively we would have to disallow multiple writes from single plugin callback. Otherwise abort can happen during the network interaction without plugin noticing. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2018-03-30 23:49:43 +0530, Nikhil Sontakke wrote: > > Quick thought: Should be simple to release lock when interacting with network. > > I don’t think this will be that simple. The network calls will > typically happen from inside the plugins and we don’t want to make > plugin authors responsible for that. You can just throw results away... ;). I'm not even kidding. We've all the necessary access in the callback for writing from a context. > > Could also have abort signal lockers. > > With the decodegroup locking we do have access to all the decoding backend pids. So we could signal them. But am not suresignaling will work if the plugin is in the midst of a network > Call. All walsender writes are nonblocking, so that's not an issue. Greetings, Andres Freund
On 30/03/18 20:50, Andres Freund wrote: > Hi, > > On 2018-03-30 23:49:43 +0530, Nikhil Sontakke wrote: >>> Quick thought: Should be simple to release lock when interacting with network. >> >> I don’t think this will be that simple. The network calls will >> typically happen from inside the plugins and we don’t want to make >> plugin authors responsible for that. > > You can just throw results away... ;). I'm not even kidding. We've all > the necessary access in the callback for writing from a context. > You mean, if we detect abort in the write callback, set something in the context which will make all the future writes noop until it's reset again after we yield back to the logical decoding? That's not the most beautiful design I've seen, but I'd be okay with that, it seems like it would solve all the issues we have with this. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2018-03-30 21:05:29 +0200, Petr Jelinek wrote: > You mean, if we detect abort in the write callback, set something in the > context which will make all the future writes noop until it's reset > again after we yield back to the logical decoding? Something like that, yea. I *think* doing it via signalling is going to be a more efficient design than constantly checking, but I've not thought it fully through. > That's not the most beautiful design I've seen, but I'd be okay with > that, it seems like it would solve all the issues we have with this. Yea, it's not too pretty, but seems pragmatic. Greetings, Andres Freund
Hi Tomas, > Thanks. I think the README is a good start, but I think we also need to > improve the comments, which is usually more detailed than the README. > For example, it's not quite acceptable that LogicalLockTransaction and > LogicalUnlockTransaction have about no comments, especially when it's > meant to be public API for decoding plugins. > Additional documents around the APIs incorporated from your review patch. > >>> >>> 2) regression tests >>> ------------------- >> They are long-winded queries and IMO made the test file look too >> cluttered and verbose.. >> > > Well, I don't think that's a major problem, and it certainly makes it > more difficult to investigate regression failures. > Changed the test files to use the actual queries everywhere now. > Now, the new bits ... attached is a .diff with a couple of changes and > comments on various places. > > 1) LogicalLockTransaction > > - This function is part of a public API, yet it has no comment. That > needs fixing - it has to be clear how to use it. The .diff suggests a > comment, but it may need improvements. > Done. > > - As I mentioned in the previous review, BecomeDecodeGroupLeader is a > misleading name. It suggest the called becomes a leader, while in fact > it looks up the PROC running the XID and makes it a leader. This is > obviously due to copying the code from lock groups, where the caller > actually becomes the leader. It's incorrect here. I suggest something > like LookupDecodeGroupLeader() or something. > Done. Used AssignDecodeGroupLeader() as the function name now. > > - In the "if (MyProc->decodeGroupLeader == NULL)" block there are two > blocks rechecking the transaction status: > > if (proc == NULL) > { ... recheck ... } > > if (!BecomeDecodeGroupMember(proc, proc->pid, rbtxn_prepared(txn))) > { ... recheck ...} > > I suggest to join them into a single block. > Done. Combined into a single block. > > - This Assert() is either bogus and there can indeed be cases with > (MyProc->decodeGroupLeader==NULL), or the "if" is unnecessary: > > Assert(MyProc->decodeGroupLeader); > > if (MyProc->decodeGroupLeader) { ... } > Done. Removed the assert now. > - I'm wondering why we're maintaining decodeAbortPending flags both for > the leader and all the members. ISTM it'd be perfectly fine to only > check the leader, particularly because RemoveDecodeGroupMemberLocked > removes the members from the decoding group. So that seems unnecessary, > and we can remove the > > if (MyProc->decodeAbortPending) > { ... } > IMO, this looked clearer that each proc has been notified that an abort is pending. > - LogicalUnlockTransaction needs a comment(s) too. > Done. > > 2) BecomeDecodeGroupLeader > > - It can bail out when (!proc), which will simplify the code a bit. > Done. > - Why does it check PID of the process at all? Seems unnecessary, > considering we're already checking the XID. > Agreed. Especially for the current case of 2PC, the proc will have 0 as pid. > - Can a proc executing a XID have a different leader? I don't think so, > so I'd make that an Assert(). > > Assert(!proc || (proc->decodeGroupLeader == proc)); > > And it'll allow simplification of some of the conditions. > Done. > - We're only dealing with prepared transactions now, so I'd just drop > the is_prepared flag - it'll make the code a bit simpler, we can add it > later in patch adding decoding of regular in-progress transactions. We > can't test the (!is_prepared) anyway. > Done. > - Why are we making the leader also a member of the group? Seems rather > unnecessary, and it complicates the abort handling, because we need to > skip the leader when deciding to wait. > The leader is part of the decode group. And other than not waiting for ourself at abort time, no other coding complications are there AFAICS. > > 3) LogicalDecodeRemoveTransaction > > - It's not clear to me what happens when a decoding backend gets killed > between LogicalLockTransaction/LogicalUnlockTransaction. Doesn't that > mean LogicalDecodeRemoveTransaction will get stuck, because the proc is > still in the decoding group? > SIGSEGV, SIGABRT, SIGKILL will all cause the PG instance to restart because of possible shmem corruption issues. So I don't think the above scenario will arise. I also did not see any related handling in the parallel lock group case as well. > > 4) a bunch of comment / docs improvements, ... > > I'm suggesting rewording a couple of comments. I've also added a couple > of missing comments - e.g. to LogicalLockTransaction and the lock group > methods in general. > > Also, a couple more questions and suggestions in XXX comments. > Incorporated relevant changes in the new patchset. Andres, Petr: As discussed, I have now added lock/unlock API calls around the "apply_change" callback. This callback is now free to consult catalog metadata without worrying about a concurrent rollback operation. Have removed direct logicallock/logicalunlock calls from inside the pgoutput and test_decoding plugins now. Also modified the sgml documentation appropriately. Am looking at how we can further optimize this by looking at the two approaches about signaling about abort or adding abort related info in the context, but this will be an additional patch over this patch set anyways. Regards, Nikhils
Attachment
- 0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.0204.patch
- 0002-Introduce-LogicalLockTransaction-LogicalUnlockTransa.0204.patch
- 0003-Support-decoding-of-two-phase-transactions-at-PREPAR.0204.patch
- 0004-pgoutput-output-plugin-support-for-logical-decoding-.0204.patch
- 0005-Teach-test_decoding-plugin-to-work-with-2PC.0204.patch
- 0006-Optional-Additional-test-case-to-demonstrate-decoding-rollbac.0204.patch
On 29 March 2018 at 23:24, Andres Freund <andres@anarazel.de> wrote: >> I agree with the former, of course - docs are a must. I disagree with >> the latter, though - there have been about no proposals how to do it >> without the locking. If there are, I'd like to hear about it. > > I don't care. Either another solution needs to be found, or the locking > needs to be automatically performed when necessary. That seems unreasonable. It's certainly a nice future goal to have it all happen automatically, but we don't know what the plugin will do. How can we ever make an unknown task happen automatically? We can't. We have a reasonable approach here. Locking shared resources before using them is not a radical new approach, its just standard development. If we find a better way in the future, we can use that, but requiring a better solution when there isn't one is unreasonable. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 29 March 2018 at 23:30, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote: > On 29/03/18 23:58, Andres Freund wrote: >> On 2018-03-29 23:52:18 +0200, Tomas Vondra wrote: >>>> I have added details about this in src/backend/storage/lmgr/README as >>>> suggested by you. >>>> >>> >>> Thanks. I think the README is a good start, but I think we also need to >>> improve the comments, which is usually more detailed than the README. >>> For example, it's not quite acceptable that LogicalLockTransaction and >>> LogicalUnlockTransaction have about no comments, especially when it's >>> meant to be public API for decoding plugins. >> >> FWIW, for me that's ground to not accept the feature. Burdening output >> plugins with this will make their development painful (because they'll >> have to adapt regularly) and correctness doubful (there's nothing >> checking for the lock being skipped). Another way needs to be found. >> > > I have to agree with Andres here. It's also visible in the latter > patches. The pgoutput patch forgets to call these new APIs completely. > The test_decoding calls them, but it does so even when it's processing > changes for committed transaction.. I think that should be avoided as it > means potentially doing SLRU lookup for every change. So doing it right > is indeed not easy. Yet you spotted these problems easily enough. Similar to finding missing LWlocks. > I as wondering how to hide this. Best idea I had so far would be to put > it in heap_beginscan (and index_beginscan given that catalog scans use > is as well) behind some condition. That would also improve performance > because locking would not need to happen for syscache hits. The problem > is however how to inform the heap_beginscan about the fact that we are > in 2PC decoding. We definitely don't want to change all the scan apis > for this. I wonder if we could add some kind of property to Snapshot > which would indicate this fact - logical decoding is using it's own > snapshots it could inject the information about being inside the 2PC > decoding. Perhaps, but how do we know we've covered all the right places? We don't know what every plugin will require, do we? The plugin needs to take responsibility for its own correctness, whether we make it easier or not. It seems clear that we would need a generalized API (the proposed locking approach) to cover all requirements. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018-04-02 09:23:10 +0100, Simon Riggs wrote: > On 29 March 2018 at 23:24, Andres Freund <andres@anarazel.de> wrote: > > >> I agree with the former, of course - docs are a must. I disagree with > >> the latter, though - there have been about no proposals how to do it > >> without the locking. If there are, I'd like to hear about it. > > > > I don't care. Either another solution needs to be found, or the locking > > needs to be automatically performed when necessary. > > That seems unreasonable. > It's certainly a nice future goal to have it all happen automatically, > but we don't know what the plugin will do. No, fighting too complicated APIs is not unreasonable. And we've found an alternative. > How can we ever make an unknown task happen automatically? We can't. The task isn't unknown, so this just seems like a non sequitur. Greetings, Andres Freund
Hi, >> It's certainly a nice future goal to have it all happen automatically, >> but we don't know what the plugin will do. > > No, fighting too complicated APIs is not unreasonable. And we've found > an alternative. > PFA, latest patch set. The LogicalLockTransaction/LogicalUnlockTransaction API implementation using decode groups now has proper cleanup handling in case there's an ERROR while holding the logical lock. Rest of the patches are the same as yesterday. Other than this, we would want to have pgoutput support for 2PC decoding to be made optional? In that case we could add an option to "CREATE SUBSCRIPTION". This will mean adding a new Anum_pg_subscription_subenable_twophase attribute to Subscription struct and related processing. Should we go down this route? Other than this, unless am mistaken, every other issue has been taken care of. Please do let me know if we think anything is pending in these patch sets. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Attachment
- 0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.0304.patch
- 0002-Introduce-LogicalLockTransaction-LogicalUnlockTransa.0304.patch
- 0003-Support-decoding-of-two-phase-transactions-at-PREPAR.0304.patch
- 0004-pgoutput-output-plugin-support-for-logical-decoding-.0304.patch
- 0005-Teach-test_decoding-plugin-to-work-with-2PC.0304.patch
- 0006-Optional-Additional-test-case-to-demonstrate-decoding-rollbac.0304.patch
On 04/03/2018 12:40 PM, Nikhil Sontakke wrote: > Hi, > >>> It's certainly a nice future goal to have it all happen automatically, >>> but we don't know what the plugin will do. >> >> No, fighting too complicated APIs is not unreasonable. And we've found >> an alternative. >> > > PFA, latest patch set. > > The LogicalLockTransaction/LogicalUnlockTransaction API implementation > using decode groups now has proper cleanup handling in case there's an > ERROR while holding the logical lock. > > Rest of the patches are the same as yesterday. > Unfortunately, this does segfault for me in `make check` almost immediately. Try ./configure --enable-debug --enable-cassert CFLAGS="-O0 -ggdb3 -DRANDOMIZE_ALLOCATED_MEMORY" && make -s clean && make -s -j4 check and you should get an assert failure right away. Examples of backtraces attached, not sure what exactly is the issue. Also, I get this compiler warning: proc.c: In function ‘AssignDecodeGroupLeader’: proc.c:1975:8: warning: variable ‘pid’ set but not used [-Wunused-but-set-variable] int pid; ^~~ All of PostgreSQL successfully made. Ready to install. which suggests we don't really need the pid variable. > Other than this, we would want to have pgoutput support for 2PC > decoding to be made optional? In that case we could add an option to > "CREATE SUBSCRIPTION". This will mean adding a new > Anum_pg_subscription_subenable_twophase attribute to Subscription > struct and related processing. Should we go down this route? > I'd say yes, we need to make it opt-in (assuming we want pgoutput to support the 2PC decoding at all). The trouble is that while it may improve replication of two-phase transactions, it may also require config changes on the subscriber (to support enough prepared transactions) and furthermore the GID is going to be copied to the subscriber. Which means that if the publisher/subscriber (at the instance level) are already part of the are on the same 2PC transaction, it can't possibly proceed because the subscriber won't be able to do PREPARE TRANSACTION. So I think we need a subscription parameter to enable/disable this, defaulting to 'disabled'. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
> On 3 Apr 2018, at 16:56, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > > So I think we need a subscription parameter to enable/disable this, > defaulting to 'disabled’. +1 Also, current value for LOGICALREP_IS_COMMIT is 1, but previous code expected flags to be zero, so this way logical replication between postgres-10 and postgres-with-2pc-decoding will be broken. So ISTM it’s better to set LOGICALREP_IS_COMMIT to zero and change flags checking rules to accommodate that. -- Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 04/03/2018 04:07 PM, Stas Kelvich wrote: > > >> On 3 Apr 2018, at 16:56, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> >> >> So I think we need a subscription parameter to enable/disable this, >> defaulting to 'disabled’. > > +1 > > Also, current value for LOGICALREP_IS_COMMIT is 1, but previous code expected > flags to be zero, so this way logical replication between postgres-10 and > postgres-with-2pc-decoding will be broken. So ISTM it’s better to set > LOGICALREP_IS_COMMIT to zero and change flags checking rules to accommodate that. > Yes, that is a good point actually - we need to test that replication between PG10 and PG11 works correctly, i.e. that the protocol version is correctly negotiated, and features are disabled/enabled accordingly etc. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra wrote: > Yes, that is a good point actually - we need to test that replication > between PG10 and PG11 works correctly, i.e. that the protocol version is > correctly negotiated, and features are disabled/enabled accordingly etc. Maybe it'd be good to have a buildfarm animal to specifically test for that? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 04/03/2018 04:37 PM, Alvaro Herrera wrote: > Tomas Vondra wrote: > >> Yes, that is a good point actually - we need to test that replication >> between PG10 and PG11 works correctly, i.e. that the protocol version is >> correctly negotiated, and features are disabled/enabled accordingly etc. > > Maybe it'd be good to have a buildfarm animal to specifically test for > that? > Not sure a buildfarm supports running two clusters with different versions easily? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > On 04/03/2018 04:37 PM, Alvaro Herrera wrote: >> Tomas Vondra wrote: >>> Yes, that is a good point actually - we need to test that replication >>> between PG10 and PG11 works correctly, i.e. that the protocol version is >>> correctly negotiated, and features are disabled/enabled accordingly etc. >> Maybe it'd be good to have a buildfarm animal to specifically test for >> that? > Not sure a buildfarm supports running two clusters with different > versions easily? You'd need some specialized buildfarm infrastructure like --- maybe the same as --- the infrastructure for testing cross-version pg_upgrade. Andrew could speak to the details better than I. regards, tom lane
FWIW, a couple of additional comments based on eyeballing the diffs: 1) twophase.c --------- I think this comment is slightly inaccurate: /* * Coordinate with logical decoding backends that may be already * decoding this prepared transaction. When aborting a transaction, * we need to wait for all of them to leave the decoding group. If * committing, we simply remove all members from the group. */ Strictly speaking, we're not waiting for the workers to leave the decoding group, but to set decodeLocked=false. That is, we may proceed when there still are members, but they must be in unlocked state. 2) reorderbuffer.c ------------------ I've already said it before, I find the "flags" bitmask and rbtxn_* macros way less readable than individual boolean flags. It was claimed this was done on Andres' request, but I don't see that in the thread. I admit it's rather subjective, though. I see ReorederBuffer only does the lock/unlock around apply_change and RelationIdGetRelation. That seems insufficient - RelidByRelfilenode can do heap_open on pg_class, for example. And I guess we need to protect rb->message too, because who knows what the plugin does in the callback? Also, we should not allocate gid[GIDSIZE] for every transaction. For example subxacts never need it, and it seems rather wasteful to allocate 200B when the rest of the struct has only ~100B. This is particularly problematic considering ReorderBufferTXN is not spilled to disk when reaching the memory limit. It needs to be allocated ad-hoc only when actually needed. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi Tomas, >> Unfortunately, this does segfault for me in `make check` almost immediately. Try This is due to the new ERROR handling code that I added today for the lock/unlock APIs. Will fix. >> Also, current value for LOGICALREP_IS_COMMIT is 1, but previous code expected flags to be zero, so this way logical replication between postgres-10 and postgres-with-2pc-decoding will be broken. Good point. Will also test pg-10 to pg-11 logical replication to ensure that there are no issues. >> So I think we need a subscription parameter to enable/disable this, defaulting to 'disabled'. Ok, will add it to the "CREATE SUBSCRIPTION", btw, we should have allowed storing options in an array form for a subscription. We might add more options in the future and adding fields one by one doesn't seem that extensible. > 1) twophase.c > --------- > > I think this comment is slightly inaccurate: > > /* > * Coordinate with logical decoding backends that may be already > * decoding this prepared transaction. When aborting a transaction, > * we need to wait for all of them to leave the decoding group. If > * committing, we simply remove all members from the group. > */ > > Strictly speaking, we're not waiting for the workers to leave the > decoding group, but to set decodeLocked=false. That is, we may proceed > when there still are members, but they must be in unlocked state. > Agreed. Will modify it to mention that it will wait only if some of the backends are in locked state. > > 2) reorderbuffer.c > ------------------ > > I've already said it before, I find the "flags" bitmask and rbtxn_* > macros way less readable than individual boolean flags. It was claimed > this was done on Andres' request, but I don't see that in the thread. I > admit it's rather subjective, though. > Yeah, this is a little subjective. > I see ReorederBuffer only does the lock/unlock around apply_change and > RelationIdGetRelation. That seems insufficient - RelidByRelfilenode can > do heap_open on pg_class, for example. And I guess we need to protect > rb->message too, because who knows what the plugin does in the callback? > > Also, we should not allocate gid[GIDSIZE] for every transaction. For > example subxacts never need it, and it seems rather wasteful to allocate > 200B when the rest of the struct has only ~100B. This is particularly > problematic considering ReorderBufferTXN is not spilled to disk when > reaching the memory limit. It needs to be allocated ad-hoc only when > actually needed. > OK, will look at allocating GID only when needed. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
On Wed, Apr 4, 2018 at 12:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: >> On 04/03/2018 04:37 PM, Alvaro Herrera wrote: >>> Tomas Vondra wrote: >>>> Yes, that is a good point actually - we need to test that replication >>>> between PG10 and PG11 works correctly, i.e. that the protocol version is >>>> correctly negotiated, and features are disabled/enabled accordingly etc. > >>> Maybe it'd be good to have a buildfarm animal to specifically test for >>> that? > >> Not sure a buildfarm supports running two clusters with different >> versions easily? > > You'd need some specialized buildfarm infrastructure like --- maybe the > same as --- the infrastructure for testing cross-version pg_upgrade. > Andrew could speak to the details better than I. > It's quite possible. The cross-version upgrade module saves out each built version. See <https://github.com/PGBuildFarm/client-code/blob/master/PGBuild/Modules/TestUpgradeXversion.pm> Since this occupies a significant amount of disk space we'd probably want to leverage it rather than have another module do the same thing. Perhaps the "save" part of it needs to be factored out. In any case, it's quite doable. I can work on that after this gets committed. Currently we seem to have only two machines doing the cross-version upgrade checks, which might make it easier to rearrange anything if necessary. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> This is due to the new ERROR handling code that I added today for the > lock/unlock APIs. Will fix. > Fixed. I continue to test this area for other issues. >>> Also, current value for LOGICALREP_IS_COMMIT is 1, but previous code expected > flags to be zero, so this way logical replication between postgres-10 and > postgres-with-2pc-decoding will be broken. > > Good point. Will also test pg-10 to pg-11 logical replication to > ensure that there are no issues. > I started making changes for supporting replication between postgres-10 and postgres-11 but then very quickly realized that pgoutput support is too far from being done. It needs to be optional and per subscription. It definitely needs proto version bump and we don't even have a framework for negotiating proto version yet (since the proto was never bumped) so there is a chunk of completely new code missing. For demo and functionality purposes we have test_decoding support for 2pc decoding in this patch set. External plugins like bdr and pglogical will be able to leverage this infrastructure as well. Importantly, since we don't do negotiation then PG10 -> PG11 replication is not possible making one of the most important current use cases not possible. To add support in pgoutput, we'd first have to get multi-protocol publisher/subscriber communication working as a pre-requisite. The good thing is that once we get the proto stuff in, we can easily add the patch from the earlier patchset which provides full 2PC decoding support in pgoutput. Thoughts? So, we should consider not adding pgoutput support right away and I have removed that patch from this patchset now. Another aspect of not working on pgoutput is we need not worry about adding an enable_twophase option to CREATE SUBSCRIPTION immediately as well. The test_decoding plugin is easy to extend with options and the patch set already does that for enabling/disabling 2PC decoding in it. >>> So I think we need a subscription parameter to enable/disable this, > defaulting to 'disabled'. > > Ok, will add it to the "CREATE SUBSCRIPTION", btw, we should have > allowed storing options in an array form for a subscription. We might > add more options in the future and adding fields one by one doesn't > seem that extensible. > This is not needed since we should not look at pgoutput 2PC decode support now. > >> 1) twophase.c >> --------- >> >> I think this comment is slightly inaccurate: >> >> /* >> * Coordinate with logical decoding backends that may be already >> * decoding this prepared transaction. When aborting a transaction, >> * we need to wait for all of them to leave the decoding group. If >> * committing, we simply remove all members from the group. >> */ >> >> Strictly speaking, we're not waiting for the workers to leave the >> decoding group, but to set decodeLocked=false. That is, we may proceed >> when there still are members, but they must be in unlocked state. >> > > Agreed. Will modify it to mention that it will wait only if some of > the backends are in locked state. > Modified the comment. >> >> 2) reorderbuffer.c >> ------------------ >> >> I've already said it before, I find the "flags" bitmask and rbtxn_* >> macros way less readable than individual boolean flags. It was claimed >> this was done on Andres' request, but I don't see that in the thread. I >> admit it's rather subjective, though. >> > > Yeah, this is a little subjective. > If the committer has strong opinions on this, then I can whip up patches along desired lines. >> I see ReorederBuffer only does the lock/unlock around apply_change and >> RelationIdGetRelation. That seems insufficient - RelidByRelfilenode can >> do heap_open on pg_class, for example. And I guess we need to protect >> rb->message too, because who knows what the plugin does in the callback? >> Added lock/unlock APIs around rb->message and other places where Relations are fetched. >> Also, we should not allocate gid[GIDSIZE] for every transaction. For >> example subxacts never need it, and it seems rather wasteful to allocate >> 200B when the rest of the struct has only ~100B. This is particularly >> problematic considering ReorderBufferTXN is not spilled to disk when >> reaching the memory limit. It needs to be allocated ad-hoc only when >> actually needed. >> > > OK, will look at allocating GID only when needed. > Done. Now GID is a char pointer and gets palloc'ed and pfree'd. PFA, latest patchset. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Attachment
- 0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.0404.patch
- 0002-Introduce-LogicalLockTransaction-LogicalUnlockTransa.0404.patch
- 0003-Support-decoding-of-two-phase-transactions-at-PREPAR.0404.patch
- 0004-Teach-test_decoding-plugin-to-work-with-2PC.0404.patch
- 0005-Optional-Additional-test-case-to-demonstrate-decoding-rollbac.0404.patch
>> This is due to the new ERROR handling code that I added today for the >> lock/unlock APIs. Will fix. >> > > Fixed. I continue to test this area for other issues. > Revised the patch after more testing and added more documentation in the ERROR handling code path. I tested ERROR handling by ensuring that LogicalLock is held by multiple backends and induced ERROR while holding it. The handling in ProcKill rightly removes entries from these backends as part of ERROR cleanup. A future ROLLBACK removes the only one entry belonging to the Leader from the decodeGroup appropriately later. Seems to be holding up ok Had also missed out a new test file for the option 0005 patch earlier. That's also included now. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Attachment
- 0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.0404.v2.0.patch
- 0002-Introduce-LogicalLockTransaction-LogicalUnlockTransa.0404.v2.0.patch
- 0003-Support-decoding-of-two-phase-transactions-at-PREPAR.0404.v2.0.patch
- 0004-Teach-test_decoding-plugin-to-work-with-2PC.0404.v2.0.patch
- 0005-Optional-Additional-test-case-to-demonstrate-decoding-rollbac.0404.v2.0.patch
Hi, I think the patch looks mostly fine. I'm about to do a bit more testing on it, but a few comments. Attached diff shows which the discussed places / comments more closely. 1) There's a race condition in LogicalLockTransaction. The code does roughly this: if (!BecomeDecodeGroupMember(...)) ... bail out ... Assert(MyProc->decodeGroupLeader); lwlock = LockHashPartitionLockByProc(MyProc->decodeGroupLeader); ... but AFAICS there is no guarantee that the transaction does not commit (or even abort) right after the become decode group member. In which case LogicalDecodeRemoveTransaction might have already reset our pointer to a leader to NULL. In which case the Assert() and lock will fail. I've initially thought this can be fixed by setting decodeLocked=true in BecomeDecodeGroupMember, but that's not really true - that would fix the race for aborts, but not commits. LogicalDecodeRemoveTransaction skips the wait for commits entirely, and just resets the flags anyway. So this needs a different fix, I think. BecomeDecodeGroupMember also needs the leader PGPROC pointer, but it does not have the issue because it gets it as a parameter. I think the same thing would work for here too - that is, use the AssignDecodeGroupLeader() result instead. 2) BecomeDecodeGroupMember sets the decodeGroupLeader=NULL when the leader does not match the parameters, despite enforcing it by Assert() at the beginning. Let's remove that assignment. 3) I don't quite understand why BecomeDecodeGroupMember does the cross-check using PID. In which case would it help? 4) AssignDecodeGroupLeader still sets pid, which is never read. Remove. 5) ReorderBufferCommitInternal does elog(LOG) about interrupting the decoding of aborted transaction only in one place. There are about three other places where we check LogicalLockTransaction. Seems inconsistent. 6) The comment before LogicalLockTransaction is somewhat inaccurate, because it talks about adding/removing the backend to the group, but that's not what's happening. We join the group on the first call and then we only tweak the decodeLocked flag. 7) I propose minor changes to a couple of comments. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Hi Tomas, > 1) There's a race condition in LogicalLockTransaction. The code does > roughly this: > > if (!BecomeDecodeGroupMember(...)) > ... bail out ... > > Assert(MyProc->decodeGroupLeader); > lwlock = LockHashPartitionLockByProc(MyProc->decodeGroupLeader); > ... > > but AFAICS there is no guarantee that the transaction does not commit > (or even abort) right after the become decode group member. In which > case LogicalDecodeRemoveTransaction might have already reset our pointer > to a leader to NULL. In which case the Assert() and lock will fail. > > I've initially thought this can be fixed by setting decodeLocked=true in > BecomeDecodeGroupMember, but that's not really true - that would fix the > race for aborts, but not commits. LogicalDecodeRemoveTransaction skips > the wait for commits entirely, and just resets the flags anyway. > > So this needs a different fix, I think. BecomeDecodeGroupMember also > needs the leader PGPROC pointer, but it does not have the issue because > it gets it as a parameter. I think the same thing would work for here > too - that is, use the AssignDecodeGroupLeader() result instead. > That's a good catch. One of the earlier patches had a check for this (it also had an ill-placed assert above though) which we removed as part of the ongoing review. Instead of doing the above, we can just re-check if the decodeGroupLeader pointer has become NULL and if so, re-assert that the leader has indeed gone away before returning false. I propose a diff like below. /* * If we were able to add ourself, then Abort processing will - * interlock with us. + * interlock with us. If the leader was done in the meanwhile + * it could have removed us and gone away as well. */ - Assert(MyProc->decodeGroupLeader); + if (MyProc->decodeGroupLeader == NULL) + { + Assert(BackendXidGetProc(txn->xid) == NULL); + return false + } > > 2) BecomeDecodeGroupMember sets the decodeGroupLeader=NULL when the > leader does not match the parameters, despite enforcing it by Assert() > at the beginning. Let's remove that assignment. > Ok, done. > > 3) I don't quite understand why BecomeDecodeGroupMember does the > cross-check using PID. In which case would it help? > When I wrote this support, I had written it with the intention of supporting both 2PC (in which case pid is 0) and in-progress regular transactions. That's why the presence of PID in these functions. The current use case is just for 2PC, so we could remove it. > > 4) AssignDecodeGroupLeader still sets pid, which is never read. Remove. > Ok, will do. > > 5) ReorderBufferCommitInternal does elog(LOG) about interrupting the > decoding of aborted transaction only in one place. There are about three > other places where we check LogicalLockTransaction. Seems inconsistent. > Note that I have added it for the OPTIONAL test_decoding test cases (which AFAIK we don't plan to commit in that state) which demonstrate concurrent rollback interlocking with the lock/unlock APIs. The first ELOG was enough to catch the interaction. If we think these elogs should be present in the code, then yes, I can add it elsewhere as well as part of an earlier patch. > > 6) The comment before LogicalLockTransaction is somewhat inaccurate, > because it talks about adding/removing the backend to the group, but > that's not what's happening. We join the group on the first call and > then we only tweak the decodeLocked flag. > True. > > 7) I propose minor changes to a couple of comments. > Ok, I am looking at your provided patch and incorporating relevant changes from it. WIll submit a patch set soon. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
On 4/5/18 8:50 AM, Nikhil Sontakke wrote: > Hi Tomas, > >> 1) There's a race condition in LogicalLockTransaction. The code does >> roughly this: >> >> if (!BecomeDecodeGroupMember(...)) >> ... bail out ... >> >> Assert(MyProc->decodeGroupLeader); >> lwlock = LockHashPartitionLockByProc(MyProc->decodeGroupLeader); >> ... >> >> but AFAICS there is no guarantee that the transaction does not commit >> (or even abort) right after the become decode group member. In which >> case LogicalDecodeRemoveTransaction might have already reset our pointer >> to a leader to NULL. In which case the Assert() and lock will fail. >> >> I've initially thought this can be fixed by setting decodeLocked=true in >> BecomeDecodeGroupMember, but that's not really true - that would fix the >> race for aborts, but not commits. LogicalDecodeRemoveTransaction skips >> the wait for commits entirely, and just resets the flags anyway. >> >> So this needs a different fix, I think. BecomeDecodeGroupMember also >> needs the leader PGPROC pointer, but it does not have the issue because >> it gets it as a parameter. I think the same thing would work for here >> too - that is, use the AssignDecodeGroupLeader() result instead. >> > > That's a good catch. One of the earlier patches had a check for this > (it also had an ill-placed assert above though) which we removed as > part of the ongoing review. > > Instead of doing the above, we can just re-check if the > decodeGroupLeader pointer has become NULL and if so, re-assert that > the leader has indeed gone away before returning false. I propose a > diff like below. > > /* > > * If we were able to add ourself, then Abort processing will > > - * interlock with us. > > + * interlock with us. If the leader was done in the meanwhile > > + * it could have removed us and gone away as well. > > */ > > - Assert(MyProc->decodeGroupLeader); > > + if (MyProc->decodeGroupLeader == NULL) > > + { > > + Assert(BackendXidGetProc(txn->xid) == NULL); > > + return false > > + } > > Uh? Simply rechecking if MyProc->decodeGroupLeader is NULL obviously does not fix the race condition - it might get NULL right after the check. So we need to either lookup the PROC again (and then get the associated lwlock), or hold some other type of lock. >> >> 3) I don't quite understand why BecomeDecodeGroupMember does the >> cross-check using PID. In which case would it help? >> > > When I wrote this support, I had written it with the intention of > supporting both 2PC (in which case pid is 0) and in-progress regular > transactions. That's why the presence of PID in these functions. The > current use case is just for 2PC, so we could remove it. > Sure, but why do we need to cross-check the PID at all? I may be missing something here, but I don't see what does this protect against? > >> >> 5) ReorderBufferCommitInternal does elog(LOG) about interrupting the >> decoding of aborted transaction only in one place. There are about three >> other places where we check LogicalLockTransaction. Seems inconsistent. >> > > Note that I have added it for the OPTIONAL test_decoding test cases > (which AFAIK we don't plan to commit in that state) which demonstrate > concurrent rollback interlocking with the lock/unlock APIs. The first > ELOG was enough to catch the interaction. If we think these elogs > should be present in the code, then yes, I can add it elsewhere as > well as part of an earlier patch. > Ah, I see. Makes sense. I've been looking at the patch as a whole and haven't realized it's part of this piece. > > Ok, I am looking at your provided patch and incorporating relevant > changes from it. WIll submit a patch set soon. > OK. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi Tomas, >> > > Uh? Simply rechecking if MyProc->decodeGroupLeader is NULL obviously > does not fix the race condition - it might get NULL right after the > check. So we need to either lookup the PROC again (and then get the > associated lwlock), or hold some other type of lock. > I realized my approach was short-sighted while coding it up. So now we lookup the leader pgproc, recheck if the XID is the same that we are interested in and go ahead. > >>> >>> 3) I don't quite understand why BecomeDecodeGroupMember does the >>> cross-check using PID. In which case would it help? >>> >> >> When I wrote this support, I had written it with the intention of >> supporting both 2PC (in which case pid is 0) and in-progress regular >> transactions. That's why the presence of PID in these functions. The >> current use case is just for 2PC, so we could remove it. >> > > Sure, but why do we need to cross-check the PID at all? I may be missing > something here, but I don't see what does this protect against? > The fact that PID is 0 in case of prepared transactions was making me nervous. So, I had added the assert that pid should only be 0 when it's a prepared transaction and not otherwise. Anyways, since we are dealing with only 2PC, I have removed the PID argument now. Also removed is_prepared argument for the same reason. >> >> Ok, I am looking at your provided patch and incorporating relevant >> changes from it. WIll submit a patch set soon. >> > > OK. > PFA, latest patch set. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Attachment
- 0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.0504.patch
- 0002-Introduce-LogicalLockTransaction-LogicalUnlockTransa.0504.patch
- 0003-Support-decoding-of-two-phase-transactions-at-PREPAR.0504.patch
- 0004-Teach-test_decoding-plugin-to-work-with-2PC.0504.patch
- 0005-OPTIONAL-Additional-test-case-to-demonstrate-decoding-rollbac.0504.patch
Hi, >> Uh? Simply rechecking if MyProc->decodeGroupLeader is NULL obviously >> does not fix the race condition - it might get NULL right after the >> check. So we need to either lookup the PROC again (and then get the >> associated lwlock), or hold some other type of lock. >> > > I realized my approach was short-sighted while coding it up. So now we > lookup the leader pgproc, recheck if the XID is the same that we are > interested in and go ahead. > I did some more gdb single-stepping and debugging on this. Introduced a few more fetch pgproc using XID calls for more robustness. I am satisfied now from my point of view with the decodegroup lock changes. Also a few other changes related to cleanups and setting of the txn flags at all places. PFA, v2.0 of the patchset for today. "make check-world" passes ok on these patches. Regards, Nikhils >> >>>> >>>> 3) I don't quite understand why BecomeDecodeGroupMember does the >>>> cross-check using PID. In which case would it help? >>>> >>> >>> When I wrote this support, I had written it with the intention of >>> supporting both 2PC (in which case pid is 0) and in-progress regular >>> transactions. That's why the presence of PID in these functions. The >>> current use case is just for 2PC, so we could remove it. >>> >> >> Sure, but why do we need to cross-check the PID at all? I may be missing >> something here, but I don't see what does this protect against? >> > > The fact that PID is 0 in case of prepared transactions was making me > nervous. So, I had added the assert that pid should only be 0 when > it's a prepared transaction and not otherwise. Anyways, since we are > dealing with only 2PC, I have removed the PID argument now. Also > removed is_prepared argument for the same reason. > >>> >>> Ok, I am looking at your provided patch and incorporating relevant >>> changes from it. WIll submit a patch set soon. >>> >> >> OK. >> > PFA, latest patch set. > > Regards, > Nikhils > -- > Nikhil Sontakke http://www.2ndQuadrant.com/ > PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Attachment
- 0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.0504.v2.0.patch
- 0002-Introduce-LogicalLockTransaction-LogicalUnlockTransa.0504.v2.0.patch
- 0003-Support-decoding-of-two-phase-transactions-at-PREPAR.0504.v2.0.patch
- 0004-Teach-test_decoding-plugin-to-work-with-2PC.0504.v2.0.patch
- 0005-OPTIONAL-Additional-test-case-to-demonstrate-decoding-rollbac.0504.v2.0.patch
On Fri, Apr 6, 2018 at 12:23 AM, Nikhil Sontakke <nikhils@2ndquadrant.com> wrote: > Hi, > > > >>> Uh? Simply rechecking if MyProc->decodeGroupLeader is NULL obviously >>> does not fix the race condition - it might get NULL right after the >>> check. So we need to either lookup the PROC again (and then get the >>> associated lwlock), or hold some other type of lock. >>> >> >> I realized my approach was short-sighted while coding it up. So now we >> lookup the leader pgproc, recheck if the XID is the same that we are >> interested in and go ahead. >> > > I did some more gdb single-stepping and debugging on this. Introduced a few > more fetch pgproc using XID calls for more robustness. I am satisfied now from > my point of view with the decodegroup lock changes. > > Also a few other changes related to cleanups and setting of the txn flags at > all places. > > PFA, v2.0 of the patchset for today. > > "make check-world" passes ok on these patches. > OK, I think this is now committable. The changes are small, fairly isolated in effect, and I think every objection has been met, partly by reducing the scope of the changes. By committing this we will allow plugin authors to start developing 2PC support, which is important in some use cases. I therefore intent to commit these patches some time before the deadline, either in 12 hours or so, or about 24 hours after that (which would be right up against the deadline by my calculation) , depending on some other important obligations I have. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 4/3/18 18:05, Andrew Dunstan wrote: > Currently we seem to have only two machines doing the cross-version > upgrade checks, which might make it easier to rearrange anything if > necessary. I think we should think about making this even more general. We could use some cross-version testing for pg_dump, psql, pg_basebackup, pg_upgrade, logical replication, and so on. Ideally, we would be able to run the whole test set against an older version somehow. Lots of details omitted here, of course. ;-) -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2018-04-06 21:30:36 +0930, Andrew Dunstan wrote: > OK, I think this is now committable. > The changes are small, fairly isolated in effect, and I think every > objection has been met, partly by reducing the scope of the > changes. By committing this we will allow plugin authors to start > developing 2PC support, which is important in some use cases. > > I therefore intent to commit these patches some time before the > deadline, either in 12 hours or so, or about 24 hours after that > (which would be right up against the deadline by my calculation) , > depending on some other important obligations I have. I object. And I'm negatively surprised that this is even considered. This is a complicated patch that has been heavily reworked in the last few days to, among other things, address objections that have first been made months ago ([1]). There we nontrivial bugs less than a day ago. It has not received a lot of reviews since these changes. This isn't an area you've previously been involved in to a significant degree. Greetings, Andres Freund [1] http://archives.postgresql.org/message-id/20180209211025.d7jxh43fhqnevhji%40alap3.anarazel.de
On Sat, Apr 7, 2018 at 1:50 AM, Andres Freund <andres@anarazel.de> wrote: > Hi, > > On 2018-04-06 21:30:36 +0930, Andrew Dunstan wrote: >> OK, I think this is now committable. > >> The changes are small, fairly isolated in effect, and I think every >> objection has been met, partly by reducing the scope of the >> changes. By committing this we will allow plugin authors to start >> developing 2PC support, which is important in some use cases. >> >> I therefore intent to commit these patches some time before the >> deadline, either in 12 hours or so, or about 24 hours after that >> (which would be right up against the deadline by my calculation) , >> depending on some other important obligations I have. > > I object. And I'm negatively surprised that this is even considered. > > This is a complicated patch that has been heavily reworked in the last > few days to, among other things, address objections that have first been > made months ago ([1]). There we nontrivial bugs less than a day ago. It > has not received a lot of reviews since these changes. This isn't an > area you've previously been involved in to a significant degree. > No I haven't although I have been spending some time familiarizing myself with it. Nevertheless, since you object I won't persist. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Apr 6, 2018 at 10:00 PM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > On 4/3/18 18:05, Andrew Dunstan wrote: >> Currently we seem to have only two machines doing the cross-version >> upgrade checks, which might make it easier to rearrange anything if >> necessary. > > I think we should think about making this even more general. We could > use some cross-version testing for pg_dump, psql, pg_basebackup, > pg_upgrade, logical replication, and so on. Ideally, we would be able > to run the whole test set against an older version somehow. Lots of > details omitted here, of course. ;-) > Yeah, that's more or less the plan. One way to generalize it might be to see if ${branch}_SAVED exists and points to a directory with bin share and lib directories. If so, use it as required to test against that branch. The buildfarm will make sure that that setting exists. There are some tricks you have to play with the environment, but it's basically doable. Anyway, this is really matter for another thread. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> I object. And I'm negatively surprised that this is even considered. > I am also a bit surprised.. > This is a complicated patch that has been heavily reworked in the last > few days to, among other things, address objections that have first been > made months ago ([1]). There we nontrivial bugs less than a day ago. It > has not received a lot of reviews since these changes. This isn't an > area you've previously been involved in to a significant degree. > I thought all the points that you had raised in [1] had been met with satisfactorily. Let me know if that's not the case. The last few days, the focus was on making the decodegroup locking implementation a bit more robust. Anyways, will now wait for the next commitfest/opportunity to try to get this in. > > [1] http://archives.postgresql.org/message-id/20180209211025.d7jxh43fhqnevhji%40alap3.anarazel.de Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
On 4/9/18 2:01 AM, Nikhil Sontakke wrote: > > Anyways, will now wait for the next commitfest/opportunity to try to > get this in. It looks like this patch should be in the Needs Review state so I have done that and moved it to the next CF. Regards, -- -David david@pgmasters.net
>> Anyways, will now wait for the next commitfest/opportunity to try to >> get this in. > > It looks like this patch should be in the Needs Review state so I have > done that and moved it to the next CF. > Thanks David, Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
Hi all, >>> Anyways, will now wait for the next commitfest/opportunity to try to >>> get this in. >> >> It looks like this patch should be in the Needs Review state so I have >> done that and moved it to the next CF. >> PFA, patchset updated to take care of bitrot. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
- 0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.patch
- 0002-Introduce-LogicalLockTransaction-LogicalUnlockTransa.patch
- 0003-Support-decoding-of-two-phase-transactions-at-PREPAR.patch
- 0004-Teach-test_decoding-plugin-to-work-with-2PC.patch
- 0005-OPTIONAL-Additional-test-case-to-demonstrate-decoding-rollbac.patch
Hi, >>>> Anyways, will now wait for the next commitfest/opportunity to try to >>>> get this in. >>> >>> It looks like this patch should be in the Needs Review state so I have >>> done that and moved it to the next CF. >>> > PFA, patchset updated to take care of bitrot. > For some reason, the 3rd patch was missing a few lines. Revised patch set attached. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
- 0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.patch
- 0002-Introduce-LogicalLockTransaction-LogicalUnlockTransa.patch
- 0003-Support-decoding-of-two-phase-transactions-at-PREPAR.patch
- 0004-Teach-test_decoding-plugin-to-work-with-2PC.patch
- 0005-OPTIONAL-Additional-test-case-to-demonstrate-decoding-rollbac.patch
Hi Nikhil, I've been looking at this patch series, and I do have a bunch of comments and questions, as usual ;-) Overall, I think it's clear the main risk associated with this patch is the decode group code - it touches PROC entries, so a bug may cause trouble pretty easily. So I've focused on this part, for now. 1) LogicalLockTransaction does roughly this ... if (MyProc->decodeGroupLeader == NULL) { leader = AssignDecodeGroupLeader(txn->xid); if (leader == NULL || !BecomeDecodeGroupMember((PGPROC *)leader, txn->xid)) goto lock_cleanup; } leader = BackendXidGetProc(txn->xid); if (!leader) goto lock_cleanup; leader_lwlock = LockHashPartitionLockByProc(leader); LWLockAcquire(leader_lwlock, LW_EXCLUSIVE); pgxact = &ProcGlobal->allPgXact[leader->pgprocno]; if(pgxact->xid != txn->xid) { LWLockRelease(leader_lwlock); goto lock_cleanup; } ... I wonder why we need the BackendXidGetProc call after the first if block. Can we simply grab MyProc->decodeGroupLeader at that point? 2) InitProcess now resets decodeAbortPending/decodeLocked flags, while checking decodeGroupLeader/decodeGroupMembers using asserts. Isn't that a bit strange? Shouldn't it do the same thing with both? 3) A comment in ProcKill says this: * Detach from any decode group of which we are a member. If the leader * exits before all other group members, its PGPROC will remain allocated * until the last group process exits; that process must return the * leader's PGPROC to the appropriate list. So I'm wondering what happens if the leader dies before other group members, but the PROC entry gets reused for a new connection. It clearly should not be a leader for that old decode group, but it may need to be a leader for another group. 4) strange hunk in ProcKill There seems to be some sort of merge/rebase issue, because this block of code (line ~880) related to lock groups /* Return PGPROC structure (and semaphore) to appropriate freelist */ proc->links.next = (SHM_QUEUE *) *procgloballist; *procgloballist = proc; got replaced by code relared to decode groups. That seems strange. 5) ReorderBufferCommitInternal I see the LogicalLockTransaction() calls in ReorderBufferCommitInternal have vastly variable comments. Some calls have no comment, some calls have "obvious" comment like "Lock transaction before catalog access" and one call has this very long comment /* * Output plugins can access catalog metadata and we * do not have any control over that. We could ask * them to call * LogicalLockTransaction/LogicalUnlockTransaction * APIs themselves, but that leads to unnecessary * complications and expectations from plugin * writers. We avoid this by calling these APIs * here, thereby ensuring that the in-progress * transaction will be around for the duration of * the apply_change call below */ I find that rather inconsistent, and I'd say those comments are useless. I suggest to remove all the per-call comments and instead add a comment about the locking into the initial file-level comment, which already explains handling of large transactions, etc. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jul 16, 2018 at 11:21 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > Overall, I think it's clear the main risk associated with this patch is the > decode group code - it touches PROC entries, so a bug may cause trouble > pretty easily. So I've focused on this part, for now. I agree. As a general statement, I think the idea of trying to prevent transactions from aborting is really scary. It's almost an axiom of the system that we're always allowed to abort, and I think there could be a lot of unintended and difficult-to-fix consequences of undermining that guarantee. I think it will be very difficult to create a sound system for delaying transactions, and I doubt very much that the proposed system is sound. In particular: - The do_wait loop contains a CHECK_FOR_INTERRUPTS(). If that makes it interruptible, then it's possible for the abort to complete before the decoding processes have aborted. If that can happen, then this whole mechanism is completely pointless, because it fails to actually achieve the guarantee which is its central goal. On the other hand, if you don't make this abort interruptible, then you are significantly increase the risk that a backend could get stuck in the abort path for an unbounded period of time. If the aborting backend holds any significant resources at this point, such as heavyweight locks, then you risk creating a deadlock that cannot be broken until the decoding process manages to abort, and if that process is involved in the deadlock, then you risk creating an unbreakable deadlock. - BackendXidGetProc() seems to be called in multiple places without any lock held. I don't see how that can be safe, because AFAICS it must inevitably introduce a race condition: the answer can change after that value is returned but before it is used. There's a bunch of recheck logic that looks like it is trying to cope with this problem, but I'm not sure it's very solid. For example, AssignDecodeGroupLeader reads proc->decodeGroupLeader without holding any lock; we have historically avoided assuming that pointer-width reads cannot be torn. (We have assumed this only for 4-byte reads or narrower.) There are no comments about the locking hazards here, and no real explanation of how the recheck algorithm tries to patch things up: + leader = BackendXidGetProc(xid); + if (!leader || leader != proc) + { + LWLockRelease(leader_lwlock); + return NULL; + } Can be non-NULL yet unequal to proc? I don't understand how that can happen: surely once the PGPROC that has that XID aborts, the same XID can't possibly be assigned to a different PGPROC. - The code for releasing PGPROCs in ProcKill looks completely unsafe to me. With locking groups for parallel query, a process always enters a lock group of its own volition. It can safely use (MyProc->lockGroupLeader != NULL) as a race-free test because no other process can modify that value. But in this implementation of decoding groups, one process can put another process into a decoding group, which means this test has a race condition. If there's some reason this is safe, the comments sure don't explain it. I don't want to overplay my hand, but I think this code is a very long way from being committable, and I am concerned that the fundamental approach of blocking transaction aborts may be unsalvageably broken or at least exceedingly dangerous. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I agree. As a general statement, I think the idea of trying to > prevent transactions from aborting is really scary. It's almost an > axiom of the system that we're always allowed to abort, and I think > there could be a lot of unintended and difficult-to-fix consequences > of undermining that guarantee. I think it will be very difficult to > create a sound system for delaying transactions, and I doubt very much > that the proposed system is sound. Ugh, is this patch really dependent on such a thing? TBH, I think the odds of making that work are indistinguishable from zero; and even if you managed to commit something that did work at the instant you committed it, the odds that it would stay working in the face of later system changes are exactly zero. I would reject this idea out of hand. regards, tom lane
On 07/16/2018 06:15 PM, Robert Haas wrote: > On Mon, Jul 16, 2018 at 11:21 AM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> Overall, I think it's clear the main risk associated with this patch is the >> decode group code - it touches PROC entries, so a bug may cause trouble >> pretty easily. So I've focused on this part, for now. > > I agree. As a general statement, I think the idea of trying to > prevent transactions from aborting is really scary. It's almost an > axiom of the system that we're always allowed to abort, and I think > there could be a lot of unintended and difficult-to-fix consequences > of undermining that guarantee. I think it will be very difficult to > create a sound system for delaying transactions, and I doubt very much > that the proposed system is sound. > > In particular: > > - The do_wait loop contains a CHECK_FOR_INTERRUPTS(). If that makes > it interruptible, then it's possible for the abort to complete before > the decoding processes have aborted. If that can happen, then this > whole mechanism is completely pointless, because it fails to actually > achieve the guarantee which is its central goal. On the other hand, > if you don't make this abort interruptible, then you are significantly > increase the risk that a backend could get stuck in the abort path for > an unbounded period of time. If the aborting backend holds any > significant resources at this point, such as heavyweight locks, then > you risk creating a deadlock that cannot be broken until the decoding > process manages to abort, and if that process is involved in the > deadlock, then you risk creating an unbreakable deadlock. > I'm not sure I understand. Are you suggesting the process might get killed or something, thanks to the CHECK_FOR_INTERRUPTS() call? > - BackendXidGetProc() seems to be called in multiple places without > any lock held. I don't see how that can be safe, because AFAICS it > must inevitably introduce a race condition: the answer can change > after that value is returned but before it is used. There's a bunch > of recheck logic that looks like it is trying to cope with this > problem, but I'm not sure it's very solid. But BackendXidGetProc() internally acquires ProcArrayLock, of course. It's true there are a few places where we do != NULL checks on the result without holding any lock, but I don't see why that would be a problem? And before actually inspecting the contents, the code always does LockHashPartitionLockByProc. But I certainly agree this would deserve comments explaining why this (lack of) locking is safe. (The goal why it's done this way is clearly an attempt to acquire the lock as infrequently as possible, in an effort to minimize the overhead.) > For example, > AssignDecodeGroupLeader reads proc->decodeGroupLeader without holding > any lock; we have historically avoided assuming that pointer-width > reads cannot be torn. (We have assumed this only for 4-byte reads or > narrower.) There are no comments about the locking hazards here, and > no real explanation of how the recheck algorithm tries to patch things > up: > > + leader = BackendXidGetProc(xid); > + if (!leader || leader != proc) > + { > + LWLockRelease(leader_lwlock); > + return NULL; > + } > > Can be non-NULL yet unequal to proc? I don't understand how that can > happen: surely once the PGPROC that has that XID aborts, the same XID > can't possibly be assigned to a different PGPROC. > Yeah. I have the same question. > - The code for releasing PGPROCs in ProcKill looks completely unsafe > to me. With locking groups for parallel query, a process always > enters a lock group of its own volition. It can safely use > (MyProc->lockGroupLeader != NULL) as a race-free test because no other > process can modify that value. But in this implementation of decoding > groups, one process can put another process into a decoding group, > which means this test has a race condition. If there's some reason > this is safe, the comments sure don't explain it. > I don't follow. How could one process put another process into a decoding group? I don't think that's possible. > I don't want to overplay my hand, but I think this code is a very long > way from being committable, and I am concerned that the fundamental > approach of blocking transaction aborts may be unsalvageably broken or > at least exceedingly dangerous. > I'm not sure about the 'unsalvageable' part, but it needs more work, that's for sure. Unfortunately, all previous attempts to make this work in various other ways failed (see past discussions in this thread), so this is the only approach left :-( So let's see if we can make it work. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 07/16/2018 07:21 PM, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> I agree. As a general statement, I think the idea of trying to >> prevent transactions from aborting is really scary. It's almost an >> axiom of the system that we're always allowed to abort, and I think >> there could be a lot of unintended and difficult-to-fix consequences >> of undermining that guarantee. I think it will be very difficult to >> create a sound system for delaying transactions, and I doubt very much >> that the proposed system is sound. > > Ugh, is this patch really dependent on such a thing? > Unfortunately it does :-( Without it the decoding (or output plugins) may see catalogs broken in various ways - the catalog records may get vacuumed, HOT chains are broken, ... There were attempts to change that part, but that seems an order of magnitude more invasive than this. > TBH, I think the odds of making that work are indistinguishable from zero; > and even if you managed to commit something that did work at the instant > you committed it, the odds that it would stay working in the face of later > system changes are exactly zero. I would reject this idea out of hand. > Why? How is this significantly different from other patches touching ProcArray and related bits? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jul 16, 2018 at 1:28 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > I'm not sure I understand. Are you suggesting the process might get killed > or something, thanks to the CHECK_FOR_INTERRUPTS() call? Yes. CHECK_FOR_INTERRUPTS() can certainly lead to a non-local transfer of control. > But BackendXidGetProc() internally acquires ProcArrayLock, of course. It's > true there are a few places where we do != NULL checks on the result without > holding any lock, but I don't see why that would be a problem? And before > actually inspecting the contents, the code always does > LockHashPartitionLockByProc. I think at least some of those cases are a problem. See below... > I don't follow. How could one process put another process into a decoding > group? I don't think that's possible. Isn't that exactly what AssignDecodeGroupLeader() is doing? It looks up the process that currently has that XID, then turns that process into a decode group leader. Then after that function returns, the caller adds itself to the decode group as well. So it seems entirely possible for somebody to swing the decodeGroupLeader pointer for a PGPROC from NULL to some other value at an arbitrary point in time. > I'm not sure about the 'unsalvageable' part, but it needs more work, that's > for sure. Unfortunately, all previous attempts to make this work in various > other ways failed (see past discussions in this thread), so this is the only > approach left :-( So let's see if we can make it work. I think that's probably not going to work out, but of course it's up to you how you want to spend your time! After thinking about it a bit more, if you want to try to stick with this design, I don't think that this decode group leader/members thing has much to recommend it. In the case of parallel query, the point of the lock group stuff is to treat all of those processes as one for purposes of heavyweight lock acquisition. There's no similar need here, so the design that makes sure the "leader" is in the list of processes that are members of the "group" is, AFAICS, just wasted code. All you really need is a list of processes hung off of the PGPROC that must abort before the leader is allowed to abort; the leader itself doesn't need to be in the list, and there's no need to consider it as a "group". It's just a list of waiters. That having been said, I still don't see how that's really going to work. Just to take one example, suppose that the leader is trying to ERROR out, and the decoding workers are blocked waiting for a lock held by the leader. The system has no way of detecting this deadlock and resolving it automatically, which certainly seems unacceptable. The only way that's going to work is if the leader waits for the worker by trying to acquire a lock held by the worker. Then the deadlock detector would know to abort some transaction. But that doesn't really work either - the deadlock was created by the foreground process trying to abort, and if the deadlock detector chooses that process as its victim, what then? We're already trying to abort, and the abort code isn't supposed to throw further errors, or fail in any way, lest we break all kinds of other things. Not to mention the fact that running the deadlock detector in the abort path isn't really safe to begin with, again because we can't throw errors when we're already in an abort path. If we're only ever talking about decoding prepared transactions, we could probably work around all of these problems: have the decoding process take a heavyweight lock before it begins decoding. Have a process that wants to execute ROLLBACK PREPARED take a conflicting heavyweight lock on the same object. The net effect would be that ROLLBACK PREPARED would simply wait for decoding to finish. That might be rather lousy from a latency point of view since the transaction could take an arbitrarily long time to decode, but it seems safe enough. Possibly you could also design a mechanism for the ROLLBACK PREPARED command to SIGTERM the processes that are blocking its lock acquisition, if they are decoding processes. The difference between this and what you the current patch is doing is that nothing complex or fragile is happening in the abort pathway itself. The complicated stuff in both the worker and in the main backend happens while the transaction is still good and can still be rolled back at need. This kind of approach won't work if you want to decode transactions that aren't yet prepared, so if that is the long term goal then we need to think harder. I'm honestly not sure that problem has any reasonable solution. The assumption that a running process can abort at any time is deeply baked into many parts of the system and for good reasons. Trying to undo that is going to be like trying to push water up a hill. I think we need to install interlocks in such a way that any waiting happens before we enter the abort path, not while we're actually trying to perform the abort. But I don't know how to do that for a foreground task that's still actively doing stuff. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 07/16/2018 08:09 PM, Robert Haas wrote: > On Mon, Jul 16, 2018 at 1:28 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> I'm not sure I understand. Are you suggesting the process might get killed >> or something, thanks to the CHECK_FOR_INTERRUPTS() call? > > Yes. CHECK_FOR_INTERRUPTS() can certainly lead to a non-local > transfer of control. > >> But BackendXidGetProc() internally acquires ProcArrayLock, of course. It's >> true there are a few places where we do != NULL checks on the result without >> holding any lock, but I don't see why that would be a problem? And before >> actually inspecting the contents, the code always does >> LockHashPartitionLockByProc. > > I think at least some of those cases are a problem. See below... > >> I don't follow. How could one process put another process into a decoding >> group? I don't think that's possible. > > Isn't that exactly what AssignDecodeGroupLeader() is doing? It looks > up the process that currently has that XID, then turns that process > into a decode group leader. Then after that function returns, the > caller adds itself to the decode group as well. So it seems entirely > possible for somebody to swing the decodeGroupLeader pointer for a > PGPROC from NULL to some other value at an arbitrary point in time. > Oh, right, I forgot the patch also adds the leader into the group, for some reason (I agree it's unclear why that would be necessary, as you pointed out later). But all this is happening while holding the partition lock (in exclusive mode). And the decoding backends do synchronize on the lock correctly (although, man, the rechecks are confusing ...). But now I see ProcKill accesses decodeGroupLeader in multiple places, and only the first one is protected by the lock, for some reason (interestingly enough the one in lockGroupLeader block). Is that what you mean? FWIW I suspect the ProcKill part is borked due to incorrectly resolved merge conflict or something, per my initial response from today. >> I'm not sure about the 'unsalvageable' part, but it needs more work, that's >> for sure. Unfortunately, all previous attempts to make this work in various >> other ways failed (see past discussions in this thread), so this is the only >> approach left :-( So let's see if we can make it work. > > I think that's probably not going to work out, but of course it's up > to you how you want to spend your time! > Well, yeah. I'm sure I could think of more fun things to do, but OTOH I also have patches that require the capability to decode in-progress transactions. > After thinking about it a bit more, if you want to try to stick with > this design, I don't think that this decode group leader/members thing > has much to recommend it. In the case of parallel query, the point of > the lock group stuff is to treat all of those processes as one for > purposes of heavyweight lock acquisition. There's no similar need > here, so the design that makes sure the "leader" is in the list of > processes that are members of the "group" is, AFAICS, just wasted > code. All you really need is a list of processes hung off of the > PGPROC that must abort before the leader is allowed to abort; the > leader itself doesn't need to be in the list, and there's no need to > consider it as a "group". It's just a list of waiters. > But the way I understand it, it pretty much *is* a list of waiters, along with a couple of flags to allow the processes to notify the other side about lock/unlock/abort. It does resemble the lock groups, but I don't think it has the same goals. The thing is that the lock/unlock happens for each decoded change independently, and it'd be silly to modify the list all the time, so instead it just sets the decodeLocked flag to true/false. Similarly, when the leader decides to abort, it marks decodeAbortPending and waits for the decoding backends to complete. Of course, that's my understanding/interpretation, and perhaps Nikhil as a patch author has a better explanation. > That having been said, I still don't see how that's really going to > work. Just to take one example, suppose that the leader is trying to > ERROR out, and the decoding workers are blocked waiting for a lock > held by the leader. The system has no way of detecting this deadlock > and resolving it automatically, which certainly seems unacceptable. > The only way that's going to work is if the leader waits for the > worker by trying to acquire a lock held by the worker. Then the > deadlock detector would know to abort some transaction. But that > doesn't really work either - the deadlock was created by the > foreground process trying to abort, and if the deadlock detector > chooses that process as its victim, what then? We're already trying > to abort, and the abort code isn't supposed to throw further errors, > or fail in any way, lest we break all kinds of other things. Not to > mention the fact that running the deadlock detector in the abort path > isn't really safe to begin with, again because we can't throw errors > when we're already in an abort path. > Fair point, not sure. I'll leave this up to Nikhil. > If we're only ever talking about decoding prepared transactions, we > could probably work around all of these problems: have the decoding > process take a heavyweight lock before it begins decoding. Have a > process that wants to execute ROLLBACK PREPARED take a conflicting > heavyweight lock on the same object. The net effect would be that > ROLLBACK PREPARED would simply wait for decoding to finish. That > might be rather lousy from a latency point of view since the > transaction could take an arbitrarily long time to decode, but it > seems safe enough. Possibly you could also design a mechanism for the > ROLLBACK PREPARED command to SIGTERM the processes that are blocking > its lock acquisition, if they are decoding processes. The difference > between this and what you the current patch is doing is that nothing > complex or fragile is happening in the abort pathway itself. The > complicated stuff in both the worker and in the main backend happens > while the transaction is still good and can still be rolled back at > need. This kind of approach won't work if you want to decode > transactions that aren't yet prepared, so if that is the long term > goal then we need to think harder. I'm honestly not sure that problem > has any reasonable solution. The assumption that a running process > can abort at any time is deeply baked into many parts of the system > and for good reasons. Trying to undo that is going to be like trying > to push water up a hill. I think we need to install interlocks in > such a way that any waiting happens before we enter the abort path, > not while we're actually trying to perform the abort. But I don't > know how to do that for a foreground task that's still actively doing > stuff. > Unfortunately it's not just for prepared transactions :-( The reason why I'm interested in this capability (decoding in-progress xacts) is that I'd like to use it to stream large transactions before commit, to reduce replication lag due to limited network bandwidth etc. It's also needed for things like speculative apply (starting apply before commit) etc. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jul 16, 2018 at 3:25 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > Oh, right, I forgot the patch also adds the leader into the group, for > some reason (I agree it's unclear why that would be necessary, as you > pointed out later). > > But all this is happening while holding the partition lock (in exclusive > mode). And the decoding backends do synchronize on the lock correctly > (although, man, the rechecks are confusing ...). > > But now I see ProcKill accesses decodeGroupLeader in multiple places, > and only the first one is protected by the lock, for some reason > (interestingly enough the one in lockGroupLeader block). Is that what > you mean? I haven't traced out the control flow completely, but it sure looks to me like there are places where decodeGroupLeader is checked without holding any LWLock at all. Also, it looks to me like some places (like where we're trying to find a PGPROC by XID) we use ProcArrayLock and in others -- I guess where we're checking the decodeGroupBlah stuff -- we are using the lock manager locks. I don't know how safe that is, and there are not a lot of comments justifying it. I also wonder why we're using the lock manager locks to protect the decodeGroup stuff rather than backendLock. > FWIW I suspect the ProcKill part is borked due to incorrectly resolved > merge conflict or something, per my initial response from today. Yeah I wasn't seeing the code the way I thought you were describing it in that response, but I'm dumb this week so maybe I just misunderstood. >> I think that's probably not going to work out, but of course it's up >> to you how you want to spend your time! > > Well, yeah. I'm sure I could think of more fun things to do, but OTOH I > also have patches that require the capability to decode in-progress > transactions. It's not a matter of fun; it's a matter of whether it can be made to work. Don't get me wrong -- I want the ability to decode in-progress transactions. I complained about that aspect of the design to Andres when I was reviewing and committing logical slots & logical decoding, and I complained about it probably more than I complained about any other aspect of it, largely because it instantaneously generates a large lag when a bulk load commits. But not liking something about the way things are is not the same as knowing how to make them better. I believe there is a way to make it work because I believe there's a way to make anything work. But I suspect that it's at least one order of magnitude more complex than this patch currently is, and likely an altogether different design. > But the way I understand it, it pretty much *is* a list of waiters, > along with a couple of flags to allow the processes to notify the other > side about lock/unlock/abort. It does resemble the lock groups, but I > don't think it has the same goals. So the parts that aren't relevant shouldn't be copied over. >> That having been said, I still don't see how that's really going to >> work. Just to take one example, suppose that the leader is trying to >> ERROR out, and the decoding workers are blocked waiting for a lock >> held by the leader. The system has no way of detecting this deadlock >> and resolving it automatically, which certainly seems unacceptable. >> The only way that's going to work is if the leader waits for the >> worker by trying to acquire a lock held by the worker. Then the >> deadlock detector would know to abort some transaction. But that >> doesn't really work either - the deadlock was created by the >> foreground process trying to abort, and if the deadlock detector >> chooses that process as its victim, what then? We're already trying >> to abort, and the abort code isn't supposed to throw further errors, >> or fail in any way, lest we break all kinds of other things. Not to >> mention the fact that running the deadlock detector in the abort path >> isn't really safe to begin with, again because we can't throw errors >> when we're already in an abort path. > > Fair point, not sure. I'll leave this up to Nikhil. That's fine, but please understand that I think there's a basic design flaw here that just can't be overcome with any amount of hacking on the details here. I think we need a much higher-level consideration of the problem here and probably a lot of new infrastructure to support it. One idea might be to initially support decoding of in-progress transactions only if they don't modify any catalog state. That would leave out a bunch of cases we'd probably like to support, such as CREATE TABLE + COPY in the same transaction, but it would likely dodge a lot of really hard problems, too, and we could improve things later. One approach to the problem of catalog changes would be to prevent catalog tuples from being removed even after transaction abort until such time as there's no decoding in progress that might care about them. That is not by itself sufficient because a transaction can abort after inserting a heap tuple but before inserting an index tuple and we can't look at the catalog when it's an inconsistent state like that and expect reasonable results. But it helps: for example, if you are decoding a transaction that has inserted a WAL record with a cmin or cmax value of 4, and you know that none of the catalog records created by that transaction can have been pruned, then it should be safe to use a snapshot with CID 3 or smaller to decode the catalogs. So consider a case like: BEGIN; CREATE TABLE blah ... -- command ID 0 COPY blah FROM '/tmp/blah' ... -- command ID 1 Once we see the COPY show up in the WAL, it should be safe to decode the CREATE TABLE command and figure out what a snapshot with command ID 0 can see (again, assuming we've suppressed pruning in the catalogs in a sufficiently-well-considered way). Then, as long as the COPY command doesn't do any DML via a trigger or a datatype input function (!) or whatever, we should be able to use that snapshot to decode the data inserted by COPY. I'm not quite sure what happens if the COPY does do some DML or something like that -- we might have to stop decoding until the following command begins in the live transaction, or something like that. Or maybe we don't have to do that. I'm not totally sure how the command counter is managed for catalog snapshots. However it works in detail, we will get into trouble if we ever use a catalog snapshot that can see a change that the live transaction is still in the midst of making. Even with pruning prevented, we can only count on the catalogs to be in a consistent state once the live transaction has finished the command -- otherwise, for example, it might have increased pg_class.relnatts but not yet added the pg_attribute entry at the time it aborts, or something like that. I'm blathering a little bit but hopefully you get the point: I think the way forward is for somebody to think carefully through how and under what circumstances using a catalog snapshot can be made safe even if an abort has occurred afterwards -- not trying to postpone the abort, which I think is never going to be right. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 07/17/2018 08:10 PM, Robert Haas wrote: > On Mon, Jul 16, 2018 at 3:25 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> Oh, right, I forgot the patch also adds the leader into the group, for >> some reason (I agree it's unclear why that would be necessary, as you >> pointed out later). >> >> But all this is happening while holding the partition lock (in exclusive >> mode). And the decoding backends do synchronize on the lock correctly >> (although, man, the rechecks are confusing ...). >> >> But now I see ProcKill accesses decodeGroupLeader in multiple places, >> and only the first one is protected by the lock, for some reason >> (interestingly enough the one in lockGroupLeader block). Is that what >> you mean? > > I haven't traced out the control flow completely, but it sure looks to > me like there are places where decodeGroupLeader is checked without > holding any LWLock at all. Also, it looks to me like some places > (like where we're trying to find a PGPROC by XID) we use ProcArrayLock > and in others -- I guess where we're checking the decodeGroupBlah > stuff -- we are using the lock manager locks. I don't know how safe > that is, and there are not a lot of comments justifying it. I also > wonder why we're using the lock manager locks to protect the > decodeGroup stuff rather than backendLock. > >> FWIW I suspect the ProcKill part is borked due to incorrectly resolved >> merge conflict or something, per my initial response from today. > > Yeah I wasn't seeing the code the way I thought you were describing it > in that response, but I'm dumb this week so maybe I just > misunderstood. > >>> I think that's probably not going to work out, but of course it's up >>> to you how you want to spend your time! >> >> Well, yeah. I'm sure I could think of more fun things to do, but OTOH I >> also have patches that require the capability to decode in-progress >> transactions. > > It's not a matter of fun; it's a matter of whether it can be made to > work. Don't get me wrong -- I want the ability to decode in-progress > transactions. I complained about that aspect of the design to Andres > when I was reviewing and committing logical slots & logical decoding, > and I complained about it probably more than I complained about any > other aspect of it, largely because it instantaneously generates a > large lag when a bulk load commits. But not liking something about > the way things are is not the same as knowing how to make them better. > I believe there is a way to make it work because I believe there's a > way to make anything work. But I suspect that it's at least one order > of magnitude more complex than this patch currently is, and likely an > altogether different design. > Sure, it may turn out not to work - but how do you know until you try? We have a well known theater play here, where of the actors is blowing tobacco smoke into the sink, to try if gold can be created that way. Which is foolish, but his reasoning is "Someone had to try, to be sure!" So we're in the phase of blowing tobacco smoke, kinda ;-) Also, you often discover solutions while investigating approaches that seem to be unworkable initially. Or entirely new approaches. It sure happened to me, many times. There's a great book/movie "Touching the Void" [1] about a climber falling into a deep crevasse. Unable to climb up, he decides to crawl down - which is obviously foolish, but he happens to find a way out. I suppose we're kinda doing the same thing here - crawling down a crevasse (while still smoking and blowing the tobacco smoke into a sink, which we happened to find in the crevasse or something). Anyway, I have no clear idea what changes would be necessary to the original design of logical decoding to make implementing this easier now. The decoding in general is quite constrained by how our transam and WAL stuff works. I suppose Andres thought about this aspect, and I guess he concluded that (a) it's not needed for v1, and (b) adding it later will require about the same effort. So in the "better" case we'd end up waiting for logical decoding much longer, in the worse case we would not have it at all. >> But the way I understand it, it pretty much *is* a list of waiters, >> along with a couple of flags to allow the processes to notify the other >> side about lock/unlock/abort. It does resemble the lock groups, but I >> don't think it has the same goals. > > So the parts that aren't relevant shouldn't be copied over. > I'm not sure which parts aren't relevant, but in general I agree that stuff that is not necessary should not be copied over. >>> That having been said, I still don't see how that's really going to >>> work. Just to take one example, suppose that the leader is trying to >>> ERROR out, and the decoding workers are blocked waiting for a lock >>> held by the leader. The system has no way of detecting this deadlock >>> and resolving it automatically, which certainly seems unacceptable. >>> The only way that's going to work is if the leader waits for the >>> worker by trying to acquire a lock held by the worker. Then the >>> deadlock detector would know to abort some transaction. But that >>> doesn't really work either - the deadlock was created by the >>> foreground process trying to abort, and if the deadlock detector >>> chooses that process as its victim, what then? We're already trying >>> to abort, and the abort code isn't supposed to throw further errors, >>> or fail in any way, lest we break all kinds of other things. Not to >>> mention the fact that running the deadlock detector in the abort path >>> isn't really safe to begin with, again because we can't throw errors >>> when we're already in an abort path. >> >> Fair point, not sure. I'll leave this up to Nikhil. > > That's fine, but please understand that I think there's a basic design > flaw here that just can't be overcome with any amount of hacking on > the details here. I think we need a much higher-level consideration > of the problem here and probably a lot of new infrastructure to > support it. One idea might be to initially support decoding of > in-progress transactions only if they don't modify any catalog state. The problem is you don't know if a transaction does DDL sometime later, in the part that you might not have decoded yet (or perhaps concurrently with the decoding). So I don't see how you could easily exclude such transactions from the decoding ... > That would leave out a bunch of cases we'd probably like to support, > such as CREATE TABLE + COPY in the same transaction, but it would > likely dodge a lot of really hard problems, too, and we could improve > things later. One approach to the problem of catalog changes would be > to prevent catalog tuples from being removed even after transaction > abort until such time as there's no decoding in progress that might > care about them. That is not by itself sufficient because a > transaction can abort after inserting a heap tuple but before > inserting an index tuple and we can't look at the catalog when it's an > inconsistent state like that and expect reasonable results. But it > helps: for example, if you are decoding a transaction that has > inserted a WAL record with a cmin or cmax value of 4, and you know > that none of the catalog records created by that transaction can have > been pruned, then it should be safe to use a snapshot with CID 3 or > smaller to decode the catalogs. So consider a case like: > > BEGIN; > CREATE TABLE blah ... -- command ID 0 > COPY blah FROM '/tmp/blah' ... -- command ID 1 > > Once we see the COPY show up in the WAL, it should be safe to decode > the CREATE TABLE command and figure out what a snapshot with command > ID 0 can see (again, assuming we've suppressed pruning in the catalogs > in a sufficiently-well-considered way). Then, as long as the COPY > command doesn't do any DML via a trigger or a datatype input function > (!) or whatever, we should be able to use that snapshot to decode the > data inserted by COPY. One obvious issue with this is that it actually does not help with reducing the replication lag, which is about the main goal of this whole effort. Because if the COPY is a big data load, waiting until after the COPY completes gives us pretty much nothing. > I'm not quite sure what happens if the COPY > does do some DML or something like that -- we might have to stop > decoding until the following command begins in the live transaction, > or something like that. Or maybe we don't have to do that. I'm not > totally sure how the command counter is managed for catalog snapshots. > However it works in detail, we will get into trouble if we ever use a > catalog snapshot that can see a change that the live transaction is > still in the midst of making. Even with pruning prevented, we can > only count on the catalogs to be in a consistent state once the live > transaction has finished the command -- otherwise, for example, it > might have increased pg_class.relnatts but not yet added the > pg_attribute entry at the time it aborts, or something like that. I'm > blathering a little bit but hopefully you get the point: I think the > way forward is for somebody to think carefully through how and under > what circumstances using a catalog snapshot can be made safe even if > an abort has occurred afterwards -- not trying to postpone the abort, > which I think is never going to be right. > But isn't this (delaying the catalog cleanup etc.) pretty much the original approach, implemented by the original patch? Which you also claimed to be unworkable, IIRC? Or how is this addressing the problems with broken HOT chains, for example? Those issues were pretty much the reason why we started looking at alternative approaches, like delaying the abort ... I wonder if disabling HOT on catalogs with wal_level=logical would be an option here. I'm not sure how important HOT on catalogs is, in practice (it surely does not help with the typical catalog bloat issue, which is temporary tables, because that's mostly insert+delete). I suppose we could disable it only when there's a replication slot indicating support for decoding of in-progress transactions, so that you still get HOT with plain logical decoding. I'm sure there will be other obstacles, not just the HOT chain stuff, but it would mean one step closer to a solution. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Jul 18, 2018 at 10:08 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > The problem is you don't know if a transaction does DDL sometime later, in > the part that you might not have decoded yet (or perhaps concurrently with > the decoding). So I don't see how you could easily exclude such transactions > from the decoding ... One idea is that maybe the running transaction could communicate with the decoding process through shared memory. For example, suppose that before you begin decoding an ongoing transaction, you have to send some kind of notification to the process saying "hey, I'm going to start decoding you" and wait for that process to acknowledge receipt of that message (say, at the next CFI). Once it acknowledges receipt, you can begin decoding. Then, we're guaranteed that the foreground process knows when that it must be careful about catalog changes. If it's going to make one, it sends a note to the decoding process and says, hey, sorry, I'm about to do catalog changes, please pause decoding. Once it gets an acknowledgement that decoding has paused, it continues its work. Decoding resumes after commit (or maybe earlier if it's provably safe). > But isn't this (delaying the catalog cleanup etc.) pretty much the original > approach, implemented by the original patch? Which you also claimed to be > unworkable, IIRC? Or how is this addressing the problems with broken HOT > chains, for example? Those issues were pretty much the reason why we started > looking at alternative approaches, like delaying the abort ... I don't think so. The original approach, IIRC, was to decode after the abort had already happened, and my objection was that you can't rely on the state of anything at that point. The approach here is to wait until the abort is in progress and then basically pause it while we try to read stuff, but that seems similarly riddled with problems. The newer approach could be considered an improvement in that you've tried to get your hands around the problem at an earlier point, but it's not early enough. To take a very rough analogy, the original approach was like trying to install a sprinkler system after the building had already burned down, while the new approach is like trying to install a sprinkler system when you notice that the building is on fire. But we need to install the sprinkler system in advance. That is, we need to make all of the necessary preparations for a possible abort before the abort occurs. That could perhaps be done by arranging things so that decoding after an abort is actually still safe (e.g. by making it look to certain parts of the system as though the aborted transaction is still in progress until decoding no longer cares about it) or by making sure that we are never decoding at the point where a problematic abort happens (e.g. as proposed above, pause decoding before doing dangerous things). > I wonder if disabling HOT on catalogs with wal_level=logical would be an > option here. I'm not sure how important HOT on catalogs is, in practice (it > surely does not help with the typical catalog bloat issue, which is > temporary tables, because that's mostly insert+delete). I suppose we could > disable it only when there's a replication slot indicating support for > decoding of in-progress transactions, so that you still get HOT with plain > logical decoding. Are you talking about HOT updates, or HOT pruning? Disabling the former wouldn't help, and disabling the latter would break VACUUM, which assumes that any tuple not removed by HOT pruning is not a dead tuple (cf. 1224383e85eee580a838ff1abf1fdb03ced973dc, which was caused by a case where that wasn't true). > I'm sure there will be other obstacles, not just the HOT chain stuff, but it > would mean one step closer to a solution. Right. Here's a crazy idea. Instead of disabling HOT pruning or anything like that, have the decoding process advertise the XID of the transaction being decoded as its own XID in its PGPROC. Also, using magic, acquire a lock on that XID even though the foreground transaction already holds that lock in exclusive mode. Fix the code (and I'm pretty sure there is some) that relies on an XID appearing in the procarray only once to no longer make that assumption. Then, if the foreground process aborts, it will appear to the rest of the system that the it's still running, so HOT pruning won't remove the XID, CLOG won't get truncated, people who are waiting to update a tuple updated by the aborted transaction will keep waiting, etc. We know that we do the right thing for running transactions, so if we make this aborted transaction look like it is running and are sufficiently convincing about the way we do that, then it should also work. That seems more likely to be able to be made robust than addressing specific problems (e.g. a tuple might get removed!) one by one. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 07/18/2018 04:56 PM, Robert Haas wrote: > On Wed, Jul 18, 2018 at 10:08 AM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> The problem is you don't know if a transaction does DDL sometime later, in >> the part that you might not have decoded yet (or perhaps concurrently with >> the decoding). So I don't see how you could easily exclude such transactions >> from the decoding ... > > One idea is that maybe the running transaction could communicate with > the decoding process through shared memory. For example, suppose that > before you begin decoding an ongoing transaction, you have to send > some kind of notification to the process saying "hey, I'm going to > start decoding you" and wait for that process to acknowledge receipt > of that message (say, at the next CFI). Once it acknowledges receipt, > you can begin decoding. Then, we're guaranteed that the foreground > process knows when that it must be careful about catalog changes. If > it's going to make one, it sends a note to the decoding process and > says, hey, sorry, I'm about to do catalog changes, please pause > decoding. Once it gets an acknowledgement that decoding has paused, > it continues its work. Decoding resumes after commit (or maybe > earlier if it's provably safe). > Let's assume running transaction is holding an exclusive lock on something. We start decoding it and do this little dance with sending messages, confirmations etc. The decoding starts, and the plugin asks for the same lock (and starts waiting). Then the transaction decides to do some catalog changes, and sends a "pause" message to the decoding. Who's going to respond, considering the decoding is waiting for the lock (and it's not easy to jump out, because it might be deep inside the output plugin, i.e. deep in some extension). >> But isn't this (delaying the catalog cleanup etc.) pretty much the original >> approach, implemented by the original patch? Which you also claimed to be >> unworkable, IIRC? Or how is this addressing the problems with broken HOT >> chains, for example? Those issues were pretty much the reason why we started >> looking at alternative approaches, like delaying the abort ... > > I don't think so. The original approach, IIRC, was to decode after > the abort had already happened, and my objection was that you can't > rely on the state of anything at that point. Pretty much, yes. Clearly there needs to be some sort of coordination between the transaction and decoding process ... > The approach here is to > wait until the abort is in progress and then basically pause it while > we try to read stuff, but that seems similarly riddled with problems. Yeah :-( > The newer approach could be considered an improvement in that you've > tried to get your hands around the problem at an earlier point, but > it's not early enough. To take a very rough analogy, the original > approach was like trying to install a sprinkler system after the > building had already burned down, while the new approach is like > trying to install a sprinkler system when you notice that the building > is on fire. When an oil well is burning, they detonate a small bomb next to it to extinguish it. What would be the analogy to that, here? pg_resetwal? ;-) > But we need to install the sprinkler system in advance. Damn causality! > That is, we need to make all of the necessary preparations for a > possible abort before the abort occurs. That could perhaps be done by > arranging things so that decoding after an abort is actually still > safe (e.g. by making it look to certain parts of the system as though > the aborted transaction is still in progress until decoding no longer > cares about it) or by making sure that we are never decoding at the > point where a problematic abort happens (e.g. as proposed above, pause > decoding before doing dangerous things). > >> I wonder if disabling HOT on catalogs with wal_level=logical would be an >> option here. I'm not sure how important HOT on catalogs is, in practice (it >> surely does not help with the typical catalog bloat issue, which is >> temporary tables, because that's mostly insert+delete). I suppose we could >> disable it only when there's a replication slot indicating support for >> decoding of in-progress transactions, so that you still get HOT with plain >> logical decoding. > > Are you talking about HOT updates, or HOT pruning? Disabling the > former wouldn't help, and disabling the latter would break VACUUM, > which assumes that any tuple not removed by HOT pruning is not a dead > tuple (cf. 1224383e85eee580a838ff1abf1fdb03ced973dc, which was caused > by a case where that wasn't true). > I'm talking about the issue you described here: https://www.postgresql.org/message-id/CA+TgmoZP0SxEfKW1Pn=ackUj+KdWCxs7PumMAhSYJeZ+_61_GQ@mail.gmail.com >> I'm sure there will be other obstacles, not just the HOT chain stuff, but it >> would mean one step closer to a solution. > > Right. > > Here's a crazy idea. Instead of disabling HOT pruning or anything > like that, have the decoding process advertise the XID of the > transaction being decoded as its own XID in its PGPROC. Also, using > magic, acquire a lock on that XID even though the foreground > transaction already holds that lock in exclusive mode. Fix the code > (and I'm pretty sure there is some) that relies on an XID appearing in > the procarray only once to no longer make that assumption. Then, if > the foreground process aborts, it will appear to the rest of the > system that the it's still running, so HOT pruning won't remove the > XID, CLOG won't get truncated, people who are waiting to update a > tuple updated by the aborted transaction will keep waiting, etc. We > know that we do the right thing for running transactions, so if we > make this aborted transaction look like it is running and are > sufficiently convincing about the way we do that, then it should also > work. That seems more likely to be able to be made robust than > addressing specific problems (e.g. a tuple might get removed!) one by > one. > A dumb question - would this work with subtransaction-level aborts? I mean, a transaction that does some catalog changes in a subxact which then however aborts, but then still continues. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Jul 18, 2018 at 11:27 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> One idea is that maybe the running transaction could communicate with >> the decoding process through shared memory. For example, suppose that >> before you begin decoding an ongoing transaction, you have to send >> some kind of notification to the process saying "hey, I'm going to >> start decoding you" and wait for that process to acknowledge receipt >> of that message (say, at the next CFI). Once it acknowledges receipt, >> you can begin decoding. Then, we're guaranteed that the foreground >> process knows when that it must be careful about catalog changes. If >> it's going to make one, it sends a note to the decoding process and >> says, hey, sorry, I'm about to do catalog changes, please pause >> decoding. Once it gets an acknowledgement that decoding has paused, >> it continues its work. Decoding resumes after commit (or maybe >> earlier if it's provably safe). > Let's assume running transaction is holding an exclusive lock on something. > We start decoding it and do this little dance with sending messages, > confirmations etc. The decoding starts, and the plugin asks for the same > lock (and starts waiting). Then the transaction decides to do some catalog > changes, and sends a "pause" message to the decoding. Who's going to > respond, considering the decoding is waiting for the lock (and it's not easy > to jump out, because it might be deep inside the output plugin, i.e. deep in > some extension). I think it's inevitable that any solution that is based on pausing decoding might have to wait for a theoretically unbounded time for decoding to get back to a point where it can safely pause. That is one of several reasons why I don't believe that any solution based on holding off aborts has any chance of being acceptable -- mid-abort is a terrible time to pause. Now, if the time is not only theoretically unbounded but also in practice likely to be very long (e.g. the foreground transaction could easily have to wait minutes for the decoding process to be able to process the pause request), then this whole approach is probably not going to work. If, on the other hand, the time is theoretically unbounded but in practice likely to be no more than a few seconds in almost every case, then we might have something. I don't know which is the case. It probably depends on where you put the code to handle pause requests, and I'm not sure what options are viable. For example, if there's a loop that eats WAL records one at a time, and we can safely pause after any given iteration of that loop, that sounds pretty good, unless a single iteration of that loop might hang inside of a network I/O, in which case it sounds ... less good, probably? But there might be ways around that, too, like ... could we pause at the next CFI? I don't understand the constraints well enough to comment intelligently here. >> The newer approach could be considered an improvement in that you've >> tried to get your hands around the problem at an earlier point, but >> it's not early enough. To take a very rough analogy, the original >> approach was like trying to install a sprinkler system after the >> building had already burned down, while the new approach is like >> trying to install a sprinkler system when you notice that the building >> is on fire. > > When an oil well is burning, they detonate a small bomb next to it to > extinguish it. What would be the analogy to that, here? pg_resetwal? ;-) Yep. :-) >> But we need to install the sprinkler system in advance. > > Damn causality! I know, right? >> Are you talking about HOT updates, or HOT pruning? Disabling the >> former wouldn't help, and disabling the latter would break VACUUM, >> which assumes that any tuple not removed by HOT pruning is not a dead >> tuple (cf. 1224383e85eee580a838ff1abf1fdb03ced973dc, which was caused >> by a case where that wasn't true). > > I'm talking about the issue you described here: > > https://www.postgresql.org/message-id/CA+TgmoZP0SxEfKW1Pn=ackUj+KdWCxs7PumMAhSYJeZ+_61_GQ@mail.gmail.com There are several issues there. The second and third ones boil down to this: As soon as the system thinks that your transaction is no longer in process, it is going to start making decisions based on whether that transaction committed or aborted. If it thinks your transaction aborted, it is going to feel entirely free to make decisions that permanently lose information -- like removing tuples or overwriting CTIDs or truncating CLOG or killing index entries. I doubt it makes any sense to try to fix each of those problems individually -- if we're going to do something about this, it had better be broad enough to nail all or nearly all of the problems in this area in one fell swoop. The first issue in that email is different. That's really about the possibility that the aborted transaction itself has created chaos, whereas the other ones are about the chaos that the rest of the system might impose based on the belief that the transaction is no longer needed for anything after an abort has occurred. > A dumb question - would this work with subtransaction-level aborts? I mean, > a transaction that does some catalog changes in a subxact which then however > aborts, but then still continues. Well, I would caution you against relying on me to design this for you. The fact that I can identify the pitfalls of trying to install a sprinkler system while the building is on fire does not mean that I know what diameter of pipe should be used to provide for proper fire containment. It's really important that this gets designed by someone who knows -- or learns -- enough to make it really good and safe. Replacing obvious problems (the building has already burned down!) with subtler problems (the water pressure is insufficient to reach the upper stories!) might get the patch committed, but that's not the right goal. That having been said, I cannot immediately see any reason why the idea that I sketched there couldn't be made to work just as well or poorly for subtransactions as it would for toplevel transactions. I don't really know that it will work even for toplevel transactions -- that would require more thought and careful study than I've given it (or, given that this is not my patch, feel that I should need to give it). However, if it does, and if there are no other problems that I've missed in thinking casually about it, then I think it should be possible to make it work for subtransactions, too. Likely, as the decoding process first encountered each new sub-XID, it would need to magically acquire a duplicate lock and advertise the subxid just as it did for the toplevel XID, so that at any given time the set of XIDs advertised by the decoding process would be a subset (not necessarily proper) of the set advertised by the foreground process. To try to be a little clearer about my overall position, I am suggesting that you (1) abandon the current approach and (2) make sure that everything is done by making sufficient preparations in advance of any abort rather than trying to cope after it's already started. I am also suggesting that, to get there, it might be helpful to (a) contemplate communication and active cooperation between the running process and the decoding process(es), but it might turn out not to be needed and I don't know exactly what needs to be communicated, (b) consider whether it there's a reasonable way to make it look to other parts of the system like the aborted transaction is still running, but this also might turn out not to be the right approach, (c) consider whether logical decoding already does or can be made to use historical catalog snapshots that only see command IDs prior to the current one so that incompletely-made changes by the last CID aren't seen if an abort happens. I think there is a good chance that a full solution involves more than one of these things, and maybe some other things I haven't thought about. These are ideas, not a plan. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi Robert and Tomas, It seems clear to me that the decodeGroup list of decoding backends waiting on the backend doing the transaction of interest is not a favored approach here. Note that I came down to this approach after trying various other approaches/iterations. I was especially enthused to see the lockGroupLeader implementation in the code and based this decodeGroup implementation on the same premise. Although our requirements are simply to have a list of waiters in the main transaction backend process. Sure, there might be some issues related to locking in the code, and am willing to try and work them out. However if the decodeGroup approach of interlocking abort processing with the decoding backends is itself considered suspect, then it might be another waste of time. > I think it's inevitable that any solution that is based on pausing > decoding might have to wait for a theoretically unbounded time for > decoding to get back to a point where it can safely pause. That is > one of several reasons why I don't believe that any solution based on > holding off aborts has any chance of being acceptable -- mid-abort is > a terrible time to pause. Now, if the time is not only theoretically > unbounded but also in practice likely to be very long (e.g. the > foreground transaction could easily have to wait minutes for the > decoding process to be able to process the pause request), then this > whole approach is probably not going to work. If, on the other hand, > the time is theoretically unbounded but in practice likely to be no > more than a few seconds in almost every case, then we might have > something. I don't know which is the case. We have tried to minimize the pausing requirements by holding the "LogicalLock" only when the decoding activity needs to access catalog tables. The decoding goes ahead only if it gets the logical lock, reads the catalog and unlocks immediately. If the decoding backend does not get the "LogicalLock" then it stops decoding the current transaction. So, the time to pause is pretty short in practical scenarios. >It probably depends on > where you put the code to handle pause requests, and I'm not sure what > options are viable. For example, if there's a loop that eats WAL > records one at a time, and we can safely pause after any given > iteration of that loop, that sounds pretty good, unless a single > iteration of that loop might hang inside of a network I/O, in which > case it sounds ... less good, probably? It's for the above scenarios of not waiting inside network I/O that we lock only before doing catalog access as described above. > There are several issues there. The second and third ones boil down > to this: As soon as the system thinks that your transaction is no > longer in process, it is going to start making decisions based on > whether that transaction committed or aborted. If it thinks your > transaction aborted, it is going to feel entirely free to make > decisions that permanently lose information -- like removing tuples or > overwriting CTIDs or truncating CLOG or killing index entries. I > doubt it makes any sense to try to fix each of those problems > individually -- if we're going to do something about this, it had > better be broad enough to nail all or nearly all of the problems in > this area in one fell swoop. Agreed, this was the crux of the issues. Decisions that cause permanent loss of information regardless of the ongoing decoding happening around that transaction was what led us down this rabbit hole in the first place. >> A dumb question - would this work with subtransaction-level aborts? I mean, >> a transaction that does some catalog changes in a subxact which then however >> aborts, but then still continues. > > That having been said, I cannot immediately see any reason why the > idea that I sketched there couldn't be made to work just as well or > poorly for subtransactions as it would for toplevel transactions. I > don't really know that it will work even for toplevel transactions -- > that would require more thought and careful study than I've given it > (or, given that this is not my patch, feel that I should need to give > it). However, if it does, and if there are no other problems that > I've missed in thinking casually about it, then I think it should be > possible to make it work for subtransactions, too. Likely, as the > decoding process first encountered each new sub-XID, it would need to > magically acquire a duplicate lock and advertise the subxid just as it > did for the toplevel XID, so that at any given time the set of XIDs > advertised by the decoding process would be a subset (not necessarily > proper) of the set advertised by the foreground process. > Am ready to go back to the drawing board and have another stab at this pesky little large issue :-) > To try to be a little clearer about my overall position, I am > suggesting that you (1) abandon the current approach and (2) make sure > that everything is done by making sufficient preparations in advance > of any abort rather than trying to cope after it's already started. I > am also suggesting that, to get there, it might be helpful to (a) > contemplate communication and active cooperation between the running > process and the decoding process(es), but it might turn out not to be > needed and I don't know exactly what needs to be communicated, (b) > consider whether it there's a reasonable way to make it look to other > parts of the system like the aborted transaction is still running, but > this also might turn out not to be the right approach, (c) consider > whether logical decoding already does or can be made to use historical > catalog snapshots that only see command IDs prior to the current one > so that incompletely-made changes by the last CID aren't seen if an > abort happens. I think there is a good chance that a full solution > involves more than one of these things, and maybe some other things I > haven't thought about. These are ideas, not a plan. > I will think more on the above lines and see if we can get something workable.. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2018-07-18 10:56:31 -0400, Robert Haas wrote: > Are you talking about HOT updates, or HOT pruning? Disabling the > former wouldn't help, and disabling the latter would break VACUUM, > which assumes that any tuple not removed by HOT pruning is not a dead > tuple (cf. 1224383e85eee580a838ff1abf1fdb03ced973dc, which was caused > by a case where that wasn't true). I don't think this reasoning actually applies for making HOT pruning weaker as necessary for decoding. The xmin horizon on catalog tables is already pegged, which'd prevent similar problems. There's already plenty cases where dead tuples, if they only recently became so, are not removed by the time vacuumlazy.c processes the tuple. I actually think the balance of all the solutions discussed in this thread seem to make neutering pruning *a bit* by far the most palatable solution. We don't need to fully prevent removal of such tuple chains, it's sufficient that we can detect that a tuple has been removed. A large-sledgehammer approach would be to just error out when attempting to read such a tuple. The existing error handling logic can relatively easily be made to work with that. Greetings, Andres Freund
Hi, On 2018-07-18 16:08:37 +0200, Tomas Vondra wrote: > Anyway, I have no clear idea what changes would be necessary to the original > design of logical decoding to make implementing this easier now. The > decoding in general is quite constrained by how our transam and WAL stuff > works. I suppose Andres thought about this aspect, and I guess he concluded > that (a) it's not needed for v1, and (b) adding it later will require about > the same effort. So in the "better" case we'd end up waiting for logical > decoding much longer, in the worse case we would not have it at all. I still don't really see an alternative that'd have been (or even *is*) realistically doable. Greetings, Andres Freund
Hi, On 2018-07-19 12:42:08 -0700, Andres Freund wrote: > I actually think the balance of all the solutions discussed in this > thread seem to make neutering pruning *a bit* by far the most palatable > solution. We don't need to fully prevent removal of such tuple chains, > it's sufficient that we can detect that a tuple has been removed. A > large-sledgehammer approach would be to just error out when attempting > to read such a tuple. The existing error handling logic can relatively > easily be made to work with that. So. I'm just back from not working for a few days. I've not followed this discussion in all it's detail over the last months. I've an annoying bout of allergies. So I might be entirely off. I think this whole issue only exists if we actually end up doing catalog lookups, not if there's only cached lookups (otherwise our invalidation handling is entirely borked). And we should normally do cached lookups for a large large percentage of the cases. Therefore we can make the cache-miss cases a bit slower. So what if we, at the begin / end of cache miss handling, re-check if the to-be-decoded transaction is still in-progress (or has committed). And we throw an error if that happened. That error is then caught in reorderbuffer, the in-progress-xact aborted callback is called, and processing continues (there's a couple nontrivial details here, but it should be doable). The biggest issue is what constitutes a "cache miss". It's fairly trivial to do this for syscache / relcache, but that's not sufficient: there's plenty cases where catalogs are accessed without going through either. But as far as I can tell if we declared that all historic accesses have to go through systable_beginscan* - which'd imo not be a crazy restriction - we could put the checks at that layer. That'd require that an index lookup can't crash if the corresponding heap entry doesn't exist (etc), but that's something we need to handle anyway. The issue that multiple separate catalog lookups need to be coherent (say Robert's pg_class exists, but pg_attribute doesn't example) is solved by virtue of the the pg_attribute lookups failing if the transaction aborted. Am I missing something here? Greetings, Andres Freund
Hi Andres, > So what if we, at the begin / end of cache miss handling, re-check if > the to-be-decoded transaction is still in-progress (or has > committed). And we throw an error if that happened. That error is then > caught in reorderbuffer, the in-progress-xact aborted callback is > called, and processing continues (there's a couple nontrivial details > here, but it should be doable). > > The biggest issue is what constitutes a "cache miss". It's fairly > trivial to do this for syscache / relcache, but that's not sufficient: > there's plenty cases where catalogs are accessed without going through > either. But as far as I can tell if we declared that all historic > accesses have to go through systable_beginscan* - which'd imo not be a > crazy restriction - we could put the checks at that layer. > Documenting that historic accesses go through systable_* APIs does seem reasonable. In our earlier discussions, we felt asking plugin writers to do anything along these lines was too onerous and cumbersome to expect. > That'd require that an index lookup can't crash if the corresponding > heap entry doesn't exist (etc), but that's something we need to handle > anyway. The issue that multiple separate catalog lookups need to be > coherent (say Robert's pg_class exists, but pg_attribute doesn't > example) is solved by virtue of the the pg_attribute lookups failing if > the transaction aborted. > > Am I missing something here? > Are you suggesting we have a: PG_TRY() { Catalog_Access(); } PG_CATCH() { Abort_Handling(); } here? Regards, Nikhils
On 2018-07-20 12:13:19 +0530, Nikhil Sontakke wrote: > Hi Andres, > > > > So what if we, at the begin / end of cache miss handling, re-check if > > the to-be-decoded transaction is still in-progress (or has > > committed). And we throw an error if that happened. That error is then > > caught in reorderbuffer, the in-progress-xact aborted callback is > > called, and processing continues (there's a couple nontrivial details > > here, but it should be doable). > > > > The biggest issue is what constitutes a "cache miss". It's fairly > > trivial to do this for syscache / relcache, but that's not sufficient: > > there's plenty cases where catalogs are accessed without going through > > either. But as far as I can tell if we declared that all historic > > accesses have to go through systable_beginscan* - which'd imo not be a > > crazy restriction - we could put the checks at that layer. > > > > Documenting that historic accesses go through systable_* APIs does > seem reasonable. In our earlier discussions, we felt asking plugin > writers to do anything along these lines was too onerous and > cumbersome to expect. But they don't really need to do that - in just about all cases access "automatically" goes through systable_* or layers above. If you call output functions, do syscache lookups, etc you're good. > > That'd require that an index lookup can't crash if the corresponding > > heap entry doesn't exist (etc), but that's something we need to handle > > anyway. The issue that multiple separate catalog lookups need to be > > coherent (say Robert's pg_class exists, but pg_attribute doesn't > > example) is solved by virtue of the the pg_attribute lookups failing if > > the transaction aborted. > > > > Am I missing something here? > > > > Are you suggesting we have a: > > PG_TRY() > { > Catalog_Access(); > } > PG_CATCH() > { > Abort_Handling(); > } > > here? Not quite, no. Basically, in a simplified manner, the logical decoding loop is like: while (true) record = readRecord() logical = decodeRecord() PG_TRY(): StartTransactionCommand(); switch (TypeOf(logical)) case INSERT: insert_callback(logical); break; ... CommitTransactionCommand(); PG_CATCH(): AbortCurrentTransaction(); PG_RE_THROW(); what I'm proposing is that that various catalog access functions throw a new class of error, something like "decoding aborted transactions". The PG_CATCH() above would then not unconditionally re-throw, but set a flag and continue iff that class of error was detected. while (true) if (in_progress_xact_abort_pending) StartTransactionCommand(); in_progress_xact_abort_callback(made_up_record); in_progress_xact_abort_pending = false; CommitTransactionCommand(); record = readRecord() logical = decodeRecord() PG_TRY(): StartTransactionCommand(); switch (TypeOf(logical)) case INSERT: insert_callback(logical); break; ... CommitTransactionCommand(); PG_CATCH(): AbortCurrentTransaction(); if (errclass == DECODING_ABORTED_XACT) in_progress_xact_abort_pending = true; continue; else PG_RE_THROW(); Now obviously that's just pseudo code with lotsa things missing, but I think the basic idea should come through? Greetings, Andres Freund
Hi Andres, >> > That'd require that an index lookup can't crash if the corresponding >> > heap entry doesn't exist (etc), but that's something we need to handle >> > anyway. The issue that multiple separate catalog lookups need to be >> > coherent (say Robert's pg_class exists, but pg_attribute doesn't >> > example) is solved by virtue of the the pg_attribute lookups failing if >> > the transaction aborted. > Not quite, no. Basically, in a simplified manner, the logical decoding > loop is like: > > while (true) > record = readRecord() > logical = decodeRecord() > > PG_TRY(): > StartTransactionCommand(); > > switch (TypeOf(logical)) > case INSERT: > insert_callback(logical); > break; > ... > > CommitTransactionCommand(); > > PG_CATCH(): > AbortCurrentTransaction(); > PG_RE_THROW(); > > what I'm proposing is that that various catalog access functions throw a > new class of error, something like "decoding aborted transactions". When will this error be thrown by the catalog functions? How will it determine that it needs to throw this error? > PG_CATCH(): > AbortCurrentTransaction(); > if (errclass == DECODING_ABORTED_XACT) > in_progress_xact_abort_pending = true; > continue; > else > PG_RE_THROW(); > > Now obviously that's just pseudo code with lotsa things missing, but I > think the basic idea should come through? > How do we handle the cases where the catalog returns inconsistent data (without erroring out) which does not help with the ongoing decoding? Consider for example: BEGIN; /* CONSIDER T1 has one column C1 */ ALTER TABLE T1 ADD COL c2; INSERT INTO TABLE T1(c2) VALUES; PREPARE TRANSACTION; If we abort the above 2PC and the catalog row for the ALTER gets cleaned up by vacuum, then the catalog read will return us T1 with one column C1. The catalog scan will NOT error out but will return metadata which causes the insert-decoding change apply callback to error out. The point here is that in some cases the catalog scan might not error out and might return inconsistent metadata which causes issues further down the line in apply processing. Regards, Nikhils > Greetings, > > Andres Freund -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2018-07-23 16:31:50 +0530, Nikhil Sontakke wrote: > >> > That'd require that an index lookup can't crash if the corresponding > >> > heap entry doesn't exist (etc), but that's something we need to handle > >> > anyway. The issue that multiple separate catalog lookups need to be > >> > coherent (say Robert's pg_class exists, but pg_attribute doesn't > >> > example) is solved by virtue of the the pg_attribute lookups failing if > >> > the transaction aborted. > > Not quite, no. Basically, in a simplified manner, the logical decoding > > loop is like: > > > > while (true) > > record = readRecord() > > logical = decodeRecord() > > > > PG_TRY(): > > StartTransactionCommand(); > > > > switch (TypeOf(logical)) > > case INSERT: > > insert_callback(logical); > > break; > > ... > > > > CommitTransactionCommand(); > > > > PG_CATCH(): > > AbortCurrentTransaction(); > > PG_RE_THROW(); > > > > what I'm proposing is that that various catalog access functions throw a > > new class of error, something like "decoding aborted transactions". > > When will this error be thrown by the catalog functions? How will it > determine that it needs to throw this error? The error check would have to happen at the end of most systable_* functions. They'd simply do something like if (decoding_in_progress_xact && TransactionIdDidAbort(xid_of_aborted)) ereport(ERROR, (errcode(DECODING_ABORTED_XACT), errmsg("oops"))); i.e. check whether the transaction to be decoded still is in progress. As that would happen before any potentially wrong result can be returned (as the check happens at the tail end of systable_*), there's no issue with wrong state in the syscache etc. > > PG_CATCH(): > > AbortCurrentTransaction(); > > if (errclass == DECODING_ABORTED_XACT) > > in_progress_xact_abort_pending = true; > > continue; > > else > > PG_RE_THROW(); > > > > Now obviously that's just pseudo code with lotsa things missing, but I > > think the basic idea should come through? > > > > How do we handle the cases where the catalog returns inconsistent data > (without erroring out) which does not help with the ongoing decoding? > Consider for example: I don't think that situation exists, given the scheme described above. That's just the point. > BEGIN; > /* CONSIDER T1 has one column C1 */ > ALTER TABLE T1 ADD COL c2; > INSERT INTO TABLE T1(c2) VALUES; > PREPARE TRANSACTION; > > If we abort the above 2PC and the catalog row for the ALTER gets > cleaned up by vacuum, then the catalog read will return us T1 with one > column C1. No, it'd throw an error due to the bew is-aborted check. > The catalog scan will NOT error out but will return metadata which > causes the insert-decoding change apply callback to error out. Why would it not throw an error? Greetings, Andres Freund
Hi Andres, >> > what I'm proposing is that that various catalog access functions throw a >> > new class of error, something like "decoding aborted transactions". >> >> When will this error be thrown by the catalog functions? How will it >> determine that it needs to throw this error? > > The error check would have to happen at the end of most systable_* > functions. They'd simply do something like > > if (decoding_in_progress_xact && TransactionIdDidAbort(xid_of_aborted)) > ereport(ERROR, (errcode(DECODING_ABORTED_XACT), errmsg("oops"))); > > i.e. check whether the transaction to be decoded still is in > progress. As that would happen before any potentially wrong result can > be returned (as the check happens at the tail end of systable_*), > there's no issue with wrong state in the syscache etc. > Oh, ok. The systable_* functions use the passed in snapshot and return tuples matching to it. They do not typically have access to the current XID being worked upon.. We can find out if the snapshot is a logical decoding one by virtue of its "satisfies" function pointing to HeapTupleSatisfiesHistoricMVCC. > >> The catalog scan will NOT error out but will return metadata which >> causes the insert-decoding change apply callback to error out. > > Why would it not throw an error? > In your scheme, it will throw an error, indeed. We'd need to make the "being-currently-decoded-XID" visible to these systable_* functions and then this scheme will work. Regards, Nikhils > Greetings, > > Andres Freund -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2018-07-23 19:37:46 +0530, Nikhil Sontakke wrote: > Hi Andres, > > >> > what I'm proposing is that that various catalog access functions throw a > >> > new class of error, something like "decoding aborted transactions". > >> > >> When will this error be thrown by the catalog functions? How will it > >> determine that it needs to throw this error? > > > > The error check would have to happen at the end of most systable_* > > functions. They'd simply do something like > > > > if (decoding_in_progress_xact && TransactionIdDidAbort(xid_of_aborted)) > > ereport(ERROR, (errcode(DECODING_ABORTED_XACT), errmsg("oops"))); > > > > i.e. check whether the transaction to be decoded still is in > > progress. As that would happen before any potentially wrong result can > > be returned (as the check happens at the tail end of systable_*), > > there's no issue with wrong state in the syscache etc. > > > > Oh, ok. The systable_* functions use the passed in snapshot and return > tuples matching to it. They do not typically have access to the > current XID being worked upon.. That seems like quite a solvable issue, especially compared to the locking schemes proposed. > We can find out if the snapshot is a logical decoding one by virtue of > its "satisfies" function pointing to HeapTupleSatisfiesHistoricMVCC. I think we even can just do something like a global TransactionId check_if_transaction_is_alive = InvalidTransactionId; and just set it up during decoding. And then just check it whenever it's not set tot InvalidTransactionId. Greetings, Andres Freund
Hi Andres, >> We can find out if the snapshot is a logical decoding one by virtue of >> its "satisfies" function pointing to HeapTupleSatisfiesHistoricMVCC. > > I think we even can just do something like a global > TransactionId check_if_transaction_is_alive = InvalidTransactionId; > and just set it up during decoding. And then just check it whenever it's > not set tot InvalidTransactionId. > > Ok. I will work on something along these lines and re-submit the set of patches. Regards, Nikhils
On Thu, Jul 19, 2018 at 3:42 PM, Andres Freund <andres@anarazel.de> wrote: > I don't think this reasoning actually applies for making HOT pruning > weaker as necessary for decoding. The xmin horizon on catalog tables is > already pegged, which'd prevent similar problems. That sounds completely wrong to me. Setting the xmin horizon keeps tuples that are made dead by a committing transaction from being removed, but I don't think it will do anything to keep tuples that are made dead by an aborting transaction from being removed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On July 23, 2018 9:11:13 AM PDT, Robert Haas <robertmhaas@gmail.com> wrote: >On Thu, Jul 19, 2018 at 3:42 PM, Andres Freund <andres@anarazel.de> >wrote: >> I don't think this reasoning actually applies for making HOT pruning >> weaker as necessary for decoding. The xmin horizon on catalog tables >is >> already pegged, which'd prevent similar problems. > >That sounds completely wrong to me. Setting the xmin horizon keeps >tuples that are made dead by a committing transaction from being >removed, but I don't think it will do anything to keep tuples that are >made dead by an aborting transaction from being removed. My point is that we could just make HTSV treat them as recently dead, without incurring the issues of the bug you referenced. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Mon, Jul 23, 2018 at 12:13 PM, Andres Freund <andres@anarazel.de> wrote: > My point is that we could just make HTSV treat them as recently dead, without incurring the issues of the bug you referenced. That doesn't seem sufficient. For example, it won't keep the predecessor tuple's ctid field from being overwritten by a subsequent updater -- and if that happens then the update chain is broken. Maybe your idea of cross-checking at the end of each syscache lookup would be sufficient to prevent that from happening, though. But I wonder if there are subtler problems, too -- e.g. relfrozenxid vs. actual xmins in the table, clog truncation, or whatever. There might be no problem, but the idea that an aborted transaction is of no further interest to anybody is pretty deeply ingrained in the system. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2018-07-23 12:38:25 -0400, Robert Haas wrote: > On Mon, Jul 23, 2018 at 12:13 PM, Andres Freund <andres@anarazel.de> wrote: > > My point is that we could just make HTSV treat them as recently dead, without incurring the issues of the bug you referenced. > > That doesn't seem sufficient. For example, it won't keep the > predecessor tuple's ctid field from being overwritten by a subsequent > updater -- and if that happens then the update chain is broken. Sure. I wasn't arguing that it'd be sufficient. Just that the specific issue that it'd bring the bug you mentioned isn't right. I agree that it's quite terrifying to attempt to get this right. > Maybe your idea of cross-checking at the end of each syscache lookup > would be sufficient to prevent that from happening, though. Hm? If we go for that approach we would not do *anything* about pruning, which is why I think it has appeal. Because we'd check at the end of system table scans (not syscache lookups, positive cache hits are fine because of invalidation handling) whether the to-be-decoded transaction aborted, we'd not need to do anything about pruning: If the transaction aborted, we're guaranteed to know - the result might have been wrong, but since we error out before filling any caches, we're ok. If it hasn't yet aborted at the end of the scan, we conversely are guaranteed that the scan results are correct. Greetings, Andres Freund
Hi, >> I think we even can just do something like a global >> TransactionId check_if_transaction_is_alive = InvalidTransactionId; >> and just set it up during decoding. And then just check it whenever it's >> not set tot InvalidTransactionId. >> >> > > Ok. I will work on something along these lines and re-submit the set of patches. > PFA, latest patchset, which completely removes the earlier LogicalLock/LogicalUnLock implementation using groupDecode stuff and uses the newly suggested approach of checking the currently decoded XID for abort in systable_* API functions. Much simpler to code and easier to test as well. Out of the patchset, the specific patch which focuses on the above systable_* API based XID checking implementation is part of 0003-Gracefully-handle-concurrent-aborts-of-uncommitted-t.patch. So, it might help to take a look at this patch first for any additional feedback on this approach. There's an additional test case in 0005-Additional-test-case-to-demonstrate-decoding-rollbac.patch which uses a sleep in the "change" plugin API to allow a concurrent rollback on the 2PC being currently decoded. Andres generally doesn't like this approach :-), but there are no timing/interlocking issues now, and the sleep just helps us do a concurrent rollback, so it might be ok now, all things considered. Anyways, it's an additional patch for now. Comments, feedback appreciated. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
- 0001-Cleaning-up-of-flags-in-ReorderBufferTXN-structure.patch
- 0002-Support-decoding-of-two-phase-transactions-at-PREPAR.patch
- 0003-Gracefully-handle-concurrent-aborts-of-uncommitted-t.patch
- 0004-Teach-test_decoding-plugin-to-work-with-2PC.patch
- 0005-Additional-test-case-to-demonstrate-decoding-rollbac.patch
On 2018-07-26 20:24:00 +0530, Nikhil Sontakke wrote: > Hi, > > >> I think we even can just do something like a global > >> TransactionId check_if_transaction_is_alive = InvalidTransactionId; > >> and just set it up during decoding. And then just check it whenever it's > >> not set tot InvalidTransactionId. > >> > >> > > > > Ok. I will work on something along these lines and re-submit the set of patches. > PFA, latest patchset, which completely removes the earlier > LogicalLock/LogicalUnLock implementation using groupDecode stuff and > uses the newly suggested approach of checking the currently decoded > XID for abort in systable_* API functions. Much simpler to code and > easier to test as well. So, leaving the fact that it might not actually be correct aside ;), you seem to be ok with the approach? > Out of the patchset, the specific patch which focuses on the above > systable_* API based XID checking implementation is part of > 0003-Gracefully-handle-concurrent-aborts-of-uncommitted-t.patch. So, > it might help to take a look at this patch first for any additional > feedback on this approach. K. > There's an additional test case in > 0005-Additional-test-case-to-demonstrate-decoding-rollbac.patch which > uses a sleep in the "change" plugin API to allow a concurrent rollback > on the 2PC being currently decoded. Andres generally doesn't like this > approach :-), but there are no timing/interlocking issues now, and the > sleep just helps us do a concurrent rollback, so it might be ok now, > all things considered. Anyways, it's an additional patch for now. Yea, I still don't think it's ok. The tests won't be reliable. There's ways to make this reliable, e.g. by forcing a lock to be acquired that's externally held or such. Might even be doable just with a weird custom datatype. > From 75edeb440794fff7de48082dafdecb065940bee5 Mon Sep 17 00:00:00 2001 > From: Nikhil Sontakke <nikhils@2ndQuadrant.com> > Date: Thu, 26 Jul 2018 18:45:26 +0530 > Subject: [PATCH 3/5] Gracefully handle concurrent aborts of uncommitted > transactions that are being decoded alongside. > > When a transaction aborts, it's changes are considered unnecessary for > other transactions. That means the changes may be either cleaned up by > vacuum or removed from HOT chains (thus made inaccessible through > indexes), and there may be other such consequences. > > When decoding committed transactions this is not an issue, and we > never decode transactions that abort before the decoding starts. > > But for in-progress transactions - for example when decoding prepared > transactions on PREPARE (and not COMMIT PREPARED as before), this > may cause failures when the output plugin consults catalogs (both > system and user-defined). > > We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK > sqlerrcode from system table scan APIs to the backend decoding a > specific uncommitted transaction. The decoding logic on the receipt > of such an sqlerrcode aborts the ongoing decoding and returns > gracefully. > --- > src/backend/access/index/genam.c | 31 +++++++++++++++++++++++++ > src/backend/replication/logical/reorderbuffer.c | 30 ++++++++++++++++++++---- > src/backend/utils/time/snapmgr.c | 25 ++++++++++++++++++-- > src/include/utils/snapmgr.h | 4 +++- > 4 files changed, 82 insertions(+), 8 deletions(-) > > diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c > index 9d08775687..67c5810bf7 100644 > --- a/src/backend/access/index/genam.c > +++ b/src/backend/access/index/genam.c > @@ -423,6 +423,16 @@ systable_getnext(SysScanDesc sysscan) > else > htup = heap_getnext(sysscan->scan, ForwardScanDirection); > > + /* > + * If CheckXidAlive is valid, then we check if it aborted. If it did, we > + * error out > + */ > + if (TransactionIdIsValid(CheckXidAlive) && > + TransactionIdDidAbort(CheckXidAlive)) > + ereport(ERROR, > + (errcode(ERRCODE_TRANSACTION_ROLLBACK), > + errmsg("transaction aborted during system catalog scan"))); > + > return htup; > } Don't we have to check TransactionIdIsInProgress() first? C.f. header comments in tqual.c. Note this is also not guaranteed to be correct after a crash (where no clog entry will exist for an aborted xact), but we probably shouldn't get here in that case - but better be safe. I suspect it'd be better reformulated as TransactionIdIsValid(CheckXidAlive) && !TransactionIdIsInProgress(CheckXidAlive) && !TransactionIdDidCommit(CheckXidAlive) What do you think? I think it'd also be good to add assertions to codepaths not going through systable_* asserting that !TransactionIdIsValid(CheckXidAlive). Alternatively we could add an if (unlikely(TransactionIdIsValid(CheckXidAlive)) && ...) branch to those too. > From 80fc576bda483798919653991bef6dc198625d90 Mon Sep 17 00:00:00 2001 > From: Nikhil Sontakke <nikhils@2ndQuadrant.com> > Date: Wed, 13 Jun 2018 16:31:15 +0530 > Subject: [PATCH 4/5] Teach test_decoding plugin to work with 2PC > > Includes a new option "enable_twophase". Depending on this options > value, PREPARE TRANSACTION will either be decoded or treated as > a single phase commit later. FWIW, I don't think I'm ok with doing this on a per-plugin-option basis. I think this is something that should be known to the outside of the plugin. More similar to how binary / non-binary support works. Should also be able to inquire the output plugin whether it's supported (cf previous similarity). > From 682b0de2827d1f55c4e471c3129eb687ae0825a5 Mon Sep 17 00:00:00 2001 > From: Nikhil Sontakke <nikhils@2ndQuadrant.com> > Date: Wed, 13 Jun 2018 16:32:16 +0530 > Subject: [PATCH 5/5] Additional test case to demonstrate decoding/rollback > interlocking > > Introduce a decode-delay parameter in the test_decoding plugin. Based > on the value provided in the plugin, sleep for those many seconds while > inside the "decode change" plugin call. A concurrent rollback is fired > off which aborts that transaction in the meanwhile. A subsequent > systable access will error out causing the logical decoding to abort. Yea, I'm *definitely* still not on board with this. This'll just lead to a fragile or extremely slow test. Greetings, Andres Freund
>> PFA, latest patchset, which completely removes the earlier >> LogicalLock/LogicalUnLock implementation using groupDecode stuff and >> uses the newly suggested approach of checking the currently decoded >> XID for abort in systable_* API functions. Much simpler to code and >> easier to test as well. > > So, leaving the fact that it might not actually be correct aside ;), you > seem to be ok with the approach? > ;-) Yes, I do like the approach. Do you think there are other locations other than systable_* APIs which might need such checks? >> There's an additional test case in >> 0005-Additional-test-case-to-demonstrate-decoding-rollbac.patch which >> uses a sleep in the "change" plugin API to allow a concurrent rollback >> on the 2PC being currently decoded. Andres generally doesn't like this >> approach :-), but there are no timing/interlocking issues now, and the >> sleep just helps us do a concurrent rollback, so it might be ok now, >> all things considered. Anyways, it's an additional patch for now. > > Yea, I still don't think it's ok. The tests won't be reliable. There's > ways to make this reliable, e.g. by forcing a lock to be acquired that's > externally held or such. Might even be doable just with a weird custom > datatype. > Ok, I will look at ways to do away with the sleep. >> diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c >> index 9d08775687..67c5810bf7 100644 >> --- a/src/backend/access/index/genam.c >> +++ b/src/backend/access/index/genam.c >> @@ -423,6 +423,16 @@ systable_getnext(SysScanDesc sysscan) >> else >> htup = heap_getnext(sysscan->scan, ForwardScanDirection); >> >> + /* >> + * If CheckXidAlive is valid, then we check if it aborted. If it did, we >> + * error out >> + */ >> + if (TransactionIdIsValid(CheckXidAlive) && >> + TransactionIdDidAbort(CheckXidAlive)) >> + ereport(ERROR, >> + (errcode(ERRCODE_TRANSACTION_ROLLBACK), >> + errmsg("transaction aborted during system catalog scan"))); >> + >> return htup; >> } > > Don't we have to check TransactionIdIsInProgress() first? C.f. header > comments in tqual.c. Note this is also not guaranteed to be correct > after a crash (where no clog entry will exist for an aborted xact), but > we probably shouldn't get here in that case - but better be safe. > > I suspect it'd be better reformulated as > TransactionIdIsValid(CheckXidAlive) && > !TransactionIdIsInProgress(CheckXidAlive) && > !TransactionIdDidCommit(CheckXidAlive) > > What do you think? > tqual.c does seem to mention this for a non-MVCC snapshot, so might as well do it this ways. The caching of fetched XID should not make these checks too expensive anyways. > > I think it'd also be good to add assertions to codepaths not going > through systable_* asserting that > !TransactionIdIsValid(CheckXidAlive). Alternatively we could add an > if (unlikely(TransactionIdIsValid(CheckXidAlive)) && ...) > branch to those too. > I was wondering if anything else would be needed for user-defined catalog tables.. > > >> From 80fc576bda483798919653991bef6dc198625d90 Mon Sep 17 00:00:00 2001 >> From: Nikhil Sontakke <nikhils@2ndQuadrant.com> >> Date: Wed, 13 Jun 2018 16:31:15 +0530 >> Subject: [PATCH 4/5] Teach test_decoding plugin to work with 2PC >> >> Includes a new option "enable_twophase". Depending on this options >> value, PREPARE TRANSACTION will either be decoded or treated as >> a single phase commit later. > > FWIW, I don't think I'm ok with doing this on a per-plugin-option basis. > I think this is something that should be known to the outside of the > plugin. More similar to how binary / non-binary support works. Should > also be able to inquire the output plugin whether it's supported (cf > previous similarity). > Hmm, lemme see if we can do it outside of the plugin. But note that a plugin might want to decode some 2PC at prepare time and another are "commit prepared" time. We also need to take care to not break logical replication if the other node is running non-2PC enabled code. We tried to optimize the COMMIT/ABORT handling by adding sub flags to the existing protocol. I will test that as well. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, PFA, latest patchset which incorporates the additional feedback. >>> There's an additional test case in >>> 0005-Additional-test-case-to-demonstrate-decoding-rollbac.patch which >>> uses a sleep in the "change" plugin API to allow a concurrent rollback >>> on the 2PC being currently decoded. Andres generally doesn't like this >>> approach :-), but there are no timing/interlocking issues now, and the >>> sleep just helps us do a concurrent rollback, so it might be ok now, >>> all things considered. Anyways, it's an additional patch for now. >> >> Yea, I still don't think it's ok. The tests won't be reliable. There's >> ways to make this reliable, e.g. by forcing a lock to be acquired that's >> externally held or such. Might even be doable just with a weird custom >> datatype. >> > > Ok, I will look at ways to do away with the sleep. > The attached patchset implements a non-sleep based approached by sending the 2PC XID to the pg_logical_slot_get_changes() function as an option for the test_decoding plugin. So, an example invocation will now look like: SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'skip-empty-xacts', '1', 'check-xid', '$xid2pc'); The test_decoding pg_decode_change() API if it sees a valid xid argument will wait for it to be aborted. Another backend can then come in and merrily abort this ongoing 2PC in the background. Once it's aborted, the pg_decode_change API will go ahead and will hit an ERROR in the systable scan APIs. That should take care of Andres' concern about using sleep in the tests. The relevant tap test has been added to this patchset. >>> @@ -423,6 +423,16 @@ systable_getnext(SysScanDesc sysscan) >>> else >>> htup = heap_getnext(sysscan->scan, ForwardScanDirection); >>> >>> + /* >>> + * If CheckXidAlive is valid, then we check if it aborted. If it did, we >>> + * error out >>> + */ >>> + if (TransactionIdIsValid(CheckXidAlive) && >>> + TransactionIdDidAbort(CheckXidAlive)) >>> + ereport(ERROR, >>> + (errcode(ERRCODE_TRANSACTION_ROLLBACK), >>> + errmsg("transaction aborted during system catalog scan"))); >>> + >>> return htup; >>> } >> >> Don't we have to check TransactionIdIsInProgress() first? C.f. header >> comments in tqual.c. Note this is also not guaranteed to be correct >> after a crash (where no clog entry will exist for an aborted xact), but >> we probably shouldn't get here in that case - but better be safe. >> >> I suspect it'd be better reformulated as >> TransactionIdIsValid(CheckXidAlive) && >> !TransactionIdIsInProgress(CheckXidAlive) && >> !TransactionIdDidCommit(CheckXidAlive) >> >> What do you think? >> Modified the checks are per the above suggestion. > I was wondering if anything else would be needed for user-defined > catalog tables.. > We don't need to do anything else for user-defined catalog tables since they will also get accessed via the systable_* scan APIs. > > Hmm, lemme see if we can do it outside of the plugin. But note that a > plugin might want to decode some 2PC at prepare time and another are > "commit prepared" time. > The test_decoding pg_decode_filter_prepare() API implements a simple filter strategy now. If the GID contains a substring "nodecode", then it filters out decoding of such a 2PC at prepare time. Have added steps to test this in the relevant test case in this patch. I believe this patchset handles all pending issues along with relevant test cases. Comments, further feedback appreciated. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 01/08/18 16:00, Nikhil Sontakke wrote: > >> I was wondering if anything else would be needed for user-defined >> catalog tables.. >> > > We don't need to do anything else for user-defined catalog tables > since they will also get accessed via the systable_* scan APIs. > They can be, but currently they might not be. So this requires at least big fat warning in docs and description on how to access user catalogs from plugins correctly (ie to always use systable_* API on them). It would be nice if we could check for it in Assert builds at least. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2018-08-01 21:55:18 +0200, Petr Jelinek wrote: > On 01/08/18 16:00, Nikhil Sontakke wrote: > > > >> I was wondering if anything else would be needed for user-defined > >> catalog tables.. > >> > > > > We don't need to do anything else for user-defined catalog tables > > since they will also get accessed via the systable_* scan APIs. > > > > They can be, but currently they might not be. So this requires at least > big fat warning in docs and description on how to access user catalogs > from plugins correctly (ie to always use systable_* API on them). It > would be nice if we could check for it in Assert builds at least. Yea, I agree. I think we should just consider putting similar checks in the general scan APIs. With an unlikely() and the easy predictability of these checks, I think we should be fine, overhead-wise. Greetings, Andres Freund
>> They can be, but currently they might not be. So this requires at least >> big fat warning in docs and description on how to access user catalogs >> from plugins correctly (ie to always use systable_* API on them). It >> would be nice if we could check for it in Assert builds at least. > Ok, modified the sgml documentation for the above. > Yea, I agree. I think we should just consider putting similar checks in > the general scan APIs. With an unlikely() and the easy predictability of > these checks, I think we should be fine, overhead-wise. > Ok, added unlikely() checks in the heap_* scan APIs. Revised patchset attached. Regards, Nikhils > Greetings, > > Andres Freund -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hello, I have looked through the patches. I will first describe relativaly serious issues I see and then proceed with small nitpicking. - On decoding of aborted xacts. The idea to throw an error once we detect the abort is appealing, however I think you will have problems with subxacts in the current implementation. What if subxact issues DDL and then aborted, but main transaction successfully committed? - Decoding transactions at PREPARE record changes rules of the "we ship all commits after lsn 'x'" game. Namely, it will break initial tablesync: what if consistent snapshot was formed *after* PREPARE, but before COMMIT PREPARED, and the plugin decides to employ 2pc? Instead of getting inital contents + continious stream of changes the receiver will miss the prepared xact contents and raise 'prepared xact doesn't exist' error. I think the starting point to address this is to forbid two-phase decoding of xacts with lsn of PREPARE less than snapbuilder's start_decoding_at. - Currently we will call abort_prepared cb even if we failed to actually prepare xact due to concurrent abort. I think it is confusing for users. We should either handle this by remembering not to invoke abort_prepared in these cases or at least document this behaviour, leaving this problem to the receiver side. - I find it suspicious that DecodePrepare completely ignores actions of SnapBuildCommitTxn. For example, to execute invalidations, the latter sets base snapshot if our xact (or subxacts) did DDL and the snapshot not set yet. My fantasy doesn't hint me the concrete example where this would burn at the moment, but it should be considered. Now, the bikeshedding. First patch: - I am one of those people upthread who don't think that converting flags to bitmask is beneficial -- especially given that many of them are mutually exclusive, e.g. xact can't be committed and aborted at the same time. Apparently you have left this to the committer though. Second patch: - Applying gives me Applying: Support decoding of two-phase transactions at PREPARE .git/rebase-apply/patch:871: trailing whitespace. + row. The <function>change_cb</function> callback may access system or + user catalog tables to aid in the process of outputting the row + modification details. In case of decoding a prepared (but yet + uncommitted) transaction or decoding of an uncommitted transaction, this + change callback is ensured sane access to catalog tables regardless of + simultaneous rollback by another backend of this very same transaction. I don't think we should explain this, at least in such words. As mentioned upthread, we should warn about allowed systable_* accesses instead. Same for message_cb. + /* + * Tell the reorderbuffer about the surviving subtransactions. We need to + * do this because the main transaction itself has not committed since we + * are in the prepare phase right now. So we need to be sure the snapshot + * is setup correctly for the main transaction in case all changes + * happened in subtransanctions + */ While we do certainly need to associate subxacts here, the explanation looks weird to me. I would leave just the 'Tell the reorderbuffer about the surviving subtransactions' as in DecodeCommit. } - /* * There's a speculative insertion remaining, just clean in up, it * can't have been successful, otherwise we'd gotten a confirmation Spurious newline deletion. - I would rename ReorderBufferCommitInternal to ReorderBufferReplay: we replay the xact there, not commit. - If xact is empty, we will not prepare it (and call cb), even if the output plugin asked us. However, we will call commit_prepared cb. - ReorderBufferTxnIsPrepared and ReorderBufferPrepareNeedSkip do the same and should be merged with comments explaining that the answer must be stable. - filter_prepare_cb callback existence is checked in both decode.c and in filter_prepare_cb_wrapper. + /* + * The transaction may or may not exist (during restarts for example). + * Anyways, 2PC transactions do not contain any reorderbuffers. So allow + * it to be created below. + */ Code around looks sane, but I think that ReorderBufferTXN for our xact must *not* exist at this moment: if we are going to COMMIT/ABORT PREPARED it, it must have been replayed and RBTXN purged immediately after. Also, instead of misty '2PC transactions do not contain any reorderbuffers' I would say something like 'create dummy ReorderBufferTXN to pass it to the callback'. - In DecodeAbort: + /* + * If it's ROLLBACK PREPARED then handle it via callbacks. + */ + if (TransactionIdIsValid(xid) && + !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) && + How xid can be invalid here? - It might be worthwile to put the check + !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) && + parsed->dbId == ctx->slot->data.database && + !FilterByOrigin(ctx, origin_id) && which appears 3 times now into separate function. + * two-phase transactions - we either have to have all of them or none. + * The filter_prepare callback is optional, but can only be defined when Kind of controversial (all of them or none, but optional), might be formulated more accurately. + /* + * Capabilities of the output plugin. + */ + bool enable_twophase; I would rename this to 'supports_twophase' since this is not an option but a description of the plugin capabilities. + /* filter_prepare is optional, but requires two-phase decoding */ + if ((ctx->callbacks.filter_prepare_cb != NULL) && (!ctx->enable_twophase)) + ereport(ERROR, + (errmsg("Output plugin does not support two-phase decoding, but " + "registered filter_prepared callback."))); Don't think we need to check that... + * Otherwise call either PREPARE (for twophase transactions) or COMMIT + * (for regular ones). + */ + if (rbtxn_rollback(txn)) + rb->abort(rb, txn, commit_lsn); This is the dead code since we don't have decoding of in-progress xacts yet. Third patch: +/* + * An xid value pointing to a possibly ongoing or a prepared transaction. + * Currently used in logical decoding. It's possible that such transactions + * can get aborted while the decoding is ongoing. + */ I would explain here that this xid is checked for abort after each catalog scan, and sent for the details to SetupHistoricSnapshot. + /* + * If CheckXidAlive is valid, then we check if it aborted. If it did, we + * error out + */ + if (TransactionIdIsValid(CheckXidAlive) && + !TransactionIdIsInProgress(CheckXidAlive) && + !TransactionIdDidCommit(CheckXidAlive)) + ereport(ERROR, + (errcode(ERRCODE_TRANSACTION_ROLLBACK), + errmsg("transaction aborted during system catalog scan"))); Probably centralize checks in one function? As well as 'We don't expect direct calls to heap_fetch...' ones. P.S. Looks like you have torn the thread chain: In-Reply-To header of mail [1] is missing. Please don't do that. [1] https://www.postgresql.org/message-id/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3%2BQ%40mail.gmail.com -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 2018-08-06 21:06:13 +0300, Arseny Sher wrote: > Hello, > > I have looked through the patches. I will first describe relativaly > serious issues I see and then proceed with small nitpicking. > > - On decoding of aborted xacts. The idea to throw an error once we > detect the abort is appealing, however I think you will have problems > with subxacts in the current implementation. What if subxact issues > DDL and then aborted, but main transaction successfully committed? I don't see a fundamental issue here. I've not reviewed the current patchset meaningfully, however. Do you see a fundamental issue here? > - Decoding transactions at PREPARE record changes rules of the "we ship > all commits after lsn 'x'" game. Namely, it will break initial > tablesync: what if consistent snapshot was formed *after* PREPARE, but > before COMMIT PREPARED, and the plugin decides to employ 2pc? Instead > of getting inital contents + continious stream of changes the receiver > will miss the prepared xact contents and raise 'prepared xact doesn't > exist' error. I think the starting point to address this is to forbid > two-phase decoding of xacts with lsn of PREPARE less than > snapbuilder's start_decoding_at. > Yea, that sounds like it need to be addressed. > - Currently we will call abort_prepared cb even if we failed to actually > prepare xact due to concurrent abort. I think it is confusing for > users. We should either handle this by remembering not to invoke > abort_prepared in these cases or at least document this behaviour, > leaving this problem to the receiver side. What precisely do you mean by "concurrent abort"? > - I find it suspicious that DecodePrepare completely ignores actions of > SnapBuildCommitTxn. For example, to execute invalidations, the latter > sets base snapshot if our xact (or subxacts) did DDL and the snapshot > not set yet. My fantasy doesn't hint me the concrete example > where this would burn at the moment, but it should be considered. Yea, I think this need to mirror the actions (and thus generalize the code to avoid duplication) > Now, the bikeshedding. > > First patch: > - I am one of those people upthread who don't think that converting > flags to bitmask is beneficial -- especially given that many of them > are mutually exclusive, e.g. xact can't be committed and aborted at > the same time. Apparently you have left this to the committer though. Similar. - Andres
Andres Freund <andres@anarazel.de> writes: >> - On decoding of aborted xacts. The idea to throw an error once we >> detect the abort is appealing, however I think you will have problems >> with subxacts in the current implementation. What if subxact issues >> DDL and then aborted, but main transaction successfully committed? > > I don't see a fundamental issue here. I've not reviewed the current > patchset meaningfully, however. Do you see a fundamental issue here? Hmm, yes, this is not an issue for this patch because after reading PREPARE record we know all aborted subxacts and won't try to decode their changes. However, this will be raised once we decide to decode in-progress transactions. Checking for all subxids is expensive; moreover, WAL doesn't provide all of them until commit... it might be easier to prevent vacuuming of aborted stuff while decoding needs it. Matter for another patch, anyway. >> - Currently we will call abort_prepared cb even if we failed to actually >> prepare xact due to concurrent abort. I think it is confusing for >> users. We should either handle this by remembering not to invoke >> abort_prepared in these cases or at least document this behaviour, >> leaving this problem to the receiver side. > > What precisely do you mean by "concurrent abort"? With current patch, the following is possible: * We start decoding of some prepared xact; * Xact aborts (ABORT PREPARED) for any reason; * Decoding processs notices this on catalog scan and calls abort() callback; * Later decoding process reads abort record and calls abort_prepared callback. -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Hi Arseny, > - Decoding transactions at PREPARE record changes rules of the "we ship > all commits after lsn 'x'" game. Namely, it will break initial > tablesync: what if consistent snapshot was formed *after* PREPARE, but > before COMMIT PREPARED, and the plugin decides to employ 2pc? Instead > of getting inital contents + continious stream of changes the receiver > will miss the prepared xact contents and raise 'prepared xact doesn't > exist' error. I think the starting point to address this is to forbid > two-phase decoding of xacts with lsn of PREPARE less than > snapbuilder's start_decoding_at. > It will be the job of the plugin to return a consistent answer for every GID that is encountered. In this case, the plugin will decode the transaction at COMMIT PREPARED time and not at PREPARE time. > - Currently we will call abort_prepared cb even if we failed to actually > prepare xact due to concurrent abort. I think it is confusing for > users. We should either handle this by remembering not to invoke > abort_prepared in these cases or at least document this behaviour, > leaving this problem to the receiver side. > The point is, when we reach the "ROLLBACK PREPARED", we have no idea if the "PREPARE" was aborted by this rollback happening concurrently. So it's possible that the 2PC has been successfully decoded and we would have to send the rollback to the other side which would need to check if it needs to rollback locally. > - I find it suspicious that DecodePrepare completely ignores actions of > SnapBuildCommitTxn. For example, to execute invalidations, the latter > sets base snapshot if our xact (or subxacts) did DDL and the snapshot > not set yet. My fantasy doesn't hint me the concrete example > where this would burn at the moment, but it should be considered. > I had discussed this area with Petr and we didn't see any issues as well, then. > Now, the bikeshedding. > > First patch: > - I am one of those people upthread who don't think that converting > flags to bitmask is beneficial -- especially given that many of them > are mutually exclusive, e.g. xact can't be committed and aborted at > the same time. Apparently you have left this to the committer though. > Hmm, there seems to be divided opinion on this. I am willing to go back to using the booleans if there's opposition and if the committer so wishes. Note that this patch will end up adding 4/5 more booleans in that case (we add new ones for prepare, commit prepare, abort, rollback prepare etc). > > Second patch: > - Applying gives me > Applying: Support decoding of two-phase transactions at PREPARE > .git/rebase-apply/patch:871: trailing whitespace. > > + row. The <function>change_cb</function> callback may access system or > + user catalog tables to aid in the process of outputting the row > + modification details. In case of decoding a prepared (but yet > + uncommitted) transaction or decoding of an uncommitted transaction, this > + change callback is ensured sane access to catalog tables regardless of > + simultaneous rollback by another backend of this very same transaction. > > I don't think we should explain this, at least in such words. As > mentioned upthread, we should warn about allowed systable_* accesses > instead. Same for message_cb. > Looks like you are looking at an earlier patchset. The latest patchset has removed the above. > > + /* > + * Tell the reorderbuffer about the surviving subtransactions. We need to > + * do this because the main transaction itself has not committed since we > + * are in the prepare phase right now. So we need to be sure the snapshot > + * is setup correctly for the main transaction in case all changes > + * happened in subtransanctions > + */ > > While we do certainly need to associate subxacts here, the explanation > looks weird to me. I would leave just the 'Tell the reorderbuffer about > the surviving subtransactions' as in DecodeCommit. > > > } > - > /* > * There's a speculative insertion remaining, just clean in up, it > * can't have been successful, otherwise we'd gotten a confirmation > > Spurious newline deletion. > > > - I would rename ReorderBufferCommitInternal to ReorderBufferReplay: > we replay the xact there, not commit. > > - If xact is empty, we will not prepare it (and call cb), > even if the output plugin asked us. However, we will call > commit_prepared cb. > > - ReorderBufferTxnIsPrepared and ReorderBufferPrepareNeedSkip do the > same and should be merged with comments explaining that the answer > must be stable. > > - filter_prepare_cb callback existence is checked in both decode.c and > in filter_prepare_cb_wrapper. > > + /* > + * The transaction may or may not exist (during restarts for example). > + * Anyways, 2PC transactions do not contain any reorderbuffers. So allow > + * it to be created below. > + */ > > > Code around looks sane, but I think that ReorderBufferTXN for our xact > must *not* exist at this moment: if we are going to COMMIT/ABORT > PREPARED it, it must have been replayed and RBTXN purged immediately > after. Also, instead of misty '2PC transactions do not contain any > reorderbuffers' I would say something like 'create dummy > ReorderBufferTXN to pass it to the callback'. > > - In DecodeAbort: > + /* > + * If it's ROLLBACK PREPARED then handle it via callbacks. > + */ > + if (TransactionIdIsValid(xid) && > + !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) && > + > > How xid can be invalid here? > > > - It might be worthwile to put the check > + !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) && > + parsed->dbId == ctx->slot->data.database && > + !FilterByOrigin(ctx, origin_id) && > > which appears 3 times now into separate function. > > > + * two-phase transactions - we either have to have all of them or none. > + * The filter_prepare callback is optional, but can only be defined when > > Kind of controversial (all of them or none, but optional), might be > formulated more accurately. > > > + /* > + * Capabilities of the output plugin. > + */ > + bool enable_twophase; > > I would rename this to 'supports_twophase' since this is not an option > but a description of the plugin capabilities. > > > + /* filter_prepare is optional, but requires two-phase decoding */ > + if ((ctx->callbacks.filter_prepare_cb != NULL) && (!ctx->enable_twophase)) > + ereport(ERROR, > + (errmsg("Output plugin does not support two-phase decoding, but " > + "registered > filter_prepared callback."))); > > Don't think we need to check that... > > > + * Otherwise call either PREPARE (for twophase transactions) or COMMIT > + * (for regular ones). > + */ > + if (rbtxn_rollback(txn)) > + rb->abort(rb, txn, commit_lsn); > > This is the dead code since we don't have decoding of in-progress xacts > yet. > Yes, the above check can be done away with it. > > Third patch: > +/* > + * An xid value pointing to a possibly ongoing or a prepared transaction. > + * Currently used in logical decoding. It's possible that such transactions > + * can get aborted while the decoding is ongoing. > + */ > > I would explain here that this xid is checked for abort after each > catalog scan, and sent for the details to SetupHistoricSnapshot. > > > + /* > + * If CheckXidAlive is valid, then we check if it aborted. If it did, we > + * error out > + */ > + if (TransactionIdIsValid(CheckXidAlive) && > + !TransactionIdIsInProgress(CheckXidAlive) && > + !TransactionIdDidCommit(CheckXidAlive)) > + ereport(ERROR, > + (errcode(ERRCODE_TRANSACTION_ROLLBACK), > + errmsg("transaction aborted during system catalog scan"))); > > Probably centralize checks in one function? As well as 'We don't expect > direct calls to heap_fetch...' ones. > > > P.S. Looks like you have torn the thread chain: In-Reply-To header of > mail [1] is missing. Please don't do that. > That wasn't me. I was also annoyed and surprised to see a new email thread separate from the earlier containing 100 or so messages. Regards, Nikhils -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Nikhil Sontakke <nikhils@2ndquadrant.com> writes: >> - Decoding transactions at PREPARE record changes rules of the "we ship >> all commits after lsn 'x'" game. Namely, it will break initial >> tablesync: what if consistent snapshot was formed *after* PREPARE, but >> before COMMIT PREPARED, and the plugin decides to employ 2pc? Instead >> of getting inital contents + continious stream of changes the receiver >> will miss the prepared xact contents and raise 'prepared xact doesn't >> exist' error. I think the starting point to address this is to forbid >> two-phase decoding of xacts with lsn of PREPARE less than >> snapbuilder's start_decoding_at. >> > > It will be the job of the plugin to return a consistent answer for > every GID that is encountered. In this case, the plugin will decode > the transaction at COMMIT PREPARED time and not at PREPARE time. I can't imagine a scenario in which plugin would want to send COMMIT PREPARED instead of replaying xact fully on CP record given it had never seen PREPARE record. On the other hand, tracking such situations on plugins side would make plugins life unneccessary complicated: either it has to dig into snapbuilder/slot internals to learn when the snapshot became consistent (which currently is impossible as this lsn is not saved anywhere btw), or it must fsync each its decision to do or not to do 2PC. Technically, my concern covers not only tablesync, but just plain decoding start: we don't want to ship COMMIT PREPARED if the downstream had never had chance to see PREPARE. As for tablesync, looking at current implementation I contemplate that we would need to do something along the following lines: - Tablesync worker performs COPY. - It then speaks with main apply worker to arrange (origin) lsn of sync point, as it does now. - Tablesync worker applies changes up to arranged lsn; it never uses two-phase decoding, all xacts are replayed on COMMIT PREPARED. Moreover, instead of going into SYNCDONE state immediately after reaching needed lsn, it stops replaying usual commits but continues to receive changes to finish all transactions which were prepared before sync point (we would need some additional support from reorderbuffer to learn when this happens). Only then it goes into SYNCDONE. - Behaviour of the main apply worker doesn't change: it ignores changes of the table in question before sync point and applies them after sync point. It also can use 2PC decoding of any transaction or not, as it desires. I believe this approach would implement tablesync correctly (all changes are applied, but only once) with minimal fuss. >> - Currently we will call abort_prepared cb even if we failed to actually >> prepare xact due to concurrent abort. I think it is confusing for >> users. We should either handle this by remembering not to invoke >> abort_prepared in these cases or at least document this behaviour, >> leaving this problem to the receiver side. > > The point is, when we reach the "ROLLBACK PREPARED", we have no idea > if the "PREPARE" was aborted by this rollback happening concurrently. > So it's possible that the 2PC has been successfully decoded and we > would have to send the rollback to the other side which would need to > check if it needs to rollback locally. I understand this. But I find this confusing for the users, so I propose to - Either document that "you might get abort_prepared cb called even after abort cb was invoked for the same transaction"; - Or consider adding some infrastructure to reorderbuffer to remember not to call abort_prepared in these cases. Due to possible reboots, I think this means that we need not to ReorderBufferCleanupTXN immediately after failed attempt to replay xact on PREPARE, but mark it as 'aborted' and keep it until we see ABORT PREPARED record. If we see that xact is marked as aborted, we don't call abort_prepared_cb. That way even if the decoder restarts in between, we will see PREPARE in WAL, inquire xact status (even if we skip it as already replayed) and mark it as aborted again. >> - I find it suspicious that DecodePrepare completely ignores actions of >> SnapBuildCommitTxn. For example, to execute invalidations, the latter >> sets base snapshot if our xact (or subxacts) did DDL and the snapshot >> not set yet. My fantasy doesn't hint me the concrete example >> where this would burn at the moment, but it should be considered. >> > > I had discussed this area with Petr and we didn't see any issues as well, then. > Ok, simplifying, what SnapBuildCommitTxn practically does is * Decide whether we are interested in tracking this xact effects, and if we are, mark it as committed. * Build and distribute snapshot to all RBTXNs, if it is important. * Set base snap of our xact if it did DDL, to execute invalidations during replay. I see that we don't need to do first two bullets during DecodePrepare: xact effects are still invisible for everyone but itself after PREPARE. As for seeing xacts own own changes, it is implemented via logging cmin/cmax and we don't need to mark xact as committed for that (c.f ReorderBufferCopySnap). Regarding the third point... I think in 2PC decoding we might need to execute invalidations twice: 1) After replaying xact on PREPARE to forget about catalog changes xact did -- it is not yet committed and must be invisible to other xacts until CP. In the latest patchset invalidations are executed only if there is at least one change in xact (it has base snap). It looks fine: we can't spoil catalogs if there were nothing to decode. Better to explain that somewhere. 2) After decoding COMMIT PREPARED to make changes visible. In current patchset it is always done. Actually, *this* is the reason RBTXN might already exist when we enter ReorderBufferFinishPrepared, not "(during restarts for example)" as comment says there: if there were inval messages, RBTXN will be created in DecodeCommit during their addition. BTW, "that we might need to execute invalidations, add snapshot" in SnapBuildCommitTxn looks like a cludge to me: I suppose it is better to do that at ReorderBufferXidSetCatalogChanges. Now, another issue is registering xact as committed in SnapBuildCommitTxn during COMMIT PREPARED processing. Since RBTXN is always purged after xact replay on PREPARE, the only medium we have for noticing catalog changes during COMMIT PREPARED is invalidation messages attached to the CP record. This raises the following question. * If there is a guarantee that whenever xact makes catalog changes it generates invalidation messages, then this code is fine. However, currently ReorderBufferXidSetCatalogChanges is also called on XLOG_HEAP_INPLACE processing and in SnapBuildProcessNewCid, which is useless if such guarantee exists. * If, on the other hand, there is no such guarantee, this code is broken. >> - I am one of those people upthread who don't think that converting >> flags to bitmask is beneficial -- especially given that many of them >> are mutually exclusive, e.g. xact can't be committed and aborted at >> the same time. Apparently you have left this to the committer though. >> > > Hmm, there seems to be divided opinion on this. I am willing to go > back to using the booleans if there's opposition and if the committer > so wishes. Note that this patch will end up adding 4/5 more booleans > in that case (we add new ones for prepare, commit prepare, abort, > rollback prepare etc). Well, you can unite mutually exclusive fields into one enum or char with macros defining possible values. Transaction can't be committed and aborted at the same time, etc. >> + row. The <function>change_cb</function> callback may access system or >> + user catalog tables to aid in the process of outputting the row >> + modification details. In case of decoding a prepared (but yet >> + uncommitted) transaction or decoding of an uncommitted transaction, this >> + change callback is ensured sane access to catalog tables regardless of >> + simultaneous rollback by another backend of this very same transaction. >> >> I don't think we should explain this, at least in such words. As >> mentioned upthread, we should warn about allowed systable_* accesses >> instead. Same for message_cb. >> > > Looks like you are looking at an earlier patchset. The latest patchset > has removed the above. I see, sorry. -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Hi Nikhil, Any progress on the issues discussed in the last couple of messages? That is: 1) removing of the sleep() from tests 2) changes to systable_getnext() wrt. TransactionIdIsInProgress() 3) adding asserts / checks to codepaths not going through systable_* 4) (not) adding this as a per-plugin option 5) handling cases where the downstream does not have 2PC enabled I guess it'd be good an updated patch or further discussion before continuing the review efforts. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi Tomas, > Any progress on the issues discussed in the last couple of messages? > That is: > > 1) removing of the sleep() from tests > Done. Now the test_decoding plugin takes a new option "check-xid". We will pass the XID which is going to be aborted via this option. The test_decoding plugin will wait for this XID to abort and exit when that happens. This removes any arbitrary sleep dependencies. > 2) changes to systable_getnext() wrt. TransactionIdIsInProgress() > Done. > 3) adding asserts / checks to codepaths not going through systable_* > Done. All the heap_* get api calls now assert that they are not being invoked with a valid CheckXidAlive value. > 4) (not) adding this as a per-plugin option > > 5) handling cases where the downstream does not have 2PC enabled > struct OutputPluginOptions now has an enable_twophase field which will be set by the plugin at init time similar to the way output_type is set to binary/text now. > I guess it'd be good an updated patch or further discussion before > continuing the review efforts. > PFA, latest patchset which implements the above. Regards, Nikhil > regards > > -- > Tomas Vondra http://www.2ndQuadrant.com > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hi, > > PFA, latest patchset which implements the above. > The newly added test_decoding test was failing due to a slight expected output mismatch. The attached patch-set corrects that. Regards, Nikhil > Regards, > Nikhil > > regards > > > > -- > > Tomas Vondra http://www.2ndQuadrant.com > > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services > > > -- > Nikhil Sontakke http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services -- Nikhil Sontakke http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Hi Nikhil, Thanks for the updated patch - I've started working on a review, with the hope of getting it committed sometime in 2019-01. But the patch bit-rotted again a bit (probably due to d3c09b9b), which broke the last part. Can you post a fixed version? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > Hi Nikhil, > > Thanks for the updated patch - I've started working on a review, with > the hope of getting it committed sometime in 2019-01. But the patch > bit-rotted again a bit (probably due to d3c09b9b), which broke the last > part. Can you post a fixed version? Please also note that at some time the thread was torn and continued in another place: https://www.postgresql.org/message-id/flat/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3%2BQ%40mail.gmail.com And now we have two branches =( I hadn't checked whether my concerns where addressed in the latest version though. -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 12/18/18 10:28 AM, Arseny Sher wrote: > > Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > >> Hi Nikhil, >> >> Thanks for the updated patch - I've started working on a review, with >> the hope of getting it committed sometime in 2019-01. But the patch >> bit-rotted again a bit (probably due to d3c09b9b), which broke the last >> part. Can you post a fixed version? > > Please also note that at some time the thread was torn and continued in > another place: > https://www.postgresql.org/message-id/flat/CAMGcDxeqEpWj3fTXwqhSwBdXd2RS9jzwWscO-XbeCfso6ts3%2BQ%40mail.gmail.com > > And now we have two branches =( > Thanks for pointing that out - I've added the other thread to the CF entry, so that we don't loose it. > I hadn't checked whether my concerns where addressed in the latest > version though. > OK, I'll read through the other thread and will check. Or perhaps Nikhil can comment on that. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi Tomas, > Thanks for the updated patch - I've started working on a review, with > the hope of getting it committed sometime in 2019-01. But the patch > bit-rotted again a bit (probably due to d3c09b9b), which broke the last > part. Can you post a fixed version? > PFA, updated patch set. Regards, Nikhil > regards > > -- > Tomas Vondra http://www.2ndQuadrant.com > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Nikhil Sontakke 2ndQuadrant - PostgreSQL Solutions for the Enterprise https://www.2ndQuadrant.com/
Attachment
Hi Arseny, > I hadn't checked whether my concerns where addressed in the latest > version though. > I'd like to believe that the latest patch set tries to address some (if not all) of your concerns. Can you please take a look and let me know? Regards, Nikhil -- Nikhil Sontakke 2ndQuadrant - PostgreSQL Solutions for the Enterprise https://www.2ndQuadrant.com/
Nikhil Sontakke <nikhils@2ndquadrant.com> writes: > I'd like to believe that the latest patch set tries to address some > (if not all) of your concerns. Can you please take a look and let me > know? Hi, sure. General things: - Earlier I said that there is no point of sending COMMIT PREPARED if decoding snapshot became consistent after PREPARE, i.e. PREPARE hadn't been sent. I realized since then that such use cases actually exist: prepare might be copied to the replica by e.g. basebackup or something else earlier. Still, a plugin must be able to easily distinguish these too early PREPARES without doing its own bookkeeping (remembering each PREPARE it has seen). Fortunately, turns out this we can make it easy. If during COMMIT PREPARED / ABORT PREPARED record decoding we see that ReorderBufferTXN with such xid exists, it means that either 1) plugin refused to do replay of this xact at PREPARE or 2) PREPARE was too early in the stream. Otherwise xact would be replayed at PREPARE processing and rbtxn purged immediately after. I think we should add this to the documentation of filter_prepare_cb. Also, to this end we need to add an argument to this callback specifying at which context it was called: during prepare / commit prepared / abort prepared. Also, for this to work, ReorderBufferProcessXid must be always called at PREPARE, not only when 2PC decoding is disabled. - BTW, ReorderBufferProcessXid at PREPARE should be always called anyway, because otherwise if xact is empty, we will not prepare it (and call cb), even if the output plugin asked us not to filter it out. However, we will call commit_prepared cb, which is inconsistent. - I find it weird that in DecodePrepare and in DecodeCommit you always ask the plugin whether to filter an xact, given that sometimes you know beforehand that you are not going to replay it: it might have already been replayed, might have wrong dbid, origin, etc. One consequence of this: imagine that notorious xact with PREPARE before point where snapshot became consistent and COMMIT PREPARED after that point. Even if filter_cb says 'I want 2PC on this xact', with current code it won't be replayed on PREPARE and rbxid will be destroyed with ReorderBufferForget. Now this xact is lost. - Doing full-blown SnapBuildCommitTxn during PREPARE decoding is wrong, because xact effects must not yet be seen to others. I discussed this at length and described adjacent problems in [1]. - I still don't like that if 2PC xact was aborted and its replay stopped, prepare callback won't be called but abort_prepared would be. This either should be documented or fixed. Second patch: + /* filter_prepare is optional, but requires two-phase decoding */ + if ((ctx->callbacks.filter_prepare_cb != NULL) && (!opt->enable_twophase)) + ereport(ERROR, + (errmsg("Output plugin does not support two-phase decoding, but " + "registered filter_prepared callback."))); I actually think that enable_twophase output plugin option is redundant. If plugin author wants 2PC, he just provides filter_prepare_cb callback and potentially others. I also don't see much value in checking that exactly 0 or 3 callbacks were registred. - You allow (commit|abort)_prepared_cb, prepare_cb callbacks to be not specified with enabled 2PC and call them without check that they actually exist. - executed within that transaction. + executed within that transaction. A transaction that is prepared for + a two-phase commit using <command>PREPARE TRANSACTION</command> will + also be decoded if the output plugin callbacks needed for decoding + them are provided. It is possible that the current transaction which + is being decoded is aborted concurrently via a <command>ROLLBACK PREPARED</command> + command. In that case, the logical decoding of this transaction will + be aborted too. This should say explicitly that such 2PC xact will be decoded at PREPARE record. Probably also add that otherwise it is decoded at CP record. Probably also add "and abort_cb callback called" to the last sentence. + The required <function>abort_cb</function> callback is called whenever + a transaction abort has to be initiated. This can happen if we are This callback is not required in the code, and it would be indeed a bad idea to demand it, breaking compatibility with existing plugins not caring about 2PC. + * Otherwise call either PREPARE (for twophase transactions) or COMMIT + * (for regular ones). + */ + if (rbtxn_rollback(txn)) + rb->abort(rb, txn, commit_lsn); This is dead code since we don't have decoding of in-progress xacts yet. + /* + * If there is a valid top-level transaction that's different from the + * two-phase one we are aborting, clear its reorder buffer as well. + */ + if (TransactionIdIsNormal(xid) && xid != parsed->twophase_xid) + ReorderBufferAbort(ctx->reorder, xid, origin_lsn); What is the aim of this? How xl_xid xid of commit prepared record can be normal? + /* + * The transaction may or may not exist (during restarts for example). + * Anyways, 2PC transactions do not contain any reorderbuffers. So allow + * it to be created below. + */ Code around looks sane, but I think that restarts are irrelevant to rbtxn existence at this moment: if we are going to COMMIT/ABORT PREPARED it, it must have been replayed and rbtxn purged immediately after. The only reason why rbtxn can exist here is invalidation addition (ReorderBufferAddInvalidations) happening a couple of calls earlier. Also, instead of misty '2PC transactions do not contain any reorderbuffers' I would say something like 'create dummy ReorderBufferTXN to pass it to the callback'. - filter_prepare_cb callback existence is checked in both decode.c and in filter_prepare_cb_wrapper. Third patch: +/* + * An xid value pointing to a possibly ongoing or a prepared transaction. + * Currently used in logical decoding. It's possible that such transactions + * can get aborted while the decoding is ongoing. + */ I would explain here that this xid is checked for abort after each catalog scan, and sent for the details to SetupHistoricSnapshot. Nitpicking: First patch: I still don't think that these flags need a bitmask. Second patch: - I still think ReorderBufferCommitInternal name is confusing and should be renamed to something like ReorderBufferReplay. /* Do we know this is a subxact? Xid of top-level txn if so */ TransactionId toplevel_xid; + /* In case of 2PC we need to pass GID to output plugin */ + char *gid; Better add here newline as between other fields. + txn->txn_flags |= RBTXN_PREPARE; + txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */ + strcpy(txn->gid, gid); pstrdup? - ReorderBufferTxnIsPrepared and ReorderBufferPrepareNeedSkip do the same and should be merged with comments explaining that the answer must be stable. + The optional <function>commit_prepared_cb</function> callback is called whenever + a commit prepared transaction has been decoded. The <parameter>gid</parameter> field, a commit prepared transaction *record* has been decoded? Fourth patch: Applying: Teach test_decoding plugin to work with 2PC .git/rebase-apply/patch:347: trailing whitespace. -- test savepoints .git/rebase-apply/patch:424: trailing whitespace. # get XID of the above two-phase transaction warning: 2 lines add whitespace errors. [1] https://www.postgresql.org/message-id/87zhxrwgvh.fsf%40ars-thinkpad -- Arseny Sher Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Hi, I think the difference between abort and abort prepared should be explained better (I am not quite sure I get it myself). > + The required <function>abort_cb</function> callback is called whenever Also, why is this one required when all the 2pc stuff is optional? > +static void > +DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, > + xl_xact_parsed_prepare * parsed) > +{ > + XLogRecPtr origin_lsn = parsed->origin_lsn; > + TimestampTz commit_time = parsed->origin_timestamp; > + XLogRecPtr origin_id = XLogRecGetOrigin(buf->record); > + TransactionId xid = parsed->twophase_xid; > + bool skip; > + > + Assert(parsed->dbId != InvalidOid); > + Assert(TransactionIdIsValid(parsed->twophase_xid)); > + > + /* Whether or not this PREPARE needs to be skipped. */ > + skip = DecodeEndOfTxn(ctx, buf, parsed, xid); > + > + FinalizeTxnDecoding(ctx, buf, parsed, xid, skip); Given that DecodeEndOfTxn calls SnapBuildCommitTxn, won't this make the catalog changes done by prepared transaction visible to other transactions (which is undesirable as they should only be visible after it's committed) ? > + if (unlikely(TransactionIdIsValid(CheckXidAlive) && > + !(IsCatalogRelation(scan->rs_rd) || > + RelationIsUsedAsCatalogTable(scan->rs_rd)))) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_TRANSACTION_STATE), > + errmsg("improper heap_getnext call"))); > + I think we should log the relation oid as well so that plugin developers have easier time debugging this (for all variants of this). -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 14/01/2019 23:16, Arseny Sher wrote: > > Nikhil Sontakke <nikhils@2ndquadrant.com> writes: > >> I'd like to believe that the latest patch set tries to address some >> (if not all) of your concerns. Can you please take a look and let me >> know? > > Hi, sure. > > General things: > > - Earlier I said that there is no point of sending COMMIT PREPARED if > decoding snapshot became consistent after PREPARE, i.e. PREPARE hadn't > been sent. I realized since then that such use cases actually exist: > prepare might be copied to the replica by e.g. basebackup or something > else earlier. Basebackup does not copy slots though and slot should not reach consistency until all prepared transactions are committed no? > > - BTW, ReorderBufferProcessXid at PREPARE should be always called > anyway, because otherwise if xact is empty, we will not prepare it > (and call cb), even if the output plugin asked us not to filter it > out. However, we will call commit_prepared cb, which is inconsistent. > > - I find it weird that in DecodePrepare and in DecodeCommit you always > ask the plugin whether to filter an xact, given that sometimes you > know beforehand that you are not going to replay it: it might have > already been replayed, might have wrong dbid, origin, etc. One > consequence of this: imagine that notorious xact with PREPARE before > point where snapshot became consistent and COMMIT PREPARED after that > point. Even if filter_cb says 'I want 2PC on this xact', with current > code it won't be replayed on PREPARE and rbxid will be destroyed with > ReorderBufferForget. Now this xact is lost. Yeah this is wrong. > > Second patch: > > + /* filter_prepare is optional, but requires two-phase decoding */ > + if ((ctx->callbacks.filter_prepare_cb != NULL) && (!opt->enable_twophase)) > + ereport(ERROR, > + (errmsg("Output plugin does not support two-phase decoding, but " > + "registered filter_prepared callback."))); > > I actually think that enable_twophase output plugin option is > redundant. If plugin author wants 2PC, he just provides > filter_prepare_cb callback and potentially others. +1 > I also don't see much > value in checking that exactly 0 or 3 callbacks were registred. > I think that check makes sense, if you support 2pc you need to register all callbacks. > > Nitpicking: > > First patch: I still don't think that these flags need a bitmask. Since we are discussing this, I personally prefer the bitmask here. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Eyeballing 0001, it has a few problems. 1. It's under-parenthesizing the txn argument of the macros. 2. the "has"/"is" macro definitions don't return booleans -- see fce4609d5e5b. 3. the remainder of this no longer makes sense: /* Do we know this is a subxact? Xid of top-level txn if so */ - bool is_known_as_subxact; TransactionId toplevel_xid; I suggest to fix the comment, and also improve the comment next to the macro that tests this flag. (4. the macro names are ugly.) -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jan 25, 2019 at 02:03:27PM -0300, Alvaro Herrera wrote: > Eyeballing 0001, it has a few problems. > > 1. It's under-parenthesizing the txn argument of the macros. > > 2. the "has"/"is" macro definitions don't return booleans -- see > fce4609d5e5b. > > 3. the remainder of this no longer makes sense: > > /* Do we know this is a subxact? Xid of top-level txn if so */ > - bool is_known_as_subxact; > TransactionId toplevel_xid; > > I suggest to fix the comment, and also improve the comment next to the > macro that tests this flag. > > > (4. the macro names are ugly.) This is an old thread, and the latest review is very recent. So I am moving the patch to next CF, waiting on author. -- Michael
Attachment
I don't understand why this patch record has been kept aliv for so long, since no new version has been sent in ages. If this patch is really waiting on the author, let's see the author do something. If no voice is heard very soon, I'll close this patch as RwF. If others want to see this feature in PostgreSQL, they are welcome to contribute. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 9/2/19 6:12 PM, Alvaro Herrera wrote: > I don't understand why this patch record has been kept aliv for so long, > since no new version has been sent in ages. If this patch is really > waiting on the author, let's see the author do something. If no voice > is heard very soon, I'll close this patch as RwF. +1. I should have marked this RWF in March but I ignored it because it was tagged v13 before the CF started. -- -David david@pgmasters.net
Hello,
Trying to revive this patch which attempts to support logical decoding of two phase transactions. I've rebased and polished Nikhil's patch on the current HEAD. Some of the logic in the previous patchset has already been committed as part of large-in-progress transaction commits, like the handling of concurrent aborts, so all that logic has been left out. I think some of the earlier comments have already been addressed or are no longer relevant. Do have a look at the patch and let me know what you think.I will try and address any pending issues going forward.
regards,
Ajin Cherian
Fujitsu Australia
Attachment
On Mon, Sep 7, 2020 at 10:54 AM Ajin Cherian <itsajin@gmail.com> wrote: > > Hello, > > Trying to revive this patch which attempts to support logical decoding of two phase transactions. I've rebased and polishedNikhil's patch on the current HEAD. Some of the logic in the previous patchset has already been committed as partof large-in-progress transaction commits, like the handling of concurrent aborts, so all that logic has been left out. > I am not sure about your point related to concurrent aborts. I think we need some changes related to this patch. Have you tried to test this behavior? Basically, we have the below code in ReorderBufferProcessTXN() which will be hit for concurrent aborts, and currently, the Asserts shown below will fail. if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK) { /* * This error can only occur when we are sending the data in * streaming mode and the streaming is not finished yet. */ Assert(streaming); Assert(stream_started); Nikhil has a test for the same (0004-Teach-test_decoding-plugin-to-work-with-2PC.Jan4) in his last email [1]. You might want to use it to test this behavior. I think you can also keep the tests as a separate patch as Nikhil had. One other comment: =================== @@ -27,6 +27,7 @@ typedef struct OutputPluginOptions { OutputPluginOutputType output_type; bool receive_rewrites; + bool enable_twophase; } OutputPluginOptions; .. .. @@ -684,6 +699,33 @@ startup_cb_wrapper(LogicalDecodingContext *ctx, OutputPluginOptions *opt, bool i /* do the actual work: call callback */ ctx->callbacks.startup_cb(ctx, opt, is_init); + /* + * If the plugin claims to support two-phase transactions, then + * check that the plugin implements all callbacks necessary to decode + * two-phase transactions - we either have to have all of them or none. + * The filter_prepare callback is optional, but can only be defined when + * two-phase decoding is enabled (i.e. the three other callbacks are + * defined). + */ + if (opt->enable_twophase) + { + int twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) + + (ctx->callbacks.commit_prepared_cb != NULL) + + (ctx->callbacks.abort_prepared_cb != NULL); + + /* Plugins with incorrect number of two-phase callbacks are broken. */ + if ((twophase_callbacks != 3) && (twophase_callbacks != 0)) + ereport(ERROR, + (errmsg("Output plugin registered only %d twophase callbacks. ", + twophase_callbacks))); + } I don't know why the patch has used this way to implement an option to enable two-phase. Can't we use how we implement 'stream-changes' option in commit 7259736a6e? Just refer how we set ctx->streaming and you can use a similar way to set this parameter. -- With Regards, Amit Kapila.
On Mon, Sep 7, 2020 at 11:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
Nikhil has a test for the same
(0004-Teach-test_decoding-plugin-to-work-with-2PC.Jan4) in his last
email [1]. You might want to use it to test this behavior. I think you
can also keep the tests as a separate patch as Nikhil had.
Done. I've added the tests and also tweaked code to make sure that the aborts during 2 phase commits are also handled.
I don't know why the patch has used this way to implement an option to
enable two-phase. Can't we use how we implement 'stream-changes'
option in commit 7259736a6e? Just refer how we set ctx->streaming and
you can use a similar way to set this parameter.
Done, I've moved the checks for callbacks to inside the corresponding wrappers.
Regards,
Ajin Cherian
Fujitsu Australia
Attachment
On Wed, Sep 9, 2020 at 3:33 PM Ajin Cherian <itsajin@gmail.com> wrote: > > On Mon, Sep 7, 2020 at 11:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> >> Nikhil has a test for the same >> (0004-Teach-test_decoding-plugin-to-work-with-2PC.Jan4) in his last >> email [1]. You might want to use it to test this behavior. I think you >> can also keep the tests as a separate patch as Nikhil had. >> > Done. I've added the tests and also tweaked code to make sure that the aborts during 2 phase commits are also handled. >> Okay, I'll look into your changes but before that today, I have gone through this entire thread to check if there are any design problems and found that there were two major issues in the original proposal, (a) one was to handle concurrent aborts which I think we should be able to deal in a way similar to what we have done for decoding of in-progress transactions and (b) what if someone specifically locks pg_class or pg_attribute in exclusive mode (say be Lock pg_attribute ...), it seems the deadlock can happen in that case [0]. AFAIU, people seem to think if there is no realistic scenario where deadlock can happen apart from user explicitly locking the system catalog then we might be able to get away by just ignoring such xacts to be decoded at prepare time or would block it in some other way as any way that will block the entire system. I am not sure what is the right thing but something has to be done to avoid any sort of deadlock for this. Another thing, I noticed is that originally we have subscriber-side support as well, see [1] (see *pgoutput* patch) but later dropped it due to some reasons [2]. I think we should have pgoutput support as well, so see what is required to get that incorporated. I would also like to summarize my thinking on the usefulness of this feature. One of the authors of this patch Stats wants this for a conflict-free logical replication, see more details [3]. Craig seems to suggest [3] that this will allow us to avoid conflicting schema changes at different nodes though it is not clear to me if that is possible without some external code support because we don't send schema changes in logical replication, maybe Craig can shed some light on this. Another use-case, I am thinking is if this can be used for scaling-out reads as well. Because of 2PC, we can ensure that on subscribers we have all the data committed on the master. Now, we can design a system where different nodes are owners of some set of tables and we can always get the data of those tables reliably from those nodes, and then one can have some external process that will route the reads accordingly. I know that the last idea is a bit of a hand-waving but it seems to be possible after this feature. [0] - https://www.postgresql.org/message-id/20170328012546.473psm6546bgsi2c%40alap3.anarazel.de [1] - https://www.postgresql.org/message-id/CAMGcDxchx%3D0PeQBVLzrgYG2AQ49QSRxHj5DCp7yy0QrJR0S0nA%40mail.gmail.com [2] - https://www.postgresql.org/message-id/CAMGcDxc-kuO9uq0zRCRwbHWBj_rePY9%3DraR7M9pZGWoj9EOGdg%40mail.gmail.com [3] - https://www.postgresql.org/message-id/CAMsr%2BYHQzGxnR-peT4SbX2-xiG2uApJMTgZ4a3TiRBM6COyfqg%40mail.gmail.com -- With Regards, Amit Kapila.
On Sat, Sep 12, 2020 at 9:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
Another thing, I noticed is that originally we have subscriber-side
support as well, see [1] (see *pgoutput* patch) but later dropped it
due to some reasons [2]. I think we should have pgoutput support as
well, so see what is required to get that incorporated.
I have added the rebased patch-set for pgoutput and subscriber side changes as well. This also includes a test case in subscriber.
regards,
Ajin Cherian
Attachment
On Wed, Sep 9, 2020 at 3:33 PM Ajin Cherian <itsajin@gmail.com> wrote: > > On Mon, Sep 7, 2020 at 11:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> >> Nikhil has a test for the same >> (0004-Teach-test_decoding-plugin-to-work-with-2PC.Jan4) in his last >> email [1]. You might want to use it to test this behavior. I think you >> can also keep the tests as a separate patch as Nikhil had. >> > Done. I've added the tests and also tweaked code to make sure that the aborts during 2 phase commits are also handled. > I don't think it is complete yet. * * This error can only occur when we are sending the data in * streaming mode and the streaming is not finished yet. */ - Assert(streaming); - Assert(stream_started); + Assert(streaming || rbtxn_prepared(txn)); + Assert(stream_started || rbtxn_prepared(txn)); Here, you have updated the code but comments are still not updated. * @@ -2370,10 +2391,19 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, errdata = NULL; curtxn->concurrent_abort = true; - /* Reset the TXN so that it is allowed to stream remaining data. */ - ReorderBufferResetTXN(rb, txn, snapshot_now, - command_id, prev_lsn, - specinsert); + /* If streaming, reset the TXN so that it is allowed to stream remaining data. */ + if (streaming && stream_started) + { + ReorderBufferResetTXN(rb, txn, snapshot_now, + command_id, prev_lsn, + specinsert); + } + else + { + elog(LOG, "stopping decoding of %s (%u)", + txn->gid[0] != '\0'? txn->gid:"", txn->xid); + rb->abort(rb, txn, commit_lsn); + } I don't think we need to perform abort here. Later we will anyway encounter the WAL for Rollback Prepared for which we will call abort_prepared_cb. As we have set the 'concurrent_abort' flag, it will allow us to skip all the intermediate records. Here, we need only enough state in ReorderBufferTxn that it can be later used for ReorderBufferFinishPrepared(). Basically, you need functionality similar to ReorderBufferTruncateTXN where except for invalidations you can free memory for everything else. You can either write a new function ReorderBufferTruncatePreparedTxn or pass another bool parameter in ReorderBufferTruncateTXN to indicate it is prepared_xact and then clean up additional things that are not required for prepared xact. * Similarly, I don't understand why we need below code: ReorderBufferProcessTXN() { .. + if (rbtxn_rollback(txn)) + rb->abort(rb, txn, commit_lsn); .. } There is nowhere we are setting the RBTXN_ROLLBACK flag, so how will this check be true? If we decide to remove this code then don't forget to update the comments. * If my previous two comments are correct then I don't think we need the below interface. + <sect3 id="logicaldecoding-output-plugin-abort"> + <title>Transaction Abort Callback</title> + + <para> + The required <function>abort_cb</function> callback is called whenever + a transaction abort has to be initiated. This can happen if we are + decoding a transaction that has been prepared for two-phase commit and + a concurrent rollback happens while we are decoding it. +<programlisting> +typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx, + ReorderBufferTXN *txn, + XLogRecPtr abort_lsn); >> >> >> I don't know why the patch has used this way to implement an option to >> enable two-phase. Can't we use how we implement 'stream-changes' >> option in commit 7259736a6e? Just refer how we set ctx->streaming and >> you can use a similar way to set this parameter. > > > Done, I've moved the checks for callbacks to inside the corresponding wrappers. > This is not what I suggested. Please study the commit 7259736a6e and see how streaming option is implemented. I want later subscribers can specify whether they want transactions to be decoded at prepare time similar to what we have done for streaming. Also, search for ctx->streaming in the code and see how it is set to get the idea. Note: Please use version number while sending patches, you can use something like git format-patch -N -v n to do that. It makes easier for the reviewer to compare it with the previous version. Few other comments: =================== 1. ReorderBufferProcessTXN() { .. if (streaming) { ReorderBufferTruncateTXN(rb, txn); /* Reset the CheckXidAlive */ CheckXidAlive = InvalidTransactionId; } else ReorderBufferCleanupTXN(rb, txn); .. } I don't think we can perform ReorderBufferCleanupTXN for the prepared transactions because if we have removed the ReorderBufferTxn before commit, the later code might not consider such a transaction in the system and compute the wrong value of restart_lsn for a slot. Basically, in SnapBuildProcessRunningXacts() when we call ReorderBufferGetOldestTXN(), it should show the ReorderBufferTxn of the prepared transaction which is not yet committed but because we have removed it after prepare, it won't get that TXN and then that leads to wrong computation of restart_lsn. Once we start from a wrong point in WAL, the snapshot built was incorrect which will lead to the wrong result. This is the same reason why the patch is not doing ReorderBufferForget in DecodePrepare when we decide to skip the transaction. Also, here, we need to set CheckXidAlive = InvalidTransactionId; for prepared xact as well. 2. Have you thought about the interaction of streaming with prepared transactions? You can try writing some tests using pg_logical* APIs and see the behaviour. For ex. there is no handling in ReorderBufferStreamCommit for the same. I think you need to introduce stream_prepare API similar to stream_commit and then use the same. 3. - if (streaming) + if (streaming || rbtxn_prepared(change->txn)) { curtxn = change->txn; SetupCheckXidLive(curtxn->xid); @@ -2249,7 +2254,6 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, break; } } - /* Spurious line removal. -- With Regards, Amit Kapila.
On Tue, Sep 15, 2020 at 5:27 PM Ajin Cherian <itsajin@gmail.com> wrote: > > On Sat, Sep 12, 2020 at 9:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> >> Another thing, I noticed is that originally we have subscriber-side >> support as well, see [1] (see *pgoutput* patch) but later dropped it >> due to some reasons [2]. I think we should have pgoutput support as >> well, so see what is required to get that incorporated. >> > I have added the rebased patch-set for pgoutput and subscriber side changes as well. This also includes a test case insubscriber. > As mentioned in my email there were some reasons due to which the support has been left for later, have you checked those and if so, can you please explain how you have addressed those or why they are not relevant now if that is the case? -- With Regards, Amit Kapila.
On Tue, Sep 15, 2020 at 10:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
Few other comments:
===================
1.
ReorderBufferProcessTXN()
{
..
if (streaming)
{
ReorderBufferTruncateTXN(rb, txn);
/* Reset the CheckXidAlive */
CheckXidAlive = InvalidTransactionId;
}
else
ReorderBufferCleanupTXN(rb, txn);
..
}
I don't think we can perform ReorderBufferCleanupTXN for the prepared
transactions because if we have removed the ReorderBufferTxn before
commit, the later code might not consider such a transaction in the
system and compute the wrong value of restart_lsn for a slot.
Basically, in SnapBuildProcessRunningXacts() when we call
ReorderBufferGetOldestTXN(), it should show the ReorderBufferTxn of
the prepared transaction which is not yet committed but because we
have removed it after prepare, it won't get that TXN and then that
leads to wrong computation of restart_lsn. Once we start from a wrong
point in WAL, the snapshot built was incorrect which will lead to the
wrong result. This is the same reason why the patch is not doing
ReorderBufferForget in DecodePrepare when we decide to skip the
transaction. Also, here, we need to set CheckXidAlive =
InvalidTransactionId; for prepared xact as well.
Just to confirm what you are expecting here. so after we send out the prepare transaction to the plugin, you are suggesting to NOT do a ReorderBufferCleanupTXN, but what to do instead?. Are you suggesting to do what you suggested
as part of concurrent abort handling? Something equivalent to ReorderBufferTruncateTXN()? remove all changes of the transaction but keep the invalidations and tuplecids etc? Do you think we should have a new flag in txn to indicate that this transaction has already been decoded? (prepare_decoded?) Any other special handling you think is required?
regards,
Ajin Cherian
Fujitsu Australia
On Thu, Sep 17, 2020 at 2:02 PM Ajin Cherian <itsajin@gmail.com> wrote: > > On Tue, Sep 15, 2020 at 10:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> >> Few other comments: >> =================== >> 1. >> ReorderBufferProcessTXN() >> { >> .. >> if (streaming) >> { >> ReorderBufferTruncateTXN(rb, txn); >> >> /* Reset the CheckXidAlive */ >> CheckXidAlive = InvalidTransactionId; >> } >> else >> ReorderBufferCleanupTXN(rb, txn); >> .. >> } >> >> I don't think we can perform ReorderBufferCleanupTXN for the prepared >> transactions because if we have removed the ReorderBufferTxn before >> commit, the later code might not consider such a transaction in the >> system and compute the wrong value of restart_lsn for a slot. >> Basically, in SnapBuildProcessRunningXacts() when we call >> ReorderBufferGetOldestTXN(), it should show the ReorderBufferTxn of >> the prepared transaction which is not yet committed but because we >> have removed it after prepare, it won't get that TXN and then that >> leads to wrong computation of restart_lsn. Once we start from a wrong >> point in WAL, the snapshot built was incorrect which will lead to the >> wrong result. This is the same reason why the patch is not doing >> ReorderBufferForget in DecodePrepare when we decide to skip the >> transaction. Also, here, we need to set CheckXidAlive = >> InvalidTransactionId; for prepared xact as well. >> >> > > Just to confirm what you are expecting here. so after we send out the prepare transaction to the plugin, you are suggestingto NOT do a ReorderBufferCleanupTXN, but what to do instead?. Are you suggesting to do what you suggested > as part of concurrent abort handling? > Yes. > Something equivalent to ReorderBufferTruncateTXN()? remove all changes of the transaction but keep the invalidations andtuplecids etc? > I don't think you don't need tuplecids. I have checked ReorderBufferFinishPrepared() and that seems to require only invalidations, check if anything else is required. > Do you think we should have a new flag in txn to indicate that this transaction has already been decoded? (prepare_decoded?) > Yeah, I think that would be better. How about if name the new variable as cleanup_prepared? -- With Regards, Amit Kapila.
On Tue, Sep 15, 2020 at 10:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > I don't think it is complete yet. > * > * This error can only occur when we are sending the data in > * streaming mode and the streaming is not finished yet. > */ > - Assert(streaming); > - Assert(stream_started); > + Assert(streaming || rbtxn_prepared(txn)); > + Assert(stream_started || rbtxn_prepared(txn)); > > Here, you have updated the code but comments are still not updated. > Updated the comments. > I don't think we need to perform abort here. Later we will anyway > encounter the WAL for Rollback Prepared for which we will call > abort_prepared_cb. As we have set the 'concurrent_abort' flag, it will > allow us to skip all the intermediate records. Here, we need only > enough state in ReorderBufferTxn that it can be later used for > ReorderBufferFinishPrepared(). Basically, you need functionality > similar to ReorderBufferTruncateTXN where except for invalidations you > can free memory for everything else. You can either write a new > function ReorderBufferTruncatePreparedTxn or pass another bool > parameter in ReorderBufferTruncateTXN to indicate it is prepared_xact > and then clean up additional things that are not required for prepared > xact. Added a new parameter to ReorderBufferTruncatePreparedTxn for prepared transactions and did cleanup of tupulecids as well, I have left snapshots and transactions. As a result of this, I also had to create a new function ReorderBufferCleanupPreparedTXN which will clean up the rest as part of FinishPrepared handling as we can't call ReorderBufferCleanupTXN again after this. > > * > Similarly, I don't understand why we need below code: > ReorderBufferProcessTXN() > { > .. > + if (rbtxn_rollback(txn)) > + rb->abort(rb, txn, commit_lsn); > .. > } > > There is nowhere we are setting the RBTXN_ROLLBACK flag, so how will > this check be true? If we decide to remove this code then don't forget > to update the comments. > Removed. > * > If my previous two comments are correct then I don't think we need the > below interface. > + <sect3 id="logicaldecoding-output-plugin-abort"> > + <title>Transaction Abort Callback</title> > + > + <para> > + The required <function>abort_cb</function> callback is called whenever > + a transaction abort has to be initiated. This can happen if we are > + decoding a transaction that has been prepared for two-phase commit and > + a concurrent rollback happens while we are decoding it. > +<programlisting> > +typedef void (*LogicalDecodeAbortCB) (struct LogicalDecodingContext *ctx, > + ReorderBufferTXN *txn, > + XLogRecPtr abort_lsn); > > > Removed. > >> > >> > >> I don't know why the patch has used this way to implement an option to > >> enable two-phase. Can't we use how we implement 'stream-changes' > >> option in commit 7259736a6e? Just refer how we set ctx->streaming and > >> you can use a similar way to set this parameter. > > > > > > Done, I've moved the checks for callbacks to inside the corresponding wrappers. > > > > This is not what I suggested. Please study the commit 7259736a6e and > see how streaming option is implemented. I want later subscribers can > specify whether they want transactions to be decoded at prepare time > similar to what we have done for streaming. Also, search for > ctx->streaming in the code and see how it is set to get the idea. > Changed it similar to ctx->streaming logic. > Note: Please use version number while sending patches, you can use > something like git format-patch -N -v n to do that. It makes easier > for the reviewer to compare it with the previous version. Done. > Few other comments: > =================== > 1. > ReorderBufferProcessTXN() > { > .. > if (streaming) > { > ReorderBufferTruncateTXN(rb, txn); > > /* Reset the CheckXidAlive */ > CheckXidAlive = InvalidTransactionId; > } > else > ReorderBufferCleanupTXN(rb, txn); > .. > } > > I don't think we can perform ReorderBufferCleanupTXN for the prepared > transactions because if we have removed the ReorderBufferTxn before > commit, the later code might not consider such a transaction in the > system and compute the wrong value of restart_lsn for a slot. > Basically, in SnapBuildProcessRunningXacts() when we call > ReorderBufferGetOldestTXN(), it should show the ReorderBufferTxn of > the prepared transaction which is not yet committed but because we > have removed it after prepare, it won't get that TXN and then that > leads to wrong computation of restart_lsn. Once we start from a wrong > point in WAL, the snapshot built was incorrect which will lead to the > wrong result. This is the same reason why the patch is not doing > ReorderBufferForget in DecodePrepare when we decide to skip the > transaction. Also, here, we need to set CheckXidAlive = > InvalidTransactionId; for prepared xact as well. Updated as suggested above. > 2. Have you thought about the interaction of streaming with prepared > transactions? You can try writing some tests using pg_logical* APIs > and see the behaviour. For ex. there is no handling in > ReorderBufferStreamCommit for the same. I think you need to introduce > stream_prepare API similar to stream_commit and then use the same. This is pending. I will look at it in the next iteration. Also pending is the investigation as to why the pgoutput changes were not added initially. regards, Ajin Cherian Fujitsu Australia
Attachment
On Thu, Sep 17, 2020 at 10:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > Yeah, I think that would be better. How about if name the new variable > as cleanup_prepared? I haven't added a new flag to indicate that the prepare was cleaned up, as that wasn' really necessary. Instead I used a new function to do partial cleanup to do whatever was not done in the truncate. If you think, using a flag and doing special handling in ReorderBufferCleanupTXN was a better idea, let me know. regards, Ajin Cherian Fujitsu Australia
On Fri, Sep 18, 2020 at 6:02 PM Ajin Cherian <itsajin@gmail.com> wrote: > I have reviewed v4-0001 patch and I have a few comments. I haven't yet completely reviewed the patch. 1. + /* + * Process invalidation messages, even if we're not interested in the + * transaction's contents, since the various caches need to always be + * consistent. + */ + if (parsed->nmsgs > 0) + { + if (!ctx->fast_forward) + ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr, + parsed->nmsgs, parsed->msgs); + ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr); + } + I think we don't need to add prepare time invalidation messages as we now we are already logging the invalidations at the command level and adding them to reorder buffer. 2. + /* + * Tell the reorderbuffer about the surviving subtransactions. We need to + * do this because the main transaction itself has not committed since we + * are in the prepare phase right now. So we need to be sure the snapshot + * is setup correctly for the main transaction in case all changes + * happened in subtransanctions + */ + for (i = 0; i < parsed->nsubxacts; i++) + { + ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i], + buf->origptr, buf->endptr); + } + + if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) || + (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) || + ctx->fast_forward || FilterByOrigin(ctx, origin_id)) + return; Do we need to call ReorderBufferCommitChild if we are skiping this transaction? I think the below check should be before calling ReorderBufferCommitChild. 3. + /* + * If it's ROLLBACK PREPARED then handle it via callbacks. + */ + if (TransactionIdIsValid(xid) && + !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) && + parsed->dbId == ctx->slot->data.database && + !FilterByOrigin(ctx, origin_id) && + ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid)) + { + ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr, + commit_time, origin_id, origin_lsn, + parsed->twophase_gid, false); + return; + } I think we have already checked !SnapBuildXactNeedsSkip, parsed->dbId == ctx->slot->data.database and !FilterByOrigin in DecodePrepare so if those are not true then we wouldn't have prepared this transaction i.e. ReorderBufferTxnIsPrepared will be false so why do we need to recheck this conditions. 4. + /* If streaming, reset the TXN so that it is allowed to stream remaining data. */ + if (streaming && stream_started) + { + ReorderBufferResetTXN(rb, txn, snapshot_now, + command_id, prev_lsn, + specinsert); + } + else + { + elog(LOG, "stopping decoding of %s (%u)", + txn->gid[0] != '\0'? txn->gid:"", txn->xid); + ReorderBufferTruncateTXN(rb, txn, true); + } Why only if (streaming) is not enough? I agree if we are coming here and it is streaming mode then streaming started must be true but we already have an assert for that. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Fri, Sep 18, 2020 at 6:02 PM Ajin Cherian <itsajin@gmail.com> wrote: > > On Tue, Sep 15, 2020 at 10:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > I don't think it is complete yet. > > * > > * This error can only occur when we are sending the data in > > * streaming mode and the streaming is not finished yet. > > */ > > - Assert(streaming); > > - Assert(stream_started); > > + Assert(streaming || rbtxn_prepared(txn)); > > + Assert(stream_started || rbtxn_prepared(txn)); > > > > Here, you have updated the code but comments are still not updated. > > > > Updated the comments. > > > > I don't think we need to perform abort here. Later we will anyway > > encounter the WAL for Rollback Prepared for which we will call > > abort_prepared_cb. As we have set the 'concurrent_abort' flag, it will > > allow us to skip all the intermediate records. Here, we need only > > enough state in ReorderBufferTxn that it can be later used for > > ReorderBufferFinishPrepared(). Basically, you need functionality > > similar to ReorderBufferTruncateTXN where except for invalidations you > > can free memory for everything else. You can either write a new > > function ReorderBufferTruncatePreparedTxn or pass another bool > > parameter in ReorderBufferTruncateTXN to indicate it is prepared_xact > > and then clean up additional things that are not required for prepared > > xact. > > Added a new parameter to ReorderBufferTruncatePreparedTxn for > prepared transactions and did cleanup of tupulecids as well, I have > left snapshots and transactions. > As a result of this, I also had to create a new function > ReorderBufferCleanupPreparedTXN which will clean up the rest as part > of FinishPrepared handling as we can't call > ReorderBufferCleanupTXN again after this. > Why can't we call ReorderBufferCleanupTXN() from ReorderBufferFinishPrepared after your changes? + * If streaming, keep the remaining info - transactions, tuplecids, invalidations and + * snapshots.If after a PREPARE, keep only the invalidations and snapshots. */ static void -ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) +ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prepared) Why do we need even snapshot for Prepared transactions? Also, note that in the comment there is no space before you start a new line. > > > >> > > >> > > >> I don't know why the patch has used this way to implement an option to > > >> enable two-phase. Can't we use how we implement 'stream-changes' > > >> option in commit 7259736a6e? Just refer how we set ctx->streaming and > > >> you can use a similar way to set this parameter. > > > > > > > > > Done, I've moved the checks for callbacks to inside the corresponding wrappers. > > > > > > > This is not what I suggested. Please study the commit 7259736a6e and > > see how streaming option is implemented. I want later subscribers can > > specify whether they want transactions to be decoded at prepare time > > similar to what we have done for streaming. Also, search for > > ctx->streaming in the code and see how it is set to get the idea. > > > > Changed it similar to ctx->streaming logic. > Hmm, I still don't see changes relevant changes in pg_decode_startup(). -- With Regards, Amit Kapila.
On Sun, Sep 20, 2020 at 11:01 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Fri, Sep 18, 2020 at 6:02 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > > 3. > > + /* > + * If it's ROLLBACK PREPARED then handle it via callbacks. > + */ > + if (TransactionIdIsValid(xid) && > + !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) && > + parsed->dbId == ctx->slot->data.database && > + !FilterByOrigin(ctx, origin_id) && > + ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid)) > + { > + ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr, > + commit_time, origin_id, origin_lsn, > + parsed->twophase_gid, false); > + return; > + } > > > I think we have already checked !SnapBuildXactNeedsSkip, parsed->dbId > == ctx->slot->data.database and !FilterByOrigin in DecodePrepare > so if those are not true then we wouldn't have prepared this > transaction i.e. ReorderBufferTxnIsPrepared will be false so why do we > need > to recheck this conditions. > Yeah, probably we should have Assert for below three conditions: + !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) && + parsed->dbId == ctx->slot->data.database && + !FilterByOrigin(ctx, origin_id) && Your other comments make sense to me. -- With Regards, Amit Kapila.
> Why can't we call ReorderBufferCleanupTXN() from > ReorderBufferFinishPrepared after your changes? > Since the truncate already removed the changes, it would fail on the below Assert in ReorderBufferCleanupTXN() /* Check we're not mixing changes from different transactions. */ Assert(change->txn == txn); regards. Ajin Cherian Fujitsu Australia
On Mon, Sep 21, 2020 at 12:36 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > Why can't we call ReorderBufferCleanupTXN() from > > ReorderBufferFinishPrepared after your changes? > > > > Since the truncate already removed the changes, it would fail on the > below Assert in ReorderBufferCleanupTXN() > /* Check we're not mixing changes from different transactions. */ > Assert(change->txn == txn); > The changes list should be empty by that time because we removing each change from the list:, see code "dlist_delete(&change->node);" in ReorderBufferTruncateTXN. If you are hitting the Assert as you mentioned then I think the problem is something else. -- With Regards, Amit Kapila.
On Mon, Sep 21, 2020 at 10:20 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sun, Sep 20, 2020 at 11:01 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Fri, Sep 18, 2020 at 6:02 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > > > > > 3. > > > > + /* > > + * If it's ROLLBACK PREPARED then handle it via callbacks. > > + */ > > + if (TransactionIdIsValid(xid) && > > + !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) && > > + parsed->dbId == ctx->slot->data.database && > > + !FilterByOrigin(ctx, origin_id) && > > + ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid)) > > + { > > + ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr, > > + commit_time, origin_id, origin_lsn, > > + parsed->twophase_gid, false); > > + return; > > + } > > > > > > I think we have already checked !SnapBuildXactNeedsSkip, parsed->dbId > > == ctx->slot->data.database and !FilterByOrigin in DecodePrepare > > so if those are not true then we wouldn't have prepared this > > transaction i.e. ReorderBufferTxnIsPrepared will be false so why do we > > need > > to recheck this conditions. > > > > Yeah, probably we should have Assert for below three conditions: > + !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) && > + parsed->dbId == ctx->slot->data.database && > + !FilterByOrigin(ctx, origin_id) && +1 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Sun, Sep 20, 2020 at 3:31 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > + /* > + * If it's ROLLBACK PREPARED then handle it via callbacks. > + */ > + if (TransactionIdIsValid(xid) && > + !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) && > + parsed->dbId == ctx->slot->data.database && > + !FilterByOrigin(ctx, origin_id) && > + ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid)) > + { > + ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr, > + commit_time, origin_id, origin_lsn, > + parsed->twophase_gid, false); > + return; > + } > > > I think we have already checked !SnapBuildXactNeedsSkip, parsed->dbId > == ctx->slot->data.database and !FilterByOrigin in DecodePrepare > so if those are not true then we wouldn't have prepared this > transaction i.e. ReorderBufferTxnIsPrepared will be false so why do we > need > to recheck this conditions. We could enter DecodeAbort even without a prepare, as the code is common for both XLOG_XACT_ABORT and XLOG_XACT_ABORT_PREPARED. So, the conditions !SnapBuildXactNeedsSkip, parsed->dbId > == ctx->slot->data.database and !FilterByOrigin could be true but the transaction is not prepared, then we dont need todo a ReorderBufferFinishPrepared (with commit flag false) but called ReorderBufferAbort. But I think there is a problem,if those conditions are in fact false, then we should return without trying to Abort using ReorderBufferAbort, whatdo you think? I agree with all your other comments. regards, Ajin Fujitsu Australia
On Mon, Sep 21, 2020 at 3:45 PM Ajin Cherian <itsajin@gmail.com> wrote: > > On Sun, Sep 20, 2020 at 3:31 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > + /* > > + * If it's ROLLBACK PREPARED then handle it via callbacks. > > + */ > > + if (TransactionIdIsValid(xid) && > > + !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) && > > + parsed->dbId == ctx->slot->data.database && > > + !FilterByOrigin(ctx, origin_id) && > > + ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid)) > > + { > > + ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr, > > + commit_time, origin_id, origin_lsn, > > + parsed->twophase_gid, false); > > + return; > > + } > > > > > > I think we have already checked !SnapBuildXactNeedsSkip, parsed->dbId > > == ctx->slot->data.database and !FilterByOrigin in DecodePrepare > > so if those are not true then we wouldn't have prepared this > > transaction i.e. ReorderBufferTxnIsPrepared will be false so why do we > > need > > to recheck this conditions. > > We could enter DecodeAbort even without a prepare, as the code is > common for both XLOG_XACT_ABORT and XLOG_XACT_ABORT_PREPARED. So, the > conditions !SnapBuildXactNeedsSkip, parsed->dbId > > == ctx->slot->data.database and !FilterByOrigin could be true but the transaction is not prepared, then we dont needto do a ReorderBufferFinishPrepared (with commit flag false) but called ReorderBufferAbort. But I think there is a problem,if those conditions are in fact false, then we should return without trying to Abort using ReorderBufferAbort, whatdo you think? > I think we need to call ReorderBufferAbort at least to clean up the TXN. Also, if what you are saying is correct then that should be true without this patch as well, no? If so, we don't need to worry about it as far as this patch is concerned. -- With Regards, Amit Kapila.
On Mon, Sep 21, 2020 at 9:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > I think we need to call ReorderBufferAbort at least to clean up the > TXN. Also, if what you are saying is correct then that should be true > without this patch as well, no? If so, we don't need to worry about it > as far as this patch is concerned. Yes, that is true. So will change this check to: if (TransactionIdIsValid(xid) && ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid) regards, Ajin Cherian Fujitsu Australia
On Mon, Sep 21, 2020 at 5:23 PM Ajin Cherian <itsajin@gmail.com> wrote: > > On Mon, Sep 21, 2020 at 9:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > I think we need to call ReorderBufferAbort at least to clean up the > > TXN. Also, if what you are saying is correct then that should be true > > without this patch as well, no? If so, we don't need to worry about it > > as far as this patch is concerned. > > Yes, that is true. So will change this check to: > > if (TransactionIdIsValid(xid) && > ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid) > Yeah and add the Assert for skip conditions as asked above. -- With Regards, Amit Kapila.
On Sun, Sep 20, 2020 at 3:31 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > 1. > + /* > + * Process invalidation messages, even if we're not interested in the > + * transaction's contents, since the various caches need to always be > + * consistent. > + */ > + if (parsed->nmsgs > 0) > + { > + if (!ctx->fast_forward) > + ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr, > + parsed->nmsgs, parsed->msgs); > + ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr); > + } > + > > I think we don't need to add prepare time invalidation messages as we now we > are already logging the invalidations at the command level and adding them to > reorder buffer. > Removed. > 2. > > + /* > + * Tell the reorderbuffer about the surviving subtransactions. We need to > + * do this because the main transaction itself has not committed since we > + * are in the prepare phase right now. So we need to be sure the snapshot > + * is setup correctly for the main transaction in case all changes > + * happened in subtransanctions > + */ > + for (i = 0; i < parsed->nsubxacts; i++) > + { > + ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i], > + buf->origptr, buf->endptr); > + } > + > + if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) || > + (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) || > + ctx->fast_forward || FilterByOrigin(ctx, origin_id)) > + return; > > Do we need to call ReorderBufferCommitChild if we are skiping this transaction? > I think the below check should be before calling ReorderBufferCommitChild. > Done. > 3. > > + /* > + * If it's ROLLBACK PREPARED then handle it via callbacks. > + */ > + if (TransactionIdIsValid(xid) && > + !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) && > + parsed->dbId == ctx->slot->data.database && > + !FilterByOrigin(ctx, origin_id) && > + ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid)) > + { > + ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr, > + commit_time, origin_id, origin_lsn, > + parsed->twophase_gid, false); > + return; > + } > > > I think we have already checked !SnapBuildXactNeedsSkip, parsed->dbId > == ctx->slot->data.database and !FilterByOrigin in DecodePrepare > so if those are not true then we wouldn't have prepared this > transaction i.e. ReorderBufferTxnIsPrepared will be false so why do we > need > to recheck this conditions. I didnt change this, as I am seeing cases where the Abort is getting called for transactions that needs to be skipped. I also see that the same check is there both in DecodePrepare and DecodeCommit. So, while the same transactions were not getting prepared or committed, it tries to get ROLLBACK PREPARED (as part of finish prepared handling). The check in if ReorderBufferTxnIsPrepared() is also not proper. I will need to relook this logic again in a future patch. > > > 4. > > + /* If streaming, reset the TXN so that it is allowed to stream > remaining data. */ > + if (streaming && stream_started) > + { > + ReorderBufferResetTXN(rb, txn, snapshot_now, > + command_id, prev_lsn, > + specinsert); > + } > + else > + { > + elog(LOG, "stopping decoding of %s (%u)", > + txn->gid[0] != '\0'? txn->gid:"", txn->xid); > + ReorderBufferTruncateTXN(rb, txn, true); > + } > > Why only if (streaming) is not enough? I agree if we are coming here > and it is streaming mode then streaming started must be true > but we already have an assert for that. > Changed. Amit, I have also changed test_decode startup to support two_phase commits only if specified similar to how it was done for streaming. I have also changed the test cases accordingly. However, I have not added it to the pgoutput startup as that would require create subscription changes. I will do that in a future patch. Some other pending changes are: 1. Remove snapshots on prepare truncate. 2. Look at why ReorderBufferCleanupTXN is failing after a ReorderBufferTruncateTXN 3. Add prepare support to streaming regards, Ajin Cherian Fujitsu Australia
Attachment
On Tue, Sep 22, 2020 at 5:18 PM Ajin Cherian <itsajin@gmail.com> wrote: > > On Sun, Sep 20, 2020 at 3:31 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > 3. > > > > + /* > > + * If it's ROLLBACK PREPARED then handle it via callbacks. > > + */ > > + if (TransactionIdIsValid(xid) && > > + !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) && > > + parsed->dbId == ctx->slot->data.database && > > + !FilterByOrigin(ctx, origin_id) && > > + ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid)) > > + { > > + ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr, > > + commit_time, origin_id, origin_lsn, > > + parsed->twophase_gid, false); > > + return; > > + } > > > > > > I think we have already checked !SnapBuildXactNeedsSkip, parsed->dbId > > == ctx->slot->data.database and !FilterByOrigin in DecodePrepare > > so if those are not true then we wouldn't have prepared this > > transaction i.e. ReorderBufferTxnIsPrepared will be false so why do we > > need > > to recheck this conditions. > > I didnt change this, as I am seeing cases where the Abort is getting > called for transactions that needs to be skipped. I also see that the > same check is there both in DecodePrepare and DecodeCommit. > So, while the same transactions were not getting prepared or > committed, it tries to get ROLLBACK PREPARED (as part of finish > prepared handling). The check in if ReorderBufferTxnIsPrepared() is > also not proper. > If the transaction is prepared which you can ensure via ReorderBufferTxnIsPrepared() (considering you have a proper check in that function), it should not require skipping the transaction in Abort. One way it could happen is if you clean up the ReorderBufferTxn in Prepare which you were doing in earlier version of patch which I pointed out was wrong, if you have changed that then I don't know why it could fail, may be someplace else during prepare the patch is freeing it. Just check that. > I will need to relook > this logic again in a future patch. > No problem. I think you can handle the other comments and then we can come back to this and you might want to share the exact details of the test (may be a narrow down version of the original test) and I or someone else might be able to help you with that. -- With Regards, Amit Kapila.
On Wed, Sep 23, 2020 at 2:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > No problem. I think you can handle the other comments and then we can > come back to this and you might want to share the exact details of the > test (may be a narrow down version of the original test) and I or > someone else might be able to help you with that. > > -- > With Regards, > Amit Kapila. I have added a new patch for supporting 2 phase commit semantics in the streaming APIs for the logical decoding plugins. I have added 3 APIs 1. stream_prepare 2. stream_commit_prepared 3. stream_abort_prepared I have also added the support for the new APIs in test_decoding plugin. I have not yet added it to pgoutpout. I have also added a fix for the error I saw while calling ReorderBufferCleanupTXN as part of FinishPrepared handling. As a result I have removed the function I added earlier, ReorderBufferCleanupPreparedTXN. Please have a look at the new changes and let me know what you think. I will continue to look at: 1. Remove snapshots on prepare truncate. 2. Bug seen while abort of prepared transaction, the prepared flag is lost, and not able to make out that it was a previously prepared transaction. regards, Ajin Cherian Fujitsu Australia
Attachment
On Mon, Sep 28, 2020 at 1:13 PM Ajin Cherian <itsajin@gmail.com> wrote: > > On Wed, Sep 23, 2020 at 2:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > I have added a new patch for supporting 2 phase commit semantics in > the streaming APIs for the logical decoding plugins. I have added 3 > APIs > 1. stream_prepare > 2. stream_commit_prepared > 3. stream_abort_prepared > > I have also added the support for the new APIs in test_decoding > plugin. I have not yet added it to pgoutpout. > > I have also added a fix for the error I saw while calling > ReorderBufferCleanupTXN as part of FinishPrepared handling. As a > result I have removed the function I added earlier, > ReorderBufferCleanupPreparedTXN. > Can you explain what was the problem and how you fixed it? > Please have a look at the new changes and let me know what you think. > > I will continue to look at: > > 1. Remove snapshots on prepare truncate. > 2. Bug seen while abort of prepared transaction, the prepared flag is > lost, and not able to make out that it was a previously prepared > transaction. > And the support of new APIs in pgoutput, right? -- With Regards, Amit Kapila.
On Mon, Sep 28, 2020 at 6:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Sep 28, 2020 at 1:13 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > > On Wed, Sep 23, 2020 at 2:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > I have added a new patch for supporting 2 phase commit semantics in > > the streaming APIs for the logical decoding plugins. I have added 3 > > APIs > > 1. stream_prepare > > 2. stream_commit_prepared > > 3. stream_abort_prepared > > > > I have also added the support for the new APIs in test_decoding > > plugin. I have not yet added it to pgoutpout. > > > > I have also added a fix for the error I saw while calling > > ReorderBufferCleanupTXN as part of FinishPrepared handling. As a > > result I have removed the function I added earlier, > > ReorderBufferCleanupPreparedTXN. > > > > Can you explain what was the problem and how you fixed it? When I added the changes for cleaning up tuplecids in ReorderBufferTruncateTXN, I was not deleting it from the list (dlist_delete), only calling ReorderBufferReturnChange to free memory. This logic was copied from ReorderBufferCleanupTXN, there the lists were all cleaned up in the end, so was not present in each list cleanup logic. > > > Please have a look at the new changes and let me know what you think. > > > > I will continue to look at: > > > > 1. Remove snapshots on prepare truncate. > > 2. Bug seen while abort of prepared transaction, the prepared flag is > > lost, and not able to make out that it was a previously prepared > > transaction. > > > > And the support of new APIs in pgoutput, right? Yes, that also. regards, Ajin Cherian Fujitsu Australia
On Wed, Sep 23, 2020 at 2:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > If the transaction is prepared which you can ensure via > ReorderBufferTxnIsPrepared() (considering you have a proper check in > that function), it should not require skipping the transaction in > Abort. One way it could happen is if you clean up the ReorderBufferTxn > in Prepare which you were doing in earlier version of patch which I > pointed out was wrong, if you have changed that then I don't know why > it could fail, may be someplace else during prepare the patch is > freeing it. Just check that. I had a look at this problem. The problem happens when decoding is done after a prepare but before the corresponding rollback prepared/commit prepared. For eg: Begin; <change 1> <change 2> PREPARE TRANSACTION '<prepare#1>'; SELECT data FROM pg_logical_slot_get_changes(...); : : ROLLBACK PREPARED '<prepare#1>'; SELECT data FROM pg_logical_slot_get_changes(...); Since the prepare is consumed in the first call to pg_logical_slot_get_changes, subsequently when it is encountered in the second call, it is skipped (as already decoded) in DecodePrepare and the txn->flags are not set to reflect the fact that it was prepared. The same behaviour is seen when it is commit prepared after the original prepare was consumed. Initially I was thinking about the following approach to fix it in DecodePrepare Approach 1: 1. Break the big Skip check in DecodePrepare into 2 parts. Return if the following conditions are true: If (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) || ctx->fast_forward || FilterByOrigin(ctx, origin_id)) 2. Check If this condition is true: SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) Then this means we are skipping because this has already been decoded, then instead of returning, call a new function ReorderBufferMarkPrepare() which will only update the flags in the txn to indicate that the transaction is prepared Then later in DecodeAbort or DecodeCommit, we can confirm that the transaction has been Prepared by checking if the flag is set and call ReorderBufferFinishPrepared appropriately. But then, thinking about this some more, I thought of a second approach. Approach 2: If the only purpose of all this was to differentiate between Abort vs Rollback Prepared and Commit vs Commit Prepared, then we dont need this. We already know the exact operation in DecodeXactOp and can differentiate there. We only overloaded DecodeAbort and DecodeCommit for convenience, we can always call these functions with an extra flag to denote that we are either commit or aborting a previously prepared transaction and call ReorderBufferFinishPrepared accordingly. Let me know your thoughts on these two approaches or any other suggestions on this. regards, Ajin Cherian Fujitsu Australia
On Tue, Sep 29, 2020 at 5:08 PM Ajin Cherian <itsajin@gmail.com> wrote: > > On Wed, Sep 23, 2020 at 2:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > If the transaction is prepared which you can ensure via > > ReorderBufferTxnIsPrepared() (considering you have a proper check in > > that function), it should not require skipping the transaction in > > Abort. One way it could happen is if you clean up the ReorderBufferTxn > > in Prepare which you were doing in earlier version of patch which I > > pointed out was wrong, if you have changed that then I don't know why > > it could fail, may be someplace else during prepare the patch is > > freeing it. Just check that. > > I had a look at this problem. The problem happens when decoding is > done after a prepare but before the corresponding rollback > prepared/commit prepared. > For eg: > > Begin; > <change 1> > <change 2> > PREPARE TRANSACTION '<prepare#1>'; > SELECT data FROM pg_logical_slot_get_changes(...); > : > : > ROLLBACK PREPARED '<prepare#1>'; > SELECT data FROM pg_logical_slot_get_changes(...); > > Since the prepare is consumed in the first call to > pg_logical_slot_get_changes, subsequently when it is encountered in > the second call, it is skipped (as already decoded) in DecodePrepare > and the txn->flags are not set to > reflect the fact that it was prepared. The same behaviour is seen when > it is commit prepared after the original prepare was consumed. > Initially I was thinking about the following approach to fix it in DecodePrepare > Approach 1: > 1. Break the big Skip check in DecodePrepare into 2 parts. > Return if the following conditions are true: > If (parsed->dbId != InvalidOid && parsed->dbId != > ctx->slot->data.database) || > ctx->fast_forward || FilterByOrigin(ctx, origin_id)) > > 2. Check If this condition is true: > SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) > > Then this means we are skipping because this has already > been decoded, then instead of returning, call a new function > ReorderBufferMarkPrepare() which will only update the flags in the txn > to indicate that the transaction is prepared > Then later in DecodeAbort or DecodeCommit, we can confirm > that the transaction has been Prepared by checking if the flag is set > and call ReorderBufferFinishPrepared appropriately. > > But then, thinking about this some more, I thought of a second approach. > Approach 2: > If the only purpose of all this was to differentiate between > Abort vs Rollback Prepared and Commit vs Commit Prepared, then we dont > need this. We already know the exact operation > in DecodeXactOp and can differentiate there. We only > overloaded DecodeAbort and DecodeCommit for convenience, we can always > call these functions with an extra flag to denote that we are either > commit or aborting a > previously prepared transaction and call > ReorderBufferFinishPrepared accordingly. > The second approach sounds better but you can see if there is not much you want to reuse from DecodeCommit/DecodeAbort then you can even write new functions DecodeCommitPrepared/DecodeAbortPrepared. OTOH, if there is a common code among them then passing the flag would be a better way. -- With Regards, Amit Kapila.
On Mon, Sep 28, 2020 at 1:13 PM Ajin Cherian <itsajin@gmail.com> wrote: > > On Wed, Sep 23, 2020 at 2:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > No problem. I think you can handle the other comments and then we can > > come back to this and you might want to share the exact details of the > > test (may be a narrow down version of the original test) and I or > > someone else might be able to help you with that. > > > > -- > > With Regards, > > Amit Kapila. > > I have added a new patch for supporting 2 phase commit semantics in > the streaming APIs for the logical decoding plugins. I have added 3 > APIs > 1. stream_prepare > 2. stream_commit_prepared > 3. stream_abort_prepared > > I have also added the support for the new APIs in test_decoding > plugin. I have not yet added it to pgoutpout. > > I have also added a fix for the error I saw while calling > ReorderBufferCleanupTXN as part of FinishPrepared handling. As a > result I have removed the function I added earlier, > ReorderBufferCleanupPreparedTXN. > Please have a look at the new changes and let me know what you think. > > I will continue to look at: > > 1. Remove snapshots on prepare truncate. > 2. Bug seen while abort of prepared transaction, the prepared flag is > lost, and not able to make out that it was a previously prepared > transaction. I have started looking into you latest patches, as of now I have a few comments. v6-0001 @@ -1987,7 +2072,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, prev_lsn = change->lsn; /* Set the current xid to detect concurrent aborts. */ - if (streaming) + if (streaming || rbtxn_prepared(change->txn)) { curtxn = change->txn; SetupCheckXidLive(curtxn->xid); @@ -2249,7 +2334,6 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, break; } } - For streaming transaction we need to check the xid everytime because there could concurrent a subtransaction abort, but for two-phase we don't need to call SetupCheckXidLive everytime, because we are sure that transaction is going to be the same throughout the processing. Apart from this I have also noticed a couple of cosmetic changes + { + xl_xact_parsed_prepare parsed; + xl_xact_prepare *xlrec; + /* check that output plugin is capable of twophase decoding */ + if (!ctx->enable_twophase) + { + ReorderBufferProcessXid(reorder, XLogRecGetXid(r), buf->origptr); + break; + } One blank line after variable declations - /* remove potential on-disk data, and deallocate */ + /* + * remove potential on-disk data, and deallocate. + * + * We remove it even for prepared transactions (GID is enough to + * commit/abort those later). + */ + ReorderBufferCleanupTXN(rb, txn); Comment not aligned properly v6-0003 +LookupGXact(const char *gid) +{ + int i; + + LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE); + + for (i = 0; i < TwoPhaseState->numPrepXacts; i++) + { + GlobalTransaction gxact = TwoPhaseState->prepXacts[i]; I think we should take LW_SHARED lowck here no? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Sep 29, 2020 at 8:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I have started looking into you latest patches, as of now I have a > few comments. > > v6-0001 > > @@ -1987,7 +2072,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, > ReorderBufferTXN *txn, > prev_lsn = change->lsn; > > /* Set the current xid to detect concurrent aborts. */ > - if (streaming) > + if (streaming || rbtxn_prepared(change->txn)) > { > curtxn = change->txn; > SetupCheckXidLive(curtxn->xid); > @@ -2249,7 +2334,6 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, > ReorderBufferTXN *txn, > break; > } > } > - > > For streaming transaction we need to check the xid everytime because > there could concurrent a subtransaction abort, but > for two-phase we don't need to call SetupCheckXidLive everytime, > because we are sure that transaction is going to be > the same throughout the processing. > While decoding transactions at 'prepare' time there could be multiple sub-transactions like in the case below. Won't that be impacted if we follow your suggestion here? postgres=# Begin; BEGIN postgres=*# insert into t1 values(1,'aaa'); INSERT 0 1 postgres=*# savepoint s1; SAVEPOINT postgres=*# insert into t1 values(2,'aaa'); INSERT 0 1 postgres=*# savepoint s2; SAVEPOINT postgres=*# insert into t1 values(3,'aaa'); INSERT 0 1 postgres=*# Prepare Transaction 'foo'; PREPARE TRANSACTION -- With Regards, Amit Kapila.
On Wed, Sep 30, 2020 at 2:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Sep 29, 2020 at 8:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > I have started looking into you latest patches, as of now I have a > > few comments. > > > > v6-0001 > > > > @@ -1987,7 +2072,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, > > ReorderBufferTXN *txn, > > prev_lsn = change->lsn; > > > > /* Set the current xid to detect concurrent aborts. */ > > - if (streaming) > > + if (streaming || rbtxn_prepared(change->txn)) > > { > > curtxn = change->txn; > > SetupCheckXidLive(curtxn->xid); > > @@ -2249,7 +2334,6 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, > > ReorderBufferTXN *txn, > > break; > > } > > } > > - > > > > For streaming transaction we need to check the xid everytime because > > there could concurrent a subtransaction abort, but > > for two-phase we don't need to call SetupCheckXidLive everytime, > > because we are sure that transaction is going to be > > the same throughout the processing. > > > > While decoding transactions at 'prepare' time there could be multiple > sub-transactions like in the case below. Won't that be impacted if we > follow your suggestion here? > > postgres=# Begin; > BEGIN > postgres=*# insert into t1 values(1,'aaa'); > INSERT 0 1 > postgres=*# savepoint s1; > SAVEPOINT > postgres=*# insert into t1 values(2,'aaa'); > INSERT 0 1 > postgres=*# savepoint s2; > SAVEPOINT > postgres=*# insert into t1 values(3,'aaa'); > INSERT 0 1 > postgres=*# Prepare Transaction 'foo'; > PREPARE TRANSACTION But once we prepare the transaction, we can not rollback individual subtransaction. We can only rollback the main transaction so instead of setting individual subxact as CheckXidLive, we can just set the main XID so no need to check on every command. Just set it before start processing. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Sep 30, 2020 at 2:46 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Sep 30, 2020 at 2:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Sep 29, 2020 at 8:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > I have started looking into you latest patches, as of now I have a > > > few comments. > > > > > > v6-0001 > > > > > > @@ -1987,7 +2072,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, > > > ReorderBufferTXN *txn, > > > prev_lsn = change->lsn; > > > > > > /* Set the current xid to detect concurrent aborts. */ > > > - if (streaming) > > > + if (streaming || rbtxn_prepared(change->txn)) > > > { > > > curtxn = change->txn; > > > SetupCheckXidLive(curtxn->xid); > > > @@ -2249,7 +2334,6 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, > > > ReorderBufferTXN *txn, > > > break; > > > } > > > } > > > - > > > > > > For streaming transaction we need to check the xid everytime because > > > there could concurrent a subtransaction abort, but > > > for two-phase we don't need to call SetupCheckXidLive everytime, > > > because we are sure that transaction is going to be > > > the same throughout the processing. > > > > > > > While decoding transactions at 'prepare' time there could be multiple > > sub-transactions like in the case below. Won't that be impacted if we > > follow your suggestion here? > > > > postgres=# Begin; > > BEGIN > > postgres=*# insert into t1 values(1,'aaa'); > > INSERT 0 1 > > postgres=*# savepoint s1; > > SAVEPOINT > > postgres=*# insert into t1 values(2,'aaa'); > > INSERT 0 1 > > postgres=*# savepoint s2; > > SAVEPOINT > > postgres=*# insert into t1 values(3,'aaa'); > > INSERT 0 1 > > postgres=*# Prepare Transaction 'foo'; > > PREPARE TRANSACTION > > But once we prepare the transaction, we can not rollback individual > subtransaction. > Sure but Rollback can come before prepare like in the case below which will appear as concurrent abort (assume there is some DDL which changes the table before the Rollback statement) because it has already been done by the backend and that need to be caught by this mechanism only. Begin; insert into t1 values(1,'aaa'); savepoint s1; insert into t1 values(2,'aaa'); savepoint s2; insert into t1 values(3,'aaa'); Rollback to savepoint s2; insert into t1 values(4,'aaa'); Prepare Transaction 'foo'; -- With Regards, Amit Kapila.
On Wed, Sep 30, 2020 at 3:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Sep 30, 2020 at 2:46 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Wed, Sep 30, 2020 at 2:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Tue, Sep 29, 2020 at 8:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > I have started looking into you latest patches, as of now I have a > > > > few comments. > > > > > > > > v6-0001 > > > > > > > > @@ -1987,7 +2072,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, > > > > ReorderBufferTXN *txn, > > > > prev_lsn = change->lsn; > > > > > > > > /* Set the current xid to detect concurrent aborts. */ > > > > - if (streaming) > > > > + if (streaming || rbtxn_prepared(change->txn)) > > > > { > > > > curtxn = change->txn; > > > > SetupCheckXidLive(curtxn->xid); > > > > @@ -2249,7 +2334,6 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, > > > > ReorderBufferTXN *txn, > > > > break; > > > > } > > > > } > > > > - > > > > > > > > For streaming transaction we need to check the xid everytime because > > > > there could concurrent a subtransaction abort, but > > > > for two-phase we don't need to call SetupCheckXidLive everytime, > > > > because we are sure that transaction is going to be > > > > the same throughout the processing. > > > > > > > > > > While decoding transactions at 'prepare' time there could be multiple > > > sub-transactions like in the case below. Won't that be impacted if we > > > follow your suggestion here? > > > > > > postgres=# Begin; > > > BEGIN > > > postgres=*# insert into t1 values(1,'aaa'); > > > INSERT 0 1 > > > postgres=*# savepoint s1; > > > SAVEPOINT > > > postgres=*# insert into t1 values(2,'aaa'); > > > INSERT 0 1 > > > postgres=*# savepoint s2; > > > SAVEPOINT > > > postgres=*# insert into t1 values(3,'aaa'); > > > INSERT 0 1 > > > postgres=*# Prepare Transaction 'foo'; > > > PREPARE TRANSACTION > > > > But once we prepare the transaction, we can not rollback individual > > subtransaction. > > > > Sure but Rollback can come before prepare like in the case below which > will appear as concurrent abort (assume there is some DDL which > changes the table before the Rollback statement) because it has > already been done by the backend and that need to be caught by this > mechanism only. > > Begin; > insert into t1 values(1,'aaa'); > savepoint s1; > insert into t1 values(2,'aaa'); > savepoint s2; > insert into t1 values(3,'aaa'); > Rollback to savepoint s2; > insert into t1 values(4,'aaa'); > Prepare Transaction 'foo'; If we are streaming on the prepare that means we must have decoded that rollback WAL which means we should have removed the ReorderBufferTXN for those subxact. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Sep 30, 2020 at 3:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Sep 30, 2020 at 3:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Sep 30, 2020 at 2:46 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Wed, Sep 30, 2020 at 2:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Tue, Sep 29, 2020 at 8:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > > > I have started looking into you latest patches, as of now I have a > > > > > few comments. > > > > > > > > > > v6-0001 > > > > > > > > > > @@ -1987,7 +2072,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, > > > > > ReorderBufferTXN *txn, > > > > > prev_lsn = change->lsn; > > > > > > > > > > /* Set the current xid to detect concurrent aborts. */ > > > > > - if (streaming) > > > > > + if (streaming || rbtxn_prepared(change->txn)) > > > > > { > > > > > curtxn = change->txn; > > > > > SetupCheckXidLive(curtxn->xid); > > > > > @@ -2249,7 +2334,6 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, > > > > > ReorderBufferTXN *txn, > > > > > break; > > > > > } > > > > > } > > > > > - > > > > > > > > > > For streaming transaction we need to check the xid everytime because > > > > > there could concurrent a subtransaction abort, but > > > > > for two-phase we don't need to call SetupCheckXidLive everytime, > > > > > because we are sure that transaction is going to be > > > > > the same throughout the processing. > > > > > > > > > > > > > While decoding transactions at 'prepare' time there could be multiple > > > > sub-transactions like in the case below. Won't that be impacted if we > > > > follow your suggestion here? > > > > > > > > postgres=# Begin; > > > > BEGIN > > > > postgres=*# insert into t1 values(1,'aaa'); > > > > INSERT 0 1 > > > > postgres=*# savepoint s1; > > > > SAVEPOINT > > > > postgres=*# insert into t1 values(2,'aaa'); > > > > INSERT 0 1 > > > > postgres=*# savepoint s2; > > > > SAVEPOINT > > > > postgres=*# insert into t1 values(3,'aaa'); > > > > INSERT 0 1 > > > > postgres=*# Prepare Transaction 'foo'; > > > > PREPARE TRANSACTION > > > > > > But once we prepare the transaction, we can not rollback individual > > > subtransaction. > > > > > > > Sure but Rollback can come before prepare like in the case below which > > will appear as concurrent abort (assume there is some DDL which > > changes the table before the Rollback statement) because it has > > already been done by the backend and that need to be caught by this > > mechanism only. > > > > Begin; > > insert into t1 values(1,'aaa'); > > savepoint s1; > > insert into t1 values(2,'aaa'); > > savepoint s2; > > insert into t1 values(3,'aaa'); > > Rollback to savepoint s2; > > insert into t1 values(4,'aaa'); > > Prepare Transaction 'foo'; > > > If we are streaming on the prepare that means we must have decoded > that rollback WAL which means we should have removed the > ReorderBufferTXN for those subxact. > Okay, valid point. We can avoid setting it for each sub-transaction in that case but OTOH even if we allow to set it there shouldn't be any bug. -- With Regards, Amit Kapila.
On Wed, Sep 30, 2020 at 3:27 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Sep 30, 2020 at 3:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Wed, Sep 30, 2020 at 3:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Wed, Sep 30, 2020 at 2:46 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > On Wed, Sep 30, 2020 at 2:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > On Tue, Sep 29, 2020 at 8:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > > > > > I have started looking into you latest patches, as of now I have a > > > > > > few comments. > > > > > > > > > > > > v6-0001 > > > > > > > > > > > > @@ -1987,7 +2072,7 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, > > > > > > ReorderBufferTXN *txn, > > > > > > prev_lsn = change->lsn; > > > > > > > > > > > > /* Set the current xid to detect concurrent aborts. */ > > > > > > - if (streaming) > > > > > > + if (streaming || rbtxn_prepared(change->txn)) > > > > > > { > > > > > > curtxn = change->txn; > > > > > > SetupCheckXidLive(curtxn->xid); > > > > > > @@ -2249,7 +2334,6 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, > > > > > > ReorderBufferTXN *txn, > > > > > > break; > > > > > > } > > > > > > } > > > > > > - > > > > > > > > > > > > For streaming transaction we need to check the xid everytime because > > > > > > there could concurrent a subtransaction abort, but > > > > > > for two-phase we don't need to call SetupCheckXidLive everytime, > > > > > > because we are sure that transaction is going to be > > > > > > the same throughout the processing. > > > > > > > > > > > > > > > > While decoding transactions at 'prepare' time there could be multiple > > > > > sub-transactions like in the case below. Won't that be impacted if we > > > > > follow your suggestion here? > > > > > > > > > > postgres=# Begin; > > > > > BEGIN > > > > > postgres=*# insert into t1 values(1,'aaa'); > > > > > INSERT 0 1 > > > > > postgres=*# savepoint s1; > > > > > SAVEPOINT > > > > > postgres=*# insert into t1 values(2,'aaa'); > > > > > INSERT 0 1 > > > > > postgres=*# savepoint s2; > > > > > SAVEPOINT > > > > > postgres=*# insert into t1 values(3,'aaa'); > > > > > INSERT 0 1 > > > > > postgres=*# Prepare Transaction 'foo'; > > > > > PREPARE TRANSACTION > > > > > > > > But once we prepare the transaction, we can not rollback individual > > > > subtransaction. > > > > > > > > > > Sure but Rollback can come before prepare like in the case below which > > > will appear as concurrent abort (assume there is some DDL which > > > changes the table before the Rollback statement) because it has > > > already been done by the backend and that need to be caught by this > > > mechanism only. > > > > > > Begin; > > > insert into t1 values(1,'aaa'); > > > savepoint s1; > > > insert into t1 values(2,'aaa'); > > > savepoint s2; > > > insert into t1 values(3,'aaa'); > > > Rollback to savepoint s2; > > > insert into t1 values(4,'aaa'); > > > Prepare Transaction 'foo'; > > > > > > If we are streaming on the prepare that means we must have decoded > > that rollback WAL which means we should have removed the > > ReorderBufferTXN for those subxact. > > > > Okay, valid point. We can avoid setting it for each sub-transaction in > that case but OTOH even if we allow to set it there shouldn't be any > bug. Right, there will not be any bug, just an optimization. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Hello Ajin. I have done some review of the v6 patches. I had some difficulty replying my review comments to the OSS list, so I am putting them as an attachment here. Kind Regards, Peter Smith Fujitsu Australia
Attachment
Hello Ajin. I have gone through the v6 patch changes and have a list of review comments below. Apologies for the length of this email - I know that many of the following comments are trivial, but I figured I should either just ignore everything cosmetic, or list everything regardless. I chose the latter. There may be some duplication where the same review comment is written for multiple files and/or where the same file is in your multiple patches. Kind Regards. Peter Smith Fujitsu Australia [BEGIN] ========== Patch V6-0001, File: contrib/test_decoding/expected/prepared.out (so prepared.sql also) ========== COMMENT Line 30 - The INSERT INTO test_prepared1 VALUES (2); is kind of strange because it is not really part of the prior test nor the following test. Maybe it would be better to have a comment describing the purpose of this isolated INSERT and to also consume the data from the slot so it does not get jumbled with the data of the following (abort) test. ; COMMENT Line 53 - Same comment for this test INSERT INTO test_prepared1 VALUES (4); It kind of has nothing really to do with either the prior (abort) test nor the following (ddl) test. ; COMMENT Line 60 - Seems to check which locks are held for the test_prepared_1 table while the transaction is in progress. Maybe it would be better to have more comments describing what is expected here and why. ; COMMENT Line 88 - There is a comment in the test saying "-- We should see '7' before '5' in our results since it commits first." but I did not see any test code that actually verifies that happens. ; QUESTION Line 120 - I did not really understand the SQL checking the pg_class. I expected this would be checking table 'test_prepared1' instead. Can you explain it? SELECT 'pg_class' AS relation, locktype, mode FROM pg_locks WHERE locktype = 'relation' AND relation = 'pg_class'::regclass; relation | locktype | mode ----------+----------+------ (0 rows) ; QUESTION Line 139 - SET statement_timeout = '1s'; is 1 seconds short enough here for this test, or might it be that these statements would be completed in less than one seconds anyhow? ; QUESTION Line 163 - How is this testing a SAVEPOINT? Or is it only to check that the SAVEPOINT command is not part of the replicated changes? ; COMMENT Line 175 - Missing underscore in comment. Code requires also underscore: "nodecode" --> "_nodecode" ========== Patch V6-0001, File: contrib/test_decoding/test_decoding.c ========== COMMENT Line 43 @@ -36,6 +40,7 @@ typedef struct bool skip_empty_xacts; bool xact_wrote_changes; bool only_local; + TransactionId check_xid; /* track abort of this txid */ } TestDecodingData; The "check_xid" seems a meaningless name. Check what? IIUC maybe should be something like "check_xid_aborted" ; COMMENT Line 105 @ -88,6 +93,19 @@ static void pg_decode_stream_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, int nrelations, Relation relations[], ReorderBufferChange *change); +static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx, + ReorderBufferTXN *txn, Remove extra blank line after these functions ; COMMENT Line 149 @@ -116,6 +134,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb) cb->stream_change_cb = pg_decode_stream_change; cb->stream_message_cb = pg_decode_stream_message; cb->stream_truncate_cb = pg_decode_stream_truncate; + cb->filter_prepare_cb = pg_decode_filter_prepare; + cb->prepare_cb = pg_decode_prepare_txn; + cb->commit_prepared_cb = pg_decode_commit_prepared_txn; + cb->abort_prepared_cb = pg_decode_abort_prepared_txn; + } There is a confusing mix of terminology where sometimes things are referred as ROLLBACK/rollback and other times apparently the same operation is referred as ABORT/abort. I do not know the root cause of this mixture. IIUC maybe the internal functions and protocol generally use the term "abort", whereas the SQL syntax is "ROLLBACK"... but where those two terms collide in the middle it gets quite confusing. At least I thought the names of the "callbacks" which get exposed to the user (e.g. in the help) might be better if they would match the SQL. "abort_prepared_cb" --> "rollback_prepared_db" There are similar review comments like this below where the alternating terms caused me some confusion. ~ Also, Remove the extra blank line before the end of the function. ; COMMENT Line 267 @ -227,6 +252,42 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt, errmsg("could not parse value \"%s\" for parameter \"%s\"", strVal(elem->arg), elem->defname))); } + else if (strcmp(elem->defname, "two-phase-commit") == 0) + { + if (elem->arg == NULL) + continue; IMO the "check-xid" code might be better rearranged so the NULL check is first instead of if/else. e.g. if (elem->arg == NULL) ereport(FATAL, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("check-xid needs an input value"))); ~ Also, is it really supposed to be FATAL instead or ERROR. That is not the same as the other surrounding code. ; COMMENT Line 296 if (data->check_xid <= 0) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("Specify positive value for parameter \"%s\"," " you specified \"%s\"", elem->defname, strVal(elem->arg)))); The code checking for <= 0 seems over-complicated. Because conversion was using strtoul() I fail to see how this can ever be < 0. Wouldn't it be easier to simply test the result of the strtoul() function? BEFORE: if (errno == EINVAL || errno == ERANGE) AFTER: if (data->check_xid == 0) ~ Also, should this be FATAL? Everything else similar is ERROR. ; COMMENT (general) I don't recall seeing any of these decoding options (e.g. "two-phase-commit", "check-xid") documented anywhere. So how can a user even know these options exist so they can use them? Perhaps options should be described on this page? https://www.postgresql.org/docs/13/functions-admin.html#FUNCTIONS-REPLICATION ; COMMENT (general) "check-xid" is a meaningless option name. Maybe something like "checked-xid-aborted" is more useful? Suggest changing the member, the option, and the error messages to match some better name. ; COMMENT Line 314 @@ -238,6 +299,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt, } ctx->streaming &= enable_streaming; + ctx->enable_twophase &= enable_2pc; } The "ctx->enable_twophase" is inconsistent naming with the "ctx->streaming" member. "enable_twophase" --> "twophase" ; COMMENT Line 374 @@ -297,6 +359,94 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, OutputPluginWrite(ctx, true); } + +/* + * Filter out two-phase transactions. + * + * Each plugin can implement its own filtering logic. Here + * we demonstrate a simple logic by checking the GID. If the + * GID contains the "_nodecode" substring, then we filter + * it out. + */ +static bool +pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, Remove the extra preceding blank line. ~ I did not find anything in the help about "_nodecode". Should it be there or is this deliberately not documented feature? ; QUESTION Line 440 +pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, Is this a wrong comment "ABORT PREPARED" --> "ROLLBACK PREPARED" ?? ; COMMENT Line 620 @@ -455,6 +605,22 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, } data->xact_wrote_changes = true; + /* if check_xid is specified */ + if (TransactionIdIsValid(data->check_xid)) + { + elog(LOG, "waiting for %u to abort", data->check_xid); + while (TransactionIdIsInProgress(dat The check_xid seems a meaningless name, and the comment "/* if check_xid is specified */" was not helpful either. IIUC purpose of this is to check that the nominated xid always is rolled back. So the appropriate name may be more like "check-xid-aborted". ; ========== Patch V6-0001, File: doc/src/sgml/logicaldecoding.sgml ========== COMMENT/QUESTION Section 48.6.1 @ -387,6 +387,10 @@ typedef struct OutputPluginCallbacks LogicalDecodeTruncateCB truncate_cb; LogicalDecodeCommitCB commit_cb; LogicalDecodeMessageCB message_cb; + LogicalDecodeFilterPrepareCB filter_prepare_cb; Confused by the mixing of terminologies "abort" and "rollback". Why is it LogicalDecodeAbortPreparedCB instead of LogicalDecodeRollbackPreparedCB? Why is it abort_prepared_cb instead of rollback_prepared_cb;? I thought everything the user sees should be ROLLBACK/rollback (like the SQL) regardless of what the internal functions might be called. ; COMMENT Section 48.6.1 The begin_cb, change_cb and commit_cb callbacks are required, while startup_cb, filter_by_origin_cb, truncate_cb, and shutdown_cb are optional. If truncate_cb is not set but a TRUNCATE is to be decoded, the action will be ignored. The 1st paragraph beneath the typedef does not mention the newly added callbacks to say if they are required or optional. ; COMMENT Section 48.6.4.5 Section 48.6.4.6 Section 48.6.4.7 @@ -578,6 +588,55 @@ typedef void (*LogicalDecodeCommitCB) (struct LogicalDecodingContext *ctx, </para> </sect3> + <sect3 id="logicaldecoding-output-plugin-prepare"> + <sect3 id="logicaldecoding-output-plugin-commit-prepared"> + <sect3 id="logicaldecoding-output-plugin-abort-prepared"> +<programlisting> The wording and titles are a bit backwards compared to the others. e.g. previously was "Transaction Begin" (not "Begin Transaction") and "Transaction End" (not "End Transaction"). So for consistently following the existing IMO should change these new titles (and wording) to: - "Commit Prepared Transaction Callback" --> "Transaction Commit Prepared Callback" - "Rollback Prepared Transaction Callback" --> "Transaction Rollback Prepared Callback" - "whenever a commit prepared transaction has been decoded" --> "whenever a transaction commit prepared has been decoded" - "whenever a rollback prepared transaction has been decoded." --> "whenever a transaction rollback prepared has been decoded." ; ========== Patch V6-0001, File: src/backend/replication/logical/decode.c ========== COMMENT Line 74 @@ -70,6 +70,9 @@ static void DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, xl_xact_parsed_commit *parsed, TransactionId xid); static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, xl_xact_parsed_abort *parsed, TransactionId xid); +static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, + xl_xact_parsed_prepare * parsed); The 2nd line of DecodePrepare is misaligned by one space. ; COMMENT Line 321 @@ -312,17 +315,34 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) } break; case XLOG_XACT_PREPARE: + { + xl_xact_parsed_prepare parsed; + xl_xact_prepare *xlrec; + /* check that output plugin is capable of twophase decoding */ "twophase" --> "two-phase" ~ Also, add a blank line after the declarations. ; ========== Patch V6-0001, File: src/backend/replication/logical/logical.c ========== COMMENT Line 249 @@ -225,6 +237,19 @@ StartupDecodingContext(List *output_plugin_options, (ctx->callbacks.stream_message_cb != NULL) || (ctx->callbacks.stream_truncate_cb != NULL); + /* + * To support two phase logical decoding, we require prepare/commit-prepare/abort-prepare + * callbacks. The filter-prepare callback is optional. We however enable two phase logical + * decoding when at least one of the methods is enabled so that we can easily identify + * missing methods. The terminology is generally well known as "two-phase" (with the hyphen) https://en.wikipedia.org/wiki/Two-phase_commit_protocol so let's be consistent for all the patch code comments. Please search the code and correct this in all places, even where I might have missed to identify it. "two phase" --> "two-phase" ; COMMENT Line 822 @@ -782,6 +807,111 @@ commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, } static void +prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, + XLogRecPtr prepare_lsn) "support 2 phase" --> "supports two-phase" in the comment ; COMMENT Line 844 Code condition seems strange and/or broken. if (ctx->enable_twophase && ctx->callbacks.prepare_cb == NULL) Because if the flag is null then this condition is skipped. But then if the callback was also NULL then attempting to call it to "do the actual work" will give NPE. ~ Also, I wonder should this check be the first thing in this function? Because if it fails does it even make sense that all the errcallback code was set up? E.g errcallback.arg potentially is left pointing to a stack variable on a stack that no longer exists. ; COMMENT Line 857 +commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, "support 2 phase" --> "supports two-phase" in the comment ~ Also, Same potential trouble with the condition: if (ctx->enable_twophase && ctx->callbacks.commit_prepared_cb == NULL) Same as previously asked. Should this check be first thing in this function? ; COMMENT Line 892 +abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, "support 2 phase" --> "supports two-phase" in the comment ~ Same potential trouble with the condition: if (ctx->enable_twophase && ctx->callbacks.abort_prepared_cb == NULL) Same as previously asked. Should this check be the first thing in this function? ; COMMENT Line 1013 @@ -858,6 +988,51 @@ truncate_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, error_context_stack = errcallback.previous; } +static bool +filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, + TransactionId xid, const char *gid) Fix wording in comment: "twophase" --> "two-phase transactions" "twophase transactions" --> "two-phase transactions" ========== Patch V6-0001, File: src/backend/replication/logical/reorderbuffer.c ========== COMMENT Line 255 @@ -251,7 +251,8 @@ static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn, char *change); static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn); -static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn); +static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, + bool txn_prepared); The alignment is inconsistent. One more space needed before "bool txn_prepared" ; COMMENT Line 417 @@ -413,6 +414,11 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) } /* free data that's contained */ + if (txn->gid != NULL) + { + pfree(txn->gid); + txn->gid = NULL; + } Should add the blank link before this new code, as it was before. ; COMMENT Line 1564 @ -1502,12 +1561,14 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) } /* - * Discard changes from a transaction (and subtransactions), after streaming - * them. Keep the remaining info - transactions, tuplecids, invalidations and - * snapshots. + * Discard changes from a transaction (and subtransactions), either after streaming or + * after a PREPARE. typo "snapshots.If" -> "snapshots. If" ; COMMENT/QUESTION Line 1590 @@ -1526,7 +1587,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) Assert(rbtxn_is_known_subxact(subtxn)); Assert(subtxn->nsubtxns == 0); - ReorderBufferTruncateTXN(rb, subtxn); + ReorderBufferTruncateTXN(rb, subtxn, txn_prepared); } There are some code paths here I did not understand how they match the comments. Because this function is recursive it seems that it may be called where the 2nd parameter txn is a sub-transaction. But then this seems at odds with some of the other code comments of this function which are processing the txn without ever testing is it really toplevel or not: e.g. Line 1593 "/* cleanup changes in the toplevel txn */" e.g. Line 1632 "They are always stored in the toplevel transaction." ; COMMENT Line 1644 @@ -1560,9 +1621,33 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) * about the toplevel xact (we send the XID in all messages), but we never * stream XIDs of empty subxacts. */ - if ((!txn->toptxn) || (txn->nentries_mem != 0)) + if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0))) txn->txn_flags |= RBTXN_IS_STREAMED; + if (txn_prepared) /* remove the change from it's containing list */ typo "it's" --> "its" ; QUESTION Line 1977 @@ -1880,7 +1965,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, ReorderBufferChange *specinsert) { /* Discard the changes that we just streamed */ - ReorderBufferTruncateTXN(rb, txn); + ReorderBufferTruncateTXN(rb, txn, false); How do you know the 3rd parameter - i.e. txn_prepared - should be hardwired false here? e.g. I thought that maybe rbtxn_prepared(txn) can be true here. ; COMMENT Line 2345 @@ -2249,7 +2334,6 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, break; } } - /* Looks like accidental blank line deletion. This should be put back how it was ; COMMENT/QUESTION Line 2374 @@ -2278,7 +2362,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, } } else - rb->commit(rb, txn, commit_lsn); + { + /* + * Call either PREPARE (for twophase transactions) or COMMIT + * (for regular ones). "twophase" --> "two-phase" ~ Also, I was confused by the apparent assumption of exclusiveness of streaming and 2PC... e.g. what if streaming AND 2PC then it won't do rb->prepare() ; QUESTION Line 2424 @@ -2319,11 +2412,17 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, */ if (streaming) { - ReorderBufferTruncateTXN(rb, txn); + ReorderBufferTruncateTXN(rb, txn, false); /* Reset the CheckXidAlive */ CheckXidAlive = InvalidTransactionId; } + else if (rbtxn_prepared(txn)) I was confused by the exclusiveness of streaming/2PC. e.g. what if streaming AND 2PC at same time - how can you pass false as 3rd param to ReorderBufferTruncateTXN? ; COMMENT Line 2463 @@ -2352,17 +2451,18 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, /* * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent - * abort of the (sub)transaction we are streaming. We need to do the + * abort of the (sub)transaction we are streaming or preparing. We need to do the * cleanup and return gracefully on this error, see SetupCheckXidLive. */ "twoi phase" --> "two-phase" ; QUESTIONS Line 2482 @@ -2370,10 +2470,19 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, errdata = NULL; curtxn->concurrent_abort = true; - /* Reset the TXN so that it is allowed to stream remaining data. */ - ReorderBufferResetTXN(rb, txn, snapshot_now, - command_id, prev_lsn, - specinsert); + /* If streaming, reset the TXN so that it is allowed to stream remaining data. */ + if (streaming) Re: /* If streaming, reset the TXN so that it is allowed to stream remaining data. */ I was confused by the exclusiveness of streaming/2PC. Is it not possible for streaming flags and rbtxn_prepared(txn) true at the same time? ~ elog(LOG, "stopping decoding of %s (%u)", txn->gid[0] != '\0'? txn->gid:"", txn->xid); Is this a safe operation, or do you also need to test txn->gid is not NULL? ; COMMENT Line 2606 +ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid, "twophase" --> "two-phase" ; QUESTION Line 2655 +ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid, "This is used to handle COMMIT/ABORT PREPARED" Should that say "COMMIT/ROLLBACK PREPARED"? ; COMMENT Line 2668 "Anyways, 2PC transactions" --> "Anyway, two-phase transactions" ; COMMENT Line 2765 @@ -2495,7 +2731,13 @@ ReorderBufferAbort(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn) /* cosmetic... */ txn->final_lsn = lsn; - /* remove potential on-disk data, and deallocate */ + /* + * remove potential on-disk data, and deallocate. + * Remove the blank between the comment and code. ========== Patch V6-0001, File: src/include/replication/logical.h ========== COMMENT Line 89 "two phase" -> "two-phase" ; COMMENT Line 89 For consistency with the previous member naming really the new member should just be called "twophase" rather than "enable_twophase" ; ========== Patch V6-0001, File: src/include/replication/output_plugin.h ========== QUESTION Line 106 As previously asked, why is the callback function/typedef referred as AbortPrepared instead of RollbackPrepared? It does not match the SQL and the function comment, and seems only to add some unnecessary confusion. ; ========== Patch V6-0001, File: src/include/replication/reorderbuffer.h ========== QUESTION Line 116 @@ -162,9 +163,13 @@ typedef struct ReorderBufferChange #define RBTXN_HAS_CATALOG_CHANGES 0x0001 #define RBTXN_IS_SUBXACT 0x0002 #define RBTXN_IS_SERIALIZED 0x0004 -#define RBTXN_IS_STREAMED 0x0008 -#define RBTXN_HAS_TOAST_INSERT 0x0010 -#define RBTXN_HAS_SPEC_INSERT 0x0020 +#define RBTXN_PREPARE 0x0008 +#define RBTXN_COMMIT_PREPARED 0x0010 +#define RBTXN_ROLLBACK_PREPARED 0x0020 +#define RBTXN_COMMIT 0x0040 +#define RBTXN_IS_STREAMED 0x0080 +#define RBTXN_HAS_TOAST_INSERT 0x0100 +#define RBTXN_HAS_SPEC_INSERT 0x0200 I was wondering why when adding new flags, some of the existing flag masks were also altered. I am assuming this is ok because they are never persisted but are only used in the protocol (??) ; COMMENT Line 226 @@ -218,6 +223,15 @@ typedef struct ReorderBufferChange ((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \ ) +/* is this txn prepared? */ +#define rbtxn_prepared(txn) (txn->txn_flags & RBTXN_PREPARE) +/* was this prepared txn committed in the meanwhile? */ +#define rbtxn_commit_prepared(txn) (txn->txn_flags & RBTXN_COMMIT_PREPARED) +/* was this prepared txn aborted in the meanwhile? */ +#define rbtxn_rollback_prepared(txn) (txn->txn_flags & RBTXN_ROLLBACK_PREPARED) +/* was this txn committed in the meanwhile? */ +#define rbtxn_commit(txn) (txn->txn_flags & RBTXN_COMMIT) + Probably all the "txn->txn_flags" here might be more safely written with parentheses in the macro like "(txn)->txn_flags". ~ Also, Start all comments with capital. And what is the meaning "in the meanwhile?" ; COMMENT Line 410 @@ -390,6 +407,39 @@ typedef void (*ReorderBufferCommitCB) (ReorderBuffer *rb, ReorderBufferTXN *txn, XLogRecPtr commit_lsn); The format is inconsistent with all other callback signatures here, where the 1st arg was on the same line as the typedef. ; COMMENT Line 440-442 Excessive blank lines following this change? ; COMMENT Line 638 @@ -571,6 +631,15 @@ void ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog bool ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid); bool ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid); +bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, + const char *gid); +bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid, + const char *gid); +void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid, + XLogRecPtr commit_lsn, XLogRecPtr end_lsn, + TimestampTz commit_time, + RepOriginId origin_id, XLogRecPtr origin_lsn, + char *gid); Not aligned consistently with other function prototypes. ; ========== Patch V6-0003, File: src/backend/access/transam/twophase.c ========== COMMENT Line 551 @@ -548,6 +548,37 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held) } /* + * LookupGXact + * Check if the prepared transaction with the given GID is around + */ +bool +LookupGXact(const char *gid) There is potential to refactor/simplify this code: e.g. bool LookupGXact(const char *gid) { int i; bool found = false; LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE); for (i = 0; i < TwoPhaseState->numPrepXacts; i++) { GlobalTransaction gxact = TwoPhaseState->prepXacts[i]; /* Ignore not-yet-valid GIDs */ if (gxact->valid && strcmp(gxact->gid, gid) == 0) { found = true; break; } } LWLockRelease(TwoPhaseStateLock); return found; } ; ========== Patch V6-0003, File: src/backend/replication/logical/proto.c ========== COMMENT Line 86 @@ -72,12 +72,17 @@ logicalrep_read_begin(StringInfo in, LogicalRepBeginData *begin_data) */ void logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn, - XLogRecPtr commit_lsn) Since now the flags are used the code comment is wrong. "/* send the flags field (unused for now) */" ; COMMENT Line 129 @ -106,6 +115,77 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data) } /* + * Write PREPARE to the output stream. + */ +void +logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn, "2PC transactions" --> "two-phase commit transactions" ; COMMENT Line 133 Assert(strlen(txn->gid) > 0); Shouldn't that assertion also check txn->gid is not NULL (to prevent NPE in case gid was NULL) ; COMMENT Line 177 +logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData * prepare_data) prepare_data->prepare_type = flags; This code may be OK but it does seem a bit of an abuse of the flags. e.g. Are they flags or are the really enum values? e.g. And if they are effectively enums (it appears they are) then seemed inconsistent that |= was used when they were previously assigned. ; ========== Patch V6-0003, File: src/backend/replication/logical/worker.c ========== COMMENT Line 757 @@ -749,6 +753,141 @@ apply_handle_commit(StringInfo s) pgstat_report_activity(STATE_IDLE, NULL); } +static void +apply_handle_prepare_txn(LogicalRepPrepareData * prepare_data) +{ + Assert(prepare_data->prepare_lsn == remote_final_lsn); Missing function comment to say this is called from apply_handle_prepare. ; COMMENT Line 798 +apply_handle_commit_prepared_txn(LogicalRepPrepareData * prepare_data) Missing function comment to say this is called from apply_handle_prepare. ; COMMENT Line 824 +apply_handle_rollback_prepared_txn(LogicalRepPrepareData * prepare_data) Missing function comment to say this is called from apply_handle_prepare. ========== Patch V6-0003, File: src/backend/replication/pgoutput/pgoutput.c ========== COMMENT Line 50 @@ -47,6 +47,12 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferChange *change); static bool pgoutput_origin_filter(LogicalDecodingContext *ctx, RepOriginId origin_id); +static void pgoutput_prepare_txn(LogicalDecodingContext *ctx, + ReorderBufferTXN *txn, XLogRecPtr prepare_lsn); The parameter indentation (2nd lines) does not match everything else in this context. ; COMMENT Line 152 @@ -143,6 +149,10 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb) cb->change_cb = pgoutput_change; cb->truncate_cb = pgoutput_truncate; cb->commit_cb = pgoutput_commit_txn; + + cb->prepare_cb = pgoutput_prepare_txn; + cb->commit_prepared_cb = pgoutput_commit_prepared_txn; + cb->abort_prepared_cb = pgoutput_abort_prepared_txn; Remove the unnecessary blank line. ; QUESTION Line 386 @@ -373,7 +383,49 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, OutputPluginUpdateProgress(ctx); OutputPluginPrepareWrite(ctx, true); - logicalrep_write_commit(ctx->out, txn, commit_lsn); + logicalrep_write_commit(ctx->out, txn, commit_lsn, true); Is the is_commit parameter of logicalrep_write_commit ever passed as false? If yes, where? If no, the what is the point of it? ; COMMENT Line 408 +pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, Since all this function is identical to pg_output_prepare it might be better to either 1. just leave this as a wrapper to delegate to that function 2. remove this one entirely and assign the callback to the common pgoutput_prepare_txn ; COMMENT Line 419 +pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, Since all this function is identical to pg_output_prepare if might be better to either 1. just leave this as a wrapper to delegate to that function 2. remove this one entirely and assign the callback to the common pgoutput_prepare_tx ; COMMENT Line 419 +pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, Shouldn't this comment say be "ROLLBACK PREPARED"? ; ========== Patch V6-0003, File: src/include/replication/logicalproto.h ========== QUESTION Line 101 @@ -87,20 +87,55 @@ typedef struct LogicalRepBeginData TransactionId xid; } LogicalRepBeginData; +/* Commit (and abort) information */ #define LOGICALREP_IS_ABORT 0x02 Is there a good reason why this is not called: #define LOGICALREP_IS_ROLLBACK 0x02 ; COMMENT Line 105 ((flags == LOGICALREP_IS_COMMIT) || (flags == LOGICALREP_IS_ABORT)) Macros would be safer if flags are in parentheses (((flags) == LOGICALREP_IS_COMMIT) || ((flags) == LOGICALREP_IS_ABORT)) ; COMMENT Line 115 Unexpected whitespace for the typedef "} LogicalRepPrepareData;" ; COMMENT Line 122 /* prepare can be exactly one of PREPARE, [COMMIT|ABORT] PREPARED*/ #define PrepareFlagsAreValid(flags) \ ((flags == LOGICALREP_IS_PREPARE) || \ (flags == LOGICALREP_IS_COMMIT_PREPARED) || \ (flags == LOGICALREP_IS_ROLLBACK_PREPARED)) There is confusing mixture in macros and comments of ABORT and ROLLBACK terms "[COMMIT|ABORT] PREPARED" --> "[COMMIT|ROLLBACK] PREPARED" ~ Also, it would be safer if flags are in parentheses (((flags) == LOGICALREP_IS_PREPARE) || \ ((flags) == LOGICALREP_IS_COMMIT_PREPARED) || \ ((flags) == LOGICALREP_IS_ROLLBACK_PREPARED)) ; ========== Patch V6-0003, File: src/test/subscription/t/020_twophase.pl ========== COMMENT Line 131 - # check inserts are visible Isn't this supposed to be checking for rows 12 and 13, instead of 11 and 12? ; ========== Patch V6-0004, File: contrib/test_decoding/test_decoding.c ========== COMMENT Line 81 @@ -78,6 +78,15 @@ static void pg_decode_stream_stop(LogicalDecodingContext *ctx, static void pg_decode_stream_abort(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, XLogRecPtr abort_lsn); +static void pg_decode_stream_prepare(LogicalDecodingContext *ctx, + ReorderBufferTXN *txn, + XLogRecPtr commit_lsn); +static All these functions have a 3rd parameter called commit_lsn. Even though the functions are not commit related. It seems like a cut/paste error. ; COMMENT Line 142 @@ -130,6 +139,9 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb) cb->stream_start_cb = pg_decode_stream_start; cb->stream_stop_cb = pg_decode_stream_stop; cb->stream_abort_cb = pg_decode_stream_abort; + cb->stream_prepare_cb = pg_decode_stream_prepare; + cb->stream_commit_prepared_cb = pg_decode_stream_commit_prepared; + cb->stream_abort_prepared_cb = pg_decode_stream_abort_prepared; cb->stream_commit_cb = pg_decode_stream_commit; Can the "cb->stream_abort_prepared_cb" be changed to "cb->stream_rollback_prepared_cb"? ; COMMENT Line 827 @@ -812,6 +824,78 @@ pg_decode_stream_abort(LogicalDecodingContext *ctx, } static void +pg_decode_stream_prepare(LogicalDecodingContext *ctx, + ReorderBufferTXN *txn, + XLogRecPtr commit_lsn) +{ + TestDecodingData *data = ctx->output_plugin_pr The commit_lsn (3rd parameter) is unused and seems like a cut/paste name error. ; COMMENT Line 875 +pg_decode_stream_abort_prepared(LogicalDecodingContext *ctx, The commit_lsn (3rd parameter) is unused and seems like a cut/paste name error. ; ========== Patch V6-0004, File: doc/src/sgml/logicaldecoding.sgml ========== COMMENT 48.6.1 @@ -396,6 +396,9 @@ typedef struct OutputPluginCallbacks LogicalDecodeStreamStartCB stream_start_cb; LogicalDecodeStreamStopCB stream_stop_cb; LogicalDecodeStreamAbortCB stream_abort_cb; + LogicalDecodeStreamPrepareCB stream_prepare_cb; + LogicalDecodeStreamCommitPreparedCB stream_commit_prepared_cb; + LogicalDecodeStreamAbortPreparedCB stream_abort_prepared_cb; Same question from previous review comments - why using the terminology "abort" instead of "rollback" ; COMMENT 48.6.1 @@ -418,7 +421,9 @@ typedef void (*LogicalOutputPluginInit) (struct OutputPluginCallbacks *cb); in-progress transactions. The <function>stream_start_cb</function>, <function>stream_stop_cb</function>, <function>stream_abort_cb</function>, <function>stream_commit_cb</function> and <function>stream_change_cb</function> - are required, while <function>stream_message_cb</function> and + are required, while <function>stream_message_cb</function>, + <function>stream_prepare_cb</function>, <function>stream_commit_prepared_cb</function>, + <function>stream_abort_prepared_cb</function>, Missing "and". ... "stream_abort_prepared_cb, stream_truncate_cb are optional." --> "stream_abort_prepared_cb, and stream_truncate_cb are optional." ; COMMENT Section 48.6.4.16 Section 48.6.4.17 Section 48.6.4.18 @@ -839,6 +844,45 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx, </para> </sect3> + <sect3 id="logicaldecoding-output-plugin-stream-prepare"> + <title>Stream Prepare Callback</title> + <para> + The <function>stream_prepare_cb</function> callback is called to prepare + a previously streamed transaction as part of a two phase commit. +<programlisting> +typedef void (*LogicalDecodeStreamPrepareCB) (struct LogicalDecodingContext *ctx, + ReorderBufferTXN *txn, + XLogRecPtr abort_lsn); +</programlisting> + </para> + </sect3> + + <sect3 id="logicaldecoding-output-plugin-stream-commit-prepared"> + <title>Stream Commit Prepared Callback</title> + <para> + The <function>stream_commit_prepared_cb</function> callback is called to commit prepared + a previously streamed transaction as part of a two phase commit. +<programlisting> +typedef void (*LogicalDecodeStreamCommitPreparedCB) (struct LogicalDecodingContext *ctx, + ReorderBufferTXN *txn, + XLogRecPtr abort_lsn); +</programlisting> + </para> + </sect3> + + <sect3 id="logicaldecoding-output-plugin-stream-abort-prepared"> + <title>Stream Abort Prepared Callback</title> + <para> + The <function>stream_abort_prepared_cb</function> callback is called to abort prepared + a previously streamed transaction as part of a two phase commit. +<programlisting> +typedef void (*LogicalDecodeStreamAbortPreparedCB) (struct LogicalDecodingContext *ctx, + ReorderBufferTXN *txn, + XLogRecPtr abort_lsn); +</programlisting> + </para> + </sect3> 1. Everywhere it says "two phase" commit should be consistently replaced to say "two-phase" commit (with the hyphen) 2. Search for "abort_lsn" parameter. It seems to be overused (cut/paste error) even when the API is unrelated to abort 3. 48.6.4.17 and 48.6.4.18 Is this wording ok? Is the word "prepared" even necessary here? - "... called to commit prepared a previously streamed transaction ..." - "... called to abort prepared a previously streamed transaction ..." ; COMMENT Section 48.9 @@ -1017,9 +1061,13 @@ OutputPluginWrite(ctx, true); When streaming an in-progress transaction, the changes (and messages) are streamed in blocks demarcated by <function>stream_start_cb</function> and <function>stream_stop_cb</function> callbacks. Once all the decoded - changes are transmitted, the transaction is committed using the - <function>stream_commit_cb</function> callback (or possibly aborted using - the <function>stream_abort_cb</function> callback). + changes are transmitted, the transaction can be committed using the + the <function>stream_commit_cb</function> callback "two phase" --> "two-phase" ~ Also, Missing period on end of sentence. "or aborted using the stream_abort_prepared_cb" --> "or aborted using the stream_abort_prepared_cb." ; ========== Patch V6-0004, File: src/backend/replication/logical/logical.c ========== COMMENT Line 84 @@ -81,6 +81,12 @@ static void stream_stop_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, XLogRecPtr last_lsn); static void stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, XLogRecPtr abort_lsn); +static void stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, + XLogRecPtr commit_lsn); +static void stream_commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, + XLogRecPtr commit_lsn); +static void stream_abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, + XLogRecPtr commit_lsn); The 3rd parameter is always "commit_lsn" even for API unrelated to commit, so seems like cut/paste error. ; COMMENT Line 1246 @@ -1231,6 +1243,105 @@ stream_abort_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, } static void +stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, + XLogRecPtr commit_lsn) +{ + LogicalDecodingContext *ctx = cache->private_data; + LogicalErrorCallbackState state; Misnamed parameter "commit_lsn" ? ~ Also, Line 1272 There seem to be some missing integrity checking to make sure the callback is not NULL. A null callback will give NPE when wrapper attempts to call it ; COMMENT Line 1305 +static void +stream_commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, There seem to be some missing integrity checking to make sure the callback is not NULL. A null callback will give NPE when wrapper attempts to call it. ; COMMENT Line 1312 +static void +stream_abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, Misnamed parameter "commit_lsn" ? ~ Also, Line 1338 There seem to be some missing integrity checking to make sure the callback is not NULL. A null callback will give NPE when wrapper attempts to call it. ========== Patch V6-0004, File: src/backend/replication/logical/reorderbuffer.c ========== COMMENT Line 2684 @@ -2672,15 +2681,31 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid, txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */ strcpy(txn->gid, gid); - if (is_commit) + if (rbtxn_is_streamed(txn)) { - txn->txn_flags |= RBTXN_COMMIT_PREPARED; - rb->commit_prepared(rb, txn, commit_lsn); + if (is_commit) + { + txn->txn_flags |= RBTXN_COMMIT_PREPARED; The setting/checking of the flags could be refactored if you wanted to write less code: e.g. if (is_commit) txn->txn_flags |= RBTXN_COMMIT_PREPARED; else txn->txn_flags |= RBTXN_ROLLBACK_PREPARED; if (rbtxn_is_streamed(txn) && rbtxn_commit_prepared(txn)) rb->stream_commit_prepared(rb, txn, commit_lsn); else if (rbtxn_is_streamed(txn) && rbtxn_rollback_prepared(txn)) rb->stream_abort_prepared(rb, txn, commit_lsn); else if (rbtxn_commit_prepared(txn)) rb->commit_prepared(rb, txn, commit_lsn); else if (rbtxn_rollback_prepared(txn)) rb->abort_prepared(rb, txn, commit_lsn); ; ========== Patch V6-0004, File: src/include/replication/output_plugin.h ========== COMMENT Line 171 @@ -157,6 +157,33 @@ typedef void (*LogicalDecodeStreamAbortCB) (struct LogicalDecodingContext *ctx, XLogRecPtr abort_lsn); /* + * Called to prepare changes streamed to remote node from in-progress + * transaction. This is called as part of a two-phase commit and only when + * two-phased commits are supported + */ 1. Missing period all these comments. 2. Is the part that says "and only where two-phased commits are supported" necessary to say? Is seems redundant since comments already says called as part of a two-phase commit. ; ========== Patch V6-0004, File: src/include/replication/reorderbuffer.h ========== COMMENT Line 467 @@ -466,6 +466,24 @@ typedef void (*ReorderBufferStreamAbortCB) ( ReorderBufferTXN *txn, XLogRecPtr abort_lsn); +/* prepare streamed transaction callback signature */ +typedef void (*ReorderBufferStreamPrepareCB) ( + ReorderBuffer *rb, + ReorderBufferTXN *txn, + XLogRecPtr commit_lsn); + +/* prepare streamed transaction callback signature */ +typedef void (*ReorderBufferStreamCommitPreparedCB) ( + ReorderBuffer *rb, + ReorderBufferTXN *txn, + XLogRecPtr commit_lsn); + +/* prepare streamed transaction callback signature */ +typedef void (*ReorderBufferStreamAbortPreparedCB) ( + ReorderBuffer *rb, + ReorderBufferTXN *txn, + XLogRecPtr commit_lsn); Cut/paste error - repeated same comment 3 times? [END]
On Tue, Oct 6, 2020 at 10:23 AM Peter.B.Smith@fujitsu.com <Peter.B.Smith@fujitsu.com> wrote: > > > [BEGIN] > > ========== > Patch V6-0001, File: contrib/test_decoding/expected/prepared.out (so > prepared.sql also) > ========== > > COMMENT > Line 30 - The INSERT INTO test_prepared1 VALUES (2); is kind of > strange because it is not really part of the prior test nor the > following test. Maybe it would be better to have a comment describing > the purpose of this isolated INSERT and to also consume the data from > the slot so it does not get jumbled with the data of the following > (abort) test. > > ; > > COMMENT > Line 53 - Same comment for this test INSERT INTO test_prepared1 VALUES > (4); It kind of has nothing really to do with either the prior (abort) > test nor the following (ddl) test. > > ; > > COMMENT > Line 60 - Seems to check which locks are held for the test_prepared_1 > table while the transaction is in progress. Maybe it would be better > to have more comments describing what is expected here and why. > > ; > > COMMENT > Line 88 - There is a comment in the test saying "-- We should see '7' > before '5' in our results since it commits first." but I did not see > any test code that actually verifies that happens. > > ; All the above comments are genuine and I think it is mostly because the author has blindly modified the existing tests without completely understanding the intent of the test. I suggest we write a completely new regression file (decode_prepared.sql) for these and just copy whatever is required from prepared.sql. Once we do that we might also want to rename existing prepared.sql to decode_commit_prepared.sql or something like that. I think modifying the existing test appears to be quite ugly and also it is changing the intent of the existing tests. > > QUESTION > Line 120 - I did not really understand the SQL checking the pg_class. > I expected this would be checking table 'test_prepared1' instead. Can > you explain it? > SELECT 'pg_class' AS relation, locktype, mode > FROM pg_locks > WHERE locktype = 'relation' > AND relation = 'pg_class'::regclass; > relation | locktype | mode > ----------+----------+------ > (0 rows) > > ; Yes, I also think your expectation is correct and this should be on 'test_prepared_1'. > > QUESTION > Line 139 - SET statement_timeout = '1s'; is 1 seconds short enough > here for this test, or might it be that these statements would be > completed in less than one seconds anyhow? > > ; Good question. I think we have to mention the reason why logical decoding is not blocked while it needs to acquire a shared lock on the table and the previous commands already held an exclusive lock on the table. I am not sure if I am missing something but like you, it is not clear to me as well what this test intends to do, so surely more commentary is required. > > QUESTION > Line 163 - How is this testing a SAVEPOINT? Or is it only to check > that the SAVEPOINT command is not part of the replicated changes? > > ; It is more of testing that subtransactions will not create a problem while decoding. > > COMMENT > Line 175 - Missing underscore in comment. Code requires also underscore: > "nodecode" --> "_nodecode" > makes sense. > ========== > Patch V6-0001, File: contrib/test_decoding/test_decoding.c > ========== > > COMMENT > Line 43 > @@ -36,6 +40,7 @@ typedef struct > bool skip_empty_xacts; > bool xact_wrote_changes; > bool only_local; > + TransactionId check_xid; /* track abort of this txid */ > } TestDecodingData; > > The "check_xid" seems a meaningless name. Check what? > IIUC maybe should be something like "check_xid_aborted" > > ; > > COMMENT > Line 105 > @ -88,6 +93,19 @@ static void > pg_decode_stream_truncate(LogicalDecodingContext *ctx, > ReorderBufferTXN *txn, > int nrelations, Relation relations[], > ReorderBufferChange *change); > +static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx, > + ReorderBufferTXN *txn, > > Remove extra blank line after these functions > > ; The above two sounds reasonable suggestions. > > COMMENT > Line 149 > @@ -116,6 +134,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb) > cb->stream_change_cb = pg_decode_stream_change; > cb->stream_message_cb = pg_decode_stream_message; > cb->stream_truncate_cb = pg_decode_stream_truncate; > + cb->filter_prepare_cb = pg_decode_filter_prepare; > + cb->prepare_cb = pg_decode_prepare_txn; > + cb->commit_prepared_cb = pg_decode_commit_prepared_txn; > + cb->abort_prepared_cb = pg_decode_abort_prepared_txn; > + > } > > There is a confusing mix of terminology where sometimes things are > referred as ROLLBACK/rollback and other times apparently the same > operation is referred as ABORT/abort. I do not know the root cause of > this mixture. IIUC maybe the internal functions and protocol generally > use the term "abort", whereas the SQL syntax is "ROLLBACK"... but > where those two terms collide in the middle it gets quite confusing. > > At least I thought the names of the "callbacks" which get exposed to > the user (e.g. in the help) might be better if they would match the > SQL. > "abort_prepared_cb" --> "rollback_prepared_db" > This suggestion sounds reasonable. I think it is to entertain the case where due to error we need to rollback the transaction. I think it is better if use 'rollback' terminology in the exposed functions. We already have a function with the name stream_abort_cb in the code which we also might want to rename but that is a separate thing and we can deal it with a separate patch. > There are similar review comments like this below where the > alternating terms caused me some confusion. > > ~ > > Also, Remove the extra blank line before the end of the function. > > ; > > COMMENT > Line 267 > @ -227,6 +252,42 @@ pg_decode_startup(LogicalDecodingContext *ctx, > OutputPluginOptions *opt, > errmsg("could not parse value \"%s\" for parameter \"%s\"", > strVal(elem->arg), elem->defname))); > } > + else if (strcmp(elem->defname, "two-phase-commit") == 0) > + { > + if (elem->arg == NULL) > + continue; > > IMO the "check-xid" code might be better rearranged so the NULL check > is first instead of if/else. > e.g. > if (elem->arg == NULL) > ereport(FATAL, > (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > errmsg("check-xid needs an input value"))); > ~ > > Also, is it really supposed to be FATAL instead or ERROR. That is not > the same as the other surrounding code. > > ; +1. > > COMMENT > Line 296 > if (data->check_xid <= 0) > ereport(ERROR, > (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > errmsg("Specify positive value for parameter \"%s\"," > " you specified \"%s\"", > elem->defname, strVal(elem->arg)))); > > The code checking for <= 0 seems over-complicated. Because conversion > was using strtoul() I fail to see how this can ever be < 0. Wouldn't > it be easier to simply test the result of the strtoul() function? > > BEFORE: if (errno == EINVAL || errno == ERANGE) > AFTER: if (data->check_xid == 0) > Better to use TransactionIdIsValid(data->check_xid) here. > ~ > > Also, should this be FATAL? Everything else similar is ERROR. > > ; It should be an error. > > COMMENT > (general) > I don't recall seeing any of these decoding options (e.g. > "two-phase-commit", "check-xid") documented anywhere. > So how can a user even know these options exist so they can use them? > Perhaps options should be described on this page? > https://www.postgresql.org/docs/13/functions-admin.html#FUNCTIONS-REPLICATION > > ; I think we should do what we are doing for other options, if they are not documented then why to document this one separately. I guess we can make a case to document all the existing options and write a separate patch for that. > > COMMENT > (general) > "check-xid" is a meaningless option name. Maybe something like > "checked-xid-aborted" is more useful? > Suggest changing the member, the option, and the error messages to > match some better name. > > ; > > COMMENT > Line 314 > @@ -238,6 +299,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, > OutputPluginOptions *opt, > } > > ctx->streaming &= enable_streaming; > + ctx->enable_twophase &= enable_2pc; > } > > The "ctx->enable_twophase" is inconsistent naming with the > "ctx->streaming" member. > "enable_twophase" --> "twophase" > > ; +1. > > COMMENT > Line 374 > @@ -297,6 +359,94 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, > ReorderBufferTXN *txn, > OutputPluginWrite(ctx, true); > } > > + > +/* > + * Filter out two-phase transactions. > + * > + * Each plugin can implement its own filtering logic. Here > + * we demonstrate a simple logic by checking the GID. If the > + * GID contains the "_nodecode" substring, then we filter > + * it out. > + */ > +static bool > +pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, > > Remove the extra preceding blank line. > > ~ > > I did not find anything in the help about "_nodecode". Should it be > there or is this deliberately not documented feature? > > ; I guess we can document it along with filter_prepare API, if not already documented. > > QUESTION > Line 440 > +pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, > ReorderBufferTXN *txn, > > Is this a wrong comment > "ABORT PREPARED" --> "ROLLBACK PREPARED" ?? > > ; > > COMMENT > Line 620 > @@ -455,6 +605,22 @@ pg_decode_change(LogicalDecodingContext *ctx, > ReorderBufferTXN *txn, > } > data->xact_wrote_changes = true; > > + /* if check_xid is specified */ > + if (TransactionIdIsValid(data->check_xid)) > + { > + elog(LOG, "waiting for %u to abort", data->check_xid); > + while (TransactionIdIsInProgress(dat > > The check_xid seems a meaningless name, and the comment "/* if > check_xid is specified */" was not helpful either. > IIUC purpose of this is to check that the nominated xid always is rolled back. > So the appropriate name may be more like "check-xid-aborted". > > ; Yeah, this part deserves better comments. -- With Regards, Amit Kapila.
On Tue, Oct 6, 2020 at 10:23 AM Peter.B.Smith@fujitsu.com <Peter.B.Smith@fujitsu.com> wrote: > > > ========== > Patch V6-0001, File: doc/src/sgml/logicaldecoding.sgml > ========== > > COMMENT/QUESTION > Section 48.6.1 > @ -387,6 +387,10 @@ typedef struct OutputPluginCallbacks > LogicalDecodeTruncateCB truncate_cb; > LogicalDecodeCommitCB commit_cb; > LogicalDecodeMessageCB message_cb; > + LogicalDecodeFilterPrepareCB filter_prepare_cb; > > Confused by the mixing of terminologies "abort" and "rollback". > Why is it LogicalDecodeAbortPreparedCB instead of > LogicalDecodeRollbackPreparedCB? > Why is it abort_prepared_cb instead of rollback_prepared_cb;? > > I thought everything the user sees should be ROLLBACK/rollback (like > the SQL) regardless of what the internal functions might be called. > > ; > Fair enough. > COMMENT > Section 48.6.1 > The begin_cb, change_cb and commit_cb callbacks are required, while > startup_cb, filter_by_origin_cb, truncate_cb, and shutdown_cb are > optional. If truncate_cb is not set but a TRUNCATE is to be decoded, > the action will be ignored. > > The 1st paragraph beneath the typedef does not mention the newly added > callbacks to say if they are required or optional. > > ; Yeah, in code comments it was mentioned but is missed here, see the comment "To support two phase logical decoding, we require prepare/commit-prepare/abort-prepare callbacks. The filter-prepare callback is optional.". I think instead of directly editing the above paragraph we can write a new one similar to what we have done for streaming of large in-progress transactions (Refer <para> An output plugin may also define functions to support streaming of large, in-progress transactions.). > > COMMENT > Section 48.6.4.5 > Section 48.6.4.6 > Section 48.6.4.7 > @@ -578,6 +588,55 @@ typedef void (*LogicalDecodeCommitCB) (struct > LogicalDecodingContext *ctx, > </para> > </sect3> > > + <sect3 id="logicaldecoding-output-plugin-prepare"> > + <sect3 id="logicaldecoding-output-plugin-commit-prepared"> > + <sect3 id="logicaldecoding-output-plugin-abort-prepared"> > +<programlisting> > > The wording and titles are a bit backwards compared to the others. > e.g. previously was "Transaction Begin" (not "Begin Transaction") and > "Transaction End" (not "End Transaction"). > > So for consistently following the existing IMO should change these new > titles (and wording) to: > - "Commit Prepared Transaction Callback" --> "Transaction Commit > Prepared Callback" > - "Rollback Prepared Transaction Callback" --> "Transaction Rollback > Prepared Callback" > makes sense. > - "whenever a commit prepared transaction has been decoded" --> > "whenever a transaction commit prepared has been decoded" > - "whenever a rollback prepared transaction has been decoded." --> > "whenever a transaction rollback prepared has been decoded." > > ; I don't find above suggestions much better than current wording. How about below instead? "whenever we decode a transaction which is prepared for two-phase commit is committed" "whenever we decode a transaction which is prepared for two-phase commit is rolled back" Also, related to this: + <sect3 id="logicaldecoding-output-plugin-commit-prepared"> + <title>Commit Prepared Transaction Callback</title> + + <para> + The optional <function>commit_prepared_cb</function> callback is called whenever + a commit prepared transaction has been decoded. The <parameter>gid</parameter> field, + which is part of the <parameter>txn</parameter> parameter can be used in this + callback. +<programlisting> +typedef void (*LogicalDecodeCommitPreparedCB) (struct LogicalDecodingContext *ctx, + ReorderBufferTXN *txn, + XLogRecPtr commit_lsn); +</programlisting> + </para> + </sect3> + + <sect3 id="logicaldecoding-output-plugin-abort-prepared"> + <title>Rollback Prepared Transaction Callback</title> + + <para> + The optional <function>abort_prepared_cb</function> callback is called whenever + a rollback prepared transaction has been decoded. The <parameter>gid</parameter> field, + which is part of the <parameter>txn</parameter> parameter can be used in this + callback. +<programlisting> Both the above are not optional as per code and I think code is correct. I think the documentation is wrong here. > > ========== > Patch V6-0001, File: src/backend/replication/logical/decode.c > ========== > > COMMENT > Line 74 > @@ -70,6 +70,9 @@ static void DecodeCommit(LogicalDecodingContext > *ctx, XLogRecordBuffer *buf, > xl_xact_parsed_commit *parsed, TransactionId xid); > static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, > xl_xact_parsed_abort *parsed, TransactionId xid); > +static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, > + xl_xact_parsed_prepare * parsed); > > The 2nd line of DecodePrepare is misaligned by one space. > > ; Yeah, probably pgindent is the answer. Ajin, can you please run pgindent on all the patches? > > COMMENT > Line 321 > @@ -312,17 +315,34 @@ DecodeXactOp(LogicalDecodingContext *ctx, > XLogRecordBuffer *buf) > } > break; > case XLOG_XACT_PREPARE: > + { > + xl_xact_parsed_prepare parsed; > + xl_xact_prepare *xlrec; > + /* check that output plugin is capable of twophase decoding */ > > "twophase" --> "two-phase" > > ~ > > Also, add a blank line after the declarations. > > ; > > ========== > Patch V6-0001, File: src/backend/replication/logical/logical.c > ========== > > COMMENT > Line 249 > @@ -225,6 +237,19 @@ StartupDecodingContext(List *output_plugin_options, > (ctx->callbacks.stream_message_cb != NULL) || > (ctx->callbacks.stream_truncate_cb != NULL); > > + /* > + * To support two phase logical decoding, we require > prepare/commit-prepare/abort-prepare > + * callbacks. The filter-prepare callback is optional. We however > enable two phase logical > + * decoding when at least one of the methods is enabled so that we > can easily identify > + * missing methods. > > The terminology is generally well known as "two-phase" (with the > hyphen) https://en.wikipedia.org/wiki/Two-phase_commit_protocol so > let's be consistent for all the patch code comments. Please search the > code and correct this in all places, even where I might have missed to > identify it. > > "two phase" --> "two-phase" > > ; > > COMMENT > Line 822 > @@ -782,6 +807,111 @@ commit_cb_wrapper(ReorderBuffer *cache, > ReorderBufferTXN *txn, > } > > static void > +prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, > + XLogRecPtr prepare_lsn) > > "support 2 phase" --> "supports two-phase" in the comment > > ; > > COMMENT > Line 844 > Code condition seems strange and/or broken. > if (ctx->enable_twophase && ctx->callbacks.prepare_cb == NULL) > Because if the flag is null then this condition is skipped. > But then if the callback was also NULL then attempting to call it to > "do the actual work" will give NPE. > > ~ > > Also, I wonder should this check be the first thing in this function? > Because if it fails does it even make sense that all the errcallback > code was set up?> E.g errcallback.arg potentially is left pointing to a stack variable > on a stack that no longer exists. > > ; Right, I think we should have an Assert(ctx->enable_twophase) in the beginning and then have the check (ctx->callbacks.prepare_cb == NULL) t its current place. Refer any of the streaming APIs (for ex. stream_stop_cb_wrapper). > > COMMENT > Line 857 > +commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, > > "support 2 phase" --> "supports two-phase" in the comment > > ~ > > Also, Same potential trouble with the condition: > if (ctx->enable_twophase && ctx->callbacks.commit_prepared_cb == NULL) > Same as previously asked. Should this check be first thing in this function? > > ; Yeah, so the same solution as mentioned above can be used. > > COMMENT > Line 892 > +abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, > > "support 2 phase" --> "supports two-phase" in the comment > > ~ > > Same potential trouble with the condition: > if (ctx->enable_twophase && ctx->callbacks.abort_prepared_cb == NULL) > Same as previously asked. Should this check be the first thing in this function? > > ; Again the same solution can be used. > > COMMENT > Line 1013 > @@ -858,6 +988,51 @@ truncate_cb_wrapper(ReorderBuffer *cache, > ReorderBufferTXN *txn, > error_context_stack = errcallback.previous; > } > > +static bool > +filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, > + TransactionId xid, const char *gid) > > Fix wording in comment: > "twophase" --> "two-phase transactions" > "twophase transactions" --> "two-phase transactions" > > ========== > Patch V6-0001, File: src/backend/replication/logical/reorderbuffer.c > ========== > > COMMENT > Line 255 > @@ -251,7 +251,8 @@ static Size > ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn > static void ReorderBufferRestoreChange(ReorderBuffer *rb, > ReorderBufferTXN *txn, > char *change); > static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, > ReorderBufferTXN *txn); > -static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn); > +static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, > + bool txn_prepared); > > The alignment is inconsistent. One more space needed before "bool txn_prepared" > > ; > > COMMENT > Line 417 > @@ -413,6 +414,11 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, > ReorderBufferTXN *txn) > } > > /* free data that's contained */ > + if (txn->gid != NULL) > + { > + pfree(txn->gid); > + txn->gid = NULL; > + } > > Should add the blank link before this new code, as it was before. > > ; > > COMMENT > Line 1564 > @ -1502,12 +1561,14 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, > ReorderBufferTXN *txn) > } > > /* > - * Discard changes from a transaction (and subtransactions), after streaming > - * them. Keep the remaining info - transactions, tuplecids, invalidations and > - * snapshots. > + * Discard changes from a transaction (and subtransactions), either > after streaming or > + * after a PREPARE. > > typo "snapshots.If" -> "snapshots. If" > > ; > > COMMENT/QUESTION > Line 1590 > @@ -1526,7 +1587,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, > ReorderBufferTXN *txn) > Assert(rbtxn_is_known_subxact(subtxn)); > Assert(subtxn->nsubtxns == 0); > > - ReorderBufferTruncateTXN(rb, subtxn); > + ReorderBufferTruncateTXN(rb, subtxn, txn_prepared); > } > > There are some code paths here I did not understand how they match the comments. > Because this function is recursive it seems that it may be called > where the 2nd parameter txn is a sub-transaction. > > But then this seems at odds with some of the other code comments of > this function which are processing the txn without ever testing is it > really toplevel or not: > > e.g. Line 1593 "/* cleanup changes in the toplevel txn */" I think this comment is wrong but this is not the fault of this patch. > e.g. Line 1632 "They are always stored in the toplevel transaction." > > ; This seems to be correct and we probably need an Assert that the transaction is a top-level transaction. > > COMMENT > Line 1644 > @@ -1560,9 +1621,33 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, > ReorderBufferTXN *txn) > * about the toplevel xact (we send the XID in all messages), but we never > * stream XIDs of empty subxacts. > */ > - if ((!txn->toptxn) || (txn->nentries_mem != 0)) > + if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0))) > txn->txn_flags |= RBTXN_IS_STREAMED; > > + if (txn_prepared) > > /* remove the change from it's containing list */ > typo "it's" --> "its" > > ; > > QUESTION > Line 1977 > @@ -1880,7 +1965,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, > ReorderBufferTXN *txn, > ReorderBufferChange *specinsert) > { > /* Discard the changes that we just streamed */ > - ReorderBufferTruncateTXN(rb, txn); > + ReorderBufferTruncateTXN(rb, txn, false); > > How do you know the 3rd parameter - i.e. txn_prepared - should be > hardwired false here? > e.g. I thought that maybe rbtxn_prepared(txn) can be true here. > > ; > > COMMENT > Line 2345 > @@ -2249,7 +2334,6 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, > ReorderBufferTXN *txn, > break; > } > } > - > /* > > Looks like accidental blank line deletion. This should be put back how it was > > ; > > COMMENT/QUESTION > Line 2374 > @@ -2278,7 +2362,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, > ReorderBufferTXN *txn, > } > } > else > - rb->commit(rb, txn, commit_lsn); > + { > + /* > + * Call either PREPARE (for twophase transactions) or COMMIT > + * (for regular ones). > > "twophase" --> "two-phase" > > ~ > > Also, I was confused by the apparent assumption of exclusiveness of > streaming and 2PC... > e.g. what if streaming AND 2PC then it won't do rb->prepare() > > ; > > QUESTION > Line 2424 > @@ -2319,11 +2412,17 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, > ReorderBufferTXN *txn, > */ > if (streaming) > { > - ReorderBufferTruncateTXN(rb, txn); > + ReorderBufferTruncateTXN(rb, txn, false); > > /* Reset the CheckXidAlive */ > CheckXidAlive = InvalidTransactionId; > } > + else if (rbtxn_prepared(txn)) > > I was confused by the exclusiveness of streaming/2PC. > e.g. what if streaming AND 2PC at same time - how can you pass false > as 3rd param to ReorderBufferTruncateTXN? > > ; Yeah, this and another handling wherever it is assumed that both can't be true together is wrong. > > COMMENT > Line 2463 > @@ -2352,17 +2451,18 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, > ReorderBufferTXN *txn, > > /* > * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent > - * abort of the (sub)transaction we are streaming. We need to do the > + * abort of the (sub)transaction we are streaming or preparing. We > need to do the > * cleanup and return gracefully on this error, see SetupCheckXidLive. > */ > > "twoi phase" --> "two-phase" > > ; > > QUESTIONS > Line 2482 > @@ -2370,10 +2470,19 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, > ReorderBufferTXN *txn, > errdata = NULL; > curtxn->concurrent_abort = true; > > - /* Reset the TXN so that it is allowed to stream remaining data. */ > - ReorderBufferResetTXN(rb, txn, snapshot_now, > - command_id, prev_lsn, > - specinsert); > + /* If streaming, reset the TXN so that it is allowed to stream > remaining data. */ > + if (streaming) > > Re: /* If streaming, reset the TXN so that it is allowed to stream > remaining data. */ > I was confused by the exclusiveness of streaming/2PC. > Is it not possible for streaming flags and rbtxn_prepared(txn) true at > the same time? > Yeah, I think it is not correct to assume that both can't be true at the same time. But when prepared is true irrespective of whether streaming is true or not we can use ReorderBufferTruncateTXN() API instead of Reset API. > ~ > > elog(LOG, "stopping decoding of %s (%u)", > txn->gid[0] != '\0'? txn->gid:"", txn->xid); > > Is this a safe operation, or do you also need to test txn->gid is not NULL? > > ; I think if 'prepared' is true then we can assume it to be non-NULL, otherwise, not. I am responding to your email in phases so that we can have a discussion on specific points if required and I am slightly afraid that the email might not bounce as it happened in your case when you sent such a long email. -- With Regards, Amit Kapila.
On Wed, Oct 7, 2020 at 1:24 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > There is a confusing mix of terminology where sometimes things are > > referred as ROLLBACK/rollback and other times apparently the same > > operation is referred as ABORT/abort. I do not know the root cause of > > this mixture. IIUC maybe the internal functions and protocol generally > > use the term "abort", whereas the SQL syntax is "ROLLBACK"... but > > where those two terms collide in the middle it gets quite confusing. > > > > At least I thought the names of the "callbacks" which get exposed to > > the user (e.g. in the help) might be better if they would match the > > SQL. > > "abort_prepared_cb" --> "rollback_prepared_db" > > > > This suggestion sounds reasonable. I think it is to entertain the case > where due to error we need to rollback the transaction. I think it is > better if use 'rollback' terminology in the exposed functions. We > already have a function with the name stream_abort_cb in the code > which we also might want to rename but that is a separate thing and we > can deal it with a separate patch. So, for an ordinary transaction, rollback implies an explicit user action, but an abort could either be an explicit user action (ABORT; or ROLLBACK;) or an error. I agree that calling that case "abort" rather than "rollback" is better. However, the situation is a bit different for a prepared transaction: no error can prevent such a transaction from being committed. That is the whole point of being able to prepare transactions. So it is not unreasonable to think of use "rollback" rather than "abort" for prepared transactions, but I think it would be wrong in other cases. On the other hand, using "abort" for all the cases also doesn't seem bad to me. It's true that there is no ABORT PREPARED command at the SQL level, but I don't think that is very important. I don't feel wrong saying that ROLLBACK PREPARED causes a transaction abort. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Oct 8, 2020 at 6:14 AM Robert Haas <robertmhaas@gmail.com> wrote: > So, for an ordinary transaction, rollback implies an explicit user > action, but an abort could either be an explicit user action (ABORT; > or ROLLBACK;) or an error. I agree that calling that case "abort" > rather than "rollback" is better. However, the situation is a bit > different for a prepared transaction: no error can prevent such a > transaction from being committed. That is the whole point of being > able to prepare transactions. So it is not unreasonable to think of > use "rollback" rather than "abort" for prepared transactions, but I > think it would be wrong in other cases. On the other hand, using > "abort" for all the cases also doesn't seem bad to me. It's true that > there is no ABORT PREPARED command at the SQL level, but I don't think > that is very important. I don't feel wrong saying that ROLLBACK > PREPARED causes a transaction abort. > So, as I understand you don't object to renaming the callback APIs for ROLLBACK PREPARED transactions to "rollback_prepared_cb" but keeping the "stream_abort" as such. This was what I was planning on doing. I was just writing this up, so wanted to confirm. regards, Ajin Cherian Fujitsu Australia
On Tue, Oct 6, 2020 at 10:23 AM Peter.B.Smith@fujitsu.com <Peter.B.Smith@fujitsu.com> wrote: > > ========== > Patch V6-0001, File: src/include/replication/reorderbuffer.h > ========== > > QUESTION > Line 116 > @@ -162,9 +163,13 @@ typedef struct ReorderBufferChange > #define RBTXN_HAS_CATALOG_CHANGES 0x0001 > #define RBTXN_IS_SUBXACT 0x0002 > #define RBTXN_IS_SERIALIZED 0x0004 > -#define RBTXN_IS_STREAMED 0x0008 > -#define RBTXN_HAS_TOAST_INSERT 0x0010 > -#define RBTXN_HAS_SPEC_INSERT 0x0020 > +#define RBTXN_PREPARE 0x0008 > +#define RBTXN_COMMIT_PREPARED 0x0010 > +#define RBTXN_ROLLBACK_PREPARED 0x0020 > +#define RBTXN_COMMIT 0x0040 > +#define RBTXN_IS_STREAMED 0x0080 > +#define RBTXN_HAS_TOAST_INSERT 0x0100 > +#define RBTXN_HAS_SPEC_INSERT 0x0200 > > I was wondering why when adding new flags, some of the existing flag > masks were also altered. > I am assuming this is ok because they are never persisted but are only > used in the protocol (??) > > ; This is bad even though there is no direct problem. I don't think we need to change the existing ones, we can add the new ones at the end with the number starting where the last one ends. > > > COMMENT > Line 133 > > Assert(strlen(txn->gid) > 0); > Shouldn't that assertion also check txn->gid is not NULL (to prevent > NPE in case gid was NULL) > > ; I think that would be better and a stronger Assertion than the current one. > > COMMENT > Line 177 > +logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData * prepare_data) > > prepare_data->prepare_type = flags; > This code may be OK but it does seem a bit of an abuse of the flags. > > e.g. Are they flags or are the really enum values? > e.g. And if they are effectively enums (it appears they are) then > seemed inconsistent that |= was used when they were previously > assigned. > > ; I don't understand this point. As far as I can see at the time of write (logicalrep_write_prepare()), the patch has used |=, and at the time of reading (logicalrep_read_prepare()) it has used assignment which seems correct from the code perspective. Do you have a better proposal? > > > > COMMENT > Line 408 > +pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, > ReorderBufferTXN *txn, > > Since all this function is identical to pg_output_prepare it might be > better to either > 1. just leave this as a wrapper to delegate to that function > 2. remove this one entirely and assign the callback to the common > pgoutput_prepare_txn > > ; I think this is because as of now the patch uses the same function and protocol message to send both Prepare and Commit/Rollback Prepare which I am not sure is the right thing. I suggest keeping that code as it is for now. Let's first try to figure out if it is a good idea to overload the same protocol message and use flags to distinguish the actual message. Also, I don't know whether prepare_lsn is required during commit time? > > COMMENT > Line 419 > +pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, > > Since all this function is identical to pg_output_prepare if might be > better to either > 1. just leave this as a wrapper to delegate to that function > 2. remove this one entirely and assign the callback to the common > pgoutput_prepare_tx > > ; Due to reasons mentioned for the previous comment, let's keep this also as it is for now. -- With Regards, Amit Kapila.
On Thu, Oct 8, 2020 at 5:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > COMMENT > > Line 177 > > +logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData * prepare_data) > > > > prepare_data->prepare_type = flags; > > This code may be OK but it does seem a bit of an abuse of the flags. > > > > e.g. Are they flags or are the really enum values? > > e.g. And if they are effectively enums (it appears they are) then > > seemed inconsistent that |= was used when they were previously > > assigned. > > > > ; > > I don't understand this point. As far as I can see at the time of > write (logicalrep_write_prepare()), the patch has used |=, and at the > time of reading (logicalrep_read_prepare()) it has used assignment > which seems correct from the code perspective. Do you have a better > proposal? OK. I will explain my thinking when I wrote that review comment. I agree all is "correct" from a code perspective. But IMO using bit arithmetic implies that different combinations are also possible, whereas in current code they are not. So code is kind of having a bet each-way - sometimes treating "flags" as bit flags and sometimes as enums. e.g. If these flags are not really bit flags at all then the logicalrep_write_prepare() code might just as well be written as below: BEFORE if (rbtxn_commit_prepared(txn)) flags |= LOGICALREP_IS_COMMIT_PREPARED; else if (rbtxn_rollback_prepared(txn)) flags |= LOGICALREP_IS_ROLLBACK_PREPARED; else flags |= LOGICALREP_IS_PREPARE; /* Make sure exactly one of the expected flags is set. */ if (!PrepareFlagsAreValid(flags)) elog(ERROR, "unrecognized flags %u in prepare message", flags); AFTER if (rbtxn_commit_prepared(txn)) flags = LOGICALREP_IS_COMMIT_PREPARED; else if (rbtxn_rollback_prepared(txn)) flags = LOGICALREP_IS_ROLLBACK_PREPARED; else flags = LOGICALREP_IS_PREPARE; ~ OTOH, if you really do want to anticipate having future flag bit combinations then maybe the PrepareFlagsAreValid() macro ought to to be tweaked accordingly, and the logicalrep_read_prepare() code maybe should look more like below: BEFORE /* set the action (reuse the constants used for the flags) */ prepare_data->prepare_type = flags; AFTER /* set the action (reuse the constants used for the flags) */ prepare_data->prepare_type = flags & LOGICALREP_IS_COMMIT_PREPARED ? LOGICALREP_IS_COMMIT_PREPARED : flags & LOGICALREP_IS_ROLLBACK_PREPARED ? LOGICALREP_IS_ROLLBACK_PREPARED : LOGICALREP_IS_PREPARE; Kind Regards. Peter Smith Fujitsu Australia
On Fri, Oct 9, 2020 at 5:45 AM Peter Smith <smithpb2250@gmail.com> wrote: > > On Thu, Oct 8, 2020 at 5:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > COMMENT > > > Line 177 > > > +logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData * prepare_data) > > > > > > prepare_data->prepare_type = flags; > > > This code may be OK but it does seem a bit of an abuse of the flags. > > > > > > e.g. Are they flags or are the really enum values? > > > e.g. And if they are effectively enums (it appears they are) then > > > seemed inconsistent that |= was used when they were previously > > > assigned. > > > > > > ; > > > > I don't understand this point. As far as I can see at the time of > > write (logicalrep_write_prepare()), the patch has used |=, and at the > > time of reading (logicalrep_read_prepare()) it has used assignment > > which seems correct from the code perspective. Do you have a better > > proposal? > > OK. I will explain my thinking when I wrote that review comment. > > I agree all is "correct" from a code perspective. > > But IMO using bit arithmetic implies that different combinations are > also possible, whereas in current code they are not. > So code is kind of having a bet each-way - sometimes treating "flags" > as bit flags and sometimes as enums. > > e.g. If these flags are not really bit flags at all then the > logicalrep_write_prepare() code might just as well be written as > below: > > BEFORE > if (rbtxn_commit_prepared(txn)) > flags |= LOGICALREP_IS_COMMIT_PREPARED; > else if (rbtxn_rollback_prepared(txn)) > flags |= LOGICALREP_IS_ROLLBACK_PREPARED; > else > flags |= LOGICALREP_IS_PREPARE; > > /* Make sure exactly one of the expected flags is set. */ > if (!PrepareFlagsAreValid(flags)) > elog(ERROR, "unrecognized flags %u in prepare message", flags); > > > AFTER > if (rbtxn_commit_prepared(txn)) > flags = LOGICALREP_IS_COMMIT_PREPARED; > else if (rbtxn_rollback_prepared(txn)) > flags = LOGICALREP_IS_ROLLBACK_PREPARED; > else > flags = LOGICALREP_IS_PREPARE; > > ~ > > OTOH, if you really do want to anticipate having future flag bit > combinations > I don't anticipate more combinations rather I am not yet sure whether we want to distinguish these operations with flags or have separate messages for each of these operations. I think for now we can go with your proposal above. -- With Regards, Amit Kapila.
On Wed, Oct 7, 2020 at 4:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > All the above comments are genuine and I think it is mostly because > the author has blindly modified the existing tests without completely > understanding the intent of the test. I suggest we write a completely > new regression file (decode_prepared.sql) for these and just copy > whatever is required from prepared.sql. Once we do that we might also > want to rename existing prepared.sql to decode_commit_prepared.sql or > something like that. I think modifying the existing test appears to be > quite ugly and also it is changing the intent of the existing tests. > Updated this. Kept the original prepared.sql untouched and added a new regression file called two_phase.sql which is specific to test cases with the new flag two-phase-commit. > > > > QUESTION > > Line 120 - I did not really understand the SQL checking the pg_class. > > I expected this would be checking table 'test_prepared1' instead. Can > > you explain it? > > SELECT 'pg_class' AS relation, locktype, mode > > FROM pg_locks > > WHERE locktype = 'relation' > > AND relation = 'pg_class'::regclass; > > relation | locktype | mode > > ----------+----------+------ > > (0 rows) > > > > ; > > Yes, I also think your expectation is correct and this should be on > 'test_prepared_1'. Updated > > > > > QUESTION > > Line 139 - SET statement_timeout = '1s'; is 1 seconds short enough > > here for this test, or might it be that these statements would be > > completed in less than one seconds anyhow? > > > > ; > > Good question. I think we have to mention the reason why logical > decoding is not blocked while it needs to acquire a shared lock on the > table and the previous commands already held an exclusive lock on the > table. I am not sure if I am missing something but like you, it is not > clear to me as well what this test intends to do, so surely more > commentary is required. Updated. > > > > > > QUESTION > > Line 163 - How is this testing a SAVEPOINT? Or is it only to check > > that the SAVEPOINT command is not part of the replicated changes? > > > > ; > > It is more of testing that subtransactions will not create a problem > while decoding. Updated with a testcase that actually does a rollback to a savepoint > > > > > COMMENT > > Line 175 - Missing underscore in comment. Code requires also underscore: > > "nodecode" --> "_nodecode" > > > > makes sense. Updated. > > > ========== > > Patch V6-0001, File: contrib/test_decoding/test_decoding.c > > ========== > > > > COMMENT > > Line 43 > > @@ -36,6 +40,7 @@ typedef struct > > bool skip_empty_xacts; > > bool xact_wrote_changes; > > bool only_local; > > + TransactionId check_xid; /* track abort of this txid */ > > } TestDecodingData; > > > > The "check_xid" seems a meaningless name. Check what? > > IIUC maybe should be something like "check_xid_aborted" Updated. > > > > ; > > > > COMMENT > > Line 105 > > @ -88,6 +93,19 @@ static void > > pg_decode_stream_truncate(LogicalDecodingContext *ctx, > > ReorderBufferTXN *txn, > > int nrelations, Relation relations[], > > ReorderBufferChange *change); > > +static bool pg_decode_filter_prepare(LogicalDecodingContext *ctx, > > + ReorderBufferTXN *txn, > > > > Remove extra blank line after these functions > > > > ; > > The above two sounds reasonable suggestions. Updated. > > > > COMMENT > > Line 149 > > @@ -116,6 +134,11 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb) > > cb->stream_change_cb = pg_decode_stream_change; > > cb->stream_message_cb = pg_decode_stream_message; > > cb->stream_truncate_cb = pg_decode_stream_truncate; > > + cb->filter_prepare_cb = pg_decode_filter_prepare; > > + cb->prepare_cb = pg_decode_prepare_txn; > > + cb->commit_prepared_cb = pg_decode_commit_prepared_txn; > > + cb->abort_prepared_cb = pg_decode_abort_prepared_txn; > > + > > } > > > > There is a confusing mix of terminology where sometimes things are > > referred as ROLLBACK/rollback and other times apparently the same > > operation is referred as ABORT/abort. I do not know the root cause of > > this mixture. IIUC maybe the internal functions and protocol generally > > use the term "abort", whereas the SQL syntax is "ROLLBACK"... but > > where those two terms collide in the middle it gets quite confusing. > > > > At least I thought the names of the "callbacks" which get exposed to > > the user (e.g. in the help) might be better if they would match the > > SQL. > > "abort_prepared_cb" --> "rollback_prepared_db" > > > > This suggestion sounds reasonable. I think it is to entertain the case > where due to error we need to rollback the transaction. I think it is > better if use 'rollback' terminology in the exposed functions. We > already have a function with the name stream_abort_cb in the code > which we also might want to rename but that is a separate thing and we > can deal it with a separate patch. Changed the call back names from abort_prepared to rollback_prepapred and stream_abort_prepared to stream_rollback_prepared. > > > There are similar review comments like this below where the > > alternating terms caused me some confusion. > > > > ~ > > > > Also, Remove the extra blank line before the end of the function. > > > > ; > > > > COMMENT > > Line 267 > > @ -227,6 +252,42 @@ pg_decode_startup(LogicalDecodingContext *ctx, > > OutputPluginOptions *opt, > > errmsg("could not parse value \"%s\" for parameter \"%s\"", > > strVal(elem->arg), elem->defname))); > > } > > + else if (strcmp(elem->defname, "two-phase-commit") == 0) > > + { > > + if (elem->arg == NULL) > > + continue; > > > > IMO the "check-xid" code might be better rearranged so the NULL check > > is first instead of if/else. > > e.g. > > if (elem->arg == NULL) > > ereport(FATAL, > > (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > > errmsg("check-xid needs an input value"))); > > ~ > > > > Also, is it really supposed to be FATAL instead or ERROR. That is not > > the same as the other surrounding code. > > > > ; > > +1. Updated. > > > > > COMMENT > > Line 296 > > if (data->check_xid <= 0) > > ereport(ERROR, > > (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > > errmsg("Specify positive value for parameter \"%s\"," > > " you specified \"%s\"", > > elem->defname, strVal(elem->arg)))); > > > > The code checking for <= 0 seems over-complicated. Because conversion > > was using strtoul() I fail to see how this can ever be < 0. Wouldn't > > it be easier to simply test the result of the strtoul() function? > > > > BEFORE: if (errno == EINVAL || errno == ERANGE) > > AFTER: if (data->check_xid == 0) > > > > Better to use TransactionIdIsValid(data->check_xid) here. Updated. > > > ~ > > > > Also, should this be FATAL? Everything else similar is ERROR. > > > > ; > > It should be an error. Updated > > > > > COMMENT > > (general) > > I don't recall seeing any of these decoding options (e.g. > > "two-phase-commit", "check-xid") documented anywhere. > > So how can a user even know these options exist so they can use them? > > Perhaps options should be described on this page? > > https://www.postgresql.org/docs/13/functions-admin.html#FUNCTIONS-REPLICATION > > > > ; > > I think we should do what we are doing for other options, if they are > not documented then why to document this one separately. I guess we > can make a case to document all the existing options and write a > separate patch for that. I didnt see any of the test_decoding options updated in the documentation as these seem specific for the test_decoder used in testing. https://www.postgresql.org/docs/13/test-decoding.html > > > > > COMMENT > > (general) > > "check-xid" is a meaningless option name. Maybe something like > > "checked-xid-aborted" is more useful? > > Suggest changing the member, the option, and the error messages to > > match some better name. Updated. > > > > ; > > > > COMMENT > > Line 314 > > @@ -238,6 +299,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, > > OutputPluginOptions *opt, > > } > > > > ctx->streaming &= enable_streaming; > > + ctx->enable_twophase &= enable_2pc; > > } > > > > The "ctx->enable_twophase" is inconsistent naming with the > > "ctx->streaming" member. > > "enable_twophase" --> "twophase" > > > > ; > > +1. Updated > > > > > COMMENT > > Line 374 > > @@ -297,6 +359,94 @@ pg_decode_commit_txn(LogicalDecodingContext *ctx, > > ReorderBufferTXN *txn, > > OutputPluginWrite(ctx, true); > > } > > > > + > > +/* > > + * Filter out two-phase transactions. > > + * > > + * Each plugin can implement its own filtering logic. Here > > + * we demonstrate a simple logic by checking the GID. If the > > + * GID contains the "_nodecode" substring, then we filter > > + * it out. > > + */ > > +static bool > > +pg_decode_filter_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, > > > > Remove the extra preceding blank line. > > Updated. > > ~ > > > > I did not find anything in the help about "_nodecode". Should it be > > there or is this deliberately not documented feature? > > > > ; > > I guess we can document it along with filter_prepare API, if not > already documented. > Again , this seems to be specific to test_decoder and an example of a way to create a filter_prepare. > > > > QUESTION > > Line 440 > > +pg_decode_abort_prepared_txn(LogicalDecodingContext *ctx, > > ReorderBufferTXN *txn, > > > > Is this a wrong comment > > "ABORT PREPARED" --> "ROLLBACK PREPARED" ?? > > > > ; > > > > COMMENT > > Line 620 > > @@ -455,6 +605,22 @@ pg_decode_change(LogicalDecodingContext *ctx, > > ReorderBufferTXN *txn, > > } > > data->xact_wrote_changes = true; > > > > + /* if check_xid is specified */ > > + if (TransactionIdIsValid(data->check_xid)) > > + { > > + elog(LOG, "waiting for %u to abort", data->check_xid); > > + while (TransactionIdIsInProgress(dat > > > > The check_xid seems a meaningless name, and the comment "/* if > > check_xid is specified */" was not helpful either. > > IIUC purpose of this is to check that the nominated xid always is rolled back. > > So the appropriate name may be more like "check-xid-aborted". > > > > ; > > Yeah, this part deserves better comments. Updated. Other than these first batch of review comments from Peter Smith, I've also updated new functions in decode.c for DecodeCommitPrepared and DecodeAbortPrepared as agreed in a previous review comment by Amit and Dilip. I've also incorporated Dilip's comment on acquiring SHARED lock rather than EXCLUSIVE lock while looking for transaction matching Gid. Since Peter's comments are many, I'll be sending patch updates in parts addressing his comments. regards, Ajin Cherian Fujitsu Australia
Attachment
On Wed, Oct 7, 2020 at 9:36 AM Peter Smith <smithpb2250@gmail.com> wrote: > ========== > Patch V6-0001, File: doc/src/sgml/logicaldecoding.sgml > ========== > > COMMENT/QUESTION > Section 48.6.1 > @ -387,6 +387,10 @@ typedef struct OutputPluginCallbacks > LogicalDecodeTruncateCB truncate_cb; > LogicalDecodeCommitCB commit_cb; > LogicalDecodeMessageCB message_cb; > + LogicalDecodeFilterPrepareCB filter_prepare_cb; > > Confused by the mixing of terminologies "abort" and "rollback". > Why is it LogicalDecodeAbortPreparedCB instead of > LogicalDecodeRollbackPreparedCB? > Why is it abort_prepared_cb instead of rollback_prepared_cb;? > > I thought everything the user sees should be ROLLBACK/rollback (like > the SQL) regardless of what the internal functions might be called. > > ; Modified. > > COMMENT > Section 48.6.1 > The begin_cb, change_cb and commit_cb callbacks are required, while > startup_cb, filter_by_origin_cb, truncate_cb, and shutdown_cb are > optional. If truncate_cb is not set but a TRUNCATE is to be decoded, > the action will be ignored. > > The 1st paragraph beneath the typedef does not mention the newly added > callbacks to say if they are required or optional. > Added a new para for this. > ; > > COMMENT > Section 48.6.4.5 > Section 48.6.4.6 > Section 48.6.4.7 > @@ -578,6 +588,55 @@ typedef void (*LogicalDecodeCommitCB) (struct > LogicalDecodingContext *ctx, > </para> > </sect3> > > + <sect3 id="logicaldecoding-output-plugin-prepare"> > + <sect3 id="logicaldecoding-output-plugin-commit-prepared"> > + <sect3 id="logicaldecoding-output-plugin-abort-prepared"> > +<programlisting> > > The wording and titles are a bit backwards compared to the others. > e.g. previously was "Transaction Begin" (not "Begin Transaction") and > "Transaction End" (not "End Transaction"). > > So for consistently following the existing IMO should change these new > titles (and wording) to: > - "Commit Prepared Transaction Callback" --> "Transaction Commit > Prepared Callback" > - "Rollback Prepared Transaction Callback" --> "Transaction Rollback > Prepared Callback" > - "whenever a commit prepared transaction has been decoded" --> > "whenever a transaction commit prepared has been decoded" > - "whenever a rollback prepared transaction has been decoded." --> > "whenever a transaction rollback prepared has been decoded." > > ; Updated to this > > ========== > Patch V6-0001, File: src/backend/replication/logical/decode.c > ========== > > COMMENT > Line 74 > @@ -70,6 +70,9 @@ static void DecodeCommit(LogicalDecodingContext > *ctx, XLogRecordBuffer *buf, > xl_xact_parsed_commit *parsed, TransactionId xid); > static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, > xl_xact_parsed_abort *parsed, TransactionId xid); > +static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, > + xl_xact_parsed_prepare * parsed); > > The 2nd line of DecodePrepare is misaligned by one space. > > ; > > COMMENT > Line 321 > @@ -312,17 +315,34 @@ DecodeXactOp(LogicalDecodingContext *ctx, > XLogRecordBuffer *buf) > } > break; > case XLOG_XACT_PREPARE: > + { > + xl_xact_parsed_prepare parsed; > + xl_xact_prepare *xlrec; > + /* check that output plugin is capable of twophase decoding */ > > "twophase" --> "two-phase" > > ~ > > Also, add a blank line after the declarations. > > ; > > ========== > Patch V6-0001, File: src/backend/replication/logical/logical.c > ========== > > COMMENT > Line 249 > @@ -225,6 +237,19 @@ StartupDecodingContext(List *output_plugin_options, > (ctx->callbacks.stream_message_cb != NULL) || > (ctx->callbacks.stream_truncate_cb != NULL); > > + /* > + * To support two phase logical decoding, we require > prepare/commit-prepare/abort-prepare > + * callbacks. The filter-prepare callback is optional. We however > enable two phase logical > + * decoding when at least one of the methods is enabled so that we > can easily identify > + * missing methods. > > The terminology is generally well known as "two-phase" (with the > hyphen) https://en.wikipedia.org/wiki/Two-phase_commit_protocol so > let's be consistent for all the patch code comments. Please search the > code and correct this in all places, even where I might have missed to > identify it. > > "two phase" --> "two-phase" > > ; > > COMMENT > Line 822 > @@ -782,6 +807,111 @@ commit_cb_wrapper(ReorderBuffer *cache, > ReorderBufferTXN *txn, > } > > static void > +prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, > + XLogRecPtr prepare_lsn) > > "support 2 phase" --> "supports two-phase" in the comment > > ; > > COMMENT > Line 844 > Code condition seems strange and/or broken. > if (ctx->enable_twophase && ctx->callbacks.prepare_cb == NULL) > Because if the flag is null then this condition is skipped. > But then if the callback was also NULL then attempting to call it to > "do the actual work" will give NPE. > > ~ > > Also, I wonder should this check be the first thing in this function? > Because if it fails does it even make sense that all the errcallback > code was set up? > E.g errcallback.arg potentially is left pointing to a stack variable > on a stack that no longer exists. > Updated accordingly. > ; > > COMMENT > Line 857 > +commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, > > "support 2 phase" --> "supports two-phase" in the comment > > ~ > > Also, Same potential trouble with the condition: > if (ctx->enable_twophase && ctx->callbacks.commit_prepared_cb == NULL) > Same as previously asked. Should this check be first thing in this function? > > ; > > COMMENT > Line 892 > +abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, > > "support 2 phase" --> "supports two-phase" in the comment > > ~ > > Same potential trouble with the condition: > if (ctx->enable_twophase && ctx->callbacks.abort_prepared_cb == NULL) > Same as previously asked. Should this check be the first thing in this function? > > ; > > COMMENT > Line 1013 > @@ -858,6 +988,51 @@ truncate_cb_wrapper(ReorderBuffer *cache, > ReorderBufferTXN *txn, > error_context_stack = errcallback.previous; > } > > +static bool > +filter_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, > + TransactionId xid, const char *gid) > > Fix wording in comment: > "twophase" --> "two-phase transactions" > "twophase transactions" --> "two-phase transactions" > Updated accordingly. > ========== > Patch V6-0001, File: src/backend/replication/logical/reorderbuffer.c > ========== > > COMMENT > Line 255 > @@ -251,7 +251,8 @@ static Size > ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn > static void ReorderBufferRestoreChange(ReorderBuffer *rb, > ReorderBufferTXN *txn, > char *change); > static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, > ReorderBufferTXN *txn); > -static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn); > +static void ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, > + bool txn_prepared); > > The alignment is inconsistent. One more space needed before "bool txn_prepared" > > ; > > COMMENT > Line 417 > @@ -413,6 +414,11 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, > ReorderBufferTXN *txn) > } > > /* free data that's contained */ > + if (txn->gid != NULL) > + { > + pfree(txn->gid); > + txn->gid = NULL; > + } > > Should add the blank link before this new code, as it was before. > > ; > > COMMENT > Line 1564 > @ -1502,12 +1561,14 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, > ReorderBufferTXN *txn) > } > > /* > - * Discard changes from a transaction (and subtransactions), after streaming > - * them. Keep the remaining info - transactions, tuplecids, invalidations and > - * snapshots. > + * Discard changes from a transaction (and subtransactions), either > after streaming or > + * after a PREPARE. > > typo "snapshots.If" -> "snapshots. If" > > ; Updated Accordingly. > > COMMENT/QUESTION > Line 1590 > @@ -1526,7 +1587,7 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, > ReorderBufferTXN *txn) > Assert(rbtxn_is_known_subxact(subtxn)); > Assert(subtxn->nsubtxns == 0); > > - ReorderBufferTruncateTXN(rb, subtxn); > + ReorderBufferTruncateTXN(rb, subtxn, txn_prepared); > } > > There are some code paths here I did not understand how they match the comments. > Because this function is recursive it seems that it may be called > where the 2nd parameter txn is a sub-transaction. > > But then this seems at odds with some of the other code comments of > this function which are processing the txn without ever testing is it > really toplevel or not: > > e.g. Line 1593 "/* cleanup changes in the toplevel txn */" > e.g. Line 1632 "They are always stored in the toplevel transaction." > > ; I see that another commit in between has updated this now. > > COMMENT > Line 1644 > @@ -1560,9 +1621,33 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, > ReorderBufferTXN *txn) > * about the toplevel xact (we send the XID in all messages), but we never > * stream XIDs of empty subxacts. > */ > - if ((!txn->toptxn) || (txn->nentries_mem != 0)) > + if ((!txn_prepared) && ((!txn->toptxn) || (txn->nentries_mem != 0))) > txn->txn_flags |= RBTXN_IS_STREAMED; > > + if (txn_prepared) > > /* remove the change from it's containing list */ > typo "it's" --> "its" Updated. > > ; > > QUESTION > Line 1977 > @@ -1880,7 +1965,7 @@ ReorderBufferResetTXN(ReorderBuffer *rb, > ReorderBufferTXN *txn, > ReorderBufferChange *specinsert) > { > /* Discard the changes that we just streamed */ > - ReorderBufferTruncateTXN(rb, txn); > + ReorderBufferTruncateTXN(rb, txn, false); > > How do you know the 3rd parameter - i.e. txn_prepared - should be > hardwired false here? > e.g. I thought that maybe rbtxn_prepared(txn) can be true here. > > ; This particular function is only called when streaming and not when handling a prepared transaction. > > COMMENT > Line 2345 > @@ -2249,7 +2334,6 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, > ReorderBufferTXN *txn, > break; > } > } > - > /* > > Looks like accidental blank line deletion. This should be put back how it was > > ; > > COMMENT/QUESTION > Line 2374 > @@ -2278,7 +2362,16 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, > ReorderBufferTXN *txn, > } > } > else > - rb->commit(rb, txn, commit_lsn); > + { > + /* > + * Call either PREPARE (for twophase transactions) or COMMIT > + * (for regular ones). > > "twophase" --> "two-phase" > > ~ Updated. > > Also, I was confused by the apparent assumption of exclusiveness of > streaming and 2PC... > e.g. what if streaming AND 2PC then it won't do rb->prepare() > > ; > > QUESTION > Line 2424 > @@ -2319,11 +2412,17 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, > ReorderBufferTXN *txn, > */ > if (streaming) > { > - ReorderBufferTruncateTXN(rb, txn); > + ReorderBufferTruncateTXN(rb, txn, false); > > /* Reset the CheckXidAlive */ > CheckXidAlive = InvalidTransactionId; > } > + else if (rbtxn_prepared(txn)) > > I was confused by the exclusiveness of streaming/2PC. > e.g. what if streaming AND 2PC at same time - how can you pass false > as 3rd param to ReorderBufferTruncateTXN? ReorderBufferProcessTXN can only be called when streaming individual commands and not for streaming a prepare or a commit, Streaming of prepare and commit would be handled as part of ReorderBufferStreamCommit. > > ; > > COMMENT > Line 2463 > @@ -2352,17 +2451,18 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, > ReorderBufferTXN *txn, > > /* > * The error code ERRCODE_TRANSACTION_ROLLBACK indicates a concurrent > - * abort of the (sub)transaction we are streaming. We need to do the > + * abort of the (sub)transaction we are streaming or preparing. We > need to do the > * cleanup and return gracefully on this error, see SetupCheckXidLive. > */ > > "twoi phase" --> "two-phase" > > ; > > QUESTIONS > Line 2482 > @@ -2370,10 +2470,19 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, > ReorderBufferTXN *txn, > errdata = NULL; > curtxn->concurrent_abort = true; > > - /* Reset the TXN so that it is allowed to stream remaining data. */ > - ReorderBufferResetTXN(rb, txn, snapshot_now, > - command_id, prev_lsn, > - specinsert); > + /* If streaming, reset the TXN so that it is allowed to stream > remaining data. */ > + if (streaming) > > Re: /* If streaming, reset the TXN so that it is allowed to stream > remaining data. */ > I was confused by the exclusiveness of streaming/2PC. > Is it not possible for streaming flags and rbtxn_prepared(txn) true at > the same time? Same as above. > > ~ > > elog(LOG, "stopping decoding of %s (%u)", > txn->gid[0] != '\0'? txn->gid:"", txn->xid); > > Is this a safe operation, or do you also need to test txn->gid is not NULL? Since this is in code where it is not streaming and therefore rbtxn_prepared(txn), so gid has to be NOT NULL. > > ; > > COMMENT > Line 2606 > +ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid, > > "twophase" --> "two-phase" > > ; > > QUESTION > Line 2655 > +ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid, > > "This is used to handle COMMIT/ABORT PREPARED" > Should that say "COMMIT/ROLLBACK PREPARED"? > > ; > > COMMENT > Line 2668 > > "Anyways, 2PC transactions" --> "Anyway, two-phase transactions" > > ; > > COMMENT > Line 2765 > @@ -2495,7 +2731,13 @@ ReorderBufferAbort(ReorderBuffer *rb, > TransactionId xid, XLogRecPtr lsn) > /* cosmetic... */ > txn->final_lsn = lsn; > > - /* remove potential on-disk data, and deallocate */ > + /* > + * remove potential on-disk data, and deallocate. > + * > > Remove the blank between the comment and code. > > ========== > Patch V6-0001, File: src/include/replication/logical.h > ========== > > COMMENT > Line 89 > > "two phase" -> "two-phase" > > ; > > COMMENT > Line 89 > > For consistency with the previous member naming really the new member > should just be called "twophase" rather than "enable_twophase" > > ; Updated accordingly. > > ========== > Patch V6-0001, File: src/include/replication/output_plugin.h > ========== > > QUESTION > Line 106 > > As previously asked, why is the callback function/typedef referred as > AbortPrepared instead of RollbackPrepared? > It does not match the SQL and the function comment, and seems only to > add some unnecessary confusion. > > ; > > ========== > Patch V6-0001, File: src/include/replication/reorderbuffer.h > ========== > > QUESTION > Line 116 > @@ -162,9 +163,13 @@ typedef struct ReorderBufferChange > #define RBTXN_HAS_CATALOG_CHANGES 0x0001 > #define RBTXN_IS_SUBXACT 0x0002 > #define RBTXN_IS_SERIALIZED 0x0004 > -#define RBTXN_IS_STREAMED 0x0008 > -#define RBTXN_HAS_TOAST_INSERT 0x0010 > -#define RBTXN_HAS_SPEC_INSERT 0x0020 > +#define RBTXN_PREPARE 0x0008 > +#define RBTXN_COMMIT_PREPARED 0x0010 > +#define RBTXN_ROLLBACK_PREPARED 0x0020 > +#define RBTXN_COMMIT 0x0040 > +#define RBTXN_IS_STREAMED 0x0080 > +#define RBTXN_HAS_TOAST_INSERT 0x0100 > +#define RBTXN_HAS_SPEC_INSERT 0x0200 > > I was wondering why when adding new flags, some of the existing flag > masks were also altered. > I am assuming this is ok because they are never persisted but are only > used in the protocol (??) > > ; > > COMMENT > Line 226 > @@ -218,6 +223,15 @@ typedef struct ReorderBufferChange > ((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \ > ) > > +/* is this txn prepared? */ > +#define rbtxn_prepared(txn) (txn->txn_flags & RBTXN_PREPARE) > +/* was this prepared txn committed in the meanwhile? */ > +#define rbtxn_commit_prepared(txn) (txn->txn_flags & RBTXN_COMMIT_PREPARED) > +/* was this prepared txn aborted in the meanwhile? */ > +#define rbtxn_rollback_prepared(txn) (txn->txn_flags & RBTXN_ROLLBACK_PREPARED) > +/* was this txn committed in the meanwhile? */ > +#define rbtxn_commit(txn) (txn->txn_flags & RBTXN_COMMIT) > + > > Probably all the "txn->txn_flags" here might be more safely written > with parentheses in the macro like "(txn)->txn_flags". > > ~ > > Also, Start all comments with capital. And what is the meaning "in the > meanwhile?" > > ; > > COMMENT > Line 410 > @@ -390,6 +407,39 @@ typedef void (*ReorderBufferCommitCB) (ReorderBuffer *rb, > ReorderBufferTXN *txn, > XLogRecPtr commit_lsn); > > The format is inconsistent with all other callback signatures here, > where the 1st arg was on the same line as the typedef. > > ; > > COMMENT > Line 440-442 > > Excessive blank lines following this change? > > ; > > COMMENT > Line 638 > @@ -571,6 +631,15 @@ void > ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, > XLog > bool ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid); > bool ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid); > > +bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, > + const char *gid); > +bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid, > + const char *gid); > +void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid, > + XLogRecPtr commit_lsn, XLogRecPtr end_lsn, > + TimestampTz commit_time, > + RepOriginId origin_id, XLogRecPtr origin_lsn, > + char *gid); > > Not aligned consistently with other function prototypes. > > ; Updated > > ========== > Patch V6-0003, File: src/backend/access/transam/twophase.c > ========== > > COMMENT > Line 551 > @@ -548,6 +548,37 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held) > } > > /* > + * LookupGXact > + * Check if the prepared transaction with the given GID is around > + */ > +bool > +LookupGXact(const char *gid) > > There is potential to refactor/simplify this code: > e.g. > > bool > LookupGXact(const char *gid) > { > int i; > bool found = false; > > LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE); > for (i = 0; i < TwoPhaseState->numPrepXacts; i++) > { > GlobalTransaction gxact = TwoPhaseState->prepXacts[i]; > /* Ignore not-yet-valid GIDs */ > if (gxact->valid && strcmp(gxact->gid, gid) == 0) > { > found = true; > break; > } > } > LWLockRelease(TwoPhaseStateLock); > return found; > } > > ; > Updated accordingly. > ========== > Patch V6-0003, File: src/backend/replication/logical/proto.c > ========== > > COMMENT > Line 86 > @@ -72,12 +72,17 @@ logicalrep_read_begin(StringInfo in, > LogicalRepBeginData *begin_data) > */ > void > logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn, > - XLogRecPtr commit_lsn) > > Since now the flags are used the code comment is wrong. > "/* send the flags field (unused for now) */" > > ; > > COMMENT > Line 129 > @ -106,6 +115,77 @@ logicalrep_read_commit(StringInfo in, > LogicalRepCommitData *commit_data) > } > > /* > + * Write PREPARE to the output stream. > + */ > +void > +logicalrep_write_prepare(StringInfo out, ReorderBufferTXN *txn, > > "2PC transactions" --> "two-phase commit transactions" > > ; Updated > > COMMENT > Line 133 > > Assert(strlen(txn->gid) > 0); > Shouldn't that assertion also check txn->gid is not NULL (to prevent > NPE in case gid was NULL) In this case txn->gid has to be non NULL. > > ; > > COMMENT > Line 177 > +logicalrep_read_prepare(StringInfo in, LogicalRepPrepareData * prepare_data) > > prepare_data->prepare_type = flags; > This code may be OK but it does seem a bit of an abuse of the flags. > > e.g. Are they flags or are the really enum values? > e.g. And if they are effectively enums (it appears they are) then > seemed inconsistent that |= was used when they were previously > assigned. > > ; I have not updated this as according to Amit this might require refactoring again. > > ========== > Patch V6-0003, File: src/backend/replication/logical/worker.c > ========== > > COMMENT > Line 757 > @@ -749,6 +753,141 @@ apply_handle_commit(StringInfo s) > pgstat_report_activity(STATE_IDLE, NULL); > } > > +static void > +apply_handle_prepare_txn(LogicalRepPrepareData * prepare_data) > +{ > + Assert(prepare_data->prepare_lsn == remote_final_lsn); > > Missing function comment to say this is called from apply_handle_prepare. > > ; > > COMMENT > Line 798 > +apply_handle_commit_prepared_txn(LogicalRepPrepareData * prepare_data) > > Missing function comment to say this is called from apply_handle_prepare. > > ; > > COMMENT > Line 824 > +apply_handle_rollback_prepared_txn(LogicalRepPrepareData * prepare_data) > > Missing function comment to say this is called from apply_handle_prepare. > Updated. > ========== > Patch V6-0003, File: src/backend/replication/pgoutput/pgoutput.c > ========== > > COMMENT > Line 50 > @@ -47,6 +47,12 @@ static void pgoutput_truncate(LogicalDecodingContext *ctx, > ReorderBufferChange *change); > static bool pgoutput_origin_filter(LogicalDecodingContext *ctx, > RepOriginId origin_id); > +static void pgoutput_prepare_txn(LogicalDecodingContext *ctx, > + ReorderBufferTXN *txn, XLogRecPtr prepare_lsn); > > The parameter indentation (2nd lines) does not match everything else > in this context. > > ; > > COMMENT > Line 152 > @@ -143,6 +149,10 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb) > cb->change_cb = pgoutput_change; > cb->truncate_cb = pgoutput_truncate; > cb->commit_cb = pgoutput_commit_txn; > + > + cb->prepare_cb = pgoutput_prepare_txn; > + cb->commit_prepared_cb = pgoutput_commit_prepared_txn; > + cb->abort_prepared_cb = pgoutput_abort_prepared_txn; > > Remove the unnecessary blank line. > > ; > > QUESTION > Line 386 > @@ -373,7 +383,49 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, > ReorderBufferTXN *txn, > OutputPluginUpdateProgress(ctx); > > OutputPluginPrepareWrite(ctx, true); > - logicalrep_write_commit(ctx->out, txn, commit_lsn); > + logicalrep_write_commit(ctx->out, txn, commit_lsn, true); > > Is the is_commit parameter of logicalrep_write_commit ever passed as false? > If yes, where? > If no, the what is the point of it? It was dead code from an earlier version. I have removed it, updated accordingly. > > ; > > COMMENT > Line 408 > +pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, > ReorderBufferTXN *txn, > > Since all this function is identical to pg_output_prepare it might be > better to either > 1. just leave this as a wrapper to delegate to that function > 2. remove this one entirely and assign the callback to the common > pgoutput_prepare_txn > > ; I have not changed this as this might require re-factoring according to Amit. > > COMMENT > Line 419 > +pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, > > Since all this function is identical to pg_output_prepare if might be > better to either > 1. just leave this as a wrapper to delegate to that function > 2. remove this one entirely and assign the callback to the common > pgoutput_prepare_tx > > ; Same as above. > > COMMENT > Line 419 > +pgoutput_abort_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, > > Shouldn't this comment say be "ROLLBACK PREPARED"? > > ; Updated. > > ========== > Patch V6-0003, File: src/include/replication/logicalproto.h > ========== > > QUESTION > Line 101 > @@ -87,20 +87,55 @@ typedef struct LogicalRepBeginData > TransactionId xid; > } LogicalRepBeginData; > > +/* Commit (and abort) information */ > > #define LOGICALREP_IS_ABORT 0x02 > Is there a good reason why this is not called: > #define LOGICALREP_IS_ROLLBACK 0x02 > > ; Removed. > > COMMENT > Line 105 > > ((flags == LOGICALREP_IS_COMMIT) || (flags == LOGICALREP_IS_ABORT)) > > Macros would be safer if flags are in parentheses > (((flags) == LOGICALREP_IS_COMMIT) || ((flags) == LOGICALREP_IS_ABORT)) > > ; > > COMMENT > Line 115 > > Unexpected whitespace for the typedef > "} LogicalRepPrepareData;" > > ; > > COMMENT > Line 122 > /* prepare can be exactly one of PREPARE, [COMMIT|ABORT] PREPARED*/ > #define PrepareFlagsAreValid(flags) \ > ((flags == LOGICALREP_IS_PREPARE) || \ > (flags == LOGICALREP_IS_COMMIT_PREPARED) || \ > (flags == LOGICALREP_IS_ROLLBACK_PREPARED)) > > There is confusing mixture in macros and comments of ABORT and ROLLBACK terms > "[COMMIT|ABORT] PREPARED" --> "[COMMIT|ROLLBACK] PREPARED" > > ~ > > Also, it would be safer if flags are in parentheses > (((flags) == LOGICALREP_IS_PREPARE) || \ > ((flags) == LOGICALREP_IS_COMMIT_PREPARED) || \ > ((flags) == LOGICALREP_IS_ROLLBACK_PREPARED)) > > ; updated. > > ========== > Patch V6-0003, File: src/test/subscription/t/020_twophase.pl > ========== > > COMMENT > Line 131 - # check inserts are visible > > Isn't this supposed to be checking for rows 12 and 13, instead of 11 and 12? > > ; Updated. > > ========== > Patch V6-0004, File: contrib/test_decoding/test_decoding.c > ========== > > COMMENT > Line 81 > @@ -78,6 +78,15 @@ static void > pg_decode_stream_stop(LogicalDecodingContext *ctx, > static void pg_decode_stream_abort(LogicalDecodingContext *ctx, > ReorderBufferTXN *txn, > XLogRecPtr abort_lsn); > +static void pg_decode_stream_prepare(LogicalDecodingContext *ctx, > + ReorderBufferTXN *txn, > + XLogRecPtr commit_lsn); > +static > > All these functions have a 3rd parameter called commit_lsn. Even > though the functions are not commit related. It seems like a cut/paste > error. > > ; > > COMMENT > Line 142 > @@ -130,6 +139,9 @@ _PG_output_plugin_init(OutputPluginCallbacks *cb) > cb->stream_start_cb = pg_decode_stream_start; > cb->stream_stop_cb = pg_decode_stream_stop; > cb->stream_abort_cb = pg_decode_stream_abort; > + cb->stream_prepare_cb = pg_decode_stream_prepare; > + cb->stream_commit_prepared_cb = pg_decode_stream_commit_prepared; > + cb->stream_abort_prepared_cb = pg_decode_stream_abort_prepared; > cb->stream_commit_cb = pg_decode_stream_commit; > > Can the "cb->stream_abort_prepared_cb" be changed to > "cb->stream_rollback_prepared_cb"? > > ; > > COMMENT > Line 827 > @@ -812,6 +824,78 @@ pg_decode_stream_abort(LogicalDecodingContext *ctx, > } > > static void > +pg_decode_stream_prepare(LogicalDecodingContext *ctx, > + ReorderBufferTXN *txn, > + XLogRecPtr commit_lsn) > +{ > + TestDecodingData *data = ctx->output_plugin_pr > > The commit_lsn (3rd parameter) is unused and seems like a cut/paste name error. > > ; > > COMMENT > Line 875 > +pg_decode_stream_abort_prepared(LogicalDecodingContext *ctx, > > The commit_lsn (3rd parameter) is unused and seems like a cut/paste name error. > > ; > Updated. > ========== > Patch V6-0004, File: doc/src/sgml/logicaldecoding.sgml > ========== > > COMMENT > 48.6.1 > @@ -396,6 +396,9 @@ typedef struct OutputPluginCallbacks > LogicalDecodeStreamStartCB stream_start_cb; > LogicalDecodeStreamStopCB stream_stop_cb; > LogicalDecodeStreamAbortCB stream_abort_cb; > + LogicalDecodeStreamPrepareCB stream_prepare_cb; > + LogicalDecodeStreamCommitPreparedCB stream_commit_prepared_cb; > + LogicalDecodeStreamAbortPreparedCB stream_abort_prepared_cb; > > Same question from previous review comments - why using the > terminology "abort" instead of "rollback" > > ; > > COMMENT > 48.6.1 > @@ -418,7 +421,9 @@ typedef void (*LogicalOutputPluginInit) (struct > OutputPluginCallbacks *cb); > in-progress transactions. The <function>stream_start_cb</function>, > <function>stream_stop_cb</function>, <function>stream_abort_cb</function>, > <function>stream_commit_cb</function> and <function>stream_change_cb</function> > - are required, while <function>stream_message_cb</function> and > + are required, while <function>stream_message_cb</function>, > + <function>stream_prepare_cb</function>, > <function>stream_commit_prepared_cb</function>, > + <function>stream_abort_prepared_cb</function>, > > Missing "and". > ... "stream_abort_prepared_cb, stream_truncate_cb are optional." --> > "stream_abort_prepared_cb, and stream_truncate_cb are optional." > > ; > > COMMENT > Section 48.6.4.16 > Section 48.6.4.17 > Section 48.6.4.18 > @@ -839,6 +844,45 @@ typedef void (*LogicalDecodeStreamAbortCB) > (struct LogicalDecodingContext *ctx, > </para> > </sect3> > > + <sect3 id="logicaldecoding-output-plugin-stream-prepare"> > + <title>Stream Prepare Callback</title> > + <para> > + The <function>stream_prepare_cb</function> callback is called to prepare > + a previously streamed transaction as part of a two phase commit. > +<programlisting> > +typedef void (*LogicalDecodeStreamPrepareCB) (struct > LogicalDecodingContext *ctx, > + ReorderBufferTXN *txn, > + XLogRecPtr abort_lsn); > +</programlisting> > + </para> > + </sect3> > + > + <sect3 id="logicaldecoding-output-plugin-stream-commit-prepared"> > + <title>Stream Commit Prepared Callback</title> > + <para> > + The <function>stream_commit_prepared_cb</function> callback is > called to commit prepared > + a previously streamed transaction as part of a two phase commit. > +<programlisting> > +typedef void (*LogicalDecodeStreamCommitPreparedCB) (struct > LogicalDecodingContext *ctx, > + ReorderBufferTXN *txn, > + XLogRecPtr abort_lsn); > +</programlisting> > + </para> > + </sect3> > + > + <sect3 id="logicaldecoding-output-plugin-stream-abort-prepared"> > + <title>Stream Abort Prepared Callback</title> > + <para> > + The <function>stream_abort_prepared_cb</function> callback is called > to abort prepared > + a previously streamed transaction as part of a two phase commit. > +<programlisting> > +typedef void (*LogicalDecodeStreamAbortPreparedCB) (struct > LogicalDecodingContext *ctx, > + ReorderBufferTXN *txn, > + XLogRecPtr abort_lsn); > +</programlisting> > + </para> > + </sect3> > > 1. Everywhere it says "two phase" commit should be consistently > replaced to say "two-phase" commit (with the hyphen) > > 2. Search for "abort_lsn" parameter. It seems to be overused > (cut/paste error) even when the API is unrelated to abort > > 3. 48.6.4.17 and 48.6.4.18 > Is this wording ok? Is the word "prepared" even necessary here? > - "... called to commit prepared a previously streamed transaction ..." > - "... called to abort prepared a previously streamed transaction ..." > > ; Updated accordingly. > > COMMENT > Section 48.9 > @@ -1017,9 +1061,13 @@ OutputPluginWrite(ctx, true); > When streaming an in-progress transaction, the changes (and messages) are > streamed in blocks demarcated by <function>stream_start_cb</function> > and <function>stream_stop_cb</function> callbacks. Once all the decoded > - changes are transmitted, the transaction is committed using the > - <function>stream_commit_cb</function> callback (or possibly aborted using > - the <function>stream_abort_cb</function> callback). > + changes are transmitted, the transaction can be committed using the > + the <function>stream_commit_cb</function> callback > > "two phase" --> "two-phase" > > ~ > > Also, Missing period on end of sentence. > "or aborted using the stream_abort_prepared_cb" --> "or aborted using > the stream_abort_prepared_cb." > > ; Updated accordingly. > > ========== > Patch V6-0004, File: src/backend/replication/logical/logical.c > ========== > > COMMENT > Line 84 > @@ -81,6 +81,12 @@ static void stream_stop_cb_wrapper(ReorderBuffer > *cache, ReorderBufferTXN *txn, > XLogRecPtr last_lsn); > static void stream_abort_cb_wrapper(ReorderBuffer *cache, > ReorderBufferTXN *txn, > XLogRecPtr abort_lsn); > +static void stream_prepare_cb_wrapper(ReorderBuffer *cache, > ReorderBufferTXN *txn, > + XLogRecPtr commit_lsn); > +static void stream_commit_prepared_cb_wrapper(ReorderBuffer *cache, > ReorderBufferTXN *txn, > + XLogRecPtr commit_lsn); > +static void stream_abort_prepared_cb_wrapper(ReorderBuffer *cache, > ReorderBufferTXN *txn, > + XLogRecPtr commit_lsn); > > The 3rd parameter is always "commit_lsn" even for API unrelated to > commit, so seems like cut/paste error. > > ; > > COMMENT > Line 1246 > @@ -1231,6 +1243,105 @@ stream_abort_cb_wrapper(ReorderBuffer *cache, > ReorderBufferTXN *txn, > } > > static void > +stream_prepare_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, > + XLogRecPtr commit_lsn) > +{ > + LogicalDecodingContext *ctx = cache->private_data; > + LogicalErrorCallbackState state; > > Misnamed parameter "commit_lsn" ? > > ~ > > Also, Line 1272 > There seem to be some missing integrity checking to make sure the > callback is not NULL. > A null callback will give NPE when wrapper attempts to call it > > ; > > COMMENT > Line 1305 > +static void > +stream_commit_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, > > There seem to be some missing integrity checking to make sure the > callback is not NULL. > A null callback will give NPE when wrapper attempts to call it. > > ; > > COMMENT > Line 1312 > +static void > +stream_abort_prepared_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn, > > Misnamed parameter "commit_lsn" ? > > ~ > > Also, Line 1338 > There seem to be some missing integrity checking to make sure the > callback is not NULL. > A null callback will give NPE when wrapper attempts to call it. > Updated accordingly. > > ========== > Patch V6-0004, File: src/backend/replication/logical/reorderbuffer.c > ========== > > COMMENT > Line 2684 > @@ -2672,15 +2681,31 @@ ReorderBufferFinishPrepared(ReorderBuffer *rb, > TransactionId xid, > txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */ > strcpy(txn->gid, gid); > > - if (is_commit) > + if (rbtxn_is_streamed(txn)) > { > - txn->txn_flags |= RBTXN_COMMIT_PREPARED; > - rb->commit_prepared(rb, txn, commit_lsn); > + if (is_commit) > + { > + txn->txn_flags |= RBTXN_COMMIT_PREPARED; > > The setting/checking of the flags could be refactored if you wanted to > write less code: > e.g. > if (is_commit) > txn->txn_flags |= RBTXN_COMMIT_PREPARED; > else > txn->txn_flags |= RBTXN_ROLLBACK_PREPARED; > > if (rbtxn_is_streamed(txn) && rbtxn_commit_prepared(txn)) > rb->stream_commit_prepared(rb, txn, commit_lsn); > else if (rbtxn_is_streamed(txn) && rbtxn_rollback_prepared(txn)) > rb->stream_abort_prepared(rb, txn, commit_lsn); > else if (rbtxn_commit_prepared(txn)) > rb->commit_prepared(rb, txn, commit_lsn); > else if (rbtxn_rollback_prepared(txn)) > rb->abort_prepared(rb, txn, commit_lsn); > > ; Updated accordingly. > > ========== > Patch V6-0004, File: src/include/replication/output_plugin.h > ========== > > COMMENT > Line 171 > @@ -157,6 +157,33 @@ typedef void (*LogicalDecodeStreamAbortCB) > (struct LogicalDecodingContext *ctx, > XLogRecPtr abort_lsn); > > /* > + * Called to prepare changes streamed to remote node from in-progress > + * transaction. This is called as part of a two-phase commit and only when > + * two-phased commits are supported > + */ > > 1. Missing period all these comments. > > 2. Is the part that says "and only where two-phased commits are > supported" necessary to say? Is seems redundant since comments already > says called as part of a two-phase commit. > > ; > > ========== > Patch V6-0004, File: src/include/replication/reorderbuffer.h > ========== > > COMMENT > Line 467 > @@ -466,6 +466,24 @@ typedef void (*ReorderBufferStreamAbortCB) ( > ReorderBufferTXN *txn, > XLogRecPtr abort_lsn); > > +/* prepare streamed transaction callback signature */ > +typedef void (*ReorderBufferStreamPrepareCB) ( > + ReorderBuffer *rb, > + ReorderBufferTXN *txn, > + XLogRecPtr commit_lsn); > + > +/* prepare streamed transaction callback signature */ > +typedef void (*ReorderBufferStreamCommitPreparedCB) ( > + ReorderBuffer *rb, > + ReorderBufferTXN *txn, > + XLogRecPtr commit_lsn); > + > +/* prepare streamed transaction callback signature */ > +typedef void (*ReorderBufferStreamAbortPreparedCB) ( > + ReorderBuffer *rb, > + ReorderBufferTXN *txn, > + XLogRecPtr commit_lsn); > > Cut/paste error - repeated same comment 3 times? > Updated Accordingly. > > [END] > > I believe I have addressed all of Peter's comments. Peter, do have a look and let me know if I missed anything or if you find anythinge else. Thanks for your comments, much appreciated. regards, Ajin Cherian Fujitsu Australia
Attachment
On Wed, Oct 14, 2020 at 6:15 PM Ajin Cherian <itsajin@gmail.com> wrote: > I think it will be easier to review this work if we can split the patches according to the changes made in different layers. The first patch could be changes made in output plugin and the corresponding changes in test_decoding, see the similar commit of in-progress transactions [1]. So you need to move corresponding changes from v8-0001-Support-decoding-of-two-phase-transactions and v8-0004-Support-two-phase-commits-in-streaming-mode-in-lo for this. The second patch could be changes made in ReorderBuffer to support this feature, see [2]. The third patch could be changes made to support pgoutput and subscriber-side stuff, see [3]. What do you think? [1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=45fdc9738b36d1068d3ad8fdb06436d6fd14436b [2] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=7259736a6e5b7c7588fff9578370736a6648acbb [3] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=464824323e57dc4b397e8b05854d779908b55304 -- With Regards, Amit Kapila.
On Thu, Oct 15, 2020 at 2:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Oct 14, 2020 at 6:15 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > > I think it will be easier to review this work if we can split the > patches according to the changes made in different layers. The first > patch could be changes made in output plugin and the corresponding > changes in test_decoding, see the similar commit of in-progress > transactions [1]. So you need to move corresponding changes from > v8-0001-Support-decoding-of-two-phase-transactions and > v8-0004-Support-two-phase-commits-in-streaming-mode-in-lo for this. > The second patch could be changes made in ReorderBuffer to support > this feature, see [2]. The third patch could be changes made to > support pgoutput and subscriber-side stuff, see [3]. What do you > think? I agree. I have split the patches accordingly. Do have a look. Pending work is: 1. Add pgoutput support for the new streaming two-phase commit APIs 2. Add test cases for two-phase commits with streaming for pub/sub and test_decoding 3. Add CREATE SUBSCRIPTION command option to specify two-phase commits rather than having it turned on by default. regards, Ajin Cherian Fujitsu Australia
Attachment
Hello Ajin, The v9 patches provided support for two-phase transactions for NON-streaming. Now I have added STREAM support for two-phase transactions, and bumped all patches to version v10. (The 0001 and 0002 patches are unchanged. Only 0003 is changed). -- There are a few TODO/FIXME comments in the code highlighting parts needing some attention. There is a #define DEBUG_STREAM_2PC useful for debugging, which I can remove later. All the patches have some whitespaces issues when applied. We can resolve them as we go. Please let me know any comments/feedback. Kind Regards Peter Smith. Fujitsu Australia.
Attachment
Hello Ajin. I have gone through the v10 patches to verify if and how my previous v6 review comments got addressed. Some issues remain, and there are a few newly introduced ones. Mostly it is all very minor stuff. Please find my revised review comments below. Kind Regards. Peter Smith Fujitsu Australia --- V10 REVIEW COMMENTS FOLLOW ========== Patch v10-0001, File: contrib/test_decoding/test_decoding.c ========== COMMENT Line 285 + { + errno = 0; + data->check_xid_aborted = (TransactionId) + strtoul(strVal(elem->arg), NULL, 0); + + if (!TransactionIdIsValid(data->check_xid_aborted)) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("check-xid-aborted is not a valid xid: \"%s\"", + strVal(elem->arg)))); + } I think it is risky to assign strtoul directly to the check_xid_aborted member because it makes some internal assumption that the invalid transaction is the same as the error return from strtoul. Maybe better to do in 2 steps like below: BEFORE errno = 0; data->check_xid_aborted = (TransactionId)strtoul(strVal(elem->arg), NULL, 0); AFTER long xid; errno = 0; xid = strtoul(strVal(elem->arg), NULL, 0); if (xid == 0 || errno != 0) data->check_xid_aborted = InvalidTransactionId; else data->check_xid_aborted =(TransactionId)xid; --- COMMENT Line 430 + +/* ABORT PREPARED callback */ +static void +pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, + XLogRecPtr abort_lsn) Fix comment "ABORT PREPARED" --> "ROLLBACK PREPARED" ========== Patch v10-0001, File: doc/src/sgml/logicaldecoding.sgml ========== COMMENT Section 48.6.1 Says: An output plugin may also define functions to support streaming of large, in-progress transactions. The stream_start_cb, stream_stop_cb, stream_abort_cb, stream_commit_cb and stream_change_cb are required, while stream_message_cb, stream_prepare_cb, stream_commit_prepared_cb, stream_rollback_prepared_cb and stream_truncate_cb are optional. An output plugin may also define functions to support two-phase commits, which are decoded on PREPARE TRANSACTION. The prepare_cb, commit_prepared_cb and rollback_prepared_cb callbacks are required, while filter_prepare_cb is optional. - But is that correct? It seems strange/inconsistent to say that the 2PC callbacks are mandatory for the non-streaming, but that they are optional for streaming. --- COMMENT 48.6.4.5 "Transaction Prepare Callback" 48.6.4.6 "Transaction Commit Prepared Callback" 48.6.4.7 "Transaction Rollback Prepared Callback" There seems some confusion about what is optional and what is mandatory. e.g. Why are the non-stream 2PC callbacks mandatory but the stream 2PC callbacks are not? And also there is some inconsistency with what is said in the paragraph at the top of the page versus what each of the callback sections says wrt optional/mandatory. The sub-sections 49.6.4.5, 49.6.4.6, 49.6.4.7 say those callbacks are optional which IIUC Amit said is incorrect. This is similar to the previous review comment --- COMMENT Section 48.6.4.7 "Transaction Rollback Prepared Callback" parameter "abort_lsn" probably should be "rollback_lsn" --- COMMENT Section 49.6.4.18. "Stream Rollback Prepared Callback" Says: The stream_rollback_prepared_cb callback is called to abort a previously streamed transaction as part of a two-phase commit. maybe should say "is called to rollback" ========== Patch v10-0001, File: src/backend/replication/logical/logical.c ========== COMMENT Line 252 Says: We however enable two phase logical... "two phase" --> "two-phase" -- COMMENT Line 885 Line 923 Says: If the plugin support 2 phase commits... "support 2 phase" --> "supports two-phase" in the comment. Same issue occurs twice. --- COMMENT Line 830 Line 868 Line 906 Says: /* We're only supposed to call this when two-phase commits are supported */ There is an extra space between the "are" and "supported" in the comment. Same issue occurs 3 times. --- COMMENT Line 1023 + /* + * Skip if decoding of two-phase at PREPARE time is not enabled. In that + * case all two-phase transactions are considered filtered out and will be + * applied as regular transactions at COMMIT PREPARED. + */ Comment still is missing the word "transactions" "Skip if decoding of two-phase at PREPARE time is not enabled." -> "Skip if decoding of two-phase transactions at PREPARE time is not enabled. ========== Patch v10-0001, File: src/include/replication/reorderbuffer.h ========== COMMENT Line 459 /* abort prepared callback signature */ typedef void (*ReorderBufferRollbackPreparedCB) ( ReorderBuffer *rb, ReorderBufferTXN *txn, XLogRecPtr abort_lsn); There is no alignment consistency here for ReorderBufferRollbackPreparedCB. Some function args are directly under the "(" and some are on the same line. This function code is neither. --- COMMENT Line 638 @@ -431,6 +486,24 @@ typedef void (*ReorderBufferStreamAbortCB) ( ReorderBufferTXN *txn, XLogRecPtr abort_lsn); +/* prepare streamed transaction callback signature */ +typedef void (*ReorderBufferStreamPrepareCB) ( + ReorderBuffer *rb, + ReorderBufferTXN *txn, + XLogRecPtr prepare_lsn); + +/* prepare streamed transaction callback signature */ +typedef void (*ReorderBufferStreamCommitPreparedCB) ( + ReorderBuffer *rb, + ReorderBufferTXN *txn, + XLogRecPtr commit_lsn); + +/* prepare streamed transaction callback signature */ +typedef void (*ReorderBufferStreamRollbackPreparedCB) ( + ReorderBuffer *rb, + ReorderBufferTXN *txn, + XLogRecPtr rollback_lsn); There is no inconsistent alignment with the arguments (compare how other functions are aligned) See: - for ReorderBufferStreamCommitPreparedCB - for ReorderBufferStreamRollbackPreparedCB - for ReorderBufferPrepareNeedSkip - for ReorderBufferTxnIsPrepared - for ReorderBufferPrepare --- COMMENT Line 489 Line 495 Line 501 /* prepare streamed transaction callback signature */ Same comment cut/paste 3 times? - for ReorderBufferStreamPrepareCB - for ReorderBufferStreamCommitPreparedCB - for ReorderBufferStreamRollbackPreparedCB --- COMMENT Line 457 /* abort prepared callback signature */ typedef void (*ReorderBufferRollbackPreparedCB) ( ReorderBuffer *rb, ReorderBufferTXN *txn, XLogRecPtr abort_lsn); "abort" --> "rollback" in the function comment. --- COMMENT Line 269 /* In case of 2PC we need to pass GID to output plugin */ "2PC" --> "two-phase commit" ========== Patch v10-0002, File: contrib/test_decoding/expected/two_phase.out (and .sql) ========== COMMENT General It is a bit hard to see what are the main tests here are what are just sub-parts of some test case. e.g. It seems like the main tests are. 1. Test that decoding happens at PREPARE time 2. Test decoding of an aborted tx 3. Test a prepared tx which contains some DDL 4. Test decoding works while an uncommitted prepared tx with DDL exists 5. Test operations holding exclusive locks won't block decoding 6. Test savepoints and sub-transactions 7. Test "_nodecode" will defer the decoding until the commit time Can the comments be made more obvious so it is easy to distinguish the main tests from the steps of those tests? --- COMMENT Line 1 -- Test two-phased transactions, when two-phase-commit is enabled, transactions are -- decoded at PREPARE time rather than at COMMIT PREPARED time. Some commas to be removed and this comment to be split into several sentences. --- COMMENT Line 19 -- should show nothing Comment could be more informative. E.g. "Should show nothing because the PREPARE has not happened yet" --- COMMENT Line 77 Looks like there is a missing comment about here that should say something like "Show that the DDL does not appear in the decoding" --- COMMENT Line 160 -- test savepoints and sub-xacts as a result The subsequent test is testing savepoints. But is it testing sub transactions like the comment says? ========== Patch v10-0002, File: contrib/test_decoding/t/001_twophase.pl ========== COMMENT General I think basically there are only 2 tests in this file. 1. to check that the concurrent abort works. 2. to check that the prepared tx can span a server shutdown/restart But the tests comments do not make this clear at all. e.g. All the "#" comments look equally important although most of them are just steps of each test case. Can the comments be better to distinguish the tests versus the steps of each test? ========== Patch v10-0002, File: src/backend/replication/logical/decode.c ========== COMMENT Line 71 static void DecodeCommitPrepared(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, xl_xact_parsed_commit *parsed, TransactionId xid); static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, xl_xact_parsed_abort *parsed, TransactionId xid); static void DecodeAbortPrepared(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, xl_xact_parsed_abort *parsed, TransactionId xid); static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, xl_xact_parsed_prepare * parsed); The 2nd line or args are not aligned properly. - for DecodeCommitPrepared - for DecodeAbortPrepared - for DecodePrepare ========== Patch v10-0002, File: src/backend/replication/logical/reorderbuffer.c ========== COMMENT There are some parts of the code where in my v6 review I had a doubt about the mutually exclusive treatment of the "streaming" flag and the "rbtxn_prepared(txn)" state. Basically I did not see how some parts of the code are treating NOT streaming as implying 2PC etc because it defies my understanding that 2PC can also work in streaming mode. Perhaps the "streaming" flag has a different meaning to how I interpret it? Or perhaps some functions are guarding higher up and can only be called under certain conditions? Anyway, this confusion manifests in several parts of the code, none of which was changed after my v6 review. Affected code includes the following: CASE 1 Wherever the ReorderBufferTruncateTXN(...) "prepared" flag (third parameter) is hardwired true/false, I think there must be some preceding Assert to guarantee the prepared state condition holds true. There can't be any room for doubts like "but what will it do for streamed 2PC..." Line 1805 - ReorderBufferTruncateTXN(rb, txn, true); // if rbtxn_prepared(txn) Line 1941 - ReorderBufferTruncateTXN(rb, txn, false); // state ?? Line 2389 - ReorderBufferTruncateTXN(rb, txn, false); // if streaming Line 2396 - ReorderBufferTruncateTXN(rb, txn, true); // if not streaming and if rbtxm_prepared(txn) Line 2459 - ReorderBufferTruncateTXN(rb, txn, true); // if not streaming ~ CASE 2 Wherever the "streaming" flag is tested I don't really understand how NOT streaming can automatically imply 2PC. Line 2330 - if (streaming) // what about if it is streaming AND 2PC at the same time? Line 2387 - if (streaming) // what about if it is streaming AND 2PC at the same time? Line 2449 - if (streaming) // what about if it is streaming AND 2PC at the same time? ~ Case 1 and Case 2 above overlap a fair bit. I just listed them so they all get checked again. Even if the code is thought to be currently OK I do still think something should be done like: a) add some more substantial comments to explain WHY the combination of streaming and 2PC is not valid in the context b) the Asserts to be strengthened to 100% guarantee that the streaming and prepared states really are exclusive (if indeed they are). For this point I thought the following Assert condition could be better: Assert(streaming || rbtxn_prepared(txn)); Assert(stream_started || rbtxn_prepared(txn)); because as it is you still are left wondering if both streaming AND rbtxn_prepared(txn) can be possible at the same time... --- COMMENT Line 2634 * Anyways, two-phase transactions do not contain any reorderbuffers. "Anyways" --> "Anyway" ========== Patch v10-0003, File: src/backend/access/transam/twophase.c ========== COMMENT Line 557 @@ -548,6 +548,33 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held) } /* + * LookupGXact + * Check if the prepared transaction with the given GID is around + */ +bool +LookupGXact(const char *gid) +{ + int i; + bool found = false; The variable declarations (i and found) are not aligned. ========== Patch v10-0003, File: src/backend/replication/logical/proto.c ========== COMMENT Line 125 Line 205 Assert(strlen(txn->gid) > 0); I suggested that the assertion should also check txn->gid is not NULL. You replied "In this case txn->gid has to be non NULL". But that is exactly what I said :-) If it HAS to be non-NULL then why not just Assert that in code instead of leaving the reader wondering? "Assert(strlen(txn->gid) > 0);" --> "Assert(tdx->gid && strlen(txn->gid) > 0);" Same occurs several times. --- COMMENT Line 133 Line 213 if (rbtxn_commit_prepared(txn)) flags |= LOGICALREP_IS_COMMIT_PREPARED; else if (rbtxn_rollback_prepared(txn)) flags |= LOGICALREP_IS_ROLLBACK_PREPARED; else flags |= LOGICALREP_IS_PREPARE; Previously I wrote that the use of the bit flags on assignment in the logicalrep_write_prepare was inconsistent with the way they are treated when they are read. Really it should be using a direct assignment instead of bit flags. You said this is skipped anticipating a possible refactor. But IMO this leaves the code in a half/half state. I think it is better to fix it properly and if refactoring happens then deal with that at the time. The last comment I saw from Amit said to use my 1st proposal of direct assignment instead of bit flag assignment. (applies to both non-stream and stream functions) - see logicalrep_write_prepare - see logicalrep_write_stream_prepare ========== Patch v10-0003, File: src/backend/replication/pgoutput/pgoutput.c ========== COMMENT Line 429 /* * PREPARE callback */ static void pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, XLogRecPtr prepare_lsn) The function comment looks wrong. Shouldn't this comment say be "ROLLBACK PREPARED callback"? ========== Patch v10-0003, File: src/include/replication/logicalproto.h ========== Line 115 #define PrepareFlagsAreValid(flags) \ ((flags == LOGICALREP_IS_PREPARE) || \ (flags == LOGICALREP_IS_COMMIT_PREPARED) || \ (flags == LOGICALREP_IS_ROLLBACK_PREPARED)) Would be safer if all the references to flags are in parentheses e.g. "flags" --> "(flags)" [END]
On Fri, Oct 16, 2020 at 5:21 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Hello Ajin, > > The v9 patches provided support for two-phase transactions for NON-streaming. > > Now I have added STREAM support for two-phase transactions, and bumped > all patches to version v10. > > (The 0001 and 0002 patches are unchanged. Only 0003 is changed). > > -- > > There are a few TODO/FIXME comments in the code highlighting parts > needing some attention. > > There is a #define DEBUG_STREAM_2PC useful for debugging, which I can > remove later. > > All the patches have some whitespaces issues when applied. We can > resolve them as we go. > > Please let me know any comments/feedback. Hi Peter, Thanks for your patch. Some comments for your patch: Comments: src/backend/replication/logical/worker.c @@ -888,6 +888,319 @@ apply_handle_prepare(StringInfo s) + /* + * FIXME - Following condition was in apply_handle_prepare_txn except I found it was ALWAYS IsTransactionState() == false + * The synchronization worker runs in single transaction. * + if (IsTransactionState() && !am_tablesync_worker()) + */ + if (!am_tablesync_worker()) Comment: I dont think a tablesync worker will use streaming, none of the other stream APIs check this, this might not be relevant for stream_prepare either. + /* + * ================================================================================================== + * The following chunk of code is largely cut/paste from the existing apply_handle_prepare_commit_txn Comment: Here, I think you meant apply_handle_stream_commit. Also rather than duplicating this chunk of code, you could put it in a new function. + /* open the spool file for the committed transaction */ + changes_filename(path, MyLogicalRepWorker->subid, xid); Comment: Here the comment should read "committed/prepared" rather than "committed" + else + { + /* Process any invalidation messages that might have accumulated. */ + AcceptInvalidationMessages(); + maybe_reread_subscription(); + } Comment: This else block might not be necessary as a tablesync worker will not initiate the streaming APIs. + BeginTransactionBlock(); + CommitTransactionCommand(); + StartTransactionCommand(); Comment: Rereading the code and the transaction state description in src/backend/access/transam/README. I am not entirely sure if the BeginTransactionBlock followed by CommitTransactionBlock is really needed here. I understand this code was copied over from apply_handle_prepare_txn, but now looking back I'm not so sure if it is correct. The transaction would have already begin as part of applying the changes, why begin it again? Maybe Amit could confirm this. END regards, Ajin Cherian Fujitsu Australia
The PG docs for PREPARE TRANSACTION [1] don't say anything about an empty (zero length) transaction-id. e.g. PREPARE TRANSACTION ''; [1] https://www.postgresql.org/docs/current/sql-prepare-transaction.html ~ Meanwhile, during testing I found the 2PC prepare hangs when an empty id is used. Now I am not sure does this represent some bug within the 2PC code, or in fact should the PREPARE never have allowed an empty transaction-id to be specified in the first place? Thoughts? Kind Regards Peter Smith. Fujitsu Australia.
On Tue, Oct 20, 2020 at 4:32 PM Ajin Cherian <itsajin@gmail.com> wrote: > > On Fri, Oct 16, 2020 at 5:21 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Comments: > > src/backend/replication/logical/worker.c > @@ -888,6 +888,319 @@ apply_handle_prepare(StringInfo s) > + /* > + * FIXME - Following condition was in apply_handle_prepare_txn except > I found it was ALWAYS IsTransactionState() == false > + * The synchronization worker runs in single transaction. * > + if (IsTransactionState() && !am_tablesync_worker()) > + */ > + if (!am_tablesync_worker()) > > Comment: I dont think a tablesync worker will use streaming, none of > the other stream APIs check this, this might not be relevant for > stream_prepare either. > Yes, I think this is right. See pgoutput_startup where we are disabling the streaming for init phase. But it is always good to once test this and ensure the same. > > + /* > + * ================================================================================================== > + * The following chunk of code is largely cut/paste from the existing > apply_handle_prepare_commit_txn > > Comment: Here, I think you meant apply_handle_stream_commit. Also > rather than duplicating this chunk of code, you could put it in a new > function. > > + /* open the spool file for the committed transaction */ > + changes_filename(path, MyLogicalRepWorker->subid, xid); > > Comment: Here the comment should read "committed/prepared" rather than > "committed" > > > + else > + { > + /* Process any invalidation messages that might have accumulated. */ > + AcceptInvalidationMessages(); > + maybe_reread_subscription(); > + } > > Comment: This else block might not be necessary as a tablesync worker > will not initiate the streaming APIs. > I think it is better to have an Assert here for streaming-mode? > + BeginTransactionBlock(); > + CommitTransactionCommand(); > + StartTransactionCommand(); > > Comment: Rereading the code and the transaction state description in > src/backend/access/transam/README. I am not entirely sure if the > BeginTransactionBlock followed by CommitTransactionBlock is really > needed here. > Yeah, I also find this strange. I guess the patch is doing so because it needs to call PrepareTransactionBlock later but I am not sure. How can we call CommitTransactionCommand(), won't it commit the on-going transaction and make it visible before even it is visible on the publisher. I think you can verify by having a breakpoint after CommitTransactionCommand() and see if the changes for which we are doing prepare become visible. > I understand this code was copied over from apply_handle_prepare_txn, > but now looking back I'm not so sure if it is correct. The transaction > would have already begin as part of applying the changes, why begin it > again? > Maybe Amit could confirm this. > I hope the above suggestions will help to proceed here. -- With Regards, Amit Kapila.
On Wed, Oct 21, 2020 at 1:38 PM Peter Smith <smithpb2250@gmail.com> wrote: > > The PG docs for PREPARE TRANSACTION [1] don't say anything about an > empty (zero length) transaction-id. > e.g. PREPARE TRANSACTION ''; > [1] https://www.postgresql.org/docs/current/sql-prepare-transaction.html > > ~ > > Meanwhile, during testing I found the 2PC prepare hangs when an empty > id is used. > Can you please take an example to explain what you are trying to say? I have tried below and doesn't face any problem: postgres=# Begin; BEGIN postgres=*# select txid_current(); txid_current -------------- 534 (1 row) postgres=*# Prepare Transaction 'foo'; PREPARE TRANSACTION postgres=# Commit Prepared 'foo'; COMMIT PREPARED postgres=# Begin; BEGIN postgres=*# Prepare Transaction 'foo'; PREPARE TRANSACTION postgres=# Commit Prepared 'foo'; COMMIT PREPARED -- With Regards, Amit Kapila.
On Tue, Oct 20, 2020 at 9:46 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > ========== > Patch v10-0002, File: src/backend/replication/logical/reorderbuffer.c > ========== > > COMMENT > There are some parts of the code where in my v6 review I had a doubt > about the mutually exclusive treatment of the "streaming" flag and the > "rbtxn_prepared(txn)" state. > I am not sure about the exact specifics here but we can always prepare a transaction that is streamed. I have to raise one more point in this regard. Why do we need stream_commit_prepared_cb, stream_rollback_prepared_cb callbacks? Do we need to do something separate in pgoutput or otherwise for these APIs? If not, can't we use a non-stream version of these APIs instead? There appears to be a use-case for stream_prepare_cb which is to apply the existing changes on subscriber and call prepare but I can't see usecase for the other two APIs. One minor comment: v10-0001-Support-2PC-txn-base 1. @@ -574,6 +655,11 @@ void ReorderBufferQueueMessage(ReorderBuffer *, TransactionId, Snapshot snapsho void ReorderBufferCommit(ReorderBuffer *, TransactionId, XLogRecPtr commit_lsn, XLogRecPtr end_lsn, TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn); +void ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid, + XLogRecPtr commit_lsn, XLogRecPtr end_lsn, + TimestampTz commit_time, + RepOriginId origin_id, XLogRecPtr origin_lsn, + char *gid, bool is_commit); void ReorderBufferAssignChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn); void ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId, XLogRecPtr commit_lsn, XLogRecPtr end_lsn); @@ -597,6 +683,15 @@ void ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid, XLog bool ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid); bool ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid); +bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid, + const char *gid); +bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid, + const char *gid); +void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid, + XLogRecPtr commit_lsn, XLogRecPtr end_lsn, + TimestampTz commit_time, + RepOriginId origin_id, XLogRecPtr origin_lsn, + char *gid); ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuf I don't think these changes belong to this patch as the definition of these functions is not part of this patch. -- With Regards, Amit Kapila.
On Wed, Oct 21, 2020 at 7:42 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Oct 21, 2020 at 1:38 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > The PG docs for PREPARE TRANSACTION [1] don't say anything about an > > empty (zero length) transaction-id. > > e.g. PREPARE TRANSACTION ''; > > [1] https://www.postgresql.org/docs/current/sql-prepare-transaction.html > > > > ~ > > > > Meanwhile, during testing I found the 2PC prepare hangs when an empty > > id is used. > > > > Can you please take an example to explain what you are trying to say? I was referring to an empty (zero length) transaction ID, not an empty transaction. The example was already given as PREPARE TRANSACTION ''; A longer example from my regress test is shown below. Using 2PC pub/sub this will currently hang: # -------------------- # Test using empty GID # -------------------- # check that 2PC gets replicated to subscriber $node_publisher->safe_psql('postgres', "BEGIN;INSERT INTO tab_full VALUES (51);PREPARE TRANSACTION '';"); $node_publisher->poll_query_until('postgres', $caughtup_query) or die "Timed out while waiting for subscriber to catch up"; # check that transaction is in prepared state on subscriber $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = '';"); is($result, qq(1), 'transaction is prepared on subscriber'); # ROLLBACK $node_publisher->safe_psql('postgres', "ROLLBACK PREPARED '';"); # check that 2PC gets aborted on subscriber $node_publisher->poll_query_until('postgres', $caughtup_query) or die "Timed out while waiting for subscriber to catch up"; $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts where gid = '';"); is($result, qq(0), 'transaction is aborted on subscriber'); ~ Is that something that should be made to work for 2PC pub/sub, or was Postgres PREPARE TRANSACTION statement wrong to allow the user to specify an empty transaction ID in the first place? Kind Regards Peter Smith. Fujitsu Australia.
On Thu, Oct 22, 2020 at 4:58 AM Peter Smith <smithpb2250@gmail.com> wrote: > > On Wed, Oct 21, 2020 at 7:42 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Oct 21, 2020 at 1:38 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > > The PG docs for PREPARE TRANSACTION [1] don't say anything about an > > > empty (zero length) transaction-id. > > > e.g. PREPARE TRANSACTION ''; > > > [1] https://www.postgresql.org/docs/current/sql-prepare-transaction.html > > > > > > ~ > > > > > > Meanwhile, during testing I found the 2PC prepare hangs when an empty > > > id is used. > > > > > > > Can you please take an example to explain what you are trying to say? > > I was referring to an empty (zero length) transaction ID, not an empty > transaction. > oh, I got it confused with the system generated 32-bit TransactionId. But now, I got what you were referring to. > The example was already given as PREPARE TRANSACTION ''; > > > Is that something that should be made to work for 2PC pub/sub, or was > Postgres PREPARE TRANSACTION statement wrong to allow the user to > specify an empty transaction ID in the first place? > I don't see any problem with the empty transaction identifier used in Prepare Transaction. This is just used as an identifier to uniquely identify the transaction. If you try to use an empty string ('') more than once for Prepare Transaction, it will give an error like below: postgres=*# prepare transaction ''; ERROR: transaction identifier "" is already in use So, I think this should work for pub/sub as well. Did you find out the reason of hang? -- With Regards, Amit Kapila.
On Tue, Oct 20, 2020 at 3:15 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Hello Ajin. > > I have gone through the v10 patches to verify if and how my previous > v6 review comments got addressed. > > Some issues remain, and there are a few newly introduced ones. > > Mostly it is all very minor stuff. > > Please find my revised review comments below. > > Kind Regards. > Peter Smith > Fujitsu Australia > > --- > > V10 REVIEW COMMENTS FOLLOW > > ========== > Patch v10-0001, File: contrib/test_decoding/test_decoding.c > ========== > > COMMENT > Line 285 > + { > + errno = 0; > + data->check_xid_aborted = (TransactionId) > + strtoul(strVal(elem->arg), NULL, 0); > + > + if (!TransactionIdIsValid(data->check_xid_aborted)) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("check-xid-aborted is not a valid xid: \"%s\"", > + strVal(elem->arg)))); > + } > > > I think it is risky to assign strtoul directly to the > check_xid_aborted member because it makes some internal assumption > that the invalid transaction is the same as the error return from > strtoul. > > Maybe better to do in 2 steps like below: > > BEFORE > errno = 0; > data->check_xid_aborted = (TransactionId)strtoul(strVal(elem->arg), NULL, 0); > > AFTER > long xid; > errno = 0; > xid = strtoul(strVal(elem->arg), NULL, 0); > if (xid == 0 || errno != 0) > data->check_xid_aborted = InvalidTransactionId; > else > data->check_xid_aborted =(TransactionId)xid; > > --- Updated accordingly. > > COMMENT > Line 430 > + > +/* ABORT PREPARED callback */ > +static void > +pg_decode_rollback_prepared_txn(LogicalDecodingContext *ctx, > ReorderBufferTXN *txn, > + XLogRecPtr abort_lsn) > > > Fix comment "ABORT PREPARED" --> "ROLLBACK PREPARED" Updated accordingly. > > ========== > Patch v10-0001, File: doc/src/sgml/logicaldecoding.sgml > ========== > > COMMENT > Section 48.6.1 > Says: > An output plugin may also define functions to support streaming of > large, in-progress transactions. The stream_start_cb, stream_stop_cb, > stream_abort_cb, stream_commit_cb and stream_change_cb are required, > while stream_message_cb, stream_prepare_cb, stream_commit_prepared_cb, > stream_rollback_prepared_cb and stream_truncate_cb are optional. > > An output plugin may also define functions to support two-phase > commits, which are decoded on PREPARE TRANSACTION. The prepare_cb, > commit_prepared_cb and rollback_prepared_cb callbacks are required, > while filter_prepare_cb is optional. > > - > > But is that correct? It seems strange/inconsistent to say that the 2PC > callbacks are mandatory for the non-streaming, but that they are > optional for streaming. > Updated making all the 2PC callbacks mandatory. > --- > > COMMENT > 48.6.4.5 "Transaction Prepare Callback" > 48.6.4.6 "Transaction Commit Prepared Callback" > 48.6.4.7 "Transaction Rollback Prepared Callback" > > There seems some confusion about what is optional and what is > mandatory. e.g. Why are the non-stream 2PC callbacks mandatory but the > stream 2PC callbacks are not? And also there is some inconsistency > with what is said in the paragraph at the top of the page versus what > each of the callback sections says wrt optional/mandatory. > > The sub-sections 49.6.4.5, 49.6.4.6, 49.6.4.7 say those callbacks are > optional which IIUC Amit said is incorrect. This is similar to the > previous review comment > > --- Updated making all the 2PC callbacks mandatory. > > COMMENT > Section 48.6.4.7 "Transaction Rollback Prepared Callback" > > parameter "abort_lsn" probably should be "rollback_lsn" > > --- > > COMMENT > Section 49.6.4.18. "Stream Rollback Prepared Callback" > Says: > The stream_rollback_prepared_cb callback is called to abort a > previously streamed transaction as part of a two-phase commit. > > maybe should say "is called to rollback" > > ========== > Patch v10-0001, File: src/backend/replication/logical/logical.c > ========== > > COMMENT > Line 252 > Says: We however enable two phase logical... > > "two phase" --> "two-phase" > > -- > > COMMENT > Line 885 > Line 923 > Says: If the plugin support 2 phase commits... > > "support 2 phase" --> "supports two-phase" in the comment. Same issue > occurs twice. > > --- > > COMMENT > Line 830 > Line 868 > Line 906 > Says: > /* We're only supposed to call this when two-phase commits are supported */ > > There is an extra space between the "are" and "supported" in the comment. > Same issue occurs 3 times. > > --- > > COMMENT > Line 1023 > + /* > + * Skip if decoding of two-phase at PREPARE time is not enabled. In that > + * case all two-phase transactions are considered filtered out and will be > + * applied as regular transactions at COMMIT PREPARED. > + */ > > Comment still is missing the word "transactions" > "Skip if decoding of two-phase at PREPARE time is not enabled." > -> "Skip if decoding of two-phase transactions at PREPARE time is not enabled. > Updated accordingly. > ========== > Patch v10-0001, File: src/include/replication/reorderbuffer.h > ========== > > COMMENT > Line 459 > /* abort prepared callback signature */ > typedef void (*ReorderBufferRollbackPreparedCB) ( > ReorderBuffer *rb, > ReorderBufferTXN *txn, > XLogRecPtr abort_lsn); > > There is no alignment consistency here for > ReorderBufferRollbackPreparedCB. Some function args are directly under > the "(" and some are on the same line. This function code is neither. > > --- > > COMMENT > Line 638 > @@ -431,6 +486,24 @@ typedef void (*ReorderBufferStreamAbortCB) ( > ReorderBufferTXN *txn, > XLogRecPtr abort_lsn); > > +/* prepare streamed transaction callback signature */ > +typedef void (*ReorderBufferStreamPrepareCB) ( > + ReorderBuffer *rb, > + ReorderBufferTXN *txn, > + XLogRecPtr prepare_lsn); > + > +/* prepare streamed transaction callback signature */ > +typedef void (*ReorderBufferStreamCommitPreparedCB) ( > + ReorderBuffer *rb, > + ReorderBufferTXN *txn, > + XLogRecPtr commit_lsn); > + > +/* prepare streamed transaction callback signature */ > +typedef void (*ReorderBufferStreamRollbackPreparedCB) ( > + ReorderBuffer *rb, > + ReorderBufferTXN *txn, > + XLogRecPtr rollback_lsn); > > There is no inconsistent alignment with the arguments (compare how > other functions are aligned) > > See: > - for ReorderBufferStreamCommitPreparedCB > - for ReorderBufferStreamRollbackPreparedCB > - for ReorderBufferPrepareNeedSkip > - for ReorderBufferTxnIsPrepared > - for ReorderBufferPrepare > > --- > > COMMENT > Line 489 > Line 495 > Line 501 > /* prepare streamed transaction callback signature */ > > Same comment cut/paste 3 times? > - for ReorderBufferStreamPrepareCB > - for ReorderBufferStreamCommitPreparedCB > - for ReorderBufferStreamRollbackPreparedCB > > --- > > COMMENT > Line 457 > /* abort prepared callback signature */ > typedef void (*ReorderBufferRollbackPreparedCB) ( > ReorderBuffer *rb, > ReorderBufferTXN *txn, > XLogRecPtr abort_lsn); > > "abort" --> "rollback" in the function comment. > > --- > > COMMENT > Line 269 > /* In case of 2PC we need to pass GID to output plugin */ > > "2PC" --> "two-phase commit" > Updated accordingly. > ========== > Patch v10-0002, File: contrib/test_decoding/expected/two_phase.out (and .sql) > ========== > > COMMENT > General > > It is a bit hard to see what are the main tests here are what are just > sub-parts of some test case. > > e.g. It seems like the main tests are. > > 1. Test that decoding happens at PREPARE time > 2. Test decoding of an aborted tx > 3. Test a prepared tx which contains some DDL > 4. Test decoding works while an uncommitted prepared tx with DDL exists > 5. Test operations holding exclusive locks won't block decoding > 6. Test savepoints and sub-transactions > 7. Test "_nodecode" will defer the decoding until the commit time > > Can the comments be made more obvious so it is easy to distinguish the > main tests from the steps of those tests? > > --- > > COMMENT > Line 1 > -- Test two-phased transactions, when two-phase-commit is enabled, > transactions are > -- decoded at PREPARE time rather than at COMMIT PREPARED time. > > Some commas to be removed and this comment to be split into several sentences. > > --- > > COMMENT > Line 19 > -- should show nothing > > Comment could be more informative. E.g. "Should show nothing because > the PREPARE has not happened yet" > > --- > > COMMENT > Line 77 > > Looks like there is a missing comment about here that should say > something like "Show that the DDL does not appear in the decoding" > > --- > > COMMENT > Line 160 > -- test savepoints and sub-xacts as a result > > The subsequent test is testing savepoints. But is it testing sub > transactions like the comment says? > Updated accordingly. > ========== > Patch v10-0002, File: contrib/test_decoding/t/001_twophase.pl > ========== > > COMMENT > General > > I think basically there are only 2 tests in this file. > 1. to check that the concurrent abort works. > 2. to check that the prepared tx can span a server shutdown/restart > > But the tests comments do not make this clear at all. > e.g. All the "#" comments look equally important although most of them > are just steps of each test case. > Can the comments be better to distinguish the tests versus the steps > of each test? > Updated accordingly. > ========== > Patch v10-0002, File: src/backend/replication/logical/decode.c > ========== > > COMMENT > Line 71 > static void DecodeCommitPrepared(LogicalDecodingContext *ctx, > XLogRecordBuffer *buf, > xl_xact_parsed_commit *parsed, TransactionId xid); > static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, > xl_xact_parsed_abort *parsed, TransactionId xid); > static void DecodeAbortPrepared(LogicalDecodingContext *ctx, > XLogRecordBuffer *buf, > xl_xact_parsed_abort *parsed, TransactionId xid); > static void DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, > xl_xact_parsed_prepare * parsed); > > The 2nd line or args are not aligned properly. > - for DecodeCommitPrepared > - for DecodeAbortPrepared > - for DecodePrepare > Updated accordingly. > ========== > Patch v10-0002, File: src/backend/replication/logical/reorderbuffer.c > ========== > > COMMENT > There are some parts of the code where in my v6 review I had a doubt > about the mutually exclusive treatment of the "streaming" flag and the > "rbtxn_prepared(txn)" state. > > Basically I did not see how some parts of the code are treating NOT > streaming as implying 2PC etc because it defies my understanding that > 2PC can also work in streaming mode. Perhaps the "streaming" flag has > a different meaning to how I interpret it? Or perhaps some functions > are guarding higher up and can only be called under certain > conditions? > > Anyway, this confusion manifests in several parts of the code, none of > which was changed after my v6 review. > > Affected code includes the following: > > CASE 1 > Wherever the ReorderBufferTruncateTXN(...) "prepared" flag (third > parameter) is hardwired true/false, I think there must be some > preceding Assert to guarantee the prepared state condition holds true. > There can't be any room for doubts like "but what will it do for > streamed 2PC..." > Line 1805 - ReorderBufferTruncateTXN(rb, txn, true); // if rbtxn_prepared(txn) > Line 1941 - ReorderBufferTruncateTXN(rb, txn, false); // state ?? > Line 2389 - ReorderBufferTruncateTXN(rb, txn, false); // if streaming > Line 2396 - ReorderBufferTruncateTXN(rb, txn, true); // if not > streaming and if rbtxm_prepared(txn) > Line 2459 - ReorderBufferTruncateTXN(rb, txn, true); // if not streaming > > ~ > > CASE 2 > Wherever the "streaming" flag is tested I don't really understand how > NOT streaming can automatically imply 2PC. > Line 2330 - if (streaming) // what about if it is streaming AND 2PC at > the same time? > Line 2387 - if (streaming) // what about if it is streaming AND 2PC at > the same time? > Line 2449 - if (streaming) // what about if it is streaming AND 2PC at > the same time? > > ~ > > Case 1 and Case 2 above overlap a fair bit. I just listed them so they > all get checked again. > > Even if the code is thought to be currently OK I do still think > something should be done like: > a) add some more substantial comments to explain WHY the combination > of streaming and 2PC is not valid in the context > b) the Asserts to be strengthened to 100% guarantee that the streaming > and prepared states really are exclusive (if indeed they are). For > this point I thought the following Assert condition could be better: > Assert(streaming || rbtxn_prepared(txn)); > Assert(stream_started || rbtxn_prepared(txn)); > because as it is you still are left wondering if both streaming AND > rbtxn_prepared(txn) can be possible at the same time... > > --- Updated with more comments and a new Assert. > > COMMENT > Line 2634 > * Anyways, two-phase transactions do not contain any reorderbuffers. > > "Anyways" --> "Anyway" Updated. > > ========== > Patch v10-0003, File: src/backend/access/transam/twophase.c > ========== > > COMMENT > Line 557 > @@ -548,6 +548,33 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held) > } > > /* > + * LookupGXact > + * Check if the prepared transaction with the given GID is around > + */ > +bool > +LookupGXact(const char *gid) > +{ > + int i; > + bool found = false; > > The variable declarations (i and found) are not aligned. > Updated. > ========== > Patch v10-0003, File: src/backend/replication/logical/proto.c > ========== > > COMMENT > Line 125 > Line 205 > Assert(strlen(txn->gid) > 0); > > I suggested that the assertion should also check txn->gid is not NULL. > You replied "In this case txn->gid has to be non NULL". > > But that is exactly what I said :-) > If it HAS to be non-NULL then why not just Assert that in code instead > of leaving the reader wondering? > > "Assert(strlen(txn->gid) > 0);" --> "Assert(tdx->gid && strlen(txn->gid) > 0);" > Same occurs several times. > > --- Updated checking that gid is non-NULL as zero strlen is actually a valid case. > > COMMENT > Line 133 > Line 213 > if (rbtxn_commit_prepared(txn)) > flags |= LOGICALREP_IS_COMMIT_PREPARED; > else if (rbtxn_rollback_prepared(txn)) > flags |= LOGICALREP_IS_ROLLBACK_PREPARED; > else > flags |= LOGICALREP_IS_PREPARE; > > Previously I wrote that the use of the bit flags on assignment in the > logicalrep_write_prepare was inconsistent with the way they are > treated when they are read. Really it should be using a direct > assignment instead of bit flags. > > You said this is skipped anticipating a possible refactor. But IMO > this leaves the code in a half/half state. I think it is better to fix > it properly and if refactoring happens then deal with that at the > time. > > The last comment I saw from Amit said to use my 1st proposal of direct > assignment instead of bit flag assignment. > > (applies to both non-stream and stream functions) > - see logicalrep_write_prepare > - see logicalrep_write_stream_prepare > > Updated accordingly. > ========== > Patch v10-0003, File: src/backend/replication/pgoutput/pgoutput.c > ========== > > COMMENT > Line 429 > /* > * PREPARE callback > */ > static void > pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx, > ReorderBufferTXN *txn, > XLogRecPtr prepare_lsn) > The function comment looks wrong. > Shouldn't this comment say be "ROLLBACK PREPARED callback"? > > ========== > Patch v10-0003, File: src/include/replication/logicalproto.h > ========== > > Line 115 > #define PrepareFlagsAreValid(flags) \ > ((flags == LOGICALREP_IS_PREPARE) || \ > (flags == LOGICALREP_IS_COMMIT_PREPARED) || \ > (flags == LOGICALREP_IS_ROLLBACK_PREPARED)) > > Would be safer if all the references to flags are in parentheses > e.g. "flags" --> "(flags)" > Updated accordingly. Amit, I have also modified the stream callback APIs to not include stream_commit_prpeared and stream_rollback_prepared, instead use the non-stream APIs for the same functionality. I have also updated the test_decoding and pgoutput plugins accordingly. regards, Ajin Cherian Fujitsu Australia
Attachment
On Fri, Oct 23, 2020 at 3:41 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > Amit, > I have also modified the stream callback APIs to not include > stream_commit_prpeared and stream_rollback_prepared, instead use the > non-stream APIs for the same functionality. > I have also updated the test_decoding and pgoutput plugins accordingly. > Thanks, I think you forgot to address one of my comments in the previous email[1] (See "One minor comment .."). You have not even responded to it. [1] - https://www.postgresql.org/message-id/CAA4eK1JzRvUX2XLEKo2f74Vjecnt6wq-kkk1OiyMJ5XjJN%2BGvQ%40mail.gmail.com -- With Regards, Amit Kapila.
Hi Ajin. I've addressed your review comments (details below) and bumped the patch set to v12 attached. I also added more test cases. On Tue, Oct 20, 2020 at 10:02 PM Ajin Cherian <itsajin@gmail.com> wrote: > Thanks for your patch. Some comments for your patch: > > Comments: > > src/backend/replication/logical/worker.c > @@ -888,6 +888,319 @@ apply_handle_prepare(StringInfo s) > + /* > + * FIXME - Following condition was in apply_handle_prepare_txn except > I found it was ALWAYS IsTransactionState() == false > + * The synchronization worker runs in single transaction. * > + if (IsTransactionState() && !am_tablesync_worker()) > + */ > + if (!am_tablesync_worker()) > > Comment: I dont think a tablesync worker will use streaming, none of > the other stream APIs check this, this might not be relevant for > stream_prepare either. Updated > + /* > + * ================================================================================================== > + * The following chunk of code is largely cut/paste from the existing > apply_handle_prepare_commit_txn > > Comment: Here, I think you meant apply_handle_stream_commit. Updated. > Also > rather than duplicating this chunk of code, you could put it in a new > function. Code is refactored to share a common function for the spool file processing. > + else > + { > + /* Process any invalidation messages that might have accumulated. */ > + AcceptInvalidationMessages(); > + maybe_reread_subscription(); > + } > > Comment: This else block might not be necessary as a tablesync worker > will not initiate the streaming APIs. Updated ~ Kind Regards, Peter Smith Fujitsu Australia
Attachment
Hi Ajin. I checked the to see how my previous review comments (of v10) were addressed by the latest patches (currently v12) There are a couple of remaining items. --- ==================== v12-0001. File: doc/src/sgml/logicaldecoding.sgml ==================== COMMENT Section 49.6.1 Says: An output plugin may also define functions to support streaming of large, in-progress transactions. The stream_start_cb, stream_stop_cb, stream_abort_cb, stream_commit_cb, stream_change_cb, and stream_prepare_cb are required, while stream_message_cb and stream_truncate_cb are optional. An output plugin may also define functions to support two-phase commits, which are decoded on PREPARE TRANSACTION. The prepare_cb, commit_prepared_cb and rollback_prepared_cb callbacks are required, while filter_prepare_cb is optional. ~ I was not sure how the paragraphs are organised. e.g. 1st seems to be about streams and 2nd seems to be about two-phase commit. But they are not mutually exclusive, so I guess I thought it was odd that stream_prepare_cb was not mentioned in the 2nd paragraph. Or maybe it is OK as-is? ==================== v12-0002. File: contrib/test_decoding/expected/two_phase.out ==================== COMMENT Line 26 PREPARE TRANSACTION 'test_prepared#1'; -- SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1'); ~ Seems like a missing comment to explain the expectation of that select. --- COMMENT Line 80 -- The insert should show the newly altered column. ~ Do you also need to mention something about the DDL not being present in the decoding? ==================== v12-0002. File: src/backend/replication/logical/reorderbuffer.c ==================== COMMENT Line 1807 /* Here we are streaming and part of the PREPARE of a two-phase commit * The full cleanup will happen as part of the COMMIT PREPAREDs, so now * just truncate txn by removing changes and tuple_cids */ ~ Something seems strange about the first sentence of that comment --- COMMENT Line 1944 /* Discard the changes that we just streamed. * This can only be called if streaming and not part of a PREPARE in * a two-phase commit, so set prepared flag as false. */ ~ I thought since this comment that is asserting various things, that should also actually be written as code Assert. --- COMMENT Line 2401 /* * We are here due to one of the 3 scenarios: * 1. As part of streaming in-progress transactions * 2. Prepare of a two-phase commit * 3. Commit of a transaction. * * If we are streaming the in-progress transaction then discard the * changes that we just streamed, and mark the transactions as * streamed (if they contained changes), set prepared flag as false. * If part of a prepare of a two-phase commit set the prepared flag * as true so that we can discard changes and cleanup tuplecids. * Otherwise, remove all the * changes and deallocate the ReorderBufferTXN. */ ~ The above comment is beyond my understanding. Anything you could do to simplify it would be good. For example, when viewing this function in isolation I have never understood why the streaming flag and rbtxn_prepared(txn) flag are not possible to be set at the same time? Perhaps the code is relying on just internal knowledge of how this helper function gets called? And if it is just that, then IMO there really should be some Asserts in the code to give more assurance about that. (Or maybe use completely different flags to represent those 3 scenarios instead of bending the meanings of the existing flags) ==================== v12-0003. File: src/backend/access/transam/twophase.c ==================== COMMENT Line 557 @@ -548,6 +548,33 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held) } /* + * LookupGXact + * Check if the prepared transaction with the given GID is around + */ +bool +LookupGXact(const char *gid) +{ + int i; + bool found = false; ~ Alignment of the variable declarations in LookupGXact function --- Kind Regards, Peter Smith. Fujitsu Australia
On Mon, Oct 26, 2020 at 6:49 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Hi Ajin. > > I checked the to see how my previous review comments (of v10) were > addressed by the latest patches (currently v12) > > There are a couple of remaining items. > > --- > > ==================== > v12-0001. File: doc/src/sgml/logicaldecoding.sgml > ==================== > > COMMENT > Section 49.6.1 > Says: > An output plugin may also define functions to support streaming of > large, in-progress transactions. The stream_start_cb, stream_stop_cb, > stream_abort_cb, stream_commit_cb, stream_change_cb, and > stream_prepare_cb are required, while stream_message_cb and > stream_truncate_cb are optional. > > An output plugin may also define functions to support two-phase > commits, which are decoded on PREPARE TRANSACTION. The prepare_cb, > commit_prepared_cb and rollback_prepared_cb callbacks are required, > while filter_prepare_cb is optional. > ~ > I was not sure how the paragraphs are organised. e.g. 1st seems to be > about streams and 2nd seems to be about two-phase commit. But they are > not mutually exclusive, so I guess I thought it was odd that > stream_prepare_cb was not mentioned in the 2nd paragraph. > > Or maybe it is OK as-is? > I've added stream_prepare_cb to the 2nd paragraph as well. > ==================== > v12-0002. File: contrib/test_decoding/expected/two_phase.out > ==================== > > COMMENT > Line 26 > PREPARE TRANSACTION 'test_prepared#1'; > -- > SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, > NULL, 'two-phase-commit', '1', 'include-xids', '0', > 'skip-empty-xacts', '1'); > ~ > Seems like a missing comment to explain the expectation of that select. > > --- > Updated. > COMMENT > Line 80 > -- The insert should show the newly altered column. > ~ > Do you also need to mention something about the DDL not being present > in the decoding? > Updated. > ==================== > v12-0002. File: src/backend/replication/logical/reorderbuffer.c > ==================== > > COMMENT > Line 1807 > /* Here we are streaming and part of the PREPARE of a two-phase commit > * The full cleanup will happen as part of the COMMIT PREPAREDs, so now > * just truncate txn by removing changes and tuple_cids > */ > ~ > Something seems strange about the first sentence of that comment > > --- > > COMMENT > Line 1944 > /* Discard the changes that we just streamed. > * This can only be called if streaming and not part of a PREPARE in > * a two-phase commit, so set prepared flag as false. > */ > ~ > I thought since this comment that is asserting various things, that > should also actually be written as code Assert. > > --- Added an assert. > > COMMENT > Line 2401 > /* > * We are here due to one of the 3 scenarios: > * 1. As part of streaming in-progress transactions > * 2. Prepare of a two-phase commit > * 3. Commit of a transaction. > * > * If we are streaming the in-progress transaction then discard the > * changes that we just streamed, and mark the transactions as > * streamed (if they contained changes), set prepared flag as false. > * If part of a prepare of a two-phase commit set the prepared flag > * as true so that we can discard changes and cleanup tuplecids. > * Otherwise, remove all the > * changes and deallocate the ReorderBufferTXN. > */ > ~ > The above comment is beyond my understanding. Anything you could do to > simplify it would be good. > > For example, when viewing this function in isolation I have never > understood why the streaming flag and rbtxn_prepared(txn) flag are not > possible to be set at the same time? > > Perhaps the code is relying on just internal knowledge of how this > helper function gets called? And if it is just that, then IMO there > really should be some Asserts in the code to give more assurance about > that. (Or maybe use completely different flags to represent those 3 > scenarios instead of bending the meanings of the existing flags) > Left this for now, probably re-look at this at a later review. But just to explain; this function is what does the main decoding of changes of a transaction. At what point this decoding happens is what this feature and the streaming in-progress feature is about. As of PG13, this decoding only happens at commit time. With the streaming of in-progress txn feature, this began to happen (if streaming enabled) at the time when the memory limit for decoding transactions was crossed. This 2PC feature is supporting decoding at the time of a PREPARE transaction. Now, if streaming is enabled and streaming has started as a result of crossing the memory threshold, then there is no need to again begin streaming at a PREPARE transaction as the transaction that is being prepared has already been streamed. Which is why this function will not be called when a streaming transaction is prepared as part of a two-phase commit. > ==================== > v12-0003. File: src/backend/access/transam/twophase.c > ==================== > > COMMENT > Line 557 > @@ -548,6 +548,33 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held) > } > > /* > + * LookupGXact > + * Check if the prepared transaction with the given GID is around > + */ > +bool > +LookupGXact(const char *gid) > +{ > + int i; > + bool found = false; > ~ > Alignment of the variable declarations in LookupGXact function > > --- Updated. Amit, I have also updated your comment about removing function declaration from commit 1 and I've added it to commit 2. Also removed whitespace errors. regards, Ajin Cherian Fujitsu Australia
Attachment
FYI - Please find attached code coverage reports which I generated (based on the v12 patches) after running the following tests: 1. cd contrib/test_decoding; make check 2. cd src/test/subscriber; make check Kind Regards, Peter Smith. Fujitsu Australia On Tue, Oct 27, 2020 at 8:55 PM Ajin Cherian <itsajin@gmail.com> wrote: > > On Mon, Oct 26, 2020 at 6:49 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Hi Ajin. > > > > I checked the to see how my previous review comments (of v10) were > > addressed by the latest patches (currently v12) > > > > There are a couple of remaining items. > > > > --- > > > > ==================== > > v12-0001. File: doc/src/sgml/logicaldecoding.sgml > > ==================== > > > > COMMENT > > Section 49.6.1 > > Says: > > An output plugin may also define functions to support streaming of > > large, in-progress transactions. The stream_start_cb, stream_stop_cb, > > stream_abort_cb, stream_commit_cb, stream_change_cb, and > > stream_prepare_cb are required, while stream_message_cb and > > stream_truncate_cb are optional. > > > > An output plugin may also define functions to support two-phase > > commits, which are decoded on PREPARE TRANSACTION. The prepare_cb, > > commit_prepared_cb and rollback_prepared_cb callbacks are required, > > while filter_prepare_cb is optional. > > ~ > > I was not sure how the paragraphs are organised. e.g. 1st seems to be > > about streams and 2nd seems to be about two-phase commit. But they are > > not mutually exclusive, so I guess I thought it was odd that > > stream_prepare_cb was not mentioned in the 2nd paragraph. > > > > Or maybe it is OK as-is? > > > > I've added stream_prepare_cb to the 2nd paragraph as well. > > > > ==================== > > v12-0002. File: contrib/test_decoding/expected/two_phase.out > > ==================== > > > > COMMENT > > Line 26 > > PREPARE TRANSACTION 'test_prepared#1'; > > -- > > SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, > > NULL, 'two-phase-commit', '1', 'include-xids', '0', > > 'skip-empty-xacts', '1'); > > ~ > > Seems like a missing comment to explain the expectation of that select. > > > > --- > > > > Updated. > > > COMMENT > > Line 80 > > -- The insert should show the newly altered column. > > ~ > > Do you also need to mention something about the DDL not being present > > in the decoding? > > > > Updated. > > > ==================== > > v12-0002. File: src/backend/replication/logical/reorderbuffer.c > > ==================== > > > > COMMENT > > Line 1807 > > /* Here we are streaming and part of the PREPARE of a two-phase commit > > * The full cleanup will happen as part of the COMMIT PREPAREDs, so now > > * just truncate txn by removing changes and tuple_cids > > */ > > ~ > > Something seems strange about the first sentence of that comment > > > > --- > > > > COMMENT > > Line 1944 > > /* Discard the changes that we just streamed. > > * This can only be called if streaming and not part of a PREPARE in > > * a two-phase commit, so set prepared flag as false. > > */ > > ~ > > I thought since this comment that is asserting various things, that > > should also actually be written as code Assert. > > > > --- > > Added an assert. > > > > > COMMENT > > Line 2401 > > /* > > * We are here due to one of the 3 scenarios: > > * 1. As part of streaming in-progress transactions > > * 2. Prepare of a two-phase commit > > * 3. Commit of a transaction. > > * > > * If we are streaming the in-progress transaction then discard the > > * changes that we just streamed, and mark the transactions as > > * streamed (if they contained changes), set prepared flag as false. > > * If part of a prepare of a two-phase commit set the prepared flag > > * as true so that we can discard changes and cleanup tuplecids. > > * Otherwise, remove all the > > * changes and deallocate the ReorderBufferTXN. > > */ > > ~ > > The above comment is beyond my understanding. Anything you could do to > > simplify it would be good. > > > > For example, when viewing this function in isolation I have never > > understood why the streaming flag and rbtxn_prepared(txn) flag are not > > possible to be set at the same time? > > > > Perhaps the code is relying on just internal knowledge of how this > > helper function gets called? And if it is just that, then IMO there > > really should be some Asserts in the code to give more assurance about > > that. (Or maybe use completely different flags to represent those 3 > > scenarios instead of bending the meanings of the existing flags) > > > > Left this for now, probably re-look at this at a later review. > But just to explain; this function is what does the main decoding of > changes of a transaction. > At what point this decoding happens is what this feature and the > streaming in-progress feature is about. As of PG13, this decoding only > happens at commit time. With the streaming of in-progress txn feature, > this began to happen (if streaming enabled) at the time when the > memory limit for decoding transactions was crossed. This 2PC feature > is supporting decoding at the time of a PREPARE transaction. > Now, if streaming is enabled and streaming has started as a result of > crossing the memory threshold, then there is no need to > again begin streaming at a PREPARE transaction as the transaction that > is being prepared has already been streamed. Which is why this > function will not be called when a streaming transaction is prepared > as part of a two-phase commit. > > > ==================== > > v12-0003. File: src/backend/access/transam/twophase.c > > ==================== > > > > COMMENT > > Line 557 > > @@ -548,6 +548,33 @@ MarkAsPrepared(GlobalTransaction gxact, bool lock_held) > > } > > > > /* > > + * LookupGXact > > + * Check if the prepared transaction with the given GID is around > > + */ > > +bool > > +LookupGXact(const char *gid) > > +{ > > + int i; > > + bool found = false; > > ~ > > Alignment of the variable declarations in LookupGXact function > > > > --- > > Updated. > > Amit, I have also updated your comment about removing function > declaration from commit 1 and I've added it to commit 2. Also removed > whitespace errors. > > regards, > Ajin Cherian > Fujitsu Australia
Attachment
Hi Ajin. I have re-checked the v13 patches for how my remaining review comments have been addressed. On Tue, Oct 27, 2020 at 8:55 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > ==================== > > v12-0002. File: src/backend/replication/logical/reorderbuffer.c > > ==================== > > > > COMMENT > > Line 2401 > > /* > > * We are here due to one of the 3 scenarios: > > * 1. As part of streaming in-progress transactions > > * 2. Prepare of a two-phase commit > > * 3. Commit of a transaction. > > * > > * If we are streaming the in-progress transaction then discard the > > * changes that we just streamed, and mark the transactions as > > * streamed (if they contained changes), set prepared flag as false. > > * If part of a prepare of a two-phase commit set the prepared flag > > * as true so that we can discard changes and cleanup tuplecids. > > * Otherwise, remove all the > > * changes and deallocate the ReorderBufferTXN. > > */ > > ~ > > The above comment is beyond my understanding. Anything you could do to > > simplify it would be good. > > > > For example, when viewing this function in isolation I have never > > understood why the streaming flag and rbtxn_prepared(txn) flag are not > > possible to be set at the same time? > > > > Perhaps the code is relying on just internal knowledge of how this > > helper function gets called? And if it is just that, then IMO there > > really should be some Asserts in the code to give more assurance about > > that. (Or maybe use completely different flags to represent those 3 > > scenarios instead of bending the meanings of the existing flags) > > > > Left this for now, probably re-look at this at a later review. > But just to explain; this function is what does the main decoding of > changes of a transaction. > At what point this decoding happens is what this feature and the > streaming in-progress feature is about. As of PG13, this decoding only > happens at commit time. With the streaming of in-progress txn feature, > this began to happen (if streaming enabled) at the time when the > memory limit for decoding transactions was crossed. This 2PC feature > is supporting decoding at the time of a PREPARE transaction. > Now, if streaming is enabled and streaming has started as a result of > crossing the memory threshold, then there is no need to > again begin streaming at a PREPARE transaction as the transaction that > is being prepared has already been streamed. Which is why this > function will not be called when a streaming transaction is prepared > as part of a two-phase commit. AFAIK the last remaining issue now is only about the complexity of the aforementioned code/comment. If you want to defer changing that until we can come up with something better, then that is OK by me. Apart from that I have no other pending review comments at this time. Kind Regards, Peter Smith. Fujitsu Australia
Hi Ajin. Looking at v13 patches again I found a couple more review comments: === (1) COMMENT File: src/backend/replication/logical/proto.c Function: logicalrep_write_prepare + if (rbtxn_commit_prepared(txn)) + flags = LOGICALREP_IS_COMMIT_PREPARED; + else if (rbtxn_rollback_prepared(txn)) + flags = LOGICALREP_IS_ROLLBACK_PREPARED; + else + flags = LOGICALREP_IS_PREPARE; + + /* Make sure exactly one of the expected flags is set. */ + if (!PrepareFlagsAreValid(flags)) + elog(ERROR, "unrecognized flags %u in prepare message", flags); Since those flags are directly assigned, I think the subsequent if (!PrepareFlagsAreValid(flags)) check is redundant. === (2) COMMENT File: src/backend/replication/logical/proto.c Function: logicalrep_write_stream_prepare +/* + * Write STREAM PREPARE to the output stream. + * (For stream PREPARE, stream COMMIT PREPARED, stream ROLLBACK PREPARED) + */ I think the function comment is outdated because IIUC the stream COMMIT PREPARED and stream ROLLBACK PREPARED are not being handled by the function logicalrep_write_prepare. SInce this approach seems counter-intuitive there needs to be an improved function comment to explain what is going on. === (3) COMMENT File: src/backend/replication/logical/proto.c Function: logicalrep_read_stream_prepare +/* + * Read STREAM PREPARE from the output stream. + * (For stream PREPARE, stream COMMIT PREPARED, stream ROLLBACK PREPARED) + */ This is the same as the previous review comment. The function comment needs to explain the new handling for stream COMMIT PREPARED and stream ROLLBACK PREPARED. === (4) COMMENT File: src/backend/replication/logical/proto.c Function: logicalrep_read_stream_prepare +TransactionId +logicalrep_read_stream_prepare(StringInfo in, LogicalRepPrepareData *prepare_data) +{ + TransactionId xid; + uint8 flags; + + xid = pq_getmsgint(in, 4); + + /* read flags */ + flags = pq_getmsgbyte(in); + + if (!PrepareFlagsAreValid(flags)) + elog(ERROR, "unrecognized flags %u in prepare message", flags); I think the logicalrep_write_stream_prepare now can only assign the flags = LOGICALREP_IS_PREPARE. So that means the check here for bad flags should be changed to match. BEFORE: if (!PrepareFlagsAreValid(flags)) AFTER: if (flags != LOGICALREP_IS_PREPARE) === (5) COMMENT General Since the COMMENTs (2), (3) and (4) are all caused by the refactoring that was done for removal of the commit/rollback stream callbacks. I do wonder if it might have worked out better just to leave the logicalrep_read/write_stream_prepared as it was instead of mixing up stream/no-stream handling. A check for stream/no-stream could possibly have been made higher up. For example: static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, XLogRecPtr prepare_lsn) { OutputPluginUpdateProgress(ctx); OutputPluginPrepareWrite(ctx, true); if (ctx->streaming) logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn); else logicalrep_write_prepare(ctx->out, txn, prepare_lsn); OutputPluginWrite(ctx, true); } === Kind Regards, Peter Smith. Fujitsu Australia
FYI - I have cross-checked all the v12 patch code changes against the v12 code coverage resulting from running the patch tests Those v12 code coverage results were posted in this thread previously [1]. The purpose of this study was to identify if / where there are any gaps in the testing of this patch - e.g is there some code not currently getting executed? I found in general there seems quite high coverage of the normal (not error) code path,but there are a couple of current gaps in the test coverage. For details please find attached the study results. (MS Excel file) === [1] https://www.postgresql.org/message-id/CAHut%2BPt6zB-YffCrMo7%2BZOKn7C2yXkNYnuQTdbStEJJJXZZXaw%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Attachment
On Thu, Oct 29, 2020 at 11:48 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Hi Ajin. > > Looking at v13 patches again I found a couple more review comments: > > === > > (1) COMMENT > File: src/backend/replication/logical/proto.c > Function: logicalrep_write_prepare > + if (rbtxn_commit_prepared(txn)) > + flags = LOGICALREP_IS_COMMIT_PREPARED; > + else if (rbtxn_rollback_prepared(txn)) > + flags = LOGICALREP_IS_ROLLBACK_PREPARED; > + else > + flags = LOGICALREP_IS_PREPARE; > + > + /* Make sure exactly one of the expected flags is set. */ > + if (!PrepareFlagsAreValid(flags)) > + elog(ERROR, "unrecognized flags %u in prepare message", flags); > > Since those flags are directly assigned, I think the subsequent if > (!PrepareFlagsAreValid(flags)) check is redundant. > > === > Updated this. > (2) COMMENT > File: src/backend/replication/logical/proto.c > Function: logicalrep_write_stream_prepare > +/* > + * Write STREAM PREPARE to the output stream. > + * (For stream PREPARE, stream COMMIT PREPARED, stream ROLLBACK PREPARED) > + */ > > I think the function comment is outdated because IIUC the stream > COMMIT PREPARED and stream ROLLBACK PREPARED are not being handled by > the function logicalrep_write_prepare. SInce this approach seems > counter-intuitive there needs to be an improved function comment to > explain what is going on. > > === > > (3) COMMENT > File: src/backend/replication/logical/proto.c > Function: logicalrep_read_stream_prepare > +/* > + * Read STREAM PREPARE from the output stream. > + * (For stream PREPARE, stream COMMIT PREPARED, stream ROLLBACK PREPARED) > + */ > > This is the same as the previous review comment. The function comment > needs to explain the new handling for stream COMMIT PREPARED and > stream ROLLBACK PREPARED. > > === I think that these functions only writing/reading STREAM PREPARE as the name suggests is more intuitive. Maybe the usage of flags is more confusing. More below. > > (4) COMMENT > File: src/backend/replication/logical/proto.c > Function: logicalrep_read_stream_prepare > +TransactionId > +logicalrep_read_stream_prepare(StringInfo in, LogicalRepPrepareData > *prepare_data) > +{ > + TransactionId xid; > + uint8 flags; > + > + xid = pq_getmsgint(in, 4); > + > + /* read flags */ > + flags = pq_getmsgbyte(in); > + > + if (!PrepareFlagsAreValid(flags)) > + elog(ERROR, "unrecognized flags %u in prepare message", flags); > > I think the logicalrep_write_stream_prepare now can only assign the > flags = LOGICALREP_IS_PREPARE. So that means the check here for bad > flags should be changed to match. > BEFORE: if (!PrepareFlagsAreValid(flags)) > AFTER: if (flags != LOGICALREP_IS_PREPARE) > > === Updated. > > (5) COMMENT > General > Since the COMMENTs (2), (3) and (4) are all caused by the refactoring > that was done for removal of the commit/rollback stream callbacks. I > do wonder if it might have worked out better just to leave the > logicalrep_read/write_stream_prepared as it was instead of mixing up > stream/no-stream handling. A check for stream/no-stream could possibly > have been made higher up. > > For example: > static void > pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, > XLogRecPtr prepare_lsn) > { > OutputPluginUpdateProgress(ctx); > > OutputPluginPrepareWrite(ctx, true); > if (ctx->streaming) > logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn); > else > logicalrep_write_prepare(ctx->out, txn, prepare_lsn); > OutputPluginWrite(ctx, true); > } > > === I think I'll keep this as such for now. Amit was talking about considering removal of flags to overload PREPARE with COMMIT PREPARED and ROLLBACK PREPARED. Separate functions for each. Will wait if Amit thinks that is the way to go. I've also added a new test case for test_decoding for streaming 2PC. Removed function ReorderBufferTxnIsPrepared as it was never called thanks to Peter's coverage report. And added stream_prepare to the list of callbacks that would enable two-phase commits. regards, Ajin Cherian Fujitsu Australia
Attachment
On Tue, Oct 27, 2020 at 3:25 PM Ajin Cherian <itsajin@gmail.com> wrote: > [v13 patch set] Few comments on v13-0001-Support-2PC-txn-base. I haven't checked v14 version of patches so if you have fixed anything then ignore it. 1. --- a/src/include/replication/reorderbuffer.h +++ b/src/include/replication/reorderbuffer.h @@ -10,6 +10,7 @@ #define REORDERBUFFER_H #include "access/htup_details.h" +#include "access/twophase.h" #include "lib/ilist.h" #include "storage/sinval.h" #include "utils/hsearch.h" @@ -174,6 +175,9 @@ typedef struct ReorderBufferChange #define RBTXN_IS_STREAMED 0x0010 #define RBTXN_HAS_TOAST_INSERT 0x0020 #define RBTXN_HAS_SPEC_INSERT 0x0040 +#define RBTXN_PREPARE 0x0080 +#define RBTXN_COMMIT_PREPARED 0x0100 +#define RBTXN_ROLLBACK_PREPARED 0x0200 /* Does the transaction have catalog changes? */ #define rbtxn_has_catalog_changes(txn) \ @@ -233,6 +237,24 @@ typedef struct ReorderBufferChange ((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \ ) +/* Has this transaction been prepared? */ +#define rbtxn_prepared(txn) \ +( \ + ((txn)->txn_flags & RBTXN_PREPARE) != 0 \ +) + +/* Has this prepared transaction been committed? */ +#define rbtxn_commit_prepared(txn) \ +( \ + ((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \ +) + +/* Has this prepared transaction been rollbacked? */ +#define rbtxn_rollback_prepared(txn) \ +( \ + ((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \ +) + I think the above changes should be moved to the second patch. There is no use of these macros in this patch and moreover they appear to be out-of-place. 2. @@ -127,6 +152,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt, ListCell *option; TestDecodingData *data; bool enable_streaming = false; + bool enable_2pc = false; I think it is better to name this variable as enable_two_pc or enable_twopc. 3. + xid = strtoul(strVal(elem->arg), NULL, 0); + if (xid == 0 || errno != 0) + data->check_xid_aborted = InvalidTransactionId; + else + data->check_xid_aborted = (TransactionId)xid; + + if (!TransactionIdIsValid(data->check_xid_aborted)) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("check-xid-aborted is not a valid xid: \"%s\"", + strVal(elem->arg)))); Can't we write this as below and get rid of xid variable: data->check_xid_aborted= (TransactionId) strtoul(strVal(elem->arg), NULL, 0); if (!TransactionIdIsValid(data->check_xid_aborted) || errno) ereport.. 4. + /* if check_xid_aborted is a valid xid, then it was passed in + * as an option to check if the transaction having this xid would be aborted. + * This is to test concurrent aborts. + */ multi-line comments have the first line as empty. 5. + <para> + The required <function>prepare_cb</function> callback is called whenever + a transaction which is prepared for two-phase commit has been + decoded. The <function>change_cb</function> callbacks for all modified + rows will have been called before this, if there have been any modified + rows. +<programlisting> +typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx, + ReorderBufferTXN *txn, + XLogRecPtr prepare_lsn); +</programlisting> + </para> + </sect3> + + <sect3 id="logicaldecoding-output-plugin-commit-prepared"> + <title>Transaction Commit Prepared Callback</title> + + <para> + The required <function>commit_prepared_cb</function> callback is called whenever + a transaction commit prepared has been decoded. The <parameter>gid</parameter> field, + which is part of the <parameter>txn</parameter> parameter can be used in this + callback. I think the last line "The <parameter>gid</parameter> field, which is part of the <parameter>txn</parameter> parameter can be used in this callback." in 'Transaction Commit Prepared Callback' should also be present in 'Transaction Prepare Callback' as we using the same in prepare API as well. 6. +pg_decode_stream_prepare(LogicalDecodingContext *ctx, + ReorderBufferTXN *txn, + XLogRecPtr prepare_lsn) +{ + TestDecodingData *data = ctx->output_plugin_private; + + if (data->skip_empty_xacts && !data->xact_wrote_changes) + return; + + OutputPluginPrepareWrite(ctx, true); + + if (data->include_xids) + appendStringInfo(ctx->out, "preparing streamed transaction TXN %u", txn->xid); + else + appendStringInfo(ctx->out, "preparing streamed transaction"); I think we should include 'gid' as well in the above messages. 7. @@ -221,12 +235,26 @@ StartupDecodingContext(List *output_plugin_options, ctx->streaming = (ctx->callbacks.stream_start_cb != NULL) || (ctx->callbacks.stream_stop_cb != NULL) || (ctx->callbacks.stream_abort_cb != NULL) || + (ctx->callbacks.stream_prepare_cb != NULL) || (ctx->callbacks.stream_commit_cb != NULL) || (ctx->callbacks.stream_change_cb != NULL) || (ctx->callbacks.stream_message_cb != NULL) || (ctx->callbacks.stream_truncate_cb != NULL); /* + * To support two-phase logical decoding, we require prepare/commit-prepare/abort-prepare + * callbacks. The filter-prepare callback is optional. We however enable two-phase logical + * decoding when at least one of the methods is enabled so that we can easily identify + * missing methods. + * + * We decide it here, but only check it later in the wrappers. + */ + ctx->twophase = (ctx->callbacks.prepare_cb != NULL) || + (ctx->callbacks.commit_prepared_cb != NULL) || + (ctx->callbacks.rollback_prepared_cb != NULL) || + (ctx->callbacks.filter_prepare_cb != NULL); + I think stream_prepare_cb should be checked for the 'twophase' flag because we won't use this unless two-phase is enabled. Am I missing something? -- With Regards, Amit Kapila.
On Wed, Oct 21, 2020 at 7:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Comment: I dont think a tablesync worker will use streaming, none of > > the other stream APIs check this, this might not be relevant for > > stream_prepare either. > > > > Yes, I think this is right. See pgoutput_startup where we are > disabling the streaming for init phase. But it is always good to once > test this and ensure the same. I have tested this scenario and confirmed that even when the subscriber is capable of streaming it does NOT do any streaming during its tablesync phase. Kind Regards Peter Smith. Fujitsu Australia
On Thu, Oct 29, 2020 at 11:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Oct 27, 2020 at 3:25 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > [v13 patch set] > Few comments on v13-0001-Support-2PC-txn-base. I haven't checked v14 > version of patches so if you have fixed anything then ignore it. > > 1. > --- a/src/include/replication/reorderbuffer.h > +++ b/src/include/replication/reorderbuffer.h > @@ -10,6 +10,7 @@ > #define REORDERBUFFER_H > > #include "access/htup_details.h" > +#include "access/twophase.h" > #include "lib/ilist.h" > #include "storage/sinval.h" > #include "utils/hsearch.h" > @@ -174,6 +175,9 @@ typedef struct ReorderBufferChange > #define RBTXN_IS_STREAMED 0x0010 > #define RBTXN_HAS_TOAST_INSERT 0x0020 > #define RBTXN_HAS_SPEC_INSERT 0x0040 > +#define RBTXN_PREPARE 0x0080 > +#define RBTXN_COMMIT_PREPARED 0x0100 > +#define RBTXN_ROLLBACK_PREPARED 0x0200 > > /* Does the transaction have catalog changes? */ > #define rbtxn_has_catalog_changes(txn) \ > @@ -233,6 +237,24 @@ typedef struct ReorderBufferChange > ((txn)->txn_flags & RBTXN_IS_STREAMED) != 0 \ > ) > > +/* Has this transaction been prepared? */ > +#define rbtxn_prepared(txn) \ > +( \ > + ((txn)->txn_flags & RBTXN_PREPARE) != 0 \ > +) > + > +/* Has this prepared transaction been committed? */ > +#define rbtxn_commit_prepared(txn) \ > +( \ > + ((txn)->txn_flags & RBTXN_COMMIT_PREPARED) != 0 \ > +) > + > +/* Has this prepared transaction been rollbacked? */ > +#define rbtxn_rollback_prepared(txn) \ > +( \ > + ((txn)->txn_flags & RBTXN_ROLLBACK_PREPARED) != 0 \ > +) > + > > I think the above changes should be moved to the second patch. There > is no use of these macros in this patch and moreover they appear to be > out-of-place. Moved to second patch in the patchset. > > 2. > @@ -127,6 +152,7 @@ pg_decode_startup(LogicalDecodingContext *ctx, > OutputPluginOptions *opt, > ListCell *option; > TestDecodingData *data; > bool enable_streaming = false; > + bool enable_2pc = false; > > I think it is better to name this variable as enable_two_pc or enable_twopc. Renamed it to enable_twophase so that it matches with the ctx member ctx-twophase. > > 3. > + xid = strtoul(strVal(elem->arg), NULL, 0); > + if (xid == 0 || errno != 0) > + data->check_xid_aborted = InvalidTransactionId; > + else > + data->check_xid_aborted = (TransactionId)xid; > + > + if (!TransactionIdIsValid(data->check_xid_aborted)) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("check-xid-aborted is not a valid xid: \"%s\"", > + strVal(elem->arg)))); > > Can't we write this as below and get rid of xid variable: > data->check_xid_aborted= (TransactionId) strtoul(strVal(elem->arg), NULL, 0); > if (!TransactionIdIsValid(data->check_xid_aborted) || errno) > ereport.. Updated. Small change so that errno is checked first. > > 4. > + /* if check_xid_aborted is a valid xid, then it was passed in > + * as an option to check if the transaction having this xid would be aborted. > + * This is to test concurrent aborts. > + */ > > multi-line comments have the first line as empty. Updated. > > 5. > + <para> > + The required <function>prepare_cb</function> callback is called whenever > + a transaction which is prepared for two-phase commit has been > + decoded. The <function>change_cb</function> callbacks for all modified > + rows will have been called before this, if there have been any modified > + rows. > +<programlisting> > +typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx, > + ReorderBufferTXN *txn, > + XLogRecPtr prepare_lsn); > +</programlisting> > + </para> > + </sect3> > + > + <sect3 id="logicaldecoding-output-plugin-commit-prepared"> > + <title>Transaction Commit Prepared Callback</title> > + > + <para> > + The required <function>commit_prepared_cb</function> callback > is called whenever > + a transaction commit prepared has been decoded. The > <parameter>gid</parameter> field, > + which is part of the <parameter>txn</parameter> parameter can > be used in this > + callback. > > I think the last line "The <parameter>gid</parameter> field, which is > part of the <parameter>txn</parameter> parameter can be used in this > callback." in 'Transaction Commit Prepared Callback' should also be > present in 'Transaction Prepare Callback' as we using the same in > prepare API as well. Updated. > > 6. > +pg_decode_stream_prepare(LogicalDecodingContext *ctx, > + ReorderBufferTXN *txn, > + XLogRecPtr prepare_lsn) > +{ > + TestDecodingData *data = ctx->output_plugin_private; > + > + if (data->skip_empty_xacts && !data->xact_wrote_changes) > + return; > + > + OutputPluginPrepareWrite(ctx, true); > + > + if (data->include_xids) > + appendStringInfo(ctx->out, "preparing streamed transaction TXN %u", txn->xid); > + else > + appendStringInfo(ctx->out, "preparing streamed transaction"); > > I think we should include 'gid' as well in the above messages. Updated. > > 7. > @@ -221,12 +235,26 @@ StartupDecodingContext(List *output_plugin_options, > ctx->streaming = (ctx->callbacks.stream_start_cb != NULL) || > (ctx->callbacks.stream_stop_cb != NULL) || > (ctx->callbacks.stream_abort_cb != NULL) || > + (ctx->callbacks.stream_prepare_cb != NULL) || > (ctx->callbacks.stream_commit_cb != NULL) || > (ctx->callbacks.stream_change_cb != NULL) || > (ctx->callbacks.stream_message_cb != NULL) || > (ctx->callbacks.stream_truncate_cb != NULL); > > /* > + * To support two-phase logical decoding, we require > prepare/commit-prepare/abort-prepare > + * callbacks. The filter-prepare callback is optional. We however > enable two-phase logical > + * decoding when at least one of the methods is enabled so that we > can easily identify > + * missing methods. > + * > + * We decide it here, but only check it later in the wrappers. > + */ > + ctx->twophase = (ctx->callbacks.prepare_cb != NULL) || > + (ctx->callbacks.commit_prepared_cb != NULL) || > + (ctx->callbacks.rollback_prepared_cb != NULL) || > + (ctx->callbacks.filter_prepare_cb != NULL); > + > > I think stream_prepare_cb should be checked for the 'twophase' flag > because we won't use this unless two-phase is enabled. Am I missing > something? Was fixed in v14. regards, Ajin Cherian Fujitsu Australia
Attachment
On Fri, Oct 30, 2020 at 2:46 PM Ajin Cherian <itsajin@gmail.com> wrote: > > On Thu, Oct 29, 2020 at 11:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > 6. > > +pg_decode_stream_prepare(LogicalDecodingContext *ctx, > > + ReorderBufferTXN *txn, > > + XLogRecPtr prepare_lsn) > > +{ > > + TestDecodingData *data = ctx->output_plugin_private; > > + > > + if (data->skip_empty_xacts && !data->xact_wrote_changes) > > + return; > > + > > + OutputPluginPrepareWrite(ctx, true); > > + > > + if (data->include_xids) > > + appendStringInfo(ctx->out, "preparing streamed transaction TXN %u", txn->xid); > > + else > > + appendStringInfo(ctx->out, "preparing streamed transaction"); > > > > I think we should include 'gid' as well in the above messages. > > Updated. > gid needs to be included in the case of 'include_xids' as well. > > > > 7. > > @@ -221,12 +235,26 @@ StartupDecodingContext(List *output_plugin_options, > > ctx->streaming = (ctx->callbacks.stream_start_cb != NULL) || > > (ctx->callbacks.stream_stop_cb != NULL) || > > (ctx->callbacks.stream_abort_cb != NULL) || > > + (ctx->callbacks.stream_prepare_cb != NULL) || > > (ctx->callbacks.stream_commit_cb != NULL) || > > (ctx->callbacks.stream_change_cb != NULL) || > > (ctx->callbacks.stream_message_cb != NULL) || > > (ctx->callbacks.stream_truncate_cb != NULL); > > > > /* > > + * To support two-phase logical decoding, we require > > prepare/commit-prepare/abort-prepare > > + * callbacks. The filter-prepare callback is optional. We however > > enable two-phase logical > > + * decoding when at least one of the methods is enabled so that we > > can easily identify > > + * missing methods. > > + * > > + * We decide it here, but only check it later in the wrappers. > > + */ > > + ctx->twophase = (ctx->callbacks.prepare_cb != NULL) || > > + (ctx->callbacks.commit_prepared_cb != NULL) || > > + (ctx->callbacks.rollback_prepared_cb != NULL) || > > + (ctx->callbacks.filter_prepare_cb != NULL); > > + > > > > I think stream_prepare_cb should be checked for the 'twophase' flag > > because we won't use this unless two-phase is enabled. Am I missing > > something? > > Was fixed in v14. > But you still have it in the streaming check. I don't think we need that for the streaming case. Few other comments on v15-0002-Support-2PC-txn-backend-and-tests: ====================================================================== 1. The functions DecodeCommitPrepared and DecodeAbortPrepared have a lot of code similar to DecodeCommit/Abort. Can we merge these functions? 2. DecodeCommitPrepared() { .. + * If filter check present and this needs to be skipped, do a regular commit. + */ + if (ctx->callbacks.filter_prepare_cb && + ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed->twophase_gid)) + { + ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr, + commit_time, origin_id, origin_lsn); + } + else + { + ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr, + commit_time, origin_id, origin_lsn, + parsed->twophase_gid, true); + } + +} Can we expand the comment here to say why we need to do ReorderBufferCommit? 3. There are a lot of test cases in this patch which is a good thing but can we split them into a separate patch for the time being as I would like to focus on the core logic of the patch first. We can later see if we need to retain all or part of those tests. 4. Please run pgindent on your patches. -- With Regards, Amit Kapila.
On Wed, Oct 28, 2020 at 10:50 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Hi Ajin. > > I have re-checked the v13 patches for how my remaining review comments > have been addressed. > > On Tue, Oct 27, 2020 at 8:55 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > > > ==================== > > > v12-0002. File: src/backend/replication/logical/reorderbuffer.c > > > ==================== > > > > > > COMMENT > > > Line 2401 > > > /* > > > * We are here due to one of the 3 scenarios: > > > * 1. As part of streaming in-progress transactions > > > * 2. Prepare of a two-phase commit > > > * 3. Commit of a transaction. > > > * > > > * If we are streaming the in-progress transaction then discard the > > > * changes that we just streamed, and mark the transactions as > > > * streamed (if they contained changes), set prepared flag as false. > > > * If part of a prepare of a two-phase commit set the prepared flag > > > * as true so that we can discard changes and cleanup tuplecids. > > > * Otherwise, remove all the > > > * changes and deallocate the ReorderBufferTXN. > > > */ > > > ~ > > > The above comment is beyond my understanding. Anything you could do to > > > simplify it would be good. > > > > > > For example, when viewing this function in isolation I have never > > > understood why the streaming flag and rbtxn_prepared(txn) flag are not > > > possible to be set at the same time? > > > > > > Perhaps the code is relying on just internal knowledge of how this > > > helper function gets called? And if it is just that, then IMO there > > > really should be some Asserts in the code to give more assurance about > > > that. (Or maybe use completely different flags to represent those 3 > > > scenarios instead of bending the meanings of the existing flags) > > > > > > > Left this for now, probably re-look at this at a later review. > > But just to explain; this function is what does the main decoding of > > changes of a transaction. > > At what point this decoding happens is what this feature and the > > streaming in-progress feature is about. As of PG13, this decoding only > > happens at commit time. With the streaming of in-progress txn feature, > > this began to happen (if streaming enabled) at the time when the > > memory limit for decoding transactions was crossed. This 2PC feature > > is supporting decoding at the time of a PREPARE transaction. > > Now, if streaming is enabled and streaming has started as a result of > > crossing the memory threshold, then there is no need to > > again begin streaming at a PREPARE transaction as the transaction that > > is being prepared has already been streamed. > > I don't think this is true, think of a case where we need to send the last set of changes along with PREPARE. In that case we need to stream those changes at the time of PREPARE. If I am correct then as pointed by Peter you need to change some comments and some of the assumptions related to this you have in the patch. Few more comments on the latest patch (v15-0002-Support-2PC-txn-backend-and-tests) ========================================================================= 1. @@ -274,6 +296,23 @@ DecodeXactOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) DecodeAbort(ctx, buf, &parsed, xid); break; } + case XLOG_XACT_ABORT_PREPARED: + { .. + + if (!TransactionIdIsValid(parsed.twophase_xid)) + xid = XLogRecGetXid(r); + else + xid = parsed.twophase_xid; I think we don't need this 'if' check here because you must have a valid value of parsed.twophase_xid;. But, I think this will be moot if you address the review comment in my previous email such that the handling of XLOG_XACT_ABORT_PREPARED and XLOG_XACT_ABORT will be combined as it is there without the patch. 2. +DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, + xl_xact_parsed_prepare * parsed) +{ .. + if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) || + (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) || + ctx->fast_forward || FilterByOrigin(ctx, origin_id)) + return; + I think this check is the same as the check in DecodeCommit, so you can write some comments to indicate the same and also why we don't need to call ReorderBufferForget here. One more thing is to note is even if we don't need to call ReorderBufferForget here but still we need to execute invalidations (which are present in top-level txn) for the reasons mentioned in ReorderBufferForget. Also, if we do this, don't forget to update the comment atop ReorderBufferImmediateInvalidation. 3. + /* This is a PREPARED transaction, part of a two-phase commit. + * The full cleanup will happen as part of the COMMIT PREPAREDs, so now + * just truncate txn by removing changes and tuple_cids + */ + ReorderBufferTruncateTXN(rb, txn, true); The first line in the multi-line comment should be empty. -- With Regards, Amit Kapila.
On Mon, Nov 2, 2020 at 4:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > Few Comments on v15-0003-Support-2PC-txn-pgoutput =============================================== 1. This patch needs to be rebased after commit 644f0d7cc9 and requires some adjustments accordingly. 2. if (flags != 0) elog(ERROR, "unrecognized flags %u in commit message", flags); + /* read fields */ commit_data->commit_lsn = pq_getmsgint64(in); Spurious line. 3. @@ -720,6 +722,7 @@ apply_handle_commit(StringInfo s) replorigin_session_origin_timestamp = commit_data.committime; CommitTransactionCommand(); + pgstat_report_stat(false); Spurious line 4. +static void +apply_handle_prepare_txn(LogicalRepPrepareData * prepare_data) +{ + Assert(prepare_data->prepare_lsn == remote_final_lsn); + + /* The synchronization worker runs in single transaction. */ + if (IsTransactionState() && !am_tablesync_worker()) + { + /* End the earlier transaction and start a new one */ + BeginTransactionBlock(); + CommitTransactionCommand(); + StartTransactionCommand(); There is no explanation as to why you want to end the previous transaction and start a new one. Even if we have to do so, we first need to call BeginTransactionBlock before CommitTransactionCommand. 5. - * Handle STREAM COMMIT message. + * Common spoolfile processing. + * Returns how many changes were applied. */ -static void -apply_handle_stream_commit(StringInfo s) +static int +apply_spooled_messages(TransactionId xid, XLogRecPtr lsn) { - TransactionId xid; Can we have a separate patch for this as this can be committed before main patch. This is a refactoring required for the main patch. 6. @@ -57,7 +63,8 @@ static void pgoutput_stream_abort(struct LogicalDecodingContext *ctx, static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx, ReorderBufferTXN *txn, XLogRecPtr commit_lsn); - +static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx, + ReorderBufferTXN *txn, XLogRecPtr prepare_lsn); Spurious line removal. -- With Regards, Amit Kapila.
Hi Amit I have rebased, split, and addressed (most of) the review comments of the v15-0003 patch. So the previous v15-0003 patch is now split into three as follows: - v16-0001-Support-2PC-txn-spoolfile.patch - v16-0002-Support-2PC-txn-pgoutput.patch - v16-0003-Support-2PC-txn-subscriber-tests.patch PSA. Of course the previous v15-0001 and v15-0002 are still required before applying these v16 patches. Later (v17?) we will combine these again with what Ajin is currently working on to give the full suite of patches which will have a consistent version number. On Tue, Nov 3, 2020 at 4:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Few Comments on v15-0003-Support-2PC-txn-pgoutput > =============================================== > 1. This patch needs to be rebased after commit 644f0d7cc9 and requires > some adjustments accordingly. Done. > > 2. > if (flags != 0) > elog(ERROR, "unrecognized flags %u in commit message", flags); > > + > /* read fields */ > commit_data->commit_lsn = pq_getmsgint64(in); > > Spurious line. Fixed. > > 3. > @@ -720,6 +722,7 @@ apply_handle_commit(StringInfo s) > replorigin_session_origin_timestamp = commit_data.committime; > > CommitTransactionCommand(); > + > pgstat_report_stat(false); > > Spurious line Fixed. > > 4. > +static void > +apply_handle_prepare_txn(LogicalRepPrepareData * prepare_data) > +{ > + Assert(prepare_data->prepare_lsn == remote_final_lsn); > + > + /* The synchronization worker runs in single transaction. */ > + if (IsTransactionState() && !am_tablesync_worker()) > + { > + /* End the earlier transaction and start a new one */ > + BeginTransactionBlock(); > + CommitTransactionCommand(); > + StartTransactionCommand(); > > There is no explanation as to why you want to end the previous > transaction and start a new one. Even if we have to do so, we first > need to call BeginTransactionBlock before CommitTransactionCommand. TODO > > 5. > - * Handle STREAM COMMIT message. > + * Common spoolfile processing. > + * Returns how many changes were applied. > */ > -static void > -apply_handle_stream_commit(StringInfo s) > +static int > +apply_spooled_messages(TransactionId xid, XLogRecPtr lsn) > { > - TransactionId xid; > > Can we have a separate patch for this as this can be committed before > main patch. This is a refactoring required for the main patch. Done. > > 6. > @@ -57,7 +63,8 @@ static void pgoutput_stream_abort(struct > LogicalDecodingContext *ctx, > static void pgoutput_stream_commit(struct LogicalDecodingContext *ctx, > ReorderBufferTXN *txn, > XLogRecPtr commit_lsn); > - > +static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx, > + ReorderBufferTXN *txn, XLogRecPtr prepare_lsn); > > Spurious line removal. Fixed. --- Kind Regards, Peter Smith. Fujitsu Australia
Attachment
On Fri, Oct 30, 2020 at 9:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Oct 30, 2020 at 2:46 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > > On Thu, Oct 29, 2020 at 11:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > 6. > > > +pg_decode_stream_prepare(LogicalDecodingContext *ctx, > > > + ReorderBufferTXN *txn, > > > + XLogRecPtr prepare_lsn) > > > +{ > > > + TestDecodingData *data = ctx->output_plugin_private; > > > + > > > + if (data->skip_empty_xacts && !data->xact_wrote_changes) > > > + return; > > > + > > > + OutputPluginPrepareWrite(ctx, true); > > > + > > > + if (data->include_xids) > > > + appendStringInfo(ctx->out, "preparing streamed transaction TXN %u", txn->xid); > > > + else > > > + appendStringInfo(ctx->out, "preparing streamed transaction"); > > > > > > I think we should include 'gid' as well in the above messages. > > > > Updated. > > > > gid needs to be included in the case of 'include_xids' as well. > Updated. > > > > > > 7. > > > @@ -221,12 +235,26 @@ StartupDecodingContext(List *output_plugin_options, > > > ctx->streaming = (ctx->callbacks.stream_start_cb != NULL) || > > > (ctx->callbacks.stream_stop_cb != NULL) || > > > (ctx->callbacks.stream_abort_cb != NULL) || > > > + (ctx->callbacks.stream_prepare_cb != NULL) || > > > (ctx->callbacks.stream_commit_cb != NULL) || > > > (ctx->callbacks.stream_change_cb != NULL) || > > > (ctx->callbacks.stream_message_cb != NULL) || > > > (ctx->callbacks.stream_truncate_cb != NULL); > > > > > > /* > > > + * To support two-phase logical decoding, we require > > > prepare/commit-prepare/abort-prepare > > > + * callbacks. The filter-prepare callback is optional. We however > > > enable two-phase logical > > > + * decoding when at least one of the methods is enabled so that we > > > can easily identify > > > + * missing methods. > > > + * > > > + * We decide it here, but only check it later in the wrappers. > > > + */ > > > + ctx->twophase = (ctx->callbacks.prepare_cb != NULL) || > > > + (ctx->callbacks.commit_prepared_cb != NULL) || > > > + (ctx->callbacks.rollback_prepared_cb != NULL) || > > > + (ctx->callbacks.filter_prepare_cb != NULL); > > > + > > > > > > I think stream_prepare_cb should be checked for the 'twophase' flag > > > because we won't use this unless two-phase is enabled. Am I missing > > > something? > > > > Was fixed in v14. > > > > But you still have it in the streaming check. I don't think we need > that for the streaming case. > Updated. > Few other comments on v15-0002-Support-2PC-txn-backend-and-tests: > ====================================================================== > 1. The functions DecodeCommitPrepared and DecodeAbortPrepared have a > lot of code similar to DecodeCommit/Abort. Can we merge these > functions? Merged the two functions into DecodeCommit and DecodeAbort.. > > 2. > DecodeCommitPrepared() > { > .. > + * If filter check present and this needs to be skipped, do a regular commit. > + */ > + if (ctx->callbacks.filter_prepare_cb && > + ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed->twophase_gid)) > + { > + ReorderBufferCommit(ctx->reorder, xid, buf->origptr, buf->endptr, > + commit_time, origin_id, origin_lsn); > + } > + else > + { > + ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr, > + commit_time, origin_id, origin_lsn, > + parsed->twophase_gid, true); > + } > + > +} > > Can we expand the comment here to say why we need to do ReorderBufferCommit? Updated. > > 3. There are a lot of test cases in this patch which is a good thing > but can we split them into a separate patch for the time being as I > would like to focus on the core logic of the patch first. We can later > see if we need to retain all or part of those tests. Split the patch and created a new patch for test_decoding tests. > > 4. Please run pgindent on your patches. Have not done this. Will do this, after unifiying the patchset. regards, Ajin Cherian Fujitsu Australia
On Mon, Nov 2, 2020 at 9:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Oct 28, 2020 at 10:50 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Hi Ajin. > > > > I have re-checked the v13 patches for how my remaining review comments > > have been addressed. > > > > On Tue, Oct 27, 2020 at 8:55 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > > > > > ==================== > > > > v12-0002. File: src/backend/replication/logical/reorderbuffer.c > > > > ==================== > > > > > > > > COMMENT > > > > Line 2401 > > > > /* > > > > * We are here due to one of the 3 scenarios: > > > > * 1. As part of streaming in-progress transactions > > > > * 2. Prepare of a two-phase commit > > > > * 3. Commit of a transaction. > > > > * > > > > * If we are streaming the in-progress transaction then discard the > > > > * changes that we just streamed, and mark the transactions as > > > > * streamed (if they contained changes), set prepared flag as false. > > > > * If part of a prepare of a two-phase commit set the prepared flag > > > > * as true so that we can discard changes and cleanup tuplecids. > > > > * Otherwise, remove all the > > > > * changes and deallocate the ReorderBufferTXN. > > > > */ > > > > ~ > > > > The above comment is beyond my understanding. Anything you could do to > > > > simplify it would be good. > > > > > > > > For example, when viewing this function in isolation I have never > > > > understood why the streaming flag and rbtxn_prepared(txn) flag are not > > > > possible to be set at the same time? > > > > > > > > Perhaps the code is relying on just internal knowledge of how this > > > > helper function gets called? And if it is just that, then IMO there > > > > really should be some Asserts in the code to give more assurance about > > > > that. (Or maybe use completely different flags to represent those 3 > > > > scenarios instead of bending the meanings of the existing flags) > > > > > > > > > > Left this for now, probably re-look at this at a later review. > > > But just to explain; this function is what does the main decoding of > > > changes of a transaction. > > > At what point this decoding happens is what this feature and the > > > streaming in-progress feature is about. As of PG13, this decoding only > > > happens at commit time. With the streaming of in-progress txn feature, > > > this began to happen (if streaming enabled) at the time when the > > > memory limit for decoding transactions was crossed. This 2PC feature > > > is supporting decoding at the time of a PREPARE transaction. > > > Now, if streaming is enabled and streaming has started as a result of > > > crossing the memory threshold, then there is no need to > > > again begin streaming at a PREPARE transaction as the transaction that > > > is being prepared has already been streamed. > > > > > I don't think this is true, think of a case where we need to send the > last set of changes along with PREPARE. In that case we need to stream > those changes at the time of PREPARE. If I am correct then as pointed > by Peter you need to change some comments and some of the assumptions > related to this you have in the patch. I have changed the asserts and the comments to reflect this. > > Few more comments on the latest patch > (v15-0002-Support-2PC-txn-backend-and-tests) > ========================================================================= > 1. > @@ -274,6 +296,23 @@ DecodeXactOp(LogicalDecodingContext *ctx, > XLogRecordBuffer *buf) > DecodeAbort(ctx, buf, &parsed, xid); > break; > } > + case XLOG_XACT_ABORT_PREPARED: > + { > > .. > + > + if (!TransactionIdIsValid(parsed.twophase_xid)) > + xid = XLogRecGetXid(r); > + else > + xid = parsed.twophase_xid; > > I think we don't need this 'if' check here because you must have a > valid value of parsed.twophase_xid;. But, I think this will be moot if > you address the review comment in my previous email such that the > handling of XLOG_XACT_ABORT_PREPARED and XLOG_XACT_ABORT will be > combined as it is there without the patch. > > 2. > +DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, > + xl_xact_parsed_prepare * parsed) > +{ > .. > + if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) || > + (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) || > + ctx->fast_forward || FilterByOrigin(ctx, origin_id)) > + return; > + > > I think this check is the same as the check in DecodeCommit, so you > can write some comments to indicate the same and also why we don't > need to call ReorderBufferForget here. One more thing is to note is > even if we don't need to call ReorderBufferForget here but still we > need to execute invalidations (which are present in top-level txn) for > the reasons mentioned in ReorderBufferForget. Also, if we do this, > don't forget to update the comment atop > ReorderBufferImmediateInvalidation. I have updated the comments. I wasn't sure of when to execute invalidations. Should I only execute invalidations if this was for another database than what was being decoded or should I execute invalidation every time we skip? I will also have to create a new function in reorderbuffer,c similar to ReorderBufferForget as the txn is not available in decode.c. > > 3. > + /* This is a PREPARED transaction, part of a two-phase commit. > + * The full cleanup will happen as part of the COMMIT PREPAREDs, so now > + * just truncate txn by removing changes and tuple_cids > + */ > + ReorderBufferTruncateTXN(rb, txn, true); > > The first line in the multi-line comment should be empty. Updated. regards, Ajin Cherian Fujitsu Australia
Attachment
On Wed, Nov 4, 2020 at 3:01 PM Ajin Cherian <itsajin@gmail.com> wrote: > > On Mon, Nov 2, 2020 at 9:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Oct 28, 2020 at 10:50 AM Peter Smith <smithpb2250@gmail.com> wrote: > > 2. > > +DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, > > + xl_xact_parsed_prepare * parsed) > > +{ > > .. > > + if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) || > > + (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) || > > + ctx->fast_forward || FilterByOrigin(ctx, origin_id)) > > + return; > > + > > > > I think this check is the same as the check in DecodeCommit, so you > > can write some comments to indicate the same and also why we don't > > need to call ReorderBufferForget here. One more thing is to note is > > even if we don't need to call ReorderBufferForget here but still we > > need to execute invalidations (which are present in top-level txn) for > > the reasons mentioned in ReorderBufferForget. Also, if we do this, > > don't forget to update the comment atop > > ReorderBufferImmediateInvalidation. > > I have updated the comments. I wasn't sure of when to execute > invalidations. Should I only > execute invalidations if this was for another database than what was > being decoded or should > I execute invalidation every time we skip? > I think so. Did there exist any such special condition in DecodeCommit or do you have any other reason in your mind for not doing it every time we skip? We probably might not need to execute when the database is different (at least I can't think of a reason for the same) but I guess this doesn't make much difference and it will keep the code consistent with what we do in DecodeCommit. -- With Regards, Amit Kapila.
On Wed, Nov 4, 2020 at 9:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Nov 4, 2020 at 3:01 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > > On Mon, Nov 2, 2020 at 9:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Wed, Oct 28, 2020 at 10:50 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > 2. > > > +DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, > > > + xl_xact_parsed_prepare * parsed) > > > +{ > > > .. > > > + if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) || > > > + (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) || > > > + ctx->fast_forward || FilterByOrigin(ctx, origin_id)) > > > + return; > > > + > > > > > > I think this check is the same as the check in DecodeCommit, so you > > > can write some comments to indicate the same and also why we don't > > > need to call ReorderBufferForget here. One more thing is to note is > > > even if we don't need to call ReorderBufferForget here but still we > > > need to execute invalidations (which are present in top-level txn) for > > > the reasons mentioned in ReorderBufferForget. Also, if we do this, > > > don't forget to update the comment atop > > > ReorderBufferImmediateInvalidation. > > > > I have updated the comments. I wasn't sure of when to execute > > invalidations. Should I only > > execute invalidations if this was for another database than what was > > being decoded or should > > I execute invalidation every time we skip? > > > > I think so. Did there exist any such special condition in DecodeCommit > or do you have any other reason in your mind for not doing it every > time we skip? We probably might not need to execute when the database > is different (at least I can't think of a reason for the same) but I > guess this doesn't make much difference and it will keep the code > consistent with what we do in DecodeCommit. > I was just basing it on the comments in the DecodeCommit: * We can't just use ReorderBufferAbort() here, because we need to execute * the transaction's invalidations. This currently won't be needed if * we're just skipping over the transaction because currently we only do * so during startup, to get to the first transaction the client needs. As * we have reset the catalog caches before starting to read WAL, and we * haven't yet touched any catalogs, there can't be anything to invalidate. * But if we're "forgetting" this commit because it's it happened in * another database, the invalidations might be important, because they * could be for shared catalogs and we might have loaded data into the * relevant syscaches. regards, Ajin Cherian Fujitsu Australia
On Wed, Nov 4, 2020 at 3:46 PM Ajin Cherian <itsajin@gmail.com> wrote: > > On Wed, Nov 4, 2020 at 9:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Nov 4, 2020 at 3:01 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > > > > On Mon, Nov 2, 2020 at 9:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Wed, Oct 28, 2020 at 10:50 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > 2. > > > > +DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, > > > > + xl_xact_parsed_prepare * parsed) > > > > +{ > > > > .. > > > > + if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) || > > > > + (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) || > > > > + ctx->fast_forward || FilterByOrigin(ctx, origin_id)) > > > > + return; > > > > + > > > > > > > > I think this check is the same as the check in DecodeCommit, so you > > > > can write some comments to indicate the same and also why we don't > > > > need to call ReorderBufferForget here. One more thing is to note is > > > > even if we don't need to call ReorderBufferForget here but still we > > > > need to execute invalidations (which are present in top-level txn) for > > > > the reasons mentioned in ReorderBufferForget. Also, if we do this, > > > > don't forget to update the comment atop > > > > ReorderBufferImmediateInvalidation. > > > > > > I have updated the comments. I wasn't sure of when to execute > > > invalidations. Should I only > > > execute invalidations if this was for another database than what was > > > being decoded or should > > > I execute invalidation every time we skip? > > > > > > > I think so. Did there exist any such special condition in DecodeCommit > > or do you have any other reason in your mind for not doing it every > > time we skip? We probably might not need to execute when the database > > is different (at least I can't think of a reason for the same) but I > > guess this doesn't make much difference and it will keep the code > > consistent with what we do in DecodeCommit. > > > > I was just basing it on the comments in the DecodeCommit: > Okay, so it is mentioned in the comment why we need to execute invalidations even when the database is not the same. So, we are probably good here if we are executing the invalidations whenever we skip to decode the prepared xact. -- With Regards, Amit Kapila.
On Wed, Nov 4, 2020 at 9:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > Okay, so it is mentioned in the comment why we need to execute > invalidations even when the database is not the same. So, we are > probably good here if we are executing the invalidations whenever we > skip to decode the prepared xact. Updated to execute invalidations while skipping prepared transactions. Also ran pgindent on the source files with updated typedefs. Attaching v17 with 1,2 and 3. regards, Ajin Cherian Fujitsu Australia
Attachment
> > 4. > > +static void > > +apply_handle_prepare_txn(LogicalRepPrepareData * prepare_data) > > +{ > > + Assert(prepare_data->prepare_lsn == remote_final_lsn); > > + > > + /* The synchronization worker runs in single transaction. */ > > + if (IsTransactionState() && !am_tablesync_worker()) > > + { > > + /* End the earlier transaction and start a new one */ > > + BeginTransactionBlock(); > > + CommitTransactionCommand(); > > + StartTransactionCommand(); > > > > There is no explanation as to why you want to end the previous > > transaction and start a new one. Even if we have to do so, we first > > need to call BeginTransactionBlock before CommitTransactionCommand. Done --- Also... pgindent has been run for all patches now. The latest of all six patches are again reunited with a common v18 version number. PSA Kind Regards, Peter Smith. Fujitsu Australia.
Attachment
On Mon, Nov 9, 2020 at 3:23 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > 4. > > > +static void > > > +apply_handle_prepare_txn(LogicalRepPrepareData * prepare_data) > > > +{ > > > + Assert(prepare_data->prepare_lsn == remote_final_lsn); > > > + > > > + /* The synchronization worker runs in single transaction. */ > > > + if (IsTransactionState() && !am_tablesync_worker()) > > > + { > > > + /* End the earlier transaction and start a new one */ > > > + BeginTransactionBlock(); > > > + CommitTransactionCommand(); > > > + StartTransactionCommand(); > > > > > > There is no explanation as to why you want to end the previous > > > transaction and start a new one. Even if we have to do so, we first > > > need to call BeginTransactionBlock before CommitTransactionCommand. > > Done > > --- > > Also... > > pgindent has been run for all patches now. > > The latest of all six patches are again reunited with a common v18 > version number. > I've looked at the patches and done some tests. Here is my comment and question I realized during testing and reviewing. +static void +DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, + xl_xact_parsed_prepare *parsed) +{ + XLogRecPtr origin_lsn = parsed->origin_lsn; + TimestampTz commit_time = parsed->origin_timestamp; static void DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, - xl_xact_parsed_abort *parsed, TransactionId xid) + xl_xact_parsed_abort *parsed, TransactionId xid, bool prepared) { int i; + XLogRecPtr origin_lsn = InvalidXLogRecPtr; + TimestampTz commit_time = 0; + XLogRecPtr origin_id = XLogRecGetOrigin(buf->record); - for (i = 0; i < parsed->nsubxacts; i++) + if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN) { - ReorderBufferAbort(ctx->reorder, parsed->subxacts[i], - buf->record->EndRecPtr); + origin_lsn = parsed->origin_lsn; + commit_time = parsed->origin_timestamp; } In the above two changes, parsed->origin_timestamp is used as commit_time. But in DecodeCommit() we use parsed->xact_time instead. Therefore it a transaction didn't have replorigin_session_origin the timestamp of logical decoding out generated by test_decoding with 'include-timestamp' option is invalid. Is it intentional? --- + if (is_commit) + txn->txn_flags |= RBTXN_COMMIT_PREPARED; + else + txn->txn_flags |= RBTXN_ROLLBACK_PREPARED; + + if (rbtxn_commit_prepared(txn)) + rb->commit_prepared(rb, txn, commit_lsn); + else if (rbtxn_rollback_prepared(txn)) + rb->rollback_prepared(rb, txn, commit_lsn); RBTXN_COMMIT_PREPARED and RBTXN_ROLLBACK_PREPARED are used only here and it seems to me that it's not necessarily necessary. --- + /* + * If this is COMMIT_PREPARED and the output plugin supports + * two-phase commits then set the prepared flag to true. + */ + prepared = ((info == XLOG_XACT_COMMIT_PREPARED) && ctx->twophase) ? true : false; We can write instead: prepared = ((info == XLOG_XACT_COMMIT_PREPARED) && ctx->twophase); + /* + * If this is ABORT_PREPARED and the output plugin supports + * two-phase commits then set the prepared flag to true. + */ + prepared = ((info == XLOG_XACT_ABORT_PREPARED) && ctx->twophase) ? true : false; The same is true here. --- 'git show --check' of v18-0002 reports some warnings. Regards, -- Masahiko Sawada EnterpriseDB: https://www.enterprisedb.com/
On Mon, Nov 9, 2020 at 1:38 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Nov 9, 2020 at 3:23 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > I've looked at the patches and done some tests. Here is my comment and > question I realized during testing and reviewing. > > +static void > +DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, > + xl_xact_parsed_prepare *parsed) > +{ > + XLogRecPtr origin_lsn = parsed->origin_lsn; > + TimestampTz commit_time = parsed->origin_timestamp; > > static void > DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, > - xl_xact_parsed_abort *parsed, TransactionId xid) > + xl_xact_parsed_abort *parsed, TransactionId xid, bool prepared) > { > int i; > + XLogRecPtr origin_lsn = InvalidXLogRecPtr; > + TimestampTz commit_time = 0; > + XLogRecPtr origin_id = XLogRecGetOrigin(buf->record); > > - for (i = 0; i < parsed->nsubxacts; i++) > + if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN) > { > - ReorderBufferAbort(ctx->reorder, parsed->subxacts[i], > - buf->record->EndRecPtr); > + origin_lsn = parsed->origin_lsn; > + commit_time = parsed->origin_timestamp; > } > > In the above two changes, parsed->origin_timestamp is used as > commit_time. But in DecodeCommit() we use parsed->xact_time instead. > Therefore it a transaction didn't have replorigin_session_origin the > timestamp of logical decoding out generated by test_decoding with > 'include-timestamp' option is invalid. Is it intentional? > I think all three DecodePrepare/DecodeAbort/DecodeCommit should have same handling for this with the exception that at DecodePrepare time we can't rely on XACT_XINFO_HAS_ORIGIN but instead we need to check if origin_timestamp is non-zero then we will overwrite commit_time with it. Does that make sense to you? > --- > + if (is_commit) > + txn->txn_flags |= RBTXN_COMMIT_PREPARED; > + else > + txn->txn_flags |= RBTXN_ROLLBACK_PREPARED; > + > + if (rbtxn_commit_prepared(txn)) > + rb->commit_prepared(rb, txn, commit_lsn); > + else if (rbtxn_rollback_prepared(txn)) > + rb->rollback_prepared(rb, txn, commit_lsn); > > RBTXN_COMMIT_PREPARED and RBTXN_ROLLBACK_PREPARED are used only here > and it seems to me that it's not necessarily necessary. > +1. > --- > + /* > + * If this is COMMIT_PREPARED and the output plugin supports > + * two-phase commits then set the prepared flag to true. > + */ > + prepared = ((info == XLOG_XACT_COMMIT_PREPARED) && > ctx->twophase) ? true : false; > > We can write instead: > > prepared = ((info == XLOG_XACT_COMMIT_PREPARED) && ctx->twophase); > > > + /* > + * If this is ABORT_PREPARED and the output plugin supports > + * two-phase commits then set the prepared flag to true. > + */ > + prepared = ((info == XLOG_XACT_ABORT_PREPARED) && > ctx->twophase) ? true : false; > > The same is true here. > +1. > --- > 'git show --check' of v18-0002 reports some warnings. > I have also noticed this. Actually, I have already started making some changes to these patches apart from what you have reported so I'll take care of these things as well. -- With Regards, Amit Kapila.
Hi. I have re-generated new coverage reports using the current (v18) source. PSA Note: This is the coverage reported after running only the following tests: 1. make check 2. cd contrib/test_decoding; make check 3. cd src/test/subscriber; make check --- Kind Regards, Peter Smith. Fujitsu Australia
Attachment
On Mon, Nov 9, 2020 at 8:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Nov 9, 2020 at 1:38 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Mon, Nov 9, 2020 at 3:23 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > > > I've looked at the patches and done some tests. Here is my comment and > > question I realized during testing and reviewing. > > > > +static void > > +DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, > > + xl_xact_parsed_prepare *parsed) > > +{ > > + XLogRecPtr origin_lsn = parsed->origin_lsn; > > + TimestampTz commit_time = parsed->origin_timestamp; > > > > static void > > DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, > > - xl_xact_parsed_abort *parsed, TransactionId xid) > > + xl_xact_parsed_abort *parsed, TransactionId xid, bool prepared) > > { > > int i; > > + XLogRecPtr origin_lsn = InvalidXLogRecPtr; > > + TimestampTz commit_time = 0; > > + XLogRecPtr origin_id = XLogRecGetOrigin(buf->record); > > > > - for (i = 0; i < parsed->nsubxacts; i++) > > + if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN) > > { > > - ReorderBufferAbort(ctx->reorder, parsed->subxacts[i], > > - buf->record->EndRecPtr); > > + origin_lsn = parsed->origin_lsn; > > + commit_time = parsed->origin_timestamp; > > } > > > > In the above two changes, parsed->origin_timestamp is used as > > commit_time. But in DecodeCommit() we use parsed->xact_time instead. > > Therefore it a transaction didn't have replorigin_session_origin the > > timestamp of logical decoding out generated by test_decoding with > > 'include-timestamp' option is invalid. Is it intentional? > > > > I think all three DecodePrepare/DecodeAbort/DecodeCommit should have > same handling for this with the exception that at DecodePrepare time > we can't rely on XACT_XINFO_HAS_ORIGIN but instead we need to check if > origin_timestamp is non-zero then we will overwrite commit_time with > it. Does that make sense to you? Yeah, that makes sense to me. > > 'git show --check' of v18-0002 reports some warnings. > > > > I have also noticed this. Actually, I have already started making some > changes to these patches apart from what you have reported so I'll > take care of these things as well. Ok. Regards, -- Masahiko Sawada EnterpriseDB: https://www.enterprisedb.com/
FYI - I have cross-checked all the v18 patch code against the v18 code coverage [1] resulting from running the tests. The purpose of this study was to identify where there may be any gaps in the testing of this patch - e.g is there some v18 code not currently getting executed by the tests? I found almost all of the normal (not error) code paths are getting executed. For details please see attached the study results. (MS Excel file) === [1] https://www.postgresql.org/message-id/CAHut%2BPu4BpUr0GfCLqJjXc%3DDcaKSvjDarSN89-4W2nxBeae9hQ%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Attachment
I was doing some testing, and I found some issues. Two issues. The first one, seems to be a behaviour that might be acceptable, the second one not so much. I was using test_decoding, not sure how this might behave with the pg_output plugin. Test 1: A transaction that is immediately rollbacked after the prepare. SET synchronous_commit = on; SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding'); CREATE TABLE stream_test(data text); -- consume DDL SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); BEGIN; INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1, 20) g(i); PREPARE TRANSACTION 'test1'; ROLLBACK PREPARED 'test1'; SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1'); ================== Here, what is seen is that while the transaction was not decoded at all since it was rollbacked before it could get decoded, the ROLLBACK PREPARED is actually decoded. The result being that the standby could get a spurious ROLLBACK PREPARED. The current code in worker.c does handle this silently. So, this might not be an issue. Test 2: A transaction that is partially streamed , is then prepared. ' BEGIN; INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1,800) g(i); SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1'); SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1'); PREPARE TRANSACTION 'test1'; SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1'); ROLLBACK PREPARED 'test1'; ========================== Here, what is seen is that the transaction is streamed twice, first when it crosses the memory threshold and is streamed (usually only in the 2nd pg_logical_slot_get_changes call) and then the same transaction is streamed again after the prepare. This cannot be right, as it would result in duplication of data on the standby. I will be debugging the second issue and try to arrive at a fix. regards, Ajin Cherian Fujitsu Australia. On Tue, Nov 10, 2020 at 4:47 PM Peter Smith <smithpb2250@gmail.com> wrote: > > FYI - I have cross-checked all the v18 patch code against the v18 code > coverage [1] resulting from running the tests. > > The purpose of this study was to identify where there may be any gaps > in the testing of this patch - e.g is there some v18 code not > currently getting executed by the tests? > > I found almost all of the normal (not error) code paths are getting executed. > > For details please see attached the study results. (MS Excel file) > > === > > [1] https://www.postgresql.org/message-id/CAHut%2BPu4BpUr0GfCLqJjXc%3DDcaKSvjDarSN89-4W2nxBeae9hQ%40mail.gmail.com > > Kind Regards, > Peter Smith. > Fujitsu Australia
On Mon, Nov 9, 2020 at 1:38 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > I've looked at the patches and done some tests. Here is my comment and > question I realized during testing and reviewing. > > +static void > +DecodePrepare(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, > + xl_xact_parsed_prepare *parsed) > +{ > + XLogRecPtr origin_lsn = parsed->origin_lsn; > + TimestampTz commit_time = parsed->origin_timestamp; > > static void > DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, > - xl_xact_parsed_abort *parsed, TransactionId xid) > + xl_xact_parsed_abort *parsed, TransactionId xid, bool prepared) > { > int i; > + XLogRecPtr origin_lsn = InvalidXLogRecPtr; > + TimestampTz commit_time = 0; > + XLogRecPtr origin_id = XLogRecGetOrigin(buf->record); > > - for (i = 0; i < parsed->nsubxacts; i++) > + if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN) > { > - ReorderBufferAbort(ctx->reorder, parsed->subxacts[i], > - buf->record->EndRecPtr); > + origin_lsn = parsed->origin_lsn; > + commit_time = parsed->origin_timestamp; > } > > In the above two changes, parsed->origin_timestamp is used as > commit_time. But in DecodeCommit() we use parsed->xact_time instead. > Therefore it a transaction didn't have replorigin_session_origin the > timestamp of logical decoding out generated by test_decoding with > 'include-timestamp' option is invalid. Is it intentional? > Changed as discussed. > --- > + if (is_commit) > + txn->txn_flags |= RBTXN_COMMIT_PREPARED; > + else > + txn->txn_flags |= RBTXN_ROLLBACK_PREPARED; > + > + if (rbtxn_commit_prepared(txn)) > + rb->commit_prepared(rb, txn, commit_lsn); > + else if (rbtxn_rollback_prepared(txn)) > + rb->rollback_prepared(rb, txn, commit_lsn); > > RBTXN_COMMIT_PREPARED and RBTXN_ROLLBACK_PREPARED are used only here > and it seems to me that it's not necessarily necessary. > These are used in v18-0005-Support-2PC-txn-pgoutput. So, I don't think we can directly remove them. > --- > + /* > + * If this is COMMIT_PREPARED and the output plugin supports > + * two-phase commits then set the prepared flag to true. > + */ > + prepared = ((info == XLOG_XACT_COMMIT_PREPARED) && > ctx->twophase) ? true : false; > > We can write instead: > > prepared = ((info == XLOG_XACT_COMMIT_PREPARED) && ctx->twophase); > > > + /* > + * If this is ABORT_PREPARED and the output plugin supports > + * two-phase commits then set the prepared flag to true. > + */ > + prepared = ((info == XLOG_XACT_ABORT_PREPARED) && > ctx->twophase) ? true : false; > > The same is true here. > I have changed this code so that we can determine if the transaction is already decoded at prepare time before calling DecodeCommit/DecodeAbort, so these checks are gone now and I think that makes the code look a bit cleaner. Apart from this, I have changed v19-0001-Support-2PC-txn-base such that it displays xid and gid consistently in all APIs. In v19-0002-Support-2PC-txn-backend, apart from fixing the above comments, I have rearranged the code in DecodeCommit/Abort/Prepare so that it does only the required things (like in DecodeCommit was still processing subtxns even when it has to just perform FinishPrepared, also the stats were not updated properly which I have fixed.) and added/edited the comments. Apart from 0001 and 0002, I have not changed anything in the remaining patches. -- With Regards, Amit Kapila.
Attachment
Did some further tests on the problem I saw and I see that it does not have anything to do with this patch. I picked code from top of head. If I have enough changes in a transaction to initiate streaming, then it will also stream the same changes after a commit. BEGIN; INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM generate_series(1,800) g(i); SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1'); ** see streamed output here ** END; SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1', 'stream-changes', '1'); ** see the same streamed output here ** I think this is because since the transaction has not been committed, SnapBuildCommitTxn is not called which is what moves the "builder->start_decoding_at", and as a result later calls to pg_logical_slot_get_changes will start from the previous lsn. I did do a quick test in pgoutput using pub/sub and I dont see duplication of data there but I haven't verified exactly what happens. regards, Ajin Cherian Fujitsu Australia
The subscriber tests are updated to include test cases for "cascading" pub/sub scenarios. i.e.. NODE_A publisher => subscriber NODE_B publisher => subscriber NODE_C PSA only the modified v20-0006 patch (the other 5 patches remain unchanged) Kind Regards, Peter Smith. Fujitsu Australia.
Attachment
On Wed, Nov 11, 2020 at 12:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote: I have rearranged the code in DecodeCommit/Abort/Prepare so > that it does only the required things (like in DecodeCommit was still > processing subtxns even when it has to just perform FinishPrepared, > also the stats were not updated properly which I have fixed.) and > added/edited the comments. Apart from 0001 and 0002, I have not > changed anything in the remaining patches. One small comment on the patch: - DecodeCommit(ctx, buf, &parsed, xid); + /* + * If we have already decoded this transaction data then + * DecodeCommit doesn't need to decode it again. This is + * possible iff output plugin supports two-phase commits and + * doesn't skip the transaction at prepare time. + */ + if (info == XLOG_XACT_COMMIT_PREPARED && ctx->twophase) + { + already_decoded = !(ctx->callbacks.filter_prepare_cb && + ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid)); + } + Just a small nitpick but the way already_decoded is assigned here is a bit misleading. It appears that the callbacks determine if the transaction is already decoded when in reality the callbacks only decide if the transaction should skip two phase commits. I think it's better to either move it to the if condition or if that is too long then have one more variable skip_twophase. if (info == XLOG_XACT_COMMIT_PREPARED && ctx->twophase && !(ctx->callbacks.filter_prepare_cb && ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid))) already_decoded = true; OR bool skip_twophase = false; skip_twophase = !(ctx->callbacks.filter_prepare_cb && ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid)); if (info == XLOG_XACT_COMMIT_PREPARED && ctx->twophase && skip_twophase) already_decoded = true; regards, Ajin Cherian Fujitsu Australia
On Thu, Nov 12, 2020 at 2:28 PM Ajin Cherian <itsajin@gmail.com> wrote: > > On Wed, Nov 11, 2020 at 12:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > I have rearranged the code in DecodeCommit/Abort/Prepare so > > that it does only the required things (like in DecodeCommit was still > > processing subtxns even when it has to just perform FinishPrepared, > > also the stats were not updated properly which I have fixed.) and > > added/edited the comments. Apart from 0001 and 0002, I have not > > changed anything in the remaining patches. > > One small comment on the patch: > > - DecodeCommit(ctx, buf, &parsed, xid); > + /* > + * If we have already decoded this transaction data then > + * DecodeCommit doesn't need to decode it again. This is > + * possible iff output plugin supports two-phase commits and > + * doesn't skip the transaction at prepare time. > + */ > + if (info == XLOG_XACT_COMMIT_PREPARED && ctx->twophase) > + { > + already_decoded = !(ctx->callbacks.filter_prepare_cb && > + ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid)); > + } > + > > Just a small nitpick but the way already_decoded is assigned here is a > bit misleading. It appears that the callbacks determine if the > transaction is already decoded when in > reality the callbacks only decide if the transaction should skip two > phase commits. I think it's better to either move it to the if > condition or if that is too long then have one more variable > skip_twophase. > > if (info == XLOG_XACT_COMMIT_PREPARED && ctx->twophase && > !(ctx->callbacks.filter_prepare_cb && > ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid))) > already_decoded = true; > > OR > bool skip_twophase = false; > skip_twophase = !(ctx->callbacks.filter_prepare_cb && > ReorderBufferPrepareNeedSkip(ctx->reorder, xid, parsed.twophase_gid)); > if (info == XLOG_XACT_COMMIT_PREPARED && ctx->twophase && skip_twophase) > already_decoded = true; > Hmm, introducing an additional boolean variable for this doesn't seem like a good idea neither the other alternative suggested by you. How about if we change the comment to make it clear. How about: "If output plugin supports two-phase commits and doesn't skip the transaction at prepare time then we don't need to decode the transaction data at commit prepared time as it would have already been decoded at prepare time."? -- With Regards, Amit Kapila.
On Tue, Nov 10, 2020 at 4:19 PM Ajin Cherian <itsajin@gmail.com> wrote: > > I was doing some testing, and I found some issues. Two issues. The > first one, seems to be a behaviour that might be acceptable, the > second one not so much. > I was using test_decoding, not sure how this might behave with the > pg_output plugin. > > Test 1: > A transaction that is immediately rollbacked after the prepare. > > SET synchronous_commit = on; > SELECT 'init' FROM > pg_create_logical_replication_slot('regression_slot', > 'test_decoding'); > CREATE TABLE stream_test(data text); > -- consume DDL > SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, > NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); > > BEGIN; > INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM > generate_series(1, 20) g(i); > PREPARE TRANSACTION 'test1'; > ROLLBACK PREPARED 'test1'; > SELECT data FROM pg_logical_slot_get_changes('regression_slot', > NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', > 'skip-empty-xacts', '1', 'stream-changes', '1'); > ================== > > Here, what is seen is that while the transaction was not decoded at > all since it was rollbacked before it could get decoded, the ROLLBACK > PREPARED is actually decoded. > The result being that the standby could get a spurious ROLLBACK > PREPARED. The current code in worker.c does handle this silently. So, > this might not be an issue. > Yeah, this seems okay because it is quite possible that such a Rollback would have encountered after processing few records in which case sending the Rollback is required. This can happen when rollback has been issues concurrently when we are decoding prepare. If the Output plugin wants, then can detect that transaction has not written any data and ignore rollback and we already do something similar in test_decoding. So, I think this should be fine. -- With Regards, Amit Kapila.
On Fri, Nov 13, 2020 at 9:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Nov 12, 2020 at 2:28 PM Ajin Cherian <itsajin@gmail.com> wrote: > Hmm, introducing an additional boolean variable for this doesn't seem > like a good idea neither the other alternative suggested by you. How > about if we change the comment to make it clear. How about: "If output > plugin supports two-phase commits and doesn't skip the transaction at > prepare time then we don't need to decode the transaction data at > commit prepared time as it would have already been decoded at prepare > time."? Yes, that works for me. regards, Ajin Cherian Fujitsu Australia
On Wed, Nov 11, 2020 at 4:30 PM Ajin Cherian <itsajin@gmail.com> wrote: > > Did some further tests on the problem I saw and I see that it does not > have anything to do with this patch. I picked code from top of head. > If I have enough changes in a transaction to initiate streaming, then > it will also stream the same changes after a commit. > > BEGIN; > INSERT INTO stream_test SELECT repeat('a', 10) || g.i FROM > generate_series(1,800) g(i); > SELECT data FROM pg_logical_slot_get_changes('regression_slot', > NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', > 'skip-empty-xacts', '1', 'stream-changes', '1'); > ** see streamed output here ** > END; > SELECT data FROM pg_logical_slot_get_changes('regression_slot', > NULL,NULL, 'two-phase-commit', '1', 'include-xids', '0', > 'skip-empty-xacts', '1', 'stream-changes', '1'); > ** see the same streamed output here ** > > I think this is because since the transaction has not been committed, > SnapBuildCommitTxn is not called which is what moves the > "builder->start_decoding_at", and as a result > later calls to pg_logical_slot_get_changes will start from the > previous lsn. > No, we always move start_decoding_at after streaming changes. It will be moved because we have advanced the confirmed_flush location after streaming all the changes (via LogicalConfirmReceivedLocation()) which will be used to set 'start_decoding_at' when we create decoding context (CreateDecodingContext) next time. However, we don't advance 'restart_lsn' due to which it start from the same point and accumulate all changes for transaction each time. Now, after Commit we get an extra record which is ahead of 'start_decoding_at' and we try to decode it, it will get all the changes of the transaction. It might be that we update the documentation for pg_logical_slot_get_changes() to indicate the same but I don't think this is a problem. > I did do a quick test in pgoutput using pub/sub and I > dont see duplication of data there but I haven't > verified exactly what happens. > Yeah, because we always move ahead for WAL locations in that unless the subscriber/publisher is restarted in which case it should start from the required location. But still, we can try to see if there is any bug. -- With Regards, Amit Kapila.
Updated with a new test case (contrib/test_decoding/t/002_twophase-streaming.pl) that tests concurrent aborts during streaming prepare. Had to make a few changes to the test_decoding stream_start callbacks to handle "check-xid-aborted" the same way it was handled in the non stream callbacks. Merged Peter's v20-0006 as well. regards, Ajin Cherian Fujitsu Australia
Attachment
On Mon, Nov 16, 2020 at 4:25 PM Ajin Cherian <itsajin@gmail.com> wrote: > > Updated with a new test case > (contrib/test_decoding/t/002_twophase-streaming.pl) that tests > concurrent aborts during streaming prepare. Had to make a few changes > to the test_decoding stream_start callbacks to handle > "check-xid-aborted" > the same way it was handled in the non stream callbacks. Merged > Peter's v20-0006 as well. > Thank you for updating the patch. I have a question about the timestamp of PREPARE on a subscriber node, although this may have already been discussed. With the current patch, the timestamps of PREPARE are different between the publisher and the subscriber but the timestamp of their commits are the same. For example, -- There is 1 prepared transaction on a publisher node. =# select * from pg_prepared_xact; transaction | gid | prepared | owner | database -------------+-----+-------------------------------+----------+---------- 510 | h1 | 2020-11-16 16:57:13.438633+09 | masahiko | postgres (1 row) -- This prepared transaction is replicated to a subscriber. =# select * from pg_prepared_xact; transaction | gid | prepared | owner | database -------------+-----+-------------------------------+----------+---------- 514 | h1 | 2020-11-16 16:57:13.440593+09 | masahiko | postgres (1 row) These timestamps are different. Let's commit the prepared transaction 'h1' on the publisher and check the commit timestamps on both nodes. -- On the publisher node. =# select pg_xact_commit_timestamp('510'::xid); pg_xact_commit_timestamp ------------------------------- 2020-11-16 16:57:13.474275+09 (1 row) -- Commit prepared is also replicated to the subscriber node. =# select pg_xact_commit_timestamp('514'::xid); pg_xact_commit_timestamp ------------------------------- 2020-11-16 16:57:13.474275+09 (1 row) The commit timestamps are the same. At PREPARE we use the local timestamp when PREPARE is executed as 'prepared' time while at COMMIT PREPARED we use the origin's commit timestamp as the commit timestamp if the commit WAL has. This behaviour made me think a possibility that if the clock of the publisher is behind then on subscriber node the timestamp of COMMIT PREPARED (i.g., the return value from pg_xact_commit_timestamp()) could be smaller than the timestamp of PREPARE (i.g., 'prepared_at' in pg_prepared_xacts). I think it would not be a critical issue but I think it might be worth discussing the behaviour. Regards, -- Masahiko Sawada EnterpriseDB: https://www.enterprisedb.com/
On Mon, Nov 16, 2020 at 3:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Nov 16, 2020 at 4:25 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > > Updated with a new test case > > (contrib/test_decoding/t/002_twophase-streaming.pl) that tests > > concurrent aborts during streaming prepare. Had to make a few changes > > to the test_decoding stream_start callbacks to handle > > "check-xid-aborted" > > the same way it was handled in the non stream callbacks. Merged > > Peter's v20-0006 as well. > > > > Thank you for updating the patch. > > I have a question about the timestamp of PREPARE on a subscriber node, > although this may have already been discussed. > > With the current patch, the timestamps of PREPARE are different > between the publisher and the subscriber but the timestamp of their > commits are the same. For example, > > -- There is 1 prepared transaction on a publisher node. > =# select * from pg_prepared_xact; > > transaction | gid | prepared | owner | database > -------------+-----+-------------------------------+----------+---------- > 510 | h1 | 2020-11-16 16:57:13.438633+09 | masahiko | postgres > (1 row) > > -- This prepared transaction is replicated to a subscriber. > =# select * from pg_prepared_xact; > > transaction | gid | prepared | owner | database > -------------+-----+-------------------------------+----------+---------- > 514 | h1 | 2020-11-16 16:57:13.440593+09 | masahiko | postgres > (1 row) > > These timestamps are different. Let's commit the prepared transaction > 'h1' on the publisher and check the commit timestamps on both nodes. > > -- On the publisher node. > =# select pg_xact_commit_timestamp('510'::xid); > > pg_xact_commit_timestamp > ------------------------------- > 2020-11-16 16:57:13.474275+09 > (1 row) > > -- Commit prepared is also replicated to the subscriber node. > =# select pg_xact_commit_timestamp('514'::xid); > > pg_xact_commit_timestamp > ------------------------------- > 2020-11-16 16:57:13.474275+09 > (1 row) > > The commit timestamps are the same. At PREPARE we use the local > timestamp when PREPARE is executed as 'prepared' time while at COMMIT > PREPARED we use the origin's commit timestamp as the commit timestamp > if the commit WAL has. > Doesn't this happen only if you set replication origins? Because otherwise both PrepareTransaction() and RecordTransactionCommitPrepared() used the current timestamp. -- With Regards, Amit Kapila.
On Tue, Nov 17, 2020 at 10:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Doesn't this happen only if you set replication origins? Because > otherwise both PrepareTransaction() and > RecordTransactionCommitPrepared() used the current timestamp. > I was also checking this, even if you set replicating origins, the preparedTransaction will reflect the local prepare time in pg_prepared_xacts. pg_prepared_xacts fetches this information from GlobalTransaction data which does not store the origin_timestamp; it only stores the prepared_at which is the local timestamp. The WAL record does have the origin_timestamp but that is not updated in the GlobalTransaction data structure typedef struct xl_xact_prepare { uint32 magic; /* format identifier */ uint32 total_len; /* actual file length */ TransactionId xid; /* original transaction XID */ Oid database; /* OID of database it was in */ TimestampTz prepared_at; /* time of preparation */ <=== this is local time and updated in GlobalTransaction Oid owner; /* user running the transaction */ int32 nsubxacts; /* number of following subxact XIDs */ int32 ncommitrels; /* number of delete-on-commit rels */ int32 nabortrels; /* number of delete-on-abort rels */ int32 ninvalmsgs; /* number of cache invalidation messages */ bool initfileinval; /* does relcache init file need invalidation? */ uint16 gidlen; /* length of the GID - GID follows the header */ XLogRecPtr origin_lsn; /* lsn of this record at origin node */ TimestampTz origin_timestamp; /* time of prepare at origin node */ <=== this is the time at origin which is not updated in GlobalTransaction } xl_xact_prepare; regards, Ajin Cherian Fujitsu Australia
On Tue, Nov 17, 2020 at 5:02 PM Ajin Cherian <itsajin@gmail.com> wrote: > > On Tue, Nov 17, 2020 at 10:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > Doesn't this happen only if you set replication origins? Because > > otherwise both PrepareTransaction() and > > RecordTransactionCommitPrepared() used the current timestamp. > > > > I was also checking this, even if you set replicating origins, the > preparedTransaction will reflect the local prepare time in > pg_prepared_xacts. pg_prepared_xacts fetches this information > from GlobalTransaction data which does not store the origin_timestamp; > it only stores the prepared_at which is the local timestamp. > Sure, but my question was does this difference in behavior happens without replication origins in any way? The reason is that if it occurs only with replication origins, I don't think we need to bother about the same because that feature is not properly implemented and not used as-is. See the discussion [1] [2]. OTOH, if this behavior can happen without replication origins then we might want to consider changing it. [1] - https://www.postgresql.org/message-id/064fab0c-915e-aede-c02e-bd4ec1f59732%402ndquadrant.com [2] - https://www.postgresql.org/message-id/188d15be-8699-c045-486a-f0439c9c2b7d%402ndquadrant.com -- With Regards, Amit Kapila.
On Tue, Nov 17, 2020 at 9:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Nov 17, 2020 at 5:02 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > > On Tue, Nov 17, 2020 at 10:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > Doesn't this happen only if you set replication origins? Because > > > otherwise both PrepareTransaction() and > > > RecordTransactionCommitPrepared() used the current timestamp. > > > > > > > I was also checking this, even if you set replicating origins, the > > preparedTransaction will reflect the local prepare time in > > pg_prepared_xacts. pg_prepared_xacts fetches this information > > from GlobalTransaction data which does not store the origin_timestamp; > > it only stores the prepared_at which is the local timestamp. > > > > Sure, but my question was does this difference in behavior happens > without replication origins in any way? The reason is that if it > occurs only with replication origins, I don't think we need to bother > about the same because that feature is not properly implemented and > not used as-is. See the discussion [1] [2]. OTOH, if this behavior can > happen without replication origins then we might want to consider > changing it. Logical replication workers always have replication origins, right? Is that what you meant 'with replication origins'? IIUC logical replication workers always set the origin's commit timestamp as the commit timestamp of the replicated transaction. OTOH, the timestamp of PREPARE, ‘prepare’ of pg_prepared_xacts, always uses the local timestamp even if the caller of PrepareTransaction() sets replorigin_session_origin_timestamp. In terms of user-visible timestamps of transaction operations, I think users might expect these timestamps are matched between the origin and its subscribers. But the pg_xact_commit_timestamp() is a function of the commit timestamp feature whereas ‘prepare’ is a pure timestamp when the transaction is prepared. So I’m not sure these timestamps really need to be matched, though. Regards, -- Masahiko Sawada EnterpriseDB: https://www.enterprisedb.com/
On Wed, Nov 18, 2020 at 7:54 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Nov 17, 2020 at 9:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Nov 17, 2020 at 5:02 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > > > > On Tue, Nov 17, 2020 at 10:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > Doesn't this happen only if you set replication origins? Because > > > > otherwise both PrepareTransaction() and > > > > RecordTransactionCommitPrepared() used the current timestamp. > > > > > > > > > > I was also checking this, even if you set replicating origins, the > > > preparedTransaction will reflect the local prepare time in > > > pg_prepared_xacts. pg_prepared_xacts fetches this information > > > from GlobalTransaction data which does not store the origin_timestamp; > > > it only stores the prepared_at which is the local timestamp. > > > > > > > Sure, but my question was does this difference in behavior happens > > without replication origins in any way? The reason is that if it > > occurs only with replication origins, I don't think we need to bother > > about the same because that feature is not properly implemented and > > not used as-is. See the discussion [1] [2]. OTOH, if this behavior can > > happen without replication origins then we might want to consider > > changing it. > > Logical replication workers always have replication origins, right? Is > that what you meant 'with replication origins'? > I was thinking with respect to the publisher-side but you are right that logical apply workers always have replication origins so the effect will be visible but I think the same should be true on publisher without this patch as well. Say, the user has set up replication origin via pg_replication_origin_xact_setup and provided a value of timestamp then also the same behavior will be there. > IIUC logical replication workers always set the origin's commit > timestamp as the commit timestamp of the replicated transaction. OTOH, > the timestamp of PREPARE, ‘prepare’ of pg_prepared_xacts, always uses > the local timestamp even if the caller of PrepareTransaction() sets > replorigin_session_origin_timestamp. In terms of user-visible > timestamps of transaction operations, I think users might expect these > timestamps are matched between the origin and its subscribers. But the > pg_xact_commit_timestamp() is a function of the commit timestamp > feature whereas ‘prepare’ is a pure timestamp when the transaction is > prepared. So I’m not sure these timestamps really need to be matched, > though. > Yeah, I am not sure if it is a good idea for users to rely on this especially if the same behavior is visible on the publisher as well. We might want to think separately if there is a value in making prepare-time to also rely on replorigin_session_origin_timestamp and if so, that can be done as a separate patch. What do you think? -- With Regards, Amit Kapila.
On Mon, Nov 16, 2020 at 12:55 PM Ajin Cherian <itsajin@gmail.com> wrote: > > Updated with a new test case > (contrib/test_decoding/t/002_twophase-streaming.pl) that tests > concurrent aborts during streaming prepare. Had to make a few changes > to the test_decoding stream_start callbacks to handle > "check-xid-aborted" > the same way it was handled in the non stream callbacks. > Why did you make a change in stream_start API? I think it should be *_change and *_truncate APIs because the concurrent abort can happen while decoding any intermediate change. If you agree then you can probably take that code into a separate function and call it from the respective APIs. In 0003, contrib/test_decoding/t/002_twophase-streaming.pl | 102 +++++++++ The naming of the file seems to be inconsistent with other files. It should be 002_twophase_streaming.pl Other than this, please find attached rebased patch set. It needs rebase after latest commit 9653f24ad8307f393de51e0a64d9b10a49efa6e3. -- With Regards, Amit Kapila.
Attachment
Hi. Using a tablesync debugging technique as described in another mail thread [1][2] I have caused the tablesync worker to handle (e.g. apply_dispatch) a 2PC PREPARE This exposes a problem with the current 2PC logic because if/when the PREPARE is processed by the tablesync worker then the txn will end up being COMMITTED, even though the 2PC PREPARE has not yet been COMMIT PREPARED by the publisher. For example, below is some logging (using my patch [2]) which shows this occurring: --- [postgres@CentOS7-x64 ~]$ psql -d test_sub -p 54321 -c "CREATE SUBSCRIPTION tap_sub CONNECTION 'host=localhost dbname=test_pub application_name=tap_sub' PUBLICATION tap_pub;" 2020-11-18 17:00:37.394 AEDT [15885] LOG: logical decoding found consistent point at 0/16EF840 2020-11-18 17:00:37.394 AEDT [15885] DETAIL: There are no running transactions. 2020-11-18 17:00:37.394 AEDT [15885] STATEMENT: CREATE_REPLICATION_SLOT "tap_sub" LOGICAL pgoutput NOEXPORT_SNAPSHOT NOTICE: created replication slot "tap_sub" on publisher CREATE SUBSCRIPTION 2020-11-18 17:00:37.407 AEDT [15886] LOG: logical replication apply worker for subscription "tap_sub" has started 2020-11-18 17:00:37.407 AEDT [15886] LOG: !!>> The apply worker process has PID = 15886 2020-11-18 17:00:37.415 AEDT [15887] LOG: starting logical decoding for slot "tap_sub" 2020-11-18 17:00:37.415 AEDT [15887] DETAIL: Streaming transactions committing after 0/16EF878, reading WAL from 0/16EF840. 2020-11-18 17:00:37.415 AEDT [15887] STATEMENT: START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2', publication_names '"tap_pub"') 2020-11-18 17:00:37.415 AEDT [15887] LOG: logical decoding found consistent point at 0/16EF840 2020-11-18 17:00:37.415 AEDT [15887] DETAIL: There are no running transactions. 2020-11-18 17:00:37.415 AEDT [15887] STATEMENT: START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2', publication_names '"tap_pub"') 2020-11-18 17:00:37.415 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:00:37.415 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:00:37.421 AEDT [15889] LOG: logical replication table synchronization worker for subscription "tap_sub", table "test_tab" has started 2020-11-18 17:00:37.421 AEDT [15889] LOG: !!>> The tablesync worker process has PID = 15889 2020-11-18 17:00:37.421 AEDT [15889] LOG: !!>> Sleeping 30 secs. For debugging, attach to process 15889 now! [postgres@CentOS7-x64 ~]$ 2020-11-18 17:00:38.431 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:00:38.431 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:00:39.433 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:00:39.433 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:00:40.437 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:00:40.437 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:00:41.439 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:00:41.439 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:00:42.441 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:00:42.441 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:00:43.442 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:00:43.442 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables -- etc. 2020-11-18 17:01:03.520 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:01:03.520 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:01:04.521 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:01:04.521 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:01:05.523 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:01:05.523 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:01:06.532 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:01:06.532 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:01:07.426 AEDT [15889] LOG: !!>> tablesync worker: About to call LogicalRepSyncTableStart to do initial syncing 2020-11-18 17:01:07.536 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:01:07.536 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:01:07.536 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:01:07.536 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:01:08.539 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:01:08.539 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:01:09.541 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:01:09.541 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables -- etc. 2020-11-18 17:01:23.583 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:01:23.583 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:01:24.584 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:01:24.584 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:01:25.586 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:01:25.586 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:01:26.586 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:01:26.586 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:01:27.454 AEDT [17456] LOG: logical decoding found consistent point at 0/16EF878 2020-11-18 17:01:27.454 AEDT [17456] DETAIL: There are no running transactions. 2020-11-18 17:01:27.454 AEDT [17456] STATEMENT: CREATE_REPLICATION_SLOT "tap_sub_24582_sync_16385" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT 2020-11-18 17:01:27.456 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:01:27.457 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:01:27.465 AEDT [15889] LOG: !!>> tablesync worker: wait for CATCHUP state notification 2020-11-18 17:01:27.465 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:01:27.465 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables #### Here, while the tablesync worker is paused in the debugger I execute the PREPARE txn on publisher psql -d test_pub -c "BEGIN;INSERT INTO test_tab VALUES(1, 'foo');PREPARE TRANSACTION 'test_prepared_tab';" PREPARE TRANSACTION 2020-11-18 17:01:54.732 AEDT [15887] LOG: !!>> pgoutput_begin_txn 2020-11-18 17:01:54.732 AEDT [15887] CONTEXT: slot "tap_sub", output plugin "pgoutput", in the begin callback, associated LSN 0/16EF8B0 2020-11-18 17:01:54.732 AEDT [15887] STATEMENT: START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2', publication_names '"tap_pub"') #### And then in the debugger I let the tablesync worker continue... 2020-11-18 17:02:02.788 AEDT [15889] LOG: !!>> tablesync worker: received CATCHUP state notification 2020-11-18 17:02:07.729 AEDT [15889] LOG: !!>> tablesync worker: Returned from LogicalRepSyncTableStart 2020-11-18 17:02:16.284 AEDT [17456] LOG: starting logical decoding for slot "tap_sub_24582_sync_16385" 2020-11-18 17:02:16.284 AEDT [17456] DETAIL: Streaming transactions committing after 0/16EF8B0, reading WAL from 0/16EF878. 2020-11-18 17:02:16.284 AEDT [17456] STATEMENT: START_REPLICATION SLOT "tap_sub_24582_sync_16385" LOGICAL 0/16EF8B0 (proto_version '2', publication_names '"tap_pub"') 2020-11-18 17:02:16.284 AEDT [17456] LOG: logical decoding found consistent point at 0/16EF878 2020-11-18 17:02:16.284 AEDT [17456] DETAIL: There are no running transactions. 2020-11-18 17:02:16.284 AEDT [17456] STATEMENT: START_REPLICATION SLOT "tap_sub_24582_sync_16385" LOGICAL 0/16EF8B0 (proto_version '2', publication_names '"tap_pub"') 2020-11-18 17:02:16.284 AEDT [17456] LOG: !!>> pgoutput_begin_txn 2020-11-18 17:02:16.284 AEDT [17456] CONTEXT: slot "tap_sub_24582_sync_16385", output plugin "pgoutput", in the begin callback, associated LSN 0/16EF8B0 2020-11-18 17:02:16.284 AEDT [17456] STATEMENT: START_REPLICATION SLOT "tap_sub_24582_sync_16385" LOGICAL 0/16EF8B0 (proto_version '2', publication_names '"tap_pub"') 2020-11-18 17:02:40.346 AEDT [15889] LOG: !!>> tablesync worker: LogicalRepApplyLoop #### The tablesync worker processes the replication messages.... 2020-11-18 17:02:47.992 AEDT [15889] LOG: !!>> tablesync worker: apply_dispatch for message kind 'B' 2020-11-18 17:02:54.858 AEDT [15889] LOG: !!>> tablesync worker: apply_dispatch for message kind 'R' 2020-11-18 17:02:56.082 AEDT [15889] LOG: !!>> tablesync worker: apply_dispatch for message kind 'I' 2020-11-18 17:02:56.082 AEDT [15889] LOG: !!>> tablesync worker: should_apply_changes_for_rel: true 2020-11-18 17:02:57.354 AEDT [15889] LOG: !!>> tablesync worker: apply_dispatch for message kind 'P' 2020-11-18 17:02:57.354 AEDT [15889] LOG: !!>> tablesync worker: called process_syncing_tables 2020-11-18 17:02:59.011 AEDT [15889] LOG: logical replication table synchronization worker for subscription "tap_sub", table "test_tab" has finished #### SInce the tablesync was "ahead", the apply worker now needs to skip those same messages #### Notice should_apply_changes_for_rel() is false #### Then apply worker just waits for next messages.... 2020-11-18 17:02:59.064 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:02:59.064 AEDT [15886] LOG: !!>> apply worker: apply_dispatch for message kind 'B' 2020-11-18 17:02:59.064 AEDT [15886] LOG: !!>> apply worker: apply_dispatch for message kind 'R' 2020-11-18 17:02:59.064 AEDT [15886] LOG: !!>> apply worker: apply_dispatch for message kind 'I' 2020-11-18 17:02:59.065 AEDT [15886] LOG: !!>> apply worker: should_apply_changes_for_rel: false 2020-11-18 17:02:59.065 AEDT [15886] LOG: !!>> apply worker: apply_dispatch for message kind 'P' 2020-11-18 17:02:59.067 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:02:59.067 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:03:00.071 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:03:00.071 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:03:01.073 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:03:01.073 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:03:02.075 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:03:02.075 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:03:03.080 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:03:03.080 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:03:04.081 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:03:04.082 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables 2020-11-18 17:03:05.103 AEDT [15886] LOG: !!>> apply worker: LogicalRepApplyLoop 2020-11-18 17:03:05.103 AEDT [15886] LOG: !!>> apply worker: called process_syncing_tables etc ... #### At this point there is a problem because the tablesync worker has COMMITTED that PREPARED INSERT. #### See the subscriber node has ONE record but the publisher node has NONE! [postgres@CentOS7-x64 ~]$ psql -d test_pub -c "SELECT count(*) FROM test_tab;" count ------- 0 (1 row) [postgres@CentOS7-x64 ~]$ [postgres@CentOS7-x64 ~]$ psql -d test_sub -p 54321 -c "SELECT count(*) FROM test_tab;" count ------- 1 (1 row) [postgres@CentOS7-x64 ~]$ ----- [1] - https://www.postgresql.org/message-id/CAHut%2BPsprtsa4o89wtNnKLxxwXeDKAX9nNsdghT1Pv63siz%2BAA%40mail.gmail.com [2] - https://www.postgresql.org/message-id/CAHut%2BPt4PyKQCwqzQ%3DEFF%3DbpKKJD7XKt_S23F6L20ayQNxg77A%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
On Wed, Nov 18, 2020 at 1:18 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Hi. > > Using a tablesync debugging technique as described in another mail > thread [1][2] I have caused the tablesync worker to handle (e.g. > apply_dispatch) a 2PC PREPARE > > This exposes a problem with the current 2PC logic because if/when the > PREPARE is processed by the tablesync worker then the txn will end up > being COMMITTED, even though the 2PC PREPARE has not yet been COMMIT > PREPARED by the publisher. > IIUC, this is the problem with the patch being discussed here, right? Because before this we won't decode at Prepare time. -- With Regards, Amit Kapila.
On Wed, Nov 18, 2020 at 7:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Nov 18, 2020 at 1:18 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Hi. > > > > Using a tablesync debugging technique as described in another mail > > thread [1][2] I have caused the tablesync worker to handle (e.g. > > apply_dispatch) a 2PC PREPARE > > > > This exposes a problem with the current 2PC logic because if/when the > > PREPARE is processed by the tablesync worker then the txn will end up > > being COMMITTED, even though the 2PC PREPARE has not yet been COMMIT > > PREPARED by the publisher. > > > > IIUC, this is the problem with the patch being discussed here, right? > Because before this we won't decode at Prepare time. Correct. This is new. Kind Regards, Peter Smith. Fujitsu Australia.
> Why did you make a change in stream_start API? I think it should be > *_change and *_truncate APIs because the concurrent abort can happen > while decoding any intermediate change. If you agree then you can > probably take that code into a separate function and call it from the > respective APIs. > Patch 0001: Updated this from stream_start to stream_change. I haven't updated *_truncate as the test case written for this does not include a truncate. Also created a new function for this: test_concurrent_aborts(). > In 0003, > contrib/test_decoding/t/002_twophase-streaming.pl | 102 +++++++++ > > The naming of the file seems to be inconsistent with other files. It > should be 002_twophase_streaming.pl Patch 0003: Changed accordingly. Patch 0002: I've updated a comment that got muddled up while applying pg-indent in reorderbuffer.c regards, Ajin Cherian Fujitsu Australia
Attachment
On Thu, Nov 19, 2020 at 11:27 AM Ajin Cherian <itsajin@gmail.com> wrote: > > > Why did you make a change in stream_start API? I think it should be > > *_change and *_truncate APIs because the concurrent abort can happen > > while decoding any intermediate change. If you agree then you can > > probably take that code into a separate function and call it from the > > respective APIs. > > > Patch 0001: > Updated this from stream_start to stream_change. I haven't updated > *_truncate as the test case written for this does not include a > truncate. > I think the same check should be there in truncate as well to make the APIs consistent and also one can use it for writing another test that has a truncate operation. -- With Regards, Amit Kapila.
On Thu, Nov 19, 2020 at 5:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > I think the same check should be there in truncate as well to make the > APIs consistent and also one can use it for writing another test that > has a truncate operation. Updated the checks in both truncate callbacks (stream and non-stream). Also added a test case for testing concurrent aborts while decoding streaming TRUNCATE. regards, Ajin Cherian Fujitsu Australia
Attachment
On Thu, Nov 19, 2020 at 2:52 PM Ajin Cherian <itsajin@gmail.com> wrote: > > On Thu, Nov 19, 2020 at 5:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > I think the same check should be there in truncate as well to make the > > APIs consistent and also one can use it for writing another test that > > has a truncate operation. > > Updated the checks in both truncate callbacks (stream and non-stream). > Also added a test case for testing concurrent aborts while decoding > streaming TRUNCATE. > While reviewing/editing the code in 0002-Support-2PC-txn-backend, I came across the following code which seems dubious to me. 1. + /* + * If streaming, reset the TXN so that it is allowed to stream + * remaining data. Streaming can also be on a prepared txn, handle + * it the same way. + */ + if (streaming) + { + elog(LOG, "stopping decoding of %u",txn->xid); + ReorderBufferResetTXN(rb, txn, snapshot_now, + command_id, prev_lsn, + specinsert); + } + else + { + elog(LOG, "stopping decoding of %s (%u)", + txn->gid != NULL ? txn->gid : "", txn->xid); + ReorderBufferTruncateTXN(rb, txn, true); + } Why do we need to handle the prepared txn case differently here? I think for both cases we can call ReorderBufferResetTXN as it frees the memory we should free in exceptions. Sure, there is some code (like stream_stop and saving the snapshot for next run) in ReorderBufferResetTXN which needs to be only called when we are streaming the txn but otherwise, it seems it can be used here. We can easily identify if the transaction is streamed to differentiate that code path. Can you think of any other reason for not doing so? 2. +void +ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid, + XLogRecPtr commit_lsn, XLogRecPtr end_lsn, + TimestampTz commit_time, + RepOriginId origin_id, XLogRecPtr origin_lsn, + char *gid, bool is_commit) +{ + ReorderBufferTXN *txn; + + /* + * The transaction may or may not exist (during restarts for example). + * Anyway, two-phase transactions do not contain any reorderbuffers. So + * allow it to be created below. + */ + txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn, + true); Why should we allow to create a new transaction here or in other words in which cases txn won't be present? I guess this should be the case with the earlier version of the patch where at prepare time we were cleaning the ReorderBufferTxn. -- With Regards, Amit Kapila.
On Wed, Nov 18, 2020 at 12:42 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Nov 18, 2020 at 7:54 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Tue, Nov 17, 2020 at 9:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Tue, Nov 17, 2020 at 5:02 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > > > > > > On Tue, Nov 17, 2020 at 10:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > > Doesn't this happen only if you set replication origins? Because > > > > > otherwise both PrepareTransaction() and > > > > > RecordTransactionCommitPrepared() used the current timestamp. > > > > > > > > > > > > > I was also checking this, even if you set replicating origins, the > > > > preparedTransaction will reflect the local prepare time in > > > > pg_prepared_xacts. pg_prepared_xacts fetches this information > > > > from GlobalTransaction data which does not store the origin_timestamp; > > > > it only stores the prepared_at which is the local timestamp. > > > > > > > > > > Sure, but my question was does this difference in behavior happens > > > without replication origins in any way? The reason is that if it > > > occurs only with replication origins, I don't think we need to bother > > > about the same because that feature is not properly implemented and > > > not used as-is. See the discussion [1] [2]. OTOH, if this behavior can > > > happen without replication origins then we might want to consider > > > changing it. > > > > Logical replication workers always have replication origins, right? Is > > that what you meant 'with replication origins'? > > > > I was thinking with respect to the publisher-side but you are right > that logical apply workers always have replication origins so the > effect will be visible but I think the same should be true on > publisher without this patch as well. Say, the user has set up > replication origin via pg_replication_origin_xact_setup and provided a > value of timestamp then also the same behavior will be there. Right. > > > IIUC logical replication workers always set the origin's commit > > timestamp as the commit timestamp of the replicated transaction. OTOH, > > the timestamp of PREPARE, ‘prepare’ of pg_prepared_xacts, always uses > > the local timestamp even if the caller of PrepareTransaction() sets > > replorigin_session_origin_timestamp. In terms of user-visible > > timestamps of transaction operations, I think users might expect these > > timestamps are matched between the origin and its subscribers. But the > > pg_xact_commit_timestamp() is a function of the commit timestamp > > feature whereas ‘prepare’ is a pure timestamp when the transaction is > > prepared. So I’m not sure these timestamps really need to be matched, > > though. > > > > Yeah, I am not sure if it is a good idea for users to rely on this > especially if the same behavior is visible on the publisher as well. > We might want to think separately if there is a value in making > prepare-time to also rely on replorigin_session_origin_timestamp and > if so, that can be done as a separate patch. What do you think? I agree that we can think about it separately. If it's necessary we can make a patch later. Regards, -- Masahiko Sawada EnterpriseDB: https://www.enterprisedb.com/
On Fri, Nov 20, 2020 at 12:23 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Nov 19, 2020 at 2:52 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > > On Thu, Nov 19, 2020 at 5:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > I think the same check should be there in truncate as well to make the > > > APIs consistent and also one can use it for writing another test that > > > has a truncate operation. > > > > Updated the checks in both truncate callbacks (stream and non-stream). > > Also added a test case for testing concurrent aborts while decoding > > streaming TRUNCATE. > > > > While reviewing/editing the code in 0002-Support-2PC-txn-backend, I > came across the following code which seems dubious to me. > > 1. > + /* > + * If streaming, reset the TXN so that it is allowed to stream > + * remaining data. Streaming can also be on a prepared txn, handle > + * it the same way. > + */ > + if (streaming) > + { > + elog(LOG, "stopping decoding of %u",txn->xid); > + ReorderBufferResetTXN(rb, txn, snapshot_now, > + command_id, prev_lsn, > + specinsert); > + } > + else > + { > + elog(LOG, "stopping decoding of %s (%u)", > + txn->gid != NULL ? txn->gid : "", txn->xid); > + ReorderBufferTruncateTXN(rb, txn, true); > + } > > Why do we need to handle the prepared txn case differently here? I > think for both cases we can call ReorderBufferResetTXN as it frees the > memory we should free in exceptions. Sure, there is some code (like > stream_stop and saving the snapshot for next run) in > ReorderBufferResetTXN which needs to be only called when we are > streaming the txn but otherwise, it seems it can be used here. We can > easily identify if the transaction is streamed to differentiate that > code path. Can you think of any other reason for not doing so? Yes, I agree with this that ReorderBufferResetTXN needs to be called to free up memory after an exception. Will change ReorderBufferResetTXN so that it now has an extra parameter that indicates streaming; so that the stream_stop and saving of the snapshot is only done if streaming. > > 2. > +void > +ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid, > + XLogRecPtr commit_lsn, XLogRecPtr end_lsn, > + TimestampTz commit_time, > + RepOriginId origin_id, XLogRecPtr origin_lsn, > + char *gid, bool is_commit) > +{ > + ReorderBufferTXN *txn; > + > + /* > + * The transaction may or may not exist (during restarts for example). > + * Anyway, two-phase transactions do not contain any reorderbuffers. So > + * allow it to be created below. > + */ > + txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn, > + true); > > Why should we allow to create a new transaction here or in other words > in which cases txn won't be present? I guess this should be the case > with the earlier version of the patch where at prepare time we were > cleaning the ReorderBufferTxn. Just confirmed this, yes, you are right. Even after a restart, the transaction does get created again prior to this, We need not be creating it here. I will change this as well. regards, Ajin Cherian Fujitsu Australia
On Fri, Nov 20, 2020 at 7:54 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Nov 18, 2020 at 12:42 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > IIUC logical replication workers always set the origin's commit > > > timestamp as the commit timestamp of the replicated transaction. OTOH, > > > the timestamp of PREPARE, ‘prepare’ of pg_prepared_xacts, always uses > > > the local timestamp even if the caller of PrepareTransaction() sets > > > replorigin_session_origin_timestamp. In terms of user-visible > > > timestamps of transaction operations, I think users might expect these > > > timestamps are matched between the origin and its subscribers. But the > > > pg_xact_commit_timestamp() is a function of the commit timestamp > > > feature whereas ‘prepare’ is a pure timestamp when the transaction is > > > prepared. So I’m not sure these timestamps really need to be matched, > > > though. > > > > > > > Yeah, I am not sure if it is a good idea for users to rely on this > > especially if the same behavior is visible on the publisher as well. > > We might want to think separately if there is a value in making > > prepare-time to also rely on replorigin_session_origin_timestamp and > > if so, that can be done as a separate patch. What do you think? > > I agree that we can think about it separately. If it's necessary we > can make a patch later. > Thanks for the confirmation. Your review and suggestions are quite helpful. -- With Regards, Amit Kapila.
On Fri, Nov 20, 2020 at 9:12 AM Ajin Cherian <itsajin@gmail.com> wrote: > > On Fri, Nov 20, 2020 at 12:23 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Nov 19, 2020 at 2:52 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > > > > On Thu, Nov 19, 2020 at 5:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > I think the same check should be there in truncate as well to make the > > > > APIs consistent and also one can use it for writing another test that > > > > has a truncate operation. > > > > > > Updated the checks in both truncate callbacks (stream and non-stream). > > > Also added a test case for testing concurrent aborts while decoding > > > streaming TRUNCATE. > > > > > > > While reviewing/editing the code in 0002-Support-2PC-txn-backend, I > > came across the following code which seems dubious to me. > > > > 1. > > + /* > > + * If streaming, reset the TXN so that it is allowed to stream > > + * remaining data. Streaming can also be on a prepared txn, handle > > + * it the same way. > > + */ > > + if (streaming) > > + { > > + elog(LOG, "stopping decoding of %u",txn->xid); > > + ReorderBufferResetTXN(rb, txn, snapshot_now, > > + command_id, prev_lsn, > > + specinsert); > > + } > > + else > > + { > > + elog(LOG, "stopping decoding of %s (%u)", > > + txn->gid != NULL ? txn->gid : "", txn->xid); > > + ReorderBufferTruncateTXN(rb, txn, true); > > + } > > > > Why do we need to handle the prepared txn case differently here? I > > think for both cases we can call ReorderBufferResetTXN as it frees the > > memory we should free in exceptions. Sure, there is some code (like > > stream_stop and saving the snapshot for next run) in > > ReorderBufferResetTXN which needs to be only called when we are > > streaming the txn but otherwise, it seems it can be used here. We can > > easily identify if the transaction is streamed to differentiate that > > code path. Can you think of any other reason for not doing so? > > Yes, I agree with this that ReorderBufferResetTXN needs to be called > to free up memory after an exception. > Will change ReorderBufferResetTXN so that it now has an extra > parameter that indicates streaming; so that the stream_stop and saving > of the snapshot is only done if streaming. > I've already made the changes for this in the patch, you can verify the same when I'll share the new version. We don't need to pass an extra parameter rbtx_prepared()/rbtxn_is_streamed should serve the need. > > > > 2. > > +void > > +ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid, > > + XLogRecPtr commit_lsn, XLogRecPtr end_lsn, > > + TimestampTz commit_time, > > + RepOriginId origin_id, XLogRecPtr origin_lsn, > > + char *gid, bool is_commit) > > +{ > > + ReorderBufferTXN *txn; > > + > > + /* > > + * The transaction may or may not exist (during restarts for example). > > + * Anyway, two-phase transactions do not contain any reorderbuffers. So > > + * allow it to be created below. > > + */ > > + txn = ReorderBufferTXNByXid(rb, xid, true, NULL, commit_lsn, > > + true); > > > > Why should we allow to create a new transaction here or in other words > > in which cases txn won't be present? I guess this should be the case > > with the earlier version of the patch where at prepare time we were > > cleaning the ReorderBufferTxn. > > Just confirmed this, yes, you are right. Even after a restart, the > transaction does get created again prior to this, We need not be > creating > it here. I will change this as well. > I'll take care of it along with other changes. Thanks for the confirmation. -- With Regards, Amit Kapila.
On Fri, Nov 20, 2020 at 2:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > I'll take care of it along with other changes. > > Thanks for the confirmation. > Ok, meanwhile I've just split the patches to move out the check_xid_aborted test cases as well as the support in the code for this into a separate patch. New 0007 patch for this. regards, Ajin
Attachment
- v24-0001-Support-2PC-txn-base.patch
- v24-0004-Support-2PC-txn-spoolfile.patch
- v24-0005-Support-2PC-txn-pgoutput.patch
- v24-0002-Support-2PC-txn-backend.patch
- v24-0003-Support-2PC-test-cases-for-test_decoding.patch
- v24-0007-2pc-test-cases-for-testing-concurrent-aborts.patch
- v24-0006-Support-2PC-txn-subscriber-tests.patch
On Fri, Nov 20, 2020 at 4:54 PM Ajin Cherian <itsajin@gmail.com> wrote: > > On Fri, Nov 20, 2020 at 2:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > I'll take care of it along with other changes. > > > > Thanks for the confirmation. > > > > Ok, meanwhile I've just split the patches to move out the > check_xid_aborted test cases as well as the support in the code for > this into a separate patch. New 0007 patch for this. > This makes sense to me but it should have been 0004 in the series. I have changed the order in the attached. I have updated 0002-Support-2PC-txn-backend and 0007-2pc-test-cases-for-testing-concurrent-aborts. The changes are: 1. As mentioned previously, used ReorderBufferResetTxn to deal with concurrent aborts both in case of streamed and prepared txns. 2. There was no clear explanation as to why we are not skipping DecodePrepare in the presence of concurrent aborts. I have added the explanation of the same atop DecodePrepare() and at various other palces. 3. Added/Edited comments at various places in the code and made some other changes like simplified the code at a few places. 4. Changed the function name ReorderBufferCommitInternal to ReorderBufferReplay as that seems more appropriate. 5. In ReorderBufferReplay()(which was previously ReorderBufferCommitInternal), the patch was doing cleanup of TXN even for prepared transactions which is not consistent with what we do at other places in the patch, so changed the same. 6. In 2pc-test-cases-for-testing-concurrent-aborts, changed one of the log message based on the changes in patch Support-2PC-txn-backend. I am planning to continue review of these patches but I thought it is better to check about the above changes before proceeding further. Let me know what you think? -- With Regards, Amit Kapila.
Attachment
- v25-0001-Support-2PC-txn-base.patch
- v25-0002-Support-2PC-txn-backend.patch
- v25-0003-Support-2PC-test-cases-for-test_decoding.patch
- v25-0004-2pc-test-cases-for-testing-concurrent-aborts.patch
- v25-0005-Support-2PC-txn-spoolfile.patch
- v25-0006-Support-2PC-txn-pgoutput.patch
- v25-0007-Support-2PC-txn-subscriber-tests.patch
On Sun, Nov 22, 2020 at 12:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > I am planning to continue review of these patches but I thought it is > better to check about the above changes before proceeding further. Let > me know what you think? > I've had a look at the changes and done a few tests, and have no comments. However, I did see that the test 002_twophase_streaming.pl failed once. I've run it at least 30 times after that but haven't seen it fail again. Unfortunately my ulimit was not set up to create dumps and so I dont have a dump when the test case failed. I will continue testing and reviewing the changes. regards, Ajin Cherian Fujitsu Australia
On Mon, Nov 23, 2020 at 3:41 PM Ajin Cherian <itsajin@gmail.com> wrote: > > On Sun, Nov 22, 2020 at 12:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > I am planning to continue review of these patches but I thought it is > > better to check about the above changes before proceeding further. Let > > me know what you think? > > > > I've had a look at the changes and done a few tests, and have no > comments. > Okay, thanks. Additionally, I have analyzed whether we need to call SnapbuildCommittedTxn in DecodePrepare as was raised earlier for this patch [1]. As mentioned in that thread SnapbuildCommittedTxn primarily does three things (a) Decide whether we are interested in tracking the current txn effects and if we are, mark it as committed. (b) Build and distribute snapshot to all RBTXNs, if it is important. (c) Set base snap of our xact if it did DDL, to execute invalidations during replay. For the first two, as the xact is still not visible to others so we don't need to make it behave like a committed txn. To make the (DDL) changes visible to the current txn, the message REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID copies the snapshot which fills the subxip array. This will be sufficient to make the changes visible to the current txn. For the third, I have checked the code that whenever we have any change message the base snapshot gets set via SnapBuildProcessChange. It is possible that I have missed something but I don't want to call SnapbuildCommittedTxn in DecodePrepare unless we have a clear reason for the same so leaving it for now. Can you or someone see any reason for the same? > However, I did see that the test 002_twophase_streaming.pl > failed once. I've run it at least 30 times after that but haven't seen > it fail again. > This test is based on waiting to see some message in the log. It is possible it failed due to timeout which can only happen rarely. You can check some failure logs in test_decoding folder (probably in tmp_check folder). Even if we get some server or test log, it can help us to diagnose the problem. [1] - https://www.postgresql.org/message-id/87zhxrwgvh.fsf%40ars-thinkpad -- With Regards, Amit Kapila.
FYI - I have regenerated a new v26 set of patches. PSA v26-0001 - no change v26-0002 - no change v26-0003 - only filename changed (for consistency) v26-0004 - only filename changed (for consistency) v26-0005 - no change v26-0006 - minor code change to have more consistently located calls to process_syncing_tables v26-0007 - no change --- Kind Regards Peter Smith. Fujitsu Australia.
Attachment
- v26-0001-Support-2PC-txn-base.patch
- v26-0002-Support-2PC-txn-backend.patch
- v26-0004-Support-2PC-txn-tests-for-concurrent-aborts.patch
- v26-0003-Support-2PC-txn-tests-for-test_decoding.patch
- v26-0005-Support-2PC-txn-spoolfile.patch
- v26-0006-Support-2PC-txn-pgoutput.patch
- v26-0007-Support-2PC-txn-subscriber-tests.patch
On Mon, Nov 23, 2020 at 10:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > For the first two, as the xact is still not visible to others so we > don't need to make it behave like a committed txn. To make the (DDL) > changes visible to the current txn, the message > REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID copies the snapshot which > fills the subxip array. This will be sufficient to make the changes > visible to the current txn. For the third, I have checked the code > that whenever we have any change message the base snapshot gets set > via SnapBuildProcessChange. It is possible that I have missed > something but I don't want to call SnapbuildCommittedTxn in > DecodePrepare unless we have a clear reason for the same so leaving it > for now. Can you or someone see any reason for the same? I reviewed and tested this and like you said, SnapBuildProcessChange sets the base snapshot for every change. I did various tests using DDL updates and haven't seen any issues so far. I agree with your analysis. regards, Ajin
Hi Amit. IIUC the tablesync worker runs in a single transaction. Last week I discovered and described [1] a problem where/if (by unlucky timing) the tablesync worker gets to handle the 2PC PREPARE TRANSACTION then that whole single tx is getting committed, regardless that a COMMIT PREPARED was not even been executed yet. i.e. It means if the publisher subsequently does a ROLLBACK PREPARED then the table records on Pub/Sub nodes will no longer be matching. AFAIK this is a new problem for the current WIP patch because prior to this the PREPARE had no decoding. Please let me know if this issue description is still not clear. Did you have any thoughts how we might address this issue? --- [1] https://www.postgresql.org/message-id/CAHut%2BPuEMk4SO8oGzxc_ftzPkGA8uC-y5qi-KRqHSy_P0i30DA%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
On Wed, Nov 25, 2020 at 12:54 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Hi Amit. > > IIUC the tablesync worker runs in a single transaction. > > Last week I discovered and described [1] a problem where/if (by > unlucky timing) the tablesync worker gets to handle the 2PC PREPARE > TRANSACTION then that whole single tx is getting committed, regardless > that a COMMIT PREPARED was not even been executed yet. i.e. It means > if the publisher subsequently does a ROLLBACK PREPARED then the table > records on Pub/Sub nodes will no longer be matching. > > AFAIK this is a new problem for the current WIP patch because prior to > this the PREPARE had no decoding. > > Please let me know if this issue description is still not clear. > > Did you have any thoughts how we might address this issue? > I think we need to disable two_phase_commit for table sync workers. We anyway wanted to expose a parater via subscription for that and we can use that to do it. Also, there were some other comments [1] related to tablesync worker w.r.t prepared transactions which would possibly be addressed by doing it. Kindly check those comments [1] and let me know if anything additional is required. [1] - https://www.postgresql.org/message-id/87zhxrwgvh.fsf%40ars-thinkpad -- With Regards, Amit Kapila.
On Tue, Nov 24, 2020 at 3:29 PM Ajin Cherian <itsajin@gmail.com> wrote: > > On Mon, Nov 23, 2020 at 10:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > For the first two, as the xact is still not visible to others so we > > don't need to make it behave like a committed txn. To make the (DDL) > > changes visible to the current txn, the message > > REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID copies the snapshot which > > fills the subxip array. This will be sufficient to make the changes > > visible to the current txn. For the third, I have checked the code > > that whenever we have any change message the base snapshot gets set > > via SnapBuildProcessChange. It is possible that I have missed > > something but I don't want to call SnapbuildCommittedTxn in > > DecodePrepare unless we have a clear reason for the same so leaving it > > for now. Can you or someone see any reason for the same? > > I reviewed and tested this and like you said, SnapBuildProcessChange > sets the base snapshot for every change. > I did various tests using DDL updates and haven't seen any issues so > far. I agree with your analysis. > Thanks, attached is a further revised version of the patch series. Changes in v27-0001-Extend-the-output-plugin-API-to-allow-decoding-p a. Removed the includes which are not required by this patch. b. Moved the 'check_xid_aborted' parameter to 0004. c. Added Assert(!ctx->fast_forward); in callback wrappers, because we won't load the output plugin when fast_forward is set so there is no chance that we call output plugin APIs. This is why we have this Assert in all the existing APIs. d. Adjusted the order of various callback APIs to make the code look consistent. e. Added/Edited comments and doc updates at various places. Changed error messages to make them consistent with other similar messages. f. Some other cosmetic changes like the removal of spurious new lines and fixed white-space issues. g. Updated commit message. Changes in v27-0002-Allow-decoding-at-prepare-time-in-ReorderBuffer a. Move the check to whether a particular txn can be skipped into a separate function as the same code for it was repeated at three different places. b. ReorderBufferPrepare has a parameter name as commit_lsn whereas it should be preapre_lsn. Similar changes has been made at various places in the patch. c. filter_prepare_cb callback existence is checked in both decode.c and in filter_prepare_cb_wrapper. Fixed by removing it from decode.c. d. Fixed miscellaneous comments and some cosmetic changes. e. Moved the special elog in ReorderBufferProcessTxn to test concurrent aborts in 0004 patch. f. Moved the changes related to flags RBTXN_COMMIT_PREPARED and RBTXN_ROLLBACK_PREPARED to patch 0006 as those are used only in that patch. g. Updated commit message. One problem with this patch is: What if we have assembled a consistent snapshot after prepare and before commit prepared. In that case, it will currently just send commit prepared record which would be a bad idea. It should decode the entire transaction for such cases at commit prepared time. This same problem is noticed by Arseny Sher, see problem-2 in email [1]. One idea to fix this could be to check if the snapshot is consistent to decide whether to skip the prepare and if we skip due to that reason, then during commit we need to decode the entire transaction. We can do that by setting a flag in txn->txn_flags such that during prepare we can set a flag when we skip the prepare because the snapshot is still not consistent and then used it during commit to see if we need to decode the entire transaction. But here we need to think about what would happen after restart? Basically, if it is possible that after restart the snapshot is consistent for the same transaction at prepare time and it got skipped due to start_decoding_at (which moved ahead after restart) then such a solution won't work. Any thoughts on this? v27-0004-Support-2PC-txn-tests-for-concurrent-aborts a. Moved the changes related to testing of concurrent aborts in this patch from other patches. v27-0006-Support-2PC-txn-pgoutput a. Moved the changes related to flags RBTXN_COMMIT_PREPARED and RBTXN_ROLLBACK_PREPARED from other patch. b. Included headers required by this patch, previously it seems to be dependent on other patches for this. The other patches remain unchanged. Let me know what you think about these changes? [1] - https://www.postgresql.org/message-id/877el38j56.fsf%40ars-thinkpad -- With Regards, Amit Kapila.
Attachment
- v27-0001-Extend-the-output-plugin-API-to-allow-decoding-p.patch
- v27-0002-Allow-decoding-at-prepare-time-in-ReorderBuffer.patch
- v27-0003-Support-2PC-txn-tests-for-test_decoding.patch
- v27-0004-Support-2PC-txn-tests-for-concurrent-aborts.patch
- v27-0005-Support-2PC-txn-spoolfile.patch
- v27-0006-Support-2PC-txn-pgoutput.patch
- v27-0007-Support-2PC-txn-subscriber-tests.patch
On Wed, Nov 25, 2020 at 11:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > The other patches remain unchanged. > > Let me know what you think about these changes? Thanks, I will look at the patch and let you know my thoughts on it. Before that, sharing a new patchset with an additional patch that includes documentation changes for two-phase commit support in Logical decoding. I have also updated the example section of Logical Decoding with examples that use two-phase commits. I have just added the documentation patch as the 8th one and renamed the other patches, not changed anything in them, regards, Ajin Cherian Fujitsu Australia
Attachment
- v28-0001-Extend-the-output-plugin-API-to-allow-decoding-p.patch
- v28-0004-Support-2PC-txn-tests-for-concurrent-aborts.patch
- v28-0005-Support-2PC-txn-spoolfile.patch
- v28-0003-Support-2PC-txn-tests-for-test_decoding.patch
- v28-0002-Allow-decoding-at-prepare-time-in-ReorderBuffer.patch
- v28-0006-Support-2PC-txn-pgoutput.patch
- v28-0007-Support-2PC-txn-subscriber-tests.patch
- v28-0008-Support-2PC-documentation.patch
On Wed, Nov 25, 2020 at 11:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > One problem with this patch is: What if we have assembled a consistent > snapshot after prepare and before commit prepared. In that case, it > will currently just send commit prepared record which would be a bad > idea. It should decode the entire transaction for such cases at commit > prepared time. This same problem is noticed by Arseny Sher, see > problem-2 in email [1]. I'm not sure I understand how you could assemble a consistent snapshot after prepare but before commit prepared? Doesn't a consistent snapshot require that all in-progress transactions commit? I've tried start a new subscription after a prepare on the publisher and I see that the create subscription just hangs till the transaction on the publisher is either committed or rolled back. Even if I try to create a replication slot using pg_create_logical_replication_slot when a transaction has been prepared but not yet committed , it just hangs till the transaction is committed/aborted. regards, Ajin Cherian Fujitsu Australia
On Thu, Nov 26, 2020 at 4:24 PM Ajin Cherian <itsajin@gmail.com> wrote: > > On Wed, Nov 25, 2020 at 11:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > One problem with this patch is: What if we have assembled a consistent > > snapshot after prepare and before commit prepared. In that case, it > > will currently just send commit prepared record which would be a bad > > idea. It should decode the entire transaction for such cases at commit > > prepared time. This same problem is noticed by Arseny Sher, see > > problem-2 in email [1]. > > I'm not sure I understand how you could assemble a consistent snapshot > after prepare but before commit prepared? > Doesn't a consistent snapshot require that all in-progress > transactions commit? > By above, I don't mean that the transaction is not committed. I am talking about the timing of WAL. It is possible that between WAL of Prepare and Commit Prepared, we reach a consistent state. > I've tried start a new subscription after > a prepare on the publisher and I see that the create subscription just > hangs till the transaction on the publisher is either committed or > rolled back. > I think what you need to do to reproduce this is to follow the snapshot machinery in SnapBuildFindSnapshot. Basically, first, start a transaction (say transaction-id is 500) and do some operations but don't commit. Here, if you create a slot (via subscription or otherwise), it will wait for 500 to complete and make the state as SNAPBUILD_BUILDING_SNAPSHOT. Here, you can commit 500 and then having debugger in that state, start another transaction (say 501), do some operations but don't commit. Next time when you reach this function, it will change the state to SNAPBUILD_FULL_SNAPSHOT and wait for 501, now you can start another transaction (say 502) which you can prepare but don't commit. Again start one more transaction 503, do some ops, commit both 501 and 503. At this stage somehow we need to ensure that XLOG_RUNNING_XACTS record. Then commit prepared 502. Now, I think you should notice that the consistent point is reached after 502's prepare and before its commit. Now, this is just a theoretical scenario, you need something on these lines and probably a way to force XLOG_RUNNING_XACTS WAL (probably via debugger or some other way) at the right times to reproduce it. Thanks for trying to build a test case for this, it is really helpful. -- With Regards, Amit Kapila.
On Thu, Nov 26, 2020 at 10:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > I think what you need to do to reproduce this is to follow the > snapshot machinery in SnapBuildFindSnapshot. Basically, first, start a > transaction (say transaction-id is 500) and do some operations but > don't commit. Here, if you create a slot (via subscription or > otherwise), it will wait for 500 to complete and make the state as > SNAPBUILD_BUILDING_SNAPSHOT. Here, you can commit 500 and then having > debugger in that state, start another transaction (say 501), do some > operations but don't commit. Next time when you reach this function, > it will change the state to SNAPBUILD_FULL_SNAPSHOT and wait for 501, > now you can start another transaction (say 502) which you can prepare > but don't commit. Again start one more transaction 503, do some ops, > commit both 501 and 503. At this stage somehow we need to ensure that > XLOG_RUNNING_XACTS record. Then commit prepared 502. Now, I think you > should notice that the consistent point is reached after 502's prepare > and before its commit. Now, this is just a theoretical scenario, you > need something on these lines and probably a way to force > XLOG_RUNNING_XACTS WAL (probably via debugger or some other way) at > the right times to reproduce it. > > Thanks for trying to build a test case for this, it is really helpful. I tried the above steps, I was able to get the builder state to SNAPBUILD_BUILDING_SNAPSHOT but was not able to get into the SNAPBUILD_FULL_SNAPSHOT state. Instead the code moves straight from SNAPBUILD_BUILDING_SNAPSHOT to SNAPBUILD_CONSISTENT state. In the function SnapBuildFindSnapshot, either the following check fails: 1327: TransactionIdPrecedesOrEquals(SnapBuildNextPhaseAt(builder), running->oldestRunningXid)) because the SnapBuildNextPhaseAt (which is same as running->nextXid) is higher than oldestRunningXid, or when the both are the same, then it falls through into the below condition higher in the code 1247: if (running->oldestRunningXid == running->nextXid) and then the builder moves straight into the SNAPBUILD_CONSISTENT state. At no point will the nextXid be less than oldestRunningXid. In my sessions, I commit multiple txns, hoping to bump up oldestRunningXid, I do checkpoints, have made sure the XLOG_RUNNING_XACTS are being inserted., but while iterating into SnapBuildFindSnapshot with a ,new XLOG_RUNNING_XACTS:record, the oldestRunningXid is being incremented at one xid at a time, which will eventually make it catch up running->nextXid and reach a SNAPBUILD_CONSISTENT state without entering the SNAPBUILD_FULL_SNAPSHOT state. regards, Ajin Cherian Fujitsu Australia
On Fri, Nov 27, 2020 at 6:35 PM Ajin Cherian <itsajin@gmail.com> wrote: > > On Thu, Nov 26, 2020 at 10:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > I think what you need to do to reproduce this is to follow the > > snapshot machinery in SnapBuildFindSnapshot. Basically, first, start a > > transaction (say transaction-id is 500) and do some operations but > > don't commit. Here, if you create a slot (via subscription or > > otherwise), it will wait for 500 to complete and make the state as > > SNAPBUILD_BUILDING_SNAPSHOT. Here, you can commit 500 and then having > > debugger in that state, start another transaction (say 501), do some > > operations but don't commit. Next time when you reach this function, > > it will change the state to SNAPBUILD_FULL_SNAPSHOT and wait for 501, > > now you can start another transaction (say 502) which you can prepare > > but don't commit. Again start one more transaction 503, do some ops, > > commit both 501 and 503. At this stage somehow we need to ensure that > > XLOG_RUNNING_XACTS record. Then commit prepared 502. Now, I think you > > should notice that the consistent point is reached after 502's prepare > > and before its commit. Now, this is just a theoretical scenario, you > > need something on these lines and probably a way to force > > XLOG_RUNNING_XACTS WAL (probably via debugger or some other way) at > > the right times to reproduce it. > > > > Thanks for trying to build a test case for this, it is really helpful. > > I tried the above steps, I was able to get the builder state to > SNAPBUILD_BUILDING_SNAPSHOT but was not able to get into the > SNAPBUILD_FULL_SNAPSHOT state. > Instead the code moves straight from SNAPBUILD_BUILDING_SNAPSHOT to > SNAPBUILD_CONSISTENT state. > I see the code coverage report and it appears that part of the code (get the snapshot machinery in SNAPBUILD_FULL_SNAPSHOT state) is covered by existing tests [1]. So, another idea you can try is to put a break (say while (1)) in that part of code and run regression tests (most probably the test_decoding or subscription tests should be sufficient to hit). Then once you found which existing test covers that, you can try to generate prepared transaction behavior as mentioned above. [1] - https://coverage.postgresql.org/src/backend/replication/logical/snapbuild.c.gcov.html -- With Regards, Amit Kapila.
On Sun, Nov 29, 2020 at 1:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > Then once you found which existing test covers > that, you can try to generate prepared transaction behavior as > mentioned above. I was able to find out the test case that exercises that code, it is the ondisk_startup spec in test_decoding. Using that, I was able to create the problem with the following setup: Using 4 sessions (this could be optimized to 3, but just sharing what I've tested): s1(session 1): begin; postgres=# begin; BEGIN postgres=*# SELECT pg_current_xact_id(); pg_current_xact_id -------------------- 546 (1 row) --------------------the above commands leave a transaction running s2: CREATE TABLE do_write(id serial primary key); SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); ---------------------this will hang because of 546 txn is pending s3: postgres=# begin; BEGIN postgres=*# SELECT pg_current_xact_id(); pg_current_xact_id -------------------- 547 (1 row) -------------------------------- leave another txn pending--- s1: postgres=*# ALTER TABLE do_write ADD COLUMN addedbys2 int; ALTER TABLE postgres=*# commit; ------------------------------commit the first txn; this will cause state to move to SNAPBUILD_FULL_SNAPSHOT state 2020-11-30 03:31:07.354 EST [16312] LOG: logical decoding found initial consistent point at 0/1730A18 2020-11-30 03:31:07.354 EST [16312] DETAIL: Waiting for transactions (approximately 1) older than 553 to end. s4: postgres=# begin; BEGIN postgres=*# INSERT INTO do_write DEFAULT VALUES; INSERT 0 1 postgres=*# prepare transaction 'test1'; PREPARE TRANSACTION -------------- leave this transaction prepared s3: postgres=*# commit; COMMIT ----------------- this will cause s2 call to return and a consistent point has been reached. 2020-11-30 03:31:34.200 EST [16312] LOG: logical decoding found consistent point at 0/1730D58 s4: commit prepared 'test1'; s2: postgres=# SELECT * FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', 'skip-empty-xacts', '1'); lsn | xid | data -----------+-----+------------------------- 0/1730FC8 | 553 | COMMIT PREPARED 'test1' (1 row) In pg_logical_slot_get_changes() we see only the Commit Prepared but no insert and no prepare command. I debugged this and I see that in DecodePrepare, the prepare is skipped because the prepare lsn is prior to the start_decoding_at point and is skipped in SnapBuildXactNeedsSkip. So, the reason for skipping the PREPARE is similar to the reason why it would have been skipped on a restart after a previous decode run. One possible fix would be similar to what you suggested, in DecodePrepare , add the check DecodingContextReady(ctx), which if false would indicate that the PREPARE was prior to a consistent snapshot and if so, set a flag value in txn accordingly (say RBTXN_PREPARE_NOT_DECODED?), and if this flag is detected while handling the COMMIT PREPARED, then handle it like you would handle a COMMIT. This would ensure that all the changes of the transaction are sent out and at the same time, the subscriber side does not need to try and handle a prepared transaction that does not exist on its side. Let me know what you think of this? regards, Ajin Cherian Fujitsu Australia
On Mon, Nov 30, 2020 at 2:36 PM Ajin Cherian <itsajin@gmail.com> wrote: > > On Sun, Nov 29, 2020 at 1:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > Then once you found which existing test covers > > that, you can try to generate prepared transaction behavior as > > mentioned above. > > I was able to find out the test case that exercises that code, it is > the ondisk_startup spec in test_decoding. Using that, I was able to > create the problem with the following setup: > Using 4 sessions (this could be optimized to 3, but just sharing what > I've tested): > > s1(session 1): > begin; > postgres=# begin; > BEGIN > postgres=*# SELECT pg_current_xact_id(); > pg_current_xact_id > -------------------- > 546 > (1 row) > --------------------the above commands leave a transaction running > s2: > CREATE TABLE do_write(id serial primary key); > SELECT 'init' FROM > pg_create_logical_replication_slot('isolation_slot', 'test_decoding'); > > ---------------------this will hang because of 546 txn is pending > > s3: > postgres=# begin; > BEGIN > postgres=*# SELECT pg_current_xact_id(); > pg_current_xact_id > -------------------- > 547 > (1 row) > -------------------------------- leave another txn pending--- > > s1: > postgres=*# ALTER TABLE do_write ADD COLUMN addedbys2 int; > ALTER TABLE > postgres=*# commit; > ------------------------------commit the first txn; this will cause > state to move to SNAPBUILD_FULL_SNAPSHOT state > 2020-11-30 03:31:07.354 EST [16312] LOG: logical decoding found > initial consistent point at 0/1730A18 > 2020-11-30 03:31:07.354 EST [16312] DETAIL: Waiting for transactions > (approximately 1) older than 553 to end. > > > s4: > postgres=# begin; > BEGIN > postgres=*# INSERT INTO do_write DEFAULT VALUES; > INSERT 0 1 > postgres=*# prepare transaction 'test1'; > PREPARE TRANSACTION > -------------- leave this transaction prepared > > s3: > postgres=*# commit; > COMMIT > ----------------- this will cause s2 call to return and a consistent > point has been reached. > 2020-11-30 03:31:34.200 EST [16312] LOG: logical decoding found > consistent point at 0/1730D58 > > s4: > commit prepared 'test1'; > > s2: > postgres=# SELECT * FROM pg_logical_slot_get_changes('isolation_slot', > NULL, NULL, 'two-phase-commit', '1', 'include-xids', '0', > 'skip-empty-xacts', '1'); > lsn | xid | data > -----------+-----+------------------------- > 0/1730FC8 | 553 | COMMIT PREPARED 'test1' > (1 row) > > In pg_logical_slot_get_changes() we see only the Commit Prepared but > no insert and no prepare command. I debugged this and I see that in > DecodePrepare, the > prepare is skipped because the prepare lsn is prior to the > start_decoding_at point and is skipped in SnapBuildXactNeedsSkip. > So what caused it to skip due to start_decoding_at? Because the commit where the snapshot became consistent is after Prepare. Does it happen due to the below code in SnapBuildFindSnapshot() where we bump start_decoding_at. { ... if (running->oldestRunningXid == running->nextXid) { if (builder->start_decoding_at == InvalidXLogRecPtr || builder->start_decoding_at <= lsn) /* can decode everything after this */ builder->start_decoding_at = lsn + 1; > So, > the reason for skipping > the PREPARE is similar to the reason why it would have been skipped on > a restart after a previous decode run. > > One possible fix would be similar to what you suggested, in > DecodePrepare , add the check DecodingContextReady(ctx), which if > false would indicate that the > PREPARE was prior to a consistent snapshot and if so, set a flag value > in txn accordingly > Sure, but you can see in your example above it got skipped due to start_decoding_at not due to DecodingContextReady. So, the problem as mentioned by me previously was how we distinguish those cases because it can skip due to start_decoding_at during restart as well when we would have already sent the prepare to the subscriber. One idea could be that the subscriber skips the transaction if it sees the transaction is already prepared. We already skip changes in apply worker (subscriber) if they are performed via tablesync worker, see should_apply_changes_for_rel. This will be a different thing but I am trying to indicate that something similar is already done in subscriber. I am not sure if we can detect this in publisher, if so, that would be also worth considering and might be better. Thoughts? -- With Regards, Amit Kapila.
On Tue, Dec 1, 2020 at 12:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > So what caused it to skip due to start_decoding_at? Because the commit > where the snapshot became consistent is after Prepare. Does it happen > due to the below code in SnapBuildFindSnapshot() where we bump > start_decoding_at. > > { > ... > if (running->oldestRunningXid == running->nextXid) > { > if (builder->start_decoding_at == InvalidXLogRecPtr || > builder->start_decoding_at <= lsn) > /* can decode everything after this */ > builder->start_decoding_at = lsn + 1; I think the reason is that in the function DecodingContextFindStartpoint(), the code loops till it finds the consistent snapshot. Then once consistent snapshot is found, it sets slot->data.confirmed_flush = ctx->reader->EndRecPtr; This will be used as the start_decoding_at when the slot is restarted for decoding. > Sure, but you can see in your example above it got skipped due to > start_decoding_at not due to DecodingContextReady. So, the problem as > mentioned by me previously was how we distinguish those cases because > it can skip due to start_decoding_at during restart as well when we > would have already sent the prepare to the subscriber. The distinguishing factor is that at restart, the Prepare does satisfy DecodingContextReady (because the snapshot is consistent then). In both cases, the prepare is prior to start_decoding_at, but when the prepare is before a consistent point, it does not satisfy DecodingContextReady. Which is why I suggested using the check DecodingContextReady to mark the prepare as 'Not decoded". regards, Ajin Cherian Fujitsu Australia
On Tue, Dec 1, 2020 at 7:55 AM Ajin Cherian <itsajin@gmail.com> wrote: > > On Tue, Dec 1, 2020 at 12:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > Sure, but you can see in your example above it got skipped due to > > start_decoding_at not due to DecodingContextReady. So, the problem as > > mentioned by me previously was how we distinguish those cases because > > it can skip due to start_decoding_at during restart as well when we > > would have already sent the prepare to the subscriber. > > The distinguishing factor is that at restart, the Prepare does satisfy > DecodingContextReady (because the snapshot is consistent then). > In both cases, the prepare is prior to start_decoding_at, but when the > prepare is before a consistent point, > it does not satisfy DecodingContextReady. > I think it won't be true when we reuse some already serialized snapshot from some other slot. It is possible that we wouldn't have encountered such a serialized snapshot while creating a slot but later during replication, we might use it because by that time some other slot has serialized the one at that point. -- With Regards, Amit Kapila.
On Mon, Nov 30, 2020 at 7:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Nov 30, 2020 at 2:36 PM Ajin Cherian <itsajin@gmail.com> wrote: > > Sure, but you can see in your example above it got skipped due to > start_decoding_at not due to DecodingContextReady. So, the problem as > mentioned by me previously was how we distinguish those cases because > it can skip due to start_decoding_at during restart as well when we > would have already sent the prepare to the subscriber. > > One idea could be that the subscriber skips the transaction if it sees > the transaction is already prepared. > To skip it, we need to send GID in begin message and then on subscriber-side, check if the prepared xact already exists, if so then set a flag. The flag needs to be set in begin/start_stream and reset in stop_stream/commit/abort. Using the flag, we can skip the entire contents of the prepared xact. In ReorderFuffer-side also, we need to get and set GID in txn even when we skip it because we need to send the same at commit time. In this solution, we won't be able to send it during normal start_stream because by that time we won't know GID and I think that won't be required. Note that this is only required when we skipped sending prepare, otherwise, we just need to send Commit-Prepared at commit time. Another way to solve this problem via publisher-side is to maintain in some file at slot level whether we have sent prepare for a particular txn? Basically, after sending prepare, we need to update the slot information on disk to indicate that the particular GID is sent (we can probably store GID and LSN of Prepare). Then next time whenever we have to skip prepare due to whatever reason, we can check the existence of persistent information on disk for that GID, if it exists then we need to send just Commit Prepared, otherwise, the entire transaction. We can remove this information during or after CheckPointSnapBuild, basically, we can remove the information of all GID's that are after cutoff LSN computed via ReplicationSlotsComputeLogicalRestartLSN. Now, we can even think of removing this information after Commit Prepared but not sure if that is correct because we can't lose this information unless start_decoding_at (or restart_lsn) is moved past the commit lsn Now, to persist this information, there could be multiple possibilities (a) maintain the flexible array for GID's at the end of ReplicationSlotPersistentData, (b) have a separate state file per-slot for prepared xacts, (c) have a separate state file for each prepared xact per-slot. With (a) during upgrade from the previous version there could be a problem because the previous data won't match new data but I am not sure if we maintain slots info intact after upgrade. I think (c) would be simplest but OTOH, having many such files (in case there are more prepared xacts) per-slot might not be a good idea. One more thing that needs to be thought about is when we are sending the entire xact at commit time whether we will send prepare separately? Because, if we don't send it separately, then later allowing the PREPARE on the master to wait for prepare via subscribers won't be possible? Thoughts? -- With Regards, Amit Kapila.
On Tue, Dec 1, 2020 at 6:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > One idea could be that the subscriber skips the transaction if it sees > > the transaction is already prepared. > > > > To skip it, we need to send GID in begin message and then on > subscriber-side, check if the prepared xact already exists, if so then > set a flag. The flag needs to be set in begin/start_stream and reset > in stop_stream/commit/abort. Using the flag, we can skip the entire > contents of the prepared xact. In ReorderFuffer-side also, we need to > get and set GID in txn even when we skip it because we need to send > the same at commit time. In this solution, we won't be able to send it > during normal start_stream because by that time we won't know GID and > I think that won't be required. Note that this is only required when > we skipped sending prepare, otherwise, we just need to send > Commit-Prepared at commit time. > After going through both the solutions, I think the above one is a better idea. I also think, rather than change the protocol for the regular begin, we could have a special begin_prepare for prepared txns specifically. This way we won't affect non-prepared transactions. We will need to add in a begin_prepare callback as well, which has the gid as one of the parameters. Other than this, in ReorderBufferFinishPrepared, if the txn hasn't already been prepared (because it was skipped in DecodePrepare), then we set prepared flag and call ReorderBufferReplay before calling commit-prepared callback. At the subscriber side, on receipt of the special begin-prepare, we first check if the gid is of an already prepared txn, if yes, then we set a flag such that the rest of the transaction are not applied but skipped, If it's not a gid that has already been prepared, then continue to apply changes as you would otherwise. So, this is the approach I'd pick. The drawback is probably that we send extra prepares after a restart, which might be quite common while using test_decoding but not so common when using the pgoutput and real world scenarios of pub/sub. The second approach is a bit more involved requiring file creation and manipulation as well as the overhead of having to write to a file on every prepare which might be a performance bottleneck. Let me know what you think. regards, Ajin Cherian Fujitsu Australia
On Wed, Dec 2, 2020 at 12:47 PM Ajin Cherian <itsajin@gmail.com> wrote: > > On Tue, Dec 1, 2020 at 6:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > One idea could be that the subscriber skips the transaction if it sees > > > the transaction is already prepared. > > > > > > > To skip it, we need to send GID in begin message and then on > > subscriber-side, check if the prepared xact already exists, if so then > > set a flag. The flag needs to be set in begin/start_stream and reset > > in stop_stream/commit/abort. Using the flag, we can skip the entire > > contents of the prepared xact. In ReorderFuffer-side also, we need to > > get and set GID in txn even when we skip it because we need to send > > the same at commit time. In this solution, we won't be able to send it > > during normal start_stream because by that time we won't know GID and > > I think that won't be required. Note that this is only required when > > we skipped sending prepare, otherwise, we just need to send > > Commit-Prepared at commit time. > > > > After going through both the solutions, I think the above one is a better idea. > I also think, rather than change the protocol for the regular begin, > we could have > a special begin_prepare for prepared txns specifically. This way we won't affect > non-prepared transactions. We will need to add in a begin_prepare callback > as well, which has the gid as one of the parameters. Other than this, > in ReorderBufferFinishPrepared, if the txn hasn't already been > prepared (because it was skipped in DecodePrepare), then we set > prepared flag and call > ReorderBufferReplay before calling commit-prepared callback. > > At the subscriber side, on receipt of the special begin-prepare, we > first check if the gid is of an already > prepared txn, if yes, then we set a flag such that the rest of the > transaction are not applied but skipped, If it's not > a gid that has already been prepared, then continue to apply changes > as you would otherwise. The above sketch sounds good to me and additionally you might want to add Asserts in streaming APIs on the subscriber-side to ensure that we should never reach the already prepared case there. We should never need to stream the changes when we are skipping prepare either because the snapshot was not consistent by that time or we have already sent those changes before restart. > So, this is the > approach I'd pick. The drawback is probably that we send extra > prepares after a restart, which might be quite common > while using test_decoding but not so common when using the pgoutput > and real world scenarios of pub/sub. > The restarts would be rare. It depends on how one uses test_decoding module, this is primarily for testing and if you write a test in way that it tries to perform wal decoding again and again for the same WAL (aka simulating restarts) then probably you would see it again but otherwise, one shouldn't see it. -- With Regards, Amit Kapila.
I have rebased the v28 patch set (made necessary due to recent commit [1]) [1] https://github.com/postgres/postgres/commit/0926e96c493443644ba8e96b5d96d013a9ffaf64 And at the same time I have added patch 0009 to this set - This is for the new SUBSCRIPTION option "two_phase" (0009 is still WIP but stable). PSA new patch set with version bumped to v29. --- Kind Regards, Peter Smith. Fujitsu Australia
Attachment
- v29-0002-Allow-decoding-at-prepare-time-in-ReorderBuffer.patch
- v29-0006-Support-2PC-txn-pgoutput.patch
- v29-0004-Support-2PC-txn-tests-for-concurrent-aborts.patch
- v29-0005-Support-2PC-txn-spoolfile.patch
- v29-0003-Support-2PC-txn-tests-for-test_decoding.patch
- v29-0007-Support-2PC-txn-subscriber-tests.patch
- v29-0008-Support-2PC-documentation.patch
- v29-0009-Support-2PC-txn-WIP-Subscription-option.patch
- v29-0001-Extend-the-output-plugin-API-to-allow-decoding-p.patch
On Wed, Dec 2, 2020 at 8:24 PM Peter Smith <smithpb2250@gmail.com> wrote: > > I have rebased the v28 patch set (made necessary due to recent commit [1]) > [1] https://github.com/postgres/postgres/commit/0926e96c493443644ba8e96b5d96d013a9ffaf64 > > And at the same time I have added patch 0009 to this set - This is for > the new SUBSCRIPTION option "two_phase" (0009 is still WIP but > stable). > > PSA new patch set with version bumped to v29. Thank you for updating the patch! While looking at the patch set I found that the tests in src/test/subscription don't work with this patch. I got the following error: 2020-12-03 15:18:12.666 JST [44771] tap_sub ERROR: unrecognized pgoutput option: two_phase 2020-12-03 15:18:12.666 JST [44771] tap_sub CONTEXT: slot "tap_sub", output plugin "pgoutput", in the startup callback 2020-12-03 15:18:12.666 JST [44771] tap_sub STATEMENT: START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2', two_phase 'on', publication_names '"tap_pub","tap_pub_ins_only"') In v29-0009 patch "two_phase" option is added on the subscription side (i.g., libpqwalreceiver) but it seems not on the publisher side (pgoutput). Regards, -- Masahiko Sawada EnterpriseDB: https://www.enterprisedb.com/
On Thu, Dec 3, 2020 at 5:34 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > While looking at the patch set I found that the tests in > src/test/subscription don't work with this patch. I got the following > error: > > 2020-12-03 15:18:12.666 JST [44771] tap_sub ERROR: unrecognized > pgoutput option: two_phase > 2020-12-03 15:18:12.666 JST [44771] tap_sub CONTEXT: slot "tap_sub", > output plugin "pgoutput", in the startup callback > 2020-12-03 15:18:12.666 JST [44771] tap_sub STATEMENT: > START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2', > two_phase 'on', publication_names '"tap_pub","tap_pub_ins_only"') > > In v29-0009 patch "two_phase" option is added on the subscription side > (i.g., libpqwalreceiver) but it seems not on the publisher side > (pgoutput). > The v29-0009 patch is still a WIP for a new SUBSCRIPTION "two_phase" option so it is not yet fully implemented. I did run following prior to upload but somehow did not see those failures yesterday: cd src/test/subscription make check Anyway, as 0009 is the last of the set please just git apply --reverse that one if it is causing a problem. Sorry for any inconvenience. I will add the missing functionality to 0009 as soon as I can. Kind Regards, Peter Smith. Fujitsu Australia
On Thu, Dec 3, 2020 at 6:21 PM Peter Smith <smithpb2250@gmail.com> wrote: > Sorry for any inconvenience. I will add the missing functionality to > 0009 as soon as I can. > PSA a **replacement** patch for the previous v29-0009. This should correct the recently reported trouble [1] [1] = https://www.postgresql.org/message-id/CAD21AoBnZ6dYffVjOCdSvSohR_1ZNedqmb%3D6P9w_H6W0bK1s6g%40mail.gmail.com I observed after this patch: make check is all OK. cd src/test/subscription, then make check is all OK. ~ Note that the tablesync worker's (temporary) slot always uses two_phase *off*, regardless of the user setting. e.g. CREATE SUBSCRIPTION tap_sub CONNECTION 'host=localhost dbname=test_pub application_name=tap_sub' PUBLICATION tap_pub WITH (streaming = on, two_phase = on); will show in the logs that only the apply worker slot enabled the two_phase. STATEMENT: START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2', streaming 'on', two_phase 'on', publication_names '"tap_pub"') STATEMENT: START_REPLICATION SLOT "tap_sub_16395_sync_16385" LOGICAL 0/16076D8 (proto_version '2', streaming 'on', publication_names '"tap_pub"') --- Kind Regards, Peter Smith. Fujitsu Australia
Attachment
On Tue, Dec 1, 2020 at 6:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > To skip it, we need to send GID in begin message and then on > subscriber-side, check if the prepared xact already exists, if so then > set a flag. The flag needs to be set in begin/start_stream and reset > in stop_stream/commit/abort. Using the flag, we can skip the entire > contents of the prepared xact. In ReorderFuffer-side also, we need to > get and set GID in txn even when we skip it because we need to send > the same at commit time. In this solution, we won't be able to send it > during normal start_stream because by that time we won't know GID and > I think that won't be required. Note that this is only required when > we skipped sending prepare, otherwise, we just need to send > Commit-Prepared at commit time. I have implemented these changes and tested the fix using the test setup I had shared above and it seems to be working fine. I have also tested restarts that simulate duplicate prepares being sent by the publisher and verified that it is handled correctly by the subscriber. Do have a look at the changes and let me know if you have any comments. regards, Ajin Cherian Fujitsu Australia
Attachment
- v30-0003-Support-2PC-txn-tests-for-test_decoding.patch
- v30-0005-Support-2PC-txn-spoolfile.patch
- v30-0004-Support-2PC-txn-tests-for-concurrent-aborts.patch
- v30-0001-Extend-the-output-plugin-API-to-allow-decoding-p.patch
- v30-0002-Allow-decoding-at-prepare-time-in-ReorderBuffer.patch
- v30-0006-Support-2PC-txn-pgoutput.patch
- v30-0008-Support-2PC-documentation.patch
- v30-0009-Support-2PC-txn-Subscription-option.patch
- v30-0007-Support-2PC-txn-subscriber-tests.patch
On Tue, Dec 8, 2020 at 2:01 PM Ajin Cherian <itsajin@gmail.com> wrote: > > On Tue, Dec 1, 2020 at 6:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > To skip it, we need to send GID in begin message and then on > > subscriber-side, check if the prepared xact already exists, if so then > > set a flag. The flag needs to be set in begin/start_stream and reset > > in stop_stream/commit/abort. Using the flag, we can skip the entire > > contents of the prepared xact. In ReorderFuffer-side also, we need to > > get and set GID in txn even when we skip it because we need to send > > the same at commit time. In this solution, we won't be able to send it > > during normal start_stream because by that time we won't know GID and > > I think that won't be required. Note that this is only required when > > we skipped sending prepare, otherwise, we just need to send > > Commit-Prepared at commit time. > > I have implemented these changes and tested the fix using the test > setup I had shared above and it seems to be working fine. > I have also tested restarts that simulate duplicate prepares being > sent by the publisher and verified that it is handled correctly by the > subscriber. > This implementation has a flaw in that it has used commit_lsn for the prepare when we are sending prepare just before commit prepared. We can't send the commit LSN with prepare because if the subscriber crashes after prepare then it would update replorigin_session_origin_lsn with that commit_lsn. Now, after the restart, because we will use that LSN to start decoding, the Commit Prepared will get skipped. To fix this, we need to remember the prepare LSN and other information even when we skip prepare and then use it while sending the prepare during commit prepared. Now, after fixing this, I discovered another issue which is that we allow adding a new snapshot to prepared transactions via SnapBuildDistributeNewCatalogSnapshot. We can only allow it to get added to in-progress transactions. If you comment out the changes added in SnapBuildDistributeNewCatalogSnapshot then you will notice one test failure which indicates this problem. This problem was not evident before the bug-fix in the previous paragraph because you were using commit-lsn even for the prepare and newly added snapshot change appears to be before the end_lsn. Some other assorted changes in various patches: v31-0001-Extend-the-output-plugin-API-to-allow-decoding-o 1. I have changed the filter_prepare API to match the signature with FilterByOrigin. I don't see the need for ReorderBufferTxn or xid in the API. 2. I have expanded the documentation of 'Begin Prepare Callback' to explain how a user can use it to detect already prepared transactions and in which scenarios that can happen. 3. Added a few comments in the code atop begin_prepare_cb_wrapper to explain why we are adding this new API. 4. Move the check whether the filter_prepare callback is defined from filter_prepare_cb_wrapper to caller. This is similar to how FilterByOrigin works. 5. Fixed various whitespace and cosmetic issues. 6. Update commit message to include two of the newly added APIs v31-0002-Allow-decoding-at-prepare-time-in-ReorderBuffer 1. Changed the variable names and comments in DecodeXactOp. 2. A new API for FilterPrepare similar to FilterByOrigin and use that instead of ReorderBufferPrepareNeedSkip. 3. In DecodeCommit, we need to update the reorderbuffer about the surviving subtransactions for both ReorderBufferFinishPrepared and ReorderBufferCommit because now both can process the transaction. 4. Because, now we need to remember the prepare info even when we skip it, I have simplified ReorderBufferPrepare API by removing the extra parameters as that information will be now available via ReorderBufferTxn. 5. Updated comments at various places. v31-0006-Support-2PC-txn-pgoutput 1. Added Asserts in streaming APIs on the subscriber-side to ensure that we should never reach there for the already prepared transaction case. We never need to stream the changes when we are skipping prepare either because the snapshot was not consistent by that time or we have already sent those changes before restart. Added the same Assert in Begin and Commit routines because while skipping prepared txn, we must not receive the changes of any other xact. 2. + /* + * Flags are determined from the state of the transaction. We know we + * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if + * it's already marked as committed then it has to be COMMIT PREPARED (and + * likewise for abort / ROLLBACK PREPARED). + */ + if (rbtxn_commit_prepared(txn)) + flags = LOGICALREP_IS_COMMIT_PREPARED; + else if (rbtxn_rollback_prepared(txn)) + flags = LOGICALREP_IS_ROLLBACK_PREPARED; + else + flags = LOGICALREP_IS_PREPARE; I don't like clubbing three different operations under one message LOGICAL_REP_MSG_PREPARE. It looks awkward to use new flags RBTXN_COMMIT_PREPARED and RBTXN_ROLLBACK_PREPARED in ReordeBuffer so that we can recognize these operations in corresponding callbacks. I think setting any flag in ReorderBuffer should not dictate the behavior in callbacks. Then also there are few things that are not common to those APIs like the patch has an Assert to say that the txn is marked with prepare flag for all three operations which I think is not true for Rollback Prepared after the restart. We don't ensure to set the Prepare flag if the Rollback Prepare happens after the restart. Then, we have to introduce separate flags to distinguish prepare/commit prepared/rollback prepared to distinguish multiple operations sent as protocol messages. Also, all these operations are mutually exclusive so it will be better to send separate messages for each of these and I have changed it accordingly in the attached patch. 3. The patch has a duplicate code to send replication origins. I have moved the common code to a separate function. v31-0009-Support-2PC-txn-Subscription-option 1. --- a/src/include/catalog/catversion.h +++ b/src/include/catalog/catversion.h @@ -53,6 +53,6 @@ */ /* yyyymmddN */ -#define CATALOG_VERSION_NO 202011251 +#define CATALOG_VERSION_NO 202011271 No need to change catversion as this gets changed frequently and that leads to conflict in the patch. We can change it either in the final version or normally committers take care of this. If you want to remember it, maybe adding a line for it in the commit message should be okay. For now, I have removed this from the patch. -- With Regards, Amit Kapila.
Attachment
- v31-0001-Extend-the-output-plugin-API-to-allow-decoding-o.patch
- v31-0002-Allow-decoding-at-prepare-time-in-ReorderBuffer.patch
- v31-0003-Support-2PC-txn-tests-for-test_decoding.patch
- v31-0004-Support-2PC-txn-tests-for-concurrent-aborts.patch
- v31-0005-Support-2PC-txn-spoolfile.patch
- v31-0006-Support-2PC-txn-pgoutput.patch
- v31-0007-Support-2PC-txn-subscriber-tests.patch
- v31-0008-Support-2PC-documentation.patch
- v31-0009-Support-2PC-txn-Subscription-option.patch
On Mon, Dec 14, 2020 at 2:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > Today, I looked at one of the issues discussed earlier in this thread [1] which is that decoding can block (or deadlock can happen) when the user explicitly locks the catalog relation (like Lock pg_class) or perform Cluster on non-relmapped catalog relations (like Cluster pg_trigger using pg_class_oid_index; and the user_table on which we have performed any operation has a trigger) in the prepared xact. As discussed previously, we don't have a problem when user tables are exclusively locked because during decoding we don't acquire any lock on those and in fact, we have a test case for the same in the patch. In the previous discussion, most people seem to be of opinion that we should document it in a category "don't do that", or prohibit to prepare transactions that lock system tables in the exclusive mode as any way that can block the entire system. The other possibility could be that the plugin can allow enabling lock_timeout when it wants to allow decoding of two-phase xacts and if the timeout occurs it tries to fetch by disabling two-phase option provided by the patch. I think it is better to document this as there is no realistic scenario where it can happen. I also think separately (not as part of this patch) we can investigate whether it is a good idea to prohibit prepare for transactions that acquire exclusive locks on catalog relations. Thoughts? [1] - https://www.postgresql.org/message-id/CAMGcDxf83P5SGnGH52%3D_0wRP9pO6uRWCMRwAA0nxKtZvir2_vQ%40mail.gmail.com -- With Regards, Amit Kapila.
On Mon, Dec 14, 2020 at 6:27 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Dec 8, 2020 at 2:01 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > > On Tue, Dec 1, 2020 at 6:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > To skip it, we need to send GID in begin message and then on > > > subscriber-side, check if the prepared xact already exists, if so then > > > set a flag. The flag needs to be set in begin/start_stream and reset > > > in stop_stream/commit/abort. Using the flag, we can skip the entire > > > contents of the prepared xact. In ReorderFuffer-side also, we need to > > > get and set GID in txn even when we skip it because we need to send > > > the same at commit time. In this solution, we won't be able to send it > > > during normal start_stream because by that time we won't know GID and > > > I think that won't be required. Note that this is only required when > > > we skipped sending prepare, otherwise, we just need to send > > > Commit-Prepared at commit time. > > > > I have implemented these changes and tested the fix using the test > > setup I had shared above and it seems to be working fine. > > I have also tested restarts that simulate duplicate prepares being > > sent by the publisher and verified that it is handled correctly by the > > subscriber. > > > > This implementation has a flaw in that it has used commit_lsn for the > prepare when we are sending prepare just before commit prepared. We > can't send the commit LSN with prepare because if the subscriber > crashes after prepare then it would update > replorigin_session_origin_lsn with that commit_lsn. Now, after the > restart, because we will use that LSN to start decoding, the Commit > Prepared will get skipped. To fix this, we need to remember the > prepare LSN and other information even when we skip prepare and then > use it while sending the prepare during commit prepared. > > Now, after fixing this, I discovered another issue which is that we > allow adding a new snapshot to prepared transactions via > SnapBuildDistributeNewCatalogSnapshot. We can only allow it to get > added to in-progress transactions. If you comment out the changes > added in SnapBuildDistributeNewCatalogSnapshot then you will notice > one test failure which indicates this problem. This problem was not > evident before the bug-fix in the previous paragraph because you were > using commit-lsn even for the prepare and newly added snapshot change > appears to be before the end_lsn. > > Some other assorted changes in various patches: > v31-0001-Extend-the-output-plugin-API-to-allow-decoding-o > 1. I have changed the filter_prepare API to match the signature with > FilterByOrigin. I don't see the need for ReorderBufferTxn or xid in > the API. > 2. I have expanded the documentation of 'Begin Prepare Callback' to > explain how a user can use it to detect already prepared transactions > and in which scenarios that can happen. > 3. Added a few comments in the code atop begin_prepare_cb_wrapper to > explain why we are adding this new API. > 4. Move the check whether the filter_prepare callback is defined from > filter_prepare_cb_wrapper to caller. This is similar to how > FilterByOrigin works. > 5. Fixed various whitespace and cosmetic issues. > 6. Update commit message to include two of the newly added APIs > > v31-0002-Allow-decoding-at-prepare-time-in-ReorderBuffer > 1. Changed the variable names and comments in DecodeXactOp. > 2. A new API for FilterPrepare similar to FilterByOrigin and use that > instead of ReorderBufferPrepareNeedSkip. > 3. In DecodeCommit, we need to update the reorderbuffer about the > surviving subtransactions for both ReorderBufferFinishPrepared and > ReorderBufferCommit because now both can process the transaction. > 4. Because, now we need to remember the prepare info even when we skip > it, I have simplified ReorderBufferPrepare API by removing the extra > parameters as that information will be now available via > ReorderBufferTxn. > 5. Updated comments at various places. > > v31-0006-Support-2PC-txn-pgoutput > 1. Added Asserts in streaming APIs on the subscriber-side to ensure > that we should never reach there for the already prepared transaction > case. We never need to stream the changes when we are skipping prepare > either because the snapshot was not consistent by that time or we have > already sent those changes before restart. Added the same Assert in > Begin and Commit routines because while skipping prepared txn, we must > not receive the changes of any other xact. > 2. > + /* > + * Flags are determined from the state of the transaction. We know we > + * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if > + * it's already marked as committed then it has to be COMMIT PREPARED (and > + * likewise for abort / ROLLBACK PREPARED). > + */ > + if (rbtxn_commit_prepared(txn)) > + flags = LOGICALREP_IS_COMMIT_PREPARED; > + else if (rbtxn_rollback_prepared(txn)) > + flags = LOGICALREP_IS_ROLLBACK_PREPARED; > + else > + flags = LOGICALREP_IS_PREPARE; > > I don't like clubbing three different operations under one message > LOGICAL_REP_MSG_PREPARE. It looks awkward to use new flags > RBTXN_COMMIT_PREPARED and RBTXN_ROLLBACK_PREPARED in ReordeBuffer so > that we can recognize these operations in corresponding callbacks. I > think setting any flag in ReorderBuffer should not dictate the > behavior in callbacks. Then also there are few things that are not > common to those APIs like the patch has an Assert to say that the txn > is marked with prepare flag for all three operations which I think is > not true for Rollback Prepared after the restart. We don't ensure to > set the Prepare flag if the Rollback Prepare happens after the > restart. Then, we have to introduce separate flags to distinguish > prepare/commit prepared/rollback prepared to distinguish multiple > operations sent as protocol messages. Also, all these operations are > mutually exclusive so it will be better to send separate messages for > each of these and I have changed it accordingly in the attached patch. > > 3. The patch has a duplicate code to send replication origins. I have > moved the common code to a separate function. > > v31-0009-Support-2PC-txn-Subscription-option > 1. > --- a/src/include/catalog/catversion.h > +++ b/src/include/catalog/catversion.h > @@ -53,6 +53,6 @@ > */ > > /* yyyymmddN */ > -#define CATALOG_VERSION_NO 202011251 > +#define CATALOG_VERSION_NO 202011271 > > No need to change catversion as this gets changed frequently and that > leads to conflict in the patch. We can change it either in the final > version or normally committers take care of this. If you want to > remember it, maybe adding a line for it in the commit message should > be okay. For now, I have removed this from the patch. > Thank you for updating the patch. I have two questions: ----- @@ -239,6 +239,19 @@ CREATE SUBSCRIPTION <replaceable class="parameter">subscription_name</replaceabl </para> </listitem> </varlistentry> + <varlistentry> + <term><literal>two_phase</literal> (<type>boolean</type>)</term> + <listitem> + <para> + Specifies whether two-phase commit is enabled for this subscription. + The default is <literal>false</literal>. + When two-phase commit is enabled then the decoded transactions are sent + to the subscriber on the PREPARE TRANSACTION. When two-phase commit is not + enabled then PREPARE TRANSACTION and COMMIT/ROLLBACK PREPARED are not + decoded on the publisher. + </para> + </listitem> + </varlistentry> The user will need to specify the 'two_phase’ option on CREATE SUBSCRIPTION. It would mean the user will need to control what data is streamed both on publication side for INSERT/UPDATE/DELETE/TRUNCATE and on subscriber side for PREPARE. Looking at the implementation of the ’two_phase’ option of CREATE SUBSCRIPTION, it ultimately just passes the ‘two_phase' option to the publisher. Why don’t we set it on the publisher side? Also, I guess we can improve the description of ’two_phase’ option of CREATE SUBSCRIPTION in the doc by adding the fact that when this option is not enabled the transaction prepared on the publisher is decoded as a normal transaction: ------ + if (LookupGXact(begin_data.gid)) + { + /* + * If this gid has already been prepared then we dont want to apply + * this txn again. This can happen after restart where upstream can + * send the prepared transaction again. See + * ReorderBufferFinishPrepared. Don't update remote_final_lsn. + */ + skip_prepared_txn = true; + return; + } When PREPARE arrives at the subscriber node but there is the prepared transaction with the same transaction identifier, the apply worker skips the whole transaction. So if the users prepared a transaction with the same identifier on the subscriber, the prepared transaction that came from the publisher would be ignored without any messages. On the other hand, if applying other operations such as HEAP_INSERT conflicts (such as when violating the unique constraint) the apply worker raises an ERROR and stops logical replication until the conflict is resolved. IIUC since we can know that the prepared transaction came from the same publisher again by checking origin_lsn in TwoPhaseFileHeader I guess we can skip the PREPARE message only when the existing prepared transaction has the same LSN and the same identifier. To be exact, it’s still possible that the subscriber gets two PREPARE messages having the same LSN and name from two different publishers but it’s unlikely happen in practice. Regards, -- Masahiko Sawada EnterpriseDB: https://www.enterprisedb.com/
On Wed, Dec 16, 2020 at 1:04 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Thank you for updating the patch. I have two questions: > > ----- > @@ -239,6 +239,19 @@ CREATE SUBSCRIPTION <replaceable > class="parameter">subscription_name</replaceabl > </para> > </listitem> > </varlistentry> > + <varlistentry> > + <term><literal>two_phase</literal> (<type>boolean</type>)</term> > + <listitem> > + <para> > + Specifies whether two-phase commit is enabled for this subscription. > + The default is <literal>false</literal>. > + When two-phase commit is enabled then the decoded > transactions are sent > + to the subscriber on the PREPARE TRANSACTION. When > two-phase commit is not > + enabled then PREPARE TRANSACTION and COMMIT/ROLLBACK PREPARED are not > + decoded on the publisher. > + </para> > + </listitem> > + </varlistentry> > > The user will need to specify the 'two_phase’ option on CREATE > SUBSCRIPTION. It would mean the user will need to control what data is > streamed both on publication side for INSERT/UPDATE/DELETE/TRUNCATE > and on subscriber side for PREPARE. Looking at the implementation of > the ’two_phase’ option of CREATE SUBSCRIPTION, it ultimately just > passes the ‘two_phase' option to the publisher. Why don’t we set it on > the publisher side? > There could be multiple subscriptions for the same publication, some want to decode the transaction at prepare time and others might want to decode at commit time only. Also, one subscription could subscribe to multiple publications, so not sure if it is even feasible to set at publication level (consider one txn has changes belonging to multiple publications). This option controls how the data is streamed from a publication similar to other options like 'streaming'. Why do you think this should be any different? > Also, I guess we can improve the description of > ’two_phase’ option of CREATE SUBSCRIPTION in the doc by adding the > fact that when this option is not enabled the transaction prepared on > the publisher is decoded as a normal transaction: > Sounds reasonable. > ------ > + if (LookupGXact(begin_data.gid)) > + { > + /* > + * If this gid has already been prepared then we dont want to apply > + * this txn again. This can happen after restart where upstream can > + * send the prepared transaction again. See > + * ReorderBufferFinishPrepared. Don't update remote_final_lsn. > + */ > + skip_prepared_txn = true; > + return; > + } > > When PREPARE arrives at the subscriber node but there is the prepared > transaction with the same transaction identifier, the apply worker > skips the whole transaction. So if the users prepared a transaction > with the same identifier on the subscriber, the prepared transaction > that came from the publisher would be ignored without any messages. On > the other hand, if applying other operations such as HEAP_INSERT > conflicts (such as when violating the unique constraint) the apply > worker raises an ERROR and stops logical replication until the > conflict is resolved. IIUC since we can know that the prepared > transaction came from the same publisher again by checking origin_lsn > in TwoPhaseFileHeader I guess we can skip the PREPARE message only > when the existing prepared transaction has the same LSN and the same > identifier. To be exact, it’s still possible that the subscriber gets > two PREPARE messages having the same LSN and name from two different > publishers but it’s unlikely happen in practice. > The idea sounds reasonable. I'll try and see if this works. Thanks. -- With Regards, Amit Kapila.
> v31-0009-Support-2PC-txn-Subscription-option > 1. > --- a/src/include/catalog/catversion.h > +++ b/src/include/catalog/catversion.h > @@ -53,6 +53,6 @@ > */ > > /* yyyymmddN */ > -#define CATALOG_VERSION_NO 202011251 > +#define CATALOG_VERSION_NO 202011271 > > No need to change catversion as this gets changed frequently and that > leads to conflict in the patch. We can change it either in the final > version or normally committers take care of this. If you want to > remember it, maybe adding a line for it in the commit message should > be okay. For now, I have removed this from the patch. > > > -- > With Regards, > Amit Kapila. I have reviewed the changes, did not have any new comments. While testing, I found an issue in this patch. During initialisation, the pg_output is not initialised fully and the subscription parameters are not all read. As a result, ctx->twophase could be set to true , even if the subscription does not specify so. For this, we need to make the following change in pgoutput.c: pgoutput_startup(), similar to how streaming is handled. /* * This is replication start and not slot initialization. * * Parse and validate options passed by the client. */ if (!is_init) { : : } else { /* Disable the streaming during the slot initialization mode. */ ctx->streaming = false; + ctx->twophase = false } regards, Ajin
On Thu, Dec 17, 2020 at 7:02 AM Ajin Cherian <itsajin@gmail.com> wrote: > > > v31-0009-Support-2PC-txn-Subscription-option > > 1. > > --- a/src/include/catalog/catversion.h > > +++ b/src/include/catalog/catversion.h > > @@ -53,6 +53,6 @@ > > */ > > > > /* yyyymmddN */ > > -#define CATALOG_VERSION_NO 202011251 > > +#define CATALOG_VERSION_NO 202011271 > > > > No need to change catversion as this gets changed frequently and that > > leads to conflict in the patch. We can change it either in the final > > version or normally committers take care of this. If you want to > > remember it, maybe adding a line for it in the commit message should > > be okay. For now, I have removed this from the patch. > > > > > > -- > > With Regards, > > Amit Kapila. > > I have reviewed the changes, did not have any new comments. > While testing, I found an issue in this patch. During initialisation, > the pg_output is not initialised fully and the subscription parameters > are not all read. As a result, ctx->twophase could be > set to true , even if the subscription does not specify so. For this, > we need to make the following change in pgoutput.c: > pgoutput_startup(), similar to how streaming is handled. > > /* > * This is replication start and not slot initialization. > * > * Parse and validate options passed by the client. > */ > if (!is_init) > { > : > : > } > else > { > /* Disable the streaming during the slot initialization mode. */ > ctx->streaming = false; > + ctx->twophase = false > } > makes sense. I can take care of this in the next version where I am planning to address Sawada-San's comments and few other clean up work. -- With Regards, Amit Kapila.
On Thu, Dec 17, 2020 at 9:02 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Dec 17, 2020 at 7:02 AM Ajin Cherian <itsajin@gmail.com> wrote: > > > > > > I have reviewed the changes, did not have any new comments. > > While testing, I found an issue in this patch. During initialisation, > > the pg_output is not initialised fully and the subscription parameters > > are not all read. As a result, ctx->twophase could be > > set to true , even if the subscription does not specify so. For this, > > we need to make the following change in pgoutput.c: > > pgoutput_startup(), similar to how streaming is handled. > > > > /* > > * This is replication start and not slot initialization. > > * > > * Parse and validate options passed by the client. > > */ > > if (!is_init) > > { > > : > > : > > } > > else > > { > > /* Disable the streaming during the slot initialization mode. */ > > ctx->streaming = false; > > + ctx->twophase = false > > } > > > > makes sense. > On again thinking about this, I think it is good to disable it during slot initialization but will it create any problem because during slot initialization we don't stream any xact and stop processing WAL as soon as we reach CONSISTENT_STATE? Did you observe any problem with this? -- With Regards, Amit Kapila.
On Thu, Dec 17, 2020 at 2:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On again thinking about this, I think it is good to disable it during > slot initialization but will it create any problem because during slot > initialization we don't stream any xact and stop processing WAL as > soon as we reach CONSISTENT_STATE? Did you observe any problem with > this? > Yes, it did not stream any xact during initialization but I was surprised that the DecodePrepare code was invoked even though I hadn't created the subscription with twophase enabled. No problem was observed. regards, Ajin Cherian Fujitsu Australia
Adding a test case that tests that when a consistent snapshot is formed after a prepared transaction but before it has been COMMIT PREPARED. This test makes sure that in this case, the entire transaction is decoded on a COMMIT PREPARED. This patch applies on top of v31. regards, Ajin Cherian Fujitsu Australia
Attachment
On Tue, Dec 15, 2020 at 11:42 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Dec 14, 2020 at 2:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > Today, I looked at one of the issues discussed earlier in this thread > [1] which is that decoding can block (or deadlock can happen) when the > user explicitly locks the catalog relation (like Lock pg_class) or > perform Cluster on non-relmapped catalog relations (like Cluster > pg_trigger using pg_class_oid_index; and the user_table on which we > have performed any operation has a trigger) in the prepared xact. As > discussed previously, we don't have a problem when user tables are > exclusively locked because during decoding we don't acquire any lock > on those and in fact, we have a test case for the same in the patch. > Yes, and as described in that mail, the current code explicitly denies preparation of a 2PC transaction. under some circumstances: postgres=# BEGIN; postgres=# CLUSTER pg_class using pg_class_oid_index ; postgres=# PREPARE TRANSACTION 'test_prepared_lock'; ERROR: cannot PREPARE a transaction that modified relation mapping > In the previous discussion, most people seem to be of opinion that we > should document it in a category "don't do that", or prohibit to > prepare transactions that lock system tables in the exclusive mode as > any way that can block the entire system. The other possibility could > be that the plugin can allow enabling lock_timeout when it wants to > allow decoding of two-phase xacts and if the timeout occurs it tries > to fetch by disabling two-phase option provided by the patch. > > I think it is better to document this as there is no realistic > scenario where it can happen. I also think separately (not as part of > this patch) we can investigate whether it is a good idea to prohibit > prepare for transactions that acquire exclusive locks on catalog > relations. > > Thoughts? I agree with the documentation option. If we choose to disable two-phase on timeout, we still need to decide what to do with already prepared transactions. regards, Ajin Cherian Fujitsu Australia
On Wed, Dec 16, 2020 at 6:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Dec 16, 2020 at 1:04 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > Thank you for updating the patch. I have two questions: > > > > ----- > > @@ -239,6 +239,19 @@ CREATE SUBSCRIPTION <replaceable > > class="parameter">subscription_name</replaceabl > > </para> > > </listitem> > > </varlistentry> > > + <varlistentry> > > + <term><literal>two_phase</literal> (<type>boolean</type>)</term> > > + <listitem> > > + <para> > > + Specifies whether two-phase commit is enabled for this subscription. > > + The default is <literal>false</literal>. > > + When two-phase commit is enabled then the decoded > > transactions are sent > > + to the subscriber on the PREPARE TRANSACTION. When > > two-phase commit is not > > + enabled then PREPARE TRANSACTION and COMMIT/ROLLBACK PREPARED are not > > + decoded on the publisher. > > + </para> > > + </listitem> > > + </varlistentry> > > > > The user will need to specify the 'two_phase’ option on CREATE > > SUBSCRIPTION. It would mean the user will need to control what data is > > streamed both on publication side for INSERT/UPDATE/DELETE/TRUNCATE > > and on subscriber side for PREPARE. Looking at the implementation of > > the ’two_phase’ option of CREATE SUBSCRIPTION, it ultimately just > > passes the ‘two_phase' option to the publisher. Why don’t we set it on > > the publisher side? > > > > There could be multiple subscriptions for the same publication, some > want to decode the transaction at prepare time and others might want > to decode at commit time only. Also, one subscription could subscribe > to multiple publications, so not sure if it is even feasible to set at > publication level (consider one txn has changes belonging to multiple > publications). This option controls how the data is streamed from a > publication similar to other options like 'streaming'. Why do you > think this should be any different? Oh, I was thinking that the option controls what data is streamed similar to the 'publish' option. But I agreed with you. As you mentioned, it might be a problem if a subscription subscribes multiple publications setting different ’two_phase’ options. Also in terms of changing this option while streaming changes, it’s better to control it on the subscriber side. Regards, -- Masahiko Sawada EnterpriseDB: https://www.enterprisedb.com/
On Wed, Dec 16, 2020 at 2:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Dec 16, 2020 at 1:04 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > Also, I guess we can improve the description of > > ’two_phase’ option of CREATE SUBSCRIPTION in the doc by adding the > > fact that when this option is not enabled the transaction prepared on > > the publisher is decoded as a normal transaction: > > > > Sounds reasonable. > Fixed in the attached. > > ------ > > + if (LookupGXact(begin_data.gid)) > > + { > > + /* > > + * If this gid has already been prepared then we dont want to apply > > + * this txn again. This can happen after restart where upstream can > > + * send the prepared transaction again. See > > + * ReorderBufferFinishPrepared. Don't update remote_final_lsn. > > + */ > > + skip_prepared_txn = true; > > + return; > > + } > > > > When PREPARE arrives at the subscriber node but there is the prepared > > transaction with the same transaction identifier, the apply worker > > skips the whole transaction. So if the users prepared a transaction > > with the same identifier on the subscriber, the prepared transaction > > that came from the publisher would be ignored without any messages. On > > the other hand, if applying other operations such as HEAP_INSERT > > conflicts (such as when violating the unique constraint) the apply > > worker raises an ERROR and stops logical replication until the > > conflict is resolved. IIUC since we can know that the prepared > > transaction came from the same publisher again by checking origin_lsn > > in TwoPhaseFileHeader I guess we can skip the PREPARE message only > > when the existing prepared transaction has the same LSN and the same > > identifier. To be exact, it’s still possible that the subscriber gets > > two PREPARE messages having the same LSN and name from two different > > publishers but it’s unlikely happen in practice. > > > > The idea sounds reasonable. I'll try and see if this works. > I went ahead and used both origin_lsn and origin_timestamp to avoid the possibility of a match of prepared xact from two different nodes. We can handle this at begin_prepare and prepare time but we don't have prepare_lsn and prepare_timestamp at rollback_prepared time, so what do about that? As of now, I am using just GID at rollback_prepare time and that would have been sufficient if we always receive prepare before rollback because at prepare time we would have checked origin_lsn and origin_timestamp. But it is possible that we get rollback prepared without prepare in case if prepare happened before consistent_snapshot is reached and rollback happens after that. For commit-case, we do send prepare and all the data at commit time in such a case but doing so for rollback case doesn't sound to be a good idea. Another possibility is that we send prepare_lsn and prepare_time in rollback_prepared API to deal with this. I am not sure if it is a good idea to just rely on GID in rollback_prepare. What do you think? I have done some additional changes in the patch-series. 1. Removed some declarations from 0001-Extend-the-output-plugin-API-to-allow-decoding-o which were not required. 2. In 0002-Allow-decoding-at-prepare-time-in-ReorderBuffer, + txn->txn_flags |= RBTXN_PREPARE; + txn->gid = palloc(strlen(gid) + 1); /* trailing '\0' */ + strcpy(txn->gid, gid); Changed the above code to use pstrdup. 3. Merged the test-code from 0003 to 0002. I have yet to merge the latest test case posted by Ajin[1]. 4. Removed the test for Rollback Prepared from two_phase_streaming.sql because I think a similar test exists for non-streaming case in two_phase.sql and it doesn't make sense to repeat it. 5. Comments update and minor cosmetic changes for test cases merged from 0003 to 0002. [1] - https://www.postgresql.org/message-id/CAFPTHDYWj99%2BysDjCH_z8BfN8hG2FoxtJg%2BEU8_MpJe5owXg4A%40mail.gmail.com -- With Regards, Amit Kapila.
Attachment
- v32-0001-Extend-the-output-plugin-API-to-allow-decoding-o.patch
- v32-0002-Allow-decoding-at-prepare-time-in-ReorderBuffer.patch
- v32-0003-Refactor-spool-file-logic-in-worker.c.patch
- v32-0004-Add-support-for-apply-at-prepare-time-to-built-i.patch
- v32-0005-Support-2PC-txn-subscriber-tests.patch
- v32-0006-Support-2PC-documentation.patch
- v32-0007-Support-2PC-txn-Subscription-option.patch
- v32-0008-Support-2PC-consistent-snapshot-isolation-tests.patch
- v32-0009-Support-2PC-txn-tests-for-concurrent-aborts.patch
On Thu, Dec 17, 2020 at 6:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Dec 16, 2020 at 2:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Dec 16, 2020 at 1:04 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > Also, I guess we can improve the description of > > > ’two_phase’ option of CREATE SUBSCRIPTION in the doc by adding the > > > fact that when this option is not enabled the transaction prepared on > > > the publisher is decoded as a normal transaction: > > > > > > > Sounds reasonable. > > > > Fixed in the attached. > > > > ------ > > > + if (LookupGXact(begin_data.gid)) > > > + { > > > + /* > > > + * If this gid has already been prepared then we dont want to apply > > > + * this txn again. This can happen after restart where upstream can > > > + * send the prepared transaction again. See > > > + * ReorderBufferFinishPrepared. Don't update remote_final_lsn. > > > + */ > > > + skip_prepared_txn = true; > > > + return; > > > + } > > > > > > When PREPARE arrives at the subscriber node but there is the prepared > > > transaction with the same transaction identifier, the apply worker > > > skips the whole transaction. So if the users prepared a transaction > > > with the same identifier on the subscriber, the prepared transaction > > > that came from the publisher would be ignored without any messages. On > > > the other hand, if applying other operations such as HEAP_INSERT > > > conflicts (such as when violating the unique constraint) the apply > > > worker raises an ERROR and stops logical replication until the > > > conflict is resolved. IIUC since we can know that the prepared > > > transaction came from the same publisher again by checking origin_lsn > > > in TwoPhaseFileHeader I guess we can skip the PREPARE message only > > > when the existing prepared transaction has the same LSN and the same > > > identifier. To be exact, it’s still possible that the subscriber gets > > > two PREPARE messages having the same LSN and name from two different > > > publishers but it’s unlikely happen in practice. > > > > > > > The idea sounds reasonable. I'll try and see if this works. > > > > I went ahead and used both origin_lsn and origin_timestamp to avoid > the possibility of a match of prepared xact from two different nodes. > We can handle this at begin_prepare and prepare time but we don't have > prepare_lsn and prepare_timestamp at rollback_prepared time, so what > do about that? As of now, I am using just GID at rollback_prepare time > and that would have been sufficient if we always receive prepare > before rollback because at prepare time we would have checked > origin_lsn and origin_timestamp. But it is possible that we get > rollback prepared without prepare in case if prepare happened before > consistent_snapshot is reached and rollback happens after that. > Note that it is not easy to detect this case, otherwise, we would have avoided sending rollback_prepared. See comments in ReorderBufferFinishPrepared in patch v32-0002-Allow-decoding-at-prepare-time-in-ReorderBuffer. -- With Regards, Amit Kapila.
On Thu, Dec 17, 2020 at 9:30 AM Ajin Cherian <itsajin@gmail.com> wrote: > > On Thu, Dec 17, 2020 at 2:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > On again thinking about this, I think it is good to disable it during > > slot initialization but will it create any problem because during slot > > initialization we don't stream any xact and stop processing WAL as > > soon as we reach CONSISTENT_STATE? Did you observe any problem with > > this? > > > Yes, it did not stream any xact during initialization but I was > surprised that the DecodePrepare code was invoked even though > I hadn't created the subscription with twophase enabled. No problem > was observed. > Fair enough, I have fixed this in the patch-series posted sometime back. -- With Regards, Amit Kapila.
On Thu, Dec 17, 2020 at 11:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > I went ahead and used both origin_lsn and origin_timestamp to avoid > the possibility of a match of prepared xact from two different nodes. > We can handle this at begin_prepare and prepare time but we don't have > prepare_lsn and prepare_timestamp at rollback_prepared time, so what > do about that? As of now, I am using just GID at rollback_prepare time > and that would have been sufficient if we always receive prepare > before rollback because at prepare time we would have checked > origin_lsn and origin_timestamp. But it is possible that we get > rollback prepared without prepare in case if prepare happened before > consistent_snapshot is reached and rollback happens after that. For > commit-case, we do send prepare and all the data at commit time in > such a case but doing so for rollback case doesn't sound to be a good > idea. Another possibility is that we send prepare_lsn and prepare_time > in rollback_prepared API to deal with this. I am not sure if it is a > good idea to just rely on GID in rollback_prepare. What do you think? Thinking about it for some time, my initial reaction was that the distributed servers should maintain uniqueness of GIDs and re-checking with LSNs is just overkill. But thinking some more, I realise that since we allow reuse of GIDs, there could be a race condition where a previously aborted/committed txn's GID was reused which could lead to this. Yes, I think we could change rollback_prepare to send out prepare_lsn and prepare_time as well, just to be safe. regards, Ajin Cherian Fujitsu Australia.
On Fri, Dec 18, 2020 at 11:23 AM Ajin Cherian <itsajin@gmail.com> wrote: > > On Thu, Dec 17, 2020 at 11:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > I went ahead and used both origin_lsn and origin_timestamp to avoid > > the possibility of a match of prepared xact from two different nodes. > > We can handle this at begin_prepare and prepare time but we don't have > > prepare_lsn and prepare_timestamp at rollback_prepared time, so what > > do about that? As of now, I am using just GID at rollback_prepare time > > and that would have been sufficient if we always receive prepare > > before rollback because at prepare time we would have checked > > origin_lsn and origin_timestamp. But it is possible that we get > > rollback prepared without prepare in case if prepare happened before > > consistent_snapshot is reached and rollback happens after that. For > > commit-case, we do send prepare and all the data at commit time in > > such a case but doing so for rollback case doesn't sound to be a good > > idea. Another possibility is that we send prepare_lsn and prepare_time > > in rollback_prepared API to deal with this. I am not sure if it is a > > good idea to just rely on GID in rollback_prepare. What do you think? > > Thinking about it for some time, my initial reaction was that the > distributed servers should maintain uniqueness of GIDs and re-checking > with LSNs is just overkill. But thinking some more, I realise that > since we allow reuse of GIDs, there could be a race condition where a > previously aborted/committed txn's GID was reused > which could lead to this. Yes, I think we could change > rollback_prepare to send out prepare_lsn and prepare_time as well, > just to be safe. > Okay, I have changed the rollback_prepare API as discussed above and accordingly handle the case where rollback is received without prepare in apply_handle_rollback_prepared. While testing for this case, I noticed that the tracking of replication progress for aborts is not complete due to which after restart we can again ask for the rollback lsn. This shouldn't be a problem with the latest code because we will simply skip it when there is no corresponding prepare but this is far from ideal because that is the sole purpose of tracking via replication origins. This was due to the incomplete handling of aborts in the original commit 1eb6d6527a. I have fixed this now in a separate patch v33-0004-Track-replication-origin-progress-for-rollbacks. If you want to see the problem then change the below code and don't apply v33-0004-Track-replication-origin-progress-for-rollbacks, the regression failure is due to the reason that we are not tracking progress for aborts: apply_handle_rollback_prepared { .. if (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn, rollback_data.preparetime)) .. } to apply_handle_rollback_prepared { .. Assert (LookupGXact(rollback_data.gid, rollback_data.prepare_end_lsn, rollback_data.preparetime)); -- With Regards, Amit Kapila.
Attachment
- v33-0001-Extend-the-output-plugin-API-to-allow-decoding-o.patch
- v33-0002-Allow-decoding-at-prepare-time-in-ReorderBuffer.patch
- v33-0003-Refactor-spool-file-logic-in-worker.c.patch
- v33-0004-Track-replication-origin-progress-for-rollbacks.patch
- v33-0005-Add-support-for-apply-at-prepare-time-to-built-i.patch
- v33-0006-Support-2PC-documentation.patch
- v33-0007-Support-2PC-txn-subscriber-tests.patch
- v33-0008-Support-2PC-txn-Subscription-option.patch
- v33-0009-Support-2PC-consistent-snapshot-isolation-tests.patch
- v33-0010-Support-2PC-txn-tests-for-concurrent-aborts.patch
On Sat, Dec 19, 2020 at 2:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > Okay, I have changed the rollback_prepare API as discussed above and > accordingly handle the case where rollback is received without prepare > in apply_handle_rollback_prepared. I have reviewed and tested your new patchset, I agree with all the changes that you have made and have tested quite a few scenarios and they seem to be working as expected. No major comments but some minor observations: Patch 1: logical.c: 984 Comment should be "rollback prepared" rather than "abort prepared". Patch 2: decode.c: 737: The comments in the header of DecodePrepare seem out of place, I think here it should describe what the function does rather than what it does not. reorderbuffer.c: 2422: It looks like pg_indent has mangled the comments, the numbering is no longer aligned. Patch 5: worker.c: 753: Type: change "dont" to "don't" Patch 6: logicaldecoding.sgml logicaldecoding example is no longer correct. This was true prior to the changes done to replay prepared transactions after a restart. Now the whole transaction will get decoded again after the commit prepared. postgres=# COMMIT PREPARED 'test_prepared1'; COMMIT PREPARED postgres=# select * from pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'two-phase-commit', '1'); lsn | xid | data -----------+-----+-------------------------------------------- 0/168A060 | 529 | COMMIT PREPARED 'test_prepared1', txid 529 (1 row) Patch 8: worker.c: 2798 : worker.c: 3445 : disabling two-phase in tablesync worker. considering new design of multiple commits in tablesync, do we need to disable two-phase in tablesync? Other than this I've noticed a few typos that are not in the patch but in the surrounding code. logical.c: 1383: Comment should mention stream_commit_cb not stream_abort_cb. decode.c: 686 - Extra "it's" here: "because it's it happened" regards, Ajin Cherian Fujitsu Australia
On Tue, Dec 22, 2020 at 2:51 PM Ajin Cherian <itsajin@gmail.com> wrote: > > On Sat, Dec 19, 2020 at 2:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > Okay, I have changed the rollback_prepare API as discussed above and > > accordingly handle the case where rollback is received without prepare > > in apply_handle_rollback_prepared. > > > I have reviewed and tested your new patchset, I agree with all the > changes that you have made and have tested quite a few scenarios and > they seem to be working as expected. > No major comments but some minor observations: > > Patch 1: > logical.c: 984 > Comment should be "rollback prepared" rather than "abort prepared". > Agreed. > Patch 2: > decode.c: 737: The comments in the header of DecodePrepare seem out of > place, I think here it should describe what the function does rather > than what it does not. > Hmm, I have written it because it is important to explain the theory of concurrent aborts as that is not quite obvious. Also, the functionality is quite similar to DecodeCommit and the comments inside the function explain clearly if there is any difference so not sure what additional we can write, do you have any suggestions? > reorderbuffer.c: 2422: It looks like pg_indent has mangled the > comments, the numbering is no longer aligned. > Yeah, I had also noticed that but not sure if there is a better alternative because we don't want to change it after each pgindent run. We might want to use (a), (b) .. notation instead but otherwise, there is no big problem with how it is. > Patch 5: > worker.c: 753: Type: change "dont" to "don't" > Okay. > Patch 6: logicaldecoding.sgml > logicaldecoding example is no longer correct. This was true prior to > the changes done to replay prepared transactions after a restart. Now > the whole transaction will get decoded again after the commit > prepared. > > postgres=# COMMIT PREPARED 'test_prepared1'; > COMMIT PREPARED > postgres=# select * from > pg_logical_slot_get_changes('regression_slot', NULL, NULL, > 'two-phase-commit', '1'); > lsn | xid | data > -----------+-----+-------------------------------------------- > 0/168A060 | 529 | COMMIT PREPARED 'test_prepared1', txid 529 > (1 row) > Agreed. > Patch 8: > worker.c: 2798 : > worker.c: 3445 : disabling two-phase in tablesync worker. > considering new design of multiple commits in tablesync, do we need > to disable two-phase in tablesync? > No, but let Peter's patch get committed then we can change it. > Other than this I've noticed a few typos that are not in the patch but > in the surrounding code. > logical.c: 1383: Comment should mention stream_commit_cb not stream_abort_cb. > decode.c: 686 - Extra "it's" here: "because it's it happened" > Anything not related to this patch, please post in a separate email. Can you please update the patch for the points we agreed upon? -- With Regards, Amit Kapila.
On Tue, Dec 22, 2020 at 8:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Dec 22, 2020 at 2:51 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > > On Sat, Dec 19, 2020 at 2:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > Okay, I have changed the rollback_prepare API as discussed above and > > > accordingly handle the case where rollback is received without prepare > > > in apply_handle_rollback_prepared. > > > > > > I have reviewed and tested your new patchset, I agree with all the > > changes that you have made and have tested quite a few scenarios and > > they seem to be working as expected. > > No major comments but some minor observations: > > > > Patch 1: > > logical.c: 984 > > Comment should be "rollback prepared" rather than "abort prepared". > > > > Agreed. Changed. > > > Patch 2: > > decode.c: 737: The comments in the header of DecodePrepare seem out of > > place, I think here it should describe what the function does rather > > than what it does not. > > > > Hmm, I have written it because it is important to explain the theory > of concurrent aborts as that is not quite obvious. Also, the > functionality is quite similar to DecodeCommit and the comments inside > the function explain clearly if there is any difference so not sure > what additional we can write, do you have any suggestions? I have slightly re-worded it. Have a look. > > > reorderbuffer.c: 2422: It looks like pg_indent has mangled the > > comments, the numbering is no longer aligned. > > > > Yeah, I had also noticed that but not sure if there is a better > alternative because we don't want to change it after each pgindent > run. We might want to use (a), (b) .. notation instead but otherwise, > there is no big problem with how it is. Leaving this as is. > > > Patch 5: > > worker.c: 753: Type: change "dont" to "don't" > > > > Okay. Changed. > > > Patch 6: logicaldecoding.sgml > > logicaldecoding example is no longer correct. This was true prior to > > the changes done to replay prepared transactions after a restart. Now > > the whole transaction will get decoded again after the commit > > prepared. > > > > postgres=# COMMIT PREPARED 'test_prepared1'; > > COMMIT PREPARED > > postgres=# select * from > > pg_logical_slot_get_changes('regression_slot', NULL, NULL, > > 'two-phase-commit', '1'); > > lsn | xid | data > > -----------+-----+-------------------------------------------- > > 0/168A060 | 529 | COMMIT PREPARED 'test_prepared1', txid 529 > > (1 row) > > > > Agreed. Changed. > > > Patch 8: > > worker.c: 2798 : > > worker.c: 3445 : disabling two-phase in tablesync worker. > > considering new design of multiple commits in tablesync, do we need > > to disable two-phase in tablesync? > > > > No, but let Peter's patch get committed then we can change it. OK, leaving it. > Can you please update the patch for the points we agreed upon? Changed and attached. regards, Ajin Cherian Fujitsu Australia
Attachment
- v34-0002-Allow-decoding-at-prepare-time-in-ReorderBuffer.patch
- v34-0003-Refactor-spool-file-logic-in-worker.c.patch
- v34-0005-Add-support-for-apply-at-prepare-time-to-built-i.patch
- v34-0004-Track-replication-origin-progress-for-rollbacks.patch
- v34-0001-Extend-the-output-plugin-API-to-allow-decoding-o.patch
- v34-0006-Support-2PC-documentation.patch
- v34-0007-Support-2PC-txn-subscriber-tests.patch
- v34-0008-Support-2PC-txn-Subscription-option.patch
- v34-0009-Support-2PC-consistent-snapshot-isolation-tests.patch
- v34-0010-Support-2PC-txn-tests-for-concurrent-aborts.patch
On Wed, Dec 23, 2020 at 3:08 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > > Can you please update the patch for the points we agreed upon? > > Changed and attached. > Thanks, I have looked at these patches again and it seems patches 0001 to 0004 are in good shape, and among those v33-0001-Extend-the-output-plugin-API-to-allow-decoding-o is good to go. So, I am planning to push the first patch (0001*) in next week sometime unless you or someone else has any comments on it. -- With Regards, Amit Kapila.
Hi, Amit-San On Thursday, Dec 24, 2020 2:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Dec 23, 2020 at 3:08 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > > > Can you please update the patch for the points we agreed upon? > > > > Changed and attached. > > > > Thanks, I have looked at these patches again and it seems patches 0001 to > 0004 are in good shape, and among those > v33-0001-Extend-the-output-plugin-API-to-allow-decoding-o is good to go. > So, I am planning to push the first patch (0001*) in next week sometime > unless you or someone else has any comments on it. I agree this from the perspective of good code quality for memory management. I reviewed the v33 patchset by using valgrind and conclude that the patchset of version 33th has no problem in terms of memory management. This can be applied to v34 because the difference between the two versions are really small. I conducted comparison of valgrind logfiles between master and master with v33 patchset applied. I checked both testing of contrib/test-decoding and src/test/subscription of course, using valgrind. The first reason why I reached the conclusion is that I don't find any description of memcheck error in the log files. I picked up and greped error message expressions in the documentation of the valgrind - [1], but there was no grep matches. Secondly, I surveyed function stack of valgrind's 3 types of memory leak, "Definitely lost", "Indirectly lost" and "Possibly lost" and it turned out that the patchset didn't add any new cause of memory leak. [1] - https://valgrind.org/docs/manual/mc-manual.html#mc-manual.errormsgs Best Regards, Takamichi Osumi
Hi Ajin, On Wed, Dec 23, 2020 at 6:38 PM Ajin Cherian <itsajin@gmail.com> wrote: > > On Tue, Dec 22, 2020 at 8:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Dec 22, 2020 at 2:51 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > > > > On Sat, Dec 19, 2020 at 2:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > Okay, I have changed the rollback_prepare API as discussed above and > > > > accordingly handle the case where rollback is received without prepare > > > > in apply_handle_rollback_prepared. > > > > > > > > > I have reviewed and tested your new patchset, I agree with all the > > > changes that you have made and have tested quite a few scenarios and > > > they seem to be working as expected. > > > No major comments but some minor observations: > > > > > > Patch 1: > > > logical.c: 984 > > > Comment should be "rollback prepared" rather than "abort prepared". > > > > > > > Agreed. > > Changed. > > > > > > Patch 2: > > > decode.c: 737: The comments in the header of DecodePrepare seem out of > > > place, I think here it should describe what the function does rather > > > than what it does not. > > > > > > > Hmm, I have written it because it is important to explain the theory > > of concurrent aborts as that is not quite obvious. Also, the > > functionality is quite similar to DecodeCommit and the comments inside > > the function explain clearly if there is any difference so not sure > > what additional we can write, do you have any suggestions? > > I have slightly re-worded it. Have a look. > > > > > > reorderbuffer.c: 2422: It looks like pg_indent has mangled the > > > comments, the numbering is no longer aligned. > > > > > > > Yeah, I had also noticed that but not sure if there is a better > > alternative because we don't want to change it after each pgindent > > run. We might want to use (a), (b) .. notation instead but otherwise, > > there is no big problem with how it is. > > Leaving this as is. > > > > > > Patch 5: > > > worker.c: 753: Type: change "dont" to "don't" > > > > > > > Okay. > > Changed. > > > > > > Patch 6: logicaldecoding.sgml > > > logicaldecoding example is no longer correct. This was true prior to > > > the changes done to replay prepared transactions after a restart. Now > > > the whole transaction will get decoded again after the commit > > > prepared. > > > > > > postgres=# COMMIT PREPARED 'test_prepared1'; > > > COMMIT PREPARED > > > postgres=# select * from > > > pg_logical_slot_get_changes('regression_slot', NULL, NULL, > > > 'two-phase-commit', '1'); > > > lsn | xid | data > > > -----------+-----+-------------------------------------------- > > > 0/168A060 | 529 | COMMIT PREPARED 'test_prepared1', txid 529 > > > (1 row) > > > > > > > Agreed. > > Changed. > > > > > > Patch 8: > > > worker.c: 2798 : > > > worker.c: 3445 : disabling two-phase in tablesync worker. > > > considering new design of multiple commits in tablesync, do we need > > > to disable two-phase in tablesync? > > > > > > > No, but let Peter's patch get committed then we can change it. > > OK, leaving it. > > > Can you please update the patch for the points we agreed upon? > > Changed and attached. Thank you for updating the patches! I realized that this patch is not registered yet for the next CommitFest[1] that starts in a couple of days. I found the old entry of this patch[2] but it's marked as "Returned with feedback". Although this patch is being reviewed actively, I suggest you adding it before 2021-01-01 AoE[2] so cfbot also can test your patch. Regards, [1] https://commitfest.postgresql.org/31/ [2] https://commitfest.postgresql.org/22/944/ [3] https://en.wikipedia.org/wiki/Anywhere_on_Earth -- Masahiko Sawada EnterpriseDB: https://www.enterprisedb.com/
Hi Sawada-san, I think Amit has a plan to commit this patch-set in phases. I will leave it to him to decide because I think he has a plan. I took time to refactor the test_decoding isolation test for consistent snapshot so that it uses just 3 sessions rather than 4. Posting an updated patch-0009 regards, Ajin Cherian Fujitsu Australia
Attachment
On Tue, Dec 29, 2020 at 3:15 PM Ajin Cherian <itsajin@gmail.com> wrote: > > Hi Sawada-san, > > I think Amit has a plan to commit this patch-set in phases. > I have pushed the first patch and I would like to make a few changes in the second patch after which I will post the new version. I'll try to do that tomorrow if possible and register the patch. > I will > leave it to him to decide because I think he has a plan. > I took time to refactor the test_decoding isolation test for > consistent snapshot so that it uses just 3 sessions rather than 4. > Posting an updated patch-0009 > Thanks, I will look into this. -- With Regards, Amit Kapila.
On Wed, Dec 30, 2020 at 6:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Dec 29, 2020 at 3:15 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > > Hi Sawada-san, > > > > I think Amit has a plan to commit this patch-set in phases. > > > > I have pushed the first patch and I would like to make a few changes > in the second patch after which I will post the new version. I'll try > to do that tomorrow if possible and register the patch. > Please find attached a rebased version of this patch-set. I have made a number of changes in the v35-0001-Allow-decoding-at-prepare-time-in-ReorderBuffer. 1. Centralize the logic to decide whether to perform decoding at prepare time in FilterPrepare function. 2. Changed comments atop DecodePrepare. I didn't like much the comments changed by Ajin in the last patch. 3. Merged the doc changes patch after some changes mostly cosmetic. I am planning to commit the first patch in this series early next week after reading it once more. -- With Regards, Amit Kapila.
Attachment
- v35-0001-Allow-decoding-at-prepare-time-in-ReorderBuffer.patch
- v35-0002-Refactor-spool-file-logic-in-worker.c.patch
- v35-0003-Track-replication-origin-progress-for-rollbacks.patch
- v35-0004-Add-support-for-apply-at-prepare-time-to-built-i.patch
- v35-0005-Support-2PC-txn-subscriber-tests.patch
- v35-0006-Support-2PC-txn-Subscription-option.patch
- v35-0007-Support-2PC-consistent-snapshot-isolation-tests.patch
- v35-0008-Support-2PC-txn-tests-for-concurrent-aborts.patch
On Thu, Dec 31, 2020 at 10:48 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Dec 30, 2020 at 6:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Dec 29, 2020 at 3:15 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > > > > Hi Sawada-san, > > > > > > I think Amit has a plan to commit this patch-set in phases. > > > > > > > I have pushed the first patch and I would like to make a few changes > > in the second patch after which I will post the new version. I'll try > > to do that tomorrow if possible and register the patch. > > > > Please find attached a rebased version of this patch-set. > Registered in CF (https://commitfest.postgresql.org/31/2914/). -- With Regards, Amit Kapila.
On Thu, Dec 31, 2020 at 4:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > 3. Merged the doc changes patch after some changes mostly cosmetic. Some minor comments here: v35-0001 - logicaldecoding.sgml In the example section: Change "The following example shows SQL interface can be used to decode prepared transactions." to "The following example shows the SQL interface that can be used to decode prepared transactions." Then in "Two-phase commit support for Logical Decoding" page: Change "To support streaming of two-phase commands, an output plugin needs to provide the additional callbacks." to "To support streaming of two-phase commands, an output plugin needs to provide additional callbacks." Other than that, I have no more comments. regards, Ajin Cherian Fujitsu Australia
On Thu, Dec 31, 2020 at 12:31 PM Ajin Cherian <itsajin@gmail.com> wrote: > > On Thu, Dec 31, 2020 at 4:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > 3. Merged the doc changes patch after some changes mostly cosmetic. > Some minor comments here: > > v35-0001 - logicaldecoding.sgml > > In the example section: > Change "The following example shows SQL interface can be used to > decode prepared transactions." > to "The following example shows the SQL interface that can be used to > decode prepared transactions." > > Then in "Two-phase commit support for Logical Decoding" page: > Change "To support streaming of two-phase commands, an output plugin > needs to provide the additional callbacks." > to "To support streaming of two-phase commands, an output plugin needs > to provide additional callbacks." > > Other than that, I have no more comments. > Thanks, I have pushed the 0001* patch after making the above and a few other cosmetic modifications. -- With Regards, Amit Kapila.
On Tue, Dec 29, 2020 at 3:15 PM Ajin Cherian <itsajin@gmail.com> wrote: > > Hi Sawada-san, > > I think Amit has a plan to commit this patch-set in phases. I will > leave it to him to decide because I think he has a plan. > I took time to refactor the test_decoding isolation test for > consistent snapshot so that it uses just 3 sessions rather than 4. > Posting an updated patch-0009 > I have reviewed this test case patch and have the below comments: 1. +step "s1checkpoint" { CHECKPOINT; } ... +step "s2alter" { ALTER TABLE do_write ADD COLUMN addedbys2 int; } I don't see the need for the above steps and we should be able to generate the required scenario without these as well. Is there any reason to keep those? 2. "s3c""s1insert" space is missing between these two. 3. +# Force building of a consistent snapshot between a PREPARE and COMMIT PREPARED. +# Ensure that the whole transaction is decoded fresh at the time of COMMIT PREPARED. +permutation "s2b" "s2txid" "s1init" "s3b" "s3txid" "s2alter" "s2c" "s2b" "s2insert" "s2prepare" "s3c""s1insert" "s1checkpoint" "s1start" "s2commit" "s1start" I think we can update the above comments to indicate how and which important steps help us to realize the required scenario. See subxact_without_top.spec for reference. 4. +step "s2c" { COMMIT; } ... +step "s2prepare" { PREPARE TRANSACTION 'test1'; } +step "s2commit" { COMMIT PREPARED 'test1'; } s2c and s2commit seem to be confusing names as both sounds like doing the same thing. How about changing s2commit to s2cp and s2prepare to s2p? -- With Regards, Amit Kapila.
On Tue, Jan 5, 2021 at 5:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > I have reviewed this test case patch and have the below comments: > > 1. > +step "s1checkpoint" { CHECKPOINT; } > ... > +step "s2alter" { ALTER TABLE do_write ADD COLUMN addedbys2 int; } > > I don't see the need for the above steps and we should be able to > generate the required scenario without these as well. Is there any > reason to keep those? Removed. > > 2. > "s3c""s1insert" > > space is missing between these two. Updated. > > 3. > +# Force building of a consistent snapshot between a PREPARE and > COMMIT PREPARED. > +# Ensure that the whole transaction is decoded fresh at the time of > COMMIT PREPARED. > +permutation "s2b" "s2txid" "s1init" "s3b" "s3txid" "s2alter" "s2c" > "s2b" "s2insert" "s2prepare" "s3c""s1insert" "s1checkpoint" "s1start" > "s2commit" "s1start" > > I think we can update the above comments to indicate how and which > important steps help us to realize the required scenario. See > subxact_without_top.spec for reference. Added more comments to explain the state change of logical decoding. > 4. > +step "s2c" { COMMIT; } > ... > +step "s2prepare" { PREPARE TRANSACTION 'test1'; } > +step "s2commit" { COMMIT PREPARED 'test1'; } > > s2c and s2commit seem to be confusing names as both sounds like doing > the same thing. How about changing s2commit to s2cp and s2prepare to > s2p? Updated. I've addressed the above comments and the patch is attached. I've called it v36-0007. regards, Ajin Cherian Fujitsu Australia
Attachment
On Tue, Jan 5, 2021 at 2:11 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > I've addressed the above comments and the patch is attached. I've > called it v36-0007. > Thanks, I have pushed this after minor wordsmithing. -- With Regards, Amit Kapila.
On Tue, Dec 22, 2020 at 3:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Dec 22, 2020 at 2:51 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > > > Other than this I've noticed a few typos that are not in the patch but > > in the surrounding code. > > logical.c: 1383: Comment should mention stream_commit_cb not stream_abort_cb. > > decode.c: 686 - Extra "it's" here: "because it's it happened" > > > > Anything not related to this patch, please post in a separate email. > Pushed the fix for above reported typos. -- With Regards, Amit Kapila.
On Tue, Jan 5, 2021 at 4:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jan 5, 2021 at 2:11 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > > > > I've addressed the above comments and the patch is attached. I've > > called it v36-0007. > > > > Thanks, I have pushed this after minor wordsmithing. > The test case is failing on one of the build farm machines. See email from Tom Lane [1]. The symptom clearly shows that we are decoding empty xacts which can happen due to background activity by autovacuum. I think we need a fix similar to what we have done in https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=82a0ba7707e010a29f5fe1a0020d963c82b8f1cb. I'll try to reproduce and provide a fix for this later today or tomorrow. [1] - https://www.postgresql.org/message-id/363512.1610171267%40sss.pgh.pa.us -- With Regards, Amit Kapila.
On Sat, Jan 9, 2021 at 12:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jan 5, 2021 at 4:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Jan 5, 2021 at 2:11 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > > > > > > > I've addressed the above comments and the patch is attached. I've > > > called it v36-0007. > > > > > > > Thanks, I have pushed this after minor wordsmithing. > > > > The test case is failing on one of the build farm machines. See email > from Tom Lane [1]. The symptom clearly shows that we are decoding > empty xacts which can happen due to background activity by autovacuum. > I think we need a fix similar to what we have done in > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=82a0ba7707e010a29f5fe1a0020d963c82b8f1cb. > > I'll try to reproduce and provide a fix for this later today or tomorrow. > I have pushed the fix. -- With Regards, Amit Kapila.
Please find attached the new patch set v37. This patch set v37* is now rebased to use the most recent tablesync patch from the other thread [1]. i.e. notice that v37-0001 is an exact copy of the v17-0001-tablesync-Solution1.patch Details how v37* patches relate to earlier patches is shown below: ====== v35-0001 -> committed -> NA v17-0001-Tablesync-Solution1 -> (copy from [1]) -> v37-0001 v35-0002 -> (unchanged) -> v37-0002 v35-0003 -> (unchanged) -> v37-0003 v35-0004 -> (modify code, apply_handle_prepare changed for tablesync worker) -> v37-0004 v35-0005 -> (unchanged) --> v37-0005 v35-0006 -> (modify code, twophase mode is now same for tablesync/apply slots) -> v37-0006 v35-0007 -> v36-0007 -> committed -> NA v35-0008 -> (unchanged) -> v37-0007 ====== ---- [1] https://www.postgresql.org/message-id/flat/CAA4eK1KHJxaZS-fod-0fey%3D0tq3%3DGkn4ho%3D8N4-5HWiCfu0H1A%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Attachment
- v37-0001-Tablesync-Solution1.patch
- v37-0003-Track-replication-origin-progress-for-rollbacks.patch
- v37-0002-Refactor-spool-file-logic-in-worker.c.patch
- v37-0004-Add-support-for-apply-at-prepare-time-to-built-i.patch
- v37-0005-Support-2PC-txn-subscriber-tests.patch
- v37-0006-Support-2PC-txn-Subscription-option.patch
- v37-0007-Support-2PC-txn-tests-for-concurrent-aborts.patch
PSA the new patch set v38*. This patch set has been rebased to use the most recent tablesync patch from other thread [1] (i.e. notice that v38-0001 is an exact copy of that thread's tablesync patch v31) ---- [1] https://www.postgresql.org/message-id/flat/CAA4eK1KHJxaZS-fod-0fey%3D0tq3%3DGkn4ho%3D8N4-5HWiCfu0H1A%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Attachment
- v38-0001-Tablesync-V31.patch
- v38-0003-Track-replication-origin-progress-for-rollbacks.patch
- v38-0002-Refactor-spool-file-logic-in-worker.c.patch
- v38-0004-Add-support-for-apply-at-prepare-time-to-built-i.patch
- v38-0005-Support-2PC-txn-subscriber-tests.patch
- v38-0006-Support-2PC-txn-Subscription-option.patch
- v38-0007-Support-2PC-txn-tests-for-concurrent-aborts.patch
On Wed, Feb 10, 2021 at 3:59 PM Peter Smith <smithpb2250@gmail.com> wrote: > > PSA the new patch set v38*. > > This patch set has been rebased to use the most recent tablesync patch > from other thread [1] > (i.e. notice that v38-0001 is an exact copy of that thread's tablesync > patch v31) > I see one problem which might lead to the skip of prepared xacts for some of the subscriptions. The problem is that we skip the prepared xacts based on GID and the same prepared transaction arrives on the subscriber for different subscriptions. And even if we wouldn't have skipped the prepared xact, it would have lead to an error "transaction identifier "p1" is already in use". See the scenario below: On Publisher: =========== CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120)); CREATE TABLE mytbl2(id SERIAL PRIMARY KEY, somedata int, text varchar(120)); postgres=# BEGIN; BEGIN postgres=*# INSERT INTO mytbl1(somedata, text) VALUES (1, 1); INSERT 0 1 postgres=*# INSERT INTO mytbl1(somedata, text) VALUES (1, 2); INSERT 0 1 postgres=*# COMMIT; COMMIT postgres=# BEGIN; BEGIN postgres=*# INSERT INTO mytbl2(somedata, text) VALUES (1, 1); INSERT 0 1 postgres=*# INSERT INTO mytbl2(somedata, text) VALUES (1, 2); INSERT 0 1 postgres=*# Commit; COMMIT postgres=# CREATE PUBLICATION mypub1 FOR TABLE mytbl1; CREATE PUBLICATION postgres=# CREATE PUBLICATION mypub2 FOR TABLE mytbl2; CREATE PUBLICATION On Subscriber: ============ CREATE TABLE mytbl1(id SERIAL PRIMARY KEY, somedata int, text varchar(120)); CREATE TABLE mytbl2(id SERIAL PRIMARY KEY, somedata int, text varchar(120)); postgres=# CREATE SUBSCRIPTION mysub1 postgres-# CONNECTION 'host=localhost port=5432 dbname=postgres' postgres-# PUBLICATION mypub1; NOTICE: created replication slot "mysub1" on publisher CREATE SUBSCRIPTION postgres=# CREATE SUBSCRIPTION mysub2 postgres-# CONNECTION 'host=localhost port=5432 dbname=postgres' postgres-# PUBLICATION mypub2; NOTICE: created replication slot "mysub2" on publisher CREATE SUBSCRIPTION On Publisher: ============ postgres=# Begin; BEGIN postgres=*# INSERT INTO mytbl1(somedata, text) VALUES (1, 3); INSERT 0 1 postgres=*# INSERT INTO mytbl2(somedata, text) VALUES (1, 3); INSERT 0 1 postgres=*# Prepare Transaction 'myprep1'; After this step, wait for few seconds and then perform Commit Prepared 'myprep1'; on Publisher and you will notice following error in the subscriber log: "ERROR: prepared transaction with identifier "myprep1" does not exist" One idea to avoid this is that we use subscription_id along with GID on subscription for prepared xacts. Let me know if you have any better ideas to handle this? Few other minor comments on v38-0004-Add-support-for-apply-at-prepare-time-to-built-i: ====================================================================== 1. - * Mark the prepared transaction as valid. As soon as xact.c marks - * MyProc as not running our XID (which it will do immediately after - * this function returns), others can commit/rollback the xact. + * Mark the prepared transaction as valid. As soon as xact.c marks MyProc + * as not running our XID (which it will do immediately after this + * function returns), others can commit/rollback the xact. Why this change in this patch? Is it due to pgindent? If so, you need to exclude this change? 2. @@ -78,7 +78,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn, pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT); - /* send the flags field (unused for now) */ + /* send the flags field */ pq_sendbyte(out, flags); Is there a reason to change the above comment? -- With Regards, Amit Kapila.
On Thu, Feb 11, 2021 at 12:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Few other minor comments on > v38-0004-Add-support-for-apply-at-prepare-time-to-built-i: > ====================================================================== > 1. > - * Mark the prepared transaction as valid. As soon as xact.c marks > - * MyProc as not running our XID (which it will do immediately after > - * this function returns), others can commit/rollback the xact. > + * Mark the prepared transaction as valid. As soon as xact.c marks MyProc > + * as not running our XID (which it will do immediately after this > + * function returns), others can commit/rollback the xact. > > Why this change in this patch? Is it due to pgindent? If so, you need > to exclude this change? Fixed in V39. > > 2. > @@ -78,7 +78,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn, > > pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT); > > - /* send the flags field (unused for now) */ > + /* send the flags field */ > pq_sendbyte(out, flags); > > Is there a reason to change the above comment? Fixed in V39. ---------- Please find attached the new 2PC patch set v39* This fixes some recent feedback comments (see above). ---- Kind Regards, Peter Smith. Fujitsu Australia
Attachment
- v39-0002-Refactor-spool-file-logic-in-worker.c.patch
- v39-0003-Track-replication-origin-progress-for-rollbacks.patch
- v39-0004-Add-support-for-apply-at-prepare-time-to-built-i.patch
- v39-0005-Support-2PC-txn-subscriber-tests.patch
- v39-0001-Tablesync-V31.patch
- v39-0006-Support-2PC-txn-Subscription-option.patch
- v39-0007-Support-2PC-txn-tests-for-concurrent-aborts.patch
Hi On Thursday, February 11, 2021 5:10 PM Peter Smith <smithpb2250@gmail.com> wrote: > Please find attached the new 2PC patch set v39* I started to review the patchset so, let me give some comments I have at this moment. (1) File : v39-0007-Support-2PC-txn-tests-for-concurrent-aborts.patch Modification : @@ -620,6 +666,9 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, } txndata->xact_wrote_changes = true; + /* For testing concurrent aborts */ + test_concurrent_aborts(data); + class_form = RelationGetForm(relation); tupdesc = RelationGetDescr(relation); Comment : There are unnecessary whitespaces in comments like above in v37-007 Please check such as pg_decode_change(), pg_decode_truncate(), pg_decode_stream_truncate() as well. I suggest you align the code formats by pgindent. (2) File : v39-0006-Support-2PC-txn-Subscription-option.patch @@ -213,6 +219,15 @@ parse_subscription_options(List *options, *streaming_given = true; *streaming = defGetBoolean(defel); } + else if (strcmp(defel->defname, "two_phase") == 0 && twophase) + { + if (*twophase_given) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("conflicting or redundant options"))); + *twophase_given = true; + *twophase = defGetBoolean(defel); + } You can add this test in subscription.sql easily with double twophase options. When I find something else, I'll let you know. Best Regards, Takamichi Osumi
On Fri, Feb 12, 2021 at 12:29 PM osumi.takamichi@fujitsu.com <osumi.takamichi@fujitsu.com> wrote: > > On Thursday, February 11, 2021 5:10 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Please find attached the new 2PC patch set v39* > I started to review the patchset > so, let me give some comments I have at this moment. > > (1) > > File : v39-0007-Support-2PC-txn-tests-for-concurrent-aborts.patch > Modification : > > @@ -620,6 +666,9 @@ pg_decode_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, > } > txndata->xact_wrote_changes = true; > > + /* For testing concurrent aborts */ > + test_concurrent_aborts(data); > + > class_form = RelationGetForm(relation); > tupdesc = RelationGetDescr(relation); > > Comment : There are unnecessary whitespaces in comments like above in v37-007 > Please check such as pg_decode_change(), pg_decode_truncate(), pg_decode_stream_truncate() as well. > I suggest you align the code formats by pgindent. > This patch (v39-0007-Support-2PC-txn-tests-for-concurrent-aborts.patch) is mostly for dev-testing purpose. We don't intend to commit as this has a lot of timing-dependent tests and I am not sure if it is valuable enough at this stage. So, we can ignore cosmetic comments in this patch for now. -- With Regards, Amit Kapila.
Please find attached the new patch set v40* The tablesync patch [1] was already committed [2], so the v39-0001 patch is no longer required. v40* has been rebased to HEAD. ---- [1] https://www.postgresql.org/message-id/flat/CAA4eK1KHJxaZS-fod-0fey%3D0tq3%3DGkn4ho%3D8N4-5HWiCfu0H1A%40mail.gmail.com [2] https://github.com/postgres/postgres/commit/ce0fdbfe9722867b7fad4d3ede9b6a6bfc51fb4e Kind Regards, Peter Smith. Fujitsu Australia
Attachment
- v40-0005-Support-2PC-txn-Subscription-option.patch
- v40-0001-Refactor-spool-file-logic-in-worker.c.patch
- v40-0002-Track-replication-origin-progress-for-rollbacks.patch
- v40-0003-Add-support-for-apply-at-prepare-time-to-built-i.patch
- v40-0004-Support-2PC-txn-subscriber-tests.patch
- v40-0006-Support-2PC-txn-tests-for-concurrent-aborts.patch
On Fri, Feb 12, 2021 at 5:59 PM osumi.takamichi@fujitsu.com <osumi.takamichi@fujitsu.com> wrote: > (2) > > File : v39-0006-Support-2PC-txn-Subscription-option.patch > > @@ -213,6 +219,15 @@ parse_subscription_options(List *options, > *streaming_given = true; > *streaming = defGetBoolean(defel); > } > + else if (strcmp(defel->defname, "two_phase") == 0 && twophase) > + { > + if (*twophase_given) > + ereport(ERROR, > + (errcode(ERRCODE_SYNTAX_ERROR), > + errmsg("conflicting or redundant options"))); > + *twophase_given = true; > + *twophase = defGetBoolean(defel); > + } > > You can add this test in subscription.sql easily with double twophase options. Thanks for the feedback. You are right. But in the pgoutput.c there are several other potential syntax errors "conflicting or redundant options" which are just like this "two_phase" one. e.g. there is the same error for options "proto_version", "publication_names", "binary", "streaming". AFAIK none of those other syntax errors had any regression tests. That is the reason why I did not include any new test for the "two_phase" option. So: a) should I add a new test per your feedback comment, or b) should I be consistent with the other similar errors, and not add the test? Of course it is easy to add a new test if you think option (a) is best. Thoughts? ----- Kind Regards, Peter Smith. Fujitsu Australia
Hi On Tuesday, February 16, 2021 8:33 AM Peter Smith <smithpb2250@gmail.com> > On Fri, Feb 12, 2021 at 5:59 PM osumi.takamichi@fujitsu.com > <osumi.takamichi@fujitsu.com> wrote: > > (2) > > > > File : v39-0006-Support-2PC-txn-Subscription-option.patch > > > > @@ -213,6 +219,15 @@ parse_subscription_options(List *options, > > *streaming_given = true; > > *streaming = defGetBoolean(defel); > > } > > + else if (strcmp(defel->defname, "two_phase") == 0 && > twophase) > > + { > > + if (*twophase_given) > > + ereport(ERROR, > > + > (errcode(ERRCODE_SYNTAX_ERROR), > > + errmsg("conflicting or > redundant options"))); > > + *twophase_given = true; > > + *twophase = defGetBoolean(defel); > > + } > > > > You can add this test in subscription.sql easily with double twophase > options. > > Thanks for the feedback. You are right. > > But in the pgoutput.c there are several other potential syntax errors > "conflicting or redundant options" which are just like this "two_phase" one. > e.g. there is the same error for options "proto_version", "publication_names", > "binary", "streaming". > > AFAIK none of those other syntax errors had any regression tests. That is the > reason why I did not include any new test for the "two_phase" > option. > > So: > a) should I add a new test per your feedback comment, or > b) should I be consistent with the other similar errors, and not add the test? > > Of course it is easy to add a new test if you think option (a) is best. > > Thoughts? OK. Then, we can think previously, such tests for other options are regarded as needless because the result are too apparent. Let's choose (b) to make the patch set aligned with other similar past codes. Thanks. Best Regards, Takamichi Osumi
Please find attached the new patch set v41* (v40* needed to be rebased to current HEAD) ---- Kind Regards, Peter Smith. Fujitsu Australia
Attachment
- v41-0001-Refactor-spool-file-logic-in-worker.c.patch
- v41-0005-Support-2PC-txn-Subscription-option.patch
- v41-0003-Add-support-for-apply-at-prepare-time-to-built-i.patch
- v41-0002-Track-replication-origin-progress-for-rollbacks.patch
- v41-0004-Support-2PC-txn-subscriber-tests.patch
- v41-0006-Support-2PC-txn-tests-for-concurrent-aborts.patch
On Thu, Feb 18, 2021 at 5:48 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Please find attached the new patch set v41* > I see one issue here. Currently, when we create a subscription, we first launch apply-worker and create the main apply worker slot and then launch table sync workers as required. Now, assume, the apply worker slot is created and after that, we launch tablesync worker, which will initiate its slot (sync_slot) creation. Then, on the publisher-side, the situation is such that there is a prepared transaction that happens before we reach a consistent snapshot. We can assume the exact scenario as we have in twophase_snapshot.spec where we skip prepared xact due to this reason. Because the WALSender corresponding to apply worker is already running so it will be in consistent state, for it, such a prepared xact can be decoded and it will send the same to the subscriber. On the subscriber-side, it can skip applying the data-modification operations because the corresponding rel is still not in a ready state (see should_apply_changes_for_rel and its callers) simply because the corresponding table sync worker is not finished yet. But prepare will occur and it will lead to a prepared transaction on the subscriber. In this situation, tablesync worker has skipped prepare because the snapshot was not consistent and then it exited because it is in sync with the apply worker. And apply worker has skipped because tablesync was in-progress. Later when Commit prepared will come, the apply-worker will simply commit the previously prepared transaction and we will never see the prepared transaction data. So, the basic premise is that we can't allow tablesync workers to skip prepared transactions (which can be processed by apply worker) and process later commits. I have one idea to address this. When we get the first begin_prepare in the apply-worker, we can check if there are any relations in "not_ready" state and if so then just wait till all the relations become in sync with the apply worker. This is to avoid that any of the tablesync workers might skip prepared xact and we don't want apply worker to also skip the same. Now, it is possible that some tablesync worker has copied the data and moved the sync position ahead of where the current apply worker's position is. In such a case, we need to process transactions in apply worker such that we can process commits if any, and write prepared transactions to file. For prepared transactions, we can take decisions only once the commit prepared for them has arrived. -- With Regards, Amit Kapila.
Please find attached the new patch set v42* This removes the (development only) patch v41-0006 which was causing some random cfbot fails. ---- Kind Regards, Peter Smith. Fujitsu Australia
Attachment
On Wed, Mar 17, 2021 at 11:27 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > 5. I have modified the comments atop worker.c to explain the design > and some of the problems clearly. See attached. If you are fine with > this, please include it in the next version of the patch. > I have further expanded these comments to explain the handling of prepared transactions for multiple subscriptions on the same server especially when the same prepared transaction operates on tables for those subscriptions. See attached, this applies atop the patch sent by me in the last email. I am not sure but I think it might be better to add something on those lines in user-facing docs. What do you think? Another comment: + ereport(LOG, + (errmsg("logical replication apply worker for subscription \"%s\" 2PC is %s.", + MySubscription->name, + MySubscription->twophase == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" : + MySubscription->twophase == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" : + MySubscription->twophase == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" : + "?"))); I don't think this is required in LOGs, maybe at some DEBUG level, because users can check this in pg_subscription. If we keep this message, there will be two consecutive messages like below in logs for subscriptions that have two_pc option enabled which looks a bit odd. LOG: logical replication apply worker for subscription "mysub" has started LOG: logical replication apply worker for subscription "mysub" 2PC is ENABLED. -- With Regards, Amit Kapila.
Attachment
Please find attached the latest patch set v61* Differences from v60* are: * Rebased to HEAD @ today * Addresses the following feedback issues: ---- Vignesh 12/Mar - https://www.postgresql.org/message-id/CALDaNm1p%3DKYcDc1s_Q0Lk2P8UYU-z4acW066gaeLfXvW_O-kBA%40mail.gmail.com (61) Skipped. twophase_given could be a local variable ---- Vignesh 16/Mar - https://www.postgresql.org/message-id/CALDaNm0qTRapggmUY_kgwNd14cec0i8mS5_PnrMcs_Y-_TXrgA%40mail.gmail.com (68) Fixed. Removed obsolete psf typedefs from typedefs.h. (69) Done. Updated comment wording. (70) Fixed. Removed references to psf in comments. Restored the Assert how it was before (71) Duplicate. See (73) (72) Duplicate. See (86) ---- Amit 16/Mar - https://www.postgresql.org/message-id/CAA4eK1Kwah%2BMimFMR3jPY5cSqpGFVh5zfV2g4%3DgTphaPsacoLw%40mail.gmail.com (73) Done. Removed comments referring to obsolete psf. (76) Done. Removed whitespace changes unrelated to this patch set. (77) Done. Updated comment of Alter Subscription ... REFRESH. (84) Done. Removed the extra function AnyTablesyncsNotREADY. (85) Done. Renamed the function UpdateTwoPhaseTriState. (86) Fixed. Removed debugging code from the main patch. (88) Done. Removed the unused table_states_all List. (90) Fixed. Change the log message to say "two_phase" instead of "2PC". ---- Vignesh 16/Mar - https://www.postgresql.org/message-id/CALDaNm11A5wL0E-GDtqWY00iFzgUPsPLfA%2BL0zi4SEokEVtoFQ%40mail.gmail.com (92) Fixed. Replace cache failure Assert with ERROR (93) Skipped. Suggested to remove the global variable for table_states_not_ready. ---- Amit 17/Mar - https://www.postgresql.org/message-id/CAA4eK1LNLA20ci3_qqNQv7BYRTy3HqiAsOfuieqo6tJ2GeYuJw%40mail.gmail.com (95) Done. Renamed the pg_subscription column. New state values d/p/e. Updated PG docs. (98) Done. Renamed the constant LOGICALREP_PROTO_2PC_VERSION_NUM. (99) Fixed. Apply new (supplied) comments atop worker.c ---- Vignesh 17/Mar (100) Fixed. Applied patch (supplied) to fix a multiple subscriber bug. ----- Kind Regards, Peter Smith. Fujitsu Australia
Attachment
On Thu, Mar 18, 2021 at 5:20 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Please find attached the latest patch set v61* > Oops. Attaching the correct v61* patches this time... --- Kind Regards, Peter Smith. Fujitsu Australia
Attachment
On Thu, Mar 18, 2021 at 5:30 PM Peter Smith <smithpb2250@gmail.com> wrote:
On Thu, Mar 18, 2021 at 5:20 PM Peter Smith <smithpb2250@gmail.com> wrote:Please find attached the latest patch set v62
>
> Please find attached the latest patch set v61*
>
Differences from v61 are:
* Rebased to HEAD
* Addresses the following feedback issues:
Vignesh 12/Mar -
https://www.postgresql.org/message-id/CALDaNm1p%3DKYcDc1s_Q0Lk2P8UYU-z4acW066gaeLfXvW_O-kBA%40mail.gmail.com
(62) Fixed. Added assert for twophase alter check in maybe_reread_subscription(void)
(63) Fixed. Changed parse_output_parameters to disable two-phase and streaming combo
Amit 16 Mar - https://www.postgresql.org/message-id/CAA4eK1Kwah%2BMimFMR3jPY5cSqpGFVh5zfV2g4%3DgTphaPsacoLw%40mail.gmail.com
(74) Fixed. Modify comment about why not supporting combination of two-phase and streaming
(75) Fixed. Added more comments about creating slot with two-phase race conditions
(78) Skipped. Adding assert for two-phase variables getting reset, the logic has been changed, so skipping this.
(79) Changed. Reworded the comment about allowing decoding of prepared transaction (restoring iff)
(80) Fixed. Added & in the assignment for ctx->twophase, logic is also changed
(81) Fixed. Changed to conditional setting of two_phase_at only if two_phase is enabled.
(82) Fixed. Better explanation for two_phase_at variable in snapbuild.changed
(83) Skipped. The comparison in ReorderBufferFinishPrepared was not changed and it was tested and it works.
The reason it works is because even if the Prepare is filtered when two-phase is not enabled, once the tablessync is
over and the TABLES are in READY state, the apply worker and the walsender restarts, and after restart, the prepare will be
not be filtered out, but will be marked as skipped prepare and also updated in ReorderBufferRememberPrepareInfo
(87) Fixed. Added server version check before two-phase enabled startstream in ApplyWorkerMain.
(91)Fixed. Removed unused macros in reorderbuffer.h
Amit 17/Mar - https://www.postgresql.org/message-id/CAA4eK1LNLA20ci3_qqNQv7BYRTy3HqiAsOfuieqo6tJ2GeYuJw%40mail.gmail.com
(96) Fixed - Removed token for twophase in Start Replication slot, instead used the twophase options. But kept the token
in Create_Replication slot, as we gave the option for plugins to enable two-phase while creating a slot. This allows plugins without a table-synchronization phase
to handle two-phase from the start.
regards,
Ajin Cherian
Fujitsu Australia
Attachment
Missed the patch - 0001, resending.
On Thu, Mar 18, 2021 at 10:58 PM Ajin Cherian <itsajin@gmail.com> wrote:
On Thu, Mar 18, 2021 at 5:30 PM Peter Smith <smithpb2250@gmail.com> wrote:On Thu, Mar 18, 2021 at 5:20 PM Peter Smith <smithpb2250@gmail.com> wrote:Please find attached the latest patch set v62
>
> Please find attached the latest patch set v61*
>
Differences from v61 are:
* Rebased to HEAD
* Addresses the following feedback issues:
Vignesh 12/Mar -
https://www.postgresql.org/message-id/CALDaNm1p%3DKYcDc1s_Q0Lk2P8UYU-z4acW066gaeLfXvW_O-kBA%40mail.gmail.com
(62) Fixed. Added assert for twophase alter check in maybe_reread_subscription(void)
(63) Fixed. Changed parse_output_parameters to disable two-phase and streaming combo
Amit 16 Mar - https://www.postgresql.org/message-id/CAA4eK1Kwah%2BMimFMR3jPY5cSqpGFVh5zfV2g4%3DgTphaPsacoLw%40mail.gmail.com
(74) Fixed. Modify comment about why not supporting combination of two-phase and streaming
(75) Fixed. Added more comments about creating slot with two-phase race conditions
(78) Skipped. Adding assert for two-phase variables getting reset, the logic has been changed, so skipping this.
(79) Changed. Reworded the comment about allowing decoding of prepared transaction (restoring iff)
(80) Fixed. Added & in the assignment for ctx->twophase, logic is also changed
(81) Fixed. Changed to conditional setting of two_phase_at only if two_phase is enabled.
(82) Fixed. Better explanation for two_phase_at variable in snapbuild.changed
(83) Skipped. The comparison in ReorderBufferFinishPrepared was not changed and it was tested and it works.
The reason it works is because even if the Prepare is filtered when two-phase is not enabled, once the tablessync is
over and the TABLES are in READY state, the apply worker and the walsender restarts, and after restart, the prepare will be
not be filtered out, but will be marked as skipped prepare and also updated in ReorderBufferRememberPrepareInfo
(87) Fixed. Added server version check before two-phase enabled startstream in ApplyWorkerMain.
(91)Fixed. Removed unused macros in reorderbuffer.h
Amit 17/Mar - https://www.postgresql.org/message-id/CAA4eK1LNLA20ci3_qqNQv7BYRTy3HqiAsOfuieqo6tJ2GeYuJw%40mail.gmail.com
(96) Fixed - Removed token for twophase in Start Replication slot, instead used the twophase options. But kept the token
in Create_Replication slot, as we gave the option for plugins to enable two-phase while creating a slot. This allows plugins without a table-synchronization phase
to handle two-phase from the start.regards,Ajin CherianFujitsu Australia
Attachment
Hi On Saturday, March 13, 2021 5:01 PM osumi.takamichi@fujitsu.com <osumi.takamichi@fujitsu.com> wrote: > On Friday, March 12, 2021 5:40 PM Peter Smith <smithpb2250@gmail.com> > > Please find attached the latest patch set v58* > Thank you for updating those. I'm testing the patchset and I think it's > preferable that you add simple two types of more tests in 020_twophase.pl > because those aren't checked by v58. > > (1) execute single PREPARE TRANSACTION > which affects several tables (connected to corresponding > publications) > at the same time and confirm they are synced correctly. > > (2) execute single PREPARE TRANSACTION which affects multiple > subscribers > and confirm they are synced correctly. > This doesn't mean cascading standbys like > 022_twophase_cascade.pl. > Imagine that there is one publisher and two subscribers to it. Attached a patch for those two tests. The patch works with v62. I tested this in a loop more than 100 times and showed no failure. Best Regards, Takamichi Osumi
Attachment
On Fri, Mar 19, 2021 at 5:03 AM Ajin Cherian <itsajin@gmail.com> wrote: > > Missed the patch - 0001, resending. > I have made miscellaneous changes in the patch which includes improving comments, error messages, and miscellaneous coding improvements. The most notable one is that we don't need an additional parameter in walrcv_startstreaming, if the two_phase option is set properly. My changes are in v63-0002-Misc-changes-by-Amit, if you are fine with those, then please merge them in the next version. I have omitted the dev-logs patch but feel free to submit it. I have one question: @@ -538,10 +550,21 @@ CreateDecodingContext(XLogRecPtr start_lsn, .. + /* Set two_phase_at LSN only if it hasn't already been set. */ + if (ctx->twophase && !MyReplicationSlot->data.two_phase_at) + { + MyReplicationSlot->data.two_phase_at = start_lsn; + slot->data.two_phase = true; + ReplicationSlotMarkDirty(); + ReplicationSlotSave(); + SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn); + } What if the walsender or apply worker restarts after setting two_phase_at/two_phase here and updating the two_phase state in pg_subscription? Won't we need to set SnapBuildSetTwoPhaseAt after restart as well? If so, we probably need a else if (ctx->twophase) {Assert (slot->data.two_phase_at); SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, slot->data.two_phase_at);}. Am, I missing something? -- With Regards, Amit Kapila.
Attachment
On Sat, Mar 20, 2021 at 1:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Mar 19, 2021 at 5:03 AM Ajin Cherian <itsajin@gmail.com> wrote:
>
> Missed the patch - 0001, resending.
>
@@ -538,10 +550,21 @@ CreateDecodingContext(XLogRecPtr start_lsn,
..
+ /* Set two_phase_at LSN only if it hasn't already been set. */
+ if (ctx->twophase && !MyReplicationSlot->data.two_phase_at)
+ {
+ MyReplicationSlot->data.two_phase_at = start_lsn;
+ slot->data.two_phase = true;
+ ReplicationSlotMarkDirty();
+ ReplicationSlotSave();
+ SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
+ }
What if the walsender or apply worker restarts after setting
two_phase_at/two_phase here and updating the two_phase state in
pg_subscription? Won't we need to set SnapBuildSetTwoPhaseAt after
restart as well?
After a restart, two_phase_at will be set by calling AllocateSnapshotBuilder with two_phase_at
@@ -207,7 +207,7 @@ StartupDecodingContext(List *output_plugin_options,
ctx->reorder = ReorderBufferAllocate();
ctx->snapshot_builder =
AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
- need_full_snapshot, slot->data.initial_consistent_point);
+ need_full_snapshot, slot->data.two_phase_at);
ctx->reorder = ReorderBufferAllocate();
ctx->snapshot_builder =
AllocateSnapshotBuilder(ctx->reorder, xmin_horizon, start_lsn,
- need_full_snapshot, slot->data.initial_consistent_point);
+ need_full_snapshot, slot->data.two_phase_at);
and then in AllocateSnapshotBuilder:
@@ -309,7 +306,7 @@ AllocateSnapshotBuilder(ReorderBuffer *reorder,
builder->initial_xmin_horizon = xmin_horizon;
builder->start_decoding_at = start_lsn;
builder->building_full_snapshot = need_full_snapshot;
- builder->initial_consistent_point = initial_consistent_point;
+ builder->two_phase_at = two_phase_at;
builder->initial_xmin_horizon = xmin_horizon;
builder->start_decoding_at = start_lsn;
builder->building_full_snapshot = need_full_snapshot;
- builder->initial_consistent_point = initial_consistent_point;
+ builder->two_phase_at = two_phase_at;
regards,
Ajin Cherian
Fujitsu Australia
Please find attached the latest patch set v64* Differences from v62* are: * Rebased to HEAD @ yesterday 19/Mar. * Addresses the following feedback issues: ---- From Osumi-san 19/Mar - https://www.postgresql.org/message-id/OSBPR01MB4888930C23E17AF29EDB9D82ED689%40OSBPR01MB4888.jpnprd01.prod.outlook.com (64) Done. New tests added. Supplied patch by Osumi-san. (65) Done. New tests added. Supplied patch by Osumi-san. ---- From Amit 16/Mar - https://www.postgresql.org/message-id/CAA4eK1Kwah%2BMimFMR3jPY5cSqpGFVh5zfV2g4%3DgTphaPsacoLw%40mail.gmail.com (89) Done. Added more comments explaining the AllTablesReady() implementation. ---- From Peter 17/Mar (internal) (94) Done. Improved comment to two_phase option parsing code ---- From Amit 17/Mar - https://www.postgresql.org/message-id/CAA4eK1LNLA20ci3_qqNQv7BYRTy3HqiAsOfuieqo6tJ2GeYuJw%40mail.gmail.com (97) Done. Improved comment to two_phase option parsing code ---- From Amit 18/Mar - https://www.postgresql.org/message-id/CAA4eK1J9A_9hsxE6m_1c6CsrMsBeeaRbaLX2P16ucJrpN25-EQ%40mail.gmail.com (101) Done. Improved comment for worker.c. Apply supplied patch from Amit. No equivalent text was put in PG docs at this time because we are still awaiting responses on other thread [1] that might impact what we may want to write. Please raise a new feedback comment if/whenn you decide PG docs should be updating. (102) Fixed. Use different log level for subscription starting message ---- From Amit 19/Mar (internal) (104) Done. Rename function AllTablesyncsREADY to AllTablesyncsReady ---- From Amit 19/Mar - https://www.postgresql.org/message-id/CAA4eK1JLz7ypPdbkPjHQW5c9vOZO5onOwb%2BfSLsArHQjg6dNhQ%40mail.gmail.com (105) Done. Miscellaneous fixes. Apply supplied patch from Amit. ----- [1] https://www.postgresql.org/message-id/CALDaNm06R_ppr5ibwS1-FLDKGqUjHr-1VPdk-yJWU1TP_zLLig%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Attachment
On Sat, Mar 20, 2021 at 7:07 AM Ajin Cherian <itsajin@gmail.com> wrote: > > On Sat, Mar 20, 2021 at 1:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> On Fri, Mar 19, 2021 at 5:03 AM Ajin Cherian <itsajin@gmail.com> wrote: >> > >> > Missed the patch - 0001, resending. >> > >> >> >> @@ -538,10 +550,21 @@ CreateDecodingContext(XLogRecPtr start_lsn, >> .. >> + /* Set two_phase_at LSN only if it hasn't already been set. */ >> + if (ctx->twophase && !MyReplicationSlot->data.two_phase_at) >> + { >> + MyReplicationSlot->data.two_phase_at = start_lsn; >> + slot->data.two_phase = true; >> + ReplicationSlotMarkDirty(); >> + ReplicationSlotSave(); >> + SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn); >> + } >> >> What if the walsender or apply worker restarts after setting >> two_phase_at/two_phase here and updating the two_phase state in >> pg_subscription? Won't we need to set SnapBuildSetTwoPhaseAt after >> restart as well? > > > After a restart, two_phase_at will be set by calling AllocateSnapshotBuilder with two_phase_at > Okay, that makes sense. -- With Regards, Amit Kapila.
On Sat, Mar 20, 2021 at 1:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Mar 19, 2021 at 5:03 AM Ajin Cherian <itsajin@gmail.com> wrote:
>
> Missed the patch - 0001, resending.
>
I have made miscellaneous changes in the patch which includes
improving comments, error messages, and miscellaneous coding
improvements. The most notable one is that we don't need an additional
parameter in walrcv_startstreaming, if the two_phase option is set
properly. My changes are in v63-0002-Misc-changes-by-Amit, if you are
fine with those, then please merge them in the next version. I have
omitted the dev-logs patch but feel free to submit it. I have one
question:
I am fine with these changes. I see that Peter has already merged in these changes.
thanks,
Ajin Cherian
Fujitsu Australia
On Sat, Mar 20, 2021 at 10:09 AM Ajin Cherian <itsajin@gmail.com> wrote: > > On Sat, Mar 20, 2021 at 1:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> On Fri, Mar 19, 2021 at 5:03 AM Ajin Cherian <itsajin@gmail.com> wrote: >> > >> > Missed the patch - 0001, resending. >> > >> >> I have made miscellaneous changes in the patch which includes >> improving comments, error messages, and miscellaneous coding >> improvements. The most notable one is that we don't need an additional >> parameter in walrcv_startstreaming, if the two_phase option is set >> properly. My changes are in v63-0002-Misc-changes-by-Amit, if you are >> fine with those, then please merge them in the next version. I have >> omitted the dev-logs patch but feel free to submit it. I have one >> question: >> > > I am fine with these changes. I see that Peter has already merged in these changes. > I have further updated the patch to implement unique GID on the subscriber-side as discussed in the nearby thread [1]. That requires some changes in the test. Additionally, I have updated some comments and docs. Let me know what do you think about the changes? [1] - https://www.postgresql.org/message-id/CAA4eK1+opiV4aFTmWWUF9h_32=HfPOW9vZASHarT0UA5oBrtGw@mail.gmail.com -- With Regards, Amit Kapila.
Attachment
Hello On Sunday, March 21, 2021 4:37 PM Amit Kapila <amit.kapila16@gmail.com> > On Sat, Mar 20, 2021 at 10:09 AM Ajin Cherian <itsajin@gmail.com> wrote: > > > > On Sat, Mar 20, 2021 at 1:35 AM Amit Kapila <amit.kapila16@gmail.com> > wrote: > >> > >> On Fri, Mar 19, 2021 at 5:03 AM Ajin Cherian <itsajin@gmail.com> wrote: > >> > > >> > Missed the patch - 0001, resending. > >> > > >> > >> I have made miscellaneous changes in the patch which includes > >> improving comments, error messages, and miscellaneous coding > >> improvements. The most notable one is that we don't need an > >> additional parameter in walrcv_startstreaming, if the two_phase > >> option is set properly. My changes are in > >> v63-0002-Misc-changes-by-Amit, if you are fine with those, then > >> please merge them in the next version. I have omitted the dev-logs > >> patch but feel free to submit it. I have one > >> question: > >> > > > > I am fine with these changes. I see that Peter has already merged in these > changes. > > > > I have further updated the patch to implement unique GID on the > subscriber-side as discussed in the nearby thread [1]. That requires some > changes in the test. Thank you for your update. v65 didn't make any failure during make check-world. I've written additional tests for alter subscription using refresh for enabled subscription and two_phase = on. I wrote those as TAP tests because refresh requires enabled subscription and to get a subscription enabled, we need to set connect true as well. TAP tests are for having connection between sub and pub, and tests in subscription.sql are aligned with connect=false. Just in case, I ran 020_twophase.pl with this patch 100 times, based on v65 as well and didn't cause any failure. Please have a look at the attached patch. Best Regards, Takamichi Osumi
Attachment
On Sun, Mar 21, 2021 at 6:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > I have further updated the patch to implement unique GID on the > subscriber-side as discussed in the nearby thread [1]. That requires > some changes in the test. Additionally, I have updated some comments > and docs. Let me know what do you think about the changes? > Hi Amit. PSA a small collection of feedback patches you can apply on top of the patch v65-0001 if you decide they are OK. (There are all I have found after a first pass over all the recent changes). ------ Kind Regards, Peter Smith. Fujitsu Australia
Attachment
On Mon, Mar 22, 2021 at 6:27 PM Peter Smith <smithpb2250@gmail.com> wrote: > > On Sun, Mar 21, 2021 at 6:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > I have further updated the patch to implement unique GID on the > > subscriber-side as discussed in the nearby thread [1]. That requires > > some changes in the test. Additionally, I have updated some comments > > and docs. Let me know what do you think about the changes? > > > > Hi Amit. > > PSA a small collection of feedback patches you can apply on top of the > patch v65-0001 if you decide they are OK. > > (There are all I have found after a first pass over all the recent changes). > I have spell-checked the content of v65-0001. PSA a couple more feedback patches to apply on top of v65-0001 if you decide they are ok. ---- Kind Regards, Peter Smith. Fujitsu Australia
Attachment
On Sunday, March 21, 2021 4:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote: >I have further updated the patch to implement unique GID on the >subscriber-side as discussed in the nearby thread [1]. I did some tests(cross version & synchronous) on the latest patch set v65*, all tests passed. Here is the detail, pleasetake it as a reference. Case | version of publisher | version of subscriber | two_phase option | synchronous | expect result | result -------+------------------------+-------------------------+----------------------+---------------+-----------------+--------- 1 | 13 | 14(patched) | on | no | same as case3 | ok 2 | 13 | 14(patched) | off | no | same as case3 | ok 3 | 13 | 14(unpatched) | not support | no | - | - 4 | 14(patched) | 13 | not support | no | same as case5 | ok 5 | 14(unpatched) | 13 | not support | no | - | - 6 | 13 | 14(patched) | on | yes | same as case8 | ok 7 | 13 | 14(patched) | off | yes | same as case8 | ok 8 | 13 | 14(unpatched) | not support | yes | - | - 9 | 14(patched) | 13 | not support | yes | same as case10 | ok 10 | 14(unpatched) | 13 | not support | yes | - | - remark: (1)case3, 5 ,8, 10 is tested just for reference (2)SQL been executed in each case scenario1 begin…commit scenario2 begin…prepare…commit Regards, Tang
On Mon, Mar 22, 2021 at 2:41 PM Peter Smith <smithpb2250@gmail.com> wrote: > > On Mon, Mar 22, 2021 at 6:27 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > > Hi Amit. > > > > PSA a small collection of feedback patches you can apply on top of the > > patch v65-0001 if you decide they are OK. > > > > (There are all I have found after a first pass over all the recent changes). > > > > I have spell-checked the content of v65-0001. > > PSA a couple more feedback patches to apply on top of v65-0001 if you > decide they are ok. > I have incorporated all your changes and additionally made few more changes (a) got rid of LogicalRepBeginPrepareData and instead used LogicalRepPreparedTxnData, (b) made a number of changes in comments and docs, (c) ran pgindent, (d) modified tests to use standard wait_for_catch function and removed few tests to reduce the time and to keep regression tests reliable. -- With Regards, Amit Kapila.
Attachment
On Mon, Mar 22, 2021 at 11:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > I have incorporated all your changes and additionally made few more > changes (a) got rid of LogicalRepBeginPrepareData and instead used > LogicalRepPreparedTxnData, (b) made a number of changes in comments > and docs, (c) ran pgindent, (d) modified tests to use standard > wait_for_catch function and removed few tests to reduce the time and > to keep regression tests reliable. I checked all v65* / v66* differences and found only two trivial comment typos. PSA patches to fix those. ---- Kind Regards, Peter Smith. Fujitsu Australia
Attachment
On Tue, Mar 23, 2021 at 10:44 AM Peter Smith <smithpb2250@gmail.com> wrote: > > On Mon, Mar 22, 2021 at 11:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > I have incorporated all your changes and additionally made few more > > changes (a) got rid of LogicalRepBeginPrepareData and instead used > > LogicalRepPreparedTxnData, (b) made a number of changes in comments > > and docs, (c) ran pgindent, (d) modified tests to use standard > > wait_for_catch function and removed few tests to reduce the time and > > to keep regression tests reliable. > > I checked all v65* / v66* differences and found only two trivial comment typos. > > PSA patches to fix those. > Hi Amit. PSA a patch to allow the ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work when two-phase tristate is PENDING. This is necessary for the pg_dump/pg_restore scenario, or for any other use-case where the subscription might start off having no tables. Please apply this on top of your v66-0001 (after applying the other Feedback patches I posted earlier today). ------ Kind Regards, Peter Smith. Fujitsu Australia
Attachment
On Tue, Mar 23, 2021 at 9:01 PM Peter Smith <smithpb2250@gmail.com> wrote:
Please apply this on top of your v66-0001 (after applying the other
Feedback patches I posted earlier today).
Applied all the above patches and did a 5 cascade test set up and all the instances synced correctly. Test log attached.
regards,
Ajin Cherian
Fujitsu Australia
Attachment
On Tue, Mar 23, 2021 at 9:49 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > > On Tue, Mar 23, 2021 at 9:01 PM Peter Smith <smithpb2250@gmail.com> wrote: >> >> >> >> Please apply this on top of your v66-0001 (after applying the other >> Feedback patches I posted earlier today). > > > Applied all the above patches and did a 5 cascade test set up and all the instances synced correctly. Test log attached. > FYI - Using the same v66* patch set (including yesterday's additional patches) I have run the subscription TAP tests 020 and 021 in a loop x 150. All passed ok. PSA the results file as evidence. ------ Kind Regards, Peter Smith. Fujitsu Australia
Attachment
On Tue, Mar 23, 2021 at 9:01 PM Peter Smith <smithpb2250@gmail.com> wrote: > > On Tue, Mar 23, 2021 at 10:44 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > On Mon, Mar 22, 2021 at 11:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > I have incorporated all your changes and additionally made few more > > > changes (a) got rid of LogicalRepBeginPrepareData and instead used > > > LogicalRepPreparedTxnData, (b) made a number of changes in comments > > > and docs, (c) ran pgindent, (d) modified tests to use standard > > > wait_for_catch function and removed few tests to reduce the time and > > > to keep regression tests reliable. > > > > I checked all v65* / v66* differences and found only two trivial comment typos. > > > > PSA patches to fix those. > > > > Hi Amit. > > PSA a patch to allow the ALTER SUBSCRIPTION ... REFRESH PUBLICATION to > work when two-phase tristate is PENDING. > > This is necessary for the pg_dump/pg_restore scenario, or for any > other use-case where the subscription might > start off having no tables. > > Please apply this on top of your v66-0001 (after applying the other > Feedback patches I posted earlier today). > PSA a small addition to the 66-0003 "Fix to allow REFRESH PUBLICATION" patch posted yesterday. This just updates the worker.c comment. ------ Kind Regards, Peter Smith. Fujitsu Australia.
Attachment
> I have incorporated all your changes and additionally made few more changes > (a) got rid of LogicalRepBeginPrepareData and instead used > LogicalRepPreparedTxnData, (b) made a number of changes in comments and > docs, (c) ran pgindent, (d) modified tests to use standard wait_for_catch > function and removed few tests to reduce the time and to keep regression > tests reliable. Hi, When reading the code, I found some comments related to the patch here. * XXX Now, this can even lead to a deadlock if the prepare * transaction is waiting to get it logically replicated for * distributed 2PC. Currently, we don't have an in-core * implementation of prepares for distributed 2PC but some * out-of-core logical replication solution can have such an * implementation. They need to inform users to not have locks * on catalog tables in such transactions. */ Since we will have in-core implementation of prepares, should we update the comments here ? Best regards, houzj
On Tue, Mar 23, 2021 at 3:31 PM Peter Smith <smithpb2250@gmail.com> wrote: > > On Tue, Mar 23, 2021 at 10:44 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > PSA patches to fix those. > > > > Hi Amit. > > PSA a patch to allow the ALTER SUBSCRIPTION ... REFRESH PUBLICATION to > work when two-phase tristate is PENDING. > > This is necessary for the pg_dump/pg_restore scenario, or for any > other use-case where the subscription might > start off having no tables. > + subrels = GetSubscriptionRelations(MySubscription->oid); + + /* + * If there are no tables then leave the state as PENDING, which + * allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work. + */ + become_two_phase_enabled = list_length(subrels) > 0; This code is similar at both the places it is used. Isn't it better to move this inside AllTablesyncsReady and if required then we can change the name of the function. -- With Regards, Amit Kapila.
On Wed, Mar 24, 2021 at 11:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Mar 23, 2021 at 3:31 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > On Tue, Mar 23, 2021 at 10:44 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > > PSA patches to fix those. > > > > > > > Hi Amit. > > > > PSA a patch to allow the ALTER SUBSCRIPTION ... REFRESH PUBLICATION to > > work when two-phase tristate is PENDING. > > > > This is necessary for the pg_dump/pg_restore scenario, or for any > > other use-case where the subscription might > > start off having no tables. > > > > + subrels = GetSubscriptionRelations(MySubscription->oid); > + > + /* > + * If there are no tables then leave the state as PENDING, which > + * allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work. > + */ > + become_two_phase_enabled = list_length(subrels) > 0; > > This code is similar at both the places it is used. Isn't it better to > move this inside AllTablesyncsReady and if required then we can change > the name of the function. I agree. That way is better. PSA a patch which changes the AllTableSyncsReady function to now include the zero tables check. (This patch is to be applied on top of all previous patches) ------ Kind Regards, Peter Smith. Fujitsu Australia.
Attachment
On Thu, Mar 25, 2021 at 1:40 PM Peter Smith <smithpb2250@gmail.com> wrote: > > On Wed, Mar 24, 2021 at 11:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Mar 23, 2021 at 3:31 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > > On Tue, Mar 23, 2021 at 10:44 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > > > > PSA patches to fix those. > > > > > > > > > > Hi Amit. > > > > > > PSA a patch to allow the ALTER SUBSCRIPTION ... REFRESH PUBLICATION to > > > work when two-phase tristate is PENDING. > > > > > > This is necessary for the pg_dump/pg_restore scenario, or for any > > > other use-case where the subscription might > > > start off having no tables. > > > > > > > + subrels = GetSubscriptionRelations(MySubscription->oid); > > + > > + /* > > + * If there are no tables then leave the state as PENDING, which > > + * allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work. > > + */ > > + become_two_phase_enabled = list_length(subrels) > 0; > > > > This code is similar at both the places it is used. Isn't it better to > > move this inside AllTablesyncsReady and if required then we can change > > the name of the function. > > I agree. That way is better. > > PSA a patch which changes the AllTableSyncsReady function to now > include the zero tables check. > > (This patch is to be applied on top of all previous patches) > > ------ PSA a patch which modifies the FetchTableStates function to use a more efficient way of testing if the subscription has any tables or not. (This patch is to be applied on top of all previous v66* patches posted) ------ Kind Regards, Peter Smith. Fujitsu Australia.
Attachment
On Thu, Mar 25, 2021 at 12:39 PM Peter Smith <smithpb2250@gmail.com> wrote: > > PSA a patch which modifies the FetchTableStates function to use a more > efficient way of testing if the subscription has any tables or not. > > (This patch is to be applied on top of all previous v66* patches posted) > I have incorporated all your incremental patches and fixed comments raised by Hou-San in the attached patch. -- With Regards, Amit Kapila.
Attachment
On Wed, Mar 24, 2021 at 3:59 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote: > > > I have incorporated all your changes and additionally made few more changes > > (a) got rid of LogicalRepBeginPrepareData and instead used > > LogicalRepPreparedTxnData, (b) made a number of changes in comments and > > docs, (c) ran pgindent, (d) modified tests to use standard wait_for_catch > > function and removed few tests to reduce the time and to keep regression > > tests reliable. > > Hi, > > When reading the code, I found some comments related to the patch here. > > * XXX Now, this can even lead to a deadlock if the prepare > * transaction is waiting to get it logically replicated for > * distributed 2PC. Currently, we don't have an in-core > * implementation of prepares for distributed 2PC but some > * out-of-core logical replication solution can have such an > * implementation. They need to inform users to not have locks > * on catalog tables in such transactions. > */ > > Since we will have in-core implementation of prepares, should we update the comments here ? > Fixed this in the latest patch posted by me. I have additionally updated the docs to reflect the same. -- With Regards, Amit Kapila.
On Sun, Mar 21, 2021 at 1:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sat, Mar 20, 2021 at 10:09 AM Ajin Cherian <itsajin@gmail.com> wrote: > > > > On Sat, Mar 20, 2021 at 1:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > >> > >> On Fri, Mar 19, 2021 at 5:03 AM Ajin Cherian <itsajin@gmail.com> wrote: > >> > > >> > Missed the patch - 0001, resending. > >> > > >> > >> I have made miscellaneous changes in the patch which includes > >> improving comments, error messages, and miscellaneous coding > >> improvements. The most notable one is that we don't need an additional > >> parameter in walrcv_startstreaming, if the two_phase option is set > >> properly. My changes are in v63-0002-Misc-changes-by-Amit, if you are > >> fine with those, then please merge them in the next version. I have > >> omitted the dev-logs patch but feel free to submit it. I have one > >> question: > >> > > > > I am fine with these changes. I see that Peter has already merged in these changes. > > > > I have further updated the patch to implement unique GID on the > subscriber-side as discussed in the nearby thread [1]. That requires > some changes in the test. Additionally, I have updated some comments > and docs. Let me know what do you think about the changes? > +static void +TwoPhaseTransactionGid(RepOriginId originid, TransactionId xid, + char *gid, int szgid) +{ + /* Origin and Transaction ids must be valid */ + Assert(originid != InvalidRepOriginId); + Assert(TransactionIdIsValid(xid)); + + snprintf(gid, szgid, "pg_%u_%u", originid, xid); +} I found one issue in the current mechanism that we use to generate the GID's. In one of the scenarios it will generate the same GID's, steps for the same is given below: ---- setup 2 publisher and one subscriber with synchronous_standby_names prepare txn 't1' on publisher1 (This prepared txn is prepared as pg_1_542 on subscriber) drop subscription of publisher1 create subscription subscriber for publisher2 (We have changed the subscription to subscribe to publisher2 which was earlier subscribing to publisher1) prepare txn 't2' on publisher2 (This prepared txn also uses pg_1_542 on subscriber even though user has given a different gid) This prepared txn keeps waiting for it to complete in the subscriber, but never completes. Here user uses different gid for prepared transaction but it ends up using the same gid at the subscriber. The subscriber keeps failing with: 2021-03-22 10:14:57.859 IST [73959] ERROR: transaction identifier "pg_1_542" is already in use 2021-03-22 10:14:57.860 IST [73868] LOG: background worker "logical replication worker" (PID 73959) exited with exit code 1 Attached file has the steps for it. This might be a rare scenario, may or may not be a user scenario, Should we handle this scenario? Regards, Vignesh
Attachment
Please find attached the latest patch set v68* Differences from v67* are: * Rebased to HEAD @ today. * v68 fixes an issue reported by Vignesh [1] where a scenario was found which still was able to cause a generated GID clash. Using Vignesh's test script I could reproduce the problem exactly as described. The fix makes the GID unique by including the subid. Now the same script runs to normal completion and produces good/expected output: transaction | gid | prepared | owner | database -------------+------------------+-------------------------------+----------+---------- 547 | pg_gid_16389_543 | 2021-03-30 10:32:36.87207+11 | postgres | postgres 555 | pg_gid_16390_543 | 2021-03-30 10:32:48.087771+11 | postgres | postgres (2 rows) ---- [1] https://www.postgresql.org/message-id/CALDaNm2ZnJeG23bE%2BgEOQEmXo8N%2Bfs2g4%3DxuH2u6nNcX0s9Jjg%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Attachment
On Tue, Mar 30, 2021 at 5:34 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Please find attached the latest patch set v68* > > Differences from v67* are: > > * Rebased to HEAD @ today. > > * v68 fixes an issue reported by Vignesh [1] where a scenario was > found which still was able to cause a generated GID clash. Using > Vignesh's test script I could reproduce the problem exactly as > described. The fix makes the GID unique by including the subid. Now > the same script runs to normal completion and produces good/expected > output: > > transaction | gid | prepared | > owner | database > -------------+------------------+-------------------------------+----------+---------- > 547 | pg_gid_16389_543 | 2021-03-30 10:32:36.87207+11 | > postgres | postgres > 555 | pg_gid_16390_543 | 2021-03-30 10:32:48.087771+11 | > postgres | postgres > (2 rows) > Thanks for the patch with the fix, the fix solves the issue reported. Regards, Vignesh
On Tue, Mar 30, 2021 at 5:34 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Please find attached the latest patch set v68* > I think this patch is in much better shape than it was few versions earlier but I feel still some more work and testing is required. We can try to make it work with the streaming option and do something about empty prepare transactions to reduce the need for users to set a much higher value for max_prepared_xacts on subscribers. So, I propose to move it to the next CF, what do you think? -- With Regards, Amit Kapila.
On Thu, Apr 1, 2021 at 2:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Mar 30, 2021 at 5:34 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> Please find attached the latest patch set v68*
>
I think this patch is in much better shape than it was few versions
earlier but I feel still some more work and testing is required. We
can try to make it work with the streaming option and do something
about empty prepare transactions to reduce the need for users to set a
much higher value for max_prepared_xacts on subscribers. So, I propose
to move it to the next CF, what do you think?
I agree.
regards,
Ajin Cherian
Fujitsu Australia
On Thu, Apr 1, 2021 at 8:59 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Mar 30, 2021 at 5:34 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Please find attached the latest patch set v68* > > > > I think this patch is in much better shape than it was few versions > earlier but I feel still some more work and testing is required. We > can try to make it work with the streaming option and do something > about empty prepare transactions to reduce the need for users to set a > much higher value for max_prepared_xacts on subscribers. So, I propose > to move it to the next CF, what do you think? +1 for moving it to the next PG version. Regards, Vignesh
On Thu, Apr 1, 2021 at 4:58 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > > On Thu, Apr 1, 2021 at 2:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> On Tue, Mar 30, 2021 at 5:34 AM Peter Smith <smithpb2250@gmail.com> wrote: >> > >> > Please find attached the latest patch set v68* >> > >> >> I think this patch is in much better shape than it was few versions >> earlier but I feel still some more work and testing is required. We >> can try to make it work with the streaming option and do something >> about empty prepare transactions to reduce the need for users to set a >> much higher value for max_prepared_xacts on subscribers. So, I propose >> to move it to the next CF, what do you think? >> > > I agree. OK, done. Moved to next CF here: https://commitfest.postgresql.org/33/2914/ ------ Kind Regards, Peter Smith. Fujitsu Australia.
Please find attached the latest patch set v69* Differences from v68* are: * Rebased to HEAD @ yesterday. There was some impacts caused by recently pushed patches [1] [2] * The stream/prepare functionality and tests have been restored to be the same as they were in v48 [3]. Previously, this code had been removed back in v49 [4] due to incompatibilities with the (now obsolete) psf design. * TAP tests are now co-located in the same patch as the code they are testing. ---- [1] https://github.com/postgres/postgres/commit/531737ddad214cb8a675953208e2f3a6b1be122b [2] https://github.com/postgres/postgres/commit/ac4645c0157fc5fcef0af8ff571512aa284a2cec [3] https://www.postgresql.org/message-id/CAHut%2BPsr8f1tUttndgnkK_%3Da7w%3Dhsomw16SEOn6U68jSBKL9SQ%40mail.gmail.com [4] https://www.postgresql.org/message-id/CAFPTHDZduc2fDzqd_L4vPmA2R%2B-e8nEbau9HseHHi82w%3Dp-uvQ%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia On Tue, Mar 30, 2021 at 11:03 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Please find attached the latest patch set v68* > > Differences from v67* are: > > * Rebased to HEAD @ today. > > * v68 fixes an issue reported by Vignesh [1] where a scenario was > found which still was able to cause a generated GID clash. Using > Vignesh's test script I could reproduce the problem exactly as > described. The fix makes the GID unique by including the subid. Now > the same script runs to normal completion and produces good/expected > output: > > transaction | gid | prepared | > owner | database > -------------+------------------+-------------------------------+----------+---------- > 547 | pg_gid_16389_543 | 2021-03-30 10:32:36.87207+11 | > postgres | postgres > 555 | pg_gid_16390_543 | 2021-03-30 10:32:48.087771+11 | > postgres | postgres > (2 rows) > > > ---- > [1] https://www.postgresql.org/message-id/CALDaNm2ZnJeG23bE%2BgEOQEmXo8N%2Bfs2g4%3DxuH2u6nNcX0s9Jjg%40mail.gmail.com > > Kind Regards, > Peter Smith. > Fujitsu Australia
Attachment
Please find attached the latest patch set v70* Differences from v69* are: * Rebased to HEAD @ today Unfortunately, the v69 patch was broken due to a recent push [1] ---- [1] https://github.com/postgres/postgres/commit/82ed7748b710e3ddce3f7ebc74af80fe4869492f Kind Regards, Peter Smith. Fujitsu Australia On Wed, Apr 7, 2021 at 10:25 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Please find attached the latest patch set v69* > > Differences from v68* are: > > * Rebased to HEAD @ yesterday. > There was some impacts caused by recently pushed patches [1] [2] > > * The stream/prepare functionality and tests have been restored to be > the same as they were in v48 [3]. > Previously, this code had been removed back in v49 [4] due to > incompatibilities with the (now obsolete) psf design. > > * TAP tests are now co-located in the same patch as the code they are testing. > > ---- > [1] https://github.com/postgres/postgres/commit/531737ddad214cb8a675953208e2f3a6b1be122b > [2] https://github.com/postgres/postgres/commit/ac4645c0157fc5fcef0af8ff571512aa284a2cec > [3] https://www.postgresql.org/message-id/CAHut%2BPsr8f1tUttndgnkK_%3Da7w%3Dhsomw16SEOn6U68jSBKL9SQ%40mail.gmail.com > [4] https://www.postgresql.org/message-id/CAFPTHDZduc2fDzqd_L4vPmA2R%2B-e8nEbau9HseHHi82w%3Dp-uvQ%40mail.gmail.com > > Kind Regards, > Peter Smith. > Fujitsu Australia > > On Tue, Mar 30, 2021 at 11:03 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Please find attached the latest patch set v68* > > > > Differences from v67* are: > > > > * Rebased to HEAD @ today. > > > > * v68 fixes an issue reported by Vignesh [1] where a scenario was > > found which still was able to cause a generated GID clash. Using > > Vignesh's test script I could reproduce the problem exactly as > > described. The fix makes the GID unique by including the subid. Now > > the same script runs to normal completion and produces good/expected > > output: > > > > transaction | gid | prepared | > > owner | database > > -------------+------------------+-------------------------------+----------+---------- > > 547 | pg_gid_16389_543 | 2021-03-30 10:32:36.87207+11 | > > postgres | postgres > > 555 | pg_gid_16390_543 | 2021-03-30 10:32:48.087771+11 | > > postgres | postgres > > (2 rows) > > > > > > ---- > > [1] https://www.postgresql.org/message-id/CALDaNm2ZnJeG23bE%2BgEOQEmXo8N%2Bfs2g4%3DxuH2u6nNcX0s9Jjg%40mail.gmail.com > > > > Kind Regards, > > Peter Smith. > > Fujitsu Australia
Attachment
Please find attached the latest patch set v71* Differences from v70* are: * Rebased to HEAD @ yesterday. * Functionality of v71 is identical to v70, but the patch has been split into two parts 0001 - 2PC core patch 0002 - adds 2PC support for "streaming" transactions ---- Kind Regards, Peter Smith. Fujitsu Australia
Attachment
On Mon, Dec 14, 2020 at 8:27 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > 2. > + /* > + * Flags are determined from the state of the transaction. We know we > + * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if > + * it's already marked as committed then it has to be COMMIT PREPARED (and > + * likewise for abort / ROLLBACK PREPARED). > + */ > + if (rbtxn_commit_prepared(txn)) > + flags = LOGICALREP_IS_COMMIT_PREPARED; > + else if (rbtxn_rollback_prepared(txn)) > + flags = LOGICALREP_IS_ROLLBACK_PREPARED; > + else > + flags = LOGICALREP_IS_PREPARE; > > I don't like clubbing three different operations under one message > LOGICAL_REP_MSG_PREPARE. It looks awkward to use new flags > RBTXN_COMMIT_PREPARED and RBTXN_ROLLBACK_PREPARED in ReordeBuffer so > that we can recognize these operations in corresponding callbacks. I > think setting any flag in ReorderBuffer should not dictate the > behavior in callbacks. Then also there are few things that are not > common to those APIs like the patch has an Assert to say that the txn > is marked with prepare flag for all three operations which I think is > not true for Rollback Prepared after the restart. We don't ensure to > set the Prepare flag if the Rollback Prepare happens after the > restart. Then, we have to introduce separate flags to distinguish > prepare/commit prepared/rollback prepared to distinguish multiple > operations sent as protocol messages. Also, all these operations are > mutually exclusive so it will be better to send separate messages for > each of these and I have changed it accordingly in the attached patch. > While looking at the two-phase protocol messages (with a view to documenting them) I noticed that the messages for LOGICAL_REP_MSG_PREPARE, LOGICAL_REP_MSG_COMMIT_PREPARED, LOGICAL_REP_MSG_ROLLBACK_PREPARED are all sending and receiving flag bytes which *always* has a value 0. ---------- e.g. uint8 flags = 0; pq_sendbyte(out, flags); and /* read flags */ uint8 flags = pq_getmsgbyte(in); if (flags != 0) elog(ERROR, "unrecognized flags %u in commit prepare message", flags); ---------- I think this patch version v31 is where the flags became redundant. Is there some reason why these unused flags still remain in the protocol code? Do you have any objection to me removing them? Otherwise, it might seem strange to document a flag that has no function. ------ KInd Regards, Peter Smith. Fujitsu Australia
On Fri, Apr 9, 2021 at 12:33 PM Peter Smith <smithpb2250@gmail.com> wrote: > > On Mon, Dec 14, 2020 at 8:27 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > 2. > > + /* > > + * Flags are determined from the state of the transaction. We know we > > + * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if > > + * it's already marked as committed then it has to be COMMIT PREPARED (and > > + * likewise for abort / ROLLBACK PREPARED). > > + */ > > + if (rbtxn_commit_prepared(txn)) > > + flags = LOGICALREP_IS_COMMIT_PREPARED; > > + else if (rbtxn_rollback_prepared(txn)) > > + flags = LOGICALREP_IS_ROLLBACK_PREPARED; > > + else > > + flags = LOGICALREP_IS_PREPARE; > > > > I don't like clubbing three different operations under one message > > LOGICAL_REP_MSG_PREPARE. It looks awkward to use new flags > > RBTXN_COMMIT_PREPARED and RBTXN_ROLLBACK_PREPARED in ReordeBuffer so > > that we can recognize these operations in corresponding callbacks. I > > think setting any flag in ReorderBuffer should not dictate the > > behavior in callbacks. Then also there are few things that are not > > common to those APIs like the patch has an Assert to say that the txn > > is marked with prepare flag for all three operations which I think is > > not true for Rollback Prepared after the restart. We don't ensure to > > set the Prepare flag if the Rollback Prepare happens after the > > restart. Then, we have to introduce separate flags to distinguish > > prepare/commit prepared/rollback prepared to distinguish multiple > > operations sent as protocol messages. Also, all these operations are > > mutually exclusive so it will be better to send separate messages for > > each of these and I have changed it accordingly in the attached patch. > > > > While looking at the two-phase protocol messages (with a view to > documenting them) I noticed that the messages for > LOGICAL_REP_MSG_PREPARE, LOGICAL_REP_MSG_COMMIT_PREPARED, > LOGICAL_REP_MSG_ROLLBACK_PREPARED are all sending and receiving flag > bytes which *always* has a value 0. > > ---------- > e.g. > uint8 flags = 0; > pq_sendbyte(out, flags); > > and > /* read flags */ > uint8 flags = pq_getmsgbyte(in); > if (flags != 0) > elog(ERROR, "unrecognized flags %u in commit prepare message", flags); > ---------- > > I think this patch version v31 is where the flags became redundant. > I think this has been kept for future use similar to how we have in logicalrep_write_commit. So, I think we can keep them unused for now. We can document it similar commit message ('C') [1]. [1] - https://www.postgresql.org/docs/devel/protocol-logicalrep-message-formats.html -- With Regards, Amit Kapila.
On Fri, Apr 9, 2021 at 6:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Apr 9, 2021 at 12:33 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > On Mon, Dec 14, 2020 at 8:27 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > 2. > > > + /* > > > + * Flags are determined from the state of the transaction. We know we > > > + * always get PREPARE first and then [COMMIT|ROLLBACK] PREPARED, so if > > > + * it's already marked as committed then it has to be COMMIT PREPARED (and > > > + * likewise for abort / ROLLBACK PREPARED). > > > + */ > > > + if (rbtxn_commit_prepared(txn)) > > > + flags = LOGICALREP_IS_COMMIT_PREPARED; > > > + else if (rbtxn_rollback_prepared(txn)) > > > + flags = LOGICALREP_IS_ROLLBACK_PREPARED; > > > + else > > > + flags = LOGICALREP_IS_PREPARE; > > > > > > I don't like clubbing three different operations under one message > > > LOGICAL_REP_MSG_PREPARE. It looks awkward to use new flags > > > RBTXN_COMMIT_PREPARED and RBTXN_ROLLBACK_PREPARED in ReordeBuffer so > > > that we can recognize these operations in corresponding callbacks. I > > > think setting any flag in ReorderBuffer should not dictate the > > > behavior in callbacks. Then also there are few things that are not > > > common to those APIs like the patch has an Assert to say that the txn > > > is marked with prepare flag for all three operations which I think is > > > not true for Rollback Prepared after the restart. We don't ensure to > > > set the Prepare flag if the Rollback Prepare happens after the > > > restart. Then, we have to introduce separate flags to distinguish > > > prepare/commit prepared/rollback prepared to distinguish multiple > > > operations sent as protocol messages. Also, all these operations are > > > mutually exclusive so it will be better to send separate messages for > > > each of these and I have changed it accordingly in the attached patch. > > > > > > > While looking at the two-phase protocol messages (with a view to > > documenting them) I noticed that the messages for > > LOGICAL_REP_MSG_PREPARE, LOGICAL_REP_MSG_COMMIT_PREPARED, > > LOGICAL_REP_MSG_ROLLBACK_PREPARED are all sending and receiving flag > > bytes which *always* has a value 0. > > > > ---------- > > e.g. > > uint8 flags = 0; > > pq_sendbyte(out, flags); > > > > and > > /* read flags */ > > uint8 flags = pq_getmsgbyte(in); > > if (flags != 0) > > elog(ERROR, "unrecognized flags %u in commit prepare message", flags); > > ---------- > > > > I think this patch version v31 is where the flags became redundant. > > > > I think this has been kept for future use similar to how we have in > logicalrep_write_commit. So, I think we can keep them unused for now. > We can document it similar commit message ('C') [1]. > > [1] - https://www.postgresql.org/docs/devel/protocol-logicalrep-message-formats.html > Yeah, we can do that. And if nobody else gives feedback about this then I will do exactly like you suggested. But I don't understand why we are even trying to "future proof" the protocol by keeping redundant flags lying around on the off-chance that maybe one day they could be useful. Isn't that what the protocol version number is for? e.g. If there did become some future need for some flags then just add them at that time and bump the protocol version. And, even if we wanted to, I think we cannot use these existing flags in future without bumping the protocol version, because the current protocol docs say that flag value must be zero! ------ Kind Regards, Peter Smith. Fujitsu Australia
Please find attached the latest patch set v72* Differences from v71* are: * Rebased to HEAD @ yesterday. * The Replication protocol version requirement for two-phase message support is bumped to version 3 * Documentation of protocol messages has be updated for two-phase messages similar to [1] ---- [1] https://github.com/postgres/postgres/commit/15c1a9d9cb7604472d4823f48b64cdc02c441194 Kind Regards, Peter Smith. Fujitsu Australia
Attachment
Please find attached the latest patch set v73`* Differences from v72* are: * Rebased to HEAD @ today (required because v72-0001 no longer applied cleanly) * Minor documentation correction for protocol messages for Commit Prepared ('K') * Non-functional code tidy (mostly proto.c) to reduce overloading different meanings to same member names for prepare/commit times. ---- Kind Regards, Peter Smith. Fujitsu Australia
Attachment
On Tue, Apr 20, 2021 at 3:45 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Please find attached the latest patch set v73`* > > Differences from v72* are: > > * Rebased to HEAD @ today (required because v72-0001 no longer applied cleanly) > > * Minor documentation correction for protocol messages for Commit Prepared ('K') > > * Non-functional code tidy (mostly proto.c) to reduce overloading > different meanings to same member names for prepare/commit times. Please find attached a re-posting of patch set v73* This is the same as yesterday's v73 but with a contrib module compile error fixed. (I have confirmed make check-world is OK for this patch set) ------ Kind Regards, Peter Smith. Fujitsu Australia
Attachment
On Wed, Apr 21, 2021 at 12:13 PM Peter Smith <smithpb2250@gmail.com> wrote: > > On Tue, Apr 20, 2021 at 3:45 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Please find attached the latest patch set v73`* > > > > Differences from v72* are: > > > > * Rebased to HEAD @ today (required because v72-0001 no longer applied cleanly) > > > > * Minor documentation correction for protocol messages for Commit Prepared ('K') > > > > * Non-functional code tidy (mostly proto.c) to reduce overloading > > different meanings to same member names for prepare/commit times. > > > Please find attached a re-posting of patch set v73* > > This is the same as yesterday's v73 but with a contrib module compile > error fixed. Thanks for the updated patch, few comments: 1) Should "final_lsn not set in begin message" be "prepare_lsn not set in begin message" +logicalrep_read_begin_prepare(StringInfo in, LogicalRepPreparedTxnData *begin_data) +{ + /* read fields */ + begin_data->prepare_lsn = pq_getmsgint64(in); + if (begin_data->prepare_lsn == InvalidXLogRecPtr) + elog(ERROR, "final_lsn not set in begin message"); 2) Should "These commands" be "ALTER SUBSCRIPTION ... REFRESH PUBLICATION and ALTER SUBSCRIPTION ... SET/ADD PUBLICATION ..." as copy_data cannot be specified with alter subscription .. drop publication. + These commands also cannot be executed with <literal>copy_data = true</literal> + when the subscription has <literal>two_phase</literal> commit enabled. See + column <literal>subtwophasestate</literal> of + <xref linkend="catalog-pg-subscription"/> to know the actual two-phase state. 3) <term>Byte1('A')</term> should be <term>Byte1('r')</term> as we have defined LOGICAL_REP_MSG_ROLLBACK_PREPARED as r. +<term>Rollback Prepared</term> +<listitem> +<para> + +<variablelist> + +<varlistentry> +<term>Byte1('A')</term> +<listitem><para> + Identifies this message as the rollback of a two-phase transaction message. +</para></listitem> +</varlistentry> 4) Should "Check if the prepared transaction with the given GID and lsn is around." be "Check if the prepared transaction with the given GID, lsn & timestamp is around." +/* + * LookupGXact + * Check if the prepared transaction with the given GID and lsn is around. + * + * Note that we always compare with the LSN where prepare ends because that is + * what is stored as origin_lsn in the 2PC file. + * + * This function is primarily used to check if the prepared transaction + * received from the upstream (remote node) already exists. Checking only GID + * is not sufficient because a different prepared xact with the same GID can + * exist on the same node. So, we are ensuring to match origin_lsn and + * origin_timestamp of prepared xact to avoid the possibility of a match of + * prepared xact from two different nodes. + */ 5) Should we change "The LSN of the prepare." to "The LSN of the begin prepare." +<term>Begin Prepare</term> +<listitem> +<para> + +<variablelist> + +<varlistentry> +<term>Byte1('b')</term> +<listitem><para> + Identifies this message as the beginning of a two-phase transaction message. +</para></listitem> +</varlistentry> + +<varlistentry> +<term>Int64</term> +<listitem><para> + The LSN of the prepare. +</para></listitem> +</varlistentry> 6) Similarly in cases of "Commit Prepared" and "Rollback Prepared" 7) No need to initialize has_subrels as we will always assign the value returned by HeapTupleIsValid +HasSubscriptionRelations(Oid subid) +{ + Relation rel; + int nkeys = 0; + ScanKeyData skey[2]; + SysScanDesc scan; + bool has_subrels = false; + + rel = table_open(SubscriptionRelRelationId, AccessShareLock); 8) We could include errhint, like errhint("Option \"two_phase\" specified more than once") to specify a more informative error message. + else if (strcmp(defel->defname, "two_phase") == 0) + { + if (two_phase_option_given) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("conflicting or redundant options"))); + two_phase_option_given = true; + + data->two_phase = defGetBoolean(defel); + } 9) We have a lot of function parameters for parse_subscription_options, should we change it to struct? @@ -69,7 +69,8 @@ parse_subscription_options(List *options, char **synchronous_commit, bool *refresh, bool *binary_given, bool *binary, - bool *streaming_given, bool *streaming) + bool *streaming_given, bool *streaming, + bool *twophase_given, bool *twophase) 10) Should we change " errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false, or use DROP/CREATE SUBSCRIPTION.")" to "errhint("Use ALTER SUBSCRIPTION ...SET/ADD PUBLICATION with refresh = false, or with copy_data = false.")" as we don't support copy_data in ALTER subscription ... DROP publication. + /* + * See ALTER_SUBSCRIPTION_REFRESH for details why this is + * not allowed. + */ + if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"), + errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = false, or with copy_data = false" + ", or use DROP/CREATE SUBSCRIPTION."))); 11) Should 14000 be 15000 as this feature will be committed in PG15 + if (options->proto.logical.twophase && + PQserverVersion(conn->streamConn) >= 140000) + appendStringInfoString(&cmd, ", two_phase 'on'"); 12) should we change "begin message" to "begin prepare message" + if (begin_data->prepare_lsn == InvalidXLogRecPtr) + elog(ERROR, "final_lsn not set in begin message"); + begin_data->end_lsn = pq_getmsgint64(in); + if (begin_data->end_lsn == InvalidXLogRecPtr) + elog(ERROR, "end_lsn not set in begin message"); 13) should we change "commit prepare message" to "commit prepared message" + if (flags != 0) + elog(ERROR, "unrecognized flags %u in commit prepare message", flags); + + /* read fields */ + prepare_data->commit_lsn = pq_getmsgint64(in); + if (prepare_data->commit_lsn == InvalidXLogRecPtr) + elog(ERROR, "commit_lsn is not set in commit prepared message"); + prepare_data->end_lsn = pq_getmsgint64(in); + if (prepare_data->end_lsn == InvalidXLogRecPtr) + elog(ERROR, "end_lsn is not set in commit prepared message"); + prepare_data->commit_time = pq_getmsgint64(in); 14) should we change "commit prepared message" to "rollback prepared message" +void +logicalrep_read_rollback_prepared(StringInfo in, + LogicalRepRollbackPreparedTxnData *rollback_data) +{ + /* read flags */ + uint8 flags = pq_getmsgbyte(in); + + if (flags != 0) + elog(ERROR, "unrecognized flags %u in rollback prepare message", flags); + + /* read fields */ + rollback_data->prepare_end_lsn = pq_getmsgint64(in); + if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr) + elog(ERROR, "prepare_end_lsn is not set in commit prepared message"); + rollback_data->rollback_end_lsn = pq_getmsgint64(in); + if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr) + elog(ERROR, "rollback_end_lsn is not set in commit prepared message"); + rollback_data->prepare_time = pq_getmsgint64(in); + rollback_data->rollback_time = pq_getmsgint64(in); + rollback_data->xid = pq_getmsgint(in, 4); + + /* read gid (copy it into a pre-allocated buffer) */ + strcpy(rollback_data->gid, pq_getmsgstring(in)); +} 15) We can include check pg_stat_replication_slots to verify if statistics is getting updated. +$node_publisher->safe_psql('postgres', " + BEGIN; + INSERT INTO tab_full VALUES (11); + PREPARE TRANSACTION 'test_prepared_tab_full';"); + +$node_publisher->wait_for_catchup($appname); + +# check that transaction is in prepared state on subscriber +my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;"); +is($result, qq(1), 'transaction is prepared on subscriber'); + +# check that 2PC gets committed on subscriber +$node_publisher->safe_psql('postgres', "COMMIT PREPARED 'test_prepared_tab_full';"); + +$node_publisher->wait_for_catchup($appname); Regards, Vignesh
On Wed, Apr 21, 2021 at 12:13 PM Peter Smith <smithpb2250@gmail.com> wrote: > > On Tue, Apr 20, 2021 at 3:45 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Please find attached the latest patch set v73`* > > > > Differences from v72* are: > > > > * Rebased to HEAD @ today (required because v72-0001 no longer applied cleanly) > > > > * Minor documentation correction for protocol messages for Commit Prepared ('K') > > > > * Non-functional code tidy (mostly proto.c) to reduce overloading > > different meanings to same member names for prepare/commit times. > > > Please find attached a re-posting of patch set v73* Few comments when I was having a look at the tests added: 1) Can the below: +# check inserts are visible. 22 should be rolled back. 21 should be committed. +$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (21);"); +is($result, qq(1), 'Rows committed are on the subscriber'); +$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (22);"); +is($result, qq(0), 'Rows rolled back are not on the subscriber'); be changed to: $result = $node_subscriber->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);"); is($result, qq(21), 'Rows committed are on the subscriber'); And Test count need to be reduced to "use Test::More tests => 19" 2) we can change tx to transaction: +# check the tx state is prepared on subscriber(s) +$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;"); +is($result, qq(1), 'transaction is prepared on subscriber B'); +$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;"); +is($result, qq(1), 'transaction is prepared on subscriber C'); 3) There are few more instances present in the same file, those also can be changed. 4) Can the below: check inserts are visible at subscriber(s). # 22 should be rolled back. # 21 should be committed. $result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (21);"); is($result, qq(1), 'Rows committed are present on subscriber B'); $result = $node_B->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (22);"); is($result, qq(0), 'Rows rolled back are not present on subscriber B'); $result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (21);"); is($result, qq(1), 'Rows committed are present on subscriber C'); $result = $node_C->safe_psql('postgres', "SELECT count(*) FROM tab_full where a IN (22);"); is($result, qq(0), 'Rows rolled back are not present on subscriber C'); be changed to: $result = $node_B->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);"); is($result, qq(21), 'Rows committed are on the subscriber'); $result = $node_C->safe_psql('postgres', "SELECT a FROM tab_full where a IN (21,22);"); is($result, qq(21), 'Rows committed are on the subscriber'); And Test count need to be reduced to "use Test::More tests => 27" 5) should we change "Two phase commit" to "Two phase commit state" : + /* + * Binary, streaming, and two_phase are only supported in v14 and + * higher + */ if (pset.sversion >= 140000) appendPQExpBuffer(&buf, ", subbinary AS \"%s\"\n" - ", substream AS \"%s\"\n", + ", substream AS \"%s\"\n" + ", subtwophasestate AS \"%s\"\n", gettext_noop("Binary"), - gettext_noop("Streaming")); + gettext_noop("Streaming"), + gettext_noop("Two phase commit")); Regards, Vignesh
On Wed, Apr 21, 2021 at 12:13 PM Peter Smith <smithpb2250@gmail.com> wrote: > > On Tue, Apr 20, 2021 at 3:45 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Please find attached the latest patch set v73`* > > > > Differences from v72* are: > > > > * Rebased to HEAD @ today (required because v72-0001 no longer applied cleanly) > > > > * Minor documentation correction for protocol messages for Commit Prepared ('K') > > > > * Non-functional code tidy (mostly proto.c) to reduce overloading > > different meanings to same member names for prepare/commit times. > > > Please find attached a re-posting of patch set v73* > > This is the same as yesterday's v73 but with a contrib module compile > error fixed. Few comments on v73-0002-Add-prepare-API-support-for-streaming-transactio.patch patch: 1) There are slight differences in error message in case of Alter subscription ... drop publication, we can keep the error message similar: postgres=# ALTER SUBSCRIPTION mysub drop PUBLICATION mypub WITH (refresh = false, copy_data=true, two_phase=true); ERROR: unrecognized subscription parameter: "copy_data" postgres=# ALTER SUBSCRIPTION mysub drop PUBLICATION mypub WITH (refresh = false, two_phase=true, streaming=true); ERROR: cannot alter two_phase option 2) We are sending txn->xid twice, I felt we should send only once in logicalrep_write_stream_prepare: + /* transaction ID */ + Assert(TransactionIdIsValid(txn->xid)); + pq_sendint32(out, txn->xid); + + /* send the flags field */ + pq_sendbyte(out, flags); + + /* send fields */ + pq_sendint64(out, prepare_lsn); + pq_sendint64(out, txn->end_lsn); + pq_sendint64(out, txn->u_op_time.prepare_time); + pq_sendint32(out, txn->xid); + 3) We could remove xid and return prepare_data->xid +TransactionId +logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data) +{ + TransactionId xid; + uint8 flags; + + xid = pq_getmsgint(in, 4); 4) Here comments can be above apply_spooled_messages for better readability + /* + * 1. Replay all the spooled operations - Similar code as for + * apply_handle_stream_commit (i.e. non two-phase stream commit) + */ + + ensure_transaction(); + + nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn); + 5) Similarly this below comment can be above PrepareTransactionBlock + /* + * 2. Mark the transaction as prepared. - Similar code as for + * apply_handle_prepare (i.e. two-phase non-streamed prepare) + */ + + /* + * BeginTransactionBlock is necessary to balance the EndTransactionBlock + * called within the PrepareTransactionBlock below. + */ + BeginTransactionBlock(); + CommitTransactionCommand(); + + /* + * Update origin state so we can restart streaming from correct position + * in case of crash. + */ + replorigin_session_origin_lsn = prepare_data.end_lsn; + replorigin_session_origin_timestamp = prepare_data.prepare_time; + + PrepareTransactionBlock(gid); + CommitTransactionCommand(); + + pgstat_report_stat(false); 6) There is a lot of common code between apply_handle_stream_prepare and apply_handle_prepare, if possible try to have a common function to avoid fixing at both places. + /* + * 2. Mark the transaction as prepared. - Similar code as for + * apply_handle_prepare (i.e. two-phase non-streamed prepare) + */ + + /* + * BeginTransactionBlock is necessary to balance the EndTransactionBlock + * called within the PrepareTransactionBlock below. + */ + BeginTransactionBlock(); + CommitTransactionCommand(); + + /* + * Update origin state so we can restart streaming from correct position + * in case of crash. + */ + replorigin_session_origin_lsn = prepare_data.end_lsn; + replorigin_session_origin_timestamp = prepare_data.prepare_time; + + PrepareTransactionBlock(gid); + CommitTransactionCommand(); + + pgstat_report_stat(false); + + store_flush_position(prepare_data.end_lsn); 7) two-phase commit is slightly misleading, we can just mention streaming prepare. + * PREPARE callback (for streaming two-phase commit). + * + * Notify the downstream to prepare the transaction. + */ +static void +pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx, + ReorderBufferTXN *txn, + XLogRecPtr prepare_lsn) 8) should we include Assert of in_streaming similar to other pgoutput_stream*** functions. +static void +pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx, + ReorderBufferTXN *txn, + XLogRecPtr prepare_lsn) +{ + Assert(rbtxn_is_streamed(txn)); + + OutputPluginUpdateProgress(ctx); + OutputPluginPrepareWrite(ctx, true); + logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn); + OutputPluginWrite(ctx, true); +} 9) Here also, we can verify that the transaction is streamed by checking the pg_stat_replication_slots. +# check that transaction is committed on subscriber +$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), count(c), count(d = 999) FROM test_tab"); +is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed on subscriber, and extra columns contain local defaults'); +$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;"); +is($result, qq(0), 'transaction is committed on subscriber'); Regards, Vignesh
Modified pgbench's "tpcb-like" builtin query as below to do two-phase commits and then ran a 4 cascade replication setup. "BEGIN;\n" "UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n" "SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n" "UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;\n" "UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;\n" "INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n" "PREPARE TRANSACTION ':aid:';\n" "COMMIT PREPARED ':aid:';\n" The tests ran fine and all 4 cascaded servers replicated the changes correctly. All the subscriptions were configured with two_phase enabled. regards, Ajin Cherian Fujitsu Australia
Please find attached the latest patch set v74* Differences from v73* are: * Rebased to HEAD @ 2 days ago. * v74 addresses most of the feedback comments from Vignesh posts [1][2][3]. * Please refer to the replies to [1][2][3] for details of what was fixed, and what was not. ---- [1] https://www.postgresql.org/message-id/CALDaNm1BxeN7NP9MjK4VRatjcXqWEO6K_35aVN0um%2B5AFdv-%2BA%40mail.gmail.com [2] https://www.postgresql.org/message-id/CALDaNm2gWSYW5d%3DTCgqN04aV9bLFkXp8QOaE_iDXV9%3Dw1_%3DLpA%40mail.gmail.com [3] https://www.postgresql.org/message-id/CALDaNm0u%3DQGwd7jDAj-4u%3D7vvPn5rarFjBMCgfiJbDte55CWAA%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Attachment
On Mon, Apr 26, 2021 at 9:22 PM vignesh C <vignesh21@gmail.com> wrote: > > On Wed, Apr 21, 2021 at 12:13 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > On Tue, Apr 20, 2021 at 3:45 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > > Please find attached the latest patch set v73`* > > > > > > Differences from v72* are: > > > > > > * Rebased to HEAD @ today (required because v72-0001 no longer applied cleanly) > > > > > > * Minor documentation correction for protocol messages for Commit Prepared ('K') > > > > > > * Non-functional code tidy (mostly proto.c) to reduce overloading > > > different meanings to same member names for prepare/commit times. > > > > > > Please find attached a re-posting of patch set v73* > > > > This is the same as yesterday's v73 but with a contrib module compile > > error fixed. > > Thanks for the updated patch, few comments: Thanks for your feedback comments, My replies are inline below. > 1) Should "final_lsn not set in begin message" be "prepare_lsn not set > in begin message" > +logicalrep_read_begin_prepare(StringInfo in, > LogicalRepPreparedTxnData *begin_data) > +{ > + /* read fields */ > + begin_data->prepare_lsn = pq_getmsgint64(in); > + if (begin_data->prepare_lsn == InvalidXLogRecPtr) > + elog(ERROR, "final_lsn not set in begin message"); > OK. Updated in v74. > 2) Should "These commands" be "ALTER SUBSCRIPTION ... REFRESH > PUBLICATION and ALTER SUBSCRIPTION ... SET/ADD PUBLICATION ..." as > copy_data cannot be specified with alter subscription .. drop > publication. > + These commands also cannot be executed with <literal>copy_data = > true</literal> > + when the subscription has <literal>two_phase</literal> commit enabled. See > + column <literal>subtwophasestate</literal> of > + <xref linkend="catalog-pg-subscription"/> to know the actual > two-phase state. OK. Updated in v74. While technically more correct, I think rewording it as suggested makes the doc harder to understand. But I have reworded it slightly to account for the fact that the copy_data setting is not possible with the DROP. > > 3) <term>Byte1('A')</term> should be <term>Byte1('r')</term> as we > have defined LOGICAL_REP_MSG_ROLLBACK_PREPARED as r. > +<term>Rollback Prepared</term> > +<listitem> > +<para> > + > +<variablelist> > + > +<varlistentry> > +<term>Byte1('A')</term> > +<listitem><para> > + Identifies this message as the rollback of a > two-phase transaction message. > +</para></listitem> > +</varlistentry> OK. Updated in v74. > > 4) Should "Check if the prepared transaction with the given GID and > lsn is around." be > "Check if the prepared transaction with the given GID, lsn & timestamp > is around." > +/* > + * LookupGXact > + * Check if the prepared transaction with the given GID > and lsn is around. > + * > + * Note that we always compare with the LSN where prepare ends because that is > + * what is stored as origin_lsn in the 2PC file. > + * > + * This function is primarily used to check if the prepared transaction > + * received from the upstream (remote node) already exists. Checking only GID > + * is not sufficient because a different prepared xact with the same GID can > + * exist on the same node. So, we are ensuring to match origin_lsn and > + * origin_timestamp of prepared xact to avoid the possibility of a match of > + * prepared xact from two different nodes. > + */ OK. Updated in v74. > > 5) Should we change "The LSN of the prepare." to "The LSN of the begin prepare." > +<term>Begin Prepare</term> > +<listitem> > +<para> > + > +<variablelist> > + > +<varlistentry> > +<term>Byte1('b')</term> > +<listitem><para> > + Identifies this message as the beginning of a > two-phase transaction message. > +</para></listitem> > +</varlistentry> > + > +<varlistentry> > +<term>Int64</term> > +<listitem><para> > + The LSN of the prepare. > +</para></listitem> > +</varlistentry> > Not updated. The PG Docs is correct as-is I think. > > 6) Similarly in cases of "Commit Prepared" and "Rollback Prepared" Not updated. AFAIK these are correct – it really is LSN of the PREPARE just like it says. > > 7) No need to initialize has_subrels as we will always assign the > value returned by HeapTupleIsValid > +HasSubscriptionRelations(Oid subid) > +{ > + Relation rel; > + int nkeys = 0; > + ScanKeyData skey[2]; > + SysScanDesc scan; > + bool has_subrels = false; > + > + rel = table_open(SubscriptionRelRelationId, AccessShareLock); OK. Updated in v74. > > 8) We could include errhint, like errhint("Option \"two_phase\" > specified more than once") to specify a more informative error > message. > + else if (strcmp(defel->defname, "two_phase") == 0) > + { > + if (two_phase_option_given) > + ereport(ERROR, > + (errcode(ERRCODE_SYNTAX_ERROR), > + errmsg("conflicting > or redundant options"))); > + two_phase_option_given = true; > + > + data->two_phase = defGetBoolean(defel); > + } > Not updated. Yes, maybe it would be better like you say, but the code would then be inconsistent with every other option in this function. Perhaps your idea can be raised as a separate patch to fix all of them. > 9) We have a lot of function parameters for > parse_subscription_options, should we change it to struct? > @@ -69,7 +69,8 @@ parse_subscription_options(List *options, > char **synchronous_commit, > bool *refresh, > bool *binary_given, > bool *binary, > - bool > *streaming_given, bool *streaming) > + bool > *streaming_given, bool *streaming, > + bool > *twophase_given, bool *twophase) Not updated. This is not really related to the 2PC functionality so I think your idea might be good, but it should be done as a later refactoring patch after the 2PC patch is pushed. > > 10) Should we change " errhint("Use ALTER SUBSCRIPTION ...SET > PUBLICATION with refresh = false, or with copy_data = false, or use > DROP/CREATE SUBSCRIPTION.")" to "errhint("Use ALTER SUBSCRIPTION > ...SET/ADD PUBLICATION with refresh = false, or with copy_data = > false.")" as we don't support copy_data in ALTER subscription ... DROP > publication. > + /* > + * See > ALTER_SUBSCRIPTION_REFRESH for details why this is > + * not allowed. > + */ > + if (sub->twophasestate == > LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data) > + ereport(ERROR, > + > (errcode(ERRCODE_SYNTAX_ERROR), > + > errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed > when two_phase is enabled"), > + > errhint("Use ALTER SUBSCRIPTION ...SET PUBLICATION with refresh = > false, or with copy_data = false" > + > ", or use DROP/CREATE SUBSCRIPTION."))); > Not updated. The hint is saying that one workaround is to DROP and re-CREATE the SUBSCRIPTIPON. It doesn’t say anything about “support of copy_data in ALTER subscription ... DROP publication.” So I did not understand the point of your comment. > 11) Should 14000 be 15000 as this feature will be committed in PG15 > + if (options->proto.logical.twophase && > + PQserverVersion(conn->streamConn) >= 140000) > + appendStringInfoString(&cmd, ", two_phase 'on'"); > Not updated. This is already a known TODO task; I will do this as soon as PG15 development starts. > 12) should we change "begin message" to "begin prepare message" > + if (begin_data->prepare_lsn == InvalidXLogRecPtr) > + elog(ERROR, "final_lsn not set in begin message"); > + begin_data->end_lsn = pq_getmsgint64(in); > + if (begin_data->end_lsn == InvalidXLogRecPtr) > + elog(ERROR, "end_lsn not set in begin message"); OK. Updated in v74. > > 13) should we change "commit prepare message" to "commit prepared message" > + if (flags != 0) > + elog(ERROR, "unrecognized flags %u in commit prepare > message", flags); > + > + /* read fields */ > + prepare_data->commit_lsn = pq_getmsgint64(in); > + if (prepare_data->commit_lsn == InvalidXLogRecPtr) > + elog(ERROR, "commit_lsn is not set in commit prepared message"); > + prepare_data->end_lsn = pq_getmsgint64(in); > + if (prepare_data->end_lsn == InvalidXLogRecPtr) > + elog(ERROR, "end_lsn is not set in commit prepared message"); > + prepare_data->commit_time = pq_getmsgint64(in); > OK, updated in v74 > 14) should we change "commit prepared message" to "rollback prepared message" > +void > +logicalrep_read_rollback_prepared(StringInfo in, > + > LogicalRepRollbackPreparedTxnData *rollback_data) > +{ > + /* read flags */ > + uint8 flags = pq_getmsgbyte(in); > + > + if (flags != 0) > + elog(ERROR, "unrecognized flags %u in rollback prepare > message", flags); > + > + /* read fields */ > + rollback_data->prepare_end_lsn = pq_getmsgint64(in); > + if (rollback_data->prepare_end_lsn == InvalidXLogRecPtr) > + elog(ERROR, "prepare_end_lsn is not set in commit > prepared message"); > + rollback_data->rollback_end_lsn = pq_getmsgint64(in); > + if (rollback_data->rollback_end_lsn == InvalidXLogRecPtr) > + elog(ERROR, "rollback_end_lsn is not set in commit > prepared message"); > + rollback_data->prepare_time = pq_getmsgint64(in); > + rollback_data->rollback_time = pq_getmsgint64(in); > + rollback_data->xid = pq_getmsgint(in, 4); > + > + /* read gid (copy it into a pre-allocated buffer) */ > + strcpy(rollback_data->gid, pq_getmsgstring(in)); > +} OK. Updated in v74. > > 15) We can include check pg_stat_replication_slots to verify if > statistics is getting updated. > +$node_publisher->safe_psql('postgres', " > + BEGIN; > + INSERT INTO tab_full VALUES (11); > + PREPARE TRANSACTION 'test_prepared_tab_full';"); > + > +$node_publisher->wait_for_catchup($appname); > + > +# check that transaction is in prepared state on subscriber > +my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) > FROM pg_prepared_xacts;"); > +is($result, qq(1), 'transaction is prepared on subscriber'); > + > +# check that 2PC gets committed on subscriber > +$node_publisher->safe_psql('postgres', "COMMIT PREPARED > 'test_prepared_tab_full';"); > + > +$node_publisher->wait_for_catchup($appname); Not updated. But I recorded this as a TODO task - I agree we need to introduce some stats tests later. ------ Kind Regards, Peter Smith. Fujitsu Australia
On Tue, Apr 27, 2021 at 1:41 PM vignesh C <vignesh21@gmail.com> wrote: > > On Wed, Apr 21, 2021 at 12:13 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > On Tue, Apr 20, 2021 at 3:45 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > > Please find attached the latest patch set v73`* > > > > > > Differences from v72* are: > > > > > > * Rebased to HEAD @ today (required because v72-0001 no longer applied cleanly) > > > > > > * Minor documentation correction for protocol messages for Commit Prepared ('K') > > > > > > * Non-functional code tidy (mostly proto.c) to reduce overloading > > > different meanings to same member names for prepare/commit times. > > > > > > Please find attached a re-posting of patch set v73* > > Few comments when I was having a look at the tests added: Thanks for your feedback comments. My replies are inline below. > 1) Can the below: > +# check inserts are visible. 22 should be rolled back. 21 should be committed. > +$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) > FROM tab_full where a IN (21);"); > +is($result, qq(1), 'Rows committed are on the subscriber'); > +$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) > FROM tab_full where a IN (22);"); > +is($result, qq(0), 'Rows rolled back are not on the subscriber'); > > be changed to: > $result = $node_subscriber->safe_psql('postgres', "SELECT a FROM > tab_full where a IN (21,22);"); > is($result, qq(21), 'Rows committed are on the subscriber'); > > And Test count need to be reduced to "use Test::More tests => 19" > OK. Updated in v74. > 2) we can change tx to transaction: > +# check the tx state is prepared on subscriber(s) > +$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM > pg_prepared_xacts;"); > +is($result, qq(1), 'transaction is prepared on subscriber B'); > +$result = $node_C->safe_psql('postgres', "SELECT count(*) FROM > pg_prepared_xacts;"); > +is($result, qq(1), 'transaction is prepared on subscriber C'); > OK. Updated in v74 > 3) There are few more instances present in the same file, those also > can be changed. OK. I found no others in the same file, but there were similar cases in the 021 TAP test. Those were also updated in v74/ > > 4) Can the below: > check inserts are visible at subscriber(s). > # 22 should be rolled back. > # 21 should be committed. > $result = $node_B->safe_psql('postgres', "SELECT count(*) FROM > tab_full where a IN (21);"); > is($result, qq(1), 'Rows committed are present on subscriber B'); > $result = $node_B->safe_psql('postgres', "SELECT count(*) FROM > tab_full where a IN (22);"); > is($result, qq(0), 'Rows rolled back are not present on subscriber B'); > $result = $node_C->safe_psql('postgres', "SELECT count(*) FROM > tab_full where a IN (21);"); > is($result, qq(1), 'Rows committed are present on subscriber C'); > $result = $node_C->safe_psql('postgres', "SELECT count(*) FROM > tab_full where a IN (22);"); > is($result, qq(0), 'Rows rolled back are not present on subscriber C'); > > be changed to: > $result = $node_B->safe_psql('postgres', "SELECT a FROM tab_full where > a IN (21,22);"); > is($result, qq(21), 'Rows committed are on the subscriber'); > $result = $node_C->safe_psql('postgres', "SELECT a FROM tab_full where > a IN (21,22);"); > is($result, qq(21), 'Rows committed are on the subscriber'); > > And Test count need to be reduced to "use Test::More tests => 27" > OK. Updated in v74. > 5) should we change "Two phase commit" to "Two phase commit state" : > + /* > + * Binary, streaming, and two_phase are only supported > in v14 and > + * higher > + */ > if (pset.sversion >= 140000) > appendPQExpBuffer(&buf, > ", subbinary > AS \"%s\"\n" > - ", substream > AS \"%s\"\n", > + ", substream > AS \"%s\"\n" > + ", > subtwophasestate AS \"%s\"\n", > > gettext_noop("Binary"), > - > gettext_noop("Streaming")); > + > gettext_noop("Streaming"), > + > gettext_noop("Two phase commit")); > Not updated. I think the column name is already the longest one and this just makes it even longer - far too long IMO. I am not sure what is better having the “state” suffix. After all, booleans are also states. Anyway, I did not make this change now but if people feel strongly about it then I can revisit it. ------ Kind Regards, Peter Smith. Fujitsu Australia
On Tue, Apr 27, 2021 at 6:17 PM vignesh C <vignesh21@gmail.com> wrote: > > On Wed, Apr 21, 2021 at 12:13 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > On Tue, Apr 20, 2021 at 3:45 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > > Please find attached the latest patch set v73`* > > > > > > Differences from v72* are: > > > > > > * Rebased to HEAD @ today (required because v72-0001 no longer applied cleanly) > > > > > > * Minor documentation correction for protocol messages for Commit Prepared ('K') > > > > > > * Non-functional code tidy (mostly proto.c) to reduce overloading > > > different meanings to same member names for prepare/commit times. > > > > > > Please find attached a re-posting of patch set v73* > > > > This is the same as yesterday's v73 but with a contrib module compile > > error fixed. > > Few comments on > v73-0002-Add-prepare-API-support-for-streaming-transactio.patch patch: Thanks for your feedback comments. My replies are inline below. > 1) There are slight differences in error message in case of Alter > subscription ... drop publication, we can keep the error message > similar: > postgres=# ALTER SUBSCRIPTION mysub drop PUBLICATION mypub WITH > (refresh = false, copy_data=true, two_phase=true); > ERROR: unrecognized subscription parameter: "copy_data" > postgres=# ALTER SUBSCRIPTION mysub drop PUBLICATION mypub WITH > (refresh = false, two_phase=true, streaming=true); > ERROR: cannot alter two_phase option OK. Updated in v74. > > 2) We are sending txn->xid twice, I felt we should send only once in > logicalrep_write_stream_prepare: > + /* transaction ID */ > + Assert(TransactionIdIsValid(txn->xid)); > + pq_sendint32(out, txn->xid); > + > + /* send the flags field */ > + pq_sendbyte(out, flags); > + > + /* send fields */ > + pq_sendint64(out, prepare_lsn); > + pq_sendint64(out, txn->end_lsn); > + pq_sendint64(out, txn->u_op_time.prepare_time); > + pq_sendint32(out, txn->xid); > + > OK. Updated in v74. > 3) We could remove xid and return prepare_data->xid > +TransactionId > +logicalrep_read_stream_prepare(StringInfo in, > LogicalRepPreparedTxnData *prepare_data) > +{ > + TransactionId xid; > + uint8 flags; > + > + xid = pq_getmsgint(in, 4); > OK. Updated in v74. > 4) Here comments can be above apply_spooled_messages for better readability > + /* > + * 1. Replay all the spooled operations - Similar code as for > + * apply_handle_stream_commit (i.e. non two-phase stream commit) > + */ > + > + ensure_transaction(); > + > + nchanges = apply_spooled_messages(xid, prepare_data.prepare_lsn); > + > Not done. It was deliberately commented this way because the part below the comment is what is in apply_handle_stream_commit. > 5) Similarly this below comment can be above PrepareTransactionBlock > + /* > + * 2. Mark the transaction as prepared. - Similar code as for > + * apply_handle_prepare (i.e. two-phase non-streamed prepare) > + */ > + > + /* > + * BeginTransactionBlock is necessary to balance the EndTransactionBlock > + * called within the PrepareTransactionBlock below. > + */ > + BeginTransactionBlock(); > + CommitTransactionCommand(); > + > + /* > + * Update origin state so we can restart streaming from correct position > + * in case of crash. > + */ > + replorigin_session_origin_lsn = prepare_data.end_lsn; > + replorigin_session_origin_timestamp = prepare_data.prepare_time; > + > + PrepareTransactionBlock(gid); > + CommitTransactionCommand(); > + > + pgstat_report_stat(false); > Not done. It is deliberately commented this way because the part below the comment is what is in apply_handle_prepare. > 6) There is a lot of common code between apply_handle_stream_prepare > and apply_handle_prepare, if possible try to have a common function to > avoid fixing at both places. > + /* > + * 2. Mark the transaction as prepared. - Similar code as for > + * apply_handle_prepare (i.e. two-phase non-streamed prepare) > + */ > + > + /* > + * BeginTransactionBlock is necessary to balance the EndTransactionBlock > + * called within the PrepareTransactionBlock below. > + */ > + BeginTransactionBlock(); > + CommitTransactionCommand(); > + > + /* > + * Update origin state so we can restart streaming from correct position > + * in case of crash. > + */ > + replorigin_session_origin_lsn = prepare_data.end_lsn; > + replorigin_session_origin_timestamp = prepare_data.prepare_time; > + > + PrepareTransactionBlock(gid); > + CommitTransactionCommand(); > + > + pgstat_report_stat(false); > + > + store_flush_position(prepare_data.end_lsn); > Not done. If you diff those functions there are really only ~ 10 statements in common so I felt it is more readable to keep it this way than to try to make a “common” function out of an arbitrary code fragment. > 7) two-phase commit is slightly misleading, we can just mention > streaming prepare. > + * PREPARE callback (for streaming two-phase commit). > + * > + * Notify the downstream to prepare the transaction. > + */ > +static void > +pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx, > + ReorderBufferTXN *txn, > + XLogRecPtr prepare_lsn) > OK. Updated in v74. > 8) should we include Assert of in_streaming similar to other > pgoutput_stream*** functions. > +static void > +pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx, > + ReorderBufferTXN *txn, > + XLogRecPtr prepare_lsn) > +{ > + Assert(rbtxn_is_streamed(txn)); > + > + OutputPluginUpdateProgress(ctx); > + OutputPluginPrepareWrite(ctx, true); > + logicalrep_write_stream_prepare(ctx->out, txn, prepare_lsn); > + OutputPluginWrite(ctx, true); > +} > Not done. AFAIK it is correct as-is. > 9) Here also, we can verify that the transaction is streamed by > checking the pg_stat_replication_slots. > +# check that transaction is committed on subscriber > +$result = $node_subscriber->safe_psql('postgres', "SELECT count(*), > count(c), count(d = 999) FROM test_tab"); > +is($result, qq(3334|3334|3334), 'Rows inserted by 2PC have committed > on subscriber, and extra columns contain local defaults'); > +$result = $node_subscriber->safe_psql('postgres', "SELECT count(*) > FROM pg_prepared_xacts;"); > +is($result, qq(0), 'transaction is committed on subscriber'); > Not done. If the purpose of this comment is just to confirm that the SQL INSERT of 5000 rows of md5 data exceeds 64K then I think we can simply take that as self-evident. We don’t need some SQL to confirm it. If the purpose of this is just to ensure that stats work properly with 2PC then I agree that there should be some test cases added for stats, but this has already been recorded elsewhere as a future TODO task. ------ Kind Regards, Peter Smith. Fujitsu Australia
On Thu, Apr 29, 2021 at 2:23 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Please find attached the latest patch set v74* > > Differences from v73* are: > > * Rebased to HEAD @ 2 days ago. > > * v74 addresses most of the feedback comments from Vignesh posts [1][2][3]. > Thanks for the updated patch. Few comments: 1) I felt skey[2] should be skey as we are just using one key here. + ScanKeyData skey[2]; + SysScanDesc scan; + bool has_subrels; + + rel = table_open(SubscriptionRelRelationId, AccessShareLock); + + ScanKeyInit(&skey[nkeys++], + Anum_pg_subscription_rel_srsubid, + BTEqualStrategyNumber, F_OIDEQ, + ObjectIdGetDatum(subid)); + + scan = systable_beginscan(rel, InvalidOid, false, + NULL, nkeys, skey); + 2) I felt we can change lsn data type from Int64 to XLogRecPtr +<varlistentry> +<term>Int64</term> +<listitem><para> + The LSN of the prepare. +</para></listitem> +</varlistentry> + +<varlistentry> +<term>Int64</term> +<listitem><para> + The end LSN of the transaction. +</para></listitem> +</varlistentry> 3) I felt we can change lsn data type from Int32 to TransactionId +<varlistentry> +<term>Int32</term> +<listitem><para> + Xid of the subtransaction (will be same as xid of the transaction for top-level + transactions). +</para></listitem> +</varlistentry> 4) Should we change this to "The end LSN of the prepared transaction" just to avoid any confusion of it meaning commit/rollback. +<varlistentry> +<term>Int64</term> +<listitem><para> + The end LSN of the transaction. +</para></listitem> +</varlistentry> Similar problems related to comments 2 and 3 are being discussed at [1], we can change it accordingly based on the conclusion in the other thread. [1] - https://www.postgresql.org/message-id/flat/CAHut%2BPs2JsSd_OpBR9kXt1Rt4bwyXAjh875gUpFw6T210ttO7Q%40mail.gmail.com#cf2a85d0623dcadfbb1204a196681313 Regards, Vignesh
On Mon, May 10, 2021 at 1:31 PM vignesh C <vignesh21@gmail.com> wrote: > > 4) Should we change this to "The end LSN of the prepared transaction" > just to avoid any confusion of it meaning commit/rollback. > +<varlistentry> > +<term>Int64</term> > +<listitem><para> > + The end LSN of the transaction. > +</para></listitem> > +</varlistentry> > Can you please provide more details so I can be sure of the context of this feedback, e.g. there are multiple places that match that patch fragment provided. So was this suggestion to change all of them ( 'b', 'P', 'K' , 'r' of patch 0001; and also 'p' of patch 0002) ? ------ Kind Regards, Peter Smith. Fujitsu Australia.
On Mon, May 10, 2021 at 10:51 AM Peter Smith <smithpb2250@gmail.com> wrote: > > On Mon, May 10, 2021 at 1:31 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > 4) Should we change this to "The end LSN of the prepared transaction" > > just to avoid any confusion of it meaning commit/rollback. > > +<varlistentry> > > +<term>Int64</term> > > +<listitem><para> > > + The end LSN of the transaction. > > +</para></listitem> > > +</varlistentry> > > > > Can you please provide more details so I can be sure of the context of > this feedback, e.g. there are multiple places that match that patch > fragment provided. So was this suggestion to change all of them ( 'b', > 'P', 'K' , 'r' of patch 0001; and also 'p' of patch 0002) ? My suggestion was for all of them. Regards, Vignesh
Please find attached the latest patch set v75* Differences from v74* are: * Rebased to HEAD @ today. * v75 also addresses some of the feedback comments from Vignesh [1]. ---- [1] https://www.postgresql.org/message-id/CALDaNm3U4fGxTnQfaT1TqUkgX5c0CSDvmW12Bfksis8zB_XinA%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Attachment
On Mon, May 10, 2021 at 1:31 PM vignesh C <vignesh21@gmail.com> wrote: > > On Thu, Apr 29, 2021 at 2:23 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Please find attached the latest patch set v74* > > > > Differences from v73* are: > > > > * Rebased to HEAD @ 2 days ago. > > > > * v74 addresses most of the feedback comments from Vignesh posts [1][2][3]. > > > > Thanks for the updated patch. > Few comments: > 1) I felt skey[2] should be skey as we are just using one key here. > > + ScanKeyData skey[2]; > + SysScanDesc scan; > + bool has_subrels; > + > + rel = table_open(SubscriptionRelRelationId, AccessShareLock); > + > + ScanKeyInit(&skey[nkeys++], > + Anum_pg_subscription_rel_srsubid, > + BTEqualStrategyNumber, F_OIDEQ, > + ObjectIdGetDatum(subid)); > + > + scan = systable_beginscan(rel, InvalidOid, false, > + NULL, nkeys, skey); > + > Fixed in v75. > 2) I felt we can change lsn data type from Int64 to XLogRecPtr > +<varlistentry> > +<term>Int64</term> > +<listitem><para> > + The LSN of the prepare. > +</para></listitem> > +</varlistentry> > + > +<varlistentry> > +<term>Int64</term> > +<listitem><para> > + The end LSN of the transaction. > +</para></listitem> > +</varlistentry> Deferred. > > 3) I felt we can change lsn data type from Int32 to TransactionId > +<varlistentry> > +<term>Int32</term> > +<listitem><para> > + Xid of the subtransaction (will be same as xid of the > transaction for top-level > + transactions). > +</para></listitem> > +</varlistentry> Deferred. > > 4) Should we change this to "The end LSN of the prepared transaction" > just to avoid any confusion of it meaning commit/rollback. > +<varlistentry> > +<term>Int64</term> > +<listitem><para> > + The end LSN of the transaction. > +</para></listitem> > +</varlistentry> > Modified in v75 for message types 'b', 'P', 'K', 'r', 'p'. > Similar problems related to comments 2 and 3 are being discussed at > [1], we can change it accordingly based on the conclusion in the other > thread. > [1] - https://www.postgresql.org/message-id/flat/CAHut%2BPs2JsSd_OpBR9kXt1Rt4bwyXAjh875gUpFw6T210ttO7Q%40mail.gmail.com#cf2a85d0623dcadfbb1204a196681313 Yes, I will defer addressing those feedback comments 2 and 3 pending the outcome of your other patch of the above thread. ---------- Kind Regards, Peter Smith. Fujitsu Australia
On Thu, May 13, 2021 at 7:50 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Please find attached the latest patch set v75* > > Differences from v74* are: > > * Rebased to HEAD @ today. > > * v75 also addresses some of the feedback comments from Vignesh [1]. Adding a patch to this patch-set that avoids empty transactions from being sent to the subscriber/replica. This patch is based on the logic that was proposed for empty transactions in the thread [1]. This patch uses that patch and handles empty prepared transactions as well. So, this will avoid empty prepared transactions from being sent to the subscriber/replica. This patch also avoids sending COMMIT PREPARED /ROLLBACK PREPARED if the prepared transaction was skipped provided the COMMIT /ROLLBACK happens prior to a restart of the walsender. If the COMMIT/ROLLBACK PREPARED happens after a restart, it will not be able know that the prepared transaction prior to the restart was not sent, in this case the apply worker of the subscription will check if a prepare of the same type exists and if it does not, it will silently ignore the COMMIT PREPARED (ROLLBACK PREPARED logic was already doing this). Do have a look and let me know if you have any comments. [1] - https://www.postgresql.org/message-id/CAFPTHDYegcoS3xjGBj0XHfcdZr6Y35%2BYG1jq79TBD1VCkK7v3A%40mail.gmail.com regards, Ajin Cherian Fujitsu Australia.
Attachment
The above patch had some changes missing which resulted in some tap tests failing. Sending an updated patchset. Keeping the patchset version the same. regards, Ajin Cherian Fujitsu Australia
Attachment
On Mon, May 17, 2021 at 6:10 PM Ajin Cherian <itsajin@gmail.com> wrote: > > The above patch had some changes missing which resulted in some tap > tests failing. Sending an updated patchset. Keeping the patchset > version the same. Thanks for the updated patch, the updated patch fixes the tap test failures. Regards, Vignesh
On Sun, May 16, 2021 at 12:07 AM Ajin Cherian <itsajin@gmail.com> wrote: > > On Thu, May 13, 2021 at 7:50 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Please find attached the latest patch set v75* > > > > Differences from v74* are: > > > > * Rebased to HEAD @ today. > > > > * v75 also addresses some of the feedback comments from Vignesh [1]. > > Adding a patch to this patch-set that avoids empty transactions from > being sent to the subscriber/replica. This patch is based on the > logic that was proposed for empty transactions in the thread [1]. This > patch uses that patch and handles empty prepared transactions > as well. So, this will avoid empty prepared transactions from being > sent to the subscriber/replica. This patch also avoids sending > COMMIT PREPARED /ROLLBACK PREPARED if the prepared transaction was > skipped provided the COMMIT /ROLLBACK happens > prior to a restart of the walsender. If the COMMIT/ROLLBACK PREPARED > happens after a restart, it will not be able know that the > prepared transaction prior to the restart was not sent, in this case > the apply worker of the subscription will check if a prepare of the > same type exists > and if it does not, it will silently ignore the COMMIT PREPARED > (ROLLBACK PREPARED logic was already doing this). > Do have a look and let me know if you have any comments. > > [1] - https://www.postgresql.org/message-id/CAFPTHDYegcoS3xjGBj0XHfcdZr6Y35%2BYG1jq79TBD1VCkK7v3A%40mail.gmail.com > Hi Ajin. I have applied the latest patch set v76*. The patches applied cleanly. All of the make, make check, and TAP subscriptions tests worked OK. Below are my REVIEW COMMENTS for the v76-0003 part. ========== 1. File: doc/src/sgml/logicaldecoding.sgml 1.1 @@ -862,11 +862,19 @@ typedef void (*LogicalDecodePrepareCB) (struct LogicalDecodingContext *ctx, The required <function>commit_prepared_cb</function> callback is called whenever a transaction <command>COMMIT PREPARED</command> has been decoded. The <parameter>gid</parameter> field, which is part of the - <parameter>txn</parameter> parameter, can be used in this callback. + <parameter>txn</parameter> parameter, can be used in this callback. The + parameters <parameter>prepare_end_lsn</parameter> and + <parameter>prepare_time</parameter> can be used to check if the plugin + has received this <command>PREPARE TRANSACTION</command> in which case + it can apply the rollback, otherwise, it can skip the rollback operation. The + <parameter>gid</parameter> alone is not sufficient because the downstream + node can have a prepared transaction with same identifier. This is in the commit prepared section, but that new text is referring to "it can apply to the rollback" etc. Is this deliberate text, or maybe cut/paste error? ========== 2. File: src/backend/replication/pgoutput/pgoutput.c 2.1 @@ -76,6 +78,7 @@ static void pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx, static bool publications_valid; static bool in_streaming; +static bool in_prepared_txn; Wondering why this is a module static flag. That makes it looks like it somehow applies globally to all the functions in this scope, but really I think this is just a txn property, right? - e.g. why not use another member of the private TXN data instead? or - e.g. why not use rbtxn_prepared(txn) macro? ---------- 2.2 @@ -404,10 +410,32 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt, static void pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn) { + PGOutputTxnData *data = MemoryContextAllocZero(ctx->context, + sizeof(PGOutputTxnData)); + + (void)txn; /* keep compiler quiet */ I guess since now the arg "txn" is being used the added statement to "keep compiler quiet" is now redundant, so should be removed. ---------- 2.3 +static void +pgoutput_begin(LogicalDecodingContext *ctx, ReorderBufferTXN *txn) +{ bool send_replication_origin = txn->origin_id != InvalidRepOriginId; + PGOutputTxnData *data = (PGOutputTxnData *) txn->output_plugin_private; OutputPluginPrepareWrite(ctx, !send_replication_origin); logicalrep_write_begin(ctx->out, txn); + data->sent_begin_txn = true; I wondered is it worth adding Assert(data); here? ---------- 2.4 @@ -422,8 +450,14 @@ static void pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, XLogRecPtr commit_lsn) { + PGOutputTxnData *data = (PGOutputTxnData *) txn->output_plugin_private; + OutputPluginUpdateProgress(ctx); I wondered is it worthwhile to add Assert(data); here also? ---------- 2.5 @@ -422,8 +450,14 @@ static void pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, XLogRecPtr commit_lsn) { + PGOutputTxnData *data = (PGOutputTxnData *) txn->output_plugin_private; + OutputPluginUpdateProgress(ctx); + /* skip COMMIT message if nothing was sent */ + if (!data->sent_begin_txn) + return; Shouldn't this code also be freeing that allocated data? I think you do free it in similar functions later in this patch. ---------- 2.6 @@ -435,10 +469,31 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, static void pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn) { + PGOutputTxnData *data = MemoryContextAllocZero(ctx->context, + sizeof(PGOutputTxnData)); + + /* + * Don't send BEGIN message here. Instead, postpone it until the first + * change. In logical replication, a common scenario is to replicate a set + * of tables (instead of all tables) and transactions whose changes were on + * table(s) that are not published will produce empty transactions. These + * empty transactions will send BEGIN and COMMIT messages to subscribers, + * using bandwidth on something with little/no use for logical replication. + */ + data->sent_begin_txn = false; + txn->output_plugin_private = data; + in_prepared_txn = true; +} Apart from setting the in_prepared_txn = true; this is all identical code to pgoutput_begin_txn so you could consider just delegating to call that other function to save all the cut/paste data allocation and big comment. Or maybe this way is better - I am not sure. ---------- 2.7 +static void +pgoutput_begin_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn) +{ bool send_replication_origin = txn->origin_id != InvalidRepOriginId; + PGOutputTxnData *data = (PGOutputTxnData *) txn->output_plugin_private; OutputPluginPrepareWrite(ctx, !send_replication_origin); logicalrep_write_begin_prepare(ctx->out, txn); + data->sent_begin_txn = true; I wondered is it worth adding Assert(data); here also? ---------- 2.8 @@ -453,11 +508,18 @@ static void pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, XLogRecPtr prepare_lsn) { + PGOutputTxnData *data = (PGOutputTxnData *) txn->output_plugin_private; + OutputPluginUpdateProgress(ctx); I wondered is it worth adding Assert(data); here also? ---------- 2.9 @@ -465,12 +527,28 @@ pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, */ static void pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, - XLogRecPtr commit_lsn) + XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn, + TimestampTz prepare_time) { + PGOutputTxnData *data = (PGOutputTxnData *) txn->output_plugin_private; + OutputPluginUpdateProgress(ctx); + /* + * skip sending COMMIT PREPARED message if prepared transaction + * has not been sent. + */ + if (data && !data->sent_begin_txn) + { + pfree(data); + return; + } + + if (data) + pfree(data); OutputPluginPrepareWrite(ctx, true); I think this pfree logic might be refactored more simply to just be done in one place. e.g. like: if (data) { bool skip = !data->sent_begin_txn; pfree(data); if (skip) return; } BTW, is it even possible to get in this function with NULL private data? Perhaps that should be an Assert(data) ? ---------- 2.10 @@ -483,8 +561,22 @@ pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx, XLogRecPtr prepare_end_lsn, TimestampTz prepare_time) { + PGOutputTxnData *data = (PGOutputTxnData *) txn->output_plugin_private; + OutputPluginUpdateProgress(ctx); + /* + * skip sending COMMIT PREPARED message if prepared transaction + * has not been sent. + */ + if (data && !data->sent_begin_txn) + { + pfree(data); + return; + } + + if (data) + pfree(data); Same comment as above for refactoring the pfree logic. ---------- 2.11 @@ -483,8 +561,22 @@ pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx, XLogRecPtr prepare_end_lsn, TimestampTz prepare_time) { + PGOutputTxnData *data = (PGOutputTxnData *) txn->output_plugin_private; + OutputPluginUpdateProgress(ctx); + /* + * skip sending COMMIT PREPARED message if prepared transaction + * has not been sent. + */ + if (data && !data->sent_begin_txn) + { + pfree(data); + return; + } + + if (data) + pfree(data); Is that comment correct or cut/paste error? Why does it say "COMMIT PREPARED" ? ---------- 2.12 @@ -613,6 +705,7 @@ pgoutput_change(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, Relation relation, ReorderBufferChange *change) { PGOutputData *data = (PGOutputData *) ctx->output_plugin_private; + PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private; MemoryContext old; I wondered is it worth adding Assert(txndata); here also? ---------- 2.13 @@ -750,6 +852,7 @@ pgoutput_truncate(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, int nrelations, Relation relations[], ReorderBufferChange *change) { PGOutputData *data = (PGOutputData *) ctx->output_plugin_private; + PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private; MemoryContext old; I wondered is it worth adding Assert(txndata); here also? ---------- 2.14 @@ -813,11 +925,15 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, const char *message) { PGOutputData *data = (PGOutputData *) ctx->output_plugin_private; + PGOutputTxnData *txndata; TransactionId xid = InvalidTransactionId; if (!data->messages) return; + if (txn && txn->output_plugin_private) + txndata = (PGOutputTxnData *) txn->output_plugin_private; + /* * Remember the xid for the message in streaming mode. See * pgoutput_change. @@ -825,6 +941,19 @@ pgoutput_message(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, if (in_streaming) xid = txn->xid; + /* output BEGIN if we haven't yet, avoid for streaming and non-transactional messages */ + if (!in_streaming && transactional) + { + txndata = (PGOutputTxnData *) txn->output_plugin_private; + if (!txndata->sent_begin_txn) + { + if (!in_prepared_txn) + pgoutput_begin(ctx, txn); + else + pgoutput_begin_prepare(ctx, txn); + } + } That code: + if (txn && txn->output_plugin_private) + txndata = (PGOutputTxnData *) txn->output_plugin_private; looked misplaced to me. Shouldn't all that be relocated to be put inside the if block: + if (!in_streaming && transactional) And when you do that maybe the condition can be simplified because you could Assert(txn); ========== 3. File src/include/replication/pgoutput.h 3.1 @@ -30,4 +30,9 @@ typedef struct PGOutputData bool two_phase; } PGOutputData; +typedef struct PGOutputTxnData +{ + bool sent_begin_txn; /* flag indicating whether begin has been sent */ +} PGOutputTxnData; + Why is this typedef here? IIUC it is only used inside the pgoutput.c, so shouldn't it be declared in that file also? ---------- 3.2 @@ -30,4 +30,9 @@ typedef struct PGOutputData bool two_phase; } PGOutputData; +typedef struct PGOutputTxnData +{ + bool sent_begin_txn; /* flag indicating whether begin has been sent */ +} PGOutputTxnData; + That is a new typedef so maybe your patch also should update the src/tools/pgindent/typedefs.list to name this new typedef. ---------- Kind Regards, Peter Smith. Fujitsu Australia
Hi Ajin >The above patch had some changes missing which resulted in some tap >tests failing. Sending an updated patchset. Keeping the patchset >version the same. Thanks for your patch. I see a problem about Segmentation fault when using it. Please take a look at this. The steps to reproduce the problem are as follows. ------publisher------ create table test (a int primary key, b varchar); create publication pub for table test; ------subscriber------ create table test (a int primary key, b varchar); create subscription sub connection 'dbname=postgres' publication pub with(two_phase=on); Then, I prepare, commit, rollback transactions and TRUNCATE table in a sql as follows: ------------- BEGIN; INSERT INTO test SELECT i, md5(i::text) FROM generate_series(1, 10000) s(i); PREPARE TRANSACTION 't1'; COMMIT PREPARED 't1'; BEGIN; INSERT INTO test SELECT i, md5(i::text) FROM generate_series(10001, 20000) s(i); PREPARE TRANSACTION 't2'; ROLLBACK PREPARED 't2'; TRUNCATE test; ------------- To make sure the problem produce easily, I looped above operations in my sql file about 10 times, then I can 100% reproduceit and got segmentation fault in publisher log as follows: ------------- 2021-05-18 16:30:56.952 CST [548189] postmaster LOG: server process (PID 548222) was terminated by signal 11: Segmentationfault 2021-05-18 16:30:56.952 CST [548189] postmaster DETAIL: Failed process was running: START_REPLICATION SLOT "sub" LOGICAL0/0 (proto_version '3', two_phase 'on', publication_names '"pub"') ------------- Here is the core dump information : ------------- #0 0x000000000090afe4 in pq_sendstring (buf=buf@entry=0x251ca80, str=0x0) at pqformat.c:199 #1 0x0000000000ab0a2b in logicalrep_write_begin_prepare (out=0x251ca80, txn=txn@entry=0x25346e8) at proto.c:124 #2 0x00007f9528842dd6 in pgoutput_begin_prepare (ctx=ctx@entry=0x2514700, txn=txn@entry=0x25346e8) at pgoutput.c:495 #3 0x00007f9528843f70 in pgoutput_truncate (ctx=0x2514700, txn=0x25346e8, nrelations=1, relations=0x262f678, change=0x25370b8)at pgoutput.c:905 #4 0x0000000000aa57cb in truncate_cb_wrapper (cache=<optimized out>, txn=<optimized out>, nrelations=<optimized out>, relations=<optimizedout>, change=<optimized out>) at logical.c:1103 #5 0x0000000000abf333 in ReorderBufferApplyTruncate (streaming=false, change=0x25370b8, relations=0x262f678, nrelations=1,txn=0x25346e8, rb=0x2516710) at reorderbuffer.c:1918 #6 ReorderBufferProcessTXN (rb=rb@entry=0x2516710, txn=0x25346e8, commit_lsn=commit_lsn@entry=27517176, snapshot_now=<optimizedout>, command_id=command_id@entry=0, streaming=streaming@entry=false) at reorderbuffer.c:2278 #7 0x0000000000ac0b14 in ReorderBufferReplay (txn=<optimized out>, rb=rb@entry=0x2516710, xid=xid@entry=738, commit_lsn=commit_lsn@entry=27517176, end_lsn=end_lsn@entry=27517544, commit_time=commit_time@entry=674644388404356, origin_id=0, origin_lsn=0) at reorderbuffer.c:2591 #8 0x0000000000ac1713 in ReorderBufferCommit (rb=0x2516710, xid=xid@entry=738, commit_lsn=27517176, end_lsn=27517544, commit_time=commit_time@entry=674644388404356, origin_id=origin_id@entry=0, origin_lsn=0) at reorderbuffer.c:2615 #9 0x0000000000a9f702 in DecodeCommit (ctx=ctx@entry=0x2514700, buf=buf@entry=0x7ffdd027c2b0, parsed=parsed@entry=0x7ffdd027c140,xid=xid@entry=738, two_phase=<optimized out>) at decode.c:742 #10 0x0000000000a9fc6c in DecodeXactOp (ctx=ctx@entry=0x2514700, buf=buf@entry=0x7ffdd027c2b0) at decode.c:278 #11 0x0000000000aa1b75 in LogicalDecodingProcessRecord (ctx=0x2514700, record=0x2514ac0) at decode.c:142 #12 0x0000000000af6db1 in XLogSendLogical () at walsender.c:2876 #13 0x0000000000afb6aa in WalSndLoop (send_data=send_data@entry=0xaf6d49 <XLogSendLogical>) at walsender.c:2306 #14 0x0000000000afbdac in StartLogicalReplication (cmd=cmd@entry=0x24da288) at walsender.c:1206 #15 0x0000000000afd646 in exec_replication_command ( cmd_string=cmd_string@entry=0x2452570 "START_REPLICATION SLOT \"sub\" LOGICAL 0/0 (proto_version '3', two_phase 'on',publication_names '\"pub\"')") at walsender.c:1646 #16 0x0000000000ba3514 in PostgresMain (argc=argc@entry=1, argv=argv@entry=0x7ffdd027c560, dbname=<optimized out>, username=<optimizedout>) at postgres.c:4482 #17 0x0000000000a7284a in BackendRun (port=port@entry=0x2477b60) at postmaster.c:4491 #18 0x0000000000a78bba in BackendStartup (port=port@entry=0x2477b60) at postmaster.c:4213 #19 0x0000000000a78ff9 in ServerLoop () at postmaster.c:1745 #20 0x0000000000a7bbdf in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x244bae0) at postmaster.c:1417 #21 0x000000000090dc80 in main (argc=3, argv=0x244bae0) at main.c:209 ------------- I noticed that it called pgoutput_truncate function and pgoutput_begin_prepare function. It seems odd because TRUNCATE isnot in a prepared transaction in my case. I tried to debug this to learn more and found that in pgoutput_truncate function, the value of in_prepared_txn was true.Later, it got a segmentation fault when it tried to get gid in logicalrep_write_begin_prepare function - it has no gidso we got the segmentation fault. FYI: I also tested the case in synchronous mode, and it can execute successfully. So, I think the value of in_prepared_txn issometimes incorrect in asynchronous mode. Maybe there's a better way to get this. Regards Tang
On Thu, May 13, 2021 at 3:20 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Please find attached the latest patch set v75* > Review comments for v75-0001-Add-support-for-prepared-transactions-to-built-i: =============================================================================== 1. - <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] } + <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] } Can we do some testing of the code related to this in some way? One random idea could be to change the current subscriber-side code just for testing purposes to see if this works. Can we enhance and use pg_recvlogical to test this? It is possible that if you address comment number 13 below, this can be tested with Create Subscription command. 2. - belong to the same transaction. It also sends changes of large in-progress - transactions between a pair of Stream Start and Stream Stop messages. The - last stream of such a transaction contains Stream Commit or Stream Abort - message. + belong to the same transaction. Similarly, all messages between a pair of + Begin Prepare and Commit Prepared messages belong to the same transaction. I think here we need to write Prepare instead of Commit Prepared because Commit Prepared for a transaction can come at a later point of time and all the messages in-between won't belong to the same transaction. 3. +<!-- ==================== TWO_PHASE Messages ==================== --> + +<para> +The following messages (Begin Prepare, Prepare, Commit Prepared, Rollback Prepared) +are available since protocol version 3. +</para> I am not sure here marker like "TWO_PHASE Messages" is required. We don't have any such marker for streaming messages. 4. +<varlistentry> +<term>Int64</term> +<listitem><para> + Timestamp of the prepare transaction. Isn't it better to write this description as "Prepare timestamp of the transaction" to match with the similar description of Commit timestamp. Also, there are similar occurances in the patch at other places, change those as well. 5. +<term>Begin Prepare</term> +<listitem> +<para> ... +<varlistentry> +<term>Int32</term> +<listitem><para> + Xid of the subtransaction (will be same as xid of the transaction for top-level + transactions). The above description seems wrong to me. It should be Xid of the transaction as we won't receive Xid of subtransaction in Begin message. The same applies to the prepare/commit prepared/rollback prepared transaction messages as well, so change that as well accordingly. 6. +<term>Byte1('P')</term> +<listitem><para> + Identifies this message as a two-phase prepare transaction message. +</para></listitem> In all the similar messages, we are using "Identifies the message as ...". I feel it is better to be consistent in this and similar messages in the patch. 7. +<varlistentry> + +<term>Rollback Prepared</term> +<listitem> .. +<varlistentry> +<term>Int64</term> +<listitem><para> + The LSN of the prepare. +</para></listitem> This should be end LSN of the prepared transaction. 8. +bool +LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn, + TimestampTz origin_prepare_timestamp) .. .. + /* + * We are neither expecting the collisions of GXACTs (same gid) + * between publisher and subscribers nor the apply worker restarts + * after prepared xacts, The second part of the comment ".. nor the apply worker restarts after prepared xacts .." is no longer true after commit 8bdb1332eb[1]. So, we can remove it. 9. + /* + * Does the subscription have tables? + * + * If there were not-READY relations found then we know it does. But if + * table_state_no_ready was empty we still need to check again to see + * if there are 0 tables. + */ + has_subrels = (list_length(table_states_not_ready) > 0) || Typo in comments. /table_state_no_ready/table_state_not_ready 10. + if (!twophase) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("unrecognized subscription parameter: \"%s\"", defel->defname))); errmsg is not aligned properly. Can we make the error message clear, something like: "cannot change two_phase option" 11. @@ -69,7 +69,8 @@ parse_subscription_options(List *options, char **synchronous_commit, bool *refresh, bool *binary_given, bool *binary, - bool *streaming_given, bool *streaming) + bool *streaming_given, bool *streaming, + bool *twophase_given, bool *twophase) This function already has 14 parameters and this patch adds 2 new ones. Isn't it better to have a struct (ParseSubOptions) for these parameters? I think that might lead to some code churn but we can have that as a separate patch on top of which we can create two_pc patch. 12. * The subscription two_phase commit implementation requires + * that replication has passed the initial table + * synchronization phase before the two_phase becomes properly + * enabled. Can we slightly modify the starting of this sentence as:"The subscription option 'two_phase' requires that ..." 13. @@ -507,7 +558,16 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel) { Assert(slotname); - walrcv_create_slot(wrconn, slotname, false, + /* + * Even if two_phase is set, don't create the slot with + * two-phase enabled. Will enable it once all the tables are + * synced and ready. This avoids race-conditions like prepared + * transactions being skipped due to changes not being applied + * due to checks in should_apply_changes_for_rel() when + * tablesync for the corresponding tables are in progress. See + * comments atop worker.c. + */ + walrcv_create_slot(wrconn, slotname, false, false, Can't we enable two_phase if copy_data is false? Because in that case, all relations will be in a READY state. If we do that then we should also set two_phase state as 'enabled' during createsubscription. I think we need to be careful to check that connect option is given and copy_data is false before setting such a state. Now, I guess we may not be able to optimize this to not set 'enabled' state when the subscription has no rels. 14. + if (options->proto.logical.twophase && + PQserverVersion(conn->streamConn) >= 140000) + appendStringInfoString(&cmd, ", two_phase 'on'"); + We need to check 150000 here but for now, maybe we can add a comment similar to what you have added in ApplyWorkerMain to avoid forgetting this change. Probably a similar comment is required pg_dump.c. 15. @@ -49,7 +49,7 @@ logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn) /* fixed fields */ pq_sendint64(out, txn->final_lsn); - pq_sendint64(out, txn->commit_time); + pq_sendint64(out, txn->u_op_time.prepare_time); pq_sendint32(out, txn->xid); Why here prepare_time? It should be commit_time. We use prepare_time in begin_prepare not in begin. 16. +logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn, + XLogRecPtr commit_lsn) +{ + uint8 flags = 0; + + pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED); + + /* + * This should only ever happen for two-phase commit transactions. In + * which case we expect to have a valid GID. Additionally, the transaction + * must be prepared. See ReorderBufferFinishPrepared. + */ + Assert(txn->gid != NULL); + The second part of the comment ("Additionally, the transaction must be prepared) is no longer true. Also, we can combine the first two sentences here and at other places where a similar comment is used. 17. + union + { + TimestampTz commit_time; + TimestampTz prepare_time; + } u_op_time; I think it is better to name this union as xact_time or trans_time. [1] - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=8bdb1332eb51837c15a10a972c179b84f654279e -- With Regards, Amit Kapila.
Please find attached the latest patch set v77* Differences from v76* are: * Rebased to HEAD @ yesterday * v77* addresses most of Amit's recent feedback comments [1]; I will reply to that mail separately with the details. * The v77-003 is temporarily omitted from this patch set. That will be re-added in v78* early next week. ---- [1] https://www.postgresql.org/message-id/CAA4eK1Jz64rwLyB6H7Z_SmEDouJ41KN42%3DVkVFp6JTpafJFG8Q%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Attachment
On Tue, May 18, 2021 at 9:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, May 13, 2021 at 3:20 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Please find attached the latest patch set v75* > > > > Review comments for v75-0001-Add-support-for-prepared-transactions-to-built-i: > =============================================================================== > 1. > - <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable > class="parameter">slot_name</replaceable> [ > <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ > <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> > <replaceable class="parameter">output_plugin</replaceable> [ > <literal>EXPORT_SNAPSHOT</literal> | > <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> > ] } > + <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable > class="parameter">slot_name</replaceable> [ > <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { > <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | > <literal>LOGICAL</literal> <replaceable > class="parameter">output_plugin</replaceable> [ > <literal>EXPORT_SNAPSHOT</literal> | > <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> > ] } > > > Can we do some testing of the code related to this in some way? One > random idea could be to change the current subscriber-side code just > for testing purposes to see if this works. Can we enhance and use > pg_recvlogical to test this? It is possible that if you address > comment number 13 below, this can be tested with Create Subscription > command. > TODO > 2. > - belong to the same transaction. It also sends changes of large in-progress > - transactions between a pair of Stream Start and Stream Stop messages. The > - last stream of such a transaction contains Stream Commit or Stream Abort > - message. > + belong to the same transaction. Similarly, all messages between a pair of > + Begin Prepare and Commit Prepared messages belong to the same transaction. > > I think here we need to write Prepare instead of Commit Prepared > because Commit Prepared for a transaction can come at a later point of > time and all the messages in-between won't belong to the same > transaction. > Fixed in v77-0001 > 3. > +<!-- ==================== TWO_PHASE Messages ==================== --> > + > +<para> > +The following messages (Begin Prepare, Prepare, Commit Prepared, > Rollback Prepared) > +are available since protocol version 3. > +</para> > > I am not sure here marker like "TWO_PHASE Messages" is required. We > don't have any such marker for streaming messages. > Fixed in v77-0001 > 4. > +<varlistentry> > +<term>Int64</term> > +<listitem><para> > + Timestamp of the prepare transaction. > > Isn't it better to write this description as "Prepare timestamp of the > transaction" to match with the similar description of Commit > timestamp. Also, there are similar occurances in the patch at other > places, change those as well. > Fixed in v77-0001, v77-0002 > 5. > +<term>Begin Prepare</term> > +<listitem> > +<para> > ... > +<varlistentry> > +<term>Int32</term> > +<listitem><para> > + Xid of the subtransaction (will be same as xid of the > transaction for top-level > + transactions). > > The above description seems wrong to me. It should be Xid of the > transaction as we won't receive Xid of subtransaction in Begin > message. The same applies to the prepare/commit prepared/rollback > prepared transaction messages as well, so change that as well > accordingly. > Fixed in v77-0001, v77-0002 > 6. > +<term>Byte1('P')</term> > +<listitem><para> > + Identifies this message as a two-phase prepare > transaction message. > +</para></listitem> > > In all the similar messages, we are using "Identifies the message as > ...". I feel it is better to be consistent in this and similar > messages in the patch. > Fixed in v77-0001, v77-0002 > 7. > +<varlistentry> > + > +<term>Rollback Prepared</term> > +<listitem> > .. > +<varlistentry> > +<term>Int64</term> > +<listitem><para> > + The LSN of the prepare. > +</para></listitem> > > This should be end LSN of the prepared transaction. > Fixed in v77-0001 > 8. > +bool > +LookupGXact(const char *gid, XLogRecPtr prepare_end_lsn, > + TimestampTz origin_prepare_timestamp) > .. > .. > + /* > + * We are neither expecting the collisions of GXACTs (same gid) > + * between publisher and subscribers nor the apply worker restarts > + * after prepared xacts, > > The second part of the comment ".. nor the apply worker restarts after > prepared xacts .." is no longer true after commit 8bdb1332eb[1]. So, > we can remove it. > Fixed in v77-0001 > 9. > + /* > + * Does the subscription have tables? > + * > + * If there were not-READY relations found then we know it does. But if > + * table_state_no_ready was empty we still need to check again to see > + * if there are 0 tables. > + */ > + has_subrels = (list_length(table_states_not_ready) > 0) || > > Typo in comments. /table_state_no_ready/table_state_not_ready > Fixed in v77-0001 > 10. > + if (!twophase) > + ereport(ERROR, > + (errcode(ERRCODE_SYNTAX_ERROR), > + errmsg("unrecognized subscription parameter: \"%s\"", defel->defname))); > > errmsg is not aligned properly. Can we make the error message clear, > something like: "cannot change two_phase option" > Fixed in v77-0001. I fixed the alignment, but did not modify the message text.This message was already changed in v74 to make it more consistent with similar errors. Please see Vignesh feedback [1] comment #1. > 11. > @@ -69,7 +69,8 @@ parse_subscription_options(List *options, > char **synchronous_commit, > bool *refresh, > bool *binary_given, bool *binary, > - bool *streaming_given, bool *streaming) > + bool *streaming_given, bool *streaming, > + bool *twophase_given, bool *twophase) > > This function already has 14 parameters and this patch adds 2 new > ones. Isn't it better to have a struct (ParseSubOptions) for these > parameters? I think that might lead to some code churn but we can have > that as a separate patch on top of which we can create two_pc patch. > This same modification is already being addressed in another thread [2]. So we do nothing in this patch for now, but certainly this area needs to be re-based later after the other patch is pushed, > 12. > * The subscription two_phase commit implementation requires > + * that replication has passed the initial table > + * synchronization phase before the two_phase becomes properly > + * enabled. > > Can we slightly modify the starting of this sentence as:"The > subscription option 'two_phase' requires that ..." > Fixed in v77-0001 > 13. > @@ -507,7 +558,16 @@ CreateSubscription(CreateSubscriptionStmt *stmt, > bool isTopLevel) > { > Assert(slotname); > > - walrcv_create_slot(wrconn, slotname, false, > + /* > + * Even if two_phase is set, don't create the slot with > + * two-phase enabled. Will enable it once all the tables are > + * synced and ready. This avoids race-conditions like prepared > + * transactions being skipped due to changes not being applied > + * due to checks in should_apply_changes_for_rel() when > + * tablesync for the corresponding tables are in progress. See > + * comments atop worker.c. > + */ > + walrcv_create_slot(wrconn, slotname, false, false, > > Can't we enable two_phase if copy_data is false? Because in that case, > all relations will be in a READY state. If we do that then we should > also set two_phase state as 'enabled' during createsubscription. I > think we need to be careful to check that connect option is given and > copy_data is false before setting such a state. Now, I guess we may > not be able to optimize this to not set 'enabled' state when the > subscription has no rels. > Fixed in v77-0001 > 14. > + if (options->proto.logical.twophase && > + PQserverVersion(conn->streamConn) >= 140000) > + appendStringInfoString(&cmd, ", two_phase 'on'"); > + > > We need to check 150000 here but for now, maybe we can add a comment > similar to what you have added in ApplyWorkerMain to avoid forgetting > this change. Probably a similar comment is required pg_dump.c. > Fixed in v77-0001 > 15. > @@ -49,7 +49,7 @@ logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn) > > /* fixed fields */ > pq_sendint64(out, txn->final_lsn); > - pq_sendint64(out, txn->commit_time); > + pq_sendint64(out, txn->u_op_time.prepare_time); > pq_sendint32(out, txn->xid); > > Why here prepare_time? It should be commit_time. We use prepare_time > in begin_prepare not in begin. > Fixed in v77-0001 > 16. > +logicalrep_write_commit_prepared(StringInfo out, ReorderBufferTXN *txn, > + XLogRecPtr commit_lsn) > +{ > + uint8 flags = 0; > + > + pq_sendbyte(out, LOGICAL_REP_MSG_COMMIT_PREPARED); > + > + /* > + * This should only ever happen for two-phase commit transactions. In > + * which case we expect to have a valid GID. Additionally, the transaction > + * must be prepared. See ReorderBufferFinishPrepared. > + */ > + Assert(txn->gid != NULL); > + > > The second part of the comment ("Additionally, the transaction must be > prepared) is no longer true. Also, we can combine the first two > sentences here and at other places where a similar comment is used. > Fixed in v77-0001, v77-0002 > 17. > + union > + { > + TimestampTz commit_time; > + TimestampTz prepare_time; > + } u_op_time; > > I think it is better to name this union as xact_time or trans_time. > Fixed in v77-0001, v77-0002 -------- [1] = https://www.postgresql.org/message-id/CALDaNm0u%3DQGwd7jDAj-4u%3D7vvPn5rarFjBMCgfiJbDte55CWAA%40mail.gmail.com [2] https://www.postgresql.org/message-id/CALj2ACWEjphPsfpyX9M%2BRdqmoRwRbWVKMoW7Tx1o%2Bh%2BoNEs4pQ%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
On Fri, May 21, 2021 at 6:43 PM Peter Smith <smithpb2250@gmail.com> wrote: > Fixed in v77-0001, v77-0002 Attaching a new patch-set that rebases the patch, addresses review comments from Peter as well as a test failure reported by Tang. I've also added some new test case into patch-2 authored by Tang. I've addressed the following comments: On Tue, May 18, 2021 at 6:53 PM Peter Smith <smithpb2250@gmail.com> wrote: > > 1. File: doc/src/sgml/logicaldecoding.sgml > > 1.1 > > @@ -862,11 +862,19 @@ typedef void (*LogicalDecodePrepareCB) (struct > LogicalDecodingContext *ctx, > The required <function>commit_prepared_cb</function> callback is called > whenever a transaction <command>COMMIT PREPARED</command> has > been decoded. > The <parameter>gid</parameter> field, which is part of the > - <parameter>txn</parameter> parameter, can be used in this callback. > + <parameter>txn</parameter> parameter, can be used in this callback. The > + parameters <parameter>prepare_end_lsn</parameter> and > + <parameter>prepare_time</parameter> can be used to check if the plugin > + has received this <command>PREPARE TRANSACTION</command> in which case > + it can apply the rollback, otherwise, it can skip the rollback > operation. The > + <parameter>gid</parameter> alone is not sufficient because the downstream > + node can have a prepared transaction with same identifier. > > This is in the commit prepared section, but that new text is referring > to "it can apply to the rollback" etc. > Is this deliberate text, or maybe cut/paste error? > > ========== Fixed. > > 2. File: src/backend/replication/pgoutput/pgoutput.c > > 2.1 > > @@ -76,6 +78,7 @@ static void > pgoutput_stream_prepare_txn(LogicalDecodingContext *ctx, > > static bool publications_valid; > static bool in_streaming; > +static bool in_prepared_txn; > > Wondering why this is a module static flag. That makes it looks like > it somehow applies globally to all the functions in this scope, but > really I think this is just a txn property, right? > - e.g. why not use another member of the private TXN data instead? or > - e.g. why not use rbtxn_prepared(txn) macro? > > ---------- I've removed this flag and used rbtxn_prepared(txn) macro. This also seems to fix the crash reported by Tang, where it was trying to send a "BEGIN PREPARE" as part of a non-prepared tx. I've changed the logic to rely on the prepared flag in the txn to decide if BEGIN needs to be sent or BEGIN PREPARE needs to be sent. > > 2.2 > > @@ -404,10 +410,32 @@ pgoutput_startup(LogicalDecodingContext *ctx, > OutputPluginOptions *opt, > static void > pgoutput_begin_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn) > { > + PGOutputTxnData *data = MemoryContextAllocZero(ctx->context, > + sizeof(PGOutputTxnData)); > + > + (void)txn; /* keep compiler quiet */ > > I guess since now the arg "txn" is being used the added statement to > "keep compiler quiet" is now redundant, so should be removed. > Removed this. > ---------- > > 2.3 > > +static void > +pgoutput_begin(LogicalDecodingContext *ctx, ReorderBufferTXN *txn) > +{ > bool send_replication_origin = txn->origin_id != InvalidRepOriginId; > + PGOutputTxnData *data = (PGOutputTxnData *) txn->output_plugin_private; > > OutputPluginPrepareWrite(ctx, !send_replication_origin); > logicalrep_write_begin(ctx->out, txn); > + data->sent_begin_txn = true; > > > I wondered is it worth adding Assert(data); here? > > ---------- Added. > > 2.4 > > @@ -422,8 +450,14 @@ static void > pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, > XLogRecPtr commit_lsn) > { > + PGOutputTxnData *data = (PGOutputTxnData *) txn->output_plugin_private; > + > OutputPluginUpdateProgress(ctx); > > I wondered is it worthwhile to add Assert(data); here also? > > ---------- Added. > > 2.5 > @@ -422,8 +450,14 @@ static void > pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, > XLogRecPtr commit_lsn) > { > + PGOutputTxnData *data = (PGOutputTxnData *) txn->output_plugin_private; > + > OutputPluginUpdateProgress(ctx); > > + /* skip COMMIT message if nothing was sent */ > + if (!data->sent_begin_txn) > + return; > > Shouldn't this code also be freeing that allocated data? I think you > do free it in similar functions later in this patch. > > ---------- Modified this. > > 2.6 > > @@ -435,10 +469,31 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, > ReorderBufferTXN *txn, > static void > pgoutput_begin_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn) > { > + PGOutputTxnData *data = MemoryContextAllocZero(ctx->context, > + sizeof(PGOutputTxnData)); > + > + /* > + * Don't send BEGIN message here. Instead, postpone it until the first > + * change. In logical replication, a common scenario is to replicate a set > + * of tables (instead of all tables) and transactions whose changes were on > + * table(s) that are not published will produce empty transactions. These > + * empty transactions will send BEGIN and COMMIT messages to subscribers, > + * using bandwidth on something with little/no use for logical replication. > + */ > + data->sent_begin_txn = false; > + txn->output_plugin_private = data; > + in_prepared_txn = true; > +} > > Apart from setting the in_prepared_txn = true; this is all identical > code to pgoutput_begin_txn so you could consider just delegating to > call that other function to save all the cut/paste data allocation and > big comment. Or maybe this way is better - I am not sure. > > ---------- Updated this. > > 2.7 > > +static void > +pgoutput_begin_prepare(LogicalDecodingContext *ctx, ReorderBufferTXN *txn) > +{ > bool send_replication_origin = txn->origin_id != InvalidRepOriginId; > + PGOutputTxnData *data = (PGOutputTxnData *) txn->output_plugin_private; > > OutputPluginPrepareWrite(ctx, !send_replication_origin); > logicalrep_write_begin_prepare(ctx->out, txn); > + data->sent_begin_txn = true; > > I wondered is it worth adding Assert(data); here also? > > ---------- Added Assert. > > 2.8 > > @@ -453,11 +508,18 @@ static void > pgoutput_prepare_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, > XLogRecPtr prepare_lsn) > { > + PGOutputTxnData *data = (PGOutputTxnData *) txn->output_plugin_private; > + > OutputPluginUpdateProgress(ctx); > > I wondered is it worth adding Assert(data); here also? > > ---------- Added. > > 2.9 > > @@ -465,12 +527,28 @@ pgoutput_prepare_txn(LogicalDecodingContext > *ctx, ReorderBufferTXN *txn, > */ > static void > pgoutput_commit_prepared_txn(LogicalDecodingContext *ctx, > ReorderBufferTXN *txn, > - XLogRecPtr commit_lsn) > + XLogRecPtr commit_lsn, XLogRecPtr prepare_end_lsn, > + TimestampTz prepare_time) > { > + PGOutputTxnData *data = (PGOutputTxnData *) txn->output_plugin_private; > + > OutputPluginUpdateProgress(ctx); > > + /* > + * skip sending COMMIT PREPARED message if prepared transaction > + * has not been sent. > + */ > + if (data && !data->sent_begin_txn) > + { > + pfree(data); > + return; > + } > + > + if (data) > + pfree(data); > OutputPluginPrepareWrite(ctx, true); > > I think this pfree logic might be refactored more simply to just be > done in one place. e.g. like: > > if (data) > { > bool skip = !data->sent_begin_txn; > pfree(data); > if (skip) > return; > } > > BTW, is it even possible to get in this function with NULL private > data? Perhaps that should be an Assert(data) ? > > ---------- Changed the logic as per your suggestion but did not add the Assert because you can come into this function with a NULL private data, this can happen as the commit prepared for the transaction can come after a restart of the WALSENDER and the previously setup private data is lost. This is only applicable for commit prepared and rollback prepared. > > 2.10 > > @@ -483,8 +561,22 @@ pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx, > XLogRecPtr prepare_end_lsn, > TimestampTz prepare_time) > { > + PGOutputTxnData *data = (PGOutputTxnData *) txn->output_plugin_private; > + > OutputPluginUpdateProgress(ctx); > > + /* > + * skip sending COMMIT PREPARED message if prepared transaction > + * has not been sent. > + */ > + if (data && !data->sent_begin_txn) > + { > + pfree(data); > + return; > + } > + > + if (data) > + pfree(data); > > Same comment as above for refactoring the pfree logic. > > ---------- Refactored. > > 2.11 > > @@ -483,8 +561,22 @@ pgoutput_rollback_prepared_txn(LogicalDecodingContext *ctx, > XLogRecPtr prepare_end_lsn, > TimestampTz prepare_time) > { > + PGOutputTxnData *data = (PGOutputTxnData *) txn->output_plugin_private; > + > OutputPluginUpdateProgress(ctx); > > + /* > + * skip sending COMMIT PREPARED message if prepared transaction > + * has not been sent. > + */ > + if (data && !data->sent_begin_txn) > + { > + pfree(data); > + return; > + } > + > + if (data) > + pfree(data); > > Is that comment correct or cut/paste error? Why does it say "COMMIT PREPARED" ? > > ---------- Fixed. > > 2.12 > > @@ -613,6 +705,7 @@ pgoutput_change(LogicalDecodingContext *ctx, > ReorderBufferTXN *txn, > Relation relation, ReorderBufferChange *change) > { > PGOutputData *data = (PGOutputData *) ctx->output_plugin_private; > + PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private; > MemoryContext old; > > I wondered is it worth adding Assert(txndata); here also? > > ---------- Added. > > 2.13 > > @@ -750,6 +852,7 @@ pgoutput_truncate(LogicalDecodingContext *ctx, > ReorderBufferTXN *txn, > int nrelations, Relation relations[], ReorderBufferChange *change) > { > PGOutputData *data = (PGOutputData *) ctx->output_plugin_private; > + PGOutputTxnData *txndata = (PGOutputTxnData *) txn->output_plugin_private; > MemoryContext old; > > I wondered is it worth adding Assert(txndata); here also? > > ---------- Added. > > 2.14 > > @@ -813,11 +925,15 @@ pgoutput_message(LogicalDecodingContext *ctx, > ReorderBufferTXN *txn, > const char *message) > { > PGOutputData *data = (PGOutputData *) ctx->output_plugin_private; > + PGOutputTxnData *txndata; > TransactionId xid = InvalidTransactionId; > > if (!data->messages) > return; > > + if (txn && txn->output_plugin_private) > + txndata = (PGOutputTxnData *) txn->output_plugin_private; > + > /* > * Remember the xid for the message in streaming mode. See > * pgoutput_change. > @@ -825,6 +941,19 @@ pgoutput_message(LogicalDecodingContext *ctx, > ReorderBufferTXN *txn, > if (in_streaming) > xid = txn->xid; > > + /* output BEGIN if we haven't yet, avoid for streaming and > non-transactional messages */ > + if (!in_streaming && transactional) > + { > + txndata = (PGOutputTxnData *) txn->output_plugin_private; > + if (!txndata->sent_begin_txn) > + { > + if (!in_prepared_txn) > + pgoutput_begin(ctx, txn); > + else > + pgoutput_begin_prepare(ctx, txn); > + } > + } > > That code: > + if (txn && txn->output_plugin_private) > + txndata = (PGOutputTxnData *) txn->output_plugin_private; > looked misplaced to me. > > Shouldn't all that be relocated to be put inside the if block: > + if (!in_streaming && transactional) > > And when you do that maybe the condition can be simplified because you could > Assert(txn); > > ========== Removed that redundant code but cannot add Assert here as in streaming and transactional messages, there will be no output_plugin_private. > > 3. File src/include/replication/pgoutput.h > > 3.1 > > @@ -30,4 +30,9 @@ typedef struct PGOutputData > bool two_phase; > } PGOutputData; > > +typedef struct PGOutputTxnData > +{ > + bool sent_begin_txn; /* flag indicating whether begin has been sent */ > +} PGOutputTxnData; > + > > Why is this typedef here? IIUC it is only used inside the pgoutput.c, > so shouldn't it be declared in that file also? > > ---------- Changed this accordingly. > > 3.2 > > @@ -30,4 +30,9 @@ typedef struct PGOutputData > bool two_phase; > } PGOutputData; > > +typedef struct PGOutputTxnData > +{ > + bool sent_begin_txn; /* flag indicating whether begin has been sent */ > +} PGOutputTxnData; > + > > That is a new typedef so maybe your patch also should update the > src/tools/pgindent/typedefs.list to name this new typedef. > > ---------- Added. regards, Ajin Cherian Fujitsu Australia
Attachment
> > 13. > > @@ -507,7 +558,16 @@ CreateSubscription(CreateSubscriptionStmt *stmt, > > bool isTopLevel) > > { > > Assert(slotname); > > > > - walrcv_create_slot(wrconn, slotname, false, > > + /* > > + * Even if two_phase is set, don't create the slot with > > + * two-phase enabled. Will enable it once all the tables are > > + * synced and ready. This avoids race-conditions like prepared > > + * transactions being skipped due to changes not being applied > > + * due to checks in should_apply_changes_for_rel() when > > + * tablesync for the corresponding tables are in progress. See > > + * comments atop worker.c. > > + */ > > + walrcv_create_slot(wrconn, slotname, false, false, > > > > Can't we enable two_phase if copy_data is false? Because in that case, > > all relations will be in a READY state. If we do that then we should > > also set two_phase state as 'enabled' during createsubscription. I > > think we need to be careful to check that connect option is given and > > copy_data is false before setting such a state. Now, I guess we may > > not be able to optimize this to not set 'enabled' state when the > > subscription has no rels. > > > > Fixed in v77-0001 I noticed this modification in v77-0001 and executed "CREATE SUBSCRIPTION ... WITH (two_phase = on, copy_data = false)",but it crashed. ------------- postgres=# CREATE SUBSCRIPTION sub CONNECTION 'dbname=postgres' PUBLICATION pub WITH(two_phase = on, copy_data = false); WARNING: relcache reference leak: relation "pg_subscription" not closed WARNING: snapshot 0x34278d0 still active NOTICE: created replication slot "sub" on publisher server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !?> ------------- There are two warnings and a segmentation fault in subscriber log: ------------- 2021-05-24 15:08:32.435 CST [2848572] WARNING: relcache reference leak: relation "pg_subscription" not closed 2021-05-24 15:08:32.435 CST [2848572] WARNING: snapshot 0x32ce8b0 still active 2021-05-24 15:08:33.012 CST [2848555] LOG: server process (PID 2848572) was terminated by signal 11: Segmentation fault 2021-05-24 15:08:33.012 CST [2848555] DETAIL: Failed process was running: CREATE SUBSCRIPTION sub CONNECTION 'dbname=postgres'PUBLICATION pub WITH(two_phase = on, copy_data = false); ------------- The backtrace about segmentation fault is attached. It happened in table_close function, we got it because "CurrentResourceOwner"was NULL. I think it was related with the first warning, which reported "relcache reference leak". The backtrace information is attached,too. When updating two-phase state in CreateSubscription function, it released resource owner and set "CurrentResourceOwner"as NULL in CommitTransaction function. The second warning about "snapshot still active" was also happened in CommitTransaction function. It called AtEOXact_Snapshotfunction, checked leftover snapshots and reported the warning. I debugged and found the snapshot was added in function PortalRunUtility by calling PushActiveSnapshot function, the addressof "ActiveSnapshot" at this time was same as the address in warning. In summary, when creating subscription with two_phase = on and copy_data = false, it calls UpdateTwoPhaseState function inCreateSubscription function to set two_phase state as 'enabled', and it checked and released relcache and snapshot tooearly so the NG happened. I think some change should be made to avoid it. Thought? FYI I also tested the new released V78* at [1], the above NG still exists. [1] https://www.postgresql.org/message-id/CAFPTHDab56twVmC%2B0a%3DRNcRw4KuyFdqzW0JAcvJdS63n_fRnOQ%40mail.gmail.com Regards Tang
Attachment
On Tue, May 25, 2021 at 8:54 AM Ajin Cherian <itsajin@gmail.com> wrote: > > On Fri, May 21, 2021 at 6:43 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > Fixed in v77-0001, v77-0002 > > Attaching a new patch-set that rebases the patch, addresses review > comments from Peter as well as a test failure reported by Tang. I've > also added some new test case into patch-2 authored by Tang. Thanks for the updated patch, few comments: 1) Should "The end LSN of the prepare." be changed to "end LSN of the prepare transaction."? --- a/doc/src/sgml/protocol.sgml +++ b/doc/src/sgml/protocol.sgml @@ -7538,6 +7538,13 @@ are available since protocol version 3. <varlistentry> <term>Int64</term> <listitem><para> + The end LSN of the prepare. +</para></listitem> +</varlistentry> +<varlistentry> + +<term>Int64</term> +<listitem><para> 2) Should the ";" be "," here? +++ b/doc/src/sgml/catalogs.sgml @@ -7639,6 +7639,18 @@ SCRAM-SHA-256$<replaceable><iteration count></replaceable>:<replaceable>&l <row> <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>subtwophasestate</structfield> <type>char</type> + </para> + <para> + State code: + <literal>d</literal> = two_phase mode was not requested, so is disabled; + <literal>p</literal> = two_phase mode was requested, but is pending enablement; + <literal>e</literal> = two_phase mode was requested, and is enabled. + </para></entry> 3) Should end_lsn be commit_end_lsn? + prepare_data->commit_end_lsn = pq_getmsgint64(in); + if (prepare_data->commit_end_lsn == InvalidXLogRecPtr) elog(ERROR, "end_lsn is not set in commit prepared message"); + prepare_data->prepare_time = pq_getmsgint64(in); 4) This change is not required diff --git a/src/include/replication/pgoutput.h b/src/include/replication/pgoutput.h index 0dc460f..93c6731 100644 --- a/src/include/replication/pgoutput.h +++ b/src/include/replication/pgoutput.h @@ -29,5 +29,4 @@ typedef struct PGOutputData bool messages; bool two_phase; } PGOutputData; - #endif /* PGOUTPUT_H */ 5) Will the worker receive commit prepared/rollback prepared as we have skip logic to skip commit prepared / commit rollback in pgoutput_rollback_prepared_txn and pgoutput_commit_prepared_txn: + * It is possible that we haven't received the prepare because + * the transaction did not have any changes relevant to this + * subscription and was essentially an empty prepare. In which case, + * the walsender is optimized to drop the empty transaction and the + * accompanying prepare. Silently ignore if we don't find the prepared + * transaction. */ - replorigin_session_origin_lsn = prepare_data.end_lsn; - replorigin_session_origin_timestamp = prepare_data.commit_time; + if (LookupGXact(gid, prepare_data.prepare_end_lsn, + prepare_data.prepare_time)) + { 6) I'm not sure if we could add some tests for skip empty prepare transactions, if possible add few tests. 7) We could add some debug level log messages for the transaction that will be skipped. Regards, Vignesh
On Tue, May 25, 2021 at 4:41 PM tanghy.fnst@fujitsu.com <tanghy.fnst@fujitsu.com> wrote: > > Fixed in v77-0001 > > I noticed this modification in v77-0001 and executed "CREATE SUBSCRIPTION ... WITH (two_phase = on, copy_data = false)",but it crashed. > ------------- > postgres=# CREATE SUBSCRIPTION sub CONNECTION 'dbname=postgres' PUBLICATION pub WITH(two_phase = on, copy_data = false); > WARNING: relcache reference leak: relation "pg_subscription" not closed > WARNING: snapshot 0x34278d0 still active > NOTICE: created replication slot "sub" on publisher > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed. > !?> > ------------- > > There are two warnings and a segmentation fault in subscriber log: > ------------- > 2021-05-24 15:08:32.435 CST [2848572] WARNING: relcache reference leak: relation "pg_subscription" not closed > 2021-05-24 15:08:32.435 CST [2848572] WARNING: snapshot 0x32ce8b0 still active > 2021-05-24 15:08:33.012 CST [2848555] LOG: server process (PID 2848572) was terminated by signal 11: Segmentation fault > 2021-05-24 15:08:33.012 CST [2848555] DETAIL: Failed process was running: CREATE SUBSCRIPTION sub CONNECTION 'dbname=postgres'PUBLICATION pub WITH(two_phase = on, copy_data = false); > ------------- > Hi Tang, I've attached a patch that fixes this issue. Do test and confirm. regards, Ajin Cherian Fujitsu Australia
Attachment
On Wed, May 26, 2021 10:13 PM Ajin Cherian <itsajin@gmail.com> wrote: > > I've attached a patch that fixes this issue. Do test and confirm. > Thanks for your patch. I have tested and confirmed that the issue I reported has been fixed. Regards Tang
On Thu, May 27, 2021 at 11:20 AM tanghy.fnst@fujitsu.com <tanghy.fnst@fujitsu.com> wrote: > > On Wed, May 26, 2021 10:13 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > > I've attached a patch that fixes this issue. Do test and confirm. > > > > Thanks for your patch. > I have tested and confirmed that the issue I reported has been fixed. Thanks for confirmation. The problem seemed to be as you reported a table not closed when a transaction was committed. This seems to be because the function UpdateTwoPhaseState was committing a transaction inside the function when the caller of UpdateTwoPhaseState had a table open in CreateSubscription. This function was newly included in the CreateSubscription code, to handle the new use case of two_phase being enabled on create subscription if "copy_data = false". I don't think CreateSubscription required this to be inside a transaction and the committing of transaction was only meant for where this function was originally created to be used in the apply worker code (ApplyWorkerMain()). So, I removed the committing of the transaction from inside the function UpdateTwoPhaseState() and instead started and committed the transaction prior to and after this function is invoked in the apply worker code. regards, Ajin Cherian Fujitsu Australia
On Wed, May 26, 2021 at 6:53 PM vignesh C <vignesh21@gmail.com> wrote: > > On Tue, May 25, 2021 at 8:54 AM Ajin Cherian <itsajin@gmail.com> wrote: > > > > On Fri, May 21, 2021 at 6:43 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > Fixed in v77-0001, v77-0002 > > > > Attaching a new patch-set that rebases the patch, addresses review > > comments from Peter as well as a test failure reported by Tang. I've > > also added some new test case into patch-2 authored by Tang. > > Thanks for the updated patch, few comments: > 1) Should "The end LSN of the prepare." be changed to "end LSN of the > prepare transaction."? No, this is the end LSN of the prepare. The prepare consists of multiple LSNs. > 2) Should the ";" be "," here? > +++ b/doc/src/sgml/catalogs.sgml > @@ -7639,6 +7639,18 @@ SCRAM-SHA-256$<replaceable><iteration > count></replaceable>:<replaceable>&l > > <row> > <entry role="catalog_table_entry"><para role="column_definition"> > + <structfield>subtwophasestate</structfield> <type>char</type> > + </para> > + <para> > + State code: > + <literal>d</literal> = two_phase mode was not requested, so is disabled; > + <literal>p</literal> = two_phase mode was requested, but is > pending enablement; > + <literal>e</literal> = two_phase mode was requested, and is enabled. > + </para></entry> no, I think the ";" is correct here, connecting multiple parts of the sentence. > > 3) Should end_lsn be commit_end_lsn? > + prepare_data->commit_end_lsn = pq_getmsgint64(in); > + if (prepare_data->commit_end_lsn == InvalidXLogRecPtr) > elog(ERROR, "end_lsn is not set in commit prepared message"); > + prepare_data->prepare_time = pq_getmsgint64(in); Changed this. > > 4) This change is not required > > diff --git a/src/include/replication/pgoutput.h > b/src/include/replication/pgoutput.h > index 0dc460f..93c6731 100644 > --- a/src/include/replication/pgoutput.h > +++ b/src/include/replication/pgoutput.h > @@ -29,5 +29,4 @@ typedef struct PGOutputData > bool messages; > bool two_phase; > } PGOutputData; > - removed. > #endif /* PGOUTPUT_H */ > > 5) Will the worker receive commit prepared/rollback prepared as we > have skip logic to skip commit prepared / commit rollback in > pgoutput_rollback_prepared_txn and pgoutput_commit_prepared_txn: > > + * It is possible that we haven't received the prepare because > + * the transaction did not have any changes relevant to this > + * subscription and was essentially an empty prepare. In which case, > + * the walsender is optimized to drop the empty transaction and the > + * accompanying prepare. Silently ignore if we don't find the prepared > + * transaction. > */ > - replorigin_session_origin_lsn = prepare_data.end_lsn; > - replorigin_session_origin_timestamp = prepare_data.commit_time; > + if (LookupGXact(gid, prepare_data.prepare_end_lsn, > + prepare_data.prepare_time)) > + { > Commit prepared will be skipped if it happens in the same walsender's lifetime. But if the walsender restarts it no longer knows about the skipped prepare. In this case walsender will not skip the commit prepared. Hence, the logic for handling stray commit prepared in the apply worker. > 6) I'm not sure if we could add some tests for skip empty prepare > transactions, if possible add few tests. I've added a test case using pg_logical_slot_peek_binary_changes() for empty prepares have a look. > > 7) We could add some debug level log messages for the transaction that > will be skipped. If this is for the test, I was able to add a test without debug messages. regards, Ajin Cherian Fujitsu Australia
Attachment
On Fri, May 28, 2021 at 1:44 PM Ajin Cherian <itsajin@gmail.com> wrote: > Sorry, please ignore the previous patch-set. I attached the wrong files. Here's the correct patch-set. regards, Ajin Cherian Fujitsu Australia
Attachment
On Fri, May 28, 2021 at 9:14 AM Ajin Cherian <itsajin@gmail.com> wrote: > > On Wed, May 26, 2021 at 6:53 PM vignesh C <vignesh21@gmail.com> wrote: > > > > On Tue, May 25, 2021 at 8:54 AM Ajin Cherian <itsajin@gmail.com> wrote: > > > > > > On Fri, May 21, 2021 at 6:43 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > > > Fixed in v77-0001, v77-0002 > > > > > > Attaching a new patch-set that rebases the patch, addresses review > > > comments from Peter as well as a test failure reported by Tang. I've > > > also added some new test case into patch-2 authored by Tang. > > > > Thanks for the updated patch, few comments: > > 1) Should "The end LSN of the prepare." be changed to "end LSN of the > > prepare transaction."? > > No, this is the end LSN of the prepare. The prepare consists of multiple LSNs. > > > 2) Should the ";" be "," here? > > +++ b/doc/src/sgml/catalogs.sgml > > @@ -7639,6 +7639,18 @@ SCRAM-SHA-256$<replaceable><iteration > > count></replaceable>:<replaceable>&l > > > > <row> > > <entry role="catalog_table_entry"><para role="column_definition"> > > + <structfield>subtwophasestate</structfield> <type>char</type> > > + </para> > > + <para> > > + State code: > > + <literal>d</literal> = two_phase mode was not requested, so is disabled; > > + <literal>p</literal> = two_phase mode was requested, but is > > pending enablement; > > + <literal>e</literal> = two_phase mode was requested, and is enabled. > > + </para></entry> > > no, I think the ";" is correct here, connecting multiple parts of the sentence. > > > > > 3) Should end_lsn be commit_end_lsn? > > + prepare_data->commit_end_lsn = pq_getmsgint64(in); > > + if (prepare_data->commit_end_lsn == InvalidXLogRecPtr) > > elog(ERROR, "end_lsn is not set in commit prepared message"); > > + prepare_data->prepare_time = pq_getmsgint64(in); > > Changed this. > > > > > 4) This change is not required > > > > diff --git a/src/include/replication/pgoutput.h > > b/src/include/replication/pgoutput.h > > index 0dc460f..93c6731 100644 > > --- a/src/include/replication/pgoutput.h > > +++ b/src/include/replication/pgoutput.h > > @@ -29,5 +29,4 @@ typedef struct PGOutputData > > bool messages; > > bool two_phase; > > } PGOutputData; > > - > > removed. > > > > #endif /* PGOUTPUT_H */ > > > > 5) Will the worker receive commit prepared/rollback prepared as we > > have skip logic to skip commit prepared / commit rollback in > > pgoutput_rollback_prepared_txn and pgoutput_commit_prepared_txn: > > > > + * It is possible that we haven't received the prepare because > > + * the transaction did not have any changes relevant to this > > + * subscription and was essentially an empty prepare. In which case, > > + * the walsender is optimized to drop the empty transaction and the > > + * accompanying prepare. Silently ignore if we don't find the prepared > > + * transaction. > > */ > > - replorigin_session_origin_lsn = prepare_data.end_lsn; > > - replorigin_session_origin_timestamp = prepare_data.commit_time; > > + if (LookupGXact(gid, prepare_data.prepare_end_lsn, > > + prepare_data.prepare_time)) > > + { > > > Commit prepared will be skipped if it happens in the same walsender's > lifetime. But if the walsender restarts it no longer > knows about the skipped prepare. In this case walsender will not skip > the commit prepared. Hence, the logic for handling > stray commit prepared in the apply worker. > > > > 6) I'm not sure if we could add some tests for skip empty prepare > > transactions, if possible add few tests. > > I've added a test case using pg_logical_slot_peek_binary_changes() for > empty prepares > have a look. > > > > > 7) We could add some debug level log messages for the transaction that > > will be skipped. > > If this is for the test, I was able to add a test without debug messages. The idea here is to include any debug logs which will help in analyzing any bugs that we might get from an environment where debug access might not be available. Thanks for fixing the comments and posting an updated patch. Regards, Vignesh
On Thu, May 27, 2021 at 8:05 AM Ajin Cherian <itsajin@gmail.com> wrote: > > Thanks for confirmation. The problem seemed to be as you reported a > table not closed when a transaction was committed. > This seems to be because the function UpdateTwoPhaseState was > committing a transaction inside the function when the caller of > UpdateTwoPhaseState had > a table open in CreateSubscription. This function was newly included > in the CreateSubscription code, to handle the new use case of > two_phase being enabled on > create subscription if "copy_data = false". I don't think > CreateSubscription required this to be inside a transaction and the > committing of transaction > was only meant for where this function was originally created to be > used in the apply worker code (ApplyWorkerMain()). > So, I removed the committing of the transaction from inside the > function UpdateTwoPhaseState() and instead started and committed the > transaction > prior to and after this function is invoked in the apply worker code. > You have made these changes in 0002 whereas they should be part of 0001. One minor comment for 0001. * Special case: if when tables were specified but copy_data is + * false then it is safe to enable two_phase up-front because + * those tables are already initially READY state. Note, if + * the subscription has no tables then enablement cannot be + * done here - we must leave the twophase state as PENDING, to + * allow ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work. Can we slightly modify this comment as: "Note that if tables were specified but copy_data is false then it is safe to enable two_phase up-front because those tables are already initially READY state. When the subscription has no tables, we leave the twophase state as PENDING, to allow ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work." Also, I don't see any test after you enable this special case. Is it covered by existing tests, if not then let's try to add a test for this? -- With Regards, Amit Kapila.
On Fri, May 28, 2021 at 4:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, May 27, 2021 at 8:05 AM Ajin Cherian <itsajin@gmail.com> wrote: > > > > Thanks for confirmation. The problem seemed to be as you reported a > > table not closed when a transaction was committed. > > This seems to be because the function UpdateTwoPhaseState was > > committing a transaction inside the function when the caller of > > UpdateTwoPhaseState had > > a table open in CreateSubscription. This function was newly included > > in the CreateSubscription code, to handle the new use case of > > two_phase being enabled on > > create subscription if "copy_data = false". I don't think > > CreateSubscription required this to be inside a transaction and the > > committing of transaction > > was only meant for where this function was originally created to be > > used in the apply worker code (ApplyWorkerMain()). > > So, I removed the committing of the transaction from inside the > > function UpdateTwoPhaseState() and instead started and committed the > > transaction > > prior to and after this function is invoked in the apply worker code. > > > > You have made these changes in 0002 whereas they should be part of 0001. > > One minor comment for 0001. > * Special case: if when tables were specified but copy_data is > + * false then it is safe to enable two_phase up-front because > + * those tables are already initially READY state. Note, if > + * the subscription has no tables then enablement cannot be > + * done here - we must leave the twophase state as PENDING, to > + * allow ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work. > > Can we slightly modify this comment as: "Note that if tables were > specified but copy_data is false then it is safe to enable two_phase > up-front because those tables are already initially READY state. When > the subscription has no tables, we leave the twophase state as > PENDING, to allow ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work." > Created v81 - rebased to head and I have corrected the patch-set such that the fix as well as Tang's test cases are now part of patch-1. Also added this above minor comment update. regards, Ajin Cherian Fujitsu Australia
Attachment
On Fri, May 28, 2021 at 11:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > One minor comment for 0001. > * Special case: if when tables were specified but copy_data is > + * false then it is safe to enable two_phase up-front because > + * those tables are already initially READY state. Note, if > + * the subscription has no tables then enablement cannot be > + * done here - we must leave the twophase state as PENDING, to > + * allow ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work. > > Can we slightly modify this comment as: "Note that if tables were > specified but copy_data is false then it is safe to enable two_phase > up-front because those tables are already initially READY state. When > the subscription has no tables, we leave the twophase state as > PENDING, to allow ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work." > > Also, I don't see any test after you enable this special case. Is it > covered by existing tests, if not then let's try to add a test for > this? > I see that Ajin's latest patch has addressed the other comments except for the above test case suggestion. I have again reviewed the first patch and have some comments. Comments on v81-0001-Add-support-for-prepared-transactions-to-built-i ============================================================================ 1. <para> The logical replication solution that builds distributed two phase commit using this feature can deadlock if the prepared transaction has locked - [user] catalog tables exclusively. They need to inform users to not have - locks on catalog tables (via explicit <command>LOCK</command> command) in - such transactions. + [user] catalog tables exclusively. To avoid this users must refrain from + having locks on catalog tables (via explicit <command>LOCK</command> command) + in such transactions. </para> This change doesn't belong to this patch. I see the proposed text could be considered as an improvement but still we can do this separately. We are already trying to improve things in this regard in the thread [1], so you can propose this change there. 2. +<varlistentry> +<term>Byte1('K')</term> +<listitem><para> + Identifies the message as the commit of a two-phase transaction message. +</para></listitem> +</varlistentry> + +<varlistentry> +<term>Int8</term> +<listitem><para> + Flags; currently unused (must be 0). +</para></listitem> +</varlistentry> + +<varlistentry> +<term>Int64</term> +<listitem><para> + The LSN of the commit. +</para></listitem> +</varlistentry> + +<varlistentry> +<term>Int64</term> +<listitem><para> + The end LSN of the commit transaction. +</para></listitem> +</varlistentry> Can we change the description of LSN's as "The LSN of the commit prepared." and "The end LSN of the commit prepared transaction." respectively? This will make their description different from regular commit and I think that defines them better. 3. +<varlistentry> +<term>Int64</term> +<listitem><para> + The end LSN of the rollback transaction. +</para></listitem> +</varlistentry> Similar to above, can we change the description here as: "The end LSN of the rollback prepared transaction."? 4. + * The exception to this restriction is when copy_data = + * false, because when copy_data is false the tablesync will + * start already in READY state and will exit directly without + * doing anything which could interfere with the apply + * worker's message handling. + * + * For more details see comments atop worker.c. + */ + if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when two_phase is enabled"), + errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false" + ", or use DROP/CREATE SUBSCRIPTION."))); The above comment is a bit unclear because it seems you are saying there is some problem even when copy_data is false. Are you missing 'not' after 'could' in the comment? 5. XXX Now, this can even lead to a deadlock if the prepare * transaction is waiting to get it logically replicated for - * distributed 2PC. Currently, we don't have an in-core - * implementation of prepares for distributed 2PC but some - * out-of-core logical replication solution can have such an - * implementation. They need to inform users to not have locks - * on catalog tables in such transactions. + * distributed 2PC. This can be avoided by disallowing to + * prepare transactions that have locked [user] catalog tables + * exclusively. Can we slightly modify this part of the comment as: "This can be avoided by disallowing to prepare transactions that have locked [user] catalog tables exclusively but as of now we ask users not to do such operation"? 6. +AllTablesyncsReady(void) +{ + bool found_busy = false; + bool started_tx = false; + bool has_subrels = false; + + /* We need up-to-date sync state info for subscription tables here. */ + has_subrels = FetchTableStates(&started_tx); + + found_busy = list_length(table_states_not_ready) > 0; + + if (started_tx) + { + CommitTransactionCommand(); + pgstat_report_stat(false); + } + + /* + * When there are no tables, then return false. + * When no tablesyncs are busy, then all are READY + */ + return has_subrels && !found_busy; +} Do we really need found_busy variable in above function. Can't we change the return as (has_subrels) && (table_states_not_ready != NIL)? If so, then change the comments above return. 7. +/* + * Common code to fetch the up-to-date sync state info into the static lists. + * + * Returns true if subscription has 1 or more tables, else false. + */ +static bool +FetchTableStates(bool *started_tx) Can we update comments indicating that if this function starts the transaction then the caller is responsible to commit it? 8. (errmsg("logical replication apply worker for subscription \"%s\" will restart so two_phase can be enabled", + MySubscription->name))); Can we slightly change the message as: "logical replication apply worker for subscription \"%s\" will restart so that two_phase can be enabled"? 9. +void +UpdateTwoPhaseState(Oid suboid, char new_state) { .. + /* And update/set two_phase ENABLED */ + values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state); + replaces[Anum_pg_subscription_subtwophasestate - 1] = true; .. } The above comment seems wrong to me as we are updating the state as passed by the caller. [1] - https://www.postgresql.org/message-id/20210222222847.tpnb6eg3yiykzpky%40alap3.anarazel.de -- With Regards, Amit Kapila.
Please find attached the latest patch set v82* Differences from v81* are: * Rebased to HEAD @ yesterday * v82 addresses all of Amit's feedback comments from [1]; I will reply to that mail separately with any details. ---- [1] https://www.postgresql.org/message-id/CAA4eK1Jd9sqWtt5kEJZL1ehJB2y_DFnvDjY9vJ51k8Wq6XWVyw%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Attachment
On Mon, May 31, 2021 at 9:16 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, May 28, 2021 at 11:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > One minor comment for 0001. > > * Special case: if when tables were specified but copy_data is > > + * false then it is safe to enable two_phase up-front because > > + * those tables are already initially READY state. Note, if > > + * the subscription has no tables then enablement cannot be > > + * done here - we must leave the twophase state as PENDING, to > > + * allow ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work. > > > > Can we slightly modify this comment as: "Note that if tables were > > specified but copy_data is false then it is safe to enable two_phase > > up-front because those tables are already initially READY state. When > > the subscription has no tables, we leave the twophase state as > > PENDING, to allow ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work." > > > > Also, I don't see any test after you enable this special case. Is it > > covered by existing tests, if not then let's try to add a test for > > this? > > > > I see that Ajin's latest patch has addressed the other comments except > for the above test case suggestion. Yes, this is a known pending task. > I have again reviewed the first > patch and have some comments. > > Comments on v81-0001-Add-support-for-prepared-transactions-to-built-i > ============================================================================ > 1. > <para> > The logical replication solution that builds distributed two > phase commit > using this feature can deadlock if the prepared transaction has locked > - [user] catalog tables exclusively. They need to inform users to not have > - locks on catalog tables (via explicit <command>LOCK</command> > command) in > - such transactions. > + [user] catalog tables exclusively. To avoid this users must refrain from > + having locks on catalog tables (via explicit > <command>LOCK</command> command) > + in such transactions. > </para> > > This change doesn't belong to this patch. I see the proposed text > could be considered as an improvement but still we can do this > separately. We are already trying to improve things in this regard in > the thread [1], so you can propose this change there. > OK. This change has been removed in v82, and a patch posted to other thread here [1] > 2. > +<varlistentry> > +<term>Byte1('K')</term> > +<listitem><para> > + Identifies the message as the commit of a two-phase > transaction message. > +</para></listitem> > +</varlistentry> > + > +<varlistentry> > +<term>Int8</term> > +<listitem><para> > + Flags; currently unused (must be 0). > +</para></listitem> > +</varlistentry> > + > +<varlistentry> > +<term>Int64</term> > +<listitem><para> > + The LSN of the commit. > +</para></listitem> > +</varlistentry> > + > +<varlistentry> > +<term>Int64</term> > +<listitem><para> > + The end LSN of the commit transaction. > +</para></listitem> > +</varlistentry> > > Can we change the description of LSN's as "The LSN of the commit > prepared." and "The end LSN of the commit prepared transaction." > respectively? This will make their description different from regular > commit and I think that defines them better. > > 3. > +<varlistentry> > +<term>Int64</term> > +<listitem><para> > + The end LSN of the rollback transaction. > +</para></listitem> > +</varlistentry> > > Similar to above, can we change the description here as: "The end LSN > of the rollback prepared transaction."? > > 4. > + * The exception to this restriction is when copy_data = > + * false, because when copy_data is false the tablesync will > + * start already in READY state and will exit directly without > + * doing anything which could interfere with the apply > + * worker's message handling. > + * > + * For more details see comments atop worker.c. > + */ > + if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && copy_data) > + ereport(ERROR, > + (errcode(ERRCODE_SYNTAX_ERROR), > + errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed > when two_phase is enabled"), > + errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false" > + ", or use DROP/CREATE SUBSCRIPTION."))); > > The above comment is a bit unclear because it seems you are saying > there is some problem even when copy_data is false. Are you missing > 'not' after 'could' in the comment? > > 5. > XXX Now, this can even lead to a deadlock if the prepare > * transaction is waiting to get it logically replicated for > - * distributed 2PC. Currently, we don't have an in-core > - * implementation of prepares for distributed 2PC but some > - * out-of-core logical replication solution can have such an > - * implementation. They need to inform users to not have locks > - * on catalog tables in such transactions. > + * distributed 2PC. This can be avoided by disallowing to > + * prepare transactions that have locked [user] catalog tables > + * exclusively. > > Can we slightly modify this part of the comment as: "This can be > avoided by disallowing to prepare transactions that have locked [user] > catalog tables exclusively but as of now we ask users not to do such > operation"? > > 6. > +AllTablesyncsReady(void) > +{ > + bool found_busy = false; > + bool started_tx = false; > + bool has_subrels = false; > + > + /* We need up-to-date sync state info for subscription tables here. */ > + has_subrels = FetchTableStates(&started_tx); > + > + found_busy = list_length(table_states_not_ready) > 0; > + > + if (started_tx) > + { > + CommitTransactionCommand(); > + pgstat_report_stat(false); > + } > + > + /* > + * When there are no tables, then return false. > + * When no tablesyncs are busy, then all are READY > + */ > + return has_subrels && !found_busy; > +} > > Do we really need found_busy variable in above function. Can't we > change the return as (has_subrels) && (table_states_not_ready != NIL)? > If so, then change the comments above return. > > 7. > +/* > + * Common code to fetch the up-to-date sync state info into the static lists. > + * > + * Returns true if subscription has 1 or more tables, else false. > + */ > +static bool > +FetchTableStates(bool *started_tx) > > Can we update comments indicating that if this function starts the > transaction then the caller is responsible to commit it? > > 8. > (errmsg("logical replication apply worker for subscription \"%s\" will > restart so two_phase can be enabled", > + MySubscription->name))); > > Can we slightly change the message as: "logical replication apply > worker for subscription \"%s\" will restart so that two_phase can be > enabled"? > > 9. > +void > +UpdateTwoPhaseState(Oid suboid, char new_state) > { > .. > + /* And update/set two_phase ENABLED */ > + values[Anum_pg_subscription_subtwophasestate - 1] = CharGetDatum(new_state); > + replaces[Anum_pg_subscription_subtwophasestate - 1] = true; > .. > } > > The above comment seems wrong to me as we are updating the state as > passed by the caller. > All the above reported issues 2-9 are addressed in the latest 2PC patch set v82 ------ [1] https://www.postgresql.org/message-id/CAHut%2BPuTjTp_WERO%3D3Ybp8snTgDpiZeNaxzZhN8ky8XMo4KFVQ%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
On Wed, Jun 2, 2021 at 4:34 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Please find attached the latest patch set v82* > Few comments on 0001: ==================== 1. + /* + * BeginTransactionBlock is necessary to balance the EndTransactionBlock + * called within the PrepareTransactionBlock below. + */ + BeginTransactionBlock(); + CommitTransactionCommand(); + + /* + * Update origin state so we can restart streaming from correct position + * in case of crash. + */ + replorigin_session_origin_lsn = prepare_data.end_lsn; + replorigin_session_origin_timestamp = prepare_data.prepare_time; + + PrepareTransactionBlock(gid); + CommitTransactionCommand(); Here, the call to CommitTransactionCommand() twice looks a bit odd. Before the first call, can we write a comment like "This is to complete the Begin command started by the previous call"? 2. @@ -85,11 +85,16 @@ typedef struct LogicalDecodingContext bool streaming; /* - * Does the output plugin support two-phase decoding, and is it enabled? + * Does the output plugin support two-phase decoding. */ bool twophase; /* + * Is two-phase option given by output plugin? + */ + bool twophase_opt_given; + + /* * State for writing output. I think we can write few comments as to why we need a separate twophase parameter here? The description of twophase_opt_given can be changed to: "Is two-phase option given by output plugin? This is to allow output plugins to enable two_phase at the start of streaming. We can't rely on twophase parameter that tells whether the plugin provides all the necessary two_phase APIs for this purpose." Feel free to add more to it. 3. @@ -432,10 +432,19 @@ CreateInitDecodingContext(const char *plugin, MemoryContextSwitchTo(old_context); /* - * We allow decoding of prepared transactions iff the two_phase option is - * enabled at the time of slot creation. + * We allow decoding of prepared transactions when the two_phase is + * enabled at the time of slot creation, or when the two_phase option is + * given at the streaming start. */ - ctx->twophase &= MyReplicationSlot->data.two_phase; + ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase); + + /* Mark slot to allow two_phase decoding if not already marked */ + if (ctx->twophase && !slot->data.two_phase) + { + slot->data.two_phase = true; + ReplicationSlotMarkDirty(); + ReplicationSlotSave(); + } Why do we need to change this during CreateInitDecodingContext which is called at create_slot time? At that time, we don't need to consider any options and there is no need to toggle slot's two_phase value. 4. - /* Binary mode and streaming are only supported in v14 and higher */ + /* + * Binary, streaming, and two_phase are only supported in v14 and + * higher + */ We can say v15 for two_phase. 5. -#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM +#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3 +#define LOGICALREP_PROTO_MAX_VERSION_NUM 3 Isn't it better to define LOGICALREP_PROTO_MAX_VERSION_NUM as LOGICALREP_PROTO_TWOPHASE_VERSION_NUM instead of specifying directly the number? 6. +/* Commit (and abort) information */ typedef struct LogicalRepCommitData { XLogRecPtr commit_lsn; @@ -122,6 +132,48 @@ typedef struct LogicalRepCommitData TimestampTz committime; } LogicalRepCommitData; Is there a reason for the above comment addition? If so, how is it related to this patch? 7. +++ b/src/test/subscription/t/021_twophase.pl @@ -0,0 +1,299 @@ +# logical replication of 2PC test +use strict; +use warnings; +use PostgresNode; +use TestLib; In the nearby test files, we have Copyright notice like "# Copyright (c) 2021, PostgreSQL Global Development Group". We should add one to the new test files in this patch as well. 8. +# Also wait for two-phase to be enabled +my $twophase_query = + "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN ('e');"; +$node_subscriber->poll_query_until('postgres', $twophase_query) + or die "Timed out while waiting for subscriber to enable twophase"; Isn't it better to write this query as: "SELECT count(1) = 1 FROM pg_subscription WHERE subtwophasestate ='e';"? It looks a bit odd to use the NOT IN operator here. Similarly, change the same query used at another place in the patch. 9. +# check that transaction is in prepared state on subscriber +my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;"); +is($result, qq(1), 'transaction is prepared on subscriber'); + +# Wait for the statistics to be updated +$node_publisher->poll_query_until( + 'postgres', qq[ + SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots + WHERE slot_name = 'tap_sub' + AND total_txns > 0 AND total_bytes > 0; +]) or die "Timed out while waiting for statistics to be updated"; I don't see the need to check for stats in this test. If we really want to test stats then we can add a separate test in contrib\test_decoding\sql\stats but I suggest leaving it. Please do the same for other stats tests in the patch. 10. I think you missed to update LogicalRepRollbackPreparedTxnData in typedefs.list. -- With Regards, Amit Kapila.
On Wed, Jun 2, 2021 at 9:04 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Please find attached the latest patch set v82* > Some suggested changes to the 0001 patch comments (and note also the typo "doumentation"): diff of before and after follows: 8c8 < built-in logical replication, we need to do the below things: --- > built-in logical replication, we need to do the following things: 16,17c16,17 < * Add a new SUBSCRIPTION option "two_phase" to allow users to enable it. < We enable the two_phase once the initial data sync is over. --- > * Add a new SUBSCRIPTION option "two_phase" to allow users to enable two-phase > transactions. We enable the two_phase once the initial data sync is over. 23c23 < * Adds new subscription TAP tests, and new subscription.sql regression tests. --- > * Add new subscription TAP tests, and new subscription.sql regression tests. 25c25 < * Updates PG doumentation. --- > * Update PG documentation. 33c33 < * Prepare API for in-progress transactions is not supported. --- > * Prepare API for in-progress transactions. Regards, Greg Nancarrow Fujitsu Australia
Please find attached the latest patch set v83* Differences from v82* are: * Rebased to HEAD @ yesterday. This was necessary because some recent HEAD pushes broke the v82. * Adds a 2PC copy_data=false test case for [1]; * Addresses most of Amit's recent feedback comments from [2]; I will reply to that mail separately with the details. * Addresses Greg's feedback [3] about the patch 0001 commit comment ---- [1] https://www.postgresql.org/message-id/CAA4eK1K7qhqigORdEgqFTOPfj4r2%2BjV-uLc4-RCtgyDZwvbF8w%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAA4eK1%2B8L8h9qUQ6sS48EY0osfN7zs%3DZPqR6sE4eQxFhgwBxRw%40mail.gmail.com [3] https://www.postgresql.org/message-id/CAJcOf-cvn4EpSo4cD_9Awop72roKL1vnMtpURn1FnXv%2BgX5VPA%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Attachment
On Thu, Jun 3, 2021 at 7:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Jun 2, 2021 at 4:34 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Please find attached the latest patch set v82* > > > > Few comments on 0001: > ==================== > 1. > + /* > + * BeginTransactionBlock is necessary to balance the EndTransactionBlock > + * called within the PrepareTransactionBlock below. > + */ > + BeginTransactionBlock(); > + CommitTransactionCommand(); > + > + /* > + * Update origin state so we can restart streaming from correct position > + * in case of crash. > + */ > + replorigin_session_origin_lsn = prepare_data.end_lsn; > + replorigin_session_origin_timestamp = prepare_data.prepare_time; > + > + PrepareTransactionBlock(gid); > + CommitTransactionCommand(); > > Here, the call to CommitTransactionCommand() twice looks a bit odd. > Before the first call, can we write a comment like "This is to > complete the Begin command started by the previous call"? > Fixed in v83-0001 and v83-0002 > 2. > @@ -85,11 +85,16 @@ typedef struct LogicalDecodingContext > bool streaming; > > /* > - * Does the output plugin support two-phase decoding, and is it enabled? > + * Does the output plugin support two-phase decoding. > */ > bool twophase; > > /* > + * Is two-phase option given by output plugin? > + */ > + bool twophase_opt_given; > + > + /* > * State for writing output. > > I think we can write few comments as to why we need a separate > twophase parameter here? The description of twophase_opt_given can be > changed to: "Is two-phase option given by output plugin? This is to > allow output plugins to enable two_phase at the start of streaming. We > can't rely on twophase parameter that tells whether the plugin > provides all the necessary two_phase APIs for this purpose." Feel free > to add more to it. > TODO > 3. > @@ -432,10 +432,19 @@ CreateInitDecodingContext(const char *plugin, > MemoryContextSwitchTo(old_context); > > /* > - * We allow decoding of prepared transactions iff the two_phase option is > - * enabled at the time of slot creation. > + * We allow decoding of prepared transactions when the two_phase is > + * enabled at the time of slot creation, or when the two_phase option is > + * given at the streaming start. > */ > - ctx->twophase &= MyReplicationSlot->data.two_phase; > + ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase); > + > + /* Mark slot to allow two_phase decoding if not already marked */ > + if (ctx->twophase && !slot->data.two_phase) > + { > + slot->data.two_phase = true; > + ReplicationSlotMarkDirty(); > + ReplicationSlotSave(); > + } > > Why do we need to change this during CreateInitDecodingContext which > is called at create_slot time? At that time, we don't need to consider > any options and there is no need to toggle slot's two_phase value. > > TODO > 4. > - /* Binary mode and streaming are only supported in v14 and higher */ > + /* > + * Binary, streaming, and two_phase are only supported in v14 and > + * higher > + */ > > We can say v15 for two_phase. > Fixed in v83-0001 > 5. > -#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM > +#define LOGICALREP_PROTO_TWOPHASE_VERSION_NUM 3 > +#define LOGICALREP_PROTO_MAX_VERSION_NUM 3 > > Isn't it better to define LOGICALREP_PROTO_MAX_VERSION_NUM as > LOGICALREP_PROTO_TWOPHASE_VERSION_NUM instead of specifying directly > the number? > Fixed in v83-0001 > 6. > +/* Commit (and abort) information */ > typedef struct LogicalRepCommitData > { > XLogRecPtr commit_lsn; > @@ -122,6 +132,48 @@ typedef struct LogicalRepCommitData > TimestampTz committime; > } LogicalRepCommitData; > > Is there a reason for the above comment addition? If so, how is it > related to this patch? > The LogicalRepCommitData is used by the 0002 patch and during implementation it was not clear what was this struct, so I added the missing comment (all other nearby typedefs except this one were commented). But it is not strictly related to anything in patch 0001 so I have moved this change into the v83-0002 patch. > 7. > +++ b/src/test/subscription/t/021_twophase.pl > @@ -0,0 +1,299 @@ > +# logical replication of 2PC test > +use strict; > +use warnings; > +use PostgresNode; > +use TestLib; > > In the nearby test files, we have Copyright notice like "# Copyright > (c) 2021, PostgreSQL Global Development Group". We should add one to > the new test files in this patch as well. > Fixed in v83-0001 and v83-0002 > 8. > +# Also wait for two-phase to be enabled > +my $twophase_query = > + "SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT > IN ('e');"; > +$node_subscriber->poll_query_until('postgres', $twophase_query) > + or die "Timed out while waiting for subscriber to enable twophase"; > > Isn't it better to write this query as: "SELECT count(1) = 1 FROM > pg_subscription WHERE subtwophasestate ='e';"? It looks a bit odd to > use the NOT IN operator here. Similarly, change the same query used at > another place in the patch. > Not changed. This way keeps all the test parts more independent of each other doesn’t it? E.g. without NOT, if there were other subscriptions in the same test file then the expected result of ‘e’ may be 1 or 2 or 3 or whatever. Using NOT means you don't have to worry about any other test part. I think we had been bitten by similar state checks before which is why it was written like this. > 9. > +# check that transaction is in prepared state on subscriber > +my $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) > FROM pg_prepared_xacts;"); > +is($result, qq(1), 'transaction is prepared on subscriber'); > + > +# Wait for the statistics to be updated > +$node_publisher->poll_query_until( > + 'postgres', qq[ > + SELECT count(slot_name) >= 1 FROM pg_stat_replication_slots > + WHERE slot_name = 'tap_sub' > + AND total_txns > 0 AND total_bytes > 0; > +]) or die "Timed out while waiting for statistics to be updated"; > > I don't see the need to check for stats in this test. If we really > want to test stats then we can add a separate test in > contrib\test_decoding\sql\stats but I suggest leaving it. Please do > the same for other stats tests in the patch. > Removed statistics tests from v83-0001 and v83-0002 > 10. I think you missed to update LogicalRepRollbackPreparedTxnData in > typedefs.list. > Fixed in v83-0001. ------ Kind Regards, Peter Smith. Fujitsu Australia
On Tue, Jun 8, 2021 at 4:12 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Please find attached the latest patch set v83* > Some feedback for the v83 patch set: v83-0001: (1) doc/src/sgml/protocol.sgml (i) Remove extra space: BEFORE: + The transaction will be decoded and transmitted at AFTER: + The transaction will be decoded and transmitted at (ii) BEFORE: + contains Stream Commit or Stream Abort message. AFTER: + contains a Stream Commit or Stream Abort message. (iii) BEFORE: + The LSN of the commit prepared. AFTER: + The LSN of the commit prepared transaction. (iv) Should documentation say "prepared transaction" as opposed to "prepare transaction" ??? BEFORE: + The end LSN of the prepare transaction. AFTER: + The end LSN of the prepared transaction. (2) doc/src/sgml/ref/create_subscription.sgml (i) BEFORE: + The <literal>streaming</literal> option cannot be used along with + <literal>two_phase</literal> option. AFTER: + The <literal>streaming</literal> option cannot be used with the + <literal>two_phase</literal> option. (3) doc/src/sgml/ref/create_subscription.sgml (i) BEFORE: + prepared on publisher is decoded as normal transaction at commit. AFTER: + prepared on the publisher is decoded as a normal transaction at commit. (ii) BEFORE: + The <literal>two_phase</literal> option cannot be used along with + <literal>streaming</literal> option. AFTER: + The <literal>two_phase</literal> option cannot be used with the + <literal>streaming</literal> option. (4) src/backend/access/transam/twophase.c (i) BEFORE: + * Check if the prepared transaction with the given GID, lsn and timestamp + * is around. AFTER: + * Check if the prepared transaction with the given GID, lsn and timestamp + * exists. (5) src/backend/access/transam/twophase.c Question: Is: + * do this optimization if we encounter many collisions in GID meant to be: + * do this optimization if we encounter any collisions in GID ??? (6) src/backend/replication/logical/decode.c Grammar: BEFORE: + * distributed 2PC. This can be avoided by disallowing to + * prepare transactions that have locked [user] catalog tables + * exclusively but as of now we ask users not to do such + * operation. AFTER: + * distributed 2PC. This can be avoided by disallowing + * prepared transactions that have locked [user] catalog tables + * exclusively but as of now we ask users not to do such an + * operation. (7) src/backend/replication/logical/logical.c From the comment above it, it's not clear if the "&=" in the following line is intentional: + ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase); Also, the boolean conditions tested are in the reverse order of what is mentioned in that comment. Based on the comment, I would expect the following code: + ctx->twophase = (slot->data.two_phase || ctx->twophase_opt_given); Please check it, and maybe update the comment if "&=" is really intended. There are TWO places where this same code is used. (8) src/backend/replication/logical/tablesync.c In the following code, "has_subrels" should be a bool, not an int. +static bool +FetchTableStates(bool *started_tx) +{ + static int has_subrels = false; (9) src/backend/replication/logical/worker.c Mixed current/past tense: BEFORE: + * was still busy (see the condition of should_apply_changes_for_rel). The AFTER: + * is still busy (see the condition of should_apply_changes_for_rel). The (10) 2 places: BEFORE: + /* there is no transaction when COMMIT PREPARED is called */ AFTER: + /* There is no transaction when COMMIT PREPARED is called */ v83-0002: 1) doc/src/sgml/protocol.sgml BEFORE: + contains Stream Prepare or Stream Commit or Stream Abort message. AFTER: + contains a Stream Prepare or Stream Commit or Stream Abort message. v83-0003: 1) src/backend/replication/pgoutput/pgoutput.c i) In pgoutput_commit_txn(), the following code that pfree()s a pointer in a struct, without then NULLing it out, seems dangerous to me (because what is to stop other code, either now or in the future, from subsequently referencing that freed data or perhaps trying to pfree() again?): + PGOutputTxnData *data = (PGOutputTxnData *) txn->output_plugin_private; + bool skip; + + Assert(data); + skip = !data->sent_begin_txn; + pfree(data); I suggest adding the following line of code after the pfree(): + txn->output_plugin_private = NULL; ii) In pgoutput_commit_prepared_txn(), there's the same type of code: + if (data) + { + bool skip = !data->sent_begin_txn; + pfree(data); + if (skip) + return; + } I suggest adding the following line after the pfree() above: + txn->output_plugin_private = NULL; iii) Again, same thing in pgoutput_rollback_prepared_txn(): I suggest adding the following line after the pfree() above: + txn->output_plugin_private = NULL; Regards, Greg Nancarrow Fujitsu Australia
On Tue, Jun 8, 2021 at 4:19 PM Peter Smith <smithpb2250@gmail.com> wrote: > > On Thu, Jun 3, 2021 at 7:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Jun 2, 2021 at 4:34 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > > Please find attached the latest patch set v82* > > > Attaching patchset-v84 that addresses some of Amit's and Vignesh's comments: This patch-set also modifies the test case added for copy_data = false to check that two-phase transactions are decoded correctly. > > 2. > > @@ -85,11 +85,16 @@ typedef struct LogicalDecodingContext > > bool streaming; > > > > /* > > - * Does the output plugin support two-phase decoding, and is it enabled? > > + * Does the output plugin support two-phase decoding. > > */ > > bool twophase; > > > > /* > > + * Is two-phase option given by output plugin? > > + */ > > + bool twophase_opt_given; > > + > > + /* > > * State for writing output. > > > > I think we can write few comments as to why we need a separate > > twophase parameter here? The description of twophase_opt_given can be > > changed to: "Is two-phase option given by output plugin? This is to > > allow output plugins to enable two_phase at the start of streaming. We > > can't rely on twophase parameter that tells whether the plugin > > provides all the necessary two_phase APIs for this purpose." Feel free > > to add more to it. > > > > TODO Added comments here. > > 3. > > @@ -432,10 +432,19 @@ CreateInitDecodingContext(const char *plugin, > > MemoryContextSwitchTo(old_context); > > > > /* > > - * We allow decoding of prepared transactions iff the two_phase option is > > - * enabled at the time of slot creation. > > + * We allow decoding of prepared transactions when the two_phase is > > + * enabled at the time of slot creation, or when the two_phase option is > > + * given at the streaming start. > > */ > > - ctx->twophase &= MyReplicationSlot->data.two_phase; > > + ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase); > > + > > + /* Mark slot to allow two_phase decoding if not already marked */ > > + if (ctx->twophase && !slot->data.two_phase) > > + { > > + slot->data.two_phase = true; > > + ReplicationSlotMarkDirty(); > > + ReplicationSlotSave(); > > + } > > > > Why do we need to change this during CreateInitDecodingContext which > > is called at create_slot time? At that time, we don't need to consider > > any options and there is no need to toggle slot's two_phase value. > > > > > > TODO As part of the recent changes, we do turn on two_phase at create_slot time when the subscription is created with (copy_data = false, two_phase = on). So, this code is required. Amit: "1. - <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] } + <term><literal>CREATE_REPLICATION_SLOT</literal> <replaceable class="parameter">slot_name</replaceable> [ <literal>TEMPORARY</literal> ] [ <literal>TWO_PHASE</literal> ] { <literal>PHYSICAL</literal> [ <literal>RESERVE_WAL</literal> ] | <literal>LOGICAL</literal> <replaceable class="parameter">output_plugin</replaceable> [ <literal>EXPORT_SNAPSHOT</literal> | <literal>NOEXPORT_SNAPSHOT</literal> | <literal>USE_SNAPSHOT</literal> ] } Can we do some testing of the code related to this in some way? One random idea could be to change the current subscriber-side code just for testing purposes to see if this works. Can we enhance and use pg_recvlogical to test this? It is possible that if you address comment number 13 below, this can be tested with Create Subscription command." Actually this is tested in the test case added when Create Subscription with (copy_data = false) because in that case the slot is created with the two-phase option. Vignesh's comment: "We could add some debug level log messages for the transaction that will be skipped." Updated debug messages. regards, Ajin Cherian Fujitsu Australia
Attachment
On Wed, Jun 9, 2021 at 10:34 AM Ajin Cherian <itsajin@gmail.com> wrote: > > On Tue, Jun 8, 2021 at 4:19 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > 3. > > > @@ -432,10 +432,19 @@ CreateInitDecodingContext(const char *plugin, > > > MemoryContextSwitchTo(old_context); > > > > > > /* > > > - * We allow decoding of prepared transactions iff the two_phase option is > > > - * enabled at the time of slot creation. > > > + * We allow decoding of prepared transactions when the two_phase is > > > + * enabled at the time of slot creation, or when the two_phase option is > > > + * given at the streaming start. > > > */ > > > - ctx->twophase &= MyReplicationSlot->data.two_phase; > > > + ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase); > > > + > > > + /* Mark slot to allow two_phase decoding if not already marked */ > > > + if (ctx->twophase && !slot->data.two_phase) > > > + { > > > + slot->data.two_phase = true; > > > + ReplicationSlotMarkDirty(); > > > + ReplicationSlotSave(); > > > + } > > > > > > Why do we need to change this during CreateInitDecodingContext which > > > is called at create_slot time? At that time, we don't need to consider > > > any options and there is no need to toggle slot's two_phase value. > > > > > > > > > > TODO > > As part of the recent changes, we do turn on two_phase at create_slot time when > the subscription is created with (copy_data = false, two_phase = on). > So, this code is required. > But in that case, won't we deal it with the value passed in CreateReplicationSlotCmd. It should be enabled after we call ReplicationSlotCreate. -- With Regards, Amit Kapila.
On Wed, Jun 9, 2021 at 9:58 AM Greg Nancarrow <gregn4422@gmail.com> wrote: > > (5) src/backend/access/transam/twophase.c > > Question: > > Is: > > + * do this optimization if we encounter many collisions in GID > > meant to be: > > + * do this optimization if we encounter any collisions in GID > No, it should be fine if there are very few collisions. -- With Regards, Amit Kapila.
Please find attached the latest patch set v85* Differences from v84* are: * Rebased to HEAD @ 10/June. * This addresses all Greg's feedback comments [1] except. - Skipped (1).iii. I think this line in the documentation is OK as-is - Skipped (5). Amit wrote [2] that this comment is OK as-is - Every other feedback has been fixed exactly (or close to) the suggestions. KNOWN ISSUES: This v85 patch was built and tested using yesterday's master, but due to lots of recent activity in the replication area I expect it will be broken for HEAD very soon (if not already). I'll rebase it again ASAP to try to keep it in working order. ---- [1] https://www.postgresql.org/message-id/CAJcOf-fPcpe21RciPRn_56FwO6K_B%2BVcTZ2prAv4xvAk4cqYiQ%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAA4eK1J2XBSbWXcf9P0z30op%2BGL-cUrrqJuy-kFVmbjS1fx-eQ%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Attachment
On Fri, Jun 11, 2021 at 6:34 PM Peter Smith <smithpb2250@gmail.com> wrote: > KNOWN ISSUES: This v85 patch was built and tested using yesterday's > master, but due to lots of recent activity in the replication area I > expect it will be broken for HEAD very soon (if not already). I'll > rebase it again ASAP to try to keep it in working order. > Please find attached the latest patch set v86* Differences from v86* are: * Rebased to HEAD @ today. * Some recent pushes (e.g. [1][2][3]) in the replication area had broken the v85* patch. v86 is now working for the current HEAD. NOTE: I only changed what was necessary to get the 2PC patches working again. Specifically, one of the pushes [3] changed a number of protocol Asserts into ereports, but this 2PC patch set also introduces a number of new Asserts. If you find that any of these new Asserts are of the same kind which should be changed to ereports (in keeping with [3]) then please report them in a future code review. ---- [1] https://github.com/postgres/postgres/commit/3a09d75b4f6cabc8331e228b6988dbfcd9afdfbe [2] https://github.com/postgres/postgres/commit/d08237b5b494f96e72220bcef36a14a642969f16 [3] https://github.com/postgres/postgres/commit/fe6a20ce54cbbb6fcfe9f6675d563af836ae799a Kind Regards, Peter Smith. Fujitsu Australia
Attachment
On Wed, Jun 16, 2021 at 9:08 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > Please find attached the latest patch set v86* > A couple of comments: (1) I think one of my suggested changes was missed (or was that intentional?): BEFORE: + The LSN of the commit prepared. AFTER: + The LSN of the commit prepared transaction. (2) In light of Tom Lane's recent changes in: fe6a20ce54cbbb6fcfe9f6675d563af836ae799a (Don't use Asserts to check for violations of replication protocol) there appear to be some instances of such code in these patches. For example, in the v86-0001 patch: +/* + * Handle PREPARE message. + */ +static void +apply_handle_prepare(StringInfo s) +{ + LogicalRepPreparedTxnData prepare_data; + char gid[GIDSIZE]; + + logicalrep_read_prepare(s, &prepare_data); + + Assert(prepare_data.prepare_lsn == remote_final_lsn); The above Assert() should be changed to something like: + if (prepare_data.prepare_lsn != remote_final_lsn) + ereport(ERROR, + (errcode(ERRCODE_PROTOCOL_VIOLATION), + errmsg_internal("incorrect prepare LSN %X/%X in prepare message (expected %X/%X)", + LSN_FORMAT_ARGS(prepare_data.prepare_lsn), + LSN_FORMAT_ARGS(remote_final_lsn)))); Without being more familiar with this code, it's difficult for me to judge exactly how many of such cases are in these patches. Regards, Greg Nancarrow Fujitsu Australia
On Wed, Jun 16, 2021 at 9:08 AM Peter Smith <smithpb2250@gmail.com> wrote: > > On Fri, Jun 11, 2021 at 6:34 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > KNOWN ISSUES: This v85 patch was built and tested using yesterday's > > master, but due to lots of recent activity in the replication area I > > expect it will be broken for HEAD very soon (if not already). I'll > > rebase it again ASAP to try to keep it in working order. > > > > Please find attached the latest patch set v86* I've modified the patchset based on comments received on thread [1] for the CREATE_REPLICATION_SLOT changes. Based on the request from that thread, I've taken out those changes as two new patches (patch-1 and patch-2) and made this into 5 patches. I've also changed the logic to align with the changes in the command syntax. I've also addressed one pending comment from Amit about CreateInitDecodingContext, I've taken out the logic that sets slot->data.two_phase, and only kept the logic that sets ctx->twophase. Before: - ctx->twophase &= MyReplicationSlot->data.two_phase; + ctx->twophase &= (ctx->twophase_opt_given || slot->data.two_phase); + + /* Mark slot to allow two_phase decoding if not already marked */ + if (ctx->twophase && !slot->data.two_phase) + { + slot->data.two_phase = true; + ReplicationSlotMarkDirty(); + ReplicationSlotSave(); + } After: - ctx->twophase &= MyReplicationSlot->data.two_phase; + ctx->twophase &= slot->data.two_phase; [1] - https://postgr.es/m/64b9f783c6e125f18f88fbc0c0234e34e71d8639.camel@j-davis.com regards, Ajin Cherian Fujitsu Australia
Attachment
- v87-0001-Add-option-to-set-two-phase-in-CREATE_REPLICATIO.patch
- v87-0003-Add-support-for-prepared-transactions-to-built-i.patch
- v87-0004-Add-prepare-API-support-for-streaming-transactio.patch
- v87-0005-Skip-empty-transactions-for-logical-replication.patch
- v87-0002-Add-support-for-two-phase-decoding-in-pg_recvlog.patch
On Thu, Jun 17, 2021 at 6:22 PM Greg Nancarrow <gregn4422@gmail.com> wrote: > > On Wed, Jun 16, 2021 at 9:08 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > > Please find attached the latest patch set v86* > > > > A couple of comments: > > (1) I think one of my suggested changes was missed (or was that intentional?): > > BEFORE: > + The LSN of the commit prepared. > AFTER: > + The LSN of the commit prepared transaction. > No, not missed. I already dismissed that one and wrote about it when I posted v85 [1]. > > (2) In light of Tom Lane's recent changes in: > > fe6a20ce54cbbb6fcfe9f6675d563af836ae799a (Don't use Asserts to check > for violations of replication protocol) > > there appear to be some instances of such code in these patches. Yes, I already noted [2] there are likely to be such cases which need to be fixed. > > For example, in the v86-0001 patch: > > +/* > + * Handle PREPARE message. > + */ > +static void > +apply_handle_prepare(StringInfo s) > +{ > + LogicalRepPreparedTxnData prepare_data; > + char gid[GIDSIZE]; > + > + logicalrep_read_prepare(s, &prepare_data); > + > + Assert(prepare_data.prepare_lsn == remote_final_lsn); > > The above Assert() should be changed to something like: > > + if (prepare_data.prepare_lsn != remote_final_lsn) > + ereport(ERROR, > + (errcode(ERRCODE_PROTOCOL_VIOLATION), > + errmsg_internal("incorrect prepare LSN %X/%X in > prepare message (expected %X/%X)", > + LSN_FORMAT_ARGS(prepare_data.prepare_lsn), > + LSN_FORMAT_ARGS(remote_final_lsn)))); > > Without being more familiar with this code, it's difficult for me to > judge exactly how many of such cases are in these patches. Thanks for the above example. I will fix this one later, after receiving some more reviews and reports of other Assert cases just like this one. ------ [1] https://www.postgresql.org/message-id/CAHut%2BPvOVkiVBf4P5chdVSoVs5%3Da%3DF_GtTSHHoXDb4LiOM_8Qw%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAHut%2BPvdio4%3DOE6cz5pr8VcJNcAgt5uGBPdKf-tnGEMa1mANGg%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
On Thu, Jun 17, 2021 at 7:40 PM Ajin Cherian <itsajin@gmail.com> wrote: > > On Wed, Jun 16, 2021 at 9:08 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > On Fri, Jun 11, 2021 at 6:34 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > KNOWN ISSUES: This v85 patch was built and tested using yesterday's > > > master, but due to lots of recent activity in the replication area I > > > expect it will be broken for HEAD very soon (if not already). I'll > > > rebase it again ASAP to try to keep it in working order. > > > > > > > Please find attached the latest patch set v86* > > > I've modified the patchset based on comments received on thread [1] > for the CREATE_REPLICATION_SLOT > changes. Based on the request from that thread, I've taken out those > changes as two new patches (patch-1 and patch-2) > and made this into 5 patches. I've also changed the logic to align > with the changes in the command syntax. Few comments: 1) This content is present in v87-0001-Add-option-to-set-two-phase-in-CREATE_REPLICATIO.patch and v87-0003-Add-support-for-prepared-transactions-to-built-i.patch, it can be removed from one of them <varlistentry> + <term><literal>TWO_PHASE</literal></term> + <listitem> + <para> + Specify that this logical replication slot supports decoding of two-phase + transactions. With this option, two-phase commands like + <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT PREPARED</literal> + and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted. + The transaction will be decoded and transmitted at + <literal>PREPARE TRANSACTION</literal> time. + </para> + </listitem> + </varlistentry> + + <varlistentry> 2) This change is not required, it can be removed: <sect1 id="logicaldecoding-example"> <title>Logical Decoding Examples</title> - <para> The following example demonstrates controlling logical decoding using the SQL interface. 3) We could add comment mentioning example 1 at the beginning of example 1 and example 2 for the newly added example with description, that will clearly mark the examples. COMMIT 693 <keycombo action="simul"><keycap>Control</keycap><keycap>C</keycap></keycombo> $ pg_recvlogical -d postgres --slot=test --drop-slot + +$ pg_recvlogical -d postgres --slot=test --create-slot --two-phase +$ pg_recvlogical -d postgres --slot=test --start -f - 4) You could mention "Before you use two-phase commit commands, you must set max_prepared_transactions to at least 1" for example 2. $ pg_recvlogical -d postgres --slot=test --drop-slot + +$ pg_recvlogical -d postgres --slot=test --create-slot --two-phase +$ pg_recvlogical -d postgres --slot=test --start -f - 5) This should be before verbose, the options are documented alphabetically + <varlistentry> + <term><option>-t</option></term> + <term><option>--two-phase</option></term> + <listitem> + <para> + Enables two-phase decoding. This option should only be used with + <option>--create-slot</option> + </para> + </listitem> + </varlistentry> 6) This should be before verbose, the options are printed alphabetically printf(_(" -v, --verbose output verbose messages\n")); + printf(_(" -t, --two-phase enable two-phase decoding when creating a slot\n")); printf(_(" -V, --version output version information, then exit\n")); Regards, Vignesh
On Fri, Jun 18, 2021 at 7:43 AM Peter Smith <smithpb2250@gmail.com> wrote: > > On Thu, Jun 17, 2021 at 6:22 PM Greg Nancarrow <gregn4422@gmail.com> wrote: > > > > For example, in the v86-0001 patch: > > > > +/* > > + * Handle PREPARE message. > > + */ > > +static void > > +apply_handle_prepare(StringInfo s) > > +{ > > + LogicalRepPreparedTxnData prepare_data; > > + char gid[GIDSIZE]; > > + > > + logicalrep_read_prepare(s, &prepare_data); > > + > > + Assert(prepare_data.prepare_lsn == remote_final_lsn); > > > > The above Assert() should be changed to something like: > > > > + if (prepare_data.prepare_lsn != remote_final_lsn) > > + ereport(ERROR, > > + (errcode(ERRCODE_PROTOCOL_VIOLATION), > > + errmsg_internal("incorrect prepare LSN %X/%X in > > prepare message (expected %X/%X)", > > + LSN_FORMAT_ARGS(prepare_data.prepare_lsn), > > + LSN_FORMAT_ARGS(remote_final_lsn)))); > > > > Without being more familiar with this code, it's difficult for me to > > judge exactly how many of such cases are in these patches. > > Thanks for the above example. I will fix this one later, after > receiving some more reviews and reports of other Assert cases just > like this one. > I think on similar lines below asserts also need to be changed. 1. +static void +apply_handle_begin_prepare(StringInfo s) +{ + LogicalRepPreparedTxnData begin_data; + char gid[GIDSIZE]; + + /* Tablesync should never receive prepare. */ + Assert(!am_tablesync_worker()); 2. +static void +TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid) +{ .. + Assert(TransactionIdIsValid(xid)); 3. +static void +apply_handle_stream_prepare(StringInfo s) +{ + int nchanges = 0; + LogicalRepPreparedTxnData prepare_data; + TransactionId xid; + char gid[GIDSIZE]; + .. .. + + /* Tablesync should never receive prepare. */ + Assert(!am_tablesync_worker()); -- With Regards, Amit Kapila.
On Fri, Jun 18, 2021 at 3:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Jun 18, 2021 at 7:43 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > On Thu, Jun 17, 2021 at 6:22 PM Greg Nancarrow <gregn4422@gmail.com> wrote: > > > > > > For example, in the v86-0001 patch: > > > > > > +/* > > > + * Handle PREPARE message. > > > + */ > > > +static void > > > +apply_handle_prepare(StringInfo s) > > > +{ > > > + LogicalRepPreparedTxnData prepare_data; > > > + char gid[GIDSIZE]; > > > + > > > + logicalrep_read_prepare(s, &prepare_data); > > > + > > > + Assert(prepare_data.prepare_lsn == remote_final_lsn); > > > > > > The above Assert() should be changed to something like: > > > > > > + if (prepare_data.prepare_lsn != remote_final_lsn) > > > + ereport(ERROR, > > > + (errcode(ERRCODE_PROTOCOL_VIOLATION), > > > + errmsg_internal("incorrect prepare LSN %X/%X in > > > prepare message (expected %X/%X)", > > > + LSN_FORMAT_ARGS(prepare_data.prepare_lsn), > > > + LSN_FORMAT_ARGS(remote_final_lsn)))); > > > > > > Without being more familiar with this code, it's difficult for me to > > > judge exactly how many of such cases are in these patches. > > > > Thanks for the above example. I will fix this one later, after > > receiving some more reviews and reports of other Assert cases just > > like this one. > > > > I think on similar lines below asserts also need to be changed. > > 1. > +static void > +apply_handle_begin_prepare(StringInfo s) > +{ > + LogicalRepPreparedTxnData begin_data; > + char gid[GIDSIZE]; > + > + /* Tablesync should never receive prepare. */ > + Assert(!am_tablesync_worker()); > > 2. > +static void > +TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid) > +{ > .. > + Assert(TransactionIdIsValid(xid)); > > 3. > +static void > +apply_handle_stream_prepare(StringInfo s) > +{ > + int nchanges = 0; > + LogicalRepPreparedTxnData prepare_data; > + TransactionId xid; > + char gid[GIDSIZE]; > + > .. > .. > + > + /* Tablesync should never receive prepare. */ > + Assert(!am_tablesync_worker()); > Please find attached the latest patch set v88* Differences from v87* are: * Rebased to HEAD @ today. * Replaces several protocol Asserts with ereports (ERRCODE_PROTOCOL_VIOLATION) in patch 0003 and 0004, as reported by Greg [1] and Amit [2]. This is in keeping with the commit [3]. ---- [1] https://www.postgresql.org/message-id/CAHut%2BPuJKTNRjFre0VBufWMz9BEScC__nT%2BPUhbSaUNW2biPow%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAA4eK1JO3HsOurS988%3DJarej%3DAK6ChE1tLuMNP%3DAZCt6--hVrw%40mail.gmail.com [3] https://github.com/postgres/postgres/commit/fe6a20ce54cbbb6fcfe9f6675d563af836ae799a Kind Regards, Peter Smith. Fujitsu Australia
Attachment
- v88-0001-Add-option-to-set-two-phase-in-CREATE_REPLICATIO.patch
- v88-0002-Add-support-for-two-phase-decoding-in-pg_recvlog.patch
- v88-0005-Skip-empty-transactions-for-logical-replication.patch
- v88-0004-Add-prepare-API-support-for-streaming-transactio.patch
- v88-0003-Add-support-for-prepared-transactions-to-built-i.patch
On Mon, Jun 21, 2021 at 4:37 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Please find attached the latest patch set v88* > Some minor comments: (1) v88-0002 doc/src/sgml/logicaldecoding.sgml "examples shows" is not correct. I think there is only ONE example being referred to. BEFORE: + The following examples shows how logical decoding is controlled over the AFTER: + The following example shows how logical decoding is controlled over the (2) v88 - 0003 doc/src/sgml/ref/create_subscription.sgml (i) BEFORE: + to the subscriber on the PREPARE TRANSACTION. By default, the transaction + prepared on publisher is decoded as a normal transaction at commit. AFTER: + to the subscriber on the PREPARE TRANSACTION. By default, the transaction + prepared on the publisher is decoded as a normal transaction at commit time. (ii) src/backend/access/transam/twophase.c The double-bracketing is unnecessary: BEFORE: + if ((gxact->valid && strcmp(gxact->gid, gid) == 0)) AFTER: + if (gxact->valid && strcmp(gxact->gid, gid) == 0) (iii) src/backend/replication/logical/snapbuild.c Need to add some commas to make the following easier to read, and change "needs" to "need": BEFORE: + * The prepared transactions that were skipped because previously + * two-phase was not enabled or are not covered by initial snapshot needs + * to be sent later along with commit prepared and they must be before + * this point. AFTER: + * The prepared transactions, that were skipped because previously + * two-phase was not enabled or are not covered by initial snapshot, need + * to be sent later along with commit prepared and they must be before + * this point. (iv) src/backend/replication/logical/tablesync.c I think the convention used in Postgres code is to check for empty Lists using "== NIL" and non-empty Lists using "!= NIL". BEFORE: + if (table_states_not_ready && !last_start_times) AFTER: + if (table_states_not_ready != NIL && !last_start_times) BEFORE: + else if (!table_states_not_ready && last_start_times) AFTER: + else if (table_states_not_ready == NIL && last_start_times) Regards, Greg Nancarrow Fujitsu Australia
On Tue, Jun 22, 2021 at 3:36 PM Greg Nancarrow <gregn4422@gmail.com> wrote: > Some minor comments: > > (1) > v88-0002 > > doc/src/sgml/logicaldecoding.sgml > > "examples shows" is not correct. > I think there is only ONE example being referred to. > > BEFORE: > + The following examples shows how logical decoding is controlled over the > AFTER: > + The following example shows how logical decoding is controlled over the > > fixed. > (2) > v88 - 0003 > > doc/src/sgml/ref/create_subscription.sgml > > (i) > > BEFORE: > + to the subscriber on the PREPARE TRANSACTION. By default, > the transaction > + prepared on publisher is decoded as a normal transaction at commit. > AFTER: > + to the subscriber on the PREPARE TRANSACTION. By default, > the transaction > + prepared on the publisher is decoded as a normal > transaction at commit time. > fixed. > (ii) > > src/backend/access/transam/twophase.c > > The double-bracketing is unnecessary: > > BEFORE: > + if ((gxact->valid && strcmp(gxact->gid, gid) == 0)) > AFTER: > + if (gxact->valid && strcmp(gxact->gid, gid) == 0) > fixed. > (iii) > > src/backend/replication/logical/snapbuild.c > > Need to add some commas to make the following easier to read, and > change "needs" to "need": > > BEFORE: > + * The prepared transactions that were skipped because previously > + * two-phase was not enabled or are not covered by initial snapshot needs > + * to be sent later along with commit prepared and they must be before > + * this point. > AFTER: > + * The prepared transactions, that were skipped because previously > + * two-phase was not enabled or are not covered by initial snapshot, need > + * to be sent later along with commit prepared and they must be before > + * this point. > fixed. > (iv) > > src/backend/replication/logical/tablesync.c > > I think the convention used in Postgres code is to check for empty > Lists using "== NIL" and non-empty Lists using "!= NIL". > > BEFORE: > + if (table_states_not_ready && !last_start_times) > AFTER: > + if (table_states_not_ready != NIL && !last_start_times) > > > BEFORE: > + else if (!table_states_not_ready && last_start_times) > AFTER: > + else if (table_states_not_ready == NIL && last_start_times) > fixed. Also fixed comments from Vignesh: 1) This content is present in v87-0001-Add-option-to-set-two-phase-in-CREATE_REPLICATIO.patch and v87-0003-Add-support-for-prepared-transactions-to-built-i.patch, it can be removed from one of them <varlistentry> + <term><literal>TWO_PHASE</literal></term> + <listitem> + <para> + Specify that this logical replication slot supports decoding of two-phase + transactions. With this option, two-phase commands like + <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT PREPARED</literal> + and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted. + The transaction will be decoded and transmitted at + <literal>PREPARE TRANSACTION</literal> time. + </para> + </listitem> + </varlistentry> + + <varlistentry> I don't see this duplicate content. 2) This change is not required, it can be removed: <sect1 id="logicaldecoding-example"> <title>Logical Decoding Examples</title> - <para> The following example demonstrates controlling logical decoding using the SQL interface. fixed this. 3) We could add comment mentioning example 1 at the beginning of example 1 and example 2 for the newly added example with description, that will clearly mark the examples. added this. 5) This should be before verbose, the options are documented alphabetically fixed.this. regards, Ajin Cherian Fujitsu Australia
Attachment
- v89-0001-Add-option-to-set-two-phase-in-CREATE_REPLICATIO.patch
- v89-0005-Skip-empty-transactions-for-logical-replication.patch
- v89-0004-Add-prepare-API-support-for-streaming-transactio.patch
- v89-0003-Add-support-for-prepared-transactions-to-built-i.patch
- v89-0002-Add-support-for-two-phase-decoding-in-pg_recvlog.patch
On Wed, Jun 23, 2021 at 9:10 AM Ajin Cherian <itsajin@gmail.com> wrote: > > On Tue, Jun 22, 2021 at 3:36 PM Greg Nancarrow <gregn4422@gmail.com> wrote: > > > Some minor comments: > > > > (1) > > v88-0002 > > > > doc/src/sgml/logicaldecoding.sgml > > > > "examples shows" is not correct. > > I think there is only ONE example being referred to. > > > > BEFORE: > > + The following examples shows how logical decoding is controlled over the > > AFTER: > > + The following example shows how logical decoding is controlled over the > > > > > fixed. > > > (2) > > v88 - 0003 > > > > doc/src/sgml/ref/create_subscription.sgml > > > > (i) > > > > BEFORE: > > + to the subscriber on the PREPARE TRANSACTION. By default, > > the transaction > > + prepared on publisher is decoded as a normal transaction at commit. > > AFTER: > > + to the subscriber on the PREPARE TRANSACTION. By default, > > the transaction > > + prepared on the publisher is decoded as a normal > > transaction at commit time. > > > > fixed. > > > (ii) > > > > src/backend/access/transam/twophase.c > > > > The double-bracketing is unnecessary: > > > > BEFORE: > > + if ((gxact->valid && strcmp(gxact->gid, gid) == 0)) > > AFTER: > > + if (gxact->valid && strcmp(gxact->gid, gid) == 0) > > > > fixed. > > > (iii) > > > > src/backend/replication/logical/snapbuild.c > > > > Need to add some commas to make the following easier to read, and > > change "needs" to "need": > > > > BEFORE: > > + * The prepared transactions that were skipped because previously > > + * two-phase was not enabled or are not covered by initial snapshot needs > > + * to be sent later along with commit prepared and they must be before > > + * this point. > > AFTER: > > + * The prepared transactions, that were skipped because previously > > + * two-phase was not enabled or are not covered by initial snapshot, need > > + * to be sent later along with commit prepared and they must be before > > + * this point. > > > > fixed. > > > (iv) > > > > src/backend/replication/logical/tablesync.c > > > > I think the convention used in Postgres code is to check for empty > > Lists using "== NIL" and non-empty Lists using "!= NIL". > > > > BEFORE: > > + if (table_states_not_ready && !last_start_times) > > AFTER: > > + if (table_states_not_ready != NIL && !last_start_times) > > > > > > BEFORE: > > + else if (!table_states_not_ready && last_start_times) > > AFTER: > > + else if (table_states_not_ready == NIL && last_start_times) > > > > fixed. > > Also fixed comments from Vignesh: > > 1) This content is present in > v87-0001-Add-option-to-set-two-phase-in-CREATE_REPLICATIO.patch and > v87-0003-Add-support-for-prepared-transactions-to-built-i.patch, it > can be removed from one of them > <varlistentry> > + <term><literal>TWO_PHASE</literal></term> > + <listitem> > + <para> > + Specify that this logical replication slot supports decoding > of two-phase > + transactions. With this option, two-phase commands like > + <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT > PREPARED</literal> > + and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted. > + The transaction will be decoded and transmitted at > + <literal>PREPARE TRANSACTION</literal> time. > + </para> > + </listitem> > + </varlistentry> > + > + <varlistentry> > > I don't see this duplicate content. Thanks for the updated patch. The patch v89-0001-Add-option-to-set-two-phase-in-CREATE_REPLICATIO.patch has the following: + <term><literal>TWO_PHASE</literal></term> + <listitem> + <para> + Specify that this logical replication slot supports decoding of two-phase + transactions. With this option, two-phase commands like + <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT PREPARED</literal> + and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted. + The transaction will be decoded and transmitted at + <literal>PREPARE TRANSACTION</literal> time. + </para> + </listitem> + </varlistentry> The patch v89-0003-Add-support-for-prepared-transactions-to-built-i.patch has the following: + <term><literal>TWO_PHASE</literal></term> + <listitem> + <para> + Specify that this replication slot supports decode of two-phase + transactions. With this option, two-phase commands like + <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT PREPARED</literal> + and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted. + The transaction will be decoded and transmitted at + <literal>PREPARE TRANSACTION</literal> time. + </para> + </listitem> + </varlistentry> We can remove one of them. Regards, Vignesh
On Wed, Jun 23, 2021 at 3:18 PM vignesh C <vignesh21@gmail.com> wrote: > Thanks for the updated patch. > The patch v89-0001-Add-option-to-set-two-phase-in-CREATE_REPLICATIO.patch > has the following: > + <term><literal>TWO_PHASE</literal></term> > + <listitem> > + <para> > + Specify that this logical replication slot supports decoding > of two-phase > + transactions. With this option, two-phase commands like > + <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT > PREPARED</literal> > + and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted. > + The transaction will be decoded and transmitted at > + <literal>PREPARE TRANSACTION</literal> time. > + </para> > + </listitem> > + </varlistentry> > > The patch v89-0003-Add-support-for-prepared-transactions-to-built-i.patch > has the following: > + <term><literal>TWO_PHASE</literal></term> > + <listitem> > + <para> > + Specify that this replication slot supports decode of two-phase > + transactions. With this option, two-phase commands like > + <literal>PREPARE TRANSACTION</literal>, <literal>COMMIT > PREPARED</literal> > + and <literal>ROLLBACK PREPARED</literal> are decoded and transmitted. > + The transaction will be decoded and transmitted at > + <literal>PREPARE TRANSACTION</literal> time. > + </para> > + </listitem> > + </varlistentry> > > We can remove one of them. I missed this. Updated. Also fixed this comment below which I had missed in my last patch: >4) You could mention "Before you use two-phase commit commands, you >must set max_prepared_transactions to at least 1" for example 2. > $ pg_recvlogical -d postgres --slot=test --drop-slot >+ >+$ pg_recvlogical -d postgres --slot=test --create-slot --two-phase >+$ pg_recvlogical -d postgres --slot=test --start -f - Comment 6: >6) This should be before verbose, the options are printed alphabetically > printf(_(" -v, --verbose output verbose messages\n")); >+ printf(_(" -t, --two-phase enable two-phase decoding >when creating a slot\n")); > printf(_(" -V, --version output version information, >then exit\n")); This was also fixed in the last patch. regards, Ajin Cherian Fujitsu Australia
Attachment
- v90-0004-Add-prepare-API-support-for-streaming-transactio.patch
- v90-0001-Add-option-to-set-two-phase-in-CREATE_REPLICATIO.patch
- v90-0002-Add-support-for-two-phase-decoding-in-pg_recvlog.patch
- v90-0005-Skip-empty-transactions-for-logical-replication.patch
- v90-0003-Add-support-for-prepared-transactions-to-built-i.patch
On Wed, Jun 23, 2021 at 4:10 PM Ajin Cherian <itsajin@gmail.com> wrote: > The first two patches look mostly good to me. I have combined them into one and made some minor changes. (a) Removed opt_two_phase and related code from repl_gram.y as that is not required for this version of the patch. (b) made some changes in docs. Kindly check the attached and let me know if you have any comments? I am planning to push this first patch in the series tomorrow unless you or others have any comments. -- With Regards, Amit Kapila.
Attachment
On Tue, Jun 29, 2021 at 4:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Jun 23, 2021 at 4:10 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > > The first two patches look mostly good to me. I have combined them > into one and made some minor changes. (a) Removed opt_two_phase and > related code from repl_gram.y as that is not required for this version > of the patch. (b) made some changes in docs. Kindly check the attached > and let me know if you have any comments? I am planning to push this > first patch in the series tomorrow unless you or others have any > comments. The patch applies cleanly, tests pass. I reviewed the patch and have no comments, it looks good. regards, Ajin Cherian Fujitsu Australia
On Tue, Jun 29, 2021 at 12:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Jun 23, 2021 at 4:10 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > > The first two patches look mostly good to me. I have combined them > into one and made some minor changes. (a) Removed opt_two_phase and > related code from repl_gram.y as that is not required for this version > of the patch. (b) made some changes in docs. Kindly check the attached > and let me know if you have any comments? I am planning to push this > first patch in the series tomorrow unless you or others have any > comments. Thanks for the updated patch, patch applies neatly and tests passed. If you are ok, One of the documentation changes could be slightly changed while committing: + <para> + Enables two-phase decoding. This option should only be used with + <option>--create-slot</option> + </para> to: + <para> + Enables two-phase decoding. This option should only be specified with + <option>--create-slot</option> option. + </para> Regards, Vignesh
On Tue, Jun 29, 2021 at 5:31 PM vignesh C <vignesh21@gmail.com> wrote: > > On Tue, Jun 29, 2021 at 12:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Jun 23, 2021 at 4:10 PM Ajin Cherian <itsajin@gmail.com> wrote: > > > > > > > The first two patches look mostly good to me. I have combined them > > into one and made some minor changes. (a) Removed opt_two_phase and > > related code from repl_gram.y as that is not required for this version > > of the patch. (b) made some changes in docs. Kindly check the attached > > and let me know if you have any comments? I am planning to push this > > first patch in the series tomorrow unless you or others have any > > comments. > > Thanks for the updated patch, patch applies neatly and tests passed. > If you are ok, One of the documentation changes could be slightly > changed while committing: > Pushed the patch after taking care of your suggestion. Now, the next step is to rebase the remaining patches and adapt some of the checks to PG-15. -- With Regards, Amit Kapila.
On Wed, Jun 30, 2021 at 6:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > Pushed the patch after taking care of your suggestion. Now, the next > step is to rebase the remaining patches and adapt some of the checks > to PG-15. Please find attached the latest patch set v91* Differences from v90* are: * This is the first patch set for PG15 * Rebased to HEAD @ today. * Now the patch set has only 3 patches again because v90-0001, v90-0002 are already pushed [1] * Bumped all relevant server version checks to 150000 ---- [1] https://github.com/postgres/postgres/commit/cda03cfed6b8bd5f64567bccbc9578fba035691e Kind Regards, Peter Smith. Fujitsu Australia
Attachment
On Wed, Jun 30, 2021 at 7:47 PM Peter Smith <smithpb2250@gmail.com> wrote: > > On Wed, Jun 30, 2021 at 6:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > Pushed the patch after taking care of your suggestion. Now, the next > > step is to rebase the remaining patches and adapt some of the checks > > to PG-15. > > Please find attached the latest patch set v91* > > Differences from v90* are: > > * This is the first patch set for PG15 > > * Rebased to HEAD @ today. > > * Now the patch set has only 3 patches again because v90-0001, > v90-0002 are already pushed [1] > > * Bumped all relevant server version checks to 150000 Adding a new patch (0004) to this patch-set that handles skipping of empty streamed transactions. patch-0003 did not handle empty streamed transactions. To support this, added a new flag "sent_stream_start" to PGOutputTxnData. Also transactions which do not have any data will not be stream committed or stream prepared or stream aborted. Do review and let me know if you have any comments. regards, Ajin Cherian Fujitsu Australia
Attachment
On Thursday, July 1, 2021 11:48 AM Ajin Cherian <itsajin@gmail.com> > > Adding a new patch (0004) to this patch-set that handles skipping of > empty streamed transactions. patch-0003 did not > handle empty streamed transactions. To support this, added a new flag > "sent_stream_start" to PGOutputTxnData. > Also transactions which do not have any data will not be stream > committed or stream prepared or stream aborted. > Do review and let me know if you have any comments. > Thanks for your patch. I met an issue while using it. When a transaction contains TRUNCATE, the subscriber reported an error:" ERROR: no data left in message" and the data couldn't be replicated. Steps to reproduce the issue: (set logical_decoding_work_mem to 64kB at publisher so that streaming could work. ) ------publisher------ create table test (a int primary key, b varchar); create publication pub for table test; ------subscriber------ create table test (a int primary key, b varchar); create subscription sub connection 'dbname=postgres' publication pub with(two_phase=on, streaming=on); ------publisher------ BEGIN; TRUNCATE test; INSERT INTO test SELECT i, md5(i::text) FROM generate_series(1001, 6000) s(i); UPDATE test SET b = md5(b) WHERE mod(a,2) = 0; DELETE FROM test WHERE mod(a,3) = 0; COMMIT; The above case worked ok when remove 0004 patch, so I think it’s a problem of 0004 patch. Please have a look. Regards Tang
Please find attached the latest patch set v93* Differences from v92* are: * Rebased to HEAD @ today. This rebase was made necessary by recent changes [1] to the parse_subscription_options function. ---- [1] https://github.com/postgres/postgres/commit/8aafb02616753f5c6c90bbc567636b73c0cbb9d4 Kind Regards, Peter Smith. Fujitsu Australia
Attachment
On Fri, Jul 2, 2021 at 8:18 PM tanghy.fnst@fujitsu.com <tanghy.fnst@fujitsu.com> wrote: > > Thanks for your patch. I met an issue while using it. When a transaction contains TRUNCATE, the subscriber reported anerror: " ERROR: no data left in message" and the data couldn't be replicated. > > Steps to reproduce the issue: > > (set logical_decoding_work_mem to 64kB at publisher so that streaming could work. ) > > ------publisher------ > create table test (a int primary key, b varchar); > create publication pub for table test; > > ------subscriber------ > create table test (a int primary key, b varchar); > create subscription sub connection 'dbname=postgres' publication pub with(two_phase=on, streaming=on); > > ------publisher------ > BEGIN; > TRUNCATE test; > INSERT INTO test SELECT i, md5(i::text) FROM generate_series(1001, 6000) s(i); > UPDATE test SET b = md5(b) WHERE mod(a,2) = 0; > DELETE FROM test WHERE mod(a,3) = 0; > COMMIT; > > The above case worked ok when remove 0004 patch, so I think it’s a problem of 0004 patch. Please have a look. thanks for the test! I hadn't updated the case where sending schema across was the first change of the transaction as part of the decoding of the truncate command. In this test case, the schema was sent across without the stream start, hence the error on the apply worker. I have updated with a fix. Please do a test and confirm. regards, Ajin Cherian Fujitsu Australia
Attachment
On Tuesday, July 6, 2021 7:18 PM Ajin Cherian <itsajin@gmail.com> > > thanks for the test! > I hadn't updated the case where sending schema across was the first > change of the transaction as part of the decoding of the > truncate command. In this test case, the schema was sent across > without the stream start, hence the error on the apply worker. > I have updated with a fix. Please do a test and confirm. > Thanks for your patch. I have tested and confirmed that the issue was fixed. Regards Tang
On Tue, Jul 6, 2021 at 9:58 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Please find attached the latest patch set v93* > Thanks, I have gone through the 0001 patch and made a number of changes. (a) Removed some of the code which was leftover from previous versions, (b) Removed the Assert in apply_handle_begin_prepare() as I don't think that makes sense, (c) added/changed comments and made a few other cosmetic changes, (d) ran pgindent. Let me know what you think of the attached? -- With Regards, Amit Kapila.
Attachment
On Thu, Jul 8, 2021 at 11:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jul 6, 2021 at 9:58 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Please find attached the latest patch set v93* > > > > Thanks, I have gone through the 0001 patch and made a number of > changes. (a) Removed some of the code which was leftover from previous > versions, (b) Removed the Assert in apply_handle_begin_prepare() as I > don't think that makes sense, (c) added/changed comments and made a > few other cosmetic changes, (d) ran pgindent. > > Let me know what you think of the attached? The patch looks good to me, I don't have any comments. Regards, Vignesh
On Thu, Jul 8, 2021 at 10:08 PM vignesh C <vignesh21@gmail.com> wrote: > > On Thu, Jul 8, 2021 at 11:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Jul 6, 2021 at 9:58 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > > Please find attached the latest patch set v93* > > > > > > > Thanks, I have gone through the 0001 patch and made a number of > > changes. (a) Removed some of the code which was leftover from previous > > versions, (b) Removed the Assert in apply_handle_begin_prepare() as I > > don't think that makes sense, (c) added/changed comments and made a > > few other cosmetic changes, (d) ran pgindent. > > > > Let me know what you think of the attached? > > The patch looks good to me, I don't have any comments. I tried the v95-0001 patch. - The patch applied cleanly and all build / testing was OK. - The documentation also builds OK. - I checked all v95-0001 / v93-0001 differences and found no problems. - Furthermore, I noted that v95-0001 patch is passing the cfbot [1]. So this patch LGTM. ------ [1] http://cfbot.cputube.org/patch_33_2914.log Kind Regards, Peter Smith. Fujitsu Australia
On Fri, Jul 9, 2021 at 9:13 AM Peter Smith <smithpb2250@gmail.com> wrote: > I tried the v95-0001 patch. > > - The patch applied cleanly and all build / testing was OK. > - The documentation also builds OK. > - I checked all v95-0001 / v93-0001 differences and found no problems. > - Furthermore, I noted that v95-0001 patch is passing the cfbot [1]. > > So this patch LGTM. > Applied, reviewed and tested the patch. Also ran a 5 level cascaded standby setup running a modified pgbench that does two phase commits and it ran fine. Did some testing using empty transactions and no issues found The patch looks good to me. regards, Ajin Cherian
On Friday, July 9, 2021 2:56 PM Ajin Cherian <itsajin@gmail.com>wrote: > > On Fri, Jul 9, 2021 at 9:13 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > I tried the v95-0001 patch. > > > > - The patch applied cleanly and all build / testing was OK. > > - The documentation also builds OK. > > - I checked all v95-0001 / v93-0001 differences and found no problems. > > - Furthermore, I noted that v95-0001 patch is passing the cfbot [1]. > > > > So this patch LGTM. > > > > Applied, reviewed and tested the patch. > Also ran a 5 level cascaded standby setup running a modified pgbench > that does two phase commits and it ran fine. > Did some testing using empty transactions and no issues found > The patch looks good to me. I did some cross version tests on patch v95 (publisher is PG14 and subscriber is PG15, or publisher is PG15 and subscriberis PG14; set two_phase option to on or off/default). It worked as expected, data could be replicated correctly. Besides, I tested some scenarios using synchronized replication, it worked fine in my cases. So this patch LGTM. Regards Tang
On Fri, Jul 9, 2021 at 4:43 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > The patch looks good to me, I don't have any comments. > > I tried the v95-0001 patch. > > - The patch applied cleanly and all build / testing was OK. > - The documentation also builds OK. > - I checked all v95-0001 / v93-0001 differences and found no problems. > - Furthermore, I noted that v95-0001 patch is passing the cfbot [1]. > > So this patch LGTM. > Thanks, I took another pass over it and made a few changes in docs and comments. I am planning to push this next week sometime (by 14th July) unless there are more comments from you or someone else. Just to summarize, this patch will add support for prepared transactions to built-in logical replication. To add support for streaming transactions at prepare time into the built-in logical replication, we need to do the following things: (a) Modify the output plugin (pgoutput) to implement the new two-phase API callbacks, by leveraging the extended replication protocol. (b) Modify the replication apply worker, to properly handle two-phase transactions by replaying them on prepare. (c) Add a new SUBSCRIPTION option "two_phase" to allow users to enable two-phase transactions. We enable the two_phase once the initial data sync is over. Refer to comments atop worker.c in the patch and commit message to see further details about this patch. After this patch, there is a follow-up patch to allow streaming and two-phase options together which I feel needs some more review and can be committed separately. -- With Regards, Amit Kapila.
Attachment
On Sun, Jul 11, 2021 at 8:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Jul 9, 2021 at 4:43 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > The patch looks good to me, I don't have any comments. > > > > I tried the v95-0001 patch. > > > > - The patch applied cleanly and all build / testing was OK. > > - The documentation also builds OK. > > - I checked all v95-0001 / v93-0001 differences and found no problems. > > - Furthermore, I noted that v95-0001 patch is passing the cfbot [1]. > > > > So this patch LGTM. > > > > Thanks, I took another pass over it and made a few changes in docs and > comments. I am planning to push this next week sometime (by 14th July) > unless there are more comments from you or someone else. Just to > summarize, this patch will add support for prepared transactions to > built-in logical replication. To add support for streaming > transactions at prepare time into the > built-in logical replication, we need to do the following things: (a) > Modify the output plugin (pgoutput) to implement the new two-phase API > callbacks, by leveraging the extended replication protocol. (b) Modify > the replication apply worker, to properly handle two-phase > transactions by replaying them on prepare. (c) Add a new SUBSCRIPTION > option "two_phase" to allow users to enable > two-phase transactions. We enable the two_phase once the initial data > sync is over. Refer to comments atop worker.c in the patch and commit > message to see further details about this patch. After this patch, > there is a follow-up patch to allow streaming and two-phase options > together which I feel needs some more review and can be committed > separately. > FYI - I repeated the same verification of the v96-0001 patch as I did previously for v95-0001 - The v96 patch applied cleanly and all build / testing was OK. - The documentation also builds OK. - I checked the v95-0001 / v96-0001 differences and found no problems. - Furthermore, I noted that v96-0001 patch is passing the cfbot. LGTM. ------ Kind Regards, Peter Smith. Fujitsu Australia
On Mon, Jul 12, 2021 at 9:14 AM Peter Smith <smithpb2250@gmail.com> wrote: > > On Sun, Jul 11, 2021 at 8:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, Jul 9, 2021 at 4:43 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > > > The patch looks good to me, I don't have any comments. > > > > > > I tried the v95-0001 patch. > > > > > > - The patch applied cleanly and all build / testing was OK. > > > - The documentation also builds OK. > > > - I checked all v95-0001 / v93-0001 differences and found no problems. > > > - Furthermore, I noted that v95-0001 patch is passing the cfbot [1]. > > > > > > So this patch LGTM. > > > > > > > Thanks, I took another pass over it and made a few changes in docs and > > comments. I am planning to push this next week sometime (by 14th July) > > unless there are more comments from you or someone else. Just to > > summarize, this patch will add support for prepared transactions to > > built-in logical replication. To add support for streaming > > transactions at prepare time into the > > built-in logical replication, we need to do the following things: (a) > > Modify the output plugin (pgoutput) to implement the new two-phase API > > callbacks, by leveraging the extended replication protocol. (b) Modify > > the replication apply worker, to properly handle two-phase > > transactions by replaying them on prepare. (c) Add a new SUBSCRIPTION > > option "two_phase" to allow users to enable > > two-phase transactions. We enable the two_phase once the initial data > > sync is over. Refer to comments atop worker.c in the patch and commit > > message to see further details about this patch. After this patch, > > there is a follow-up patch to allow streaming and two-phase options > > together which I feel needs some more review and can be committed > > separately. > > > > FYI - I repeated the same verification of the v96-0001 patch as I did > previously for v95-0001 > > - The v96 patch applied cleanly and all build / testing was OK. > - The documentation also builds OK. > - I checked the v95-0001 / v96-0001 differences and found no problems. > - Furthermore, I noted that v96-0001 patch is passing the cfbot. > > LGTM. > Pushed. Feel free to submit the remaining patches after rebase. Is it possible to post patches related to skipping empty transactions in the other thread [1] where that topic is being discussed? [1] - https://www.postgresql.org/message-id/CAMkU%3D1yohp9-dv48FLoSPrMqYEyyS5ZWkaZGD41RJr10xiNo_Q%40mail.gmail.com -- With Regards, Amit Kapila.
On Wed, Jul 14, 2021 at 4:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jul 12, 2021 at 9:14 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > On Sun, Jul 11, 2021 at 8:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Fri, Jul 9, 2021 at 4:43 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > > > > > The patch looks good to me, I don't have any comments. > > > > > > > > I tried the v95-0001 patch. > > > > > > > > - The patch applied cleanly and all build / testing was OK. > > > > - The documentation also builds OK. > > > > - I checked all v95-0001 / v93-0001 differences and found no problems. > > > > - Furthermore, I noted that v95-0001 patch is passing the cfbot [1]. > > > > > > > > So this patch LGTM. > > > > > > > > > > Thanks, I took another pass over it and made a few changes in docs and > > > comments. I am planning to push this next week sometime (by 14th July) > > > unless there are more comments from you or someone else. Just to > > > summarize, this patch will add support for prepared transactions to > > > built-in logical replication. To add support for streaming > > > transactions at prepare time into the > > > built-in logical replication, we need to do the following things: (a) > > > Modify the output plugin (pgoutput) to implement the new two-phase API > > > callbacks, by leveraging the extended replication protocol. (b) Modify > > > the replication apply worker, to properly handle two-phase > > > transactions by replaying them on prepare. (c) Add a new SUBSCRIPTION > > > option "two_phase" to allow users to enable > > > two-phase transactions. We enable the two_phase once the initial data > > > sync is over. Refer to comments atop worker.c in the patch and commit > > > message to see further details about this patch. After this patch, > > > there is a follow-up patch to allow streaming and two-phase options > > > together which I feel needs some more review and can be committed > > > separately. > > > > > > > FYI - I repeated the same verification of the v96-0001 patch as I did > > previously for v95-0001 > > > > - The v96 patch applied cleanly and all build / testing was OK. > > - The documentation also builds OK. > > - I checked the v95-0001 / v96-0001 differences and found no problems. > > - Furthermore, I noted that v96-0001 patch is passing the cfbot. > > > > LGTM. > > > > Pushed. > > Feel free to submit the remaining patches after rebase. Is it possible > to post patches related to skipping empty transactions in the other > thread [1] where that topic is being discussed? > > [1] - https://www.postgresql.org/message-id/CAMkU%3D1yohp9-dv48FLoSPrMqYEyyS5ZWkaZGD41RJr10xiNo_Q%40mail.gmail.com > Please find attached the latest patch set v97* * Rebased v94* to HEAD @ today. This rebase was made necessary by the recent push of the first patch from this set. v94-0001 ==> already pushed [1] v94-0002 ==> v97-0001 v94-0003 ==> will be relocated to other thread [2] v94-0004 ==> this is omitted for now ---- [1] https://github.com/postgres/postgres/commit/a8fd13cab0ba815e9925dc9676e6309f699b5f72 [2] https://www.postgresql.org/message-id/CAMkU%3D1yohp9-dv48FLoSPrMqYEyyS5ZWkaZGD41RJr10xiNo_Q%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Attachment
On Wed, Jul 14, 2021 at 2:03 PM Peter Smith <smithpb2250@gmail.com> wrote: > > On Wed, Jul 14, 2021 at 4:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Jul 12, 2021 at 9:14 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > > On Sun, Jul 11, 2021 at 8:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Fri, Jul 9, 2021 at 4:43 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > > > > > > > The patch looks good to me, I don't have any comments. > > > > > > > > > > I tried the v95-0001 patch. > > > > > > > > > > - The patch applied cleanly and all build / testing was OK. > > > > > - The documentation also builds OK. > > > > > - I checked all v95-0001 / v93-0001 differences and found no problems. > > > > > - Furthermore, I noted that v95-0001 patch is passing the cfbot [1]. > > > > > > > > > > So this patch LGTM. > > > > > > > > > > > > > Thanks, I took another pass over it and made a few changes in docs and > > > > comments. I am planning to push this next week sometime (by 14th July) > > > > unless there are more comments from you or someone else. Just to > > > > summarize, this patch will add support for prepared transactions to > > > > built-in logical replication. To add support for streaming > > > > transactions at prepare time into the > > > > built-in logical replication, we need to do the following things: (a) > > > > Modify the output plugin (pgoutput) to implement the new two-phase API > > > > callbacks, by leveraging the extended replication protocol. (b) Modify > > > > the replication apply worker, to properly handle two-phase > > > > transactions by replaying them on prepare. (c) Add a new SUBSCRIPTION > > > > option "two_phase" to allow users to enable > > > > two-phase transactions. We enable the two_phase once the initial data > > > > sync is over. Refer to comments atop worker.c in the patch and commit > > > > message to see further details about this patch. After this patch, > > > > there is a follow-up patch to allow streaming and two-phase options > > > > together which I feel needs some more review and can be committed > > > > separately. > > > > > > > > > > FYI - I repeated the same verification of the v96-0001 patch as I did > > > previously for v95-0001 > > > > > > - The v96 patch applied cleanly and all build / testing was OK. > > > - The documentation also builds OK. > > > - I checked the v95-0001 / v96-0001 differences and found no problems. > > > - Furthermore, I noted that v96-0001 patch is passing the cfbot. > > > > > > LGTM. > > > > > > > Pushed. > > > > Feel free to submit the remaining patches after rebase. Is it possible > > to post patches related to skipping empty transactions in the other > > thread [1] where that topic is being discussed? > > > > [1] - https://www.postgresql.org/message-id/CAMkU%3D1yohp9-dv48FLoSPrMqYEyyS5ZWkaZGD41RJr10xiNo_Q%40mail.gmail.com > > > > > Please find attached the latest patch set v97* > > * Rebased v94* to HEAD @ today. Thanks for the updated patch, the patch applies cleanly and test passes: I had couple of comments: 1) Should we include "stream_prepare_cb" here in logicaldecoding-streaming section of logicaldecoding.sgml documentation: To reduce the apply lag caused by large transactions, an output plugin may provide additional callback to support incremental streaming of in-progress transactions. There are multiple required streaming callbacks (stream_start_cb, stream_stop_cb, stream_abort_cb, stream_commit_cb and stream_change_cb) and two optional callbacks (stream_message_cb and stream_truncate_cb). 2) Should we add an example for stream_prepare_cb here in logicaldecoding-streaming section of logicaldecoding.sgml documentation: One example sequence of streaming callback calls for one transaction may look like this: stream_start_cb(...); <-- start of first block of changes stream_change_cb(...); stream_change_cb(...); stream_message_cb(...); stream_change_cb(...); ... stream_change_cb(...); stream_stop_cb(...); <-- end of first block of changes stream_start_cb(...); <-- start of second block of changes stream_change_cb(...); stream_change_cb(...); stream_change_cb(...); ... stream_message_cb(...); stream_change_cb(...); stream_stop_cb(...); <-- end of second block of changes stream_commit_cb(...); <-- commit of the streamed transaction Regards, Vignesh
Amit Kapila <amit.kapila16@gmail.com> writes: > Pushed. Coverity thinks this has security issues, and I agree. /srv/coverity/git/pgsql-git/postgresql/src/backend/replication/logical/proto.c: 144 in logicalrep_read_begin_prepare() 143 /* read gid (copy it into a pre-allocated buffer) */ >>> CID 1487517: Security best practices violations (STRING_OVERFLOW) >>> You might overrun the 200-character fixed-size string "begin_data->gid" by copying the return value of "pq_getmsgstring"without checking the length. 144 strcpy(begin_data->gid, pq_getmsgstring(in)); 200 /* read gid (copy it into a pre-allocated buffer) */ >>> CID 1487515: Security best practices violations (STRING_OVERFLOW) >>> You might overrun the 200-character fixed-size string "prepare_data->gid" by copying the return value of "pq_getmsgstring"without checking the length. 201 strcpy(prepare_data->gid, pq_getmsgstring(in)); 256 /* read gid (copy it into a pre-allocated buffer) */ >>> CID 1487516: Security best practices violations (STRING_OVERFLOW) >>> You might overrun the 200-character fixed-size string "prepare_data->gid" by copying the return value of "pq_getmsgstring"without checking the length. 257 strcpy(prepare_data->gid, pq_getmsgstring(in)); 316 /* read gid (copy it into a pre-allocated buffer) */ >>> CID 1487519: Security best practices violations (STRING_OVERFLOW) >>> You might overrun the 200-character fixed-size string "rollback_data->gid" by copying the return value of "pq_getmsgstring"without checking the length. 317 strcpy(rollback_data->gid, pq_getmsgstring(in)); I think you'd be way better off making the gid fields be "char *" and pstrdup'ing the result of pq_getmsgstring. Another possibility perhaps is to use strlcpy, but I'd only go that way if it's important to constrain the received strings to 200 bytes. regards, tom lane
On Mon, Jul 19, 2021 at 1:55 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Amit Kapila <amit.kapila16@gmail.com> writes: > > Pushed. > > I think you'd be way better off making the gid fields be "char *" > and pstrdup'ing the result of pq_getmsgstring. Another possibility > perhaps is to use strlcpy, but I'd only go that way if it's important > to constrain the received strings to 200 bytes. > I think it is important to constrain length to 200 bytes for this case as here we receive a prepared transaction identifier which according to docs [1] has a max length of 200 bytes. Also, in ParseCommitRecord() and ParseAbortRecord(), we are using strlcpy with 200 as max length to copy prepare transaction identifier. So, I think it is better to use strlcpy here unless you or Peter feels otherwise. [1] - https://www.postgresql.org/docs/devel/sql-prepare-transaction.html -- With Regards, Amit Kapila.
On Mon, Jul 19, 2021 at 12:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jul 19, 2021 at 1:55 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > Amit Kapila <amit.kapila16@gmail.com> writes: > > > Pushed. > > > > I think you'd be way better off making the gid fields be "char *" > > and pstrdup'ing the result of pq_getmsgstring. Another possibility > > perhaps is to use strlcpy, but I'd only go that way if it's important > > to constrain the received strings to 200 bytes. > > > > I think it is important to constrain length to 200 bytes for this case > as here we receive a prepared transaction identifier which according > to docs [1] has a max length of 200 bytes. Also, in > ParseCommitRecord() and ParseAbortRecord(), we are using strlcpy with > 200 as max length to copy prepare transaction identifier. So, I think > it is better to use strlcpy here unless you or Peter feels otherwise. > OK. I have implemented this reported [1] potential buffer overrun using the constraining strlcpy, because the GID limitation of 200 bytes is already mentioned in the documentation [2]. PSA. ------ [1] https://www.postgresql.org/message-id/161029.1626639923%40sss.pgh.pa.us [2] https://www.postgresql.org/docs/devel/sql-prepare-transaction.html Kind Regards, Peter Smith. Fujitsu Australia
Attachment
On Wed, Jul 14, 2021 at 6:33 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Please find attached the latest patch set v97* > I couldn't spot spot any significant issues in the v97-0001 patch, but do have the following trivial feedback comments: (1) doc/src/sgml/protocol.sgml Suggestion: BEFORE: + contains a Stream Prepare or Stream Commit or Stream Abort message. AFTER: + contains a Stream Prepare, Stream Commit or Stream Abort message. (2) src/backend/replication/logical/worker.c It seems a bit weird to add a forward declaration here, without a comment, like for the one immediately above it /* Compute GID for two_phase transactions */ static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char *gid, int szgid); - +static int apply_spooled_messages(TransactionId xid, XLogRecPtr lsn); (3) src/backend/replication/logical/worker.c Other DEBUG1 messages don't end with "." + elog(DEBUG1, "apply_handle_stream_prepare: replayed %d (all) changes.", nchanges); Regards, Greg Nancarrow Fujitsu Australia
On Mon, Jul 19, 2021 at 9:19 AM Peter Smith <smithpb2250@gmail.com> wrote: > > On Mon, Jul 19, 2021 at 12:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Jul 19, 2021 at 1:55 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > > > Amit Kapila <amit.kapila16@gmail.com> writes: > > > > Pushed. > > > > > > I think you'd be way better off making the gid fields be "char *" > > > and pstrdup'ing the result of pq_getmsgstring. Another possibility > > > perhaps is to use strlcpy, but I'd only go that way if it's important > > > to constrain the received strings to 200 bytes. > > > > > > > I think it is important to constrain length to 200 bytes for this case > > as here we receive a prepared transaction identifier which according > > to docs [1] has a max length of 200 bytes. Also, in > > ParseCommitRecord() and ParseAbortRecord(), we are using strlcpy with > > 200 as max length to copy prepare transaction identifier. So, I think > > it is better to use strlcpy here unless you or Peter feels otherwise. > > > > OK. I have implemented this reported [1] potential buffer overrun > using the constraining strlcpy, because the GID limitation of 200 > bytes is already mentioned in the documentation [2]. > This will work but I think it is better to use sizeof gid buffer as we are using in ParseCommitRecord() and ParseAbortRecord(). Tomorrow, if due to some unforeseen reason if we change the size of gid buffer to be different than the GIDSIZE then it will work seamlessly. -- With Regards, Amit Kapila.
On Mon, Jul 19, 2021 at 4:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jul 19, 2021 at 9:19 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > On Mon, Jul 19, 2021 at 12:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Mon, Jul 19, 2021 at 1:55 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > > > > > Amit Kapila <amit.kapila16@gmail.com> writes: > > > > > Pushed. > > > > > > > > I think you'd be way better off making the gid fields be "char *" > > > > and pstrdup'ing the result of pq_getmsgstring. Another possibility > > > > perhaps is to use strlcpy, but I'd only go that way if it's important > > > > to constrain the received strings to 200 bytes. > > > > > > > > > > I think it is important to constrain length to 200 bytes for this case > > > as here we receive a prepared transaction identifier which according > > > to docs [1] has a max length of 200 bytes. Also, in > > > ParseCommitRecord() and ParseAbortRecord(), we are using strlcpy with > > > 200 as max length to copy prepare transaction identifier. So, I think > > > it is better to use strlcpy here unless you or Peter feels otherwise. > > > > > > > OK. I have implemented this reported [1] potential buffer overrun > > using the constraining strlcpy, because the GID limitation of 200 > > bytes is already mentioned in the documentation [2]. > > > > This will work but I think it is better to use sizeof gid buffer as we > are using in ParseCommitRecord() and ParseAbortRecord(). Tomorrow, if > due to some unforeseen reason if we change the size of gid buffer to > be different than the GIDSIZE then it will work seamlessly. > Modified as requested. PSA patch v2. ------ Kind Regards, Peter Smith. Fujitsu Australia
Attachment
On Mon, Jul 19, 2021 at 1:00 PM Peter Smith <smithpb2250@gmail.com> wrote: > > On Mon, Jul 19, 2021 at 4:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > OK. I have implemented this reported [1] potential buffer overrun > > > using the constraining strlcpy, because the GID limitation of 200 > > > bytes is already mentioned in the documentation [2]. > > > > > > > This will work but I think it is better to use sizeof gid buffer as we > > are using in ParseCommitRecord() and ParseAbortRecord(). Tomorrow, if > > due to some unforeseen reason if we change the size of gid buffer to > > be different than the GIDSIZE then it will work seamlessly. > > > > Modified as requested. PSA patch v2. > LGTM. I'll push this tomorrow unless Tom or someone else has any comments. -- With Regards, Amit Kapila.
Please find attached the latest patch set v98* Patches: v97-0001 --> v98-0001 Differences: * Rebased to HEAD @ yesterday. * Code/Docs changes: 1. Fixed the same strcpy problem as reported by Tom Lane [1] for the previous 2PC patch. 2. Addressed all feedback suggestions given by Greg [2]. 3. Added some more documentation as suggested by Vignesh [3]. ---- [1] https://www.postgresql.org/message-id/161029.1626639923%40sss.pgh.pa.us [2] https://www.postgresql.org/message-id/CAJcOf-ckGONzyAj0Y70ju_tfLWF819JYb%3Ddv9p5AnoZxm50j0g%40mail.gmail.com [3] https://www.postgresql.org/message-id/CALDaNm0LVY5A98xrgaodynnj6c%3DWQ5%3DZMpauC44aRio7-jWBYQ%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Attachment
On Mon, Jul 19, 2021 at 3:28 PM Greg Nancarrow <gregn4422@gmail.com> wrote: > > On Wed, Jul 14, 2021 at 6:33 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Please find attached the latest patch set v97* > > > > I couldn't spot spot any significant issues in the v97-0001 patch, but > do have the following trivial feedback comments: > > (1) doc/src/sgml/protocol.sgml > Suggestion: > > BEFORE: > + contains a Stream Prepare or Stream Commit or Stream Abort message. > AFTER: > + contains a Stream Prepare, Stream Commit or Stream Abort message. > > > (2) src/backend/replication/logical/worker.c > It seems a bit weird to add a forward declaration here, without a > comment, like for the one immediately above it > > /* Compute GID for two_phase transactions */ > static void TwoPhaseTransactionGid(Oid subid, TransactionId xid, char > *gid, int szgid); > - > +static int apply_spooled_messages(TransactionId xid, XLogRecPtr lsn); > > > (3) src/backend/replication/logical/worker.c > Other DEBUG1 messages don't end with "." > > + elog(DEBUG1, "apply_handle_stream_prepare: replayed %d > (all) changes.", nchanges); > Thanks for the feedback. All these are fixed as suggested in v98. ------ Kind Regards, Peter Smith. Fujitsu Australia
On Fri, Jul 16, 2021 at 4:08 PM vignesh C <vignesh21@gmail.com> wrote: > [...] > Thanks for the updated patch, the patch applies cleanly and test passes: > I had couple of comments: > 1) Should we include "stream_prepare_cb" here in > logicaldecoding-streaming section of logicaldecoding.sgml > documentation: > To reduce the apply lag caused by large transactions, an output plugin > may provide additional callback to support incremental streaming of > in-progress transactions. There are multiple required streaming > callbacks (stream_start_cb, stream_stop_cb, stream_abort_cb, > stream_commit_cb and stream_change_cb) and two optional callbacks > (stream_message_cb and stream_truncate_cb). > Modified in v98. The information about 'stream_prepare_cb' and friends is given in detail in section 49.10 so I added a link to that page. > 2) Should we add an example for stream_prepare_cb here in > logicaldecoding-streaming section of logicaldecoding.sgml > documentation: > One example sequence of streaming callback calls for one transaction > may look like this: > > stream_start_cb(...); <-- start of first block of changes > stream_change_cb(...); > stream_change_cb(...); > stream_message_cb(...); > stream_change_cb(...); > ... > stream_change_cb(...); > stream_stop_cb(...); <-- end of first block of changes > > stream_start_cb(...); <-- start of second block of changes > stream_change_cb(...); > stream_change_cb(...); > stream_change_cb(...); > ... > stream_message_cb(...); > stream_change_cb(...); > stream_stop_cb(...); <-- end of second block of changes > > stream_commit_cb(...); <-- commit of the streamed transaction > Modified in v98. I felt it would be too verbose to add another full example since it would be 90% the same as the current example. So I have combined the information. ------ Kind Regards, Peter Smith. Fujitsu Australia
On Tue, Jul 20, 2021 at 9:24 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Please find attached the latest patch set v98* > Review comments: ================ 1. /* - * Handle STREAM COMMIT message. + * Common spoolfile processing. + * Returns how many changes were applied. */ -static void -apply_handle_stream_commit(StringInfo s) +static int +apply_spooled_messages(TransactionId xid, XLogRecPtr lsn) Let's extract this common functionality (common to current code and the patch) as a separate patch? I think we can commit this as a separate patch. 2. apply_spooled_messages() { .. elog(DEBUG1, "replayed %d (all) changes from file \"%s\"", nchanges, path); .. } You have this DEBUG1 message in apply_spooled_messages and its callers. You can remove it from callers as the patch already has another debug message to indicate whether it is stream prepare or stream commit. Also, if this is the only reason to return nchanges from apply_spooled_messages() then we can get rid of that as well. 3. + /* + * 2. Mark the transaction as prepared. - Similar code as for + * apply_handle_prepare (i.e. two-phase non-streamed prepare) + */ + + /* + * BeginTransactionBlock is necessary to balance the EndTransactionBlock + * called within the PrepareTransactionBlock below. + */ + BeginTransactionBlock(); + CommitTransactionCommand(); /* Completes the preceding Begin command. */ + + /* + * Update origin state so we can restart streaming from correct position + * in case of crash. + */ + replorigin_session_origin_lsn = prepare_data.end_lsn; + replorigin_session_origin_timestamp = prepare_data.prepare_time; + + PrepareTransactionBlock(gid); I think you can move this part into a common function apply_handle_prepare_internal. If that is possible then you might want to move this part into a common functionality patch as mentioned in point-1. 4. + xid = logicalrep_read_stream_prepare(s, &prepare_data); + elog(DEBUG1, "received prepare for streamed transaction %u", xid); It is better to have an empty line between the above code lines for the sake of clarity. 5. +/* Commit (and abort) information */ typedef struct LogicalRepCommitData How is this structure related to abort? Even if it is, why this comment belongs to this patch? 6. Most of the code in logicalrep_write_stream_prepare() and logicalrep_write_prepare() is same except for message. I think if we want we can handle both of them with a single message by setting some flag for stream case but probably there will be some additional checking required on the worker-side. What do you think? I think if we want to keep them separate then at least we should keep the common functionality in logicalrep_write_*/logicalrep_read_* in separate functions. This way we will avoid minor inconsistencies in-stream and non-stream functions. 7. +++ b/doc/src/sgml/protocol.sgml @@ -2881,7 +2881,7 @@ The commands accepted in replication mode are: Begin Prepare and Prepare messages belong to the same transaction. It also sends changes of large in-progress transactions between a pair of Stream Start and Stream Stop messages. The last stream of such a transaction - contains a Stream Commit or Stream Abort message. + contains a Stream Prepare, Stream Commit or Stream Abort message. I am not sure if it is correct to mention Stream Prepare here because after that we will send commit prepared as well for such a transaction. So, I think we should remove this change. 8. -ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE); - \dRs+ +ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE); Is there a reason for this change in the tests? 9. I think this contains a lot of streaming tests in 023_twophase_stream. Let's keep just one test for crash-restart scenario (+# Check that 2PC COMMIT PREPARED is decoded properly on crash restart.) where both publisher and subscriber get restarted. I think others are covered in one or another way by other existing tests. Apart from that, I also don't see the need for the below tests: # Do DELETE after PREPARE but before COMMIT PREPARED. This is mostly the same as the previous test where the patch is testing Insert # Try 2PC transaction works using an empty GID literal This is covered in 021_twophase. 10. +++ b/src/test/subscription/t/024_twophase_cascade_stream.pl @@ -0,0 +1,271 @@ + +# Copyright (c) 2021, PostgreSQL Global Development Group + +# Test cascading logical replication of 2PC. In the above comment, you might want to say something about streaming. In general, I am not sure if it is really adding value to have these many streaming tests for cascaded setup and doing the whole setup again after we have done in tests 022_twophase_cascade. I think it is sufficient to do just one or two streaming tests by enhancing 022_twophase_cascade, you can alter subscription to enable streaming after doing non-streaming tests. 11. Have you verified that all these tests went through the streaming code path? If not, you can once enable DEBUG message in apply_handle_stream_prepare() and see if all tests hit that. -- With Regards, Amit Kapila.
On Fri, Jul 23, 2021 at 8:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jul 20, 2021 at 9:24 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Please find attached the latest patch set v98* > > > > Review comments: > ================ [...] > With Regards, > Amit Kapila. Thanks for your review comments. I having been working through them today and hope to post the v99* patches tomorrow. ------ Kind Regards, Peter Smith. Fujitsu Australia
On Fri, Jul 23, 2021 at 8:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jul 20, 2021 at 9:24 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Please find attached the latest patch set v98* > > > > Review comments: > ================ All the following review comments are addressed in v99* patch set. > 1. > /* > - * Handle STREAM COMMIT message. > + * Common spoolfile processing. > + * Returns how many changes were applied. > */ > -static void > -apply_handle_stream_commit(StringInfo s) > +static int > +apply_spooled_messages(TransactionId xid, XLogRecPtr lsn) > > Let's extract this common functionality (common to current code and > the patch) as a separate patch? I think we can commit this as a > separate patch. > Done. Split patches as requested. > 2. > apply_spooled_messages() > { > .. > elog(DEBUG1, "replayed %d (all) changes from file \"%s\"", > nchanges, path); > .. > } > > You have this DEBUG1 message in apply_spooled_messages and its > callers. You can remove it from callers as the patch already has > another debug message to indicate whether it is stream prepare or > stream commit. Also, if this is the only reason to return nchanges > from apply_spooled_messages() then we can get rid of that as well. > Done. > 3. > + /* > + * 2. Mark the transaction as prepared. - Similar code as for > + * apply_handle_prepare (i.e. two-phase non-streamed prepare) > + */ > + > + /* > + * BeginTransactionBlock is necessary to balance the EndTransactionBlock > + * called within the PrepareTransactionBlock below. > + */ > + BeginTransactionBlock(); > + CommitTransactionCommand(); /* Completes the preceding Begin command. */ > + > + /* > + * Update origin state so we can restart streaming from correct position > + * in case of crash. > + */ > + replorigin_session_origin_lsn = prepare_data.end_lsn; > + replorigin_session_origin_timestamp = prepare_data.prepare_time; > + > + PrepareTransactionBlock(gid); > > I think you can move this part into a common function > apply_handle_prepare_internal. If that is possible then you might want > to move this part into a common functionality patch as mentioned in > point-1. > Done. (The common function is included in patch 0001) > 4. > + xid = logicalrep_read_stream_prepare(s, &prepare_data); > + elog(DEBUG1, "received prepare for streamed transaction %u", xid); > > It is better to have an empty line between the above code lines for > the sake of clarity. > Done. > 5. > +/* Commit (and abort) information */ > typedef struct LogicalRepCommitData > > How is this structure related to abort? Even if it is, why this > comment belongs to this patch? > OK. Removed this from the patch. > 6. Most of the code in logicalrep_write_stream_prepare() and > logicalrep_write_prepare() is same except for message. I think if we > want we can handle both of them with a single message by setting some > flag for stream case but probably there will be some additional > checking required on the worker-side. What do you think? I think if we > want to keep them separate then at least we should keep the common > functionality in logicalrep_write_*/logicalrep_read_* in separate > functions. This way we will avoid minor inconsistencies in-stream and > non-stream functions. > Done. (The common functions are included in patch 0001). > 7. > +++ b/doc/src/sgml/protocol.sgml > @@ -2881,7 +2881,7 @@ The commands accepted in replication mode are: > Begin Prepare and Prepare messages belong to the same transaction. > It also sends changes of large in-progress transactions between a pair of > Stream Start and Stream Stop messages. The last stream of such a transaction > - contains a Stream Commit or Stream Abort message. > + contains a Stream Prepare, Stream Commit or Stream Abort message. > > I am not sure if it is correct to mention Stream Prepare here because > after that we will send commit prepared as well for such a > transaction. So, I think we should remove this change. > Done. > 8. > -ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE); > - > \dRs+ > > +ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE); > > Is there a reason for this change in the tests? > Yes, the setting of slot_name = NONE really belongs with the DROP SUBSCRIPTION. Similarly, the \dRs+ is done to test the effect of the setting of the streaming option (not the slot_name = NONE). Since I needed to add a new DROP SUBSCRIPTION (because now the streaming option works) so I also refactored this exiting test to make all the test formats consistent. > 9. > I think this contains a lot of streaming tests in 023_twophase_stream. > Let's keep just one test for crash-restart scenario (+# Check that 2PC > COMMIT PREPARED is decoded properly on crash restart.) where both > publisher and subscriber get restarted. I think others are covered in > one or another way by other existing tests. Apart from that, I also > don't see the need for the below tests: > # Do DELETE after PREPARE but before COMMIT PREPARED. > This is mostly the same as the previous test where the patch is testing Insert > # Try 2PC transaction works using an empty GID literal > This is covered in 021_twophase. > Done. Removed all the excessive tests as you suggested. > 10. > +++ b/src/test/subscription/t/024_twophase_cascade_stream.pl > @@ -0,0 +1,271 @@ > + > +# Copyright (c) 2021, PostgreSQL Global Development Group > + > +# Test cascading logical replication of 2PC. > > In the above comment, you might want to say something about streaming. > In general, I am not sure if it is really adding value to have these > many streaming tests for cascaded setup and doing the whole setup > again after we have done in tests 022_twophase_cascade. I think it is > sufficient to do just one or two streaming tests by enhancing > 022_twophase_cascade, you can alter subscription to enable streaming > after doing non-streaming tests. > Done. Remove the 024 TAP tests, and instead merged the streaming cascade tests into the 022_twophase_casecase.pl as you suggested. > 11. Have you verified that all these tests went through the streaming > code path? If not, you can once enable DEBUG message in > apply_handle_stream_prepare() and see if all tests hit that. > Yeah, it was done a very long time ago when the tests were first written; Anyway, just to be certain I temporarily modified the code as suggested and confirmed by the logfiles that the tests is running through apply_handle_stream_prepare. ------ Kind Regards, Peter Smith. Fujitsu Australia.
Please find attached the latest patch set v99* v98-0001 --> split into v99-0001 + v99-0002 Differences: * Rebased to HEAD @ yesterday. * Addresses review comments from Amit [1] and split the v98 patch as requested. ---- [1] https://www.postgresql.org/message-id/CAA4eK1%2BizpAybqpEFp8%2BRx%3DC1Z1H_XLcRod_WYjBRv2Rn%2BDO2w%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Attachment
On Tue, Jul 27, 2021 at 11:41 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Please find attached the latest patch set v99* > > v98-0001 --> split into v99-0001 + v99-0002 > Pushed the first refactoring patch after making few modifications as below. 1. - /* open the spool file for the committed transaction */ + /* Open the spool file for the committed/prepared transaction */ changes_filename(path, MyLogicalRepWorker->subid, xid); In the above comment, we don't need to say prepared. It can be done as part of the second patch. 2. +apply_handle_prepare_internal(LogicalRepPreparedTxnData *prepare_data, char *gid) I don't think there is any need for this function to take gid as input. It can compute by itself instead of callers doing it. 3. +static TransactionId+logicalrep_read_prepare_common(StringInfo in, char *msgtype, + LogicalRepPreparedTxnData *prepare_data) I don't think the above function needs to return xid because it is already present as part of prepare_data. Even, if it is required due to some reason for the second patch then let's do it as part of if but I don't think it is required for the second patch. 4. /* - * Write PREPARE to the output stream. + * Common code for logicalrep_write_prepare and logicalrep_write_stream_prepare. */ Here and at a similar another place, we don't need to refer to logicalrep_write_stream_prepare as that is part of the second patch. Few comments on 0002 patch: ========================== 1. +# --------------------- +# 2PC + STREAMING TESTS +# --------------------- + +# Setup logical replication (streaming = on) + +$node_B->safe_psql('postgres', " + ALTER SUBSCRIPTION tap_sub_B + SET (streaming = on);"); + +$node_C->safe_psql('postgres', " + ALTER SUBSCRIPTION tap_sub_C + SET (streaming = on)"); + +# Wait for subscribers to finish initialization +$node_A->wait_for_catchup($appname_B); +$node_B->wait_for_catchup($appname_C); This is not the right way to determine if the new streaming option is enabled on the publisher. Even if there is no restart of apply workers (and walsender) after you have enabled the option, the above wait will succeed. You need to do something like below as we are doing in 001_rep_changes.pl: $oldpid = $node_publisher->safe_psql('postgres', "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub';" ); $node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION tap_sub SET PUBLICATION tap_pub_ins_only WITH (copy_data = false)" ); $node_publisher->poll_query_until('postgres', "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name = 'tap_sub';" ) or die "Timed out while waiting for apply to restart"; 2. +# Create some pre-existing content on publisher (uses same DDL as 015_stream test) Here, in the comments, I don't see the need to same uses same DDL ... -- With Regards, Amit Kapila.
Please find attached the latest patch set v100* v99-0002 --> v100-0001 Differences: * Rebased to HEAD @ today (needed because some recent commits [1][2] broke v99) * Addresses v99 review comments from Amit [1]. ---- [1] https://github.com/postgres/postgres/commit/201a76183e2056c2217129e12d68c25ec9c559c8 [2] https://github.com/postgres/postgres/commit/91f9861242cd7dcf28fae216b1d8b47551c9159d [1] https://www.postgresql.org/message-id/CAA4eK1%2BNzcz%3DhzZJO16ZcqA7hZRa4RzGRwL_XXM%2Bdin8ehROaQ%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Attachment
On Thu, Jul 29, 2021 at 9:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jul 27, 2021 at 11:41 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Please find attached the latest patch set v99* > > > > v98-0001 --> split into v99-0001 + v99-0002 > > > > Pushed the first refactoring patch after making few modifications as below. > 1. > - /* open the spool file for the committed transaction */ > + /* Open the spool file for the committed/prepared transaction */ > changes_filename(path, MyLogicalRepWorker->subid, xid); > > In the above comment, we don't need to say prepared. It can be done as > part of the second patch. > Updated comment in v100. > 2. > +apply_handle_prepare_internal(LogicalRepPreparedTxnData > *prepare_data, char *gid) > > I don't think there is any need for this function to take gid as > input. It can compute by itself instead of callers doing it. > OK. > 3. > +static TransactionId+logicalrep_read_prepare_common(StringInfo in, > char *msgtype, > + LogicalRepPreparedTxnData *prepare_data) > > I don't think the above function needs to return xid because it is > already present as part of prepare_data. Even, if it is required due > to some reason for the second patch then let's do it as part of if but > I don't think it is required for the second patch. > OK. > 4. > /* > - * Write PREPARE to the output stream. > + * Common code for logicalrep_write_prepare and > logicalrep_write_stream_prepare. > */ > > Here and at a similar another place, we don't need to refer to > logicalrep_write_stream_prepare as that is part of the second patch. > Updated comment in v100 > Few comments on 0002 patch: > ========================== > 1. > +# --------------------- > +# 2PC + STREAMING TESTS > +# --------------------- > + > +# Setup logical replication (streaming = on) > + > +$node_B->safe_psql('postgres', " > + ALTER SUBSCRIPTION tap_sub_B > + SET (streaming = on);"); > + > +$node_C->safe_psql('postgres', " > + ALTER SUBSCRIPTION tap_sub_C > + SET (streaming = on)"); > + > +# Wait for subscribers to finish initialization > +$node_A->wait_for_catchup($appname_B); > +$node_B->wait_for_catchup($appname_C); > > This is not the right way to determine if the new streaming option is > enabled on the publisher. Even if there is no restart of apply workers > (and walsender) after you have enabled the option, the above wait will > succeed. You need to do something like below as we are doing in > 001_rep_changes.pl: > > $oldpid = $node_publisher->safe_psql('postgres', > "SELECT pid FROM pg_stat_replication WHERE application_name = 'tap_sub';" > ); > $node_subscriber->safe_psql('postgres', > "ALTER SUBSCRIPTION tap_sub SET PUBLICATION tap_pub_ins_only WITH > (copy_data = false)" > ); > $node_publisher->poll_query_until('postgres', > "SELECT pid != $oldpid FROM pg_stat_replication WHERE application_name > = 'tap_sub';" > ) or die "Timed out while waiting for apply to restart"; > Fixed in v100 as suggested. > 2. > +# Create some pre-existing content on publisher (uses same DDL as > 015_stream test) > > Here, in the comments, I don't see the need to same uses same DDL ... > Fixed in v100. Comment removed. ------ Kind Regards, Peter Smith. Fujitsu Australia
On Fri, Jul 30, 2021 at 9:32 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Please find attached the latest patch set v100* > > v99-0002 --> v100-0001 > > Differences: > > * Rebased to HEAD @ today (needed because some recent commits [1][2] broke v99) > The patch applies neatly, tests passes and documentation looks good. A Few minor comments. 1) This blank line is not required: +-- two_phase and streaming are compatible. +CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, streaming = true, two_phase = true); + 2) Few points have punctuation mark and few don't have, we can make it consistent: +############################### +# Test 2PC PREPARE / ROLLBACK PREPARED. +# 1. Table is deleted back to 2 rows which are replicated on subscriber. +# 2. Data is streamed using 2PC +# 3. Do rollback prepared. +# +# Expect data rolls back leaving only the original 2 rows. +############################### 3) similarly here too: +############################### +# Do INSERT after the PREPARE but before ROLLBACK PREPARED. +# 1. Table is deleted back to 2 rows which are replicated on subscriber. +# 2. Data is streamed using 2PC. +# 3. A single row INSERT is done which is after the PREPARE +# 4. Then do a ROLLBACK PREPARED. +# +# Expect the 2PC data rolls back leaving only 3 rows on the subscriber. +# (the original 2 + inserted 1) +############################### Regards, Vignesh
On Fri, Jul 30, 2021 at 2:02 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Please find attached the latest patch set v100* > > v99-0002 --> v100-0001 > A few minor comments: (1) doc/src/sgml/protocol.sgml In the following description, is the word "large" really needed? Also "the message ... for a ... message" sounds a bit odd, as does "two-phase prepare". What about the following: BEFORE: + Identifies the message as a two-phase prepare for a large in-progress transaction message. AFTER: + Identifies the message as a prepare for an in-progress two-phase transaction. (2) src/backend/replication/logical/worker.c Similar format comment, but one uses a full-stop and the other doesn't, looks a bit odd, since the lines are near each other. * 1. Replay all the spooled operations - Similar code as for * 2. Mark the transaction as prepared. - Similar code as for (3) src/test/subscription/t/023_twophase_stream.pl Shouldn't the following comment mention, for example, "with streaming" or something to that effect? # logical replication of 2PC test Regards, Greg Nancarrow Fujitsu Australia
On Friday, July 30, 2021 12:02 PM Peter Smith <smithpb2250@gmail.com>wrote: > > Please find attached the latest patch set v100* > > v99-0002 --> v100-0001 > Thanks for your patch. A few comments on the test file: 1. src/test/subscription/t/022_twophase_cascade.pl 1.1 I saw your test cases for "PREPARE / COMMIT PREPARED" and "PREPARE with a nested ROLLBACK TO SAVEPOINT", but didn't see casesfor "PREPARE / ROLLBACK PREPARED". Is it needless or just missing? 1.2 +# check inserts are visible at subscriber(s). +# All the streamed data (prior to the SAVEPOINT) should be rolled back. +# (3, 'foobar') should be committed. I think it should be (9999, 'foobar') here. 1.3 +$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';"); +is($result, qq(1), 'Rows committed are present on subscriber B'); +$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;"); + It seems the test is not finished yet. We didn't check the value of 'result'. Besides, maybe we should also check node_C,right? 1.4 +$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10)); +$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB)); You see, the first line uses a TAB but the second line uses a space. Also, we could use only one statement to append these two settings to run tests a bit faster. Thoughts? Something like: $node_B->append_conf( 'postgresql.conf', qq( max_prepared_transactions = 10 logical_decoding_work_mem = 64kB )); Regards Tang
On Wed, Jul 14, 2021 at 11:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jul 12, 2021 at 9:14 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Pushed. > As reported by Michael [1], there is one test failure related to this commit. The failure is as below: # Failed test 'transaction is prepared on subscriber' # at t/021_twophase.pl line 324. # got: '1' # expected: '2' # Looks like you failed 1 test of 24. [12:14:02] t/021_twophase.pl .................. Dubious, test returned 1 (wstat 256, 0x100) Failed 1/24 subtests [12:14:12] t/022_twophase_cascade.pl .......... ok 10542 ms ( 0.00 usr 0.00 sys + 2.03 cusr 0.61 csys = 2.64 CPU) [12:14:31] t/100_bugs.pl ...................... ok 18550 ms ( 0.00 usr 0.00 sys + 3.85 cusr 1.36 csys = 5.21 CPU) [12:14:31] I think I know what's going wrong here. The corresponding test is: # Now do a prepare on publisher and check that it IS replicated $node_publisher->safe_psql('postgres', " BEGIN; INSERT INTO tab_copy VALUES (99); PREPARE TRANSACTION 'mygid';"); $node_publisher->wait_for_catchup($appname_copy); # Check that the transaction has been prepared on the subscriber, there will be 2 # prepared transactions for the 2 subscriptions. $result = $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM pg_prepared_xacts;"); is($result, qq(2), 'transaction is prepared on subscriber'); Here, the test is expecting 2 prepared transactions corresponding to two subscriptions but it waits for just one subscription via appname_copy. It should wait for the second subscription using $appname as well. What do you think? [1] - https://www.postgresql.org/message-id/YQP02%2B5yLCIgmdJY%40paquier.xyz -- With Regards, Amit Kapila.
On Sat, Jul 31, 2021 at 2:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > Here, the test is expecting 2 prepared transactions corresponding to > two subscriptions but it waits for just one subscription via > appname_copy. It should wait for the second subscription using > $appname as well. > > What do you think? I agree with this analysis. The test needs to wait for both subscriptions to catch up. Attached is a patch that addresses this issue. regards, Ajin Cherian Fujitsu Australia
Attachment
On Sat, Jul 31, 2021 at 11:12 AM Ajin Cherian <itsajin@gmail.com> wrote: > > On Sat, Jul 31, 2021 at 2:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > Here, the test is expecting 2 prepared transactions corresponding to > > two subscriptions but it waits for just one subscription via > > appname_copy. It should wait for the second subscription using > > $appname as well. > > > > What do you think? > > I agree with this analysis. The test needs to wait for both > subscriptions to catch up. > Attached is a patch that addresses this issue. > LGTM, unless Peter Smith has any comments or thinks otherwise, I'll push this on Monday. -- With Regards, Amit Kapila.
On Fri, Jul 30, 2021 at 9:32 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Please find attached the latest patch set v100* > Few minor comments: 1. CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, two_phase = true); \dRs+ + --fail - alter of two_phase option not supported. ALTER SUBSCRIPTION regress_testsub SET (two_phase = false); Spurious line addition. 2. +TransactionId +logicalrep_read_stream_prepare(StringInfo in, LogicalRepPreparedTxnData *prepare_data) +{ + logicalrep_read_prepare_common(in, "stream prepare", prepare_data); + + return prepare_data->xid; +} There is no need to return TransactionId separately. The caller can use from prepare_data, if required. 3. extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid, TransactionId *subxid); +extern void logicalrep_write_stream_prepare(StringInfo out, ReorderBufferTXN *txn, + XLogRecPtr prepare_lsn); +extern TransactionId logicalrep_read_stream_prepare(StringInfo in, + LogicalRepPreparedTxnData *prepare_data); + + Keep the order of declarations the same as its definitions in proto.c which means move these after logicalrep_read_rollback_prepared() and be careful about extra blank lines. -- With Regards, Amit Kapila.
On Sat, Jul 31, 2021 at 11:12 AM Ajin Cherian <itsajin@gmail.com> wrote: > > On Sat, Jul 31, 2021 at 2:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > Here, the test is expecting 2 prepared transactions corresponding to > > two subscriptions but it waits for just one subscription via > > appname_copy. It should wait for the second subscription using > > $appname as well. > > > > What do you think? > > I agree with this analysis. The test needs to wait for both > subscriptions to catch up. > Attached is a patch that addresses this issue. The changes look good to me. Regards, Vignesh
On Sun, Aug 1, 2021 at 3:05 AM vignesh C <vignesh21@gmail.com> wrote: > > On Sat, Jul 31, 2021 at 11:12 AM Ajin Cherian <itsajin@gmail.com> wrote: > > > > On Sat, Jul 31, 2021 at 2:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > Here, the test is expecting 2 prepared transactions corresponding to > > > two subscriptions but it waits for just one subscription via > > > appname_copy. It should wait for the second subscription using > > > $appname as well. > > > > > > What do you think? > > > > I agree with this analysis. The test needs to wait for both > > subscriptions to catch up. > > Attached is a patch that addresses this issue. > > The changes look good to me. > The patch to the test code posted by Ajin LGTM also. I applied the patch and re-ran the TAP subscription tests. All OK. ------ Kind Regards, Peter Smith. Fujitsu Australia
Please find attached the latest patch v101 Differences: * Rebased to HEAD @ today. * Addresses all v100 review comments from Vignesh [1], Greg [2], Tang [3], and Amit [2]. ---- [1] https://www.postgresql.org/message-id/CALDaNm2N3qgSv3XyHW%2Bop_SJcLmz1s%3D0jJc-taxUmeEBXW5EPw%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAJcOf-eGCg8s%2BtT_Mo5xKksAhA%3D%3D1QAH_Sj7SqBotHQhwapdEw%40mail.gmail.com [3] https://www.postgresql.org/message-id/OS0PR01MB6113B6F3C88C3C11A2F62CFFFBEC9%40OS0PR01MB6113.jpnprd01.prod.outlook.com [4] https://www.postgresql.org/message-id/CAA4eK1%2BVcNDUYSZVm3xNg4YLzaMqcZHqxznfbAYvJWoVzvLqFQ%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Attachment
On Sat, Jul 31, 2021 at 9:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Jul 30, 2021 at 9:32 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Please find attached the latest patch set v100* > > > > Few minor comments: > 1. > CREATE SUBSCRIPTION regress_testsub CONNECTION > 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = > false, two_phase = true); > > \dRs+ > + > --fail - alter of two_phase option not supported. > ALTER SUBSCRIPTION regress_testsub SET (two_phase = false); > > Spurious line addition. > OK. Fixed in v101. > 2. > +TransactionId > +logicalrep_read_stream_prepare(StringInfo in, > LogicalRepPreparedTxnData *prepare_data) > +{ > + logicalrep_read_prepare_common(in, "stream prepare", prepare_data); > + > + return prepare_data->xid; > +} > > There is no need to return TransactionId separately. The caller can > use from prepare_data, if required. > OK. Modified in v101 > 3. > extern void logicalrep_read_stream_abort(StringInfo in, TransactionId *xid, > TransactionId *subxid); > > +extern void logicalrep_write_stream_prepare(StringInfo out, > ReorderBufferTXN *txn, > + XLogRecPtr prepare_lsn); > +extern TransactionId logicalrep_read_stream_prepare(StringInfo in, > + LogicalRepPreparedTxnData *prepare_data); > + > + > > Keep the order of declarations the same as its definitions in proto.c > which means move these after logicalrep_read_rollback_prepared() and > be careful about extra blank lines. > OK. Reordered in v101. ------ Kind Regards, Peter Smith. Fujitsu Australia
On Fri, Jul 30, 2021 at 6:25 PM tanghy.fnst@fujitsu.com <tanghy.fnst@fujitsu.com> wrote: > > On Friday, July 30, 2021 12:02 PM Peter Smith <smithpb2250@gmail.com>wrote: > > > > Please find attached the latest patch set v100* > > > > v99-0002 --> v100-0001 > > > > Thanks for your patch. A few comments on the test file: > > 1. src/test/subscription/t/022_twophase_cascade.pl > > 1.1 > I saw your test cases for "PREPARE / COMMIT PREPARED" and "PREPARE with a nested ROLLBACK TO SAVEPOINT", but didn't seecases for "PREPARE / ROLLBACK PREPARED". Is it needless or just missing? > Yes, that test used to exist but it was removed in response to a previous review (see [1] comment #10, Amit said there were too many tests). > 1.2 > +# check inserts are visible at subscriber(s). > +# All the streamed data (prior to the SAVEPOINT) should be rolled back. > +# (3, 'foobar') should be committed. > > I think it should be (9999, 'foobar') here. > Good catch. Fixed in v101. > 1.3 > +$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab where b = 'foobar';"); > +is($result, qq(1), 'Rows committed are present on subscriber B'); > +$result = $node_B->safe_psql('postgres', "SELECT count(*) FROM test_tab;"); > + > > It seems the test is not finished yet. We didn't check the value of 'result'. Besides, maybe we should also check node_C,right? > Oops. Thanks for finding this! Fixed in v101 by adding the missing tests. > 1.4 > +$node_B->append_conf('postgresql.conf', qq(max_prepared_transactions = 10)); > +$node_B->append_conf('postgresql.conf', qq(logical_decoding_work_mem = 64kB)); > > You see, the first line uses a TAB but the second line uses a space. > Also, we could use only one statement to append these two settings to run tests a bit faster. Thoughts? > Something like: > > $node_B->append_conf( > 'postgresql.conf', qq( > max_prepared_transactions = 10 > logical_decoding_work_mem = 64kB > )); > OK. In v101 I changed the config as you suggested for both the 022 and 023 TAP tests. ------ [1] https://www.postgresql.org/message-id/CAHut%2BPts_bWx_RrXu%2BYwbiJva33nTROoQQP5H4pVrF%2BNcCMkRA%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia.
On Fri, Jul 30, 2021 at 3:18 PM vignesh C <vignesh21@gmail.com> wrote: > > On Fri, Jul 30, 2021 at 9:32 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Please find attached the latest patch set v100* > > > > v99-0002 --> v100-0001 > > > > Differences: > > > > * Rebased to HEAD @ today (needed because some recent commits [1][2] broke v99) > > > > The patch applies neatly, tests passes and documentation looks good. > A Few minor comments. > 1) This blank line is not required: > +-- two_phase and streaming are compatible. > +CREATE SUBSCRIPTION regress_testsub CONNECTION > 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = > false, streaming = true, two_phase = true); > + > Fixed in v101. > 2) Few points have punctuation mark and few don't have, we can make it > consistent: > +############################### > +# Test 2PC PREPARE / ROLLBACK PREPARED. > +# 1. Table is deleted back to 2 rows which are replicated on subscriber. > +# 2. Data is streamed using 2PC > +# 3. Do rollback prepared. > +# > +# Expect data rolls back leaving only the original 2 rows. > +############################### > Fixed in v101. > 3) similarly here too: > +############################### > +# Do INSERT after the PREPARE but before ROLLBACK PREPARED. > +# 1. Table is deleted back to 2 rows which are replicated on subscriber. > +# 2. Data is streamed using 2PC. > +# 3. A single row INSERT is done which is after the PREPARE > +# 4. Then do a ROLLBACK PREPARED. > +# > +# Expect the 2PC data rolls back leaving only 3 rows on the subscriber. > +# (the original 2 + inserted 1) > +############################### > Fixed in v101. ------ Kind Regards, Peter Smith. Fujitsu Australia
On Fri, Jul 30, 2021 at 4:33 PM Greg Nancarrow <gregn4422@gmail.com> wrote: > > On Fri, Jul 30, 2021 at 2:02 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Please find attached the latest patch set v100* > > > > v99-0002 --> v100-0001 > > > > A few minor comments: > > (1) doc/src/sgml/protocol.sgml > > In the following description, is the word "large" really needed? Also > "the message ... for a ... message" sounds a bit odd, as does > "two-phase prepare". > > What about the following: > > BEFORE: > + Identifies the message as a two-phase prepare for a > large in-progress transaction message. > AFTER: > + Identifies the message as a prepare for an > in-progress two-phase transaction. > Updated in v101. The other nearby messages are referring refer to a “streamed transaction” so I’ve changed this to say “Identifies the message as a two-phase prepare for a streamed transaction message.” (e.g. compare this text with the existing similar text for ‘P’). BTW, I agree with you that "the message ... for a ... message" seems odd; it was written in this way only to be consistent with existing documentation, which all uses the same odd phrasing. > (2) src/backend/replication/logical/worker.c > > Similar format comment, but one uses a full-stop and the other > doesn't, looks a bit odd, since the lines are near each other. > > * 1. Replay all the spooled operations - Similar code as for > > * 2. Mark the transaction as prepared. - Similar code as for > Updated in v101 to make the comments consistent. > (3) src/test/subscription/t/023_twophase_stream.pl > > Shouldn't the following comment mention, for example, "with streaming" > or something to that effect? > > # logical replication of 2PC test > Fixed as suggested in v101. ------ Kind Regards, Peter Smith. Fujitsu Australia.
On Sun, Aug 1, 2021 at 3:51 PM Peter Smith <smithpb2250@gmail.com> wrote: > > On Sun, Aug 1, 2021 at 3:05 AM vignesh C <vignesh21@gmail.com> wrote: > > > > On Sat, Jul 31, 2021 at 11:12 AM Ajin Cherian <itsajin@gmail.com> wrote: > > > > > > On Sat, Jul 31, 2021 at 2:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > Here, the test is expecting 2 prepared transactions corresponding to > > > > two subscriptions but it waits for just one subscription via > > > > appname_copy. It should wait for the second subscription using > > > > $appname as well. > > > > > > > > What do you think? > > > > > > I agree with this analysis. The test needs to wait for both > > > subscriptions to catch up. > > > Attached is a patch that addresses this issue. > > > > The changes look good to me. > > > > The patch to the test code posted by Ajin LGTM also. > Pushed. -- With Regards, Amit Kapila.
Please find attached the latest patch set v102* Differences: * Rebased to HEAD @ today. * This is a documentation change only. A recent commit [1] has changed the documentation style for the message formats slightly to annotate the data types. For consistency, the same style change needs to be adopted for the newly added message of this patch. This same change also finally addresses some old review comments [2] from Vignesh. ---- [1] https://github.com/postgres/postgres/commit/a5cb4f9829fbfd68655543d2d371a18a8eb43b84 [2] https://www.postgresql.org/message-id/CALDaNm3U4fGxTnQfaT1TqUkgX5c0CSDvmW12Bfksis8zB_XinA%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Attachment
On Mon, May 10, 2021 at 1:31 PM vignesh C <vignesh21@gmail.com> wrote: > ... > > 2) I felt we can change lsn data type from Int64 to XLogRecPtr > +<varlistentry> > +<term>Int64</term> > +<listitem><para> > + The LSN of the prepare. > +</para></listitem> > +</varlistentry> > + > +<varlistentry> > +<term>Int64</term> > +<listitem><para> > + The end LSN of the transaction. > +</para></listitem> > +</varlistentry> > > 3) I felt we can change lsn data type from Int32 to TransactionId > +<varlistentry> > +<term>Int32</term> > +<listitem><para> > + Xid of the subtransaction (will be same as xid of the > transaction for top-level > + transactions). > +</para></listitem> > +</varlistentry> > ... > > Similar problems related to comments 2 and 3 are being discussed at > [1], we can change it accordingly based on the conclusion in the other > thread. > [1] - https://www.postgresql.org/message-id/flat/CAHut%2BPs2JsSd_OpBR9kXt1Rt4bwyXAjh875gUpFw6T210ttO7Q%40mail.gmail.com#cf2a85d0623dcadfbb1204a196681313 > Earlier today the other documentation patch mentioned above was committed by Tom Lane. The 2PC patch v102 now fixes your review comments 2 and 3 by matching the same datatype annotation style of that commit. ------ Kind Regards, Peter Smith Fujitsu Australia
On Tue, Aug 3, 2021 at 6:17 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Please find attached the latest patch set v102* > I have made minor modifications in the comments and docs, please see attached. Can you please check whether the names of contributors in the commit message are correct or do we need to change it? -- With Regards, Amit Kapila.
Attachment
On Tue, Aug 3, 2021 at 5:02 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Aug 3, 2021 at 6:17 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Please find attached the latest patch set v102* > > > > I have made minor modifications in the comments and docs, please see > attached. Can you please check whether the names of contributors in > the commit message are correct or do we need to change it? > I checked the differences between v102 and v103 and have no review comments about the latest changes. The commit message looks ok. I applied the v103 to the current HEAD; no errors. The build is ok. The make check is ok. The TAP subscription tests are ok. I also rebuilt the PG docs and verified rendering of the updated pages looks ok. The patch v103 LGTM. ------ Kind Regards, Peter Smith. Fujitsu Australia
On Tue, Aug 3, 2021 at 12:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Aug 3, 2021 at 6:17 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Please find attached the latest patch set v102* > > > > I have made minor modifications in the comments and docs, please see > attached. Can you please check whether the names of contributors in > the commit message are correct or do we need to change it? The patch applies neatly, the tests pass and documentation built with the updates provided. I could not find any comments. The patch looks good to me. Regards, Vignesh
On Tuesday, August 3, 2021 6:03 PM vignesh C <vignesh21@gmail.com>wrote: > > On Tue, Aug 3, 2021 at 12:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Aug 3, 2021 at 6:17 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > > Please find attached the latest patch set v102* > > > > > > > I have made minor modifications in the comments and docs, please see > > attached. Can you please check whether the names of contributors in > > the commit message are correct or do we need to change it? > > The patch applies neatly, the tests pass and documentation built with > the updates provided. I could not find any comments. The patch looks > good to me. > I did some stress tests on the patch and found no issues. It also works well when using synchronized replication. So the patch LGTM. Regards Tang
On Wed, Aug 4, 2021 at 6:51 AM tanghy.fnst@fujitsu.com <tanghy.fnst@fujitsu.com> wrote: > > On Tuesday, August 3, 2021 6:03 PM vignesh C <vignesh21@gmail.com>wrote: > > > > On Tue, Aug 3, 2021 at 12:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Tue, Aug 3, 2021 at 6:17 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > > > > Please find attached the latest patch set v102* > > > > > > > > > > I have made minor modifications in the comments and docs, please see > > > attached. Can you please check whether the names of contributors in > > > the commit message are correct or do we need to change it? > > > > The patch applies neatly, the tests pass and documentation built with > > the updates provided. I could not find any comments. The patch looks > > good to me. > > > > I did some stress tests on the patch and found no issues. > It also works well when using synchronized replication. > So the patch LGTM. > I have pushed this last patch in the series. -- With Regards, Amit Kapila.
On Wed, Aug 4, 2021 at 4:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > I have pushed this last patch in the series. > I have closed this CF entry. Thanks to everyone involved in this work! -- With Regards, Amit Kapila.
Hi, On Mon, Aug 9, 2021 at 12:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Aug 4, 2021 at 4:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > I have pushed this last patch in the series. > > > > I have closed this CF entry. Thanks to everyone involved in this work! > I have a questoin about two_phase column of pg_replication_slots view: with this feature, pg_replication_slots has a new column two_phase: View "pg_catalog.pg_replication_slots" Column | Type | Collation | Nullable | Default ---------------------+---------+-----------+----------+--------- slot_name | name | | | plugin | name | | | slot_type | text | | | datoid | oid | | | database | name | | | temporary | boolean | | | active | boolean | | | active_pid | integer | | | xmin | xid | | | catalog_xmin | xid | | | restart_lsn | pg_lsn | | | confirmed_flush_lsn | pg_lsn | | | wal_status | text | | | safe_wal_size | bigint | | | two_phase | boolean | | | According to the doc, the two_phase field has: True if the slot is enabled for decoding prepared transactions. Always false for physical slots. It's unnatural a bit to me that replication slots have such a property since the replication slots have been used to protect WAL and tuples that are required for logical decoding, physical replication, and backup, etc from removal. Also, it seems that even if a replication slot is created with two_phase = off, it's overwritten to on if the plugin enables two-phase option. Is there any reason why we can turn on and off this value on the replication slot side and is there any use case where the replication slot’s two_phase is false and the plugin’s two-phase option is on and vice versa? I think that we can have replication slots always have two_phase_at value and remove the two_phase field from the view. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Tue, Jan 4, 2022 at 9:00 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > According to the doc, the two_phase field has: > > True if the slot is enabled for decoding prepared transactions. Always > false for physical slots. > > It's unnatural a bit to me that replication slots have such a property > since the replication slots have been used to protect WAL and tuples > that are required for logical decoding, physical replication, and > backup, etc from removal. Also, it seems that even if a replication > slot is created with two_phase = off, it's overwritten to on if the > plugin enables two-phase option. Is there any reason why we can turn > on and off this value on the replication slot side and is there any > use case where the replication slot’s two_phase is false and the > plugin’s two-phase option is on and vice versa? > We enable two_phase only when we start streaming from the subscriber-side. This is required because we can't enable it till the initial sync is complete, otherwise, it could lead to loss of data. See comments atop worker.c (description under the title: TWO_PHASE TRANSACTIONS). > I think that we can > have replication slots always have two_phase_at value and remove the > two_phase field from the view. > I am not sure how that will work because we can allow streaming of prepared transactions when the same is enabled at the CREATE SUBSCRIPTION time, the default for which is false. -- With Regards, Amit Kapila.