Thread: Synchronizing slots from primary to standby
I want to reactivate $subject. I took Petr Jelinek's patch from [0], rebased it, added a bit of testing. It basically works, but as mentioned in [0], there are various issues to work out. The idea is that the standby runs a background worker to periodically fetch replication slot information from the primary. On failover, a logical subscriber would then ideally find up-to-date replication slots on the new publisher and can just continue normally. The previous thread didn't have a lot of discussion, but I have gathered from off-line conversations that there is a wider agreement on this approach. So the next steps would be to make it more robust and configurable and documented. As I said, I added a small test case to show that it works at all, but I think a lot more tests should be added. I have also found that this breaks some seemingly unrelated tests in the recovery test suite. I have disabled these here. I'm not sure if the patch actually breaks anything or if these are just differences in timing or implementation dependencies. This patch adds a LIST_SLOTS replication command, but I think this could be replaced with just a SELECT FROM pg_replication_slots query now. (This patch is originally older than when you could run SELECT queries over the replication protocol.) So, again, this isn't anywhere near ready, but there is already a lot here to gather feedback about how it works, how it should work, how to configure it, and how it fits into an overall replication and HA architecture. [0]: https://www.postgresql.org/message-id/flat/3095349b-44d4-bf11-1b33-7eefb585d578%402ndquadrant.com
Attachment
On Sun, Oct 31, 2021 at 7:08 PM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > > I want to reactivate $subject. I took Petr Jelinek's patch from [0], > rebased it, added a bit of testing. It basically works, but as > mentioned in [0], there are various issues to work out. Thank you for working on this feature! > > The idea is that the standby runs a background worker to periodically > fetch replication slot information from the primary. On failover, a > logical subscriber would then ideally find up-to-date replication slots > on the new publisher and can just continue normally. > > I have also found that this breaks some seemingly unrelated tests in > the recovery test suite. I have disabled these here. I'm not sure if > the patch actually breaks anything or if these are just differences in > timing or implementation dependencies. I haven’t looked at the patch deeply but regarding 007_sync_rep.pl, the tests seem to fail since the tests rely on the order of the wal sender array on the shared memory. Since a background worker for synchronizing replication slots periodically connects to the walsender on the primary and disconnects, it breaks the assumption of the order. Regarding 010_logical_decoding_timelines.pl, I guess that the patch breaks the test because the background worker for synchronizing slots on the replica periodically advances the replica's slot. I think we need to have a way to disable the slot synchronization or to specify the slot name to sync with the primary. I'm not sure we already discussed this topic but I think we need it at least for testing purposes. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Hi all, Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes: > I want to reactivate $subject. I took Petr Jelinek's patch from [0], > rebased it, added a bit of testing. It basically works, but as > mentioned in [0], there are various issues to work out. Thanks for working on that topic, I believe it's an important part of Postgres' HA story. > The idea is that the standby runs a background worker to periodically fetch > replication slot information from the primary. On failover, a logical > subscriber would then ideally find up-to-date replication slots on the new > publisher and can just continue normally. Is there a case to be made about doing the same thing for physical replication slots too? That's what pg_auto_failover [1] does by default: it creates replication slots on every node for every other node, in a way that a standby Postgres instance now maintains a replication slot for the primary. This ensures that after a promotion, the standby knows to retain any and all WAL segments that the primary might need when rejoining, at pg_rewind time. > The previous thread didn't have a lot of discussion, but I have gathered > from off-line conversations that there is a wider agreement on this > approach. So the next steps would be to make it more robust and > configurable and documented. I suppose part of the configuration would then include taking care of physical slots. Some people might want to turn that off and use the Postgres 13+ ability to use the remote primary restore_command to fetch missing WAL files, instead. Well, people who have setup an archiving system, anyway. > As I said, I added a small test case to > show that it works at all, but I think a lot more tests should be added. I > have also found that this breaks some seemingly unrelated tests in the > recovery test suite. I have disabled these here. I'm not sure if the patch > actually breaks anything or if these are just differences in timing or > implementation dependencies. This patch adds a LIST_SLOTS replication > command, but I think this could be replaced with just a SELECT FROM > pg_replication_slots query now. (This patch is originally older than when > you could run SELECT queries over the replication protocol.) Given the admitted state of the patch, I didn't focus on tests. I could successfully apply the patch on-top of current master's branch, and cleanly compile and `make check`. Then I also updated pg_auto_failover to support Postgres 15devel [2] so that I could then `make NODES=3 cluster` there and play with the new replication command: $ psql -d "port=5501 replication=1" -c "LIST_SLOTS;" psql:/Users/dim/.psqlrc:24: ERROR: XX000: cannot execute SQL commands in WAL sender for physical replication LOCATION: exec_replication_command, walsender.c:1830 ... I'm not too sure about this idea of running SQL in a replication protocol connection that you're mentioning, but I suppose that's just me needing to brush up on the topic. > So, again, this isn't anywhere near ready, but there is already a lot here > to gather feedback about how it works, how it should work, how to configure > it, and how it fits into an overall replication and HA architecture. Maybe the first question about configuration would be about selecting which slots a standby should maintain from the primary. Is it all of the slots that exists on both the nodes, or a sublist of that? Is it possible to have a slot with the same name on a primary and a standby node, in a way that the standby's slot would be a completely separate entity from the primary's slot? If yes (I just don't know at the moment), well then, should we continue to allow that? Other aspects of the configuration might include a list of databases in which to make the new background worker active, and the polling delay, etc. Also, do we want to even consider having the slot management on a primary node depend on the ability to sync the advancing on one or more standby nodes? I'm not sure to see that one as a good idea, but maybe we want to kill it publically very early then ;-) Regards, -- dim Author of “The Art of PostgreSQL”, see https://theartofpostgresql.com [1]: https://github.com/citusdata/pg_auto_failover [2]: https://github.com/citusdata/pg_auto_failover/pull/838
On Sun, Oct 31, 2021 at 3:38 PM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > > I want to reactivate $subject. I took Petr Jelinek's patch from [0], > rebased it, added a bit of testing. It basically works, but as > mentioned in [0], there are various issues to work out. > > The idea is that the standby runs a background worker to periodically > fetch replication slot information from the primary. On failover, a > logical subscriber would then ideally find up-to-date replication slots > on the new publisher and can just continue normally. > > The previous thread didn't have a lot of discussion, but I have gathered > from off-line conversations that there is a wider agreement on this > approach. So the next steps would be to make it more robust and > configurable and documented. As I said, I added a small test case to > show that it works at all, but I think a lot more tests should be added. > I have also found that this breaks some seemingly unrelated tests in > the recovery test suite. I have disabled these here. I'm not sure if > the patch actually breaks anything or if these are just differences in > timing or implementation dependencies. This patch adds a LIST_SLOTS > replication command, but I think this could be replaced with just a > SELECT FROM pg_replication_slots query now. (This patch is originally > older than when you could run SELECT queries over the replication protocol.) > > So, again, this isn't anywhere near ready, but there is already a lot > here to gather feedback about how it works, how it should work, how to > configure it, and how it fits into an overall replication and HA > architecture. > > > [0]: > https://www.postgresql.org/message-id/flat/3095349b-44d4-bf11-1b33-7eefb585d578%402ndquadrant.com Thanks for working on this patch. This feature will be useful as it avoids manual intervention during the failover. Here are some thoughts: 1) Instead of a new LIST_SLOT command, can't we use READ_REPLICATION_SLOT (slight modifications needs to be done to make it support logical replication slots and to get more information from the subscriber). 2) How frequently the new bg worker is going to sync the slot info? How can it ensure that the latest information exists say when the subscriber is down/crashed before it picks up the latest slot information? 3) Instead of the subscriber pulling the slot info, why can't the publisher (via the walsender or a new bg worker maybe?) push the latest slot info? I'm not sure we want to add more functionality to the walsender, if yes, isn't it going to be much simpler? 4) IIUC, the proposal works only for logical replication slots but do you also see the need for supporting some kind of synchronization of physical replication slots as well? IMO, we need a better and consistent way for both types of replication slots. If the walsender can somehow push the slot info from the primary (for physical replication slots)/publisher (for logical replication slots) to the standby/subscribers, this will be a more consistent and simplistic design. However, I'm not sure if this design is doable at all. Regards, Bharath Rupireddy.
3) Instead of the subscriber pulling the slot info, why can't the
publisher (via the walsender or a new bg worker maybe?) push the
latest slot info? I'm not sure we want to add more functionality to
the walsender, if yes, isn't it going to be much simpler?
Standby pulling the information or at least making a first attempt to connect to the primary is a better design as primary doesn't need to spend its cycles repeatedly connecting to an unreachable standby. In fact, primary wouldn't even need to know the followers, for example followers / log shipping standbys
On Mon, Nov 29, 2021 at 1:48 AM SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote: > >> 3) Instead of the subscriber pulling the slot info, why can't the >> publisher (via the walsender or a new bg worker maybe?) push the >> latest slot info? I'm not sure we want to add more functionality to >> the walsender, if yes, isn't it going to be much simpler? > > Standby pulling the information or at least making a first attempt to connect to the primary is a better design as primarydoesn't need to spend its cycles repeatedly connecting to an unreachable standby. In fact, primary wouldn't even needto know the followers, for example followers / log shipping standbys My idea was to let the existing walsender from the primary/publisher to send the slot info (both logical and physical replication slots) to the standby/subscriber, probably by piggybacking the slot info with the WAL currently it sends. Having said that, I don't know the feasibility of it. Anyways, I'm not in favour of having a new bg worker to just ship the slot info. The standby/subscriber, while making connection to primary/publisher, can choose to get the replication slot info. As I said upthread, the problem I see with standby/subscriber pulling the info is that: how frequently the standby/subscriber is going to sync the slot info from primary/publisher? How can it ensure that the latest information exists say when the subscriber is down/crashed before it picks up the latest slot information? IIUC, the initial idea proposed in this patch deals with only logical replication slots not the physical replication slots, what I'm thinking is to have a generic way to deal with both of them. Note: In the above description, I used primary-standby and publisher-subscriber to represent the physical and logical replication slots respectively. Regards, Bharath Rupireddy.
On Mon, Nov 29, 2021 at 9:40 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Mon, Nov 29, 2021 at 1:48 AM SATYANARAYANA NARLAPURAM > <satyanarlapuram@gmail.com> wrote: > > > >> 3) Instead of the subscriber pulling the slot info, why can't the > >> publisher (via the walsender or a new bg worker maybe?) push the > >> latest slot info? I'm not sure we want to add more functionality to > >> the walsender, if yes, isn't it going to be much simpler? > > > > Standby pulling the information or at least making a first attempt to connect to the primary is a better design as primarydoesn't need to spend its cycles repeatedly connecting to an unreachable standby. In fact, primary wouldn't even needto know the followers, for example followers / log shipping standbys > > My idea was to let the existing walsender from the primary/publisher > to send the slot info (both logical and physical replication slots) to > the standby/subscriber, probably by piggybacking the slot info with > the WAL currently it sends. Having said that, I don't know the > feasibility of it. Anyways, I'm not in favour of having a new bg > worker to just ship the slot info. The standby/subscriber, while > making connection to primary/publisher, can choose to get the > replication slot info. I think it is possible that the standby is restoring the WAL directly from the archive location and there might not be any wal sender at time. So I think the idea of standby pulling the WAL looks better to me. > As I said upthread, the problem I see with standby/subscriber pulling > the info is that: how frequently the standby/subscriber is going to > sync the slot info from primary/publisher? How can it ensure that the > latest information exists say when the subscriber is down/crashed > before it picks up the latest slot information? Yeah that is a good question that how frequently the subscriber should fetch the slot information, I think that should be configurable values. And the time delay is more, the chances of losing the latest slot is more. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Mon, Nov 29, 2021 at 11:14 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Nov 29, 2021 at 9:40 AM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > On Mon, Nov 29, 2021 at 1:48 AM SATYANARAYANA NARLAPURAM > > <satyanarlapuram@gmail.com> wrote: > > > > > >> 3) Instead of the subscriber pulling the slot info, why can't the > > >> publisher (via the walsender or a new bg worker maybe?) push the > > >> latest slot info? I'm not sure we want to add more functionality to > > >> the walsender, if yes, isn't it going to be much simpler? > > > > > > Standby pulling the information or at least making a first attempt to connect to the primary is a better design asprimary doesn't need to spend its cycles repeatedly connecting to an unreachable standby. In fact, primary wouldn't evenneed to know the followers, for example followers / log shipping standbys > > > > My idea was to let the existing walsender from the primary/publisher > > to send the slot info (both logical and physical replication slots) to > > the standby/subscriber, probably by piggybacking the slot info with > > the WAL currently it sends. Having said that, I don't know the > > feasibility of it. Anyways, I'm not in favour of having a new bg > > worker to just ship the slot info. The standby/subscriber, while > > making connection to primary/publisher, can choose to get the > > replication slot info. > > I think it is possible that the standby is restoring the WAL directly > from the archive location and there might not be any wal sender at > time. So I think the idea of standby pulling the WAL looks better to > me. My point was that why can't we let the walreceiver (of course users can configure it on the standby/subscriber) to choose whether or not to receive the replication (both physical and logical) slot info from the primary/publisher and if yes, the walsender(on the primary/publisher) sending it probably as a new WAL record or just piggybacking the replication slot info with any of the existing WAL records. Or simply a common bg worker (as opposed to the bg worker proposed originally in this thread which, IIUC, works for logical replication) running on standby/subscriber for getting both the physical and logical replication slots info. > > As I said upthread, the problem I see with standby/subscriber pulling > > the info is that: how frequently the standby/subscriber is going to > > sync the slot info from primary/publisher? How can it ensure that the > > latest information exists say when the subscriber is down/crashed > > before it picks up the latest slot information? > > Yeah that is a good question that how frequently the subscriber should > fetch the slot information, I think that should be configurable > values. And the time delay is more, the chances of losing the latest > slot is more. I agree that it should be configurable. Even if the primary/publisher is down/crashed, one can still compare the latest slot info from both the primary/publisher and standby/subscriber using a new tool pg_replslotdata proposed at [1] and see how far and which slots missed the latest replication slot info and probably drop those alone to recreate and retain other slots as is. [1] - https://www.postgresql.org/message-id/CALj2ACW0rV5gWK8A3m6_X62qH%2BVfaq5hznC%3Di0R5Wojt5%2Byhyw%40mail.gmail.com Regards, Bharath Rupireddy.
On Mon, Nov 29, 2021 at 12:19 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Mon, Nov 29, 2021 at 11:14 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Mon, Nov 29, 2021 at 9:40 AM Bharath Rupireddy > > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > > > On Mon, Nov 29, 2021 at 1:48 AM SATYANARAYANA NARLAPURAM > > > <satyanarlapuram@gmail.com> wrote: > > > > > > > >> 3) Instead of the subscriber pulling the slot info, why can't the > > > >> publisher (via the walsender or a new bg worker maybe?) push the > > > >> latest slot info? I'm not sure we want to add more functionality to > > > >> the walsender, if yes, isn't it going to be much simpler? > > > > > > > > Standby pulling the information or at least making a first attempt to connect to the primary is a better designas primary doesn't need to spend its cycles repeatedly connecting to an unreachable standby. In fact, primary wouldn'teven need to know the followers, for example followers / log shipping standbys > > > > > > My idea was to let the existing walsender from the primary/publisher > > > to send the slot info (both logical and physical replication slots) to > > > the standby/subscriber, probably by piggybacking the slot info with > > > the WAL currently it sends. Having said that, I don't know the > > > feasibility of it. Anyways, I'm not in favour of having a new bg > > > worker to just ship the slot info. The standby/subscriber, while > > > making connection to primary/publisher, can choose to get the > > > replication slot info. > > > > I think it is possible that the standby is restoring the WAL directly > > from the archive location and there might not be any wal sender at > > time. So I think the idea of standby pulling the WAL looks better to > > me. > > My point was that why can't we let the walreceiver (of course users > can configure it on the standby/subscriber) to choose whether or not > to receive the replication (both physical and logical) slot info from > the primary/publisher and if yes, the walsender(on the > primary/publisher) sending it probably as a new WAL record or just > piggybacking the replication slot info with any of the existing WAL > records. Okay, I thought your point was that the primary pushing is better over standby pulling the slot info, but now it seems that you also agree that standby pulling is better right? Now it appears your point is about whether we will use the same connection for pulling the slot information which we are using for streaming the data or any other connection? I mean in this patch also we are creating a replication connection and pulling the slot information over there, just point is we are starting a separate worker for pulling the slot information, and I think that approach is better as this will not impact the performance of the other replication connection which we are using for communicating the data. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Mon, Nov 29, 2021 at 1:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Nov 29, 2021 at 12:19 PM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > On Mon, Nov 29, 2021 at 11:14 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Mon, Nov 29, 2021 at 9:40 AM Bharath Rupireddy > > > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > > > > > On Mon, Nov 29, 2021 at 1:48 AM SATYANARAYANA NARLAPURAM > > > > <satyanarlapuram@gmail.com> wrote: > > > > > > > > > >> 3) Instead of the subscriber pulling the slot info, why can't the > > > > >> publisher (via the walsender or a new bg worker maybe?) push the > > > > >> latest slot info? I'm not sure we want to add more functionality to > > > > >> the walsender, if yes, isn't it going to be much simpler? > > > > > > > > > > Standby pulling the information or at least making a first attempt to connect to the primary is a better designas primary doesn't need to spend its cycles repeatedly connecting to an unreachable standby. In fact, primary wouldn'teven need to know the followers, for example followers / log shipping standbys > > > > > > > > My idea was to let the existing walsender from the primary/publisher > > > > to send the slot info (both logical and physical replication slots) to > > > > the standby/subscriber, probably by piggybacking the slot info with > > > > the WAL currently it sends. Having said that, I don't know the > > > > feasibility of it. Anyways, I'm not in favour of having a new bg > > > > worker to just ship the slot info. The standby/subscriber, while > > > > making connection to primary/publisher, can choose to get the > > > > replication slot info. > > > > > > I think it is possible that the standby is restoring the WAL directly > > > from the archive location and there might not be any wal sender at > > > time. So I think the idea of standby pulling the WAL looks better to > > > me. > > > > My point was that why can't we let the walreceiver (of course users > > can configure it on the standby/subscriber) to choose whether or not > > to receive the replication (both physical and logical) slot info from > > the primary/publisher and if yes, the walsender(on the > > primary/publisher) sending it probably as a new WAL record or just > > piggybacking the replication slot info with any of the existing WAL > > records. > > Okay, I thought your point was that the primary pushing is better over > standby pulling the slot info, but now it seems that you also agree > that standby pulling is better right? Now it appears your point is > about whether we will use the same connection for pulling the slot > information which we are using for streaming the data or any other > connection? I mean in this patch also we are creating a replication > connection and pulling the slot information over there, just point is > we are starting a separate worker for pulling the slot information, > and I think that approach is better as this will not impact the > performance of the other replication connection which we are using for > communicating the data. The easiest way to implement this feature so far, is to use a common bg worker (as opposed to the bg worker proposed originally in this thread which, IIUC, works for logical replication) running on standby (in case of streaming replication with physical replication slots) or subscriber (in case of logical replication with logical replication slots) for getting both the physical and logical replication slots info from the primary or publisher. This bg worker requires at least two GUCs, 1) to enable/disable the worker 2) to define the slot sync interval (the bg worker gets the slots info after every sync interval of time). Thoughts? Regards, Bharath Rupireddy.
On 31.10.21 11:08, Peter Eisentraut wrote: > I want to reactivate $subject. I took Petr Jelinek's patch from [0], > rebased it, added a bit of testing. It basically works, but as > mentioned in [0], there are various issues to work out. > > The idea is that the standby runs a background worker to periodically > fetch replication slot information from the primary. On failover, a > logical subscriber would then ideally find up-to-date replication slots > on the new publisher and can just continue normally. > So, again, this isn't anywhere near ready, but there is already a lot > here to gather feedback about how it works, how it should work, how to > configure it, and how it fits into an overall replication and HA > architecture. Here is an updated patch. The main changes are that I added two configuration parameters. The first, synchronize_slot_names, is set on the physical standby to specify which slots to sync from the primary. By default, it is empty. (This also fixes the recovery test failures that I had to disable in the previous patch version.) The second, standby_slot_names, is set on the primary. It holds back logical replication until the listed physical standbys have caught up. That way, when failover is necessary, the promoted standby is not behind the logical replication consumers. In principle, this works now, I think. I haven't made much progress in creating more test cases for this; that's something that needs more attention. It's worth pondering what the configuration language for standby_slot_names should be. Right now, it's just a list of slots that all need to be caught up. More complicated setups are conceivable. Maybe you have standbys S1 and S2 that are potential failover targets for logical replication consumers L1 and L2, and also standbys S3 and S4 that are potential failover targets for logical replication consumers L3 and L4. Viewed like that, this setting could be a replication slot setting. The setting might also have some relationship with synchronous_standby_names. Like, if you have synchronous_standby_names set, then that's a pretty good indication that you also want some or all of those standbys in standby_slot_names. (But note that one is slots and one is application names.) So there are a variety of possibilities.
Attachment
On 24.11.21 07:11, Masahiko Sawada wrote: > I haven’t looked at the patch deeply but regarding 007_sync_rep.pl, > the tests seem to fail since the tests rely on the order of the wal > sender array on the shared memory. Since a background worker for > synchronizing replication slots periodically connects to the walsender > on the primary and disconnects, it breaks the assumption of the order. > Regarding 010_logical_decoding_timelines.pl, I guess that the patch > breaks the test because the background worker for synchronizing slots > on the replica periodically advances the replica's slot. I think we > need to have a way to disable the slot synchronization or to specify > the slot name to sync with the primary. I'm not sure we already > discussed this topic but I think we need it at least for testing > purposes. This has been addressed by patch v2 that adds such a setting.
On 24.11.21 17:25, Dimitri Fontaine wrote: > Is there a case to be made about doing the same thing for physical > replication slots too? It has been considered. At the moment, I'm not doing it, because it would add more code and complexity and it's not that important. But it could be added in the future. > Given the admitted state of the patch, I didn't focus on tests. I could > successfully apply the patch on-top of current master's branch, and > cleanly compile and `make check`. > > Then I also updated pg_auto_failover to support Postgres 15devel [2] so > that I could then `make NODES=3 cluster` there and play with the new > replication command: > > $ psql -d "port=5501 replication=1" -c "LIST_SLOTS;" > psql:/Users/dim/.psqlrc:24: ERROR: XX000: cannot execute SQL commands in WAL sender for physical replication > LOCATION: exec_replication_command, walsender.c:1830 > ... > > I'm not too sure about this idea of running SQL in a replication > protocol connection that you're mentioning, but I suppose that's just me > needing to brush up on the topic. FWIW, the way the replication command parser works, if there is a parse error, it tries to interpret the command as a plain SQL command. But that only works for logical replication connections. So in physical replication, if you try to run anything that does not parse, you will get this error. But that has nothing to do with this feature. The above command works for me, so maybe something else went wrong in your situation. > Maybe the first question about configuration would be about selecting > which slots a standby should maintain from the primary. Is it all of the > slots that exists on both the nodes, or a sublist of that? > > Is it possible to have a slot with the same name on a primary and a > standby node, in a way that the standby's slot would be a completely > separate entity from the primary's slot? If yes (I just don't know at > the moment), well then, should we continue to allow that? This has been added in v2. > Also, do we want to even consider having the slot management on a > primary node depend on the ability to sync the advancing on one or more > standby nodes? I'm not sure to see that one as a good idea, but maybe we > want to kill it publically very early then ;-) I don't know what you mean by this.
On 28.11.21 07:52, Bharath Rupireddy wrote: > 1) Instead of a new LIST_SLOT command, can't we use > READ_REPLICATION_SLOT (slight modifications needs to be done to make > it support logical replication slots and to get more information from > the subscriber). I looked at that but didn't see an obvious way to consolidate them. This is something we could look at again later. > 2) How frequently the new bg worker is going to sync the slot info? > How can it ensure that the latest information exists say when the > subscriber is down/crashed before it picks up the latest slot > information? The interval is currently hardcoded, but could be a configuration setting. In the v2 patch, there is a new setting that orders physical replication before logical so that the logical subscribers cannot get ahead of the physical standby. > 3) Instead of the subscriber pulling the slot info, why can't the > publisher (via the walsender or a new bg worker maybe?) push the > latest slot info? I'm not sure we want to add more functionality to > the walsender, if yes, isn't it going to be much simpler? This sounds like the failover slot feature, which was rejected.
Hello, I started taking a brief look at the v2 patch, and it does appear to work for the basic case. Logical slot is synchronizedacross and I can connect to the promoted standby and stream changes afterwards. It's not clear to me what the correct behavior is when a logical slot that has been synced to the replica and then it getsdeleted on the writer. Would we expect this to be propagated or leave it up to the end-user to manage? > + rawname = pstrdup(standby_slot_names); > + SplitIdentifierString(rawname, ',', &namelist); > + > + while (true) > + { > + int wait_slots_remaining; > + XLogRecPtr oldest_flush_pos = InvalidXLogRecPtr; > + int rc; > + > + wait_slots_remaining = list_length(namelist); > + > + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED); > + for (int i = 0; i < max_replication_slots; i++) > + { Even though standby_slot_names is PGC_SIGHUP, we never reload/re-process the value. If we have a wrong entry in there, thebackend becomes stuck until we re-establish the logical connection. Adding "postmaster/interrupt.h" with ConfigReloadPending/ ProcessConfigFile does seem to work. Another thing I noticed is that once it starts waiting in this block, Ctrl+C doesn't seem to terminate the backend? pg_recvlogical -d postgres -p 5432 --slot regression_slot --start -f - .. ^Cpg_recvlogical: error: unexpected termination of replication stream: The logical backend connection is still present: ps aux | grep 51263 hsuchen 51263 80.7 0.0 320180 14304 ? Rs 01:11 3:04 postgres: walsender hsuchen [local] START_REPLICATION pstack 51263 #0 0x00007ffee99e79a5 in clock_gettime () #1 0x00007f8705e88246 in clock_gettime () from /lib64/libc.so.6 #2 0x000000000075f141 in WaitEventSetWait () #3 0x000000000075f565 in WaitLatch () #4 0x0000000000720aea in ReorderBufferProcessTXN () #5 0x00000000007142a6 in DecodeXactOp () #6 0x000000000071460f in LogicalDecodingProcessRecord () It can be terminated with a pg_terminate_backend though. If we have a physical slot with name foo on the standby, and then a logical slot is created on the writer with the same slot_nameit does error out on the replica although it prevents other slots from being synchronized which is probably fine. 2021-12-16 02:10:29.709 UTC [73788] LOG: replication slot synchronization worker for database "postgres" has started 2021-12-16 02:10:29.713 UTC [73788] ERROR: cannot use physical replication slot for logical decoding 2021-12-16 02:10:29.714 UTC [73037] DEBUG: unregistering background worker "replication slot synchronization worker" On 12/14/21, 2:26 PM, "Peter Eisentraut" <peter.eisentraut@enterprisedb.com> wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you canconfirm the sender and know the content is safe. On 28.11.21 07:52, Bharath Rupireddy wrote: > 1) Instead of a new LIST_SLOT command, can't we use > READ_REPLICATION_SLOT (slight modifications needs to be done to make > it support logical replication slots and to get more information from > the subscriber). I looked at that but didn't see an obvious way to consolidate them. This is something we could look at again later. > 2) How frequently the new bg worker is going to sync the slot info? > How can it ensure that the latest information exists say when the > subscriber is down/crashed before it picks up the latest slot > information? The interval is currently hardcoded, but could be a configuration setting. In the v2 patch, there is a new setting that orders physical replication before logical so that the logical subscribers cannot get ahead of the physical standby. > 3) Instead of the subscriber pulling the slot info, why can't the > publisher (via the walsender or a new bg worker maybe?) push the > latest slot info? I'm not sure we want to add more functionality to > the walsender, if yes, isn't it going to be much simpler? This sounds like the failover slot feature, which was rejected.
Here is an updated patch to fix some build failures. No feature changes. On 14.12.21 23:12, Peter Eisentraut wrote: > On 31.10.21 11:08, Peter Eisentraut wrote: >> I want to reactivate $subject. I took Petr Jelinek's patch from [0], >> rebased it, added a bit of testing. It basically works, but as >> mentioned in [0], there are various issues to work out. >> >> The idea is that the standby runs a background worker to periodically >> fetch replication slot information from the primary. On failover, a >> logical subscriber would then ideally find up-to-date replication >> slots on the new publisher and can just continue normally. > >> So, again, this isn't anywhere near ready, but there is already a lot >> here to gather feedback about how it works, how it should work, how to >> configure it, and how it fits into an overall replication and HA >> architecture. > > Here is an updated patch. The main changes are that I added two > configuration parameters. The first, synchronize_slot_names, is set on > the physical standby to specify which slots to sync from the primary. By > default, it is empty. (This also fixes the recovery test failures that > I had to disable in the previous patch version.) The second, > standby_slot_names, is set on the primary. It holds back logical > replication until the listed physical standbys have caught up. That > way, when failover is necessary, the promoted standby is not behind the > logical replication consumers. > > In principle, this works now, I think. I haven't made much progress in > creating more test cases for this; that's something that needs more > attention. > > It's worth pondering what the configuration language for > standby_slot_names should be. Right now, it's just a list of slots that > all need to be caught up. More complicated setups are conceivable. > Maybe you have standbys S1 and S2 that are potential failover targets > for logical replication consumers L1 and L2, and also standbys S3 and S4 > that are potential failover targets for logical replication consumers L3 > and L4. Viewed like that, this setting could be a replication slot > setting. The setting might also have some relationship with > synchronous_standby_names. Like, if you have synchronous_standby_names > set, then that's a pretty good indication that you also want some or all > of those standbys in standby_slot_names. (But note that one is slots > and one is application names.) So there are a variety of possibilities.
Attachment
On Wed, Dec 15, 2021 at 7:13 AM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > > On 31.10.21 11:08, Peter Eisentraut wrote: > > I want to reactivate $subject. I took Petr Jelinek's patch from [0], > > rebased it, added a bit of testing. It basically works, but as > > mentioned in [0], there are various issues to work out. > > > > The idea is that the standby runs a background worker to periodically > > fetch replication slot information from the primary. On failover, a > > logical subscriber would then ideally find up-to-date replication slots > > on the new publisher and can just continue normally. > > > So, again, this isn't anywhere near ready, but there is already a lot > > here to gather feedback about how it works, how it should work, how to > > configure it, and how it fits into an overall replication and HA > > architecture. > > The second, > standby_slot_names, is set on the primary. It holds back logical > replication until the listed physical standbys have caught up. That > way, when failover is necessary, the promoted standby is not behind the > logical replication consumers. I might be missing something but isn’t it okay even if the new primary server is behind the subscribers? IOW, even if two slot's LSNs (i.e., restart_lsn and confirm_flush_lsn) are behind the subscriber's remote LSN (i.e., pg_replication_origin.remote_lsn), the primary sends only transactions that were committed after the remote_lsn. So the subscriber can resume logical replication with the new primary without any data loss. The new primary should not be ahead of the subscribers because it forwards the logical replication start LSN to the slot’s confirm_flush_lsn in this case. But it cannot happen since the remote LSN of the subscriber’s origin is always updated first, then the confirm_flush_lsn of the slot on the primary is updated, and then the confirm_flush_lsn of the slot on the standby is synchronized. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
> I might be missing something but isn’t it okay even if the new primary > server is behind the subscribers? IOW, even if two slot's LSNs (i.e., > restart_lsn and confirm_flush_lsn) are behind the subscriber's remote > LSN (i.e., pg_replication_origin.remote_lsn), the primary sends only > transactions that were committed after the remote_lsn. So the > subscriber can resume logical replication with the new primary without > any data loss. Maybe I'm misreading, but I thought the purpose of this to make sure that the logical subscriber does not have data that has not been replicated to the new primary. The use-case I can think of would be if synchronous_commit were enabled and fail-over occurs. If we didn't have this set, isn't it possible that this logical subscriber has extra commits that aren't present on the newly promoted primary? And sorry I accidentally started a new thread in my last reply. Re-pasting some of my previous questions/comments: wait_for_standby_confirmation does not update standby_slot_names once it's in a loop and can't be fixed with SIGHUP. Similarly, synchronize_slot_names isn't updated once the worker is launched. If a logical slot was dropped on the writer, should the worker drop logical slots that it was previously synchronizing but are no longer present? Or should we leave that to the user to manage? I'm trying to think why users would want to sync logical slots to a reader but not have that be dropped as well if it's no longer present. Is there a reason we're deciding to use one-worker syncing per database instead of one general worker that syncs across all the databases? I imagine I'm missing something obvious here. As for how standby_slot_names should be configured, I'd prefer the flexibility similar to what we have for synchronus_standby_names since that seems the most analogous. It'd provide flexibility for failovers, which I imagine is the most common use-case. On 1/20/22, 9:34 PM, "Masahiko Sawada" <sawada.mshk@gmail.com> wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you canconfirm the sender and know the content is safe. On Wed, Dec 15, 2021 at 7:13 AM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > > On 31.10.21 11:08, Peter Eisentraut wrote: > > I want to reactivate $subject. I took Petr Jelinek's patch from [0], > > rebased it, added a bit of testing. It basically works, but as > > mentioned in [0], there are various issues to work out. > > > > The idea is that the standby runs a background worker to periodically > > fetch replication slot information from the primary. On failover, a > > logical subscriber would then ideally find up-to-date replication slots > > on the new publisher and can just continue normally. > > > So, again, this isn't anywhere near ready, but there is already a lot > > here to gather feedback about how it works, how it should work, how to > > configure it, and how it fits into an overall replication and HA > > architecture. > > The second, > standby_slot_names, is set on the primary. It holds back logical > replication until the listed physical standbys have caught up. That > way, when failover is necessary, the promoted standby is not behind the > logical replication consumers. I might be missing something but isn’t it okay even if the new primary server is behind the subscribers? IOW, even if two slot's LSNs (i.e., restart_lsn and confirm_flush_lsn) are behind the subscriber's remote LSN (i.e., pg_replication_origin.remote_lsn), the primary sends only transactions that were committed after the remote_lsn. So the subscriber can resume logical replication with the new primary without any data loss. The new primary should not be ahead of the subscribers because it forwards the logical replication start LSN to the slot’s confirm_flush_lsn in this case. But it cannot happen since the remote LSN of the subscriber’s origin is always updated first, then the confirm_flush_lsn of the slot on the primary is updated, and then the confirm_flush_lsn of the slot on the standby is synchronized. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
On Sat, Jan 22, 2022 at 4:33 AM Hsu, John <hsuchen@amazon.com> wrote: > > > I might be missing something but isn’t it okay even if the new primary > > server is behind the subscribers? IOW, even if two slot's LSNs (i.e., > > restart_lsn and confirm_flush_lsn) are behind the subscriber's remote > > LSN (i.e., pg_replication_origin.remote_lsn), the primary sends only > > transactions that were committed after the remote_lsn. So the > > subscriber can resume logical replication with the new primary without > > any data loss. > > Maybe I'm misreading, but I thought the purpose of this to make > sure that the logical subscriber does not have data that has not been > replicated to the new primary. The use-case I can think of would be > if synchronous_commit were enabled and fail-over occurs. If > we didn't have this set, isn't it possible that this logical subscriber > has extra commits that aren't present on the newly promoted primary? > This is very much possible if the new primary used to be asynchronous standby. But, it seems like the current patch is trying to hold the logical replication until the data has been replicated to the physical standby when synchronous_slot_names is set. This will ensure that the logical subscriber is never ahead of the new primary. However, AFAIU that's not the primary use-case of this patch; instead this is to ensure that the logical subscribers continue getting data from the new primary when the failover occurs. > > If a logical slot was dropped on the writer, should the worker drop logical > slots that it was previously synchronizing but are no longer present? Or > should we leave that to the user to manage? I'm trying to think why users > would want to sync logical slots to a reader but not have that be dropped > as well if it's no longer present. > AFAIU this should be taken care of by the background worker used to synchronize the replication slot. -- With Regards, Ashutosh Sharma.
Hi, On 2022-01-03 14:46:52 +0100, Peter Eisentraut wrote: > From ec00dc6ab8bafefc00e9b1c78ac9348b643b8a87 Mon Sep 17 00:00:00 2001 > From: Peter Eisentraut <peter@eisentraut.org> > Date: Mon, 3 Jan 2022 14:43:36 +0100 > Subject: [PATCH v3] Synchronize logical replication slots from primary to > standby I've just skimmed the patch and the related threads. As far as I can tell this cannot be safely used without the conflict handling in [1], is that correct? Greetings, Andres Freund [1] https://postgr.es/m/CA%2BTgmoZd-JqNL1-R3RJ0jQRD%2B-dc94X0nPJgh%2BdwdDF0rFuE3g%40mail.gmail.com
Hi Andres, Are you talking about this scenario - what if the logical replication slot on the publisher is dropped, but is being referenced by the standby where the slot is synchronized? Should the redo function for the drop replication slot have the capability to drop it on standby and its subscribers (if any) as well? -- With Regards, Ashutosh Sharma. On Sun, Feb 6, 2022 at 1:29 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2022-01-03 14:46:52 +0100, Peter Eisentraut wrote: > > From ec00dc6ab8bafefc00e9b1c78ac9348b643b8a87 Mon Sep 17 00:00:00 2001 > > From: Peter Eisentraut <peter@eisentraut.org> > > Date: Mon, 3 Jan 2022 14:43:36 +0100 > > Subject: [PATCH v3] Synchronize logical replication slots from primary to > > standby > > I've just skimmed the patch and the related threads. As far as I can tell this > cannot be safely used without the conflict handling in [1], is that correct? > > > Greetings, > > Andres Freund > > [1] https://postgr.es/m/CA%2BTgmoZd-JqNL1-R3RJ0jQRD%2B-dc94X0nPJgh%2BdwdDF0rFuE3g%40mail.gmail.com > >
Hi, On 2022-02-07 13:38:38 +0530, Ashutosh Sharma wrote: > Are you talking about this scenario - what if the logical replication > slot on the publisher is dropped, but is being referenced by the > standby where the slot is synchronized? It's a bit hard to say, because neither in this thread nor in the patch I've found a clear description of what the syncing needs to & tries to guarantee. It might be that that was discussed in one of the precursor threads, but... Generally I don't think we can permit scenarios where a slot can be in a "corrupt" state, i.e. missing required catalog entries, after "normal" administrative commands (i.e. not mucking around in catalog entries / on-disk files). Even if the sequence of commands may be a bit weird. All such cases need to be either prevented or detected. As far as I can tell, the way this patch keeps slots on physical replicas "valid" is solely by reorderbuffer.c blocking during replay via wait_for_standby_confirmation(). Which means that if e.g. the standby_slot_names GUC differs from synchronize_slot_names on the physical replica, the slots synchronized on the physical replica are not going to be valid. Or if the primary drops its logical slots. > Should the redo function for the drop replication slot have the capability > to drop it on standby and its subscribers (if any) as well? Slots are not WAL logged (and shouldn't be). I think you pretty much need the recovery conflict handling infrastructure I referenced upthread, which recognized during replay if a record has a conflict with a slot on a standby. And then ontop of that you can build something like this patch. Greetings, Andres Freund
Hi, On 2022-01-03 14:46:52 +0100, Peter Eisentraut wrote: > +static void > +ApplyLauncherStartSlotSync(TimestampTz *last_start_time, long *wait_time) > +{ > [...] > + > + foreach(lc, slots) > + { > + WalRecvReplicationSlotData *slot_data = lfirst(lc); > + LogicalRepWorker *w; > + > + if (!OidIsValid(slot_data->database)) > + continue; > + > + LWLockAcquire(LogicalRepWorkerLock, LW_SHARED); > + w = logicalrep_worker_find(slot_data->database, InvalidOid, > + InvalidOid, false); > + LWLockRelease(LogicalRepWorkerLock); > + > + if (w == NULL) > + { > + *last_start_time = now; > + *wait_time = wal_retrieve_retry_interval; > + > + logicalrep_worker_launch(slot_data->database, InvalidOid, NULL, > + BOOTSTRAP_SUPERUSERID, InvalidOid); Do we really need a dedicated worker for each single slot? That seems excessively expensive. > +++ b/src/backend/replication/logical/reorderbuffer.c > [...] > +static void > +wait_for_standby_confirmation(XLogRecPtr commit_lsn) > +{ > + char *rawname; > + List *namelist; > + ListCell *lc; > + XLogRecPtr flush_pos = InvalidXLogRecPtr; > + > + if (strcmp(standby_slot_names, "") == 0) > + return; > + > + rawname = pstrdup(standby_slot_names); > + SplitIdentifierString(rawname, ',', &namelist); > + > + while (true) > + { > + int wait_slots_remaining; > + XLogRecPtr oldest_flush_pos = InvalidXLogRecPtr; > + int rc; > + > + wait_slots_remaining = list_length(namelist); > + > + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED); > + for (int i = 0; i < max_replication_slots; i++) > + { > + ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i]; > + bool inlist; > + > + if (!s->in_use) > + continue; > + > + inlist = false; > + foreach (lc, namelist) > + { > + char *name = lfirst(lc); > + if (strcmp(name, NameStr(s->data.name)) == 0) > + { > + inlist = true; > + break; > + } > + } > + if (!inlist) > + continue; > + > + SpinLockAcquire(&s->mutex); It doesn't seem like a good idea to perform O(max_replication_slots * #standby_slot_names) work on each decoded commit. Nor to SplitIdentifierString(pstrdup(standby_slot_names)) every time. > + if (s->data.database == InvalidOid) > + /* Physical slots advance restart_lsn on flush and ignore confirmed_flush_lsn */ > + flush_pos = s->data.restart_lsn; > + else > + /* For logical slots we must wait for commit and flush */ > + flush_pos = s->data.confirmed_flush; > + > + SpinLockRelease(&s->mutex); > + > + /* We want to find out the min(flush pos) over all named slots */ > + if (oldest_flush_pos == InvalidXLogRecPtr > + || oldest_flush_pos > flush_pos) > + oldest_flush_pos = flush_pos; > + > + if (flush_pos >= commit_lsn && wait_slots_remaining > 0) > + wait_slots_remaining --; > + } > + LWLockRelease(ReplicationSlotControlLock); > + > + if (wait_slots_remaining == 0) > + return; > + > + rc = WaitLatch(MyLatch, > + WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, > + 1000L, PG_WAIT_EXTENSION); I don't think it's a good idea to block here like this - no walsender specific handling is going to happen. E.g. not sending replies to receivers will cause them to time out. And for the SQL functions this will cause blocking even though the interface expects to return when reaching the end of the WAL - which this pretty much is. I think this needs to be restructured so that you only do the checking of the "up to this point" position when needed, rather than every commit. We already *have* a check for not replaying further than the flushed WAL position, see the GetFlushRecPtr() calls in WalSndWaitForWal(), pg_logical_slot_get_changes_guts(). I think you'd basically need to integrate with that, rather than introduce blocking in reorderbuffer.c > + if (rc & WL_POSTMASTER_DEATH) > + proc_exit(1); Should use WL_EXIT_ON_PM_DEATH these days. Greetings, Andres Freund
On Tue, Feb 8, 2022 at 2:02 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2022-02-07 13:38:38 +0530, Ashutosh Sharma wrote: > > Are you talking about this scenario - what if the logical replication > > slot on the publisher is dropped, but is being referenced by the > > standby where the slot is synchronized? > > It's a bit hard to say, because neither in this thread nor in the patch I've > found a clear description of what the syncing needs to & tries to > guarantee. It might be that that was discussed in one of the precursor > threads, but... > > Generally I don't think we can permit scenarios where a slot can be in a > "corrupt" state, i.e. missing required catalog entries, after "normal" > administrative commands (i.e. not mucking around in catalog entries / on-disk > files). Even if the sequence of commands may be a bit weird. All such cases > need to be either prevented or detected. > > > As far as I can tell, the way this patch keeps slots on physical replicas > "valid" is solely by reorderbuffer.c blocking during replay via > wait_for_standby_confirmation(). > > Which means that if e.g. the standby_slot_names GUC differs from > synchronize_slot_names on the physical replica, the slots synchronized on the > physical replica are not going to be valid. Or if the primary drops its > logical slots. > > > > Should the redo function for the drop replication slot have the capability > > to drop it on standby and its subscribers (if any) as well? > > Slots are not WAL logged (and shouldn't be). > > I think you pretty much need the recovery conflict handling infrastructure I > referenced upthread, which recognized during replay if a record has a conflict > with a slot on a standby. And then ontop of that you can build something like > this patch. > OK. Understood, thanks Andres. -- With Regards, Ashutosh Sharma.
On Tue, Feb 8, 2022 at 08:27:32PM +0530, Ashutosh Sharma wrote: > > Which means that if e.g. the standby_slot_names GUC differs from > > synchronize_slot_names on the physical replica, the slots synchronized on the > > physical replica are not going to be valid. Or if the primary drops its > > logical slots. > > > > > > > Should the redo function for the drop replication slot have the capability > > > to drop it on standby and its subscribers (if any) as well? > > > > Slots are not WAL logged (and shouldn't be). > > > > I think you pretty much need the recovery conflict handling infrastructure I > > referenced upthread, which recognized during replay if a record has a conflict > > with a slot on a standby. And then ontop of that you can build something like > > this patch. > > > > OK. Understood, thanks Andres. I would love to see this feature in PG 15. Can someone explain its current status? Thanks. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On 10.02.22 22:47, Bruce Momjian wrote: > On Tue, Feb 8, 2022 at 08:27:32PM +0530, Ashutosh Sharma wrote: >>> Which means that if e.g. the standby_slot_names GUC differs from >>> synchronize_slot_names on the physical replica, the slots synchronized on the >>> physical replica are not going to be valid. Or if the primary drops its >>> logical slots. >>> >>> >>>> Should the redo function for the drop replication slot have the capability >>>> to drop it on standby and its subscribers (if any) as well? >>> >>> Slots are not WAL logged (and shouldn't be). >>> >>> I think you pretty much need the recovery conflict handling infrastructure I >>> referenced upthread, which recognized during replay if a record has a conflict >>> with a slot on a standby. And then ontop of that you can build something like >>> this patch. >>> >> >> OK. Understood, thanks Andres. > > I would love to see this feature in PG 15. Can someone explain its > current status? Thanks. The way I understand it: 1. This feature (probably) depends on the "Minimal logical decoding on standbys" patch. The details there aren't totally clear (to me). That patch had some activity lately but I don't see it in a state that it's nearing readiness. 2. I think the way this (my) patch is currently written needs some refactoring about how we launch and manage workers. Right now, it's all mangled together with logical replication, since that is a convenient way to launch and manage workers, but it really doesn't need to be tied to logical replication, since it can also be used for other logical slots. 3. It's an open question how to configure this. My patch show a very minimal configuration that allows you to keep all logical slots always behind one physical slot, which addresses one particular use case. In general, you might have things like, one set of logical slots should stay behind one physical slot, another set behind another physical slot, another set should not care, etc. This could turn into something like the synchronous replication feature, where it ends up with its own configuration language. Each of these are clearly significant jobs on their own.
On 05.02.22 20:59, Andres Freund wrote: > On 2022-01-03 14:46:52 +0100, Peter Eisentraut wrote: >> From ec00dc6ab8bafefc00e9b1c78ac9348b643b8a87 Mon Sep 17 00:00:00 2001 >> From: Peter Eisentraut<peter@eisentraut.org> >> Date: Mon, 3 Jan 2022 14:43:36 +0100 >> Subject: [PATCH v3] Synchronize logical replication slots from primary to >> standby > I've just skimmed the patch and the related threads. As far as I can tell this > cannot be safely used without the conflict handling in [1], is that correct? This or similar questions have been asked a few times about this or similar patches, but they always come with some doubt. If we think so, it would be useful perhaps if we could come up with test cases that would demonstrate why that other patch/feature is necessary. (I'm not questioning it personally, I'm just throwing out ideas here.)
On Fri, Feb 11, 2022 at 9:26 AM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > > On 10.02.22 22:47, Bruce Momjian wrote: > > I would love to see this feature in PG 15. Can someone explain its > > current status? Thanks. > > The way I understand it: > ... Hi Peter, I'm starting to review this patch, and last time I checked I noticed it didn't seem to apply cleanly to master anymore. Would you be able to send a rebased version? Thanks, James Coleman
On Fri, Feb 11, 2022 at 9:26 AM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote: > > The way I understand it: > > 1. This feature (probably) depends on the "Minimal logical decoding on > standbys" patch. The details there aren't totally clear (to me). That > patch had some activity lately but I don't see it in a state that it's > nearing readiness. > > 2. I think the way this (my) patch is currently written needs some > refactoring about how we launch and manage workers. Right now, it's all > mangled together with logical replication, since that is a convenient > way to launch and manage workers, but it really doesn't need to be tied > to logical replication, since it can also be used for other logical slots. > > 3. It's an open question how to configure this. My patch show a very > minimal configuration that allows you to keep all logical slots always > behind one physical slot, which addresses one particular use case. In > general, you might have things like, one set of logical slots should > stay behind one physical slot, another set behind another physical slot, > another set should not care, etc. This could turn into something like > the synchronous replication feature, where it ends up with its own > configuration language. > > Each of these are clearly significant jobs on their own. > Thanks for bringing this topic back up again. I haven't gotten a chance to do any testing on the patch yet, but here are my initial notes from reviewing it: First, reusing the logical replication launcher seems a bit gross. It's obviously a pragmatic choice, but I find it confusing and likely to become only moreso given the fact there's nothing about slot syncing that's inherently limited to logical slots. Plus the feature currently is about syncing slots on a physical replica. So I think that probably should change. Second, it seems to me that the worker-per-DB architecture means that this is unworkable on a cluster with a large number of DBs. The original thread said that was because "logical slots are per database, walrcv_exec needs db connection, etc". As to the walrcv_exec, we're (re)connecting to the primary for each synchronization anyway, so that doesn't seem like a significant reason. I don't understand why logical slots being per-database means we have to do it this way. Is there something about the background worker architecture (I'm revealing my own ignorance here I suppose) that requires this? Also it seems that we reconnect to the primary every time we want to synchronize slots. Maybe that's OK, but it struck me as a bit odd, so I wanted to ask about it. Third, wait_for_standby_confirmation() needs a function comment. Andres noted this earlier, but it seems like we're doing quite a bit of work in this function for each commit. Some of it is obviously duplicative like the parsing of standby_slot_names. The waiting introduced also doesn't seem like a good idea. Andres also commented on that earlier; I'd echo his comments here absent an explanation of why it's preferable/necessary to do it this way. > + if (flush_pos >= commit_lsn && wait_slots_remaining > 0) > + wait_slots_remaining --; I might be missing something re: project style, but the space before the "--" looks odd to my eyes. > * Call either PREPARE (for two-phase transactions) or COMMIT (for > * regular ones). > */ > + > + wait_for_standby_confirmation(commit_lsn); > + > if (rbtxn_prepared(txn)) > rb->prepare(rb, txn, commit_lsn); > else It appears the addition of this call splits the comment from the code it goes with. > + * Wait for remote slot to pass localy reserved position. Typo ("localy" -> "locally"). This patch would be a significant improvement for us; I'm hoping we can see some activity on it. I'm also hoping to try to do some testing next week and see if I can poke any holes in the functionality (with the goal of verifying Andres's concerns about the safety without the minimal logical decoding on a replica patch). Thanks, James Coleman
Hi, On 2022-02-11 15:28:19 +0100, Peter Eisentraut wrote: > On 05.02.22 20:59, Andres Freund wrote: > > On 2022-01-03 14:46:52 +0100, Peter Eisentraut wrote: > > > From ec00dc6ab8bafefc00e9b1c78ac9348b643b8a87 Mon Sep 17 00:00:00 2001 > > > From: Peter Eisentraut<peter@eisentraut.org> > > > Date: Mon, 3 Jan 2022 14:43:36 +0100 > > > Subject: [PATCH v3] Synchronize logical replication slots from primary to > > > standby > > I've just skimmed the patch and the related threads. As far as I can tell this > > cannot be safely used without the conflict handling in [1], is that correct? > > This or similar questions have been asked a few times about this or similar > patches, but they always come with some doubt. I'm certain it's a problem - the only reason I couched it was that there could have been something clever in the patch preventing problems that I missed because I just skimmed it. > If we think so, it would be > useful perhaps if we could come up with test cases that would demonstrate > why that other patch/feature is necessary. (I'm not questioning it > personally, I'm just throwing out ideas here.) The patch as-is just breaks one of the fundamental guarantees necessary for logical decoding, that no rows versions can be removed that are still required for logical decoding (signalled via catalog_xmin). So there needs to be an explicit mechanism upholding that guarantee, but there is not right now from what I can see. One piece of the referenced patchset is that it adds information about removed catalog rows to a few WAL records, and then verifies during replay that no record can be replayed that removes resources that are still needed. If such a conflict exists it's dealt with as a recovery conflict. That itself doesn't provide prevention against removal of required, but it provides detection. The prevention against removal can then be done using a physical replication slot with hot standby feedback or some other mechanism (e.g. slot syncing mechanism could maintain a "placeholder" slot on the primary for all sync targets or something like that). Even if that infrastructure existed / was merged, the slot sync stuff would still need some very careful logic to protect against problems due to concurrent WAL replay and "synchronized slot" creation. But that's doable. Greetings, Andres Freund
Hello, This patch status is already returned with feedback. However, I've applied this patch on 27b77ecf9f and tested so report it. make installcheck-world is passed. However, when I promote the standby server and update on the new primary server, apply worker could not start logical replication and emmit the following error. LOG: background worker "logical replication worker" (PID 14506) exited with exit code 1 LOG: logical replication apply worker for subscription "sub1" has started ERROR: terminating logical replication worker due to timeout LOG: background worker "logical replication worker" (PID 14535) exited with exit code 1 LOG: logical replication apply worker for subscription "sub1" has started It seems that apply worker does not start because wal sender is already exist on the new primary. Do you have any thoughts about what the cause might be? test script is attached. regards, sho kato
Attachment
On Fri, Feb 18, 2022 at 5:23 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2022-02-11 15:28:19 +0100, Peter Eisentraut wrote: > > On 05.02.22 20:59, Andres Freund wrote: > > > On 2022-01-03 14:46:52 +0100, Peter Eisentraut wrote: > > > > From ec00dc6ab8bafefc00e9b1c78ac9348b643b8a87 Mon Sep 17 00:00:00 2001 > > > > From: Peter Eisentraut<peter@eisentraut.org> > > > > Date: Mon, 3 Jan 2022 14:43:36 +0100 > > > > Subject: [PATCH v3] Synchronize logical replication slots from primary to > > > > standby > > > I've just skimmed the patch and the related threads. As far as I can tell this > > > cannot be safely used without the conflict handling in [1], is that correct? > > > > This or similar questions have been asked a few times about this or similar > > patches, but they always come with some doubt. > > I'm certain it's a problem - the only reason I couched it was that there could > have been something clever in the patch preventing problems that I missed > because I just skimmed it. > > > > If we think so, it would be > > useful perhaps if we could come up with test cases that would demonstrate > > why that other patch/feature is necessary. (I'm not questioning it > > personally, I'm just throwing out ideas here.) > > The patch as-is just breaks one of the fundamental guarantees necessary for > logical decoding, that no rows versions can be removed that are still required > for logical decoding (signalled via catalog_xmin). So there needs to be an > explicit mechanism upholding that guarantee, but there is not right now from > what I can see. I've been working on adding test coverage to prove this out, but I've encountered the problem reported in [1]. My assumption, but Andres please correct me if I'm wrong, that we should see issues with the following steps (given the primary, physical replica, and logical subscriber already created in the test): 1. Ensure both logical subscriber and physical replica are caught up 2. Disable logical subscription 3. Make a catalog change on the primary (currently renaming the primary key column) 4. Vacuum pg_class 5. Ensure physical replication is caught up 6. Stop primary and promote the replica 7. Write to the changed table 8. Update subscription to point to promoted replica 9. Re-enable logical subscription I'm attaching my test as an additional patch in the series for reference. Currently I have steps 3 and 4 commented out to show that the issues in [1] occur without any attempt to trigger the catalog xmin problem. Given this error seems pretty significant in terms of indicating fundamental lack of test coverage (the primary stated benefit of the patch is physical failover), and it currently is a blocker to testing more deeply. Thanks, James Coleman 1: https://www.postgresql.org/message-id/TYCPR01MB684949EA7AA904EE938548C79F3A9%40TYCPR01MB6849.jpnprd01.prod.outlook.com
Attachment
On Thu, Feb 24, 2022 at 12:46 AM James Coleman <jtc331@gmail.com> wrote: > I've been working on adding test coverage to prove this out, but I've > encountered the problem reported in [1]. > > My assumption, but Andres please correct me if I'm wrong, that we > should see issues with the following steps (given the primary, > physical replica, and logical subscriber already created in the test): > > 1. Ensure both logical subscriber and physical replica are caught up > 2. Disable logical subscription > 3. Make a catalog change on the primary (currently renaming the > primary key column) > 4. Vacuum pg_class > 5. Ensure physical replication is caught up > 6. Stop primary and promote the replica > 7. Write to the changed table > 8. Update subscription to point to promoted replica > 9. Re-enable logical subscription > > I'm attaching my test as an additional patch in the series for > reference. Currently I have steps 3 and 4 commented out to show that > the issues in [1] occur without any attempt to trigger the catalog > xmin problem. > > Given this error seems pretty significant in terms of indicating > fundamental lack of test coverage (the primary stated benefit of the > patch is physical failover), and it currently is a blocker to testing > more deeply. Few of my initial concerns specified at [1] are this: 1) Instead of a new LIST_SLOT command, can't we use READ_REPLICATION_SLOT (slight modifications needs to be done to make it support logical replication slots and to get more information from the subscriber). 2) How frequently the new bg worker is going to sync the slot info? How can it ensure that the latest information exists say when the subscriber is down/crashed before it picks up the latest slot information? 4) IIUC, the proposal works only for logical replication slots but do you also see the need for supporting some kind of synchronization of physical replication slots as well? IMO, we need a better and consistent way for both types of replication slots. If the walsender can somehow push the slot info from the primary (for physical replication slots)/publisher (for logical replication slots) to the standby/subscribers, this will be a more consistent and simplistic design. However, I'm not sure if this design is doable at all. Can anyone help clarify these? [1] https://www.postgresql.org/message-id/CALj2ACUGNGfWRtwwZwT-Y6feEP8EtOMhVTE87rdeY14mBpsRUA%40mail.gmail.com Regards, Bharath Rupireddy.
Hi, I have spent little time trying to understand the concern raised by Andres and while doing so I could think of a couple of issues which I would like to share here. Although I'm not quite sure how inline these are with the problems seen by Andres. 1) Firstly, what if we come across a situation where the failover occurs when the confirmed flush lsn has been updated on primary, but is yet to be updated on the standby? I believe this may very well be the case especially considering that standby sends sql queries to the primary to synchronize the replication slots at regular intervals and if the primary dies just after updating the confirmed flush lsn of its logical subscribers then the standby may not be able to get this information/update from the primary which means we'll probably end up having a broken logical replication slot on the new primary. 2) Secondly, if the standby goes down, the logical subscribers will stop receiving new changes from the primary as per the design of this patch OR if standby lags behind the primary for whatever reason, it will have a direct impact on logical subscribers as well. -- With Regards, Ashutosh Sharma. On Sat, Feb 19, 2022 at 3:53 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2022-02-11 15:28:19 +0100, Peter Eisentraut wrote: > > On 05.02.22 20:59, Andres Freund wrote: > > > On 2022-01-03 14:46:52 +0100, Peter Eisentraut wrote: > > > > From ec00dc6ab8bafefc00e9b1c78ac9348b643b8a87 Mon Sep 17 00:00:00 2001 > > > > From: Peter Eisentraut<peter@eisentraut.org> > > > > Date: Mon, 3 Jan 2022 14:43:36 +0100 > > > > Subject: [PATCH v3] Synchronize logical replication slots from primary to > > > > standby > > > I've just skimmed the patch and the related threads. As far as I can tell this > > > cannot be safely used without the conflict handling in [1], is that correct? > > > > This or similar questions have been asked a few times about this or similar > > patches, but they always come with some doubt. > > I'm certain it's a problem - the only reason I couched it was that there could > have been something clever in the patch preventing problems that I missed > because I just skimmed it. > > > > If we think so, it would be > > useful perhaps if we could come up with test cases that would demonstrate > > why that other patch/feature is necessary. (I'm not questioning it > > personally, I'm just throwing out ideas here.) > > The patch as-is just breaks one of the fundamental guarantees necessary for > logical decoding, that no rows versions can be removed that are still required > for logical decoding (signalled via catalog_xmin). So there needs to be an > explicit mechanism upholding that guarantee, but there is not right now from > what I can see. > > One piece of the referenced patchset is that it adds information about removed > catalog rows to a few WAL records, and then verifies during replay that no > record can be replayed that removes resources that are still needed. If such a > conflict exists it's dealt with as a recovery conflict. > > That itself doesn't provide prevention against removal of required, but it > provides detection. The prevention against removal can then be done using a > physical replication slot with hot standby feedback or some other mechanism > (e.g. slot syncing mechanism could maintain a "placeholder" slot on the > primary for all sync targets or something like that). > > Even if that infrastructure existed / was merged, the slot sync stuff would > still need some very careful logic to protect against problems due to > concurrent WAL replay and "synchronized slot" creation. But that's doable. > > Greetings, > > Andres Freund > >
Hi, On 2/11/22 3:26 PM, Peter Eisentraut wrote: > On 10.02.22 22:47, Bruce Momjian wrote: >> On Tue, Feb 8, 2022 at 08:27:32PM +0530, Ashutosh Sharma wrote: >>>> Which means that if e.g. the standby_slot_names GUC differs from >>>> synchronize_slot_names on the physical replica, the slots >>>> synchronized on the >>>> physical replica are not going to be valid. Or if the primary drops >>>> its >>>> logical slots. >>>> >>>> >>>>> Should the redo function for the drop replication slot have the >>>>> capability >>>>> to drop it on standby and its subscribers (if any) as well? >>>> >>>> Slots are not WAL logged (and shouldn't be). >>>> >>>> I think you pretty much need the recovery conflict handling >>>> infrastructure I >>>> referenced upthread, which recognized during replay if a record has >>>> a conflict >>>> with a slot on a standby. And then ontop of that you can build >>>> something like >>>> this patch. >>>> >>> >>> OK. Understood, thanks Andres. >> >> I would love to see this feature in PG 15. Can someone explain its >> current status? Thanks. > > The way I understand it: > > 1. This feature (probably) depends on the "Minimal logical decoding on > standbys" patch. The details there aren't totally clear (to me). That > patch had some activity lately but I don't see it in a state that it's > nearing readiness. > FWIW, a proposal has been submitted in [1] to add information in the WAL records in preparation for logical slot conflict handling. [1]: https://www.postgresql.org/message-id/178cf7da-9bd7-e328-9c49-e28ac4701352@gmail.com Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On 11/15/22 10:02 AM, Drouvot, Bertrand wrote: > Hi, > > On 2/11/22 3:26 PM, Peter Eisentraut wrote: >> On 10.02.22 22:47, Bruce Momjian wrote: >>> On Tue, Feb 8, 2022 at 08:27:32PM +0530, Ashutosh Sharma wrote: >>>>> Which means that if e.g. the standby_slot_names GUC differs from >>>>> synchronize_slot_names on the physical replica, the slots synchronized on the >>>>> physical replica are not going to be valid. Or if the primary drops its >>>>> logical slots. >>>>> >>>>> >>>>>> Should the redo function for the drop replication slot have the capability >>>>>> to drop it on standby and its subscribers (if any) as well? >>>>> >>>>> Slots are not WAL logged (and shouldn't be). >>>>> >>>>> I think you pretty much need the recovery conflict handling infrastructure I >>>>> referenced upthread, which recognized during replay if a record has a conflict >>>>> with a slot on a standby. And then ontop of that you can build something like >>>>> this patch. >>>>> >>>> >>>> OK. Understood, thanks Andres. >>> >>> I would love to see this feature in PG 15. Can someone explain its >>> current status? Thanks. >> >> The way I understand it: >> >> 1. This feature (probably) depends on the "Minimal logical decoding on standbys" patch. The details there aren't totallyclear (to me). That patch had some activity lately but I don't see it in a state that it's nearing readiness. >> > > FWIW, a proposal has been submitted in [1] to add information in the WAL records in preparation for logical slot conflicthandling. > > [1]: https://www.postgresql.org/message-id/178cf7da-9bd7-e328-9c49-e28ac4701352@gmail.com > Now that the "Minimal logical decoding on standby" patch series (mentioned up-thread) has been committed, I think we can resume working on this one ("Synchronizing slots from primary to standby"). I'll work on a rebase and share it once done (unless someone already started working on a rebase). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On 4/14/23 3:22 PM, Drouvot, Bertrand wrote: > Now that the "Minimal logical decoding on standby" patch series (mentioned up-thread) has been > committed, I think we can resume working on this one ("Synchronizing slots from primary to standby"). > > I'll work on a rebase and share it once done (unless someone already started working on a rebase). > Please find attached V5 (a rebase of V4 posted up-thread). In addition to the "rebasing" work, the TAP test adds a test about conflict handling (logical slot invalidation) relying on the work done in the "Minimal logical decoding on standby" patch series. I did not look more at the patch (than what's was needed for the rebase) but plan to do so. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Mon, Apr 17, 2023 at 7:37 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Please find attached V5 (a rebase of V4 posted up-thread). > > In addition to the "rebasing" work, the TAP test adds a test about conflict handling (logical slot invalidation) > relying on the work done in the "Minimal logical decoding on standby" patch series. > > I did not look more at the patch (than what's was needed for the rebase) but plan to do so. > Are you still planning to continue working on this? Some miscellaneous comments while going through this patch are as follows? 1. Can you please try to explain the functionality of the overall patch somewhere in the form of comments and or commit message? 2. It seems that the initially synchronized list of slots is only used to launch a per-database worker to synchronize all the slots corresponding to that database. If so, then why do we need to fetch all the slot-related information via LIST_SLOTS command? 3. As mentioned in the initial email, I think it would be better to replace LIST_SLOTS command with a SELECT query. 4. How the limit of sync_slot workers is decided? Can we document such a piece of information? Do we need a new GUC to decide the number of workers? Ideally, it would be better to avoid GUC, can we use any existing logical replication workers related GUC? 5. Can we separate out the functionality related to standby_slot_names in a separate patch, probably the first one? I think that will patch easier to review. 6. In libpqrcv_list_slots(), two-phase related slot information is not retrieved. Is there a reason for the same? 7. +static void +wait_for_standby_confirmation(XLogRecPtr commit_lsn) Some comments atop this function would make it easier to review. 8. +/*------------------------------------------------------------------------- + * slotsync.c + * PostgreSQL worker for synchronizing slots to a standby from primary + * + * Copyright (c) 2016-2018, PostgreSQL Global Development Group + * The copyright notice is out-of-date. 9. Why synchronize_one_slot() compares MyReplicationSlot->data.restart_lsn with the value of confirmed_flush_lsn passed to it? Also, why it does only for new slots but not existing slots? 10. Can we somehow test if the restart_lsn is advanced properly after sync? I think it is important to ensure that because otherwise after standby's promotion, the subscriber can start syncing from the wrong position. -- With Regards, Amit Kapila.
Hi, On 6/16/23 11:56 AM, Amit Kapila wrote: > On Mon, Apr 17, 2023 at 7:37 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> >> Please find attached V5 (a rebase of V4 posted up-thread). >> >> In addition to the "rebasing" work, the TAP test adds a test about conflict handling (logical slot invalidation) >> relying on the work done in the "Minimal logical decoding on standby" patch series. >> >> I did not look more at the patch (than what's was needed for the rebase) but plan to do so. >> > > Are you still planning to continue working on this? Yes, I think it would be great to have such a feature in core. > Some miscellaneous > comments while going through this patch are as follows? Thanks! I'll look at them and will try to come back to you by mid of next week. Also I think we need to handle the case of invalidated replication slot(s): should we drop/recreate it/them? (as the main goal is to have sync slot(s) on the standby). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Jun 19, 2023 at 11:34 AM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Also I think we need to handle the case of invalidated replication slot(s): should > we drop/recreate it/them? (as the main goal is to have sync slot(s) on the standby). > Do you intend to ask what happens to logical slots invalidated (due to say max_slot_wal_keep_size) on publisher? I think those should be invalidated on standby too. Another thought whether there is chance that the slot on standby gets invalidated due to conflict (say required rows removed on primary)? I think in such cases the slot on primary/publisher should have been dropped/invalidated by that time. BTW, does the patch handles drop of logical slots on standby when the same slot is dropped on publisher/primary? -- With Regards, Amit Kapila.
Hi, On 6/19/23 12:03 PM, Amit Kapila wrote: > On Mon, Jun 19, 2023 at 11:34 AM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> >> Also I think we need to handle the case of invalidated replication slot(s): should >> we drop/recreate it/them? (as the main goal is to have sync slot(s) on the standby). >> > > Do you intend to ask what happens to logical slots invalidated (due to > say max_slot_wal_keep_size) on publisher? I think those should be > invalidated on standby too. Agree that it should behave that way. > Another thought whether there is chance > that the slot on standby gets invalidated due to conflict (say > required rows removed on primary)? That's the scenario I had in mind when asking the question above. > I think in such cases the slot on > primary/publisher should have been dropped/invalidated by that time. I don't think so. For example, such a scenario could occur: - there is no physical slot between the standby and the primary - the standby is shut down - logical decoding on the primary is moving forward and now there is vacuum operations that will conflict on the standby - the standby starts and reports the logical slot being invalidated (while it is not on the primary) In such a case (slot valid on the primary but invalidated on the standby) then I think we could drop and recreate the invalidated slot on the standby. > BTW, does the patch handles drop of logical slots on standby when the > same slot is dropped on publisher/primary? > from what I've seen, yes it looks like it behaves that way (will look closer). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Jun 19, 2023 at 9:56 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > On 6/19/23 12:03 PM, Amit Kapila wrote: > > On Mon, Jun 19, 2023 at 11:34 AM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > >> > >> Also I think we need to handle the case of invalidated replication slot(s): should > >> we drop/recreate it/them? (as the main goal is to have sync slot(s) on the standby). > >> > > > > Do you intend to ask what happens to logical slots invalidated (due to > > say max_slot_wal_keep_size) on publisher? I think those should be > > invalidated on standby too. > > Agree that it should behave that way. > > > Another thought whether there is chance > > that the slot on standby gets invalidated due to conflict (say > > required rows removed on primary)? > > That's the scenario I had in mind when asking the question above. > > > I think in such cases the slot on > > primary/publisher should have been dropped/invalidated by that time. > > I don't think so. > > For example, such a scenario could occur: > > - there is no physical slot between the standby and the primary > - the standby is shut down > - logical decoding on the primary is moving forward and now there is vacuum > operations that will conflict on the standby > - the standby starts and reports the logical slot being invalidated (while it is > not on the primary) > > In such a case (slot valid on the primary but invalidated on the standby) then I think we > could drop and recreate the invalidated slot on the standby. > Will it be safe? Because after recreating the slot, it will reserve the new WAL location and build the snapshot based on that which might miss some important information in the snapshot. For example, to update the slot's position with new information from the primary, the patch uses pg_logical_replication_slot_advance() which means it will process all records and update the snapshot via DecodeCommit->SnapBuildCommitTxn(). The other related thing is that do we somehow need to ensure that WAL is replayed on standby before moving the slot's position to the target location received from the primary? > > BTW, does the patch handles drop of logical slots on standby when the > > same slot is dropped on publisher/primary? > > > > from what I've seen, yes it looks like it behaves that way (will look closer). > Okay, I have asked because I don't see a call to ReplicationSlotDrop() in the patch. -- With Regards, Amit Kapila.
Hi, On 6/20/23 12:22 PM, Amit Kapila wrote: > On Mon, Jun 19, 2023 at 9:56 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> In such a case (slot valid on the primary but invalidated on the standby) then I think we >> could drop and recreate the invalidated slot on the standby. >> > > Will it be safe? Because after recreating the slot, it will reserve > the new WAL location and build the snapshot based on that which might > miss some important information in the snapshot. For example, to > update the slot's position with new information from the primary, the > patch uses pg_logical_replication_slot_advance() which means it will > process all records and update the snapshot via > DecodeCommit->SnapBuildCommitTxn(). Your concern is that the slot could have been consumed on the standby? I mean, if we suppose the "synchronized" slot can't be consumed on the standby then drop/recreate such an invalidated slot would be ok? Asking, because I'm not sure we should allow consumption of a "synchronized" slot until the standby gets promoted. When the patch has been initially proposed, logical decoding from a standby was not implemented yet. > The other related thing is that do we somehow need to ensure that WAL > is replayed on standby before moving the slot's position to the target > location received from the primary? Yeah, will check if this is currently done that way in the patch proposal. >>> BTW, does the patch handles drop of logical slots on standby when the >>> same slot is dropped on publisher/primary? >>> >> >> from what I've seen, yes it looks like it behaves that way (will look closer). >> > > Okay, I have asked because I don't see a call to ReplicationSlotDrop() > in the patch. > Right. I'd need to look closer to understand how it works (for the moment the "only" thing I've done was the re-base shared up-thread). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Jun 26, 2023 at 11:15 AM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > On 6/20/23 12:22 PM, Amit Kapila wrote: > > On Mon, Jun 19, 2023 at 9:56 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > > >> In such a case (slot valid on the primary but invalidated on the standby) then I think we > >> could drop and recreate the invalidated slot on the standby. > >> > > > > Will it be safe? Because after recreating the slot, it will reserve > > the new WAL location and build the snapshot based on that which might > > miss some important information in the snapshot. For example, to > > update the slot's position with new information from the primary, the > > patch uses pg_logical_replication_slot_advance() which means it will > > process all records and update the snapshot via > > DecodeCommit->SnapBuildCommitTxn(). > > Your concern is that the slot could have been consumed on the standby? > > I mean, if we suppose the "synchronized" slot can't be consumed on the standby then > drop/recreate such an invalidated slot would be ok? > That also may not be sufficient because as soon as the slot is invalidated/dropped, the required WAL could be removed on standby. -- With Regards, Amit Kapila.
Hi, On 6/26/23 12:34 PM, Amit Kapila wrote: > On Mon, Jun 26, 2023 at 11:15 AM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> >> On 6/20/23 12:22 PM, Amit Kapila wrote: >>> On Mon, Jun 19, 2023 at 9:56 PM Drouvot, Bertrand >>> <bertranddrouvot.pg@gmail.com> wrote: >> >>>> In such a case (slot valid on the primary but invalidated on the standby) then I think we >>>> could drop and recreate the invalidated slot on the standby. >>>> >>> >>> Will it be safe? Because after recreating the slot, it will reserve >>> the new WAL location and build the snapshot based on that which might >>> miss some important information in the snapshot. For example, to >>> update the slot's position with new information from the primary, the >>> patch uses pg_logical_replication_slot_advance() which means it will >>> process all records and update the snapshot via >>> DecodeCommit->SnapBuildCommitTxn(). >> >> Your concern is that the slot could have been consumed on the standby? >> >> I mean, if we suppose the "synchronized" slot can't be consumed on the standby then >> drop/recreate such an invalidated slot would be ok? >> > > That also may not be sufficient because as soon as the slot is > invalidated/dropped, the required WAL could be removed on standby. > Yeah, I think once the slot is dropped we just have to wait for the slot to be re-created on the standby according to the new synchronize_slot_names GUC. Assuming the initial slot "creation" on the standby (coming from the synchronize_slot_names usage) is working "correctly" then it should also work "correctly" once the slot is dropped. If we agree that a synchronized slot can not/should not be consumed (will implement this behavior) then I think the proposed scenario above should make sense, do you agree? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Dear Drouvot, Hi, I'm also interested in the feature. Followings are my high-level comments. I did not mention some detailed notations because this patch is not at the stage. And very sorry that I could not follow all of this discussions. 1. I thought that we should not reuse logical replication launcher for another purpose. The background worker should have only one task. I wanted to ask opinions some other people... 2. I want to confirm the reason why new replication command is added. IIUC the launcher connects to primary by using primary_conninfo connection string, but it establishes the physical replication connection so that any SQL cannot be executed. Is it right? Another approach not to use is to specify the target database via GUC, whereas not smart. How do you think? 3. You chose the per-db worker approach, however, it is difficult to extend the feature to support physical slots. This may be problematic. Was there any reasons for that? I doubted ReplicationSlotCreate() or advance functions might not be used from other databases and these may be reasons, but not sure. If these operations can do without connecting to specific database, I think the architecture can be changed. 4. Currently the launcher establishes the connection every time. Isn't it better to reuse the same one instead? Following comments are assumed the configuration, maybe the straightfoward: primary->standby |->subscriber 5. After constructing the system, I dropped the subscription on the subscriber. In this case the logical slot on primary was removed, but that was not replicated to standby server. Did you support the workload or not? ``` $ psql -U postgres -p $port_sub -c "DROP SUBSCRIPTION sub" NOTICE: dropped replication slot "sub" on publisher DROP SUBSCRIPTION $ psql -U postgres -p $port_primary -c "SELECT * FROM pg_replication_slots" slot_name | plugin | slot_type | datoid | database |... -----------+----------+-----------+--------+----------+... (0 rows) $ psql -U postgres -p $port_standby -c "SELECT * FROM pg_replication_slots" slot_name | plugin | slot_type | datoid | database |... -----------+----------+-----------+--------+----------+... sub | pgoutput | logical | 5 | postgres |... (1 row) ``` 6. Current approach may delay the startpoint of sync. Assuming that physical replication system is created first, and then the subscriber connects to the publisher node. In this case the launcher connects to primary earlier than the apply worker, and reads the slot. At that time there are no slots on primary, so launcher disconnects from primary and waits a time period (up to 3min). Even if the apply worker creates the slot on publisher, but the launcher on standby cannot notice that. The synchronization may start 3 min later. I'm not sure how to fix or it could be acceptable. Thought? Best Regards, Hayato Kuroda FUJITSU LIMITED
On Wed, Jun 28, 2023 at 12:19 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > On 6/26/23 12:34 PM, Amit Kapila wrote: > > On Mon, Jun 26, 2023 at 11:15 AM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > >> > >> On 6/20/23 12:22 PM, Amit Kapila wrote: > >>> On Mon, Jun 19, 2023 at 9:56 PM Drouvot, Bertrand > >>> <bertranddrouvot.pg@gmail.com> wrote: > >> > >>>> In such a case (slot valid on the primary but invalidated on the standby) then I think we > >>>> could drop and recreate the invalidated slot on the standby. > >>>> > >>> > >>> Will it be safe? Because after recreating the slot, it will reserve > >>> the new WAL location and build the snapshot based on that which might > >>> miss some important information in the snapshot. For example, to > >>> update the slot's position with new information from the primary, the > >>> patch uses pg_logical_replication_slot_advance() which means it will > >>> process all records and update the snapshot via > >>> DecodeCommit->SnapBuildCommitTxn(). > >> > >> Your concern is that the slot could have been consumed on the standby? > >> > >> I mean, if we suppose the "synchronized" slot can't be consumed on the standby then > >> drop/recreate such an invalidated slot would be ok? > >> > > > > That also may not be sufficient because as soon as the slot is > > invalidated/dropped, the required WAL could be removed on standby. > > > > Yeah, I think once the slot is dropped we just have to wait for the slot to > be re-created on the standby according to the new synchronize_slot_names GUC. > > Assuming the initial slot "creation" on the standby (coming from the synchronize_slot_names usage) > is working "correctly" then it should also work "correctly" once the slot is dropped. > I also think so. > If we agree that a synchronized slot can not/should not be consumed (will implement this behavior) then > I think the proposed scenario above should make sense, do you agree? > Yeah, I also can't think of a use case for this. So, we can probably disallow it and document the same. I guess if we came across a use case for this, we can rethink allowing to consume the changes from synchronized slots. -- With Regards, Amit Kapila.
On Thu, Jun 29, 2023 at 3:52 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Drouvot, > > Hi, I'm also interested in the feature. Followings are my high-level comments. > I did not mention some detailed notations because this patch is not at the stage. > And very sorry that I could not follow all of this discussions. > > 1. I thought that we should not reuse logical replication launcher for another purpose. > The background worker should have only one task. I wanted to ask opinions some other people... > IIUC, the launcher will launch the sync slot workers corresponding to slots that need sync on standby and apply workers for active subscriptions on primary (which will be a subscriber in this context). If this is correct, then do you expect to launch a separate kind of standby launcher for sync slots? -- With Regards, Amit Kapila.
Hi, On 6/29/23 12:36 PM, Amit Kapila wrote: > On Wed, Jun 28, 2023 at 12:19 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> Yeah, I think once the slot is dropped we just have to wait for the slot to >> be re-created on the standby according to the new synchronize_slot_names GUC. >> >> Assuming the initial slot "creation" on the standby (coming from the synchronize_slot_names usage) >> is working "correctly" then it should also work "correctly" once the slot is dropped. >> > > I also think so. > >> If we agree that a synchronized slot can not/should not be consumed (will implement this behavior) then >> I think the proposed scenario above should make sense, do you agree? >> > > Yeah, I also can't think of a use case for this. So, we can probably > disallow it and document the same. I guess if we came across a use > case for this, we can rethink allowing to consume the changes from > synchronized slots. Yeah agree, I'll work on a new version that deals with invalidated slot that way and that ensures that a synchronized slot can't be consumed (until the standby gets promoted). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi Kuroda-san, On 6/29/23 12:22 PM, Hayato Kuroda (Fujitsu) wrote: > Dear Drouvot, > > Hi, I'm also interested in the feature. Followings are my high-level comments. > I did not mention some detailed notations because this patch is not at the stage. > And very sorry that I could not follow all of this discussions. > Thanks for looking at it and your feedback! All I've done so far is to provide a re-based version in April of the existing patch. I'll have a closer look at the code, at your feedback and Amit's one while working on the new version that will: - take care of slot invalidation - ensure that synchronized slot cant' be consumed until the standby gets promoted as discussed up-thread. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Jun 29, 2023 at 3:52 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > 2. I want to confirm the reason why new replication command is added. > Are you referring LIST_SLOTS command? If so, I don't think that is required and instead, we can use a query to fetch the required information. > IIUC the > launcher connects to primary by using primary_conninfo connection string, but > it establishes the physical replication connection so that any SQL cannot be executed. > Is it right? Another approach not to use is to specify the target database via > GUC, whereas not smart. How do you think? > 3. You chose the per-db worker approach, however, it is difficult to extend the > feature to support physical slots. This may be problematic. Was there any > reasons for that? I doubted ReplicationSlotCreate() or advance functions might > not be used from other databases and these may be reasons, but not sure. > If these operations can do without connecting to specific database, I think > the architecture can be changed. > I think this point needs some investigation but do we want just one worker that syncs all the slots? That may lead to lag in keeping the slots up-to-date. We probably need some tests. > 4. Currently the launcher establishes the connection every time. Isn't it better > to reuse the same one instead? > I feel it is not the launcher but a separate sync slot worker that establishes the connection. It is not clear to me what exactly you have in mind. Can you please explain a bit more? > Following comments are assumed the configuration, maybe the straightfoward: > > primary->standby > |->subscriber > > 5. After constructing the system, I dropped the subscription on the subscriber. > In this case the logical slot on primary was removed, but that was not replicated > to standby server. Did you support the workload or not? > This should work. > > 6. Current approach may delay the startpoint of sync. > > Assuming that physical replication system is created first, and then the > subscriber connects to the publisher node. In this case the launcher connects to > primary earlier than the apply worker, and reads the slot. At that time there are > no slots on primary, so launcher disconnects from primary and waits a time period (up to 3min). > Even if the apply worker creates the slot on publisher, but the launcher on standby > cannot notice that. The synchronization may start 3 min later. > I feel this should be based on some GUC like 'wal_retrieve_retry_interval' which we are already using in the launcher or probably a new one if that doesn't seem to match. -- With Regards, Amit Kapila.
On Fri, Jun 16, 2023 at 3:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Apr 17, 2023 at 7:37 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > > > > Please find attached V5 (a rebase of V4 posted up-thread). > > > > In addition to the "rebasing" work, the TAP test adds a test about conflict handling (logical slot invalidation) > > relying on the work done in the "Minimal logical decoding on standby" patch series. > > > > I did not look more at the patch (than what's was needed for the rebase) but plan to do so. > > > > Are you still planning to continue working on this? Some miscellaneous > comments while going through this patch are as follows? > > 1. Can you please try to explain the functionality of the overall > patch somewhere in the form of comments and or commit message? IIUC, there are 2 core ideas of the feature: 1) It will never let the logical replication subscribers go ahead of physical replication standbys specified in standby_slot_names. It implements this by delaying decoding of commit records on the walsenders corresponding to logical replication subscribers on the primary until all the specified standbys confirm receiving the commit LSN. 2) The physical replication standbys will synchronize data of the specified logical replication slots (in synchronize_slot_names) from the primary, creating the logical replication slots if necessary. Since the logical replication subscribers will never go out of physical replication standbys, the standbys can safely synchronize the slots and keep the data necessary for subscribers to connect to it and work seamlessly even after a failover. If my understanding is right, I have few thoughts here: 1. All the logical walsenders are delayed on the primary - per wait_for_standby_confirmation() despite the user being interested in only a few of them via synchronize_slot_names. Shouldn't the delay be for just the slots specified in synchronize_slot_names? 2. I think we can split the patch like this - 0001 can be the logical walsenders delaying decoding on the primary unless standbys confirm, 0002 standby synchronizing the logical slots. 3. I think we need to change the GUC standby_slot_names to better reflect what it is used for - wait_for_replication_slot_names or wait_for_ 4. It allows specifying logical slots in standby_slot_names, meaning, it can disallow logical slots getting ahead of other logical slots specified in standby_slot_names. Should we allow this case with the thinking that if there's anyone using logical replication for failover (well, will anybody do that in production?). 5. Similar to above, it allows specifying physical slots in synchronize_slot_names. Should we disallow? I'm attaching the v6 patch, a rebased version of v5. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Sun, Jul 9, 2023 at 1:01 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Fri, Jun 16, 2023 at 3:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > 1. Can you please try to explain the functionality of the overall > > patch somewhere in the form of comments and or commit message? > > IIUC, there are 2 core ideas of the feature: > > 1) It will never let the logical replication subscribers go ahead of > physical replication standbys specified in standby_slot_names. It > implements this by delaying decoding of commit records on the > walsenders corresponding to logical replication subscribers on the > primary until all the specified standbys confirm receiving the commit > LSN. > > 2) The physical replication standbys will synchronize data of the > specified logical replication slots (in synchronize_slot_names) from > the primary, creating the logical replication slots if necessary. > Since the logical replication subscribers will never go out of > physical replication standbys, the standbys can safely synchronize the > slots and keep the data necessary for subscribers to connect to it and > work seamlessly even after a failover. > > If my understanding is right, > This matches my understanding as well. > I have few thoughts here: > > 1. All the logical walsenders are delayed on the primary - per > wait_for_standby_confirmation() despite the user being interested in > only a few of them via synchronize_slot_names. Shouldn't the delay be > for just the slots specified in synchronize_slot_names? > 2. I think we can split the patch like this - 0001 can be the logical > walsenders delaying decoding on the primary unless standbys confirm, > 0002 standby synchronizing the logical slots. > Agreed with the above two points. > 3. I think we need to change the GUC standby_slot_names to better > reflect what it is used for - wait_for_replication_slot_names or > wait_for_ > I feel at this stage we can focus on getting the design and implementation correct. We can improve GUC names later once we are confident that the functionality is correct. > 4. It allows specifying logical slots in standby_slot_names, meaning, > it can disallow logical slots getting ahead of other logical slots > specified in standby_slot_names. Should we allow this case with the > thinking that if there's anyone using logical replication for failover > (well, will anybody do that in production?). > I think on the contrary we should prohibit this case. We can always extend this functionality later. > 5. Similar to above, it allows specifying physical slots in > synchronize_slot_names. Should we disallow? > We should prohibit that as well. -- With Regards, Amit Kapila.
On 14.04.23 15:22, Drouvot, Bertrand wrote: > Now that the "Minimal logical decoding on standby" patch series > (mentioned up-thread) has been > committed, I think we can resume working on this one ("Synchronizing > slots from primary to standby"). Maybe you have seen this extension that was released a few months ago: https://github.com/EnterpriseDB/pg_failover_slots . This contains the same functionality packaged as an extension. Maybe this can give some ideas about how this should behave and what options to provide etc. Note that pg_failover_slots doesn't use logical decoding on standby, because that would be too slow in practice. Earlier in this thread we had some discussion about which of the two approaches was preferred. Anyway, that's what's out there.
On Fri, Jun 16, 2023 at 3:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Apr 17, 2023 at 7:37 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > > > 3. As mentioned in the initial email, I think it would be better to > replace LIST_SLOTS command with a SELECT query. > I had a look at this thread. I am interested to work on this and can spend some time addressing the comments given here. I tried to replace LIST_SLOTS command with a SELECT query. Attached rebased patch and PoC patch for LIST_SLOTS removal. For LIST_SLOTs cmd removal, below are the points where more analysis is needed. 1) I could not use the exposed libpqwalreceiver's functions walrcv_exec/libpqrcv_exec in LogicalRepLauncher to run select query instead of LIST_SLOTS cmd. This is because libpqrcv_exec() needs database connection but since in LogicalReplauncher, we do not have any (MyDatabseId is not set), so the API gives an error. Thus to make it work for the time-being, I used 'libpqrcv_PQexec' which is not dependent upon database connection. But since it is not exposed "yet" to other layers, I temporarily added the new code to libpqwalreceiver.c itself. In fact I reused the existing function wrapper libpqrcv_list_slots and changed the functionality to get info using select query rather than list_slots. 2) While using connect API walrcv_connect/libpqrcv_connect(), we need to tell it whether it is for logical or physical replication. In the existing patch, where we were using LIST_SLOTS cmd, we have this connection made with logical=false. But now since we need to run select query to get the same info, using connection with logical=false gives error on primary while executing select query. "ERROR: cannot execute SQL commands in WAL sender for physical replication". And thus in ApplyLauncherStartSlotSync(), I have changed connect API to use logical=true for the time being. I noticed that in the existing patch, it was using logical=false in ApplyLauncherStartSlotSync() while logical=true in synchronize_slots(). Possibly due to the same fact that logical=false connection will not allow synchronize_slots() to run select query on primary while it worked for ApplyLauncherStartSlotSync() as it was running list_slots cmd instead of select query. I am exploring further on these points to figure out which one is the correct way to deal with these. Meanwhile posting this WIP patch for early feedback. I will try addressing other comments as well in next versions. thanks Shveta
Attachment
On Thu, Jul 20, 2023 at 5:05 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Fri, Jun 16, 2023 at 3:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Apr 17, 2023 at 7:37 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > > > > > > 3. As mentioned in the initial email, I think it would be better to > > replace LIST_SLOTS command with a SELECT query. > > > > I had a look at this thread. I am interested to work on this and can > spend some time addressing the comments given here. Thanks for your interest. Coincidentally, I started to split the patch into 2 recently - 0001 making the specified logical wal senders wait for specified standbys to ack, 0002 synchronize logical slots. I think I'll have these patches ready by early next week. For 0002, I'll consider your latest changes having LIST_SLOTS removed. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, Jul 21, 2023 at 11:36 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Thu, Jul 20, 2023 at 5:05 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Fri, Jun 16, 2023 at 3:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Mon, Apr 17, 2023 at 7:37 PM Drouvot, Bertrand > > > <bertranddrouvot.pg@gmail.com> wrote: > > > > > > > > > 3. As mentioned in the initial email, I think it would be better to > > > replace LIST_SLOTS command with a SELECT query. > > > > > > > I had a look at this thread. I am interested to work on this and can > > spend some time addressing the comments given here. > > Thanks for your interest. Coincidentally, I started to split the patch > into 2 recently - 0001 making the specified logical wal senders wait > for specified standbys to ack, 0002 synchronize logical slots. I think > I'll have these patches ready by early next week. For 0002, I'll > consider your latest changes having LIST_SLOTS removed. > Thanks Bharat for letting us know. It is okay to split the patch, it may definitely help to understand the modules better but shall we take a step back and try to reevaluate the design first before moving to other tasks? I analyzed more on the issues stated in [1] for replacing LIST_SLOTS with SELECT query. On rethinking, it might not be a good idea to replace this cmd with SELECT in Launcher code-path, because we do not have any database-connection in launcher and 'select' query needs one and thus we need to supply dbname to it. We may take the primary-dbname info in a new GUC from users, but I feel retaining LIST cmd is a better idea over adding a new GUC. But I do not see a reason why we should get complete replication-slots info in LIST command. The logic in Launcher is to get distinct database-ids info out of all the slots and start worker per database-id. Since we are only interested in database-id, I think we should change LIST_SLOTS to something like LIST_DBID_FOR_LOGICAL_SLOTS. This new command may get only unique database-ids for all the logical-slots (or the ones mentioned in synchronize_slot_names) from primary. By doing so, we can avoid huge network traffic in cases where the number of replication slots is quite high considering that max_replication_slots can go upto MAX_BACKENDS:2^18-1. So I plan to make this change where we retain LIST cmd over SELECT query but make this cmd's output restricted to only database-Ids. Thoughts? Secondly, I was thinking if the design proposed in the patch is the best one. No doubt, it is the most simplistic design and thus may prove very efficient for scenarios where we have a reasonable number of workers starting and each one actively busy in slots-synchronisation, handling almost equivalent load. But since we are starting one worker per database id, it may not be most efficient for cases where not all the databases are actively being used. We may have some workers (started for databases not in use) just waking up and sending queries to primary and then going back to sleep and in the process generating network traffic, while others may be heavily loaded to deal with large numbers of active slots for a heavily loaded database. I feel the design should be adaptable to load conditions i.e. if we have more number of actively used slots, then we should have more workers spawned to handle it and when the work is less then the number of spawned workers should be less. I have not thought it thoroughly yet and also not sure whether it will actually come out as a better one, but it or more such designs should be considered before we start fixing bugs in this patch. Kindly let me know if there are already discussions around it which I might have missed. Any feedback is appreciated. [1] : https://www.postgresql.org/message-id/CAJpy0uCMNz3XERP-Vzp-7rHFztJgc6d%2BxsmUVCqsxWPkZvQz0Q%40mail.gmail.com thanks Shveta
On Fri, Jul 21, 2023 at 5:16 PM shveta malik <shveta.malik@gmail.com> wrote: > > Thanks Bharat for letting us know. It is okay to split the patch, it > may definitely help to understand the modules better but shall we take > a step back and try to reevaluate the design first before moving to > other tasks? Agree that design comes first. FWIW, I'm attaching the v9 patch set that I have with me. It can't be a perfect patch set unless the design is finalized. > I analyzed more on the issues stated in [1] for replacing LIST_SLOTS > with SELECT query. On rethinking, it might not be a good idea to > replace this cmd with SELECT in Launcher code-path I think there are open fundamental design aspects, before optimizing LIST_SLOTS, see below. I'm sure we can come back to this later. > Secondly, I was thinking if the design proposed in the patch is the > best one. No doubt, it is the most simplistic design and thus may > .......... Any feedback is appreciated. Here are my thoughts about this feature: Current design: 1. On primary, never allow walsenders associated with logical replication slots to go ahead of physical standbys that are candidates for future primary after failover. This enables subscribers to connect to new primary after failover. 2. On all candidate standbys, periodically sync logical slots from primary (creating the slots if necessary) with one slot sync worker per logical slot. Important considerations: 1. Does this design guarantee the row versions required by subscribers aren't removed on candidate standbys as raised here - https://www.postgresql.org/message-id/20220218222319.yozkbhren7vkjbi5%40alap3.anarazel.de? It seems safe with logical decoding on standbys feature. Also, a test-case from upthread is already in patch sets (in v9 too) https://www.postgresql.org/message-id/CAAaqYe9FdKODa1a9n%3Dqj%2Bw3NiB9gkwvhRHhcJNginuYYRCnLrg%40mail.gmail.com. However, we need to verify the use cases extensively. 2. All candidate standbys will start one slot sync worker per logical slot which might not be scalable. Is having one (or a few more - not necessarily one for each logical slot) worker for all logical slots enough? It seems safe to have one worker for all logical slots - it's not a problem even if the worker takes a bit of time to get to sync a logical slot on a candidate standby, because the standby is ensured to retain all the WAL and row versions required to decode and send to the logical slots. 3. Indefinite waiting of logical walsenders for candidate standbys may not be a good idea. Is having a timeout for logical walsenders a good idea? A problem with timeout is that it can make logical slots unusable after failover. 4. All candidate standbys retain WAL required by logical slots. Amount of WAL retained may be huge if there's a replication lag with logical replication subscribers. This turns out to be a typical problem with replication, so there's nothing much this feature can do to prevent WAL file accumulation except for asking one to monitor replication lag and WAL file growth. 5. Logical subscribers replication lag will depend on all candidate standbys replication lag. If candidate standbys are too far from primary and logical subscribers are too close, still logical subscribers will have replication lag. There's nothing much this feature can do to prevent this except for calling it out in documentation. 6. This feature might need to prevent the GUCs from deviating on primary and the candidate standbys - there's no point in syncing a logical slot on candidate standbys if logical walsender related to it on primary isn't keeping itself behind all the candidate standbys. If preventing this from happening proves to be tough, calling it out in documentation to keep GUCs the same is a good start. 7. There are some important review comments provided upthread as far as this design and patches are concerned - https://www.postgresql.org/message-id/20220207204557.74mgbhowydjco4mh%40alap3.anarazel.de and https://www.postgresql.org/message-id/20220207203222.22aktwxrt3fcllru%40alap3.anarazel.de. I'm sure we can come to these once the design is clear. Please feel free to add the list if I'm missing anything. Thoughts? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Mon, Jul 24, 2023 at 8:03 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Fri, Jul 21, 2023 at 5:16 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > Thanks Bharat for letting us know. It is okay to split the patch, it > > may definitely help to understand the modules better but shall we take > > a step back and try to reevaluate the design first before moving to > > other tasks? > > Agree that design comes first. FWIW, I'm attaching the v9 patch set > that I have with me. It can't be a perfect patch set unless the design > is finalized. > > > I analyzed more on the issues stated in [1] for replacing LIST_SLOTS > > with SELECT query. On rethinking, it might not be a good idea to > > replace this cmd with SELECT in Launcher code-path > > I think there are open fundamental design aspects, before optimizing > LIST_SLOTS, see below. I'm sure we can come back to this later. > > > Secondly, I was thinking if the design proposed in the patch is the > > best one. No doubt, it is the most simplistic design and thus may > > .......... Any feedback is appreciated. > > Here are my thoughts about this feature: > > Current design: > > 1. On primary, never allow walsenders associated with logical > replication slots to go ahead of physical standbys that are candidates > for future primary after failover. This enables subscribers to connect > to new primary after failover. > 2. On all candidate standbys, periodically sync logical slots from > primary (creating the slots if necessary) with one slot sync worker > per logical slot. > > Important considerations: > > 1. Does this design guarantee the row versions required by subscribers > aren't removed on candidate standbys as raised here - > https://www.postgresql.org/message-id/20220218222319.yozkbhren7vkjbi5%40alap3.anarazel.de? > > It seems safe with logical decoding on standbys feature. Also, a > test-case from upthread is already in patch sets (in v9 too) > https://www.postgresql.org/message-id/CAAaqYe9FdKODa1a9n%3Dqj%2Bw3NiB9gkwvhRHhcJNginuYYRCnLrg%40mail.gmail.com. > However, we need to verify the use cases extensively. > Agreed. > 2. All candidate standbys will start one slot sync worker per logical > slot which might not be scalable. > Yeah, that doesn't sound like a good idea but IIRC, the proposed patch is using one worker per database (for all slots corresponding to a database). > Is having one (or a few more - not > necessarily one for each logical slot) worker for all logical slots > enough? > I guess for a large number of slots the is a possibility of a large gap in syncing the slots which probably means we need to retain corresponding WAL for a much longer time on the primary. If we can prove that the gap won't be large enough to matter then this would be probably worth considering otherwise, I think we should find a way to scale the number of workers to avoid the large gap. -- With Regards, Amit Kapila.
On Mon, Jul 24, 2023 at 9:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jul 24, 2023 at 8:03 AM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > On Fri, Jul 21, 2023 at 5:16 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > Thanks Bharat for letting us know. It is okay to split the patch, it > > > may definitely help to understand the modules better but shall we take > > > a step back and try to reevaluate the design first before moving to > > > other tasks? > > > > Agree that design comes first. FWIW, I'm attaching the v9 patch set > > that I have with me. It can't be a perfect patch set unless the design > > is finalized. > > > > > I analyzed more on the issues stated in [1] for replacing LIST_SLOTS > > > with SELECT query. On rethinking, it might not be a good idea to > > > replace this cmd with SELECT in Launcher code-path > > > > I think there are open fundamental design aspects, before optimizing > > LIST_SLOTS, see below. I'm sure we can come back to this later. > > > > > Secondly, I was thinking if the design proposed in the patch is the > > > best one. No doubt, it is the most simplistic design and thus may > > > .......... Any feedback is appreciated. > > > > Here are my thoughts about this feature: > > > > Current design: > > > > 1. On primary, never allow walsenders associated with logical > > replication slots to go ahead of physical standbys that are candidates > > for future primary after failover. This enables subscribers to connect > > to new primary after failover. > > 2. On all candidate standbys, periodically sync logical slots from > > primary (creating the slots if necessary) with one slot sync worker > > per logical slot. > > > > Important considerations: > > > > 1. Does this design guarantee the row versions required by subscribers > > aren't removed on candidate standbys as raised here - > > https://www.postgresql.org/message-id/20220218222319.yozkbhren7vkjbi5%40alap3.anarazel.de? > > > > It seems safe with logical decoding on standbys feature. Also, a > > test-case from upthread is already in patch sets (in v9 too) > > https://www.postgresql.org/message-id/CAAaqYe9FdKODa1a9n%3Dqj%2Bw3NiB9gkwvhRHhcJNginuYYRCnLrg%40mail.gmail.com. > > However, we need to verify the use cases extensively. > > > > Agreed. > > > 2. All candidate standbys will start one slot sync worker per logical > > slot which might not be scalable. > > > > Yeah, that doesn't sound like a good idea but IIRC, the proposed patch > is using one worker per database (for all slots corresponding to a > database). > > > Is having one (or a few more - not > > necessarily one for each logical slot) worker for all logical slots > > enough? > > > > I guess for a large number of slots the is a possibility of a large > gap in syncing the slots which probably means we need to retain > corresponding WAL for a much longer time on the primary. If we can > prove that the gap won't be large enough to matter then this would be > probably worth considering otherwise, I think we should find a way to > scale the number of workers to avoid the large gap. > How about this: 1) On standby, spawn 1 worker per database in the start (as it is doing currently). 2) Maintain statistics on activity against each primary's database on standby by any means. Could be by maintaining 'last_synced_time' and 'last_activity_seen time'. The last_synced_time is updated every time we sync/recheck slots for that particular database. The 'last_activity_seen_time' changes only if we get any slot on that database where actually confirmed_flush or say restart_lsn has changed from what was maintained already. 3) If at any moment, we find that 'last_synced_time' - 'last_activity_seen' goes beyond a threshold, that means that DB is not active currently. Add it to list of inactive DB 4) Launcher on the other hand is always checking if it needs to spawn any other extra worker for any new DB. It will additionally check if number of inactive databases (maintained on standby) has gone higher (> some threshold), then it brings down the workers for those and starts a common worker which takes care of all such inactive databases (or merge all in 1), while workers for active databases remain as such (i.e. one per db). Each worker maintains the list of DBs which it is responsible for. 5) If in the list of these inactive databases, we again find any active database using the above logic, then the launcher will spawn a separate worker for that. Pros: Lesser workers on standby as per the load on primary. Lesser poking of primary by standby i.e. standby will send queries to get slot info for all inactive DBs in 1 run instead of each worker sending such queries separately. Cons: We might see spawning and freeing of workers more frequently. Please let me know your thoughts on this. thanks Shveta
On Mon, Jul 24, 2023 at 8:03 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Fri, Jul 21, 2023 at 5:16 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > Thanks Bharat for letting us know. It is okay to split the patch, it > > may definitely help to understand the modules better but shall we take > > a step back and try to reevaluate the design first before moving to > > other tasks? > > Agree that design comes first. FWIW, I'm attaching the v9 patch set > that I have with me. It can't be a perfect patch set unless the design > is finalized. > Thanks for the patch and summarizing all the issues here. I was going through the patch and found that now we need to maintain 'synchronize_slot_names' on both primary and standby unlike the old way where it was maintained only on standby. I am aware of the problem in earlier implementation where each logical walsender/slot needed to wait for all standbys to catch-up before sending changes to logical subscribers even though that particular slot is not even needed to be synced by any of the standbys. Now it is more restrictive. But now, is this 'synchronize_slot_names' per standby? If there are multiple standbys each having different 'synchronize_slot_names' requirements, then how primary is going to keep track of that? Please let me know if that scenario can never arise where standbys can have different 'synchronize_slot_names'. thanks Shveta
On Wed, Jul 26, 2023 at 10:31 AM shveta malik <shveta.malik@gmail.com> wrote: > > On Mon, Jul 24, 2023 at 9:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Jul 24, 2023 at 8:03 AM Bharath Rupireddy > > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > > > Is having one (or a few more - not > > > necessarily one for each logical slot) worker for all logical slots > > > enough? > > > > > > > I guess for a large number of slots the is a possibility of a large > > gap in syncing the slots which probably means we need to retain > > corresponding WAL for a much longer time on the primary. If we can > > prove that the gap won't be large enough to matter then this would be > > probably worth considering otherwise, I think we should find a way to > > scale the number of workers to avoid the large gap. > > > > How about this: > > 1) On standby, spawn 1 worker per database in the start (as it is > doing currently). > > 2) Maintain statistics on activity against each primary's database on > standby by any means. Could be by maintaining 'last_synced_time' and > 'last_activity_seen time'. The last_synced_time is updated every time > we sync/recheck slots for that particular database. The > 'last_activity_seen_time' changes only if we get any slot on that > database where actually confirmed_flush or say restart_lsn has changed > from what was maintained already. > > 3) If at any moment, we find that 'last_synced_time' - > 'last_activity_seen' goes beyond a threshold, that means that DB is > not active currently. Add it to list of inactive DB > I think we should also increase the next_sync_time if in current sync, there is no update. > 4) Launcher on the other hand is always checking if it needs to spawn > any other extra worker for any new DB. It will additionally check if > number of inactive databases (maintained on standby) has gone higher > (> some threshold), then it brings down the workers for those and > starts a common worker which takes care of all such inactive databases > (or merge all in 1), while workers for active databases remain as such > (i.e. one per db). Each worker maintains the list of DBs which it is > responsible for. > > 5) If in the list of these inactive databases, we again find any > active database using the above logic, then the launcher will spawn a > separate worker for that. > I wonder if we anyway some sort of design like this because we shouldn't allow to spawn as many workers as the number of databases. There has to be some existing or new GUC like max_sync_slot_workers which decided the number of workers. Overall, this sounds to be a more workload-adaptive approach as compared to the current one. -- With Regards, Amit Kapila.
On Thu, Jul 27, 2023 at 10:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Jul 26, 2023 at 10:31 AM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Mon, Jul 24, 2023 at 9:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Mon, Jul 24, 2023 at 8:03 AM Bharath Rupireddy > > > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > > > > > Is having one (or a few more - not > > > > necessarily one for each logical slot) worker for all logical slots > > > > enough? > > > > > > > > > > I guess for a large number of slots the is a possibility of a large > > > gap in syncing the slots which probably means we need to retain > > > corresponding WAL for a much longer time on the primary. If we can > > > prove that the gap won't be large enough to matter then this would be > > > probably worth considering otherwise, I think we should find a way to > > > scale the number of workers to avoid the large gap. > > > > > > > How about this: > > > > 1) On standby, spawn 1 worker per database in the start (as it is > > doing currently). > > > > 2) Maintain statistics on activity against each primary's database on > > standby by any means. Could be by maintaining 'last_synced_time' and > > 'last_activity_seen time'. The last_synced_time is updated every time > > we sync/recheck slots for that particular database. The > > 'last_activity_seen_time' changes only if we get any slot on that > > database where actually confirmed_flush or say restart_lsn has changed > > from what was maintained already. > > > > 3) If at any moment, we find that 'last_synced_time' - > > 'last_activity_seen' goes beyond a threshold, that means that DB is > > not active currently. Add it to list of inactive DB > > > > I think we should also increase the next_sync_time if in current sync, > there is no update. +1 > > > 4) Launcher on the other hand is always checking if it needs to spawn > > any other extra worker for any new DB. It will additionally check if > > number of inactive databases (maintained on standby) has gone higher > > (> some threshold), then it brings down the workers for those and > > starts a common worker which takes care of all such inactive databases > > (or merge all in 1), while workers for active databases remain as such > > (i.e. one per db). Each worker maintains the list of DBs which it is > > responsible for. > > > > 5) If in the list of these inactive databases, we again find any > > active database using the above logic, then the launcher will spawn a > > separate worker for that. > > > > I wonder if we anyway some sort of design like this because we > shouldn't allow to spawn as many workers as the number of databases. > There has to be some existing or new GUC like max_sync_slot_workers > which decided the number of workers. > Currently it does not have any such GUC for sync-slot workers. It mainly uses the logical-rep-worker framework for the sync-slot worker part and thus it relies on 'max_logical_replication_workers' GUC. Also it errors out if 'max_replication_slots' is set to zero. I think it is not the correct way of doing things for sync-slot. We can have a new GUC (max_sync_slot_workers) as you suggested and if the number of databases < max_sync_slot_workers, then we can start 1 worker per dbid, else divide the work equally among the max sync-workers possible. And for inactive database cases, we can increase the next_sync_time rather than starting a special worker to handle all the inactive databases. Thoughts? thanks Shveta
On Wed, Jul 26, 2023 at 5:55 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Mon, Jul 24, 2023 at 8:03 AM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > On Fri, Jul 21, 2023 at 5:16 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > Thanks Bharat for letting us know. It is okay to split the patch, it > > > may definitely help to understand the modules better but shall we take > > > a step back and try to reevaluate the design first before moving to > > > other tasks? > > > > Agree that design comes first. FWIW, I'm attaching the v9 patch set > > that I have with me. It can't be a perfect patch set unless the design > > is finalized. > > > > Thanks for the patch and summarizing all the issues here. I was going > through the patch and found that now we need to maintain > 'synchronize_slot_names' on both primary and standby unlike the old > way where it was maintained only on standby. I am aware of the problem > in earlier implementation where each logical walsender/slot needed to > wait for all standbys to catch-up before sending changes to logical > subscribers even though that particular slot is not even needed to be > synced by any of the standbys. Now it is more restrictive. But now, is > this 'synchronize_slot_names' per standby? If there are multiple > standbys each having different 'synchronize_slot_names' requirements, > then how primary is going to keep track of that? > Please let me know if that scenario can never arise where standbys can > have different 'synchronize_slot_names'. > Can we think of sending 'synchronize_slot_names' from standby to primary at the time of connection? I think we also need to ensure that if the user changes this value then we need to restart the sync slot worker to allow this information to be sent to the primary. We do something similar for apply worker in logical replication. -- With Regards, Amit Kapila.
On Mon, Jul 24, 2023 at 9:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > 2. All candidate standbys will start one slot sync worker per logical > > slot which might not be scalable. > > Yeah, that doesn't sound like a good idea but IIRC, the proposed patch > is using one worker per database (for all slots corresponding to a > database). Right. It's based on one worker for each database. > > Is having one (or a few more - not > > necessarily one for each logical slot) worker for all logical slots > > enough? > > I guess for a large number of slots the is a possibility of a large > gap in syncing the slots which probably means we need to retain > corresponding WAL for a much longer time on the primary. If we can > prove that the gap won't be large enough to matter then this would be > probably worth considering otherwise, I think we should find a way to > scale the number of workers to avoid the large gap. I think the gap is largely determined by the time taken to advance each slot and the amount of WAL that each logical slot moves ahead on primary. I've measured the time it takes for pg_logical_replication_slot_advance with different amounts WAL on my system. It took 2595ms/5091ms/31238ms to advance the slot by 3.7GB/7.3GB/13GB respectively. To put things into perspective here, imagine there are 3 logical slots to sync for a single slot sync worker and each of them are in need of advancing the slot by 3.7GB/7.3GB/13GB of WAL. The slot sync worker gets to slot 1 again after 2595ms+5091ms+31238ms (~40sec), gets to slot 2 again after advance time of slot 1 with amount of WAL that the slot has moved ahead on primary during 40sec, gets to slot 3 again after advance time of slot 1 and slot 2 with amount of WAL that the slot has moved ahead on primary and so on. If WAL generation on the primary is pretty fast, and if the logical slot moves pretty fast on the primary, the time it takes for a single sync worker to sync a slot can increase. Now, let's think what happens if there's a large gap, IOW, a logical slot on standby is behind X amount of WAL from that of the logical slot on primary. The standby needs to retain more WAL for sure. IIUC, primary doesn't need to retain the WAL required for a logical slot on standby, no? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Jul 27, 2023 at 10:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > I wonder if we anyway some sort of design like this because we > shouldn't allow to spawn as many workers as the number of databases. > There has to be some existing or new GUC like max_sync_slot_workers > which decided the number of workers. It seems reasonable to not have one slot sync worker for each database. IMV, the slot sync workers must be generic and independently manageable - generic in the sense that given a database and primary conninfo, each worker must sync all the slots related to the given database, independently mangeable in the sense that separate GUC for number of sync workers, launchable directly by logical replication launcher dynamically. The work division amongst the sync workers can be simple, the logical replication launcher builds a shared memory structure based on number of slots to sync and starts the sync workers dynamically, and each sync worker picks {dboid, slot name, conninfo} from the shared memory, syncs it and proceeds with other slots. Thoughts? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Jul 27, 2023 at 12:13 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Thu, Jul 27, 2023 at 10:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Jul 26, 2023 at 10:31 AM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > On Mon, Jul 24, 2023 at 9:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Mon, Jul 24, 2023 at 8:03 AM Bharath Rupireddy > > > > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > > > > > > > Is having one (or a few more - not > > > > > necessarily one for each logical slot) worker for all logical slots > > > > > enough? > > > > > > > > > > > > > I guess for a large number of slots the is a possibility of a large > > > > gap in syncing the slots which probably means we need to retain > > > > corresponding WAL for a much longer time on the primary. If we can > > > > prove that the gap won't be large enough to matter then this would be > > > > probably worth considering otherwise, I think we should find a way to > > > > scale the number of workers to avoid the large gap. > > > > > > > > > > How about this: > > > > > > 1) On standby, spawn 1 worker per database in the start (as it is > > > doing currently). > > > > > > 2) Maintain statistics on activity against each primary's database on > > > standby by any means. Could be by maintaining 'last_synced_time' and > > > 'last_activity_seen time'. The last_synced_time is updated every time > > > we sync/recheck slots for that particular database. The > > > 'last_activity_seen_time' changes only if we get any slot on that > > > database where actually confirmed_flush or say restart_lsn has changed > > > from what was maintained already. > > > > > > 3) If at any moment, we find that 'last_synced_time' - > > > 'last_activity_seen' goes beyond a threshold, that means that DB is > > > not active currently. Add it to list of inactive DB > > > > > > > I think we should also increase the next_sync_time if in current sync, > > there is no update. > > +1 > > > > > > 4) Launcher on the other hand is always checking if it needs to spawn > > > any other extra worker for any new DB. It will additionally check if > > > number of inactive databases (maintained on standby) has gone higher > > > (> some threshold), then it brings down the workers for those and > > > starts a common worker which takes care of all such inactive databases > > > (or merge all in 1), while workers for active databases remain as such > > > (i.e. one per db). Each worker maintains the list of DBs which it is > > > responsible for. > > > > > > 5) If in the list of these inactive databases, we again find any > > > active database using the above logic, then the launcher will spawn a > > > separate worker for that. > > > > > > > I wonder if we anyway some sort of design like this because we > > shouldn't allow to spawn as many workers as the number of databases. > > There has to be some existing or new GUC like max_sync_slot_workers > > which decided the number of workers. > > > > Currently it does not have any such GUC for sync-slot workers. It > mainly uses the logical-rep-worker framework for the sync-slot worker > part and thus it relies on 'max_logical_replication_workers' GUC. Also > it errors out if 'max_replication_slots' is set to zero. I think it is > not the correct way of doing things for sync-slot. We can have a new > GUC (max_sync_slot_workers) as you suggested and if the number of > databases < max_sync_slot_workers, then we can start 1 worker per > dbid, else divide the work equally among the max sync-workers > possible. And for inactive database cases, we can increase the > next_sync_time rather than starting a special worker to handle all the > inactive databases. Thoughts? > Attaching the PoC patch (0003) where attempts to implement the basic infrastructure for the suggested design. Rebased the existing patches (0001 and 0002) as well. This patch adds a new GUC max_slot_sync_workers; the default and max value is kept at 2 and 50 respectively for this PoC patch. Now the replication launcher divides the work equally among these many slot-sync workers. Let us say there are multiple slots on primary belonging to 10 DBs and say new GUC on standby is set at default value of 2, then each worker on standby will manage 5 dbs individually and will keep on synching the slots for them. If a new DB is found by replication launcher, it will assign this new db to the worker handling the minimum number of dbs currently (or first worker in case of equal count) and that worker will pick up the new db the next time it tries to sync the slots. I have kept the changes in separate patches (003) for ease of review. Since this is just a PoC patch, many things are yet to be done appropriately, will cover those in next versions. thanks Shveta
Attachment
On Fri, Jul 28, 2023 at 8:54 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Thu, Jul 27, 2023 at 10:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > I wonder if we anyway some sort of design like this because we > > shouldn't allow to spawn as many workers as the number of databases. > > There has to be some existing or new GUC like max_sync_slot_workers > > which decided the number of workers. > > It seems reasonable to not have one slot sync worker for each > database. IMV, the slot sync workers must be generic and independently > manageable - generic in the sense that given a database and primary > conninfo, each worker must sync all the slots related to the given > database, independently mangeable in the sense that separate GUC for > number of sync workers, launchable directly by logical replication > launcher dynamically. yes agreed. The patch v10-0003 attempts to do the same. > The work division amongst the sync workers can > be simple, the logical replication launcher builds a shared memory > structure based on number of slots to sync and starts the sync workers > dynamically, and each sync worker picks {dboid, slot name, conninfo} > from the shared memory, syncs it and proceeds with other slots. > Do you mean the logical replication launcher builds a shared memory structure based on the number of 'dbs' to sync as I understood from your initial comment? thanks Shveta
On Tue, Aug 1, 2023 at 5:01 PM shveta malik <shveta.malik@gmail.com> wrote: > > > The work division amongst the sync workers can > > be simple, the logical replication launcher builds a shared memory > > structure based on number of slots to sync and starts the sync workers > > dynamically, and each sync worker picks {dboid, slot name, conninfo} > > from the shared memory, syncs it and proceeds with other slots. > > Do you mean the logical replication launcher builds a shared memory > structure based > on the number of 'dbs' to sync as I understood from your initial comment? Yes. I haven't looked at the 0003 patch posted upthread. However, the standby must do the following at a minimum: - Make GUCs synchronize_slot_names and max_slot_sync_workers of PGC_POSTMASTER type needing postmaster restart when changed as they affect the number of slot sync workers. - LR (logical replication) launcher connects to primary to fetch the logical slots specified in synchronize_slot_names. This is a one-time task. - LR launcher prepares a dynamic shared memory (created via dsm_create) with some state like locks for IPC and an array of {slot_name, dboid_associated_with_slot, is_sync_in_progress} - maximum number of elements in the array is the number of slots specified in synchronize_slot_names. This is a one-time task. - LR launcher decides the *best* number of slot sync workers - (based on some perf numbers) it can just launch, say, one worker per 2 or 4 or 8 etc. slots. - Each slot sync worker then picks up a slot from the DSM, connects to primary using primary conn info, syncs it, and moves to another slot. Not having the capability of on-demand stop/launch of slot sync workers makes the above design simple IMO. Thoughts? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Aug 3, 2023 at 12:28 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Tue, Aug 1, 2023 at 5:01 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > The work division amongst the sync workers can > > > be simple, the logical replication launcher builds a shared memory > > > structure based on number of slots to sync and starts the sync workers > > > dynamically, and each sync worker picks {dboid, slot name, conninfo} > > > from the shared memory, syncs it and proceeds with other slots. > > > > Do you mean the logical replication launcher builds a shared memory > > structure based > > on the number of 'dbs' to sync as I understood from your initial comment? > > Yes. I haven't looked at the 0003 patch posted upthread. However, the > standby must do the following at a minimum: > > - Make GUCs synchronize_slot_names and max_slot_sync_workers of > PGC_POSTMASTER type needing postmaster restart when changed as they > affect the number of slot sync workers. I agree that max_slot_sync_workers should be allowed to change only during startup but I strongly feel that synchronize_slot_names should be runtime modifiable. We should give that flexibility to the user. > - LR (logical replication) launcher connects to primary to fetch the > logical slots specified in synchronize_slot_names. This is a one-time > task. if synchronize_slot_names='*', we need to fetch slots info at regular intervals even if it is not runtime modifiable. For a runtime modifiable case, it is obvious to reftech it regular intervals. > - LR launcher prepares a dynamic shared memory (created via > dsm_create) with some state like locks for IPC and an array of > {slot_name, dboid_associated_with_slot, is_sync_in_progress} - maximum > number of elements in the array is the number of slots specified in > synchronize_slot_names. This is a one-time task. yes, we need dynamic-shared-memory but it is not a one-time-allocation. If it were a one-time allocation, then there was no need for DSM, only shared memory allocation was enough. It is not a one time allocation in any of the designs. If it is slot based design, slots may keep on varying for '*' case and if it is DB based design, then number of DBs may go beyond the initial memory allocated and we may need reallocation and relaunch of worker and thus the need of DSM. > - LR launcher decides the *best* number of slot sync workers - (based > on some perf numbers) it can just launch, say, one worker per 2 or 4 > or 8 etc. slots. > - Each slot sync worker then picks up a slot from the DSM, connects to > primary using primary conn info, syncs it, and moves to another slot. > The design based on slots i.e. launcher dividing the slots among the available workers, could prove beneficial over db based division for a case where number of slots per DB varies largely and we end up assigning all DBs with lesser slots to one worker while all heavily loaded DBs to another. But other than this, I see lot of pain points: 1) Since we are going to do slots based synching, query construction will be complex. We will have a query with a long 'where' clause: where slots in (slot1, slot2, slots...). 2) Number of pings to primary will be more as we are pinging it slot based instead of DB based. So the information which we could have fetched collectively in one query (if it was db based) is now splitted to multiple queries assuming that there could be cases where slots belonging to the same DBs end up getting splitted among different workers. 3) if number of slots < max number of workers, how are we going to assign the worker? One slot per worker or all in one worker. If it is one slot per worker, it will again be not that efficient as it will result in more network traffic. This needs more thoughts and case to case varying design. > Not having the capability of on-demand stop/launch of slot sync > workers makes the above design simple IMO. > We need to anyways relaunch workers when DSM is reallocated in case Dbs (or sya slots) exceed some initial allocation limit. thanks Shveta
Hi, On 7/24/23 4:32 AM, Bharath Rupireddy wrote: > On Fri, Jul 21, 2023 at 5:16 PM shveta malik <shveta.malik@gmail.com> wrote: > Here are my thoughts about this feature: Thanks for looking at it! > > Important considerations: > > 1. Does this design guarantee the row versions required by subscribers > aren't removed on candidate standbys as raised here - > https://www.postgresql.org/message-id/20220218222319.yozkbhren7vkjbi5%40alap3.anarazel.de? > > It seems safe with logical decoding on standbys feature. Also, a > test-case from upthread is already in patch sets (in v9 too) > https://www.postgresql.org/message-id/CAAaqYe9FdKODa1a9n%3Dqj%2Bw3NiB9gkwvhRHhcJNginuYYRCnLrg%40mail.gmail.com. > However, we need to verify the use cases extensively. Agree. We also discussed up-thread that we'd have to drop any "sync" slots if they are invalidated. And they should be re-created based on the synchronize_slot_names. > Please feel free to add the list if I'm missing anything. > We'd also have to ensure that "sync" slots can't be consumed on the standby (this has been discussed up-thread). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On 7/28/23 4:39 PM, Bharath Rupireddy wrote: > On Mon, Jul 24, 2023 at 9:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >>> 2. All candidate standbys will start one slot sync worker per logical >>> slot which might not be scalable. >> >> Yeah, that doesn't sound like a good idea but IIRC, the proposed patch >> is using one worker per database (for all slots corresponding to a >> database). > > Right. It's based on one worker for each database. > >>> Is having one (or a few more - not >>> necessarily one for each logical slot) worker for all logical slots >>> enough? >> >> I guess for a large number of slots the is a possibility of a large >> gap in syncing the slots which probably means we need to retain >> corresponding WAL for a much longer time on the primary. If we can >> prove that the gap won't be large enough to matter then this would be >> probably worth considering otherwise, I think we should find a way to >> scale the number of workers to avoid the large gap. > > I think the gap is largely determined by the time taken to advance > each slot and the amount of WAL that each logical slot moves ahead on > primary. Sorry to be late, but I gave a second thought and I wonder if we really need this design. (i.e start a logical replication background worker on the standby to sync the slots). Wouldn't that be simpler to "just" update the sync slots "metadata" as the https://github.com/EnterpriseDB/pg_failover_slots module (mentioned by Peter up-thread) is doing? (making use of LogicalConfirmReceivedLocation(), LogicalIncreaseXminForSlot() and LogicalIncreaseRestartDecodingForSlot(), If I read synchronize_one_slot() correctly). > I've measured the time it takes for > pg_logical_replication_slot_advance with different amounts WAL on my > system. It took 2595ms/5091ms/31238ms to advance the slot by > 3.7GB/7.3GB/13GB respectively. To put things into perspective here, > imagine there are 3 logical slots to sync for a single slot sync > worker and each of them are in need of advancing the slot by > 3.7GB/7.3GB/13GB of WAL. The slot sync worker gets to slot 1 again > after 2595ms+5091ms+31238ms (~40sec), gets to slot 2 again after > advance time of slot 1 with amount of WAL that the slot has moved > ahead on primary during 40sec, gets to slot 3 again after advance time > of slot 1 and slot 2 with amount of WAL that the slot has moved ahead > on primary and so on. If WAL generation on the primary is pretty fast, > and if the logical slot moves pretty fast on the primary, the time it > takes for a single sync worker to sync a slot can increase. That would be way "faster" and we would probably not need to worry that much about the number of "sync" workers (if it/they "just" has/have to sync slot's "metadata") as proposed above. Thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, Aug 4, 2023 at 2:44 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 7/28/23 4:39 PM, Bharath Rupireddy wrote: > > On Mon, Jul 24, 2023 at 9:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > >> > >>> 2. All candidate standbys will start one slot sync worker per logical > >>> slot which might not be scalable. > >> > >> Yeah, that doesn't sound like a good idea but IIRC, the proposed patch > >> is using one worker per database (for all slots corresponding to a > >> database). > > > > Right. It's based on one worker for each database. > > > >>> Is having one (or a few more - not > >>> necessarily one for each logical slot) worker for all logical slots > >>> enough? > >> > >> I guess for a large number of slots the is a possibility of a large > >> gap in syncing the slots which probably means we need to retain > >> corresponding WAL for a much longer time on the primary. If we can > >> prove that the gap won't be large enough to matter then this would be > >> probably worth considering otherwise, I think we should find a way to > >> scale the number of workers to avoid the large gap. > > > > I think the gap is largely determined by the time taken to advance > > each slot and the amount of WAL that each logical slot moves ahead on > > primary. > > Sorry to be late, but I gave a second thought and I wonder if we really need this design. > (i.e start a logical replication background worker on the standby to sync the slots). > > Wouldn't that be simpler to "just" update the sync slots "metadata" > as the https://github.com/EnterpriseDB/pg_failover_slots module (mentioned by Peter > up-thread) is doing? > (making use of LogicalConfirmReceivedLocation(), LogicalIncreaseXminForSlot() > and LogicalIncreaseRestartDecodingForSlot(), If I read synchronize_one_slot() correctly). > Agreed. It would be simpler to just update the metadata. I think you have not got chance to review the latest posted patch ('v10-0003') yet, it does the same. But I do not quite get it as in how can we do it w/o starting a background worker? Even the failover-slots extension starts one background worker. The question here is how many background workers we need to have. Will one be sufficient or do we need one per db (as done earlier by the original patches in this thread) or are we good with dividing work among some limited number of workers? I feel syncing all slots in one worker may increase the lag between subsequent syncs for a particular slot and if the number of slots are huge, the chances of losing the slot-data is more in case of failure. Starting one worker per db also might not be that efficient as it will increase load on the system (both in terms of background worker and network traffic) especially for a case where the number of dbs are more. Thus starting max 'n' number of workers where 'n' is decided by GUC and dividing the work/DBs among these looks a better option to me. Please see the discussion in and around the email at [1] [1]: https://www.postgresql.org/message-id/CAJpy0uCT%2BnpL4eUvCWiV_MBEri9ixcUgJVDdsBCJSqLd0oD1fQ%40mail.gmail.com > > I've measured the time it takes for > > pg_logical_replication_slot_advance with different amounts WAL on my > > system. It took 2595ms/5091ms/31238ms to advance the slot by > > 3.7GB/7.3GB/13GB respectively. To put things into perspective here, > > imagine there are 3 logical slots to sync for a single slot sync > > worker and each of them are in need of advancing the slot by > > 3.7GB/7.3GB/13GB of WAL. The slot sync worker gets to slot 1 again > > after 2595ms+5091ms+31238ms (~40sec), gets to slot 2 again after > > advance time of slot 1 with amount of WAL that the slot has moved > > ahead on primary during 40sec, gets to slot 3 again after advance time > > of slot 1 and slot 2 with amount of WAL that the slot has moved ahead > > on primary and so on. If WAL generation on the primary is pretty fast, > > and if the logical slot moves pretty fast on the primary, the time it > > takes for a single sync worker to sync a slot can increase. > > That would be way "faster" and we would probably not need to > worry that much about the number of "sync" workers (if it/they "just" has/have > to sync slot's "metadata") as proposed above. > Agreed, we need not to worry about delay due to pg_logical_replication_slot_advance if we are only going to update a few simple things using the function calls as mentioned above. thanks Shveta
Hi, On 8/4/23 1:32 PM, shveta malik wrote: > On Fri, Aug 4, 2023 at 2:44 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> On 7/28/23 4:39 PM, Bharath Rupireddy wrote: >> Sorry to be late, but I gave a second thought and I wonder if we really need this design. >> (i.e start a logical replication background worker on the standby to sync the slots). >> >> Wouldn't that be simpler to "just" update the sync slots "metadata" >> as the https://github.com/EnterpriseDB/pg_failover_slots module (mentioned by Peter >> up-thread) is doing? >> (making use of LogicalConfirmReceivedLocation(), LogicalIncreaseXminForSlot() >> and LogicalIncreaseRestartDecodingForSlot(), If I read synchronize_one_slot() correctly). >> > > Agreed. It would be simpler to just update the metadata. I think you > have not got chance to review the latest posted patch ('v10-0003') > yet, it does the same. Thanks for the feedback! Yeah, I did not look at v10 in details and was looking at the email thread only. Indeed, I now see that 0003 does update the metadata in local_slot_advance(), that's great! > > But I do not quite get it as in how can we do it w/o starting a > background worker? Yeah, agree that we still need background workers. What I meant was to avoid to use "logical replication background worker" (aka through logicalrep_worker_launch()) to sync the slots. > The question here is how many background workers we > need to have. Will one be sufficient or do we need one per db (as done > earlier by the original patches in this thread) or are we good with > dividing work among some limited number of workers? > > I feel syncing all slots in one worker may increase the lag between > subsequent syncs for a particular slot and if the number of slots are > huge, the chances of losing the slot-data is more in case of failure. > Starting one worker per db also might not be that efficient as it will > increase load on the system (both in terms of background worker and > network traffic) especially for a case where the number of dbs are > more. Thus starting max 'n' number of workers where 'n' is decided by > GUC and dividing the work/DBs among these looks a better option to me. > Please see the discussion in and around the email at [1] > > [1]: https://www.postgresql.org/message-id/CAJpy0uCT%2BnpL4eUvCWiV_MBEri9ixcUgJVDdsBCJSqLd0oD1fQ%40mail.gmail.com Thanks for the link! If I read the email thread correctly, this discussion was before V10 (which is the first version making use of LogicalConfirmReceivedLocation(), LogicalIncreaseXminForSlot(), LogicalIncreaseRestartDecodingForSlot()) means before the metadata sync only has been implemented. While I agree that the approach to split the sync load among workers with the new max_slot_sync_workers GUC seems reasonable without the sync only feature (pre V10), I'm not sure that with the metadata sync only in place the extra complexity to manage multiple sync workers is needed. Maybe we should start some tests/benchmark with only one sync worker to get numbers and start from there? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Aug 1, 2023 at 4:52 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Thu, Jul 27, 2023 at 12:13 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Thu, Jul 27, 2023 at 10:55 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Wed, Jul 26, 2023 at 10:31 AM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > > > On Mon, Jul 24, 2023 at 9:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > On Mon, Jul 24, 2023 at 8:03 AM Bharath Rupireddy > > > > > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > > > > > > > > > Is having one (or a few more - not > > > > > > necessarily one for each logical slot) worker for all logical slots > > > > > > enough? > > > > > > > > > > > > > > > > I guess for a large number of slots the is a possibility of a large > > > > > gap in syncing the slots which probably means we need to retain > > > > > corresponding WAL for a much longer time on the primary. If we can > > > > > prove that the gap won't be large enough to matter then this would be > > > > > probably worth considering otherwise, I think we should find a way to > > > > > scale the number of workers to avoid the large gap. > > > > > > > > > > > > > How about this: > > > > > > > > 1) On standby, spawn 1 worker per database in the start (as it is > > > > doing currently). > > > > > > > > 2) Maintain statistics on activity against each primary's database on > > > > standby by any means. Could be by maintaining 'last_synced_time' and > > > > 'last_activity_seen time'. The last_synced_time is updated every time > > > > we sync/recheck slots for that particular database. The > > > > 'last_activity_seen_time' changes only if we get any slot on that > > > > database where actually confirmed_flush or say restart_lsn has changed > > > > from what was maintained already. > > > > > > > > 3) If at any moment, we find that 'last_synced_time' - > > > > 'last_activity_seen' goes beyond a threshold, that means that DB is > > > > not active currently. Add it to list of inactive DB > > > > > > > > > > I think we should also increase the next_sync_time if in current sync, > > > there is no update. > > > > +1 > > > > > > > > > 4) Launcher on the other hand is always checking if it needs to spawn > > > > any other extra worker for any new DB. It will additionally check if > > > > number of inactive databases (maintained on standby) has gone higher > > > > (> some threshold), then it brings down the workers for those and > > > > starts a common worker which takes care of all such inactive databases > > > > (or merge all in 1), while workers for active databases remain as such > > > > (i.e. one per db). Each worker maintains the list of DBs which it is > > > > responsible for. > > > > > > > > 5) If in the list of these inactive databases, we again find any > > > > active database using the above logic, then the launcher will spawn a > > > > separate worker for that. > > > > > > > > > > I wonder if we anyway some sort of design like this because we > > > shouldn't allow to spawn as many workers as the number of databases. > > > There has to be some existing or new GUC like max_sync_slot_workers > > > which decided the number of workers. > > > > > > > Currently it does not have any such GUC for sync-slot workers. It > > mainly uses the logical-rep-worker framework for the sync-slot worker > > part and thus it relies on 'max_logical_replication_workers' GUC. Also > > it errors out if 'max_replication_slots' is set to zero. I think it is > > not the correct way of doing things for sync-slot. We can have a new > > GUC (max_sync_slot_workers) as you suggested and if the number of > > databases < max_sync_slot_workers, then we can start 1 worker per > > dbid, else divide the work equally among the max sync-workers > > possible. And for inactive database cases, we can increase the > > next_sync_time rather than starting a special worker to handle all the > > inactive databases. Thoughts? > > > > Attaching the PoC patch (0003) where attempts to implement the basic > infrastructure for the suggested design. Rebased the existing patches > (0001 and 0002) as well. > > This patch adds a new GUC max_slot_sync_workers; the default and max > value is kept at 2 and 50 respectively for this PoC patch. Now the > replication launcher divides the work equally among these many > slot-sync workers. Let us say there are multiple slots on primary > belonging to 10 DBs and say new GUC on standby is set at default value > of 2, then each worker on standby will manage 5 dbs individually and > will keep on synching the slots for them. If a new DB is found by > replication launcher, it will assign this new db to the worker > handling the minimum number of dbs currently (or first worker in case > of equal count) and that worker will pick up the new db the next time > it tries to sync the slots. > I have kept the changes in separate patches (003) for ease of review. > Since this is just a PoC patch, many things are yet to be done > appropriately, will cover those in next versions. > Attaching new set of patches which attempt to implement below changes: 1) Logical Replication launcher now gets only the list of unique dbids belonging to slots in 'synchronize_slot_names' instead of getting all the slots-data. This has been implemented using the new command LIST_DBID_FOR_LOGICAL_SLOTS. 2) The launcher assigns the DBs to sync slot workers. Each worker will have its own dbids list. Since the upper limit of this dbid-count is not known, it is now allocated using dsm. The launcher initially allocates memory to hold 100 dbids for each worker. If this limit is exhausted, it reallocates this memory with size incremented by 100 again and relaunches the worker. This re-launched worker will continue to have the existing set of DBs which it was managing earlier plus the new DB. Both these changes are in patch v11_0002. The earlier patch v10_0003 is now merged to 0002 itself. More on standby-side design of this PoC patch can be found in commit message of v11-0002 Thanks Ajin for working on 1. thanks Shveta
Attachment
On Mon, Aug 7, 2023 at 3:17 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 8/4/23 1:32 PM, shveta malik wrote: > > On Fri, Aug 4, 2023 at 2:44 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > >> On 7/28/23 4:39 PM, Bharath Rupireddy wrote: > > >> Sorry to be late, but I gave a second thought and I wonder if we really need this design. > >> (i.e start a logical replication background worker on the standby to sync the slots). > >> > >> Wouldn't that be simpler to "just" update the sync slots "metadata" > >> as the https://github.com/EnterpriseDB/pg_failover_slots module (mentioned by Peter > >> up-thread) is doing? > >> (making use of LogicalConfirmReceivedLocation(), LogicalIncreaseXminForSlot() > >> and LogicalIncreaseRestartDecodingForSlot(), If I read synchronize_one_slot() correctly). > >> > > > > Agreed. It would be simpler to just update the metadata. I think you > > have not got chance to review the latest posted patch ('v10-0003') > > yet, it does the same. > > Thanks for the feedback! Yeah, I did not look at v10 in details and was > looking at the email thread only. > > Indeed, I now see that 0003 does update the metadata in local_slot_advance(), > that's great! > > > > > But I do not quite get it as in how can we do it w/o starting a > > background worker? > > Yeah, agree that we still need background workers. > What I meant was to avoid to use "logical replication background worker" > (aka through logicalrep_worker_launch()) to sync the slots. > Agreed. That is why in v10,v11 patches, we have different infra for sync-slot worker i.e. it is not relying on "logical replication background worker" anymore. > > The question here is how many background workers we > > need to have. Will one be sufficient or do we need one per db (as done > > earlier by the original patches in this thread) or are we good with > > dividing work among some limited number of workers? > > > > I feel syncing all slots in one worker may increase the lag between > > subsequent syncs for a particular slot and if the number of slots are > > huge, the chances of losing the slot-data is more in case of failure. > > Starting one worker per db also might not be that efficient as it will > > increase load on the system (both in terms of background worker and > > network traffic) especially for a case where the number of dbs are > > more. Thus starting max 'n' number of workers where 'n' is decided by > > GUC and dividing the work/DBs among these looks a better option to me. > > Please see the discussion in and around the email at [1] > > > > [1]: https://www.postgresql.org/message-id/CAJpy0uCT%2BnpL4eUvCWiV_MBEri9ixcUgJVDdsBCJSqLd0oD1fQ%40mail.gmail.com > > Thanks for the link! If I read the email thread correctly, this discussion > was before V10 (which is the first version making use of LogicalConfirmReceivedLocation(), > LogicalIncreaseXminForSlot(), LogicalIncreaseRestartDecodingForSlot()) means > before the metadata sync only has been implemented. > > While I agree that the approach to split the sync load among workers with the new > max_slot_sync_workers GUC seems reasonable without the sync only feature (pre V10), > I'm not sure that with the metadata sync only in place the extra complexity to manage multiple > sync workers is needed. > > Maybe we should start some tests/benchmark with only one sync worker to get numbers > and start from there? Yes, we can do that performance testing to figure out the difference between the two modes. I will try to get some statistics on this. thanks Shveta
Hi, On 8/8/23 7:01 AM, shveta malik wrote: > On Mon, Aug 7, 2023 at 3:17 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> >> Hi, >> >> On 8/4/23 1:32 PM, shveta malik wrote: >>> On Fri, Aug 4, 2023 at 2:44 PM Drouvot, Bertrand >>> <bertranddrouvot.pg@gmail.com> wrote: >>>> On 7/28/23 4:39 PM, Bharath Rupireddy wrote: >> > > Agreed. That is why in v10,v11 patches, we have different infra for > sync-slot worker i.e. it is not relying on "logical replication > background worker" anymore. yeah saw that, looks like the right way to go to me. >> Maybe we should start some tests/benchmark with only one sync worker to get numbers >> and start from there? > > Yes, we can do that performance testing to figure out the difference > between the two modes. I will try to get some statistics on this. > Great, thanks! Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Aug 8, 2023 at 11:11 AM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 8/8/23 7:01 AM, shveta malik wrote: > > On Mon, Aug 7, 2023 at 3:17 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > >> > >> Hi, > >> > >> On 8/4/23 1:32 PM, shveta malik wrote: > >>> On Fri, Aug 4, 2023 at 2:44 PM Drouvot, Bertrand > >>> <bertranddrouvot.pg@gmail.com> wrote: > >>>> On 7/28/23 4:39 PM, Bharath Rupireddy wrote: > >> > > > > Agreed. That is why in v10,v11 patches, we have different infra for > > sync-slot worker i.e. it is not relying on "logical replication > > background worker" anymore. > > yeah saw that, looks like the right way to go to me. > > >> Maybe we should start some tests/benchmark with only one sync worker to get numbers > >> and start from there? > > > > Yes, we can do that performance testing to figure out the difference > > between the two modes. I will try to get some statistics on this. > > > > Great, thanks! > We (myself and Ajin) performed the tests to compute the lag in standby slots as compared to primary slots with different number of slot-sync workers configured. 3 DBs were created, each with 30 tables and each table having one logical-pub/sub configured. So this made a total of 90 logical replication slots to be synced. Then the workload was run for aprox 10 mins. During this workload, at regular intervals, primary and standby slots' lsns were captured (from pg_replication_slots) and compared. At each capture, the intent was to know how much is each standby's slot lagging behind corresponding primary's slot by taking the distance between confirmed_flush_lsn of primary and standby slot. Then we took the average (integer value) of this distance over the span of 10 min workload and this is what we got: With max_slot_sync_workers=1, average-lag = 42290.3563 With max_slot_sync_workers=2, average-lag = 24585.1421 With max_slot_sync_workers=3, average-lag = 14964.9215 This shows that more workers have better chances to keep logical replication slots in sync for this case. Another statistics if it interests you is, we ran a frequency test as well (this by changing code, unit test sort of) to figure out the 'total number of times synchronization done' with different number of sync-slots workers configured. Same 3 DBs setup with each DB having 30 logical replication slots. With 'max_slot_sync_workers' set at 1, 2 and 3; total number of times synchronization done was 15874, 20205 and 23414 respectively. Note: this is not on the same machine where we captured lsn-gap data, it is on a little less efficient machine but gives almost the same picture. Next we are planning to capture this data for a lesser number of slots like 10,30,50 etc. It may happen that the benefit of multi-workers over single workers in such cases could be less, but let's have the data to verify that. Thanks Ajin for jointly working on this. thanks Shveta
On Mon, Aug 14, 2023 at 3:22 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Tue, Aug 8, 2023 at 11:11 AM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > > > > Hi, > > > > On 8/8/23 7:01 AM, shveta malik wrote: > > > On Mon, Aug 7, 2023 at 3:17 PM Drouvot, Bertrand > > > <bertranddrouvot.pg@gmail.com> wrote: > > >> > > >> Hi, > > >> > > >> On 8/4/23 1:32 PM, shveta malik wrote: > > >>> On Fri, Aug 4, 2023 at 2:44 PM Drouvot, Bertrand > > >>> <bertranddrouvot.pg@gmail.com> wrote: > > >>>> On 7/28/23 4:39 PM, Bharath Rupireddy wrote: > > >> > > > > > > Agreed. That is why in v10,v11 patches, we have different infra for > > > sync-slot worker i.e. it is not relying on "logical replication > > > background worker" anymore. > > > > yeah saw that, looks like the right way to go to me. > > > > >> Maybe we should start some tests/benchmark with only one sync worker to get numbers > > >> and start from there? > > > > > > Yes, we can do that performance testing to figure out the difference > > > between the two modes. I will try to get some statistics on this. > > > > > > > Great, thanks! > > > > We (myself and Ajin) performed the tests to compute the lag in standby > slots as compared to primary slots with different number of slot-sync > workers configured. > > 3 DBs were created, each with 30 tables and each table having one > logical-pub/sub configured. So this made a total of 90 logical > replication slots to be synced. Then the workload was run for aprox 10 > mins. During this workload, at regular intervals, primary and standby > slots' lsns were captured (from pg_replication_slots) and compared. At > each capture, the intent was to know how much is each standby's slot > lagging behind corresponding primary's slot by taking the distance > between confirmed_flush_lsn of primary and standby slot. Then we took > the average (integer value) of this distance over the span of 10 min > workload and this is what we got: > I have attached the scripts for schema-setup, running workload and capturing lag. Please go through Readme for details. thanks Shveta
Attachment
On Mon, Aug 14, 2023 at 8:38 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Mon, Aug 14, 2023 at 3:22 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Tue, Aug 8, 2023 at 11:11 AM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > > > > > > Hi, > > > > > > On 8/8/23 7:01 AM, shveta malik wrote: > > > > On Mon, Aug 7, 2023 at 3:17 PM Drouvot, Bertrand > > > > <bertranddrouvot.pg@gmail.com> wrote: > > > >> > > > >> Hi, > > > >> > > > >> On 8/4/23 1:32 PM, shveta malik wrote: > > > >>> On Fri, Aug 4, 2023 at 2:44 PM Drouvot, Bertrand > > > >>> <bertranddrouvot.pg@gmail.com> wrote: > > > >>>> On 7/28/23 4:39 PM, Bharath Rupireddy wrote: > > > >> > > > > > > > > Agreed. That is why in v10,v11 patches, we have different infra for > > > > sync-slot worker i.e. it is not relying on "logical replication > > > > background worker" anymore. > > > > > > yeah saw that, looks like the right way to go to me. > > > > > > >> Maybe we should start some tests/benchmark with only one sync worker to get numbers > > > >> and start from there? > > > > > > > > Yes, we can do that performance testing to figure out the difference > > > > between the two modes. I will try to get some statistics on this. > > > > > > > > > > Great, thanks! > > > > > > > We (myself and Ajin) performed the tests to compute the lag in standby > > slots as compared to primary slots with different number of slot-sync > > workers configured. > > > > 3 DBs were created, each with 30 tables and each table having one > > logical-pub/sub configured. So this made a total of 90 logical > > replication slots to be synced. Then the workload was run for aprox 10 > > mins. During this workload, at regular intervals, primary and standby > > slots' lsns were captured (from pg_replication_slots) and compared. At > > each capture, the intent was to know how much is each standby's slot > > lagging behind corresponding primary's slot by taking the distance > > between confirmed_flush_lsn of primary and standby slot. Then we took > > the average (integer value) of this distance over the span of 10 min > > workload and this is what we got: > > > > I have attached the scripts for schema-setup, running workload and > capturing lag. Please go through Readme for details. > > I did some more tests for 10,20 and 40 slots to calculate the average lsn distance between slots, comparing 1 worker and 3 workers. My results are as follows: 10 slots 1 worker: 5529.75527426 (average lsn distance between primary and standby per slot) 3 worker: 2224.57589134 20 slots 1 worker: 9592.87234043 3 worker: 3194.62933333 40 slots 1 worker: 20566.0933333 3 worker: 7885.80952381 90 slots 1 worker: 36706.8405797 3 worker: 10236.6393162 regards, Ajin Cherian Fujitsu Australia
Attachment
Hi, On 8/14/23 11:52 AM, shveta malik wrote: > > We (myself and Ajin) performed the tests to compute the lag in standby > slots as compared to primary slots with different number of slot-sync > workers configured. > Thanks! > 3 DBs were created, each with 30 tables and each table having one > logical-pub/sub configured. So this made a total of 90 logical > replication slots to be synced. Then the workload was run for aprox 10 > mins. During this workload, at regular intervals, primary and standby > slots' lsns were captured (from pg_replication_slots) and compared. At > each capture, the intent was to know how much is each standby's slot > lagging behind corresponding primary's slot by taking the distance > between confirmed_flush_lsn of primary and standby slot. Then we took > the average (integer value) of this distance over the span of 10 min > workload Thanks for the explanations, make sense to me. > and this is what we got: > > With max_slot_sync_workers=1, average-lag = 42290.3563 > With max_slot_sync_workers=2, average-lag = 24585.1421 > With max_slot_sync_workers=3, average-lag = 14964.9215 > > This shows that more workers have better chances to keep logical > replication slots in sync for this case. > Agree. > Another statistics if it interests you is, we ran a frequency test as > well (this by changing code, unit test sort of) to figure out the > 'total number of times synchronization done' with different number of > sync-slots workers configured. Same 3 DBs setup with each DB having 30 > logical replication slots. With 'max_slot_sync_workers' set at 1, 2 > and 3; total number of times synchronization done was 15874, 20205 and > 23414 respectively. Note: this is not on the same machine where we > captured lsn-gap data, it is on a little less efficient machine but > gives almost the same picture > > Next we are planning to capture this data for a lesser number of slots > like 10,30,50 etc. It may happen that the benefit of multi-workers > over single workers in such cases could be less, but let's have the > data to verify that. > Thanks a lot for those numbers and for the testing! Do you think it would make sense to also get the number of using the pg_failover_slots module? (and compare the pg_failover_slots numbers with the "one worker" case here). Idea is to check if the patch does introduce some overhead as compare to pg_failover_slots. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Aug 17, 2023 at 11:44 AM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 8/14/23 11:52 AM, shveta malik wrote: > > > > > We (myself and Ajin) performed the tests to compute the lag in standby > > slots as compared to primary slots with different number of slot-sync > > workers configured. > > > > Thanks! > > > 3 DBs were created, each with 30 tables and each table having one > > logical-pub/sub configured. So this made a total of 90 logical > > replication slots to be synced. Then the workload was run for aprox 10 > > mins. During this workload, at regular intervals, primary and standby > > slots' lsns were captured (from pg_replication_slots) and compared. At > > each capture, the intent was to know how much is each standby's slot > > lagging behind corresponding primary's slot by taking the distance > > between confirmed_flush_lsn of primary and standby slot. Then we took > > the average (integer value) of this distance over the span of 10 min > > workload > > Thanks for the explanations, make sense to me. > > > and this is what we got: > > > > With max_slot_sync_workers=1, average-lag = 42290.3563 > > With max_slot_sync_workers=2, average-lag = 24585.1421 > > With max_slot_sync_workers=3, average-lag = 14964.9215 > > > > This shows that more workers have better chances to keep logical > > replication slots in sync for this case. > > > > Agree. > > > Another statistics if it interests you is, we ran a frequency test as > > well (this by changing code, unit test sort of) to figure out the > > 'total number of times synchronization done' with different number of > > sync-slots workers configured. Same 3 DBs setup with each DB having 30 > > logical replication slots. With 'max_slot_sync_workers' set at 1, 2 > > and 3; total number of times synchronization done was 15874, 20205 and > > 23414 respectively. Note: this is not on the same machine where we > > captured lsn-gap data, it is on a little less efficient machine but > > gives almost the same picture > > > > Next we are planning to capture this data for a lesser number of slots > > like 10,30,50 etc. It may happen that the benefit of multi-workers > > over single workers in such cases could be less, but let's have the > > data to verify that. > > > > Thanks a lot for those numbers and for the testing! > > Do you think it would make sense to also get the number of using > the pg_failover_slots module? (and compare the pg_failover_slots numbers with the > "one worker" case here). Idea is to check if the patch does introduce > some overhead as compare to pg_failover_slots. > Yes, definitely. We will work on that and share the numbers soon. thanks Shveta
On Thu, Aug 17, 2023 at 11:55 AM shveta malik <shveta.malik@gmail.com> wrote: > > On Thu, Aug 17, 2023 at 11:44 AM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > > > > Hi, > > > > On 8/14/23 11:52 AM, shveta malik wrote: > > > > > > > > We (myself and Ajin) performed the tests to compute the lag in standby > > > slots as compared to primary slots with different number of slot-sync > > > workers configured. > > > > > > > Thanks! > > > > > 3 DBs were created, each with 30 tables and each table having one > > > logical-pub/sub configured. So this made a total of 90 logical > > > replication slots to be synced. Then the workload was run for aprox 10 > > > mins. During this workload, at regular intervals, primary and standby > > > slots' lsns were captured (from pg_replication_slots) and compared. At > > > each capture, the intent was to know how much is each standby's slot > > > lagging behind corresponding primary's slot by taking the distance > > > between confirmed_flush_lsn of primary and standby slot. Then we took > > > the average (integer value) of this distance over the span of 10 min > > > workload > > > > Thanks for the explanations, make sense to me. > > > > > and this is what we got: > > > > > > With max_slot_sync_workers=1, average-lag = 42290.3563 > > > With max_slot_sync_workers=2, average-lag = 24585.1421 > > > With max_slot_sync_workers=3, average-lag = 14964.9215 > > > > > > This shows that more workers have better chances to keep logical > > > replication slots in sync for this case. > > > > > > > Agree. > > > > > Another statistics if it interests you is, we ran a frequency test as > > > well (this by changing code, unit test sort of) to figure out the > > > 'total number of times synchronization done' with different number of > > > sync-slots workers configured. Same 3 DBs setup with each DB having 30 > > > logical replication slots. With 'max_slot_sync_workers' set at 1, 2 > > > and 3; total number of times synchronization done was 15874, 20205 and > > > 23414 respectively. Note: this is not on the same machine where we > > > captured lsn-gap data, it is on a little less efficient machine but > > > gives almost the same picture > > > > > > Next we are planning to capture this data for a lesser number of slots > > > like 10,30,50 etc. It may happen that the benefit of multi-workers > > > over single workers in such cases could be less, but let's have the > > > data to verify that. > > > > > > > Thanks a lot for those numbers and for the testing! > > > > Do you think it would make sense to also get the number of using > > the pg_failover_slots module? (and compare the pg_failover_slots numbers with the > > "one worker" case here). Idea is to check if the patch does introduce > > some overhead as compare to pg_failover_slots. > > > > Yes, definitely. We will work on that and share the numbers soon. > We are working on these tests. Meanwhile attaching the patches which attempt to implement below functionalities: 1) Remove extra slots on standby if those no longer exist on primary or are no longer part of synchronize_slot_names. 2) Make synchronize_slot_names user-modifiable. And due to change in 'synchronize_slot_names', if dbids list is reduced, then take care of removal of extra/old db-ids (if any) from workers db-list. Thanks Ajin for working on 1. Both the above changes are in patch-0002. There is a test failure in the recovery module due to these new changes, I am looking into it and will fix it in the next version. Improvements in pipeline: a) standby slots should not be consumable. b) optimization of query which standby sends to primary. Currently it has dbid filter and slot-name filter. Slot-name filter can be optimized to have only those slots which belong to DBs assigned to the worker rather than having all 'synchronize_slot_names'. c) analyze if the naptime of the slot-sync worker can be auto-tuned. If there is no activity going on (i.e. slots are not advancing on primary) then increase naptime of slot-sync worker on standby and decrease it again when activity starts. thanks Shveta
Attachment
On Thu, Aug 17, 2023 at 11:55 AM shveta malik <shveta.malik@gmail.com> wrote: > > On Thu, Aug 17, 2023 at 11:44 AM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > > > > Hi, > > > > On 8/14/23 11:52 AM, shveta malik wrote: > > > > > > > > We (myself and Ajin) performed the tests to compute the lag in standby > > > slots as compared to primary slots with different number of slot-sync > > > workers configured. > > > > > > > Thanks! > > > > > 3 DBs were created, each with 30 tables and each table having one > > > logical-pub/sub configured. So this made a total of 90 logical > > > replication slots to be synced. Then the workload was run for aprox 10 > > > mins. During this workload, at regular intervals, primary and standby > > > slots' lsns were captured (from pg_replication_slots) and compared. At > > > each capture, the intent was to know how much is each standby's slot > > > lagging behind corresponding primary's slot by taking the distance > > > between confirmed_flush_lsn of primary and standby slot. Then we took > > > the average (integer value) of this distance over the span of 10 min > > > workload > > > > Thanks for the explanations, make sense to me. > > > > > and this is what we got: > > > > > > With max_slot_sync_workers=1, average-lag = 42290.3563 > > > With max_slot_sync_workers=2, average-lag = 24585.1421 > > > With max_slot_sync_workers=3, average-lag = 14964.9215 > > > > > > This shows that more workers have better chances to keep logical > > > replication slots in sync for this case. > > > > > > > Agree. > > > > > Another statistics if it interests you is, we ran a frequency test as > > > well (this by changing code, unit test sort of) to figure out the > > > 'total number of times synchronization done' with different number of > > > sync-slots workers configured. Same 3 DBs setup with each DB having 30 > > > logical replication slots. With 'max_slot_sync_workers' set at 1, 2 > > > and 3; total number of times synchronization done was 15874, 20205 and > > > 23414 respectively. Note: this is not on the same machine where we > > > captured lsn-gap data, it is on a little less efficient machine but > > > gives almost the same picture > > > > > > Next we are planning to capture this data for a lesser number of slots > > > like 10,30,50 etc. It may happen that the benefit of multi-workers > > > over single workers in such cases could be less, but let's have the > > > data to verify that. > > > > > > > Thanks a lot for those numbers and for the testing! > > > > Do you think it would make sense to also get the number of using > > the pg_failover_slots module? (and compare the pg_failover_slots numbers with the > > "one worker" case here). Idea is to check if the patch does introduce > > some overhead as compare to pg_failover_slots. > > > > Yes, definitely. We will work on that and share the numbers soon. > Here are the numbers for pg_failover_extension. Thank You Ajin for performing all the tests and providing the data offline. -------------------------------------- pg_failover_slots extension: ------------------------------------ 40 slots: default nap (60 sec): 12742133.96 10ms nap: 19984.34 90 slots: default nap (60 sec): 10063342.72 10ms nap: 34483.82 ---------------------------------------------- slot-sync-workers case (default 10ms nap for each test): --------------------------------------------- 40 slots: 1 worker: 20566.09 3 worker: 7885.80 90 slots: 1 worker: 36706.84 3 worker: 10236.63 Observations: 1) Worker=1 case is slightly behind in our case as compared to pg_failover_extension (for the same naptime of 10ms). This is due to the support for multi-worker design where locks and dsm come into play. I will review this case for optimization. 2) The multi-worker case seems way better in all tests. Few points we observed while performing the tests on pg_failover_extension: 1) It has a naptime of 60sec which is on the higher side and thus we see huge lag in slots being synchronized. Please see default-nap readings above. The default data of extension is not comparable to our default case. And thus for apple to apple comparisons, we changed naptime to 10ms for pg_failover_extension. 2) It takes a lot of time while creating-slots. Every slot creation needs workload to be run on primary i.e. if after say 4th slot creation, there is no activity going on primary, it waits and does not proceed to create rest of the slots and thus we had to make sure to perform some activity on primary in parallel to each slot creation on standby. This happens because after each slot-creation it checks if 'remote_slot->restart_lsn < MyReplicationSlot->data.restart_lsn' and if so, it waits for primary to catch-up. The restart_lsn for newly created slot is set at XLOG-replay position and when standby is up to date in terms of data (i.e. all xlog-streams are received and replayed) and no activity is going on primary, then the restart-lsn on standby for a newly created slot at that moment is same as confirmed-lsn of that slot on primary. And thus in order to make it proceed it needs restart-lsn on primary to move forward. Does it make more sense to have a check which compares confirmed_flush of primary with restart_lsn of standby i.e. if 'remote_slot->confirmed_flush < MyReplicationSlot->data.restart_lsn' then only wait for primary to catch-up? This check will mean that we need to wait only if more operations are performed on primary and xlogs are received and replayed on standby but still slots on primary have not been advanced and thus we need to give time to primary to catch-up. thanks Shveta
On Thu, Aug 17, 2023 at 4:09 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Thu, Aug 17, 2023 at 11:55 AM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Thu, Aug 17, 2023 at 11:44 AM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > > > > > > Hi, > > > > > > On 8/14/23 11:52 AM, shveta malik wrote: > > > > > > > > > > > We (myself and Ajin) performed the tests to compute the lag in standby > > > > slots as compared to primary slots with different number of slot-sync > > > > workers configured. > > > > > > > > > > Thanks! > > > > > > > 3 DBs were created, each with 30 tables and each table having one > > > > logical-pub/sub configured. So this made a total of 90 logical > > > > replication slots to be synced. Then the workload was run for aprox 10 > > > > mins. During this workload, at regular intervals, primary and standby > > > > slots' lsns were captured (from pg_replication_slots) and compared. At > > > > each capture, the intent was to know how much is each standby's slot > > > > lagging behind corresponding primary's slot by taking the distance > > > > between confirmed_flush_lsn of primary and standby slot. Then we took > > > > the average (integer value) of this distance over the span of 10 min > > > > workload > > > > > > Thanks for the explanations, make sense to me. > > > > > > > and this is what we got: > > > > > > > > With max_slot_sync_workers=1, average-lag = 42290.3563 > > > > With max_slot_sync_workers=2, average-lag = 24585.1421 > > > > With max_slot_sync_workers=3, average-lag = 14964.9215 > > > > > > > > This shows that more workers have better chances to keep logical > > > > replication slots in sync for this case. > > > > > > > > > > Agree. > > > > > > > Another statistics if it interests you is, we ran a frequency test as > > > > well (this by changing code, unit test sort of) to figure out the > > > > 'total number of times synchronization done' with different number of > > > > sync-slots workers configured. Same 3 DBs setup with each DB having 30 > > > > logical replication slots. With 'max_slot_sync_workers' set at 1, 2 > > > > and 3; total number of times synchronization done was 15874, 20205 and > > > > 23414 respectively. Note: this is not on the same machine where we > > > > captured lsn-gap data, it is on a little less efficient machine but > > > > gives almost the same picture > > > > > > > > Next we are planning to capture this data for a lesser number of slots > > > > like 10,30,50 etc. It may happen that the benefit of multi-workers > > > > over single workers in such cases could be less, but let's have the > > > > data to verify that. > > > > > > > > > > Thanks a lot for those numbers and for the testing! > > > > > > Do you think it would make sense to also get the number of using > > > the pg_failover_slots module? (and compare the pg_failover_slots numbers with the > > > "one worker" case here). Idea is to check if the patch does introduce > > > some overhead as compare to pg_failover_slots. > > > > > > > Yes, definitely. We will work on that and share the numbers soon. > > > > We are working on these tests. Meanwhile attaching the patches which > attempt to implement below functionalities: > > 1) Remove extra slots on standby if those no longer exist on primary > or are no longer part of synchronize_slot_names. > 2) Make synchronize_slot_names user-modifiable. And due to change in > 'synchronize_slot_names', if dbids list is reduced, then take care of > removal of extra/old db-ids (if any) from workers db-list. > > Thanks Ajin for working on 1. Both the above changes are in > patch-0002. There is a test failure in the recovery module due to > these new changes, I am looking into it and will fix it in the next > version. > > Improvements in pipeline: > a) standby slots should not be consumable. > b) optimization of query which standby sends to primary. Currently it > has dbid filter and slot-name filter. Slot-name filter can be > optimized to have only those slots which belong to DBs assigned to the > worker rather than having all 'synchronize_slot_names'. > c) analyze if the naptime of the slot-sync worker can be auto-tuned. > If there is no activity going on (i.e. slots are not advancing on > primary) then increase naptime of slot-sync worker on standby and > decrease it again when activity starts. > Please find the patches attached. 0002 has below changes: 1) The naptime of the worker is now tuned as per the activity on primary. Each worker starts with a naptime of 10ms and if no activity is observed on primary for some time, then naptime is increased to 10sec. And if activity is observed again, naptime is reduced back to 10ms. Each worker does it by choosing one slot (first one assigned to it) for monitoring purposes. If there is no change in lsn of that slot for say over 5 sync-checks, naptime is increased to 10sec and as soon as a change is observed, naptime is reduced back to 10ms. 2) The query sent by standby to primary to get slot info is written better. The query has filters : where DBID in (...) and slot_name in (..). Earlier the slot_name filter was carrying all the names mentioned in synchronize_slot_names (if it is not '*'). Now it mentions only the ones belonging to its own dbids except during the first run of the query. First run of the query is different since we are getting this info ('which slot belongs to which db') from standby only, thus the query will have all slots-names of 'synchronize_slot_names ' until slots are created on standby. This one-time longer query seems better over pinging primary to get this info. Changes to be done/analysed next: 1) find a way to distinguish between user created logical slots and synced ones. This is needed for below purposes: a) Avoid dropping user created slots by slot-sync worker. b) Unlike the user-created slots, synced slots should not be consumable. 2) Handling below corner scenarios: a) if a worker is exiting due to change in sync_slot_names which made dbids of that worker no longer valid, then that worker may leave behind some slots which should otherwise be dropped. b) if a worker is connected to a dbid and that dbid no longer exists. 3) Analyze if there is any interference with 'minimal logical decoding on standby' feature. thanks Shveta
Attachment
On Wed, Aug 23, 2023 at 3:38 PM shveta malik <shveta.malik@gmail.com> wrote: > I have reviewed the v12-0002 patch and I have some comments. I see the latest version posted sometime back and if any of this comment is already fixed in this version then feel free to ignore that. In general, code still needs a lot more comments to make it readable and in some places, code formatting is also not as per PG standard so that needs to be improved. There are some other specific comments as listed below 1. @@ -925,7 +936,7 @@ ApplyLauncherRegister(void) memset(&bgw, 0, sizeof(bgw)); bgw.bgw_flags = BGWORKER_SHMEM_ACCESS | BGWORKER_BACKEND_DATABASE_CONNECTION; - bgw.bgw_start_time = BgWorkerStart_RecoveryFinished; + bgw.bgw_start_time = BgWorkerStart_ConsistentState; What is the reason for this change, can you add some comments? 2. ApplyLauncherShmemInit(void) { bool found; + bool foundSlotSync; Is there any specific reason to launch the sync worker from the logical launcher instead of making this independent? I mean in the future if we plan to sync physical slots as well then it wouldn't be an expandable design. 3. + /* + * Remember the old dbids before we stop and cleanup this worker + * as these will be needed in order to relaunch the worker. + */ + copied_dbcnt = worker->dbcount; + copied_dbids = (Oid *)palloc0(worker->dbcount * sizeof(Oid)); + + for (i = 0; i < worker->dbcount; i++) + copied_dbids[i] = worker->dbids[i]; probably we can just memcpy the memory containing the dbids. 4. + /* + * If worker is being reused, and there is vacancy in dbids array, + * just update dbids array and dbcount and we are done. + * But if dbids array is exhausted, stop the worker, reallocate + * dbids in dsm, relaunch the worker with same set of dbs as earlier + * plus the new db. + */ Why do we need to relaunch the worker, can't we just use dsa_pointer to expand the shared memory whenever required? 5. +static bool +WaitForSlotSyncWorkerAttach(SlotSyncWorker *worker, + uint16 generation, + BackgroundWorkerHandle *handle) this function is an exact duplicate of WaitForReplicationWorkerAttach except for the LWlock, why don't we use the same function by passing the LWLock as a parameter 6. +/* + * Attach Slot-sync worker to worker-slot assigned by launcher. + */ +void +slotsync_worker_attach(int slot) this is also very similar to the logicalrep_worker_attach function. Please check other similar functions and reuse them wherever possible Also, why this function is not registering the cleanup function on shmmem_exit? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Aug 23, 2023 at 4:21 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Aug 23, 2023 at 3:38 PM shveta malik <shveta.malik@gmail.com> wrote: > > > I have reviewed the v12-0002 patch and I have some comments. I see the > latest version posted sometime back and if any of this comment is > already fixed in this version then feel free to ignore that. > Thanks for the feedback. Please find my comments on a few. I will work on rest. > 2. > ApplyLauncherShmemInit(void) > { > bool found; > + bool foundSlotSync; > > Is there any specific reason to launch the sync worker from the > logical launcher instead of making this independent? > I mean in the future if we plan to sync physical slots as well then it > wouldn't be an expandable design. When we started working on this, it was reusing logical-apply worker infra, so I separated it from logical-apply worker but let it be managed by a replication launcher considering that only logical slots needed to be synced. I think this needs more thought and I would like to know from others as well before concluding anything here. > 5. > > +static bool > +WaitForSlotSyncWorkerAttach(SlotSyncWorker *worker, > + uint16 generation, > + BackgroundWorkerHandle *handle) > > this function is an exact duplicate of WaitForReplicationWorkerAttach > except for the LWlock, why don't we use the same function by passing > the LWLock as a parameter > Here workers (first argument) are different. We can always pass LWLock, but since workers are different, in order to merge the common functionality, we need to have some common worker structure between the two workers (apply and sync-slot) and pass that to functions which need to be merged (similar to NodeTag used in Insert/CreateStmt etc). But changing LogicalRepWorker() would mean changing applyworker/table-sync worker/parallel-apply-worker files. Since there are only two such functions which you pointed out (attach and wait_for_attach), I prefered to keep the functions as is until we conclude on where slot-sync worker functionality actually fits in. I can revisit these comments then. Or if you see any better way to do it, kindly let me know. > 6. > +/* > + * Attach Slot-sync worker to worker-slot assigned by launcher. > + */ > +void > +slotsync_worker_attach(int slot) > > this is also very similar to the logicalrep_worker_attach function. > > Please check other similar functions and reuse them wherever possible > > Also, why this function is not registering the cleanup function on shmmem_exit? > It is doing it in ReplSlotSyncMain() since we have dsm-seg there. Please see this: /* Primary initialization is complete. Now, attach to our slot. */ slotsync_worker_attach(worker_slot); before_shmem_exit(slotsync_worker_detach, PointerGetDatum(seg)); thanks Shveta
Wait a minute ... From bac0fbef8b203c530e5117b0b7cfee13cfab78b9 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Date: Sat, 22 Jul 2023 10:17:48 +0000 Subject: [PATCH v13 1/2] Allow logical walsenders to wait for physical standbys @@ -2498,6 +2500,13 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, } else { + /* + * Before we send out the last set of changes to logical decoding + * output plugin, wait for specified streaming replication standby + * servers (if any) to confirm receipt of WAL upto commit_lsn. + */ + WaitForStandbyLSN(commit_lsn); OK, so we call this new function frequently enough -- once per transaction, if I read this correctly? So ... +void +WaitForStandbyLSN(XLogRecPtr wait_for_lsn) +{ ... + /* "*" means all logical walsenders should wait for physical standbys. */ + if (strcmp(synchronize_slot_names, "*") != 0) + { + bool shouldwait = false; + + rawname = pstrdup(synchronize_slot_names); + SplitIdentifierString(rawname, ',', &elemlist); + + foreach (l, elemlist) + { + char *name = lfirst(l); + if (strcmp(name, NameStr(MyReplicationSlot->data.name)) == 0) + { + shouldwait = true; + break; + } + } + + pfree(rawname); + rawname = NULL; + list_free(elemlist); + elemlist = NIL; + + if (!shouldwait) + return; + } + + rawname = pstrdup(standby_slot_names); + SplitIdentifierString(rawname, ',', &elemlist); ... do we really want to be doing the GUC string parsing every time through it? This sounds like it could be a bottleneck, or at least slow things down. Maybe we should think about caching this somehow. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "No renuncies a nada. No te aferres a nada."
On Fri, Aug 25, 2023 at 11:09 AM shveta malik <shveta.malik@gmail.com> wrote: > > On Wed, Aug 23, 2023 at 4:21 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Wed, Aug 23, 2023 at 3:38 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > I have reviewed the v12-0002 patch and I have some comments. I see the > > latest version posted sometime back and if any of this comment is > > already fixed in this version then feel free to ignore that. > > > > Thanks for the feedback. Please find my comments on a few. I will work on rest. > > > > 2. > > ApplyLauncherShmemInit(void) > > { > > bool found; > > + bool foundSlotSync; > > > > Is there any specific reason to launch the sync worker from the > > logical launcher instead of making this independent? > > I mean in the future if we plan to sync physical slots as well then it > > wouldn't be an expandable design. > > When we started working on this, it was reusing logical-apply worker > infra, so I separated it from logical-apply worker but let it be > managed by a replication launcher considering that only logical slots > needed to be synced. I think this needs more thought and I would like > to know from others as well before concluding anything here. > > > > 5. > > > > +static bool > > +WaitForSlotSyncWorkerAttach(SlotSyncWorker *worker, > > + uint16 generation, > > + BackgroundWorkerHandle *handle) > > > > this function is an exact duplicate of WaitForReplicationWorkerAttach > > except for the LWlock, why don't we use the same function by passing > > the LWLock as a parameter > > > > Here workers (first argument) are different. We can always pass > LWLock, but since workers are different, in order to merge the common > functionality, we need to have some common worker structure between > the two workers (apply and sync-slot) and pass that to functions which > need to be merged (similar to NodeTag used in Insert/CreateStmt etc). > But changing LogicalRepWorker() would mean changing > applyworker/table-sync worker/parallel-apply-worker files. Since there > are only two such functions which you pointed out (attach and > wait_for_attach), I prefered to keep the functions as is until we > conclude on where slot-sync worker functionality actually fits in. I > can revisit these comments then. Or if you see any better way to do > it, kindly let me know. > > > 6. > > +/* > > + * Attach Slot-sync worker to worker-slot assigned by launcher. > > + */ > > +void > > +slotsync_worker_attach(int slot) > > > > this is also very similar to the logicalrep_worker_attach function. > > > > Please check other similar functions and reuse them wherever possible > > > > Also, why this function is not registering the cleanup function on shmmem_exit? > > > > It is doing it in ReplSlotSyncMain() since we have dsm-seg there. > Please see this: > > /* Primary initialization is complete. Now, attach to our slot. */ > slotsync_worker_attach(worker_slot); > before_shmem_exit(slotsync_worker_detach, PointerGetDatum(seg)); > PFA new patch-set which attempts to fix these: a) Synced slots on standby are not consumable i.e. pg_logical_slot_get/peek_changes will give error on these while will work on user-created slots. b) User created slots on standby will not be dropped by slot-sync workers anymore. Earlier slot-sync worker was dropping all the slots which were not part of synchronize_slot_names. c) Now DSA is being used for dbids to facilitate memory extension if required without needing to restart the worker. Earlier dsm was used alone which needed restart of the worker in case the memory allocated needs to be extended. Changes are in patch 0002. Next in pipeline: 1. Handling of corner scenarios which I explained in: https://www.postgresql.org/message-id/CAJpy0uC%2B2agRtF3H%3Dn-hW5JkoPfaZkjPXJr%3D%3Dy3_PRE04dQvhw%40mail.gmail.com 2. Revisiting comments (older ones in this thread and latest given) for patch 1. thanks Shveta
Attachment
On Wed, Aug 23, 2023 at 4:21 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Aug 23, 2023 at 3:38 PM shveta malik <shveta.malik@gmail.com> wrote: > > > I have reviewed the v12-0002 patch and I have some comments. I see the > latest version posted sometime back and if any of this comment is > already fixed in this version then feel free to ignore that. > > In general, code still needs a lot more comments to make it readable > and in some places, code formatting is also not as per PG standard so > that needs to be improved. > There are some other specific comments as listed below > Please see the latest patch-set (v14). Did some code-formatting, used pg_indent as well. Added more comments. Let me know specifically if some more comments or formatting is needed. > 1. > @@ -925,7 +936,7 @@ ApplyLauncherRegister(void) > memset(&bgw, 0, sizeof(bgw)); > bgw.bgw_flags = BGWORKER_SHMEM_ACCESS | > BGWORKER_BACKEND_DATABASE_CONNECTION; > - bgw.bgw_start_time = BgWorkerStart_RecoveryFinished; > + bgw.bgw_start_time = BgWorkerStart_ConsistentState; > > What is the reason for this change, can you add some comments? Sure, done. > > 2. > ApplyLauncherShmemInit(void) > { > bool found; > + bool foundSlotSync; > > Is there any specific reason to launch the sync worker from the > logical launcher instead of making this independent? > I mean in the future if we plan to sync physical slots as well then it > wouldn't be an expandable design. > > 3. > + /* > + * Remember the old dbids before we stop and cleanup this worker > + * as these will be needed in order to relaunch the worker. > + */ > + copied_dbcnt = worker->dbcount; > + copied_dbids = (Oid *)palloc0(worker->dbcount * sizeof(Oid)); > + > + for (i = 0; i < worker->dbcount; i++) > + copied_dbids[i] = worker->dbids[i]; > > probably we can just memcpy the memory containing the dbids. Done. > > 4. > + /* > + * If worker is being reused, and there is vacancy in dbids array, > + * just update dbids array and dbcount and we are done. > + * But if dbids array is exhausted, stop the worker, reallocate > + * dbids in dsm, relaunch the worker with same set of dbs as earlier > + * plus the new db. > + */ > > Why do we need to relaunch the worker, can't we just use dsa_pointer > to expand the shared memory whenever required? > Done. > 5. > > +static bool > +WaitForSlotSyncWorkerAttach(SlotSyncWorker *worker, > + uint16 generation, > + BackgroundWorkerHandle *handle) > > this function is an exact duplicate of WaitForReplicationWorkerAttach > except for the LWlock, why don't we use the same function by passing > the LWLock as a parameter > > 6. > +/* > + * Attach Slot-sync worker to worker-slot assigned by launcher. > + */ > +void > +slotsync_worker_attach(int slot) > > this is also very similar to the logicalrep_worker_attach function. > > Please check other similar functions and reuse them wherever possible > Will revisit these as stated in [1]. [1]: https://www.postgresql.org/message-id/CAJpy0uDeWjJj%2BU8nn%2BHbnGWkfY%2Bn-Bbw_kuHqgphETJ1Lucy%2BQ%40mail.gmail.com > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com
Hi Shveta. Here are some comments for patch v14-0002 The patch is large, so my code review is a WIP... more later next week... ====== GENERAL 1. Patch size The patch is 2700 lines. Is it possible to break this up into smaller self-contained parts to make the reviews more manageable? ~~~ 2. PG Docs I guess there are missing PG Docs for this patch. E.g there are new GUCs added but I see no documentation yet for them. ~ 3. Terms There are variations of what to call the sync worker - "Slot sync worker" or - "slot-sync worker" or - "slot synchronization worker" or - "slot-synchronization worker" - and others These are all in the comments and messages etc. Better to search/replace to make a consistent term everywhere. FWIW, I preferred just to call it "slot-sync worker". ~ 4. typedefs I think multiple new typedefs are added by this patch. IIUC, those should be included in the file typedef.list so the pg_indent will work properly. 5. max_slot_sync_workers GUC There is already a 'max_sync_workers_per_subscription', but that one is for "tablesync" workers. IMO it is potentially confusing now that both these GUCs have 'sync_workers' in the name. I think it would be less ambiguous to change your new GUC to 'max_slotsync_workers'. ====== Commit Message 6. Overview? I felt the information in this commit message is describing details of what changes are in this patch but there is no synopsis about the *purpose* of this patch as a whole. Eg. What is it for? It seemed like there should be some introductory paragraph up-front before describing all the specifics. ~~~ 7. For slots to be synchronised, another GUC is added: synchronize_slot_names: This is a runtime modifiable GUC. ~ If this is added by this patch then how come there is some SGML describing the same GUC in patch 14-0001? What is the relationship? ~~~ 8. Let us say slots mentioned in 'synchronize_slot_names' on primary belongs to 10 DBs and say the new GUC is set at default value of 2, then each worker will manage 5 dbs and will keep on synching the slots for them. ~ /the new GUC is set at default value of 2/'max_slot_sync_workers' is 2/ ~~~ 9. If a new DB is found by replication launcher, it will assign this new db to the worker handling the minimum number of dbs currently (or first worker in case of equal count) ~ Hmm. Isn't this only describing cases where max_slot_workers was exceeded? Otherwise, you should just launch a brand new sync-worker, right? ~~~ 10. Each worker slot will have its own dbids list. ~ It seems confusing to say "worker slot" when already talking about workers and slots. Can you reword that more like "Each slot-sync worker will have its own dbids list"? ====== src/backend/postmaster/bgworker.c ====== .../libpqwalreceiver/libpqwalreceiver.c 11. libpqrcv_list_db_for_logical_slots +/* + * List DB for logical slots + * + * It gets the list of unique DBIDs for logical slots mentioned in slot_names + * from primary. + */ +static List * +libpqrcv_list_db_for_logical_slots(WalReceiverConn *conn, Comment needs some minor tweaking. ~~~ 12. + if (strcmp(slot_names, "") != 0 && strcmp(slot_names, "*") != 0) + { + char *rawname; + List *namelist; + ListCell *lc; + + appendStringInfoChar(&s, ' '); + rawname = pstrdup(slot_names); + SplitIdentifierString(rawname, ',', &namelist); + foreach (lc, namelist) + { + if (lc != list_head(namelist)) + appendStringInfoChar(&s, ','); + appendStringInfo(&s, "%s", + quote_identifier(lfirst(lc))); + } + } /rawname/rawnames/ ~~~ 13. + if (PQresultStatus(res) != PGRES_TUPLES_OK) + { + PQclear(res); + ereport(ERROR, + (errmsg("could not receive list of slots the primary server: %s", + pchomp(PQerrorMessage(conn->streamConn))))); + } /the primary server/from the primary server/ ~~~ 14. + if (PQnfields(res) < 1) + { + int nfields = PQnfields(res); + + PQclear(res); + ereport(ERROR, + (errmsg("invalid response from primary server"), + errdetail("Could not get list of slots: got %d fields, " + "expected %d or more fields.", + nfields, 1))); + } This code seems over-complicated. If it is < 1 then it can only be zero, right? So then what is the point of calculating and displaying the 'nfields' which can only be 0? ~~~ 15. + ntuples = PQntuples(res); + for (int i = 0; i < ntuples; i++) + { + + slot_data = palloc0(sizeof(WalRecvReplicationSlotDbData)); + if (!PQgetisnull(res, i, 0)) + slot_data->database = atooid(PQgetvalue(res, i, 0)); + + slot_data->last_sync_time = 0; + slotlist = lappend(slotlist, slot_data); + } 15a. Unnecessary blank line in for-block. ~~~ 15b. Unnecessary assignment to 'last_sync_time' because the whole structure was palloc0 just 2 lines above. ====== src/backend/replication/logical/Makefile ====== src/backend/replication/logical/launcher.c 16. +/* + * Initial and incremental allocation size for dbids array for each + * SlotSyncWorker in dynamic shared memory i.e. we start with this size + * and once it is exhausted, dbids is rellocated with size incremented + * by ALLOC_DB_PER_WORKER + */ +#define ALLOC_DB_PER_WORKER 100 I felt it might be simpler to just separate these values instead of having to describe how you make use of the same constant for 2 meanings For example, #define DB_PER_WORKER_ALLOC_INIT 100 #define DB_PER_WORKER_ALLOC_EXTRA 100 ~~~ 17. static TimestampTz ApplyLauncherGetWorkerStartTime(Oid subid); - /* Unnecessary whitespace change. ~~~ 18. ApplyLauncherShmemInit bool found; + bool foundSlotSync; I think it is simpler to just use the same 'found' variable again. ~~ 19. ApplyLauncherShmemInit + /* Allocate shared-memory for slot-sync workers pool now */ + LogicalRepCtx->ss_workers = (SlotSyncWorker *) + ShmemInitStruct("Replication slot synchronization workers", + mul_size(max_slot_sync_workers, sizeof(SlotSyncWorker)), + &foundSlotSync); + + if (!foundSlotSync) + { + int slot; + + for (slot = 0; slot < max_slot_sync_workers; slot++) + { + SlotSyncWorker *worker = &LogicalRepCtx->ss_workers[slot]; + + memset(worker, 0, sizeof(SlotSyncWorker)); + } + } Why is the memset in a loop? Can't we just zap the whole ss_workers array in one go using that same mul_size Size? SUGGESTION Size ssw_size = mul_size(max_slot_sync_workers, sizeof(SlotSyncWorker)); if (!found) memset(LogicalRepCtx->ss_workers, 0, ssw_size); ====== .../replication/logical/logicalfuncs.c ====== src/backend/replication/logical/meson.build ====== src/backend/replication/logical/slotsync.c ====== src/backend/replication/logical/tablesync.c ====== src/backend/replication/repl_gram.y ====== src/backend/replication/repl_scanner.l ====== src/backend/replication/slot.c ====== src/backend/replication/walsender.c ====== src/backend/storage/lmgr/lwlocknames.txt ====== .../utils/activity/wait_event_names.txt ====== src/backend/utils/misc/guc_tables.c 20. + { + {"max_slot_sync_workers", + PGC_SIGHUP, + REPLICATION_STANDBY, + gettext_noop("Maximum number of slots synchronization workers " + "on a standby."), + NULL, + }, + &max_slot_sync_workers, + 2, 0, MAX_SLOT_SYNC_WORKER_LIMIT, + NULL, NULL, NULL + }, + /slots synchronization/slot synchronization/ OR /slots synchronization/slot-sync/ ====== src/backend/utils/misc/postgresql.conf.sample 21. +#max_slot_sync_workers = 2 # max number of slot synchronization workers Should this comment match the guc_tables.c text. E.g should it say "... on a standby" ====== src/include/commands/subscriptioncmds.h ====== src/include/nodes/replnodes.h ====== src/include/postmaster/bgworker_internals.h ====== src/include/replication/logicallauncher.h ====== src/include/replication/logicalworker.h ====== src/include/replication/slot.h 22. + + /* + * Is standby synced slot? + */ + bool synced; } ReplicationSlotPersistentData; Comment is unclear: - does it mean "has this primary slot been synsc to standby" ? - does it mean "this is a slot created by a sync-slot worker"? - something else? ====== src/include/replication/walreceiver.h 23. +/* + * Slot's DBids receiver from remote. + */ +typedef struct WalRecvReplicationSlotDbData +{ + Oid database; + TimestampTz last_sync_time; +} WalRecvReplicationSlotDbData; + Is that comment correct? Or should it be more like "The slot's DBid received from remote.". Anyway, that comment seems more for the 'database' field only, not a structure-level comment. ~~~ 24. walrcv_get_conninfo_fn walrcv_get_conninfo; walrcv_get_senderinfo_fn walrcv_get_senderinfo; walrcv_identify_system_fn walrcv_identify_system; + walrcv_list_db_for_logical_slots_fn walrcv_list_db_for_logical_slots; walrcv_server_version_fn walrcv_server_version; walrcv_readtimelinehistoryfile_fn walrcv_readtimelinehistoryfile; walrcv_startstreaming_fn walrcv_startstreaming; This function name doesn't seem consistent with the existing names. Something like 'walrcv_get_dbinfo_for_logical_slots_fn' might be better? ====== src/include/replication/worker_internal.h 25. +typedef struct SlotSyncWorkerWatchSlot +{ + NameData slot_name; + XLogRecPtr confirmed_lsn; + int inactivity_count; +} SlotSyncWorkerWatchSlot; I did not find any reference to this typedef except in the following struct for SlotSyncWorker. So why not just make this a nested structure within 'SlotSyncWorker' instead? ~~~ 26. +typedef struct SlotSyncWorker +{ + /* Time at which this worker was launched. */ + TimestampTz launch_time; + + /* Indicates if this slot is used or free. */ + bool in_use; + + /* The slot in worker pool to which it is attached */ + int slot; + + /* Increased every time the slot is taken by new worker. */ + uint16 generation; + + /* Pointer to proc array. NULL if not running. */ + PGPROC *proc; + + /* User to use for connection (will be same as owner of subscription). */ + Oid userid; + + /* Database id to connect to. */ + Oid dbid; + + /* Count of Database ids it manages */ + uint32 dbcount; + + /* DSA for dbids */ + dsa_area *dbids_dsa; + + /* dsa_pointer for database ids it manages */ + dsa_pointer dbids_dp; + + /* Mutex to access dbids in dsa */ + slock_t mutex; + + /* Info about slot being monitored for worker's naptime purpose */ + SlotSyncWorkerWatchSlot monitor; +} SlotSyncWorker; There seems an awful lot about this struct which is common with 'LogicalRepWorker' struct. It seems a shame not to make use of the commonality instead of all the cut/paste here. E.g. Can it be rearranged so all these common fields are shared: - launch_time - in_use - slot - generation - proc - userid - dbid ====== src/include/storage/lwlock.h 27. LWTRANCHE_LAUNCHER_HASH, - LWTRANCHE_FIRST_USER_DEFINED + LWTRANCHE_FIRST_USER_DEFINED, + LWTRANCHE_SLOT_SYNC_DSA } BuiltinTrancheIds; Isn't 'LWTRANCHE_FIRST_USER_DEFINED' supposed to the be last enum? ====== src/test/recovery/meson.build ====== src/test/recovery/t/051_slot_sync.pl 28. + +# Copyright (c) 2021, PostgreSQL Global Development Group + +use strict; Wrong copyright date ~~~ 29. +my $node_primary = PostgreSQL::Test::Cluster->new('primary'); +my $node_phys_standby = PostgreSQL::Test::Cluster->new('phys_standby'); +my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber'); 29a. Can't all the subroutines be up-front? Then this can move to be with the other node initialisation code that comets next. ~ 29b. Add a comment something like # Setup nodes ~~~ 30. +# Check conflicting status in pg_replication_slots. +sub check_slots_conflicting_status +{ + my $res = $node_phys_standby->safe_psql( + 'postgres', qq( + select bool_and(conflicting) from pg_replication_slots;)); + + is($res, 't', + "Logical slot is reported as conflicting"); +} Doesn't bool_and() mean returns false if only some but not all slots are conflicting - is that intentional?> Or is this sub-routine only expecting to test one slot, in which case maybe the SQL should include also the 'slot_name'? ~~~ 31. +$node_primary->start; +$node_primary->psql('postgres', q{SELECT pg_create_physical_replication_slot('pslot1');}); + +$node_primary->backup('backup'); + +$node_phys_standby->init_from_backup($node_primary, 'backup', has_streaming => 1); +$node_phys_standby->append_conf('postgresql.conf', q{ +synchronize_slot_names = '*' +primary_slot_name = 'pslot1' +hot_standby_feedback = off +}); +$node_phys_standby->start; + +$node_primary->safe_psql('postgres', "CREATE TABLE t1 (a int PRIMARY KEY)"); +$node_primary->safe_psql('postgres', "INSERT INTO t1 VALUES (1), (2), (3)"); The comments seem mostly to describe details about what are the expectations at each test step. IMO there also needs to be a larger "overview" comment to describe more generally *what* this is testing, and *how* it is testing it. e.g. it is hard to understand the test without being already familiar with the patch. ------ Kind Regards, Peter Smith. Fujitsu Australia
On Wed, Aug 30, 2023 at 9:29 AM shveta malik <shveta.malik@gmail.com> wrote: > > > PFA new patch-set which attempts to fix these: > PFA v15 which implements below changes: 1) It parses synchronize_slot_names and standby_slot_names and caches the list to avoid repeated parsing. This parsing is done at Walsender startup on primary and slot-sync worker startup on standby and then during each SIGHUP. 2) Handles slots invaliation: 2.1) If the slot is invalidated on primary, it is now invalidated on standby as well. Standby gets invalidation info from primary using a new system function 'pg_get_invalidation_cause(slotname)'. 2.2) if the slot is invalidated on standby alone, it is dropped and recreated as per synchronize_slot_names in next sync-cycle. 3) The test file 051_slot_sync.pl is removed from patch2 for the time-being. It was testing whether the logical slot on standby is conflicted or not once slot on primary is removed by 'Drop Subscription' and WALs needed by logical slot on standby are flushed on primary (with hot_standby_feedback=off). But as per current implementation, we drop the slot on standby as soon as subscription is dropped on primary. So the testcase no longer solves the purpose for which it was added. Correct set of test cases will be added going forward. 4) Address most of the comments by Peter. Change 1 is in patch01 along with patch02, rest are in patch02 alone. Thank You Ajin for assisting on the above changes. Next in the pipeline: 1) Currently it allows specifying logical slots in standby_slot_names. This should be prohibited. 2) We need to ensure that WAL is replayed on standby before moving the slot's position to the target location received from the primary. 3) Rest of the comments upthread. thanks Shveta
Attachment
On Fri, Sep 1, 2023 at 1:59 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Hi Shveta. Here are some comments for patch v14-0002 > > The patch is large, so my code review is a WIP... more later next week... > Thanks Peter for the feedback. I have tried to address most of these in v15. Please find my response inline for the ones which I have not addressed. > ====== > GENERAL > > 1. Patch size > > The patch is 2700 lines. Is it possible to break this up into smaller > self-contained parts to make the reviews more manageable? > Currently, patches are created based on work done on primary and standby. Patch 001 for primary-side implementation and 002 for standby side. Let me think more on this and see if the changes can be segregated further. > > 26. > +typedef struct SlotSyncWorker > +{ > + /* Time at which this worker was launched. */ > + TimestampTz launch_time; > + > + /* Indicates if this slot is used or free. */ > + bool in_use; > + > + /* The slot in worker pool to which it is attached */ > + int slot; > + > + /* Increased every time the slot is taken by new worker. */ > + uint16 generation; > + > + /* Pointer to proc array. NULL if not running. */ > + PGPROC *proc; > + > + /* User to use for connection (will be same as owner of subscription). */ > + Oid userid; > + > + /* Database id to connect to. */ > + Oid dbid; > + > + /* Count of Database ids it manages */ > + uint32 dbcount; > + > + /* DSA for dbids */ > + dsa_area *dbids_dsa; > + > + /* dsa_pointer for database ids it manages */ > + dsa_pointer dbids_dp; > + > + /* Mutex to access dbids in dsa */ > + slock_t mutex; > + > + /* Info about slot being monitored for worker's naptime purpose */ > + SlotSyncWorkerWatchSlot monitor; > +} SlotSyncWorker; > > There seems an awful lot about this struct which is common with > 'LogicalRepWorker' struct. > > It seems a shame not to make use of the commonality instead of all the > cut/paste here. > > E.g. Can it be rearranged so all these common fields are shared: > - launch_time > - in_use > - slot > - generation > - proc > - userid > - dbid > > ====== > src/include/storage/lwlock.h Sure, I had this in mind along with previous comments where it was suggested to merge similar functions like WaitForReplicationWorkerAttach, WaitForSlotSyncWorkerAttach etc. That merging could only be possible if we try to merge the common part of these structures. This is WIP, will be addressed in the next version. > > 27. > LWTRANCHE_LAUNCHER_HASH, > - LWTRANCHE_FIRST_USER_DEFINED > + LWTRANCHE_FIRST_USER_DEFINED, > + LWTRANCHE_SLOT_SYNC_DSA > } BuiltinTrancheIds; > > Isn't 'LWTRANCHE_FIRST_USER_DEFINED' supposed to the be last enum? > > ====== > src/test/recovery/meson.build > > ====== > src/test/recovery/t/051_slot_sync.pl > I have currently removed this file from the patch. Please see my comments (pt 3) here: https://mail.google.com/mail/u/0/?ik=52d5778aba&view=om&permmsgid=msg-a:r-2984462571505788980 thanks Shveta > 28. > + > +# Copyright (c) 2021, PostgreSQL Global Development Group > + > +use strict; > > Wrong copyright date > > ~~~ > > 29. > +my $node_primary = PostgreSQL::Test::Cluster->new('primary'); > +my $node_phys_standby = PostgreSQL::Test::Cluster->new('phys_standby'); > +my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber'); > > 29a. > Can't all the subroutines be up-front? Then this can move to be with > the other node initialisation code that comets next. > > ~ > > 29b. > Add a comment something like # Setup nodes > > ~~~ > > 30. > +# Check conflicting status in pg_replication_slots. > +sub check_slots_conflicting_status > +{ > + my $res = $node_phys_standby->safe_psql( > + 'postgres', qq( > + select bool_and(conflicting) from pg_replication_slots;)); > + > + is($res, 't', > + "Logical slot is reported as conflicting"); > +} > > Doesn't bool_and() mean returns false if only some but not all slots > are conflicting - is that intentional?> Or is this sub-routine only > expecting to test one slot, in which case maybe the SQL should include > also the 'slot_name'? > > ~~~ > > 31. > +$node_primary->start; > +$node_primary->psql('postgres', q{SELECT > pg_create_physical_replication_slot('pslot1');}); > + > +$node_primary->backup('backup'); > + > +$node_phys_standby->init_from_backup($node_primary, 'backup', > has_streaming => 1); > +$node_phys_standby->append_conf('postgresql.conf', q{ > +synchronize_slot_names = '*' > +primary_slot_name = 'pslot1' > +hot_standby_feedback = off > +}); > +$node_phys_standby->start; > + > +$node_primary->safe_psql('postgres', "CREATE TABLE t1 (a int PRIMARY KEY)"); > +$node_primary->safe_psql('postgres', "INSERT INTO t1 VALUES (1), (2), (3)"); > > The comments seem mostly to describe details about what are the > expectations at each test step. > > IMO there also needs to be a larger "overview" comment to describe > more generally *what* this is testing, and *how* it is testing it. > e.g. it is hard to understand the test without being already familiar > with the patch. > > ------ > Kind Regards, > Peter Smith. > Fujitsu Australia
On Fri, Aug 25, 2023 at 2:15 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > Wait a minute ... > > From bac0fbef8b203c530e5117b0b7cfee13cfab78b9 Mon Sep 17 00:00:00 2001 > From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> > Date: Sat, 22 Jul 2023 10:17:48 +0000 > Subject: [PATCH v13 1/2] Allow logical walsenders to wait for physical > standbys > > @@ -2498,6 +2500,13 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, > } > else > { > + /* > + * Before we send out the last set of changes to logical decoding > + * output plugin, wait for specified streaming replication standby > + * servers (if any) to confirm receipt of WAL upto commit_lsn. > + */ > + WaitForStandbyLSN(commit_lsn); > > OK, so we call this new function frequently enough -- once per > transaction, if I read this correctly? So ... > > +void > +WaitForStandbyLSN(XLogRecPtr wait_for_lsn) > +{ > ... > > + /* "*" means all logical walsenders should wait for physical standbys. */ > + if (strcmp(synchronize_slot_names, "*") != 0) > + { > + bool shouldwait = false; > + > + rawname = pstrdup(synchronize_slot_names); > + SplitIdentifierString(rawname, ',', &elemlist); > + > + foreach (l, elemlist) > + { > + char *name = lfirst(l); > + if (strcmp(name, NameStr(MyReplicationSlot->data.name)) == 0) > + { > + shouldwait = true; > + break; > + } > + } > + > + pfree(rawname); > + rawname = NULL; > + list_free(elemlist); > + elemlist = NIL; > + > + if (!shouldwait) > + return; > + } > + > + rawname = pstrdup(standby_slot_names); > + SplitIdentifierString(rawname, ',', &elemlist); > > ... do we really want to be doing the GUC string parsing every time > through it? This sounds like it could be a bottleneck, or at least slow > things down. Maybe we should think about caching this somehow. > Yes, these parsed lists are now cached. Please see v15 (https://www.postgresql.org/message-id/CAJpy0uAuzbzvcjpnzFTiWuDBctnH-SDZC6AZabPX65x9GWBrjQ%40mail.gmail.com) thanks Shveta
On Thu, Sep 7, 2023 at 8:29 AM shveta malik <shveta.malik@gmail.com> wrote: > > On Fri, Aug 25, 2023 at 2:15 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > Wait a minute ... > > > > From bac0fbef8b203c530e5117b0b7cfee13cfab78b9 Mon Sep 17 00:00:00 2001 > > From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> > > Date: Sat, 22 Jul 2023 10:17:48 +0000 > > Subject: [PATCH v13 1/2] Allow logical walsenders to wait for physical > > standbys > > > > @@ -2498,6 +2500,13 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, > > } > > else > > { > > + /* > > + * Before we send out the last set of changes to logical decoding > > + * output plugin, wait for specified streaming replication standby > > + * servers (if any) to confirm receipt of WAL upto commit_lsn. > > + */ > > + WaitForStandbyLSN(commit_lsn); > > > > OK, so we call this new function frequently enough -- once per > > transaction, if I read this correctly? So ... > > > > +void > > +WaitForStandbyLSN(XLogRecPtr wait_for_lsn) > > +{ > > ... > > > > + /* "*" means all logical walsenders should wait for physical standbys. */ > > + if (strcmp(synchronize_slot_names, "*") != 0) > > + { > > + bool shouldwait = false; > > + > > + rawname = pstrdup(synchronize_slot_names); > > + SplitIdentifierString(rawname, ',', &elemlist); > > + > > + foreach (l, elemlist) > > + { > > + char *name = lfirst(l); > > + if (strcmp(name, NameStr(MyReplicationSlot->data.name)) == 0) > > + { > > + shouldwait = true; > > + break; > > + } > > + } > > + > > + pfree(rawname); > > + rawname = NULL; > > + list_free(elemlist); > > + elemlist = NIL; > > + > > + if (!shouldwait) > > + return; > > + } > > + > > + rawname = pstrdup(standby_slot_names); > > + SplitIdentifierString(rawname, ',', &elemlist); > > > > ... do we really want to be doing the GUC string parsing every time > > through it? This sounds like it could be a bottleneck, or at least slow > > things down. Maybe we should think about caching this somehow. > > > > Yes, these parsed lists are now cached. Please see v15 > (https://www.postgresql.org/message-id/CAJpy0uAuzbzvcjpnzFTiWuDBctnH-SDZC6AZabPX65x9GWBrjQ%40mail.gmail.com) > > thanks > Shveta Patches (v15) were no longer applying to HEAD, rebased those and addressed below along-with: 1) Fixed an issue in slots-invalidation code-path on standby. Thanks Ajin for testing the patch and finding the issue. 2) Ensure that WAL is replayed on standby before moving the slot's position to the target location received from the primary. 3) Some code restructuring in slotsync.c thanks Shveta
Attachment
Dear Shveta, I resumed to check the thread. Here are my high-level comments. Sorry if you have been already discussed. 01. General I think the documentation can be added, not only GUCs. How about adding examples for combinations of physical and logical replications? You can say that both of physical primary can be publisher and slots on primary/standby are synchronized. 02. General standby_slot_names ensures that physical standby is always ahead subscriber, but I think it may be not sufficient. There is a possibility that primary server does not have any physical slots. In this case the physical standby may be behind the subscriber and the system may be confused when the failover is occured. Can't we specify the name of standby via application_name or something? 03. General In this architecture, the syncslot worker is launched per db and they independently connects to primary, right? I'm not sure it is efficient, but I come up with another architecture - only a worker (syncslot receiver)connects to the primary and other workers (syncslot worker) receives infos from it and updates. This can reduce the number of connections so that it may slightly improve the latency of network. How do you think? 04. General test file recovery/t/051_slot_sync.pl is missing. 04. ReplSlotSyncMain Does the worker have to connect to the specific database? ``` /* Connect to our database. */ BackgroundWorkerInitializeConnectionByOid(MySlotSyncWorker->dbid, MySlotSyncWorker->userid, 0); ``` 05. SlotSyncInitSlotNamesLst() "Lst" should be "List". Best Regards, Hayato Kuroda FUJITSU LIMITED
On Fri, Sep 8, 2023 at 1:54 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Thu, Sep 7, 2023 at 8:29 AM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Fri, Aug 25, 2023 at 2:15 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > > > Wait a minute ... > > > > > > From bac0fbef8b203c530e5117b0b7cfee13cfab78b9 Mon Sep 17 00:00:00 2001 > > > From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> > > > Date: Sat, 22 Jul 2023 10:17:48 +0000 > > > Subject: [PATCH v13 1/2] Allow logical walsenders to wait for physical > > > standbys > > > > > > @@ -2498,6 +2500,13 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, > > > } > > > else > > > { > > > + /* > > > + * Before we send out the last set of changes to logical decoding > > > + * output plugin, wait for specified streaming replication standby > > > + * servers (if any) to confirm receipt of WAL upto commit_lsn. > > > + */ > > > + WaitForStandbyLSN(commit_lsn); > > > > > > OK, so we call this new function frequently enough -- once per > > > transaction, if I read this correctly? So ... > > > > > > +void > > > +WaitForStandbyLSN(XLogRecPtr wait_for_lsn) > > > +{ > > > ... > > > > > > + /* "*" means all logical walsenders should wait for physical standbys. */ > > > + if (strcmp(synchronize_slot_names, "*") != 0) > > > + { > > > + bool shouldwait = false; > > > + > > > + rawname = pstrdup(synchronize_slot_names); > > > + SplitIdentifierString(rawname, ',', &elemlist); > > > + > > > + foreach (l, elemlist) > > > + { > > > + char *name = lfirst(l); > > > + if (strcmp(name, NameStr(MyReplicationSlot->data.name)) == 0) > > > + { > > > + shouldwait = true; > > > + break; > > > + } > > > + } > > > + > > > + pfree(rawname); > > > + rawname = NULL; > > > + list_free(elemlist); > > > + elemlist = NIL; > > > + > > > + if (!shouldwait) > > > + return; > > > + } > > > + > > > + rawname = pstrdup(standby_slot_names); > > > + SplitIdentifierString(rawname, ',', &elemlist); > > > > > > ... do we really want to be doing the GUC string parsing every time > > > through it? This sounds like it could be a bottleneck, or at least slow > > > things down. Maybe we should think about caching this somehow. > > > > > > > Yes, these parsed lists are now cached. Please see v15 > > (https://www.postgresql.org/message-id/CAJpy0uAuzbzvcjpnzFTiWuDBctnH-SDZC6AZabPX65x9GWBrjQ%40mail.gmail.com) > > > > thanks > > Shveta > > Patches (v15) were no longer applying to HEAD, rebased those and > addressed below along-with: > > 1) Fixed an issue in slots-invalidation code-path on standby. Thanks > Ajin for testing the patch and finding the issue. > 2) Ensure that WAL is replayed on standby before moving the slot's > position to the target location received from the primary. > 3) Some code restructuring in slotsync.c > > thanks > Shveta There were cfbot failures on v16 patches: --presence of 051_slot_sync.pl in meson.build even though the file is removed. --usage of uint in launcher.c Fixed above and attached v16_2_0001/0002 patches again. thanks Shveta
Attachment
On Fri, Sep 8, 2023 at 4:40 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Shveta, > > I resumed to check the thread. Here are my high-level comments. > Sorry if you have been already discussed. Thanks Kuroda-san for the feedback. > > 01. General > > I think the documentation can be added, not only GUCs. How about adding examples > for combinations of physical and logical replications? You can say that both of > physical primary can be publisher and slots on primary/standby are synchronized. > I did not fully understand this. Can you please state a clear example. We are only synchronizing logical replication slots in this draft and that too on physical standby from primary. So the last statement is not completely true. > 02. General > > standby_slot_names ensures that physical standby is always ahead subscriber, but I > think it may be not sufficient. There is a possibility that primary server does > not have any physical slots.So it expects a slot to be present. > In this case the physical standby may be behind the > subscriber and the system may be confused when the failover is occured. Currently there is a check in slot-sync worker which mandates that there is a physical slot present between primary and standby for this feature to proceed.So that confusion state will not arise. + /* WalRcvData is not set or primary_slot_name is not set yet */ + if (!WalRcv || WalRcv->slotname[0] == '\0') + return naptime; >Can't we specify the name of standby via application_name or something? So do you mean that in absence of a physical slot (if we plan to support that), we let primary know about standby(slots-synchronization client) through application_name? I am not sure about this. Will think more on this. I would like to know others' opinion on this as well. > > 03. General > > In this architecture, the syncslot worker is launched per db and they > independently connects to primary, right? Not completely true. Each slotsync worker is responsible for managing N dbs. Here 'N' = 'Number of distinct dbs for slots in synchronize_slot_names'/ 'number of max_slotsync_workers configured' for cases where dbcount exceeds workers configured. And if dbcount < max_slotsync_workers, then we launch only that many workers equal to dbcount and each worker manages a single db. Each worker independently connects to primary. Currently it makes a connection multiple times, I am optimizing it to make connection only once and then after each SIGHUP assuming 'primary_conninfo' may change. This change will be in the next version. >I'm not sure it is efficient, but I > come up with another architecture - only a worker (syncslot receiver)connects > to the primary and other workers (syncslot worker) receives infos from it and > updates. This can reduce the number of connections so that it may slightly > improve the latency of network. How do you think? > I feel it may help in reducing network latency, but not sure if it could be more efficient in keeping the lsns in sync. I feel it may introduce lag due to the fact that only one worker is getting all the info from primary and the actual synchronizing workers are waiting on that worker. This lag may be more when the number of slots are huge. We have run some performance tests on the design implemented currently, please have a look at emails around [1] and [2]. > 04. General > > test file recovery/t/051_slot_sync.pl is missing. > yes, it was removed. Please see point3 at [3] > 04. ReplSlotSyncMain > > Does the worker have to connect to the specific database? > > > ``` > /* Connect to our database. */ > BackgroundWorkerInitializeConnectionByOid(MySlotSyncWorker->dbid, > MySlotSyncWorker->userid, > 0); > ``` Since we are using libpq public interface 'walrcv_exec=libpqrcv_exec' to connect to primary, this needs database connection. It errors out in the absence of 'MyDatabaseId'. Do you think db-connection can have some downsides? > > 05. SlotSyncInitSlotNamesLst() > > "Lst" should be "List". > Okay, I will change this in the next version. ========== [1]: https://www.postgresql.org/message-id/CAJpy0uD2F43avuXy_yQv7Wa3kpUwioY_Xn955xdmd6vX0ME6%3Dg%40mail.gmail.com [2]: https://www.postgresql.org/message-id/CAJpy0uD%3DDevMxTwFVsk_%3DxHqYNH8heptwgW6AimQ9fbRmx4ioQ%40mail.gmail.com [3]: https://www.postgresql.org/message-id/CAJpy0uAuzbzvcjpnzFTiWuDBctnH-SDZC6AZabPX65x9GWBrjQ%40mail.gmail.com thanks Shveta
Dear Shveta, Sorry for the late response. > Thanks Kuroda-san for the feedback. > > > > 01. General > > > > I think the documentation can be added, not only GUCs. How about adding > examples > > for combinations of physical and logical replications? You can say that both of > > physical primary can be publisher and slots on primary/standby are > synchronized. > > > > I did not fully understand this. Can you please state a clear example. > We are only synchronizing logical replication slots in this draft and > that too on physical standby from primary. So the last statement is > not completely true. I expected to add a new subsection in "Log-Shipping Standby Servers". I think we can add like following infos: * logical replication publisher can be also replicated * For that, a physical repliation slot must be defined on primar * Then we can set up standby_slot_names(on primary) and synchronize_slot_names (on both server). * slots are synchronized automatically > > 02. General > > > > standby_slot_names ensures that physical standby is always ahead subscriber, > but I > > think it may be not sufficient. There is a possibility that primary server does > > not have any physical slots.So it expects a slot to be present. > > In this case the physical standby may be behind the > > subscriber and the system may be confused when the failover is occured. > > Currently there is a check in slot-sync worker which mandates that > there is a physical slot present between primary and standby for this > feature to proceed.So that confusion state will not arise. > + /* WalRcvData is not set or primary_slot_name is not set yet */ > + if (!WalRcv || WalRcv->slotname[0] == '\0') > + return naptime; Right, but I wanted to know why it is needed. One motivation seemed to know the WAL location of physical standby, but I thought that struct WalSnd.apply could be also used. Is it bad to assume that the physical walsender always exists? > >Can't we specify the name of standby via application_name or something? > > So do you mean that in absence of a physical slot (if we plan to > support that), we let primary know about standby(slots-synchronization > client) through application_name? Yes, it is what I considered. > > > > 03. General > > > > In this architecture, the syncslot worker is launched per db and they > > independently connects to primary, right? > > Not completely true. Each slotsync worker is responsible for managing > N dbs. Here 'N' = 'Number of distinct dbs for slots in > synchronize_slot_names'/ 'number of max_slotsync_workers configured' > for cases where dbcount exceeds workers configured. > And if dbcount < max_slotsync_workers, then we launch only that many > workers equal to dbcount and each worker manages a single db. Each > worker independently connects to primary. Currently it makes a > connection multiple times, I am optimizing it to make connection only > once and then after each SIGHUP assuming 'primary_conninfo' may > change. This change will be in the next version. > > > >I'm not sure it is efficient, but I > > come up with another architecture - only a worker (syncslot receiver)connects > > to the primary and other workers (syncslot worker) receives infos from it and > > updates. This can reduce the number of connections so that it may slightly > > improve the latency of network. How do you think? > > > > I feel it may help in reducing network latency, but not sure if it > could be more efficient in keeping the lsns in sync. I feel it may > introduce lag due to the fact that only one worker is getting all the > info from primary and the actual synchronizing workers are waiting on > that worker. This lag may be more when the number of slots are huge. > We have run some performance tests on the design implemented > currently, please have a look at emails around [1] and [2]. Thank you for teaching! Yeah, I agreed that another point might be a bottleneck. It could be recalled in future, but currently we do not have to consider... > > 04. ReplSlotSyncMain > > > > Does the worker have to connect to the specific database? > > > > > > ``` > > /* Connect to our database. */ > > > BackgroundWorkerInitializeConnectionByOid(MySlotSyncWorker->dbid, > > > MySlotSyncWorker->userid, > > > 0); > > ``` > > Since we are using libpq public interface 'walrcv_exec=libpqrcv_exec' > to connect to primary, this needs database connection. It errors out > in the absence of 'MyDatabaseId'. Do you think db-connection can have > some downsides? > I considered that we should not grant privileges to access data more than necessary. It might be better if we can avoid to connect to the specific database. But I'm not sure that we should have to add new walreceiver API to handle it. FYI, I checked the physical walreceiver to refer it, but it was not background worker so that it was no meaning. And followings are further comments. 1. I considered the combination with the feature and initial data sync, and found an issue. How do you think? Assuming that the name of subscription is specified as "synchronize_slot_names". A synchronization of each tables is separated into two transactions: 1. In a first transaction, a logical replication slot (pg_XXX_sync_XXX...)is created and tuples are COPYd. 2. In a second transaction, changes from the first transaction are streamed by and applied. If the primary crashed between 1 and 2 and standby is promoted, the tablesync worker would execute "START_REPLICATION SLOT pg_XXX_sync_XXX..." to promoted server, but fail because such a slot does not exist. Is this a problem we should solve? Above can be reproduced by adding sleep(). 2. Do we have to add some rules in "Configuration Settings" section? 3. You can run pgindent in your timing. Best Regards, Hayato Kuroda FUJITSU LIMITED
On Mon, Sep 11, 2023 at 9:49 AM shveta malik <shveta.malik@gmail.com> wrote: > > On Fri, Sep 8, 2023 at 4:40 PM Hayato Kuroda (Fujitsu) > <kuroda.hayato@fujitsu.com> wrote: > > > > Dear Shveta, > > > > I resumed to check the thread. Here are my high-level comments. > > Sorry if you have been already discussed. > > Thanks Kuroda-san for the feedback. > > > > 01. General > > > > I think the documentation can be added, not only GUCs. How about adding examples > > for combinations of physical and logical replications? You can say that both of > > physical primary can be publisher and slots on primary/standby are synchronized. > > > > I did not fully understand this. Can you please state a clear example. > We are only synchronizing logical replication slots in this draft and > that too on physical standby from primary. So the last statement is > not completely true. > > > 02. General > > > > standby_slot_names ensures that physical standby is always ahead subscriber, but I > > think it may be not sufficient. There is a possibility that primary server does > > not have any physical slots.So it expects a slot to be present. > > In this case the physical standby may be behind the > > subscriber and the system may be confused when the failover is occured. > > Currently there is a check in slot-sync worker which mandates that > there is a physical slot present between primary and standby for this > feature to proceed.So that confusion state will not arise. > + /* WalRcvData is not set or primary_slot_name is not set yet */ > + if (!WalRcv || WalRcv->slotname[0] == '\0') > + return naptime; > > >Can't we specify the name of standby via application_name or something? > > So do you mean that in absence of a physical slot (if we plan to > support that), we let primary know about standby(slots-synchronization > client) through application_name? I am not sure about this. Will think > more on this. I would like to know others' opinion on this as well. > > > > > 03. General > > > > In this architecture, the syncslot worker is launched per db and they > > independently connects to primary, right? > > Not completely true. Each slotsync worker is responsible for managing > N dbs. Here 'N' = 'Number of distinct dbs for slots in > synchronize_slot_names'/ 'number of max_slotsync_workers configured' > for cases where dbcount exceeds workers configured. > And if dbcount < max_slotsync_workers, then we launch only that many > workers equal to dbcount and each worker manages a single db. Each > worker independently connects to primary. Currently it makes a > connection multiple times, I am optimizing it to make connection only > once and then after each SIGHUP assuming 'primary_conninfo' may > change. This change will be in the next version. > > > >I'm not sure it is efficient, but I > > come up with another architecture - only a worker (syncslot receiver)connects > > to the primary and other workers (syncslot worker) receives infos from it and > > updates. This can reduce the number of connections so that it may slightly > > improve the latency of network. How do you think? > > > > I feel it may help in reducing network latency, but not sure if it > could be more efficient in keeping the lsns in sync. I feel it may > introduce lag due to the fact that only one worker is getting all the > info from primary and the actual synchronizing workers are waiting on > that worker. This lag may be more when the number of slots are huge. > We have run some performance tests on the design implemented > currently, please have a look at emails around [1] and [2]. > > > 04. General > > > > test file recovery/t/051_slot_sync.pl is missing. > > > > yes, it was removed. Please see point3 at [3] > > > > 04. ReplSlotSyncMain > > > > Does the worker have to connect to the specific database? > > > > > > ``` > > /* Connect to our database. */ > > BackgroundWorkerInitializeConnectionByOid(MySlotSyncWorker->dbid, > > MySlotSyncWorker->userid, > > 0); > > ``` > > Since we are using libpq public interface 'walrcv_exec=libpqrcv_exec' > to connect to primary, this needs database connection. It errors out > in the absence of 'MyDatabaseId'. Do you think db-connection can have > some downsides? > > > > > 05. SlotSyncInitSlotNamesLst() > > > > "Lst" should be "List". > > > > Okay, I will change this in the next version. > > ========== > > [1]: https://www.postgresql.org/message-id/CAJpy0uD2F43avuXy_yQv7Wa3kpUwioY_Xn955xdmd6vX0ME6%3Dg%40mail.gmail.com > [2]: https://www.postgresql.org/message-id/CAJpy0uD%3DDevMxTwFVsk_%3DxHqYNH8heptwgW6AimQ9fbRmx4ioQ%40mail.gmail.com > [3]: https://www.postgresql.org/message-id/CAJpy0uAuzbzvcjpnzFTiWuDBctnH-SDZC6AZabPX65x9GWBrjQ%40mail.gmail.com > > thanks > Shveta PFA v17. It has below changes: 1) There was a common portion between SlotSync worker and LogicalRep worker structures. The common portion is now moved to WorkerHeader. The common functions are merged. 2) Connection to primary is made once in the beginning in both slotSync worker as well as launcher. Earlier it was before each sync cycle. 3) SpinLock Removed. Earlier LWlock was used for shared-memory access by workers and then there was extra Spinlock for dbids access in DSM, which is removed now. LWLock alone seems enough to maintain the consistency. 4) In 'alter system standby_slot_names', we can not give non-existing slot-names or logical slots now. Earlier it was accepting everything. This specific change is in patch1, rest in patch2. Thanks Ajin for working on 1. Next, I plan to review patch01 and the existing feedback about it. Until now focus was patch02. thanks Shveta
Attachment
On Wed, Sep 6, 2023 at 2:18 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Fri, Sep 1, 2023 at 1:59 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Hi Shveta. Here are some comments for patch v14-0002 > > > > The patch is large, so my code review is a WIP... more later next week... > > > > Thanks Peter for the feedback. I have tried to address most of these > in v15. Please find my response inline for the ones which I have not > addressed. > > > > > 26. > > +typedef struct SlotSyncWorker > > +{ > > + /* Time at which this worker was launched. */ > > + TimestampTz launch_time; > > + > > + /* Indicates if this slot is used or free. */ > > + bool in_use; > > + > > + /* The slot in worker pool to which it is attached */ > > + int slot; > > + > > + /* Increased every time the slot is taken by new worker. */ > > + uint16 generation; > > + > > + /* Pointer to proc array. NULL if not running. */ > > + PGPROC *proc; > > + > > + /* User to use for connection (will be same as owner of subscription). */ > > + Oid userid; > > + > > + /* Database id to connect to. */ > > + Oid dbid; > > + > > + /* Count of Database ids it manages */ > > + uint32 dbcount; > > + > > + /* DSA for dbids */ > > + dsa_area *dbids_dsa; > > + > > + /* dsa_pointer for database ids it manages */ > > + dsa_pointer dbids_dp; > > + > > + /* Mutex to access dbids in dsa */ > > + slock_t mutex; > > + > > + /* Info about slot being monitored for worker's naptime purpose */ > > + SlotSyncWorkerWatchSlot monitor; > > +} SlotSyncWorker; > > > > There seems an awful lot about this struct which is common with > > 'LogicalRepWorker' struct. > > > > It seems a shame not to make use of the commonality instead of all the > > cut/paste here. > > > > E.g. Can it be rearranged so all these common fields are shared: > > - launch_time > > - in_use > > - slot > > - generation > > - proc > > - userid > > - dbid > > > > Sure, I had this in mind along with previous comments where it was > suggested to merge similar functions like > WaitForReplicationWorkerAttach, WaitForSlotSyncWorkerAttach etc. That > merging could only be possible if we try to merge the common part of > these structures. This is WIP, will be addressed in the next version. > This has been addressed in version-17 now. thanks Shveta
On Wed, Sep 13, 2023 at 4:54 PM shveta malik <shveta.malik@gmail.com> wrote: > > PFA v17. It has below changes: > @@ -2498,6 +2500,13 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, } else { + /* + * Before we send out the last set of changes to logical decoding + * output plugin, wait for specified streaming replication standby + * servers (if any) to confirm receipt of WAL upto commit_lsn. + */ + WaitForStandbyLSN(commit_lsn); It seems the first patch has a wait logic for every commit. I think it is better to integrate this wait with WalSndWaitForWal() as suggested by Andres in his email[1]. [1] - https://www.postgresql.org/message-id/20220207204557.74mgbhowydjco4mh%40alap3.anarazel.de -- With Regards, Amit Kapila.
Hi. Here are some review comments for v17-0002. This is a WIP and a long way from complete, but I wanted to send what I have so far (while it is still current with your latest posted patches). ====== 1. GENERAL - loop variable declaration There are some code examples like below where the loop variable is declared within the for. AFAIK this style of declaration is atypical for the PG source. + /* Find unused worker slot. */ + for (int i = 0; i < max_slotsync_workers; i++) Search/Replace. ~~~ 2. GENERAL - from primary There are multiple examples in messages and comments that say "from primary". I felt most would be better to say "from the primary". Search/Replace. ~~~ 3. GENERAL - pg_indent There are lots of examples of function arguments like "* worker" (with space) which changed to "*worker" (without space) in v16 and then changed back to "* worker" with space in v17. Can all these toggles be cleaned up by running pg_indent? ====== Commit message. 4. This patch attempts to implement logical replication slots synchronization from primary server to physical standby so that logical subscribers are not blocked after failover. Now-on, all the logical replication slots created on primary (assuming configurations are appropriate) are automatically created on physical standbys and are synced periodically. This has been acheived by starting slot-sync worker(s) on standby server which pings primary at regular intervals to get the logical slots information and create/update the slots locally. SUGGESTION (just minor rewording) This patch implements synchronization of logical replication slots from the primary server to the physical standby so that logical subscribers are not blocked after failover. All the logical replication slots on the primary (assuming configurations are appropriate) are automatically created on the physical standbys and are synced periodically. Slot-sync worker(s) on the standby server ping the primary at regular intervals to get the necessary logical slot information and create/update the slots locally. ~ 5. For max number of slot-sync workers on standby, new GUC max_slotsync_workers has been added, default value and max value is kept at 2 and 50 respectively. This parameter can only be set at server start. 5a. SUGGESTION (minor rewording) A new GUC 'max_slotsync_workers' defines the maximum number of slot-sync workers on the standby: default value = 2, max value = 50. This parameter can only be set at server start ~ 5b. Actually, I think mentioning the values 2 and 50 here might be too much detail, but I left it anyway. Consider removing that. ~~~ 6. Now replication launcher on physical standby queries primary to get list of dbids which belong to slots mentioned in GUC 'synchronize_slot_names'. Once it gets the dbids, if dbids < max_slotsync_workers, it starts only that many workers and if dbids > max_slotsync_workers, it starts max_slotsync_workers and divides the work equally among them. Each worker is then responsible to keep on syncing the concerned logical slots belonging to the DBs assigned to it. ~ 6a. SUGGESTION (first sentence) Now the replication launcher on the physical standby queries primary to get the list of dbids that belong to the... ~ 6b. "concerned" ?? ~~~ 7. Let us say slots mentioned in 'synchronize_slot_names' on primary belongs to 4 DBs and say 'max_slotsync_workers' is 4, then a new worker will be launched for each db. If a new logical slot with a different DB is found by replication launcher, it will assign this new db to the worker handling the minimum number of dbs currently (or first worker in case of equal count). ~ /Let us say/For example, let's say/ ~~~ 8. The naptime of worker is tuned as per the activity on primary. Each worker starts with naptime of 10ms and if no activity is observed on primary for some time, then naptime is increased to 10sec. And if activity is observed again, naptime is reduced back to 10ms. Each worker does it by choosing one slot (first one assigned to it) for monitoring purpose. If there is no change in lsn of that slot for say over 10 sync-checks, naptime is increased to 10sec and as soon as a change is observed, naptime is reduced back to 10ms. ~ /as per the activity on primary/according to the activity on the primary/ /is observed on primary/is observed on the primary/ /Each worker does it by choosing one slot/Each worker uses one slot/ ~~~ 9. If there is any change in synchronize_slot_names, then the slots which are no longer part of it or the ones which no longer exist on primary will be dropped by slot-sync workers on physical standbys. ~ 9a. /on primary/on the primary/ /which no longer exist/that no longer exist/ ~ 9b. I didn't really understand why this says "or the ones which no longer exist". IIUC (from prior paragraph) such slots would already be invalidated/removed by the sync-slot worker in due course -- i.e. we don't need to wait for some change to the 'synchronize_slot_names' list to trigger that deletion, right? ====== doc/src/sgml/config.sgml 10. + <varlistentry id="guc-max-slotsync-workers" xreflabel="max_slotsync_workers"> + <term><varname>max_slotsync_workers</varname> (<type>integer</type>) + <indexterm> + <primary><varname>max_slotsync_workers</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Specifies maximum number of slot synchronization workers. + </para> + <para> + Slot synchronization workers are taken from the pool defined by + <varname>max_worker_processes</varname>. + </para> + <para> + The default value is 2. This parameter can only be set at server + start. + </para> + </listitem> + </varlistentry> This looks OK, but IMO there also needs some larger description (here or elsewhere?) about this feature more generally. Otherwise, why would the user change the 'max_slotsync_workers' when there is nothing to say "slot synchronization workers" are for? ====== src/backend/postmaster/bgworker.c 11. { "ApplyWorkerMain", ApplyWorkerMain }, + { + "ReplSlotSyncMain", ReplSlotSyncMain + }, { "ParallelApplyWorkerMain", ParallelApplyWorkerMain }, ~ I thought this entry point name/function should include the word "Worker" same as for the others. ====== .../libpqwalreceiver/libpqwalreceiver.c 12. +/* + * Get DB info for logical slots + * + * It gets the DBIDs for slot_names from primary. The list obatined has no + * duplicacy of DBIds. + */ +static List * +libpqrcv_get_dbinfo_for_logical_slots(WalReceiverConn *conn, + const char *slot_names) 12a. typo /obatined/ SUGGESTION The returned list has no duplicates. ~ 12b. I did not recognise any part of the function logic ensuring no duplicates are returned. IIUC it is actually within the logic of LIST_DBID_FOR_LOGICAL_SLOTS that this is handled, so maybe the comment can mention that. ~~~ 13. libpqrcv_get_dbinfo_for_logical_slots + if (PQnfields(res) != 1) + { + int nfields = PQnfields(res); + + PQclear(res); + ereport(ERROR, + (errmsg("invalid response from primary server"), + errdetail("Could not get list of slots: got %d fields, " + "expected %d fields.", + nfields, 1))); + } Something seems not right about the message. The "expected" part plurality is wrong, and if it can only be 1 then why use substitution? ====== src/backend/replication/logical/Makefile OK ====== src/backend/replication/logical/launcher.c 14. slot_sync_worker_stop +static void +slot_sync_worker_stop(SlotSyncWorker *worker) +{ + + Assert(LWLockHeldByMeInMode(SlotSyncWorkerLock, LW_SHARED)); ... + LWLockAcquire(SlotSyncWorkerLock, LW_SHARED); + } + +} Unnecessary whitespace at the top and bottom of this function. ~~~ 15. slot_sync_worker_launch_or_reuse + /* Find unused worker slot. */ + for (int i = 0; i < max_slotsync_workers; i++) loop variable declaration. ~~~ 16. slot_sync_worker_launch_or_reuse + if (!worker) + { + for (int i = 0; i < max_slotsync_workers; i++) loop variable declaration. ~~~ 17. slot_sync_remove_obsolete_dbs + /* Traverse slot-sync-workers to validate the DBs */ + for (int widx = 0; widx < max_slotsync_workers; widx++) + { loop variable declaration. ~ 18. + for (int dbidx = 0; dbidx < worker->dbcount;) + { loop variable declaration ~ 19. + for (int i = dbidx; i < worker->dbcount; i++) + { loop variable declaration ~ 20. + /* If dbcount for any worker has become 0, shut it down */ + for (int widx = 0; widx < max_slotsync_workers; widx++) + { loop variable declaration ~ 21. + } + +} + Unnecessary whitespace at the end of the function body ~~~ 22. ApplyLauncherStartSubs +static void +ApplyLauncherStartSubs(long *wait_time) +{ Missing function comment. ====== .../replication/logical/logicalfuncs.c OK ====== src/backend/replication/logical/meson.build OK ====== src/backend/replication/logical/slotsync.c 23. +/*------------------------------------------------------------------------- + * slotsync.c + * PostgreSQL worker for synchronizing slots to a standby from primary + * + * Copyright (c) 2016-2018, PostgreSQL Global Development Group + * Wrong copyright date? ~~~ 24. + * This file contains the code for slot-sync worker on physical standby that + * fetches the logical replication slots information from primary server + * (PrimaryConnInfo) and creates the slots on standby and synchronizes them + * periodically. It synchronizes only the slots configured in + * 'synchronize_slot_names'. SUGGESTION This file contains the code for slot-sync workers on physical standby to fetch logical replication slot information from the primary server (PrimaryConnInfo), create the slots on the standby, and synchronize them periodically. Slot-sync workers only synchronize slots configured in 'synchronize_slot_names'. ~~~ 25. + * It takes a nap of WORKER_DEFAULT_NAPTIME before every next synchronization. + * If there is no acitivity observed on primary for sometime, it increases the + * naptime to WORKER_INACTIVITY_NAPTIME and as soon as any activity is observed, + * it brings back the naptime to default value. SUGGESTION (2nd sentence) If there is no activity observed on the primary for some time, the naptime is increased to WORKER_INACTIVITY_NAPTIME, but if any activity is observed, the naptime reverts to the default value. ~~~ 26. +typedef struct RemoteSlot +{ + char *name; + char *plugin; + char *database; + bool two_phase; + bool conflicting; + XLogRecPtr restart_lsn; + XLogRecPtr confirmed_lsn; + TransactionId catalog_xmin; + + /* RS_INVAL_NONE if valid, or the reason of invalidation */ + ReplicationSlotInvalidationCause invalidated; +} RemoteSlot; This deserves at least a struct-level comment. ~~~ 27. +/* + * Inactivity Threshold Count before increasing naptime of worker. + * + * If the lsn of slot being monitored did not change for these many times, + * then increase naptime of current worker from WORKER_DEFAULT_NAPTIME to + * WORKER_INACTIVITY_NAPTIME. + */ +#define WORKER_INACTIVITY_THRESHOLD 10 I felt this constant would be better expressed as a time interval instead of a magic number. You can easily derive that loop count anyway in the code logic. e.g. here the comment would be "If the lsn of the slot being monitored did not change for XXXms then...". ~~~ 28. wait_for_primary_slot_catchup +/* + * Wait for remote slot to pass localy reserved position. + */ +static void +wait_for_primary_slot_catchup(WalReceiverConn *wrconn, char *slot_name, + XLogRecPtr min_lsn) /localy/locally/ ~~~ 29. wait_for_primary_slot_catchup + ereport(ERROR, + (errmsg("slot \"%s\" disapeared from primary", + slot_name))); /disapeared/disappeared/ ~~~ 30. ReplSlotSyncMain + if (!dsa) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("could not map dynamic shared memory " + "segment for slot-sync worker"))); + + + /* Primary initialization is complete. Now, attach to our slot. */ Unnecessary double whitespace. ====== src/backend/replication/logical/tablesync.c OK ====== src/backend/replication/repl_gram.y OK ====== src/backend/replication/repl_scanner.l OK ====== src/backend/replication/slot.c ====== src/backend/replication/slotfuncs.c 31. +/* + * SQL function for getting invalidation cause of a slot. + * + * Returns ReplicationSlotInvalidationCause enum value for valid slot_name; + * returns NULL if slot with given name is not found. + */ +Datum +pg_get_invalidation_cause(PG_FUNCTION_ARGS) +{ + Name name = PG_GETARG_NAME(0); + ReplicationSlotInvalidationCause cause; + int slotno; + + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED); + for (slotno = 0; slotno < max_replication_slots; slotno++) + { + ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[slotno]; + if (strcmp(NameStr(s->data.name), NameStr(*name)) == 0) + { + cause = s->data.invalidated; + PG_RETURN_INT16(cause); + } + } + LWLockRelease(ReplicationSlotControlLock); + + PG_RETURN_NULL(); +} 31a. There seems no check if the slot actually is invalidated. I guess in that case the function just returns the enum value RS_INVAL_NONE, but should that be mentioned in the function header comment? ~ 31b. Seems a poor choice of function name -- does not even have the word "slot" in the name (??). ~ 31c. IMO it is better to have a blankline after the declaration in the loop. ~ 31b. Might be simpler just to remove that 'cause' variable. It's not doing much. ====== src/backend/replication/walsender.c 32. ListSlotDatabaseOIDs +/* + * Handle the LIST_SLOT_DATABASE_OIDS command. + */ +static void +ListSlotDatabaseOIDs(ListDBForLogicalSlotsCmd *cmd) 32a. The function-level comment seems too terse. Just saying "handle the command" does not describe what this function is actually doing and how it does it. ~ 32b. Is "LIST_SLOT_DATABASE_OIDS" even the correct name? I don't see that anywhere else in this patch. AFAICT it should be "LIST_DBID_FOR_LOGICAL_SLOTS". ~ 33. ListSlotDatabaseOIDs - comments The comments in the body of this function are inconsistent begining uppercase/lowercase ~ 34. ListSlotDatabaseOIDs - sorting/logic Maybe explain better the reason for having the qsort and other logic. TBH, I was not sure of the necessity for the names lists and the sorting and bsearch logic. AFAICT these are all *only* used to check for uniqueness and existence of the slot name. So I was wondering if a hashmap keyed by the slot name might be more appropriate and also simpler than all this list sorting/searching. ~~ 35. ListSlotDatabaseOIDs + for (int slotno = 0; slotno < max_replication_slots; slotno++) + { loop variable declaration ====== src/backend/storage/lmgr/lwlock.c OK ====== src/backend/storage/lmgr/lwlocknames.txt OK ====== .../utils/activity/wait_event_names.txt TODO ====== src/backend/utils/misc/guc_tables.c OK ====== src/backend/utils/misc/postgresql.conf.sample 36. # primary to streaming replication standby server +#max_slotsync_workers = 2 # max number of slot synchronization workers on a standby IMO it is better to say "maximum" instead of "max" in the comment. (make sure the GUC description text is identical) ====== src/include/catalog/pg_proc.dat 37. +{ oid => '6312', descr => 'get invalidate cause of a replication slot', + proname => 'pg_get_invalidation_cause', provolatile => 's', proisstrict => 't', + prorettype => 'int2', proargtypes => 'name', + prosrc => 'pg_get_invalidation_cause' }, 37a. SUGGESTION (descr) what caused the replication slot to become invalid ~ 37b 'pg_get_invalidation_cause' seemed like a poor function name because it doesn't have any context -- not even the word "slot" in it. ====== src/include/commands/subscriptioncmds.h OK ====== src/include/nodes/replnodes.h OK ====== src/include/postmaster/bgworker_internals.h 38. #define MAX_PARALLEL_WORKER_LIMIT 1024 +#define MAX_SLOT_SYNC_WORKER_LIMIT 50 Consider SLOTSYNC instead of SLOT_SYNC for consistency with other names of this worker. ====== OK ====== src/include/replication/logicalworker.h 39. extern void ApplyWorkerMain(Datum main_arg); extern void ParallelApplyWorkerMain(Datum main_arg); extern void TablesyncWorkerMain(Datum main_arg); +extern void ReplSlotSyncMain(Datum main_arg); The name is not consistent with others nearby. At least it should include the word "Worker" like everything else does. ====== src/include/replication/slot.h OK ====== src/include/replication/walreceiver.h 40. +/* + * Slot's DBid related data + */ +typedef struct WalRcvRepSlotDbData +{ + Oid database; /* Slot's DBid received from remote */ + TimestampTz last_sync_time; /* The last time we tried to launch sync + * worker for above Dbid */ +} WalRcvRepSlotDbData; + Is that comment about field 'last_sync_time' correct? I thought this field is the last time the slot was synced -- not the last time the worker was launched. ====== src/include/replication/worker_internal.h 41. - /* User to use for connection (will be same as owner of subscription). */ + /* User to use for connection (will be same as owner of subscription + * in case of LogicalRep worker). */ Oid userid; +} WorkerHeader; 41a. This is not the normal style for a multi-line comment. ~ 41b. I wondered if the name "WorkerHeader" is just a bit *too* generic and might cause future trouble because of the vague name. ~~~ 42. +typedef struct LogicalRepWorker +{ + WorkerHeader header; + + /* What type of worker is this? */ + LogicalRepWorkerType type; /* Subscription id for the worker. */ Oid subid; @@ -77,7 +84,7 @@ typedef struct LogicalRepWorker * would be created for each transaction which will be deleted after the * transaction is finished. */ - FileSet *stream_fileset; + struct FileSet *stream_fileset; /* * PID of leader apply worker if this slot is used for a parallel apply @@ -96,6 +103,32 @@ typedef struct LogicalRepWorker TimestampTz reply_time; } LogicalRepWorker; 42a. I suggest having some struct-level comments. ~ 42b. The field name 'header' is propagated all over the place. So, IMO calling it 'hdr' instead of 'header' might be slightly less intrusive. I think there are lots of precedents for calling headers as 'hdr'. ~ 42c. What was the FileSet field changed to struct FileSet? Aren't the struct/typedef defined in the same place? ~~~ 43. +typedef struct SlotSyncWorker +{ + WorkerHeader header; + + /* The slot in worker pool to which it is attached */ + int slot; + + /* Count of Database ids it manages */ + uint32 dbcount; + + /* DSA for dbids */ + dsa_area *dbids_dsa; + + /* dsa_pointer for database ids it manages */ + dsa_pointer dbids_dp; + + /* Info about slot being monitored for worker's naptime purpose */ + struct SlotSyncWorkerWatchSlot + { + NameData slot_name; + XLogRecPtr confirmed_lsn; + int inactivity_count; + } monitoring_info; + +} SlotSyncWorker; 43a. I suggest having some struct-level comments. ~ 43b. IMO it will avoid ambiguitities to be more explicit in the comments instead of just saying "it" everywhere. + /* The slot in worker pool to which it is attached */ + /* Count of Database ids it manages */ + /* dsa_pointer for database ids it manages */ ~ 43c. There is inconsistent wording and case in these comments. Just pick one term to use everywhere. "Database ids" "database ids" "dbids" ~~~ 44. GENERAL = restructuring of common structs in worker_internal.h The field name 'header' is propagated all over the place. It is OK, and I guess there is no choice, but IMO calling it 'hdr' instead of 'header' might be slightly less intrusive. I think there are lots of precedents for calling headers as 'hdr'. ====== src/include/storage/lwlock.h ====== src/tools/pgindent/typedefs.list 45. Missing the the typedef WorkerHeader? ====== Kind Regards, Peter Smith. Fujitsu Australia
On Wed, Sep 13, 2023 at 4:48 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Shveta, > > Sorry for the late response. > > > Thanks Kuroda-san for the feedback. > > > > > > 01. General > > > > > > I think the documentation can be added, not only GUCs. How about adding > > examples > > > for combinations of physical and logical replications? You can say that both of > > > physical primary can be publisher and slots on primary/standby are > > synchronized. > > > > > > > I did not fully understand this. Can you please state a clear example. > > We are only synchronizing logical replication slots in this draft and > > that too on physical standby from primary. So the last statement is > > not completely true. > > I expected to add a new subsection in "Log-Shipping Standby Servers". I think we > can add like following infos: > > * logical replication publisher can be also replicated > * For that, a physical repliation slot must be defined on primar > * Then we can set up standby_slot_names(on primary) and synchronize_slot_names > (on both server). > * slots are synchronized automatically Sure. I am trying to find the right place in this section to add this info. I will try to address this in coming versions. > > > > 02. General > > > > > > standby_slot_names ensures that physical standby is always ahead subscriber, > > but I > > > think it may be not sufficient. There is a possibility that primary server does > > > not have any physical slots.So it expects a slot to be present. > > > In this case the physical standby may be behind the > > > subscriber and the system may be confused when the failover is occured. > > > > Currently there is a check in slot-sync worker which mandates that > > there is a physical slot present between primary and standby for this > > feature to proceed.So that confusion state will not arise. > > + /* WalRcvData is not set or primary_slot_name is not set yet */ > > + if (!WalRcv || WalRcv->slotname[0] == '\0') > > + return naptime; > > Right, but I wanted to know why it is needed. One motivation seemed to know the > WAL location of physical standby, but I thought that struct WalSnd.apply could > be also used. Is it bad to assume that the physical walsender always exists? > We do not plan to target this case where physical slot is not created between primary and physical-standby in the first draft. In such a case, slot-synchronization will be skipped for the time being. We can extend this functionality (if needed) later. > > >Can't we specify the name of standby via application_name or something? > > > > So do you mean that in absence of a physical slot (if we plan to > > support that), we let primary know about standby(slots-synchronization > > client) through application_name? > > Yes, it is what I considered. > > > > > > > 03. General > > > > > > In this architecture, the syncslot worker is launched per db and they > > > independently connects to primary, right? > > > > Not completely true. Each slotsync worker is responsible for managing > > N dbs. Here 'N' = 'Number of distinct dbs for slots in > > synchronize_slot_names'/ 'number of max_slotsync_workers configured' > > for cases where dbcount exceeds workers configured. > > And if dbcount < max_slotsync_workers, then we launch only that many > > workers equal to dbcount and each worker manages a single db. Each > > worker independently connects to primary. Currently it makes a > > connection multiple times, I am optimizing it to make connection only > > once and then after each SIGHUP assuming 'primary_conninfo' may > > change. This change will be in the next version. > > > > > > >I'm not sure it is efficient, but I > > > come up with another architecture - only a worker (syncslot receiver)connects > > > to the primary and other workers (syncslot worker) receives infos from it and > > > updates. This can reduce the number of connections so that it may slightly > > > improve the latency of network. How do you think? > > > > > > > I feel it may help in reducing network latency, but not sure if it > > could be more efficient in keeping the lsns in sync. I feel it may > > introduce lag due to the fact that only one worker is getting all the > > info from primary and the actual synchronizing workers are waiting on > > that worker. This lag may be more when the number of slots are huge. > > We have run some performance tests on the design implemented > > currently, please have a look at emails around [1] and [2]. > > Thank you for teaching! Yeah, I agreed that another point might be a bottleneck. > It could be recalled in future, but currently we do not have to consider... > > > > 04. ReplSlotSyncMain > > > > > > Does the worker have to connect to the specific database? > > > > > > > > > ``` > > > /* Connect to our database. */ > > > > > BackgroundWorkerInitializeConnectionByOid(MySlotSyncWorker->dbid, > > > > > MySlotSyncWorker->userid, > > > > > 0); > > > ``` > > > > Since we are using libpq public interface 'walrcv_exec=libpqrcv_exec' > > to connect to primary, this needs database connection. It errors out > > in the absence of 'MyDatabaseId'. Do you think db-connection can have > > some downsides? > > > > I considered that we should not grant privileges to access data more than necessary. > It might be better if we can avoid to connect to the specific database. But I'm > not sure that we should have to add new walreceiver API to handle it. FYI, I > checked the physical walreceiver to refer it, but it was not background worker > so that it was no meaning. > If this needs to be done, we need to have new walreceiver APIs around that (no db connection) which are currently exposed to front-end through libpq-fe.h but not exposed to the backend. I am not sure about the feasibility for that and the effort needed there. So currently we plan to go by db-connection as allowed by current libpq-walreceiver APIs. > > And followings are further comments. > > 1. > I considered the combination with the feature and initial data sync, and found an > issue. How do you think? Assuming that the name of subscription is specified as > "synchronize_slot_names". > > A synchronization of each tables is separated into two transactions: > > 1. In a first transaction, a logical replication slot (pg_XXX_sync_XXX...)is > created and tuples are COPYd. > 2. In a second transaction, changes from the first transaction are streamed by > and applied. > > If the primary crashed between 1 and 2 and standby is promoted, the tablesync > worker would execute "START_REPLICATION SLOT pg_XXX_sync_XXX..." to promoted > server, but fail because such a slot does not exist. > > Is this a problem we should solve? Above can be reproduced by adding sleep(). I will try to reproduce this scenario to understand it better. Allow me some more time. > > 2. > Do we have to add some rules in "Configuration Settings" section? Sure. I will review this further and see if we can add anything there. > > 3. > You can run pgindent in your timing. > I have done it in v17. > > Best Regards, > Hayato Kuroda > FUJITSU LIMITED
On Wed, Sep 13, 2023 at 5:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Sep 13, 2023 at 4:54 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > PFA v17. It has below changes: > > > > @@ -2498,6 +2500,13 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, > ReorderBufferTXN *txn, > } > else > { > + /* > + * Before we send out the last set of changes to logical decoding > + * output plugin, wait for specified streaming replication standby > + * servers (if any) to confirm receipt of WAL upto commit_lsn. > + */ > + WaitForStandbyLSN(commit_lsn); > > It seems the first patch has a wait logic for every commit. I think it > is better to integrate this wait with WalSndWaitForWal() as suggested > by Andres in his email[1]. > > [1] - https://www.postgresql.org/message-id/20220207204557.74mgbhowydjco4mh%40alap3.anarazel.de > > -- Sure Amit. PFA v18. It addresses below: 1) patch001: wait for physical-standby confirmation logic is now integrated with WalSndWaitForWal(). Now walsender waits for physical standby's confirmation to take changes upto RecentFlushPtr in WalSndWaitForWal(). This allows walsender to send the changes to logical subscribers one by one which are already covered in RecentFlushPtr without needing to wait on every commit for physical standby confirmation. 2) if synchronize_slot_names set on physical standby has physical slot name, primary's walsender on receiving that will error out. This is currently done in ListSlotDatabaseOIDs(), but it needs to be moved to logic where standby will send synchronize_slot_names to be set on primary and primary will validate that first. GUC synchronize_slot_names will be removed from primary. This arrangement to be done in next version. 3) Peter's comment dated Sep15. thanks Shveta
Attachment
On Fri, Sep 15, 2023 at 2:22 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Hi. Here are some review comments for v17-0002. > Thanks Peter for the feedback. I have addressed most of these in v18 except 2. Please find my comments for the ones not addressed. > This is a WIP and a long way from complete, but I wanted to send what > I have so far (while it is still current with your latest posted > patches). > > ====== > > 34. ListSlotDatabaseOIDs - sorting/logic > > Maybe explain better the reason for having the qsort and other logic. > > TBH, I was not sure of the necessity for the names lists and the > sorting and bsearch logic. AFAICT these are all *only* used to check > for uniqueness and existence of the slot name. So I was wondering if a > hashmap keyed by the slot name might be more appropriate and also > simpler than all this list sorting/searching. > Pending. I will revisit this soon and will let you know more on this. IMO, it was done to optimize the search as slot_names list could be pretty huge if max_replication_slots is set to high value. > ~~ > > 35. ListSlotDatabaseOIDs > > + for (int slotno = 0; slotno < max_replication_slots; slotno++) > + { > > loop variable declaration > > > ====== > src/backend/storage/lmgr/lwlock.c > OK > > ====== > src/backend/storage/lmgr/lwlocknames.txt > OK > > ====== > .../utils/activity/wait_event_names.txt > TODO > > ====== > src/backend/utils/misc/guc_tables.c > OK > > ====== > src/backend/utils/misc/postgresql.conf.sample > > 36. > # primary to streaming replication standby server > +#max_slotsync_workers = 2 # max number of slot synchronization > workers on a standby > > IMO it is better to say "maximum" instead of "max" in the comment. > > (make sure the GUC description text is identical) > > ====== > src/include/catalog/pg_proc.dat > > 37. > +{ oid => '6312', descr => 'get invalidate cause of a replication slot', > + proname => 'pg_get_invalidation_cause', provolatile => 's', > proisstrict => 't', > + prorettype => 'int2', proargtypes => 'name', > + prosrc => 'pg_get_invalidation_cause' }, > > 37a. > SUGGESTION (descr) > what caused the replication slot to become invalid > > ~ > > 37b > 'pg_get_invalidation_cause' seemed like a poor function name because > it doesn't have any context -- not even the word "slot" in it. > > ====== > src/include/commands/subscriptioncmds.h > OK > > ====== > src/include/nodes/replnodes.h > OK > > ====== > src/include/postmaster/bgworker_internals.h > > 38. > #define MAX_PARALLEL_WORKER_LIMIT 1024 > +#define MAX_SLOT_SYNC_WORKER_LIMIT 50 > > Consider SLOTSYNC instead of SLOT_SYNC for consistency with other > names of this worker. > > ====== > OK > > ====== > src/include/replication/logicalworker.h > > 39. > extern void ApplyWorkerMain(Datum main_arg); > extern void ParallelApplyWorkerMain(Datum main_arg); > extern void TablesyncWorkerMain(Datum main_arg); > +extern void ReplSlotSyncMain(Datum main_arg); > > The name is not consistent with others nearby. At least it should > include the word "Worker" like everything else does. > > ====== > src/include/replication/slot.h > OK > > ====== > src/include/replication/walreceiver.h > > 40. > +/* > + * Slot's DBid related data > + */ > +typedef struct WalRcvRepSlotDbData > +{ > + Oid database; /* Slot's DBid received from remote */ > + TimestampTz last_sync_time; /* The last time we tried to launch sync > + * worker for above Dbid */ > +} WalRcvRepSlotDbData; > + > > > Is that comment about field 'last_sync_time' correct? I thought this > field is the last time the slot was synced -- not the last time the > worker was launched. Sorry for confusion. Comment is correct, the name is misleading. I have changed the name in v18. > > ====== > src/include/replication/worker_internal.h > > 41. > - /* User to use for connection (will be same as owner of subscription). */ > + /* User to use for connection (will be same as owner of subscription > + * in case of LogicalRep worker). */ > Oid userid; > +} WorkerHeader; > > 41a. > > This is not the normal style for a multi-line comment. > > ~ > > 41b. > I wondered if the name "WorkerHeader" is just a bit *too* generic and > might cause future trouble because of the vague name. > I agree. Can you please suggest a better name for it? thanks Shveta
On Tue, Sep 19, 2023 at 10:29 AM shveta malik <shveta.malik@gmail.com> wrote: > > On Fri, Sep 15, 2023 at 2:22 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Hi. Here are some review comments for v17-0002. > > > > Thanks Peter for the feedback. I have addressed most of these in v18 > except 2. Please find my comments for the ones not addressed. > > > This is a WIP and a long way from complete, but I wanted to send what > > I have so far (while it is still current with your latest posted > > patches). > > > > ====== > > > > > 34. ListSlotDatabaseOIDs - sorting/logic > > > > Maybe explain better the reason for having the qsort and other logic. > > > > TBH, I was not sure of the necessity for the names lists and the > > sorting and bsearch logic. AFAICT these are all *only* used to check > > for uniqueness and existence of the slot name. So I was wondering if a > > hashmap keyed by the slot name might be more appropriate and also > > simpler than all this list sorting/searching. > > > > Pending. I will revisit this soon and will let you know more on this. > IMO, it was done to optimize the search as slot_names list could be > pretty huge if max_replication_slots is set to high value. > > > ~~ > > > > 35. ListSlotDatabaseOIDs > > > > + for (int slotno = 0; slotno < max_replication_slots; slotno++) > > + { > > > > loop variable declaration > > > > > > ====== > > src/backend/storage/lmgr/lwlock.c > > OK > > > > ====== > > src/backend/storage/lmgr/lwlocknames.txt > > OK > > > > ====== > > .../utils/activity/wait_event_names.txt > > TODO > > > > ====== > > src/backend/utils/misc/guc_tables.c > > OK > > > > ====== > > src/backend/utils/misc/postgresql.conf.sample > > > > 36. > > # primary to streaming replication standby server > > +#max_slotsync_workers = 2 # max number of slot synchronization > > workers on a standby > > > > IMO it is better to say "maximum" instead of "max" in the comment. > > > > (make sure the GUC description text is identical) > > > > ====== > > src/include/catalog/pg_proc.dat > > > > 37. > > +{ oid => '6312', descr => 'get invalidate cause of a replication slot', > > + proname => 'pg_get_invalidation_cause', provolatile => 's', > > proisstrict => 't', > > + prorettype => 'int2', proargtypes => 'name', > > + prosrc => 'pg_get_invalidation_cause' }, > > > > 37a. > > SUGGESTION (descr) > > what caused the replication slot to become invalid > > > > ~ > > > > 37b > > 'pg_get_invalidation_cause' seemed like a poor function name because > > it doesn't have any context -- not even the word "slot" in it. > > > > ====== > > src/include/commands/subscriptioncmds.h > > OK > > > > ====== > > src/include/nodes/replnodes.h > > OK > > > > ====== > > src/include/postmaster/bgworker_internals.h > > > > 38. > > #define MAX_PARALLEL_WORKER_LIMIT 1024 > > +#define MAX_SLOT_SYNC_WORKER_LIMIT 50 > > > > Consider SLOTSYNC instead of SLOT_SYNC for consistency with other > > names of this worker. > > > > ====== > > OK > > > > ====== > > src/include/replication/logicalworker.h > > > > 39. > > extern void ApplyWorkerMain(Datum main_arg); > > extern void ParallelApplyWorkerMain(Datum main_arg); > > extern void TablesyncWorkerMain(Datum main_arg); > > +extern void ReplSlotSyncMain(Datum main_arg); > > > > The name is not consistent with others nearby. At least it should > > include the word "Worker" like everything else does. > > > > ====== > > src/include/replication/slot.h > > OK > > > > ====== > > src/include/replication/walreceiver.h > > > > 40. > > +/* > > + * Slot's DBid related data > > + */ > > +typedef struct WalRcvRepSlotDbData > > +{ > > + Oid database; /* Slot's DBid received from remote */ > > + TimestampTz last_sync_time; /* The last time we tried to launch sync > > + * worker for above Dbid */ > > +} WalRcvRepSlotDbData; > > + > > > > > > Is that comment about field 'last_sync_time' correct? I thought this > > field is the last time the slot was synced -- not the last time the > > worker was launched. > > Sorry for confusion. Comment is correct, the name is misleading. I > have changed the name in v18. > > > > > ====== > > src/include/replication/worker_internal.h > > > > 41. > > - /* User to use for connection (will be same as owner of subscription). */ > > + /* User to use for connection (will be same as owner of subscription > > + * in case of LogicalRep worker). */ > > Oid userid; > > +} WorkerHeader; > > > > 41a. > > > > This is not the normal style for a multi-line comment. > > > > ~ > > > > 41b. > > I wondered if the name "WorkerHeader" is just a bit *too* generic and > > might cause future trouble because of the vague name. > > > > I agree. Can you please suggest a better name for it? > > > thanks > Shveta Currently in patch001, synchronize_slot_names is a GUC on both primary and physical standby. This GUC tells which all logical slots need to be synced on physical standbys from the primary. Ideally it should be a GUC on physical standby alone and each physical standby should be able to communicate the value to the primary (considering the value may vary for different physical replicas of the same primary). The primary on the other hand should be able to take UNION of these values and let the logical walsenders (belonging to the slots in UNION synchronize_slots_names) wait for physical standbys for confirmation before sending those changes to logical subscribers. The intent is logical subscribers should never be ahead of physical standbys. So in order to implement this i.e. each physical standby communicating synchronize_slot_names individually to primary, we need to maintain the resultant/union value in shared-memory on primary so that each of the logical walsenders can read these values. For the sake of less complexity around this involved shared-memory, it will be good to have synchronize_slot_names as PGC_POSTMASTER GUC parameter on physical standby rather than a PGC_SIGHUP one. Making it PGC_POSTMASTER on physical standby will make primary aware of the fact that slot-sync connection from physical standby is going down and thus we now need to invalidate the UNION synchronize_slot_names and compute it fresh from rest of the connections from physical-standby's which are still alive. Also when any new connection for slot-sync purpose comes from physical standby, we need to compute the UNION synchronize_slot_names list again. The synchronize_slots_names invalidation mechanism on primary will become connection based. Any thoughts? Does PGC_POSTMASTER over PGC_SIGHUP seem a reasonable choice here? thanks Shveta
On Thu, Sep 21, 2023 at 9:16 AM shveta malik <shveta.malik@gmail.com> wrote: > > On Tue, Sep 19, 2023 at 10:29 AM shveta malik <shveta.malik@gmail.com> wrote: > > Currently in patch001, synchronize_slot_names is a GUC on both primary > and physical standby. This GUC tells which all logical slots need to > be synced on physical standbys from the primary. Ideally it should be > a GUC on physical standby alone and each physical standby should be > able to communicate the value to the primary (considering the value > may vary for different physical replicas of the same primary). The > primary on the other hand should be able to take UNION of these values > and let the logical walsenders (belonging to the slots in UNION > synchronize_slots_names) wait for physical standbys for confirmation > before sending those changes to logical subscribers. The intent is > logical subscribers should never be ahead of physical standbys. > Before getting into the details of 'synchronize_slot_names', I would like to know whether we really need the second GUC 'standby_slot_names'. Can't we simply allow all the logical wal senders corresponding to 'synchronize_slot_names' to wait for just the physical standby(s) (physical slot corresponding to such physical standby) that have sent ' synchronize_slot_names'list? We should have one physical standby slot corresponding to one physical standby. -- With Regards, Amit Kapila.
PFA v19 patches which as below changes: 1) Now for slot synchronization to work, user must specify dbname in primary_conninfo on physical standbys. This dbname is used by slot-sync worker a) for its own connection to db (this db connection is needed by libpqwalreceiver APIs) b) to connect to primary in order to get slot-info. In absence of this dbname in primary_conninfo, slot-sync worker will error out. 2) slotsync_worker_stop() is now merged to logicalrep_worker_stop_internal(). Some other changes are also made as per Peter's suggestion. 3) There was a bug in patch001 where in wrong lsn position was passed to WaitForStandbyConfirmation (record-loc instead of RecentFlusPtr) leading to logical subscriber getting ahead of physical-standbys in some cases. It is fixed now. This will most probably fix cfbot failure. First 2 changes are in patch0002 and third one in patch001. Than You Ajin for working on 1 and 2. thanks Shveta
Attachment
Hi, Thanks for all the work that has been done on this feature, and sorry to have been quiet on it for so long. On 9/18/23 12:22 PM, shveta malik wrote: > On Wed, Sep 13, 2023 at 4:48 PM Hayato Kuroda (Fujitsu) > <kuroda.hayato@fujitsu.com> wrote: >> Right, but I wanted to know why it is needed. One motivation seemed to know the >> WAL location of physical standby, but I thought that struct WalSnd.apply could >> be also used. Is it bad to assume that the physical walsender always exists? >> > > We do not plan to target this case where physical slot is not created > between primary and physical-standby in the first draft. In such a > case, slot-synchronization will be skipped for the time being. We can > extend this functionality (if needed) later. > I do think it's needed to extend this functionality. Having physical slot created sounds like a (too?) strong requirement as: - It has not been a requirement for Logical decoding on standby so that could sounds weird to require it for sync slot (while it's not allowed to logical decode from sync slots) - One could want to limit the WAL space used on the primary It seems that the "skipping sync as primary_slot_name not set." warning message is emitted every 10ms, that seems too verbose to me. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, Sep 22, 2023 at 6:01 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Thanks for all the work that has been done on this feature, and sorry > to have been quiet on it for so long. > > On 9/18/23 12:22 PM, shveta malik wrote: > > On Wed, Sep 13, 2023 at 4:48 PM Hayato Kuroda (Fujitsu) > > <kuroda.hayato@fujitsu.com> wrote: > >> Right, but I wanted to know why it is needed. One motivation seemed to know the > >> WAL location of physical standby, but I thought that struct WalSnd.apply could > >> be also used. Is it bad to assume that the physical walsender always exists? > >> > > > > We do not plan to target this case where physical slot is not created > > between primary and physical-standby in the first draft. In such a > > case, slot-synchronization will be skipped for the time being. We can > > extend this functionality (if needed) later. > > > > I do think it's needed to extend this functionality. Having physical slot > created sounds like a (too?) strong requirement as: > > - It has not been a requirement for Logical decoding on standby so that could sounds weird > to require it for sync slot (while it's not allowed to logical decode from sync slots) > There is a difference here that we also need to prevent removal of rows required by sync_slots. That could be achieved by physical slot (and hot_standby_feedback). So, having a requirement to have physical slot doesn't sound too unreasonable to me. Otherwise, we need to invent some new mechanism of having some sort of placeholder slot to avoid removal of required rows. I guess we can always extend the functionality in later version as Shveta mentioned. Now, if we have somewhat simpler way to achieve prevention of removal of rows then it is fine otherwise let's focus on getting other parts correct considering this is already a reasonably big and complex patch. Thanks for looking into this work and your feedback will definetely help in moving this work forward. -- With Regards, Amit Kapila.
FYI -- v19 failed to apply cleanly with the latest HEAD. [postgres@CentOS7-x64 oss_postgres_misc]$ git apply ../patches_misc/v19-0001-Allow-logical-walsenders-to-wait-for-physical-st.patch error: patch failed: src/test/recovery/meson.build:44 error: src/test/recovery/meson.build: patch does not apply ------ Kind Regards, Peter Smith. Fujitsu Australia
On Fri, Sep 22, 2023 at 6:01 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > Thanks for all the work that has been done on this feature, and sorry > to have been quiet on it for so long. Thanks for looking into this. > > On 9/18/23 12:22 PM, shveta malik wrote: > > On Wed, Sep 13, 2023 at 4:48 PM Hayato Kuroda (Fujitsu) > > <kuroda.hayato@fujitsu.com> wrote: > >> Right, but I wanted to know why it is needed. One motivation seemed to know the > >> WAL location of physical standby, but I thought that struct WalSnd.apply could > >> be also used. Is it bad to assume that the physical walsender always exists? > >> > > > > We do not plan to target this case where physical slot is not created > > between primary and physical-standby in the first draft. In such a > > case, slot-synchronization will be skipped for the time being. We can > > extend this functionality (if needed) later. > > > > I do think it's needed to extend this functionality. Having physical slot > created sounds like a (too?) strong requirement as: > > - It has not been a requirement for Logical decoding on standby so that could sounds weird > to require it for sync slot (while it's not allowed to logical decode from sync slots) > > - One could want to limit the WAL space used on the primary > > It seems that the "skipping sync as primary_slot_name not set." warning message is emitted > every 10ms, that seems too verbose to me. > You are right, the warning msg is way too frequent. I will optimize it in the next version. thanks Shveta
On Mon, Sep 25, 2023 at 12:14 PM Peter Smith <smithpb2250@gmail.com> wrote: > > FYI -- v19 failed to apply cleanly with the latest HEAD. > > [postgres@CentOS7-x64 oss_postgres_misc]$ git apply > ../patches_misc/v19-0001-Allow-logical-walsenders-to-wait-for-physical-st.patch > error: patch failed: src/test/recovery/meson.build:44 > error: src/test/recovery/meson.build: patch does not apply > > ------ Rebased the patch, updating new one, calling it v19_2 regards, Ajin Cherian
Attachment
On Fri, Sep 22, 2023 at 3:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Sep 21, 2023 at 9:16 AM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Tue, Sep 19, 2023 at 10:29 AM shveta malik <shveta.malik@gmail.com> wrote: > > > > Currently in patch001, synchronize_slot_names is a GUC on both primary > > and physical standby. This GUC tells which all logical slots need to > > be synced on physical standbys from the primary. Ideally it should be > > a GUC on physical standby alone and each physical standby should be > > able to communicate the value to the primary (considering the value > > may vary for different physical replicas of the same primary). The > > primary on the other hand should be able to take UNION of these values > > and let the logical walsenders (belonging to the slots in UNION > > synchronize_slots_names) wait for physical standbys for confirmation > > before sending those changes to logical subscribers. The intent is > > logical subscribers should never be ahead of physical standbys. > > > > Before getting into the details of 'synchronize_slot_names', I would > like to know whether we really need the second GUC > 'standby_slot_names'. Can't we simply allow all the logical wal > senders corresponding to 'synchronize_slot_names' to wait for just the > physical standby(s) (physical slot corresponding to such physical > standby) that have sent ' synchronize_slot_names'list? We should have > one physical standby slot corresponding to one physical standby. > yes, with the new approach (to be implemented next) where we plan to send synchronize_slot_names from each physical standby to primary, the standby_slot_names GUC should no longer be needed on primary. The physical standbys sending requests should automatically become the ones to be waited for confirmation on the primary. thanks Shveta
Hi, On 9/23/23 3:38 AM, Amit Kapila wrote: > On Fri, Sep 22, 2023 at 6:01 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> >> Thanks for all the work that has been done on this feature, and sorry >> to have been quiet on it for so long. >> >> On 9/18/23 12:22 PM, shveta malik wrote: >>> On Wed, Sep 13, 2023 at 4:48 PM Hayato Kuroda (Fujitsu) >>> <kuroda.hayato@fujitsu.com> wrote: >>>> Right, but I wanted to know why it is needed. One motivation seemed to know the >>>> WAL location of physical standby, but I thought that struct WalSnd.apply could >>>> be also used. Is it bad to assume that the physical walsender always exists? >>>> >>> >>> We do not plan to target this case where physical slot is not created >>> between primary and physical-standby in the first draft. In such a >>> case, slot-synchronization will be skipped for the time being. We can >>> extend this functionality (if needed) later. >>> >> >> I do think it's needed to extend this functionality. Having physical slot >> created sounds like a (too?) strong requirement as: >> >> - It has not been a requirement for Logical decoding on standby so that could sounds weird >> to require it for sync slot (while it's not allowed to logical decode from sync slots) >> > > There is a difference here that we also need to prevent removal of > rows required by sync_slots. That could be achieved by physical slot > (and hot_standby_feedback). So, having a requirement to have physical > slot doesn't sound too unreasonable to me. Otherwise, we need to > invent some new mechanism of having some sort of placeholder slot to > avoid removal of required rows. Thinking about it, I wonder if removal of required rows is even possible given that: - we don't allow to logical decode from a sync slot - sync slot catalog_xmin <= its primary counter part catalog_xmin - its primary counter part prevents rows removal thanks to its own catalog_xmin - a sync slot is removed as soon as its primary counter part is removed In that case I'm not sure how rows removal on the primary could lead to remove rows required by a sync slot. Am I missing something? Do you have a scenario in mind? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Dear Ajin, Shveta, Thank you for rebasing the patch set! Here are new comments for v19_2-0001. 01. WalSndWaitForStandbyNeeded() ``` if (SlotIsPhysical(MyReplicationSlot)) return false; ``` Is there a possibility that physical walsenders call this function? IIUC following is a stacktrace for the function, so the only logical walsenders use it. If so, it should be Assert() instead of an if statement. logical_read_xlog_page() WalSndWaitForWal() WalSndWaitForStandbyNeeded() 02. WalSndWaitForStandbyNeeded() Can we set shouldwait in SlotSyncInitConfig()? synchronize_slot_names_list is searched whenever the function is called, but it is not changed automatically. If the slotname is compared with the list in the SlotSyncInitConfig(), the liner search can be reduced. 03. WalSndWaitForStandbyConfirmation() We should add ProcessRepliesIfAny() during the loop, otherwise the walsender overlooks the death of an apply worker. 04. WalSndWaitForStandbyConfirmation() Not sure, but do we have to return early if walsenders got PROCSIG_WALSND_INIT_STOPPING signal? I thought that if physical walsenders get stuck, logical walsenders wait forever. At that time we cannot stop the primary server even if "pg_ctl stop" is executed. 05. SlotSyncInitConfig() Why don't we free the memory for rawname, old standby_slot_names_list, and synchronize_slot_names_list? They seem to be overwritten. 06. SlotSyncInitConfig() Both physical and logical walsenders call the func, but physical one do not use lists, right? If so, can we add a quick exit for physical walsenders? Or, we should carefully remove where physical calls it. 07. StartReplication() I think we do not have to call SlotSyncInitConfig(). Alternative approach is written in above. 08. the other Also, I found the unexpected behavior after both 0001 and 0002 were applied. Was it normal or not? 1. constructed below setup (ensured that logical slot existed on secondary) 2. stopped the primary 3. promoted the secondary server 4. disabled a subscription once 5. changed the connection string for subscriber 6. Inserted data to new primary 7. enabled the subscription again 8. got an ERROR: replication slot "sub" does not exist I expected that the logical replication would be restarted, but it could not. Was it real issue or my fault? The error would appear in secondary.log. ``` Setup: primary--->secondary | | subscriber ``` Best Regards, Hayato Kuroda FUJITSU LIMITED
Attachment
Hi, On 9/25/23 10:44 AM, Drouvot, Bertrand wrote: > Hi, > > On 9/23/23 3:38 AM, Amit Kapila wrote: >> On Fri, Sep 22, 2023 at 6:01 PM Drouvot, Bertrand >> <bertranddrouvot.pg@gmail.com> wrote: >> There is a difference here that we also need to prevent removal of >> rows required by sync_slots. That could be achieved by physical slot >> (and hot_standby_feedback). So, having a requirement to have physical >> slot doesn't sound too unreasonable to me. Otherwise, we need to >> invent some new mechanism of having some sort of placeholder slot to >> avoid removal of required rows. > > Thinking about it, I wonder if removal of required rows is even possible > given that: > > - we don't allow to logical decode from a sync slot > - sync slot catalog_xmin <= its primary counter part catalog_xmin > - its primary counter part prevents rows removal thanks to its own catalog_xmin > - a sync slot is removed as soon as its primary counter part is removed > > In that case I'm not sure how rows removal on the primary could lead to remove rows > required by a sync slot. Am I missing something? Do you have a scenario in mind? Please forget the above questions, it's in fact pretty easy to remove rows on the primary that would be needed by a sync slot. I do agree that having a requirement to have physical slot does not sound unreasonable then. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Here are some more review comments for the patch v19-0002. This is a WIP.... these review comments are all for the file slotsync.c ====== src/backend/replication/logical/slotsync.c 1. wait_for_primary_slot_catchup + WalRcvExecResult *res; + TupleTableSlot *slot; + Oid slotRow[1] = {LSNOID}; + StringInfoData cmd; + bool isnull; + XLogRecPtr restart_lsn; + + for (;;) + { + int rc; I could not recognize a reason why 'rc' is declared within the loop, but none of the other local variables are. Personally, I'd declare all variables at the deepest scope (e.g. inside the for loop). ~~~ 2. get_local_synced_slot_names +/* + * Get list of local logical slot names which are synchronized from + * primary and belongs to one of the DBs passed in. + */ +static List * +get_local_synced_slot_names(Oid *dbids) +{ IIUC, this function gets called only from the drop_obsolete_slots() function. But I thought this list of local slot names (i.e. for the dbids that this worker is handling) would be something that perhaps could the initialized one time for the worker, instead of it being re-calculated every single time the slots processing/dropping happens. Isn't the current code expending too much effort recalculating over and over but giving back the same list every time? ~~~ 3. get_local_synced_slot_names + for (int i = 0; i < max_replication_slots; i++) + { + ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i]; + + /* Check if it is logical synchronized slot */ + if (s->in_use && SlotIsLogical(s) && s->data.synced) + { + for (int j = 0; j < MySlotSyncWorker->dbcount; j++) + { Loop variables are not declared in the common PG code way. ~~~ 4. slot_exists_locally +static bool +slot_exists_locally(List *remote_slots, ReplicationSlot *local_slot, + bool *locally_invalidated) +{ + ListCell *cell; + + foreach(cell, remote_slots) + { + RemoteSlot *remote_slot = (RemoteSlot *) lfirst(cell); + + if (strcmp(remote_slot->name, NameStr(local_slot->data.name)) == 0) + { + /* + * if remote slot is marked as non-conflicting (i.e. not + * invalidated) but local slot is marked as invalidated, then set + * the bool. + */ + if (!remote_slot->conflicting && + SlotIsLogical(local_slot) && + local_slot->data.invalidated != RS_INVAL_NONE) + *locally_invalidated = true; + + return true; + } + } + + return false; +} Why is there a SlotIsLogical(local_slot) check buried in this function? How is slot_exists_locally() getting called with a non-logical local_slot? Shouldn't that have been screened out long before here? ~~~ 5. use_slot_in_query +static bool +use_slot_in_query(char *slot_name, Oid *dbids) There are multiple non-standard for-loop variable declarations in this function. ~~~ 6. compute_naptime + * The first slot managed by each worker is chosen for monitoring purpose. + * If the lsn of that slot changes during each sync-check time, then the + * nap time is kept at regular value of WORKER_DEFAULT_NAPTIME_MS. + * When no lsn change is observed for WORKER_INACTIVITY_THRESHOLD_MS + * time, then the nap time is increased to WORKER_INACTIVITY_NAPTIME_MS. + * This nap time is brought back to WORKER_DEFAULT_NAPTIME_MS as soon as + * lsn change is observed. 6a. /regular value/the regular value/ /for WORKER_INACTIVITY_THRESHOLD_MS time/within the threshold period (WORKER_INACTIVITY_THRESHOLD_MS)/ ~ 6b. /as soon as lsn change is observed./as soon as another lsn change is observed./ ~~~ 7. + * The caller is supposed to ignore return-value of 0. The 0 value is returned + * for the slots other that slot being monitored. + */ +static long +compute_naptime(RemoteSlot *remote_slot) This rule about the returning 0 seemed hacky to me. IMO this would be a better API to pass long *naptime (which this function either updates or doesn't update, depending on this being the "monitored" slot. Knowing the current naptime is also useful to improve the function logic (see the next review comment below). Also, since this function is really only toggling naptime between 2 values, it would be helpful to assert that Assert(*naptime == WORKER_DEFAULT_NAPTIME_MS || *naptime == WORKER_INACTIVITY_NAPTIME_MS); ~~~ 8. + if (NameStr(MySlotSyncWorker->monitoring_info.slot_name)[0] == '\0') + { + /* + * First time, just update the name and lsn and return regular + * nap time. Start comparison from next time onward. + */ + strcpy(NameStr(MySlotSyncWorker->monitoring_info.slot_name), + remote_slot->name); I wasn't sure why it was necessary to identify the "monitoring" slot by name. Why doesn't the compute_naptime just get called only for the 1st slot found in the tuple loop instead of all the strcmp business trying to match monitor names? And, if the monitored slot gets "dropped", then so what; next time another slot will be the first tuple so will automatically take its place, right? ~~~ 9. + /* + * If new received lsn (remote one) is different from what we have in + * our local slot, then update last_update_time. + */ + if (MySlotSyncWorker->monitoring_info.confirmed_lsn != + remote_slot->confirmed_lsn) + MySlotSyncWorker->monitoring_info.last_update_time = now; + + MySlotSyncWorker->monitoring_info.confirmed_lsn = + remote_slot->confirmed_lsn; Doesn't it make more sense to also put that 'confirmed_lsn' assignment under the same condition? e.g. No need to overwrite the same value again. ~~~ 10. + /* If the inactivity time reaches the threshold, increase nap time */ + if (TimestampDifferenceExceeds(MySlotSyncWorker->monitoring_info.last_update_time, + now, WORKER_INACTIVITY_THRESHOLD_MS)) + return WORKER_INACTIVITY_NAPTIME_MS; + else + return WORKER_DEFAULT_NAPTIME_MS; + } Somehow this feels overcomplicated to me. In reality, the naptime is only toggling between 2 values (DEFAULT and INACTIVITY) so we should never need to be testing TimestampDifferenceExceeds again and again on subsequent calls (there might be 1000s of them) Once naptime is WORKER_INACTIVITY_NAPTIME_MS we know to reset it back to WORKER_DEFAULT_NAPTIME_MS only if (MySlotSyncWorker->monitoring_info.confirmed_lsn != remote_slot->confirmed_lsn) is detected. Basically, I think the algorithm should be like the code below: TimestampTz now = GetCurrentTimestamp(); if (MySlotSyncWorker->monitoring_info.confirmed_lsn != remote_slot->confirmed_lsn) { MySlotSyncWorker->monitoring_info.last_update_time = now; MySlotSyncWorker->monitoring_info.confirmed_lsn = remote_slot->confirmed_lsn; /* Something changed; reset naptime to default. */ *naptime = WORKER_DEFAULT_NAPTIME_MS; } else { if (*naptime == WORKER_DEFAULT_NAPTIME_MS) { /* If the inactivity time reaches the threshold, increase nap time. */ if (TimestampDifferenceExceeds(MySlotSyncWorker->monitoring_info.last_update_time, now, WORKER_INACTIVITY_THRESHOLD_MS)) *naptime = WORKER_INACTIVITY_NAPTIME_MS; } } ~~~ 11. get_remote_invalidation_cause +/* + * Get Remote Slot's invalidation cause. + * + * This gets invalidation cause of remote slot. + */ +static ReplicationSlotInvalidationCause +get_remote_invalidation_cause(WalReceiverConn *wrconn, char *slot_name) +{ Isn't that function comment just repeating itself? ~~~ 12. + initStringInfo(&cmd); + appendStringInfo(&cmd, + "select pg_get_slot_invalidation_cause(%s)", + quote_literal_cstr(slot_name)); Use uppercase "SELECT" for consistency with other SQL. ~~~ 13. + /* Make things live outside TX context */ + MemoryContextSwitchTo(oldctx); + + initStringInfo(&cmd); + appendStringInfo(&cmd, + "select pg_get_slot_invalidation_cause(%s)", + quote_literal_cstr(slot_name)); + res = walrcv_exec(wrconn, cmd.data, 1, slotRow); + pfree(cmd.data); + + CommitTransactionCommand(); + + /* Switch to oldctx we saved */ + MemoryContextSwitchTo(oldctx); There are 2x MemoryContextSwitchTo(oldctx) here. Is that deliberate? ~~~ 14. + if (res->status != WALRCV_OK_TUPLES) + ereport(ERROR, + (errmsg("could not fetch invalidation cuase for slot \"%s\" from" + " primary: %s", slot_name, res->err))); typo /cuase/cause/ ~~~ 15. + slot = MakeSingleTupleTableSlot(res->tupledesc, &TTSOpsMinimalTuple); + if (!tuplestore_gettupleslot(res->tuplestore, true, false, slot)) + ereport(ERROR, + (errmsg("slot \"%s\" disapeared from the primary", + slot_name))); typo /disapeared/disappeared/ ~~~ 16. drop_obsolete_slots +/* + * Drop obsolete slots + * + * Drop the slots which no longer need to be synced i.e. these either + * do not exist on primary or are no longer part of synchronize_slot_names. + * + * Also drop the slots which are valid on primary and got invalidated + * on standby due to conflict (say required rows removed on primary). + * The assumption is, these will get recreated in next sync-cycle and + * it is okay to drop and recreate such slots as long as these are not + * consumable on standby (which is the case currently). + */ /which no/that no/ /which are/that are/ /these will/that these will/ /and got invalidated/that got invalidated/ ~~~ 17. + /* If this slot is being monitored, clean-up the monitoring info */ + if (strcmp(NameStr(local_slot->data.name), + NameStr(MySlotSyncWorker->monitoring_info.slot_name)) == 0) + { + MemSet(NameStr(MySlotSyncWorker->monitoring_info.slot_name), 0, NAMEDATALEN); + MySlotSyncWorker->monitoring_info.confirmed_lsn = 0; + MySlotSyncWorker->monitoring_info.last_update_time = 0; + } Maybe it is better to assign InvalidXLogRecPtr instead of 0 to the cleared lsn. ~ Alternatively, consider just zapping the entire monitoring_info structure in one go: MemSet(&MySlotSyncWorker->monitoring_info, 0, sizeof(MySlotSyncWorker->monitoring_info)); ~~~ 18. construct_slot_query (calling use_slot_in_query) This separation of functions (use_slot_in_query / construct_slot_query) seems awkward to me. The use_slot_in_query() function is only called by construct_slot_query(). I felt it might be simpler to keep all the logical with the construct_slot_query(). Furthermore, it seemed strange to iterate all the DBs (to populate the "WHERE database IN" clause) and then iterate all the DBs multiple times again in use_slot_in_query (looking for slots to populate the "AND slot_name IN (" clause). Maybe I misunderstand the reason for this structuring, but IMO it would be simpler code to keep all the logic in construct_slot_query() like: a. Initialize with empty dblist, empty slotlist. b. Iterate all dbids - constructing the dblist as you go - constructing the slot list as you go (if synchronize_slot_names is not "" or "*") c. Finally, build the query: basic + dblist-clause + optional slotlist-clause ~~~ 19. construct_slot_query Why does this function return a boolean? I only see it returns true, but never false. ~~~ 20. + { + ListCell *lc; + bool first_slot = true; + + + foreach(lc, sync_slot_names_list) Unnecessary blank line. ~~~ 21. synchronize_one_slot +/* + * Synchronize single slot to given position. + * + * This creates new slot if there is no existing one and updates the + * metadata of existing slots as per the data received from the primary. + */ +static void +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot) /creates new slot/creates a new slot/ /metadata of existing slots/metadata of the slot/ ~~~ 22 + /* Search for the named slot and mark it active if we find it. */ + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED); + for (int i = 0; i < max_replication_slots; i++) + { + ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i]; + + if (!s->in_use) + continue; + + if (strcmp(NameStr(s->data.name), remote_slot->name) == 0) + { + found = true; + break; + } + } + LWLockRelease(ReplicationSlotControlLock); 22a. "and mark it active if we find it." -- What code here is marking anything active? ~ 22b. Uncommon style of loop variable declaration ~ 22c. IMO it is over-complicated code; e.g. same loop can be written like this: SUGGESTION for (i = 0; i < max_replication_slots && !found; i++) { ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i]; if (s->in_use) found = (strcmp(NameStr(s->data.name), remote_slot->name) == 0); } ~~~ 23. synchronize_slots + /* Construct query to get slots info from the primary */ + initStringInfo(&s); + if (!construct_slot_query(&s, dbids)) + { + pfree(s.data); + CommitTransactionCommand(); + LWLockRelease(SlotSyncWorkerLock); + return naptime; + } As noted elsewhere, it seems construct_slot_query() will never return false and so this block of code is unreachable. ~~~ 24. + /* Create list of remote slot names to be used by drop_obsolete_slots */ + remote_slot_list = lappend(remote_slot_list, remote_slot); This is a list of slots, not just slot names. ~~~ 25. + /* + * Update nap time in case of non-zero value returned. The zero value + * is returned if remote_slot is not the one being monitored. + */ + value = compute_naptime(remote_slot); + if (value) + naptime = value; If the compute_naptime API is changed as suggested in a prior review comment then this can be simplified to something like: SUGGESTION: /* Update nap time as required depending on slot activity. */ compute_naptime(remote_slot, &naptime); ~~~ 26. + /* + * Drop local slots which no longer need to be synced i.e. these either do + * not exist on primary or are no longer part of synchronize_slot_names. + */ + drop_obsolete_slots(dbids, remote_slot_list); /which no longer/that no longer/ I thought it might be better to omit the "i.e." part. Just leave it to the function-header of drop_obsolete_slots for a detailed explanation about *which* slots are candidates for dropping. ~ 27. + /* We are done, free remot_slot_list elements */ + foreach(cell, remote_slot_list) + { + RemoteSlot *remote_slot = (RemoteSlot *) lfirst(cell); + + pfree(remote_slot); + } 27a. /remot_slot_list/remote_slot_list/ ~ 27b. Isn't this just the same as the one-liner: list_free_deep(remote_slot_list); ~~~ 28. +/* + * Initialize the list from raw synchronize_slot_names and cache it, in order + * to avoid parsing it repeatedly. Done at slot-sync worker startup and after + * each SIGHUP. + */ +static void +SlotSyncInitSlotNamesList() +{ + char *rawname; + + if (strcmp(synchronize_slot_names, "") != 0 && + strcmp(synchronize_slot_names, "*") != 0) + { + rawname = pstrdup(synchronize_slot_names); + SplitIdentifierString(rawname, ',', &sync_slot_names_list); + } +} 28a. Why this static function name is camel-case, unlike all the others? ~ 28b. What about when the sync_slot_names_list changes from value to "" or "*". Shouldn't this function be setting sync_slot_names_list = NIL for that scenario? ~~~ 29. remote_connect +/* + * Connect to remote (primary) server. + * + * This uses primary_conninfo in order to connect to primary. For slot-sync + * to work, primary_conninfo is expected to have dbname as well. + */ +static WalReceiverConn * +remote_connect() 29a. I felt it might be more helpful to say "GUC primary_conninfo" instead of just 'primary_conninfo' the first time this is mentioned. ~ 29b. /connect to primary/connect to the primary/ ~ 29c. /is expected to have/is required to specify/ ~~~ 30. reconnect_if_needed +/* + * Reconnect to remote (primary) server if PrimaryConnInfo got changed. + */ +static WalReceiverConn * +reconnect_if_needed(WalReceiverConn *wrconn_prev, char *conninfo_prev) /got changed/has changed/ ~~~ 31. +static WalReceiverConn * +reconnect_if_needed(WalReceiverConn *wrconn_prev, char *conninfo_prev) +{ + WalReceiverConn *wrconn = NULL; + + /* If no change in PrimaryConnInfo, return previous connection itself */ + if (strcmp(conninfo_prev, PrimaryConnInfo) == 0) + return wrconn_prev; + + walrcv_disconnect(wrconn); + wrconn = remote_connect(); + return wrconn; +} /return previous/return the previous/ Disconnect NULL is a bug isn't it? Don't you mean to disconnect 'wrconn_prev'? ~~~ 32. slotsync_worker_detach +/* + * Detach the worker from DSM and update 'proc' and 'in_use'. + * Logical replication launcher will come to know using these + * that the worker has shutdown. + */ +static void +slotsync_worker_detach(int code, Datum arg) +{ + dsa_detach((dsa_area *) DatumGetPointer(arg)); + LWLockAcquire(SlotSyncWorkerLock, LW_EXCLUSIVE); + MySlotSyncWorker->hdr.in_use = false; + MySlotSyncWorker->hdr.proc = NULL; + LWLockRelease(SlotSyncWorkerLock); +} I expected this function to be in the same module as slotsync_worker_attach. It seems a bit strange to have them separated. ~~~ 33. ReplSlotSyncMain + ereport(ERROR, + (errmsg("The dbname not specified in primary_conninfo, skipping" + " slots synchronization"), + errhint("Specify dbname in primary_conninfo for slots" + " synchronization to proceed"))); /not specified in/was not specified in/ /slots synchronization/slot synchronization/ (??) -- there are multiple of these ~ 34. + /* + * Connect to the database specified by user in PrimaryConnInfo. We need + * database connection for walrcv_exec to work. Please see comments atop + * libpqrcv_exec. + */ /database connection/a database connection/ ~~~ 35. + /* Reconnect if primary_conninfo got changed */ + if (config_reloaded) + wrconn = reconnect_if_needed(wrconn, conninfo_prev); SUGGESTION Reconnect if GUC primary_conninfo has changed. ~ 36. + /* + * The slot-sync worker must not get here because it will only stop when + * it receives a SIGINT from the logical replication launcher, or when + * there is an error. None of these cases will allow the code to reach + * here. + */ + Assert(false); 36a. /must not/cannot/ 36b. "None of these cases will allow the code to reach here." <-- redundant sentence ====== Kind Regards, Peter Smith. Fujitsu Australia
Hi, On 9/19/23 6:50 AM, shveta malik wrote: > On Wed, Sep 13, 2023 at 5:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> On Wed, Sep 13, 2023 at 4:54 PM shveta malik <shveta.malik@gmail.com> wrote: >>> >>> PFA v17. It has below changes: >>> >> >> @@ -2498,6 +2500,13 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, >> ReorderBufferTXN *txn, >> } >> else >> { >> + /* >> + * Before we send out the last set of changes to logical decoding >> + * output plugin, wait for specified streaming replication standby >> + * servers (if any) to confirm receipt of WAL upto commit_lsn. >> + */ >> + WaitForStandbyLSN(commit_lsn); >> >> It seems the first patch has a wait logic for every commit. I think it >> is better to integrate this wait with WalSndWaitForWal() as suggested >> by Andres in his email[1]. >> >> [1] - https://www.postgresql.org/message-id/20220207204557.74mgbhowydjco4mh%40alap3.anarazel.de >> >> -- > > Sure Amit. PFA v18. It addresses below: > > 1) patch001: wait for physical-standby confirmation logic is now > integrated with WalSndWaitForWal(). Now walsender waits for physical > standby's confirmation to take changes upto RecentFlushPtr in > WalSndWaitForWal(). This allows walsender to send the changes to > logical subscribers one by one which are already covered in > RecentFlushPtr without needing to wait on every commit for physical > standby confirmation. + /* XXX: Is waiting for 1 second before retrying enough or more or less? */ + (void) WaitLatch(MyLatch, + WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, + 1000L, + WAIT_EVENT_WAL_SENDER_WAIT_FOR_STANDBY_CONFIRMATION); I think it would be better to let the physical walsender(s) wake up those logical walsender(s) (instead of waiting for 1 sec or such). Maybe we could introduce a new CV that would broadcast in PhysicalConfirmReceivedLocation() when restart_lsn is changed, what do you think? Still regarding preventing the logical replication to go ahead of physical replication standbys specified in standby_slot_names: we currently don't impose this limitation to pg_logical_slot_get_changes and friends (that don't start a dedicated walsender). Shouldn't we also prevent them to go ahead of physical replication standbys specified in standby_slot_names? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
PFA v20. The changes are: 1) The launcher now checks hot_standby_feedback (along with presence of physical slot) before launching slot-sync workers and skips the sync if it is off. 2) Other validity checks (for primary_slot_name, dbname in primary_conn_info etc) are now moved to the launcher before we even launch slot-sync workers. This will fix frequent WARNING msg coming in log file as reported by Bertrand. 3) Now we stop all the slot-sync workers in case any of the related GUCs has changed and then relaunch these in next sync-cycle as per new values and after performing validity checks again. 4) This patch also fixes few bugs in wait_for_primary_slot_catchup(): 4.1) This function was not coming out of wait gracefully on standby's promotion, it is fixed now. 4.2) The checks to start the wait were not correct. These have been fixed now 4.3) If the slot (on which we are waiting) is invalidated on primary meanwhile, this function was not handling that scenario and was not aborting the wait. Handled now. 5) Addressed most of the comments(dated Sep25) given by Kruoda-san in patch 0001. First 4 changes are in patch002 while last one is in patch001. thanks Shveta
Attachment
On Mon, Sep 25, 2023 at 7:46 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Ajin, Shveta, > > Thank you for rebasing the patch set! Here are new comments for v19_2-0001. > Thank You Kuroda-san for the feedback. Most of these are addressed in v20. Please find my response inline. > 01. WalSndWaitForStandbyNeeded() > > ``` > if (SlotIsPhysical(MyReplicationSlot)) > return false; > ``` > > Is there a possibility that physical walsenders call this function? > IIUC following is a stacktrace for the function, so the only logical walsenders use it. > If so, it should be Assert() instead of an if statement. > > logical_read_xlog_page() > WalSndWaitForWal() > WalSndWaitForStandbyNeeded() It will only be called from logical-walsenders. Modified as you suggested. > > 02. WalSndWaitForStandbyNeeded() > > Can we set shouldwait in SlotSyncInitConfig()? synchronize_slot_names_list is > searched whenever the function is called, but it is not changed automatically. > If the slotname is compared with the list in the SlotSyncInitConfig(), the > liner search can be reduced. standby_slot_names and synchronize_slot_names will be removed in the next version as per discussion in [1] and thus SlotSyncInitConfig() will not be needed. It will be replaced by new functionality. So I am currently leaving it as is. > > 03. WalSndWaitForStandbyConfirmation() > > We should add ProcessRepliesIfAny() during the loop, otherwise the walsender > overlooks the death of an apply worker. > Done. > 04. WalSndWaitForStandbyConfirmation() > > Not sure, but do we have to return early if walsenders got PROCSIG_WALSND_INIT_STOPPING > signal? I thought that if physical walsenders get stuck, logical walsenders wait > forever. At that time we cannot stop the primary server even if "pg_ctl stop" > is executed. > yes, right. I have added CHECK_FOR_INTERRUPTS() and 'got_STOPPING' handling now which I think should suffice to process PROCSIG_WALSND_INIT_STOPPING. > 05. SlotSyncInitConfig() > > Why don't we free the memory for rawname, old standby_slot_names_list, and synchronize_slot_names_list? > They seem to be overwritten. > Skipped for the time being due to reasons stated in pt 2. > 06. SlotSyncInitConfig() > > Both physical and logical walsenders call the func, but physical one do not use > lists, right? If so, can we add a quick exit for physical walsenders? > Or, we should carefully remove where physical calls it. > > 07. StartReplication() > > I think we do not have to call SlotSyncInitConfig(). > Alternative approach is written in above. > I have removed it from StartReplication() > 08. the other > > Also, I found the unexpected behavior after both 0001 and 0002 were applied. > Was it normal or not? > > 1. constructed below setup > (ensured that logical slot existed on secondary) > 2. stopped the primary > 3. promoted the secondary server > 4. disabled a subscription once > 5. changed the connection string for subscriber > 6. Inserted data to new primary > 7. enabled the subscription again > 8. got an ERROR: replication slot "sub" does not exist > > I expected that the logical replication would be restarted, but it could not. > Was it real issue or my fault? The error would appear in secondary.log. > > ``` > Setup: > primary--->secondary > | > | > subscriber > ``` I have attached the new test-script (v2), can you please try that on the v20 set of patches. We should let the slot creation complete first on standby and then try promotion. I have added a few extra lines in v2 of your script for the same. In the test-case, primary's restart-lsn was lagging behind new-slot's restart_lsn on standby and thus standby was waiting for primary to catch-up. Meanwhile standby got promoted and thus slot creation got aborted. That is the reason you were not able to get the logical replication working on the new primary. v20 has improved handling and better logging for such a case. Please try the attached test-script on v20. [1]: https://www.postgresql.org/message-id/CAJpy0uA%2Bt3XP2M0qtEmrOG1gSwHghjHPno5AtwTXM-94-%2Bc6JQ%40mail.gmail.com thanks Shveta
Attachment
On Wed, Sep 27, 2023 at 3:13 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > On 9/19/23 6:50 AM, shveta malik wrote: > > > > 1) patch001: wait for physical-standby confirmation logic is now > > integrated with WalSndWaitForWal(). Now walsender waits for physical > > standby's confirmation to take changes upto RecentFlushPtr in > > WalSndWaitForWal(). This allows walsender to send the changes to > > logical subscribers one by one which are already covered in > > RecentFlushPtr without needing to wait on every commit for physical > > standby confirmation. > > + /* XXX: Is waiting for 1 second before retrying enough or more or less? */ > + (void) WaitLatch(MyLatch, > + WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, > + 1000L, > + WAIT_EVENT_WAL_SENDER_WAIT_FOR_STANDBY_CONFIRMATION); > > I think it would be better to let the physical walsender(s) wake up those logical > walsender(s) (instead of waiting for 1 sec or such). Maybe we could introduce a new CV that would > broadcast in PhysicalConfirmReceivedLocation() when restart_lsn is changed, what do you think? > Yes, I also think there should be some way for physical walsender to wake up logical walsenders instead of just waiting. By the way, do you think we need a GUC like standby_slot_names (please see discussion [1])? > Still regarding preventing the logical replication to go ahead of > physical replication standbys specified in standby_slot_names: we currently don't impose this > limitation to pg_logical_slot_get_changes and friends (that don't start a dedicated walsender). > > Shouldn't we also prevent them to go ahead of physical replication standbys specified in standby_slot_names? > Yes, I also think similar handling is required in pg_logical_slot_get_changes_guts(). We do call GetFlushRecPtr(), so the handling similar to what the patch is trying to do in WalSndWaitForWal() can be done. [1] - https://www.postgresql.org/message-id/CAJpy0uA%2Bt3XP2M0qtEmrOG1gSwHghjHPno5AtwTXM-94-%2Bc6JQ%40mail.gmail.com -- With Regards, Amit Kapila.
Hi, On 9/25/23 6:10 AM, shveta malik wrote: > On Fri, Sep 22, 2023 at 3:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> On Thu, Sep 21, 2023 at 9:16 AM shveta malik <shveta.malik@gmail.com> wrote: >>> >>> On Tue, Sep 19, 2023 at 10:29 AM shveta malik <shveta.malik@gmail.com> wrote: >>> >>> Currently in patch001, synchronize_slot_names is a GUC on both primary >>> and physical standby. This GUC tells which all logical slots need to >>> be synced on physical standbys from the primary. Ideally it should be >>> a GUC on physical standby alone and each physical standby should be >>> able to communicate the value to the primary (considering the value >>> may vary for different physical replicas of the same primary). The >>> primary on the other hand should be able to take UNION of these values >>> and let the logical walsenders (belonging to the slots in UNION >>> synchronize_slots_names) wait for physical standbys for confirmation >>> before sending those changes to logical subscribers. The intent is >>> logical subscribers should never be ahead of physical standbys. >>> >> >> Before getting into the details of 'synchronize_slot_names', I would >> like to know whether we really need the second GUC >> 'standby_slot_names'. Can't we simply allow all the logical wal >> senders corresponding to 'synchronize_slot_names' to wait for just the >> physical standby(s) (physical slot corresponding to such physical >> standby) that have sent ' synchronize_slot_names'list? We should have >> one physical standby slot corresponding to one physical standby. >> > > yes, with the new approach (to be implemented next) where we plan to > send synchronize_slot_names from each physical standby to primary, the > standby_slot_names GUC should no longer be needed on primary. The > physical standbys sending requests should automatically become the > ones to be waited for confirmation on the primary. > I think that standby_slot_names could be used to do some filtering (means for which standby(s) we don't want the logical replication on the primary to go ahead and for which standby(s) one would allow it). I think that removing the GUC would: - remove this flexibility - probably open corner cases like: what if a standby is down? would that mean that synchronize_slot_names not being send to the primary would allow the decoding on the primary to go ahead? So, I'm not sure we should remove this GUC. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Sep 28, 2023 at 6:31 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > On 9/25/23 6:10 AM, shveta malik wrote: > > On Fri, Sep 22, 2023 at 3:48 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > >> > >> On Thu, Sep 21, 2023 at 9:16 AM shveta malik <shveta.malik@gmail.com> wrote: > >>> > >>> On Tue, Sep 19, 2023 at 10:29 AM shveta malik <shveta.malik@gmail.com> wrote: > >>> > >>> Currently in patch001, synchronize_slot_names is a GUC on both primary > >>> and physical standby. This GUC tells which all logical slots need to > >>> be synced on physical standbys from the primary. Ideally it should be > >>> a GUC on physical standby alone and each physical standby should be > >>> able to communicate the value to the primary (considering the value > >>> may vary for different physical replicas of the same primary). The > >>> primary on the other hand should be able to take UNION of these values > >>> and let the logical walsenders (belonging to the slots in UNION > >>> synchronize_slots_names) wait for physical standbys for confirmation > >>> before sending those changes to logical subscribers. The intent is > >>> logical subscribers should never be ahead of physical standbys. > >>> > >> > >> Before getting into the details of 'synchronize_slot_names', I would > >> like to know whether we really need the second GUC > >> 'standby_slot_names'. Can't we simply allow all the logical wal > >> senders corresponding to 'synchronize_slot_names' to wait for just the > >> physical standby(s) (physical slot corresponding to such physical > >> standby) that have sent ' synchronize_slot_names'list? We should have > >> one physical standby slot corresponding to one physical standby. > >> > > > > yes, with the new approach (to be implemented next) where we plan to > > send synchronize_slot_names from each physical standby to primary, the > > standby_slot_names GUC should no longer be needed on primary. The > > physical standbys sending requests should automatically become the > > ones to be waited for confirmation on the primary. > > > > I think that standby_slot_names could be used to do some filtering (means > for which standby(s) we don't want the logical replication on the primary to go > ahead and for which standby(s) one would allow it). > Isn't it implicit that the physical standby that has requested 'synchronize_slot_names' should be ahead of their corresponding logical walsenders? Otherwise, after the switchover to the new physical standby, the logical subscriptions won't work. > I think that removing the GUC would: > > - remove this flexibility > I think if required we can add such a GUC later as well. Asking users to set more parameters also makes the feature less attractive, so I am trying to see if we can avoid this GUC. > - probably open corner cases like: what if a standby is down? would that mean > that synchronize_slot_names not being send to the primary would allow the decoding > on the primary to go ahead? > Good question. BTW, irrespective of whether we have 'standby_slot_names' parameters or not, how should we behave if standby is down? Say, if 'synchronize_slot_names' is only specified on standby then in such a situation primary won't be even aware that some of the logical walsenders need to wait. OTOH, one can say that users should configure 'synchronize_slot_names' on both primary and standby but note that this value could be different for different standby's, so we can't configure it on primary. -- With Regards, Amit Kapila.
Hi, On 9/29/23 1:33 PM, Amit Kapila wrote: > On Thu, Sep 28, 2023 at 6:31 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> >> >> I think that standby_slot_names could be used to do some filtering (means >> for which standby(s) we don't want the logical replication on the primary to go >> ahead and for which standby(s) one would allow it). >> > > Isn't it implicit that the physical standby that has requested > 'synchronize_slot_names' should be ahead of their corresponding > logical walsenders? Otherwise, after the switchover to the new > physical standby, the logical subscriptions won't work. Right, but the idea was to let the flexibility to bypass this constraint. Use case was to avoid a physical standby being down preventing the decoding on the primary. > >> I think that removing the GUC would: >> >> - remove this flexibility >> > > I think if required we can add such a GUC later as well. Asking users > to set more parameters also makes the feature less attractive, so I am > trying to see if we can avoid this GUC. Agree but I think we have to address the standby being down case. > >> - probably open corner cases like: what if a standby is down? would that mean >> that synchronize_slot_names not being send to the primary would allow the decoding >> on the primary to go ahead? >> > > Good question. BTW, irrespective of whether we have > 'standby_slot_names' parameters or not, how should we behave if > standby is down? Say, if 'synchronize_slot_names' is only specified on > standby then in such a situation primary won't be even aware that some > of the logical walsenders need to wait. Exactly, that's why I was thinking keeping standby_slot_names to address this scenario. In such a case one could simply decide to keep or remove the associated physical replication slot from standby_slot_names. Keep would mean "wait" and removing would mean allow to decode on the primary. > OTOH, one can say that users > should configure 'synchronize_slot_names' on both primary and standby > but note that this value could be different for different standby's, > so we can't configure it on primary. > Yeah, I think that's a good use case for standby_slot_names, what do you think? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Dear Shveta, Thank you for updating the patch! I found another ERROR due to the slot removal. Is this a real issue? 1. applied add_sleep.txt, which emulated the case the tablesync worker stucked and the primary crashed during the initial sync. 2. executed test_0925_v2.sh (You attached in [1]) 3. secondary could not start the logical replication because the slot was not created (log files were also attached). Here is my analysis. The cause is that the slotsync worker aborts the slot creation on secondary server because the restart_lsn of secondary ahead the primary's one. IIUC it can be occurred when tablesync workers finishes initial copy before walsenders stream changes. In this case, the relstate of the worker is set to SUBREL_STATE_CATCHUP and the apply worker waits till the relation becomes SUBREL_STATE_SYNCDONE. From here the slot on primary will not be updated until the relation is caught up. If some changes are come and the primary crashes at that time, the syncslot worker will abort the slot creation. Anyway, followings are my comments. I have not checked detailed conventions yet. It should be done in later stage. ~~~~~~~~~~~~~~~~ For 0001: === WalSndWaitForStandbyConfirmation() ``` + /* If postmaster asked us to stop, don't wait anymore */ + if (got_STOPPING) + break; ``` I have considered again, and it may still have an issue: logical walsenders may break from the loop before physical walsenders send WALs. This may be occurred because both physical and logical walsenders would get PROCSIG_WALSND_INIT_STOPPING. I think a function like WalSndWaitStopping() must be needed, which waits until physical walsenders become WALSNDSTATE_STOPPING or exit. Thought? WalSndWaitForStandbyConfirmation() ``` + standby_slot_cpy = list_copy(standby_slot_names_list); ``` I found that standby_slot_names_list and standby_slot_cpy would not be updated even if the GUC was updated. Is this acceptable? Won't it be occurred after you refactor the patch? What would be occurred when synchronize_slot_names is updated on secondary while primary executes this? WalSndWaitForStandbyConfirmation() ``` + + goto retry; ``` I checked other "goto retry;", but I could not find the pattern that the return clause does not exist after the goto (exception: void function). I also think that current style seems a bit strange. How about using an outer loop like While (list_length(standby_slot_cpy))? ===== slot.h ``` +extern void WaitForStandbyLSN(XLogRecPtr wait_for_lsn); ``` WaitForStandbyLSN() does not exist. ~~~~~~~~~~~~~~~~ For 0002: ===== General The patch requires that primary_conninfo must contain the dbname, but it conflicts with documentation. It says: ``` ...Do not specify a database name in the primary_conninfo string. ``` I confirmed [^a] it is harmless that primary_conninfo has dbname, but at least the description must be fixed. General I found that primary server output huge amount of logs when the log_min_duration_messages = 0. This ie because slotsync worker sends an SQL per 10ms, in wait_for_primary_slot_catchup(). Is there any good way to suppress it? Or, should we be patient? ===== ``` +{ oid => '6312', descr => 'what caused the replication slot to become invalid', ``` How did you determine the oid? IIRC, developping features should use oids in the range 8000-9999. See src/include/catalog/unused_oids. ===== LogicalRepCtxStruct ``` /* Background workers. */ + SlotSyncWorker *ss_workers; /* slot-sync workers */ LogicalRepWorker workers[FLEXIBLE_ARRAY_MEMBER]; ``` It's OK for now, but can we combine them into an array? IIUC there is no possibility to exist both of processes and they have same component, so it may be able to be same. It can reduce an attribute but may lead some difficulties to read. WaitForReplicationWorkerAttach() and logicalrep_worker_stop_internal() I could not find cases that has "LWLock *" as an argument (exception: functions in lwlock.c). Is it sufficient to check RecoveryInProgress() instead of specifying as arguments? ===== wait_for_primary_slot_catchup() ``` + /* Check if this standby is promoted while we are waiting */ + if (!RecoveryInProgress()) + { + /* + * The remote slot didn't pass the locally reserved position at + * the time of local promotion, so it's not safe to use. + */ + ereport( + WARNING, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg( + "slot-sync wait for slot %s interrupted by promotion, " + "slot creation aborted", remote_slot->name))); + pfree(cmd.data); + return false; + } ``` The part would not be executed if the promote signal is sent after the primary server crashes. I think walrcv_exec() will detect the failure first. The function must be wrapped by PG_TRY() and the message must be emitted in PG_CATCH(). There may be other approaches. wait_for_primary_slot_catchup() ``` + rc = WaitLatch(MyLatch, + WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, + WORKER_DEFAULT_NAPTIME_MS, + WAIT_EVENT_REPL_SLOTSYNC_MAIN); ``` New wait event can be added. [1]: https://www.postgresql.org/message-id/CAJpy0uDD%2B9aJnDx9fBfvLvxJtxA7qqoAys4fo6h1tq1b_0_A7Q%40mail.gmail.com [^a] Regarding the secondary side, the libpqrcv_connect() does not do special things even if the primary_conninfo has dbname="XXX". It adds parameters like "replication=true" and sends a startup packet. As for the primary side, the startup packet is consumed in ProcessStartupPacket(). It checks whether the process should be a walsender or not (line 2204). Then (line 2290) the port->database_name[0] is set as '\0' in case of walsender. The value is used for setting the process title in BackendInitialize(). Also, InitPostgres() really sets some global variables like MyDatabaseId, but it is not occurred when the process is walsender. Best Regards, Hayato Kuroda FUJITSU LIMITED
Attachment
On Mon, Oct 2, 2023 at 11:39 AM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > On 9/29/23 1:33 PM, Amit Kapila wrote: > > On Thu, Sep 28, 2023 at 6:31 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > >> > > > >> - probably open corner cases like: what if a standby is down? would that mean > >> that synchronize_slot_names not being send to the primary would allow the decoding > >> on the primary to go ahead? > >> > > > > Good question. BTW, irrespective of whether we have > > 'standby_slot_names' parameters or not, how should we behave if > > standby is down? Say, if 'synchronize_slot_names' is only specified on > > standby then in such a situation primary won't be even aware that some > > of the logical walsenders need to wait. > > Exactly, that's why I was thinking keeping standby_slot_names to address > this scenario. In such a case one could simply decide to keep or remove > the associated physical replication slot from standby_slot_names. Keep would > mean "wait" and removing would mean allow to decode on the primary. > > > OTOH, one can say that users > > should configure 'synchronize_slot_names' on both primary and standby > > but note that this value could be different for different standby's, > > so we can't configure it on primary. > > > > Yeah, I think that's a good use case for standby_slot_names, what do you think? > But, even if we keep 'standby_slot_names' for this purpose, the primary doesn't know the value of 'synchronize_slot_names' once the standby is down and or the primary is restarted. So, how will we know which logical WAL senders needs to wait for 'standby_slot_names'? -- With Regards, Amit Kapila.
Dear Shveta, While investigating more, I found that the launcher crashes while executing the script. Please see attached one. In this script, the subscriber was also the publisher. Both subscriber and subscriber2 referred the same replication slot, which was synchronized by slotsync worker. I was quite not sure the synchronization should be occurred in this case, but at lease core must not be dumped. The secondary server crashed. primary ---> secondary | | subscriber subscriber2 I checked the stack trace and found that the apply worker crashed. ``` (gdb) bt #0 0x0000000000b310a9 in check_for_freed_segments (area=0x3a4ec68) at ../postgres/src/backend/utils/mmgr/dsa.c:2248 #1 0x0000000000b2e856 in dsa_get_address (area=0x3a4ec68, dp=16384) at ../postgres/src/backend/utils/mmgr/dsa.c:959 #2 0x00000000008a2bb5 in slotsync_remove_obsolete_dbs (remote_dbs=0x1fcea70) at ../postgres/src/backend/replication/logical/launcher.c:1615 #3 0x00000000008a318d in ApplyLauncherStartSlotSync (wait_time=0x7ffe15cd57a8, wrconn=0x1f82ec0) at ../postgres/src/backend/replication/logical/launcher.c:1799 #4 0x00000000008a3667 in ApplyLauncherMain (main_arg=0) at ../postgres/src/backend/replication/logical/launcher.c:1967 #5 0x0000000000863aef in StartBackgroundWorker () at ../postgres/src/backend/postmaster/bgworker.c:867 #6 0x000000000086e260 in do_start_bgworker (rw=0x1f6b4e0) at ../postgres/src/backend/postmaster/postmaster.c:5740 #7 0x000000000086e649 in maybe_start_bgworkers () at ../postgres/src/backend/postmaster/postmaster.c:5964 #8 0x000000000086953d in ServerLoop () at ../postgres/src/backend/postmaster/postmaster.c:1852 #9 0x0000000000868c42 in PostmasterMain (argc=3, argv=0x1f3e240) at ../postgres/src/backend/postmaster/postmaster.c:1465 #10 0x000000000075ad5f in main (argc=3, argv=0x1f3e240) at ../postgres/src/backend/main/main.c:198 ``` Best Regards, Hayato Kuroda FUJITSU LIMITED
Attachment
Hi, On 10/3/23 12:54 PM, Amit Kapila wrote: > On Mon, Oct 2, 2023 at 11:39 AM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> >> On 9/29/23 1:33 PM, Amit Kapila wrote: >>> On Thu, Sep 28, 2023 at 6:31 PM Drouvot, Bertrand >>> <bertranddrouvot.pg@gmail.com> wrote: >>>> >>> >>>> - probably open corner cases like: what if a standby is down? would that mean >>>> that synchronize_slot_names not being send to the primary would allow the decoding >>>> on the primary to go ahead? >>>> >>> >>> Good question. BTW, irrespective of whether we have >>> 'standby_slot_names' parameters or not, how should we behave if >>> standby is down? Say, if 'synchronize_slot_names' is only specified on >>> standby then in such a situation primary won't be even aware that some >>> of the logical walsenders need to wait. >> >> Exactly, that's why I was thinking keeping standby_slot_names to address >> this scenario. In such a case one could simply decide to keep or remove >> the associated physical replication slot from standby_slot_names. Keep would >> mean "wait" and removing would mean allow to decode on the primary. >> >>> OTOH, one can say that users >>> should configure 'synchronize_slot_names' on both primary and standby >>> but note that this value could be different for different standby's, >>> so we can't configure it on primary. >>> >> >> Yeah, I think that's a good use case for standby_slot_names, what do you think? >> > > But, even if we keep 'standby_slot_names' for this purpose, the > primary doesn't know the value of 'synchronize_slot_names' once the > standby is down and or the primary is restarted. So, how will we know > which logical WAL senders needs to wait for 'standby_slot_names'? > Yeah right, I also think we'd need: - synchronize_slot_names on both primary and standby But now we would need to take care of different standby having different values ( as you said up-thread).... Thinking out loud: What about a single GUC on the primary (not standby_slot_names nor synchronize_slot_names) but say logical_slots_wait_for_standby that could be a list of say "logical_slot_name:physical_slot". I think this GUC would help us define each walsender behavior (should the standby(s) be up or down): - don't wait if its associated logical_slot is not listed in this GUC - or wait based on its associated "list" of mapped physical slots (would probably have to deal with the min restart_lsn for all the corresponding mapped ones). I don't think we can avoid having to define at least one GUC on the primary (at least to handle the case of standby(s) being down). Thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Oct 3, 2023 at 7:56 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 10/3/23 12:54 PM, Amit Kapila wrote: > > On Mon, Oct 2, 2023 at 11:39 AM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > >> > >> On 9/29/23 1:33 PM, Amit Kapila wrote: > >>> On Thu, Sep 28, 2023 at 6:31 PM Drouvot, Bertrand > >>> <bertranddrouvot.pg@gmail.com> wrote: > >>>> > >>> > >>>> - probably open corner cases like: what if a standby is down? would that mean > >>>> that synchronize_slot_names not being send to the primary would allow the decoding > >>>> on the primary to go ahead? > >>>> > >>> > >>> Good question. BTW, irrespective of whether we have > >>> 'standby_slot_names' parameters or not, how should we behave if > >>> standby is down? Say, if 'synchronize_slot_names' is only specified on > >>> standby then in such a situation primary won't be even aware that some > >>> of the logical walsenders need to wait. > >> > >> Exactly, that's why I was thinking keeping standby_slot_names to address > >> this scenario. In such a case one could simply decide to keep or remove > >> the associated physical replication slot from standby_slot_names. Keep would > >> mean "wait" and removing would mean allow to decode on the primary. > >> > >>> OTOH, one can say that users > >>> should configure 'synchronize_slot_names' on both primary and standby > >>> but note that this value could be different for different standby's, > >>> so we can't configure it on primary. > >>> > >> > >> Yeah, I think that's a good use case for standby_slot_names, what do you think? > >> > > > > But, even if we keep 'standby_slot_names' for this purpose, the > > primary doesn't know the value of 'synchronize_slot_names' once the > > standby is down and or the primary is restarted. So, how will we know > > which logical WAL senders needs to wait for 'standby_slot_names'? > > > > Yeah right, I also think we'd need: > > - synchronize_slot_names on both primary and standby > > But now we would need to take care of different standby having different values ( > as you said up-thread).... > > Thinking out loud: What about a single GUC on the primary (not standby_slot_names nor > synchronize_slot_names) but say logical_slots_wait_for_standby that could be a list of say > "logical_slot_name:physical_slot". > > I think this GUC would help us define each walsender behavior (should the standby(s) > be up or down): > It may help in defining the walsender's behaviour better for sure. But the problem I see once we start defining sync-slot-names on primary (in any form whether as independent GUC or as above mapping GUC) is that it needs to be then in sync with standbys, as each standby for sure needs to maintain its own sync-slot-names GUC to make it aware of what all it needs to sync. This brings us to the original question of how do we actually keep these configurations in sync between primary and standby if we plan to maintain it on both? > - don't wait if its associated logical_slot is not listed in this GUC > - or wait based on its associated "list" of mapped physical slots (would probably > have to deal with the min restart_lsn for all the corresponding mapped ones). > > I don't think we can avoid having to define at least one GUC on the primary (at least to > handle the case of standby(s) being down). > > Thoughts? > > Regards, > > -- > Bertrand Drouvot > PostgreSQL Contributors Team > RDS Open Source Databases > Amazon Web Services: https://aws.amazon.com
On Tue, Oct 3, 2023 at 9:27 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Tue, Oct 3, 2023 at 7:56 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > > > > Hi, > > > > On 10/3/23 12:54 PM, Amit Kapila wrote: > > > On Mon, Oct 2, 2023 at 11:39 AM Drouvot, Bertrand > > > <bertranddrouvot.pg@gmail.com> wrote: > > >> > > >> On 9/29/23 1:33 PM, Amit Kapila wrote: > > >>> On Thu, Sep 28, 2023 at 6:31 PM Drouvot, Bertrand > > >>> <bertranddrouvot.pg@gmail.com> wrote: > > >>>> > > >>> > > >>>> - probably open corner cases like: what if a standby is down? would that mean > > >>>> that synchronize_slot_names not being send to the primary would allow the decoding > > >>>> on the primary to go ahead? > > >>>> > > >>> > > >>> Good question. BTW, irrespective of whether we have > > >>> 'standby_slot_names' parameters or not, how should we behave if > > >>> standby is down? Say, if 'synchronize_slot_names' is only specified on > > >>> standby then in such a situation primary won't be even aware that some > > >>> of the logical walsenders need to wait. > > >> > > >> Exactly, that's why I was thinking keeping standby_slot_names to address > > >> this scenario. In such a case one could simply decide to keep or remove > > >> the associated physical replication slot from standby_slot_names. Keep would > > >> mean "wait" and removing would mean allow to decode on the primary. > > >> > > >>> OTOH, one can say that users > > >>> should configure 'synchronize_slot_names' on both primary and standby > > >>> but note that this value could be different for different standby's, > > >>> so we can't configure it on primary. > > >>> > > >> > > >> Yeah, I think that's a good use case for standby_slot_names, what do you think? > > >> > > > > > > But, even if we keep 'standby_slot_names' for this purpose, the > > > primary doesn't know the value of 'synchronize_slot_names' once the > > > standby is down and or the primary is restarted. So, how will we know > > > which logical WAL senders needs to wait for 'standby_slot_names'? > > > > > > > Yeah right, I also think we'd need: > > > > - synchronize_slot_names on both primary and standby > > > > But now we would need to take care of different standby having different values ( > > as you said up-thread).... > > > > Thinking out loud: What about a single GUC on the primary (not standby_slot_names nor > > synchronize_slot_names) but say logical_slots_wait_for_standby that could be a list of say > > "logical_slot_name:physical_slot". > > > > I think this GUC would help us define each walsender behavior (should the standby(s) > > be up or down): > > > > It may help in defining the walsender's behaviour better for sure. But > the problem I see once we start defining sync-slot-names on primary > (in any form whether as independent GUC or as above mapping GUC) is > that it needs to be then in sync with standbys, as each standby for > sure needs to maintain its own sync-slot-names GUC to make it aware of > what all it needs to sync. Yes, I also think so. Also, defining such a GUC where user wants to sync all the slots which would normally be the case would be a night mare for the users. > > This brings us to the original question of > how do we actually keep these configurations in sync between primary > and standby if we plan to maintain it on both? > > > > - don't wait if its associated logical_slot is not listed in this GUC > > - or wait based on its associated "list" of mapped physical slots (would probably > > have to deal with the min restart_lsn for all the corresponding mapped ones). > > > > I don't think we can avoid having to define at least one GUC on the primary (at least to > > handle the case of standby(s) being down). > > How about an alternate scheme where we define sync_slot_names on standby but then store the physical_slot_name in the corresponding logical slot (ReplicationSlotPersistentData) to be synced? So, the standby will send the list of 'sync_slot_names' and the primary will add the physical standby's slot_name in each of the corresponding sync_slot. Now, if we do this then even after restart, we should be able to know for which physical slot each logical slot needs to wait. We can even provide an SQL API to reset the value of standby_slot_names in logical slots as a way to unblock decoding in case of emergency (for example, corresponding when physical standby never comes up). -- With Regards, Amit Kapila.
Here are some review comments for v20-0002. ====== 1. GENERAL - errmsg/elog messages There are a a lot of minor problems and/or quirks across all the message texts. Here is a summary of some I found: ERROR errmsg("could not receive list of slots from the primary server: %s", errmsg("invalid response from primary server"), errmsg("invalid connection string syntax: %s", errmsg("replication slot-sync worker slot %d is empty, cannot attach", errmsg("replication slot-sync worker slot %d is already used by another worker, cannot attach", errmsg("replication slot-sync worker slot %d is already used by another worker, cannot attach", errmsg("could not connect to the primary server: %s", errmsg("operation not permitted on replication slots on standby which are synchronized from primary"))); /primary/the primary/ errmsg("could not fetch invalidation cuase for slot \"%s\" from primary: %s", /cuase/cause/ /primary/the primary/ errmsg("slot \"%s\" disapeared from the primary", /disapeared/disappeared/ errmsg("could not fetch slot info from the primary: %s", errmsg("could not connect to the primary server: %s", err))); errmsg("could not map dynamic shared memory segment for slot-sync worker"))); errmsg("physical replication slot %s found in synchronize_slot_names", slot name not quoted? --- WARNING errmsg("out of background worker slots"), errmsg("Replication slot-sync worker failed to attach to worker-pool slot %d", case? errmsg("Removed database %d from replication slot-sync worker %d; dbcount now: %d", case? errmsg("Skipping slots synchronization as primary_slot_name is not set.")); case? errmsg("Skipping slots synchronization as hot_standby_feedback is off.")); case? errmsg("Skipping slots synchronization as dbname is not specified in primary_conninfo.")); case? errmsg("slot-sync wait for slot %s interrupted by promotion, slot creation aborted", errmsg("could not fetch slot info for slot \"%s\" from primary: %s", /primary/the primary/ errmsg("slot \"%s\" disappeared from the primary, aborting slot creation", errmsg("slot \"%s\" invalidated on primary, aborting slot creation", errmsg("slot-sync for slot %s interrupted by promotion, sync not possible", slot name not quoted? errmsg("skipping sync of slot \"%s\" as the received slot-sync lsn %X/%X is ahead of the standby position %X/%X", errmsg("not synchronizing slot %s; synchronization would move it backward", slot name not quoted? /backward/backwards/ --- LOG errmsg("Added database %d to replication slot-sync worker %d; dbcount now: %d", errmsg("Added database %d to replication slot-sync worker %d; dbcount now: %d", errmsg("Stopping replication slot-sync worker %d", errmsg("waiting for remote slot \"%s\" LSN (%u/%X) and catalog xmin (%u) to pass local slot LSN (%u/%X) and and catalog xmin (%u)", errmsg("wait over for remote slot \"%s\" as its LSN (%X/%X)and catalog xmin (%u) has now passed local slot LSN (%X/%X) and catalog xmin (%u)", missing spaces? elog(LOG, "Dropped replication slot \"%s\" ", extra space? why this one is elog but others are not? elog(LOG, "Replication slot-sync worker %d is shutting down on receiving SIGINT", MySlotSyncWorker->slot); case? why this one is elog but others are not? elog(LOG, "Replication slot-sync worker %d started", worker_slot); case? why this one is elog but others are not? ---- DEBUG1 errmsg("allocated dsa for slot-sync worker for dbcount: %d" worker number not given? should be elog? errmsg_internal("logical replication launcher started") should be elog? ---- DEBUG2 elog(DEBUG2, "slot-sync worker%d's query:%s \n", missing space after 'worker' extra space before \n ====== .../libpqwalreceiver/libpqwalreceiver.c 2. libpqrcv_get_dbname_from_conninfo +/* + * Get database name from primary conninfo. + * + * If dbanme is not found in connInfo, return NULL value. + * The caller should take care of handling NULL value. + */ +static char * +libpqrcv_get_dbname_from_conninfo(const char *connInfo) 2a. /dbanme/dbname/ ~ 2b. "The caller should take care of handling NULL value." IMO this is not very useful; it's like saying "caller must handle function return values". ~~~ 3. + for (opt = opts; opt->keyword != NULL; ++opt) + { + /* Ignore connection options that are not present. */ + if (opt->val == NULL) + continue; + + if (strcmp(opt->keyword, "dbname") == 0 && opt->val[0] != '\0') + { + dbname = pstrdup(opt->val); + } + } 3a. If there are multiple "dbname" in the conninfo then it will be the LAST one that is returned. Judging by my quick syntax experiment (below) this seemed like the correct thing to do, but I think there should be some comment to explain about it. test_sub=# create subscription sub1 connection 'dbname=foo dbname=bar dbname=test_pub' publication pub1; 2023-09-28 19:15:15.012 AEST [23997] WARNING: subscriptions created by regression test cases should have names starting with "regress_" WARNING: subscriptions created by regression test cases should have names starting with "regress_" NOTICE: created replication slot "sub1" on publisher CREATE SUBSCRIPTION ~ 3b. The block brackets {} are not needed for the single statement. ~ 3c. Since there is only one keyword of interest here it seemed overkill to have a separate 'continue' check. Why not do everything in one line: for (opt = opts; opt->keyword != NULL; ++opt) { if (strcmp(opt->keyword, "dbname") == 0 && opt->val && opt->val[0] != '\0') dbname = pstrdup(opt->val); } ====== src/backend/replication/logical/launcher.c 4. +/* + * The local variables to store the current values of slot-sync related GUCs + * before each ConfigReload. + */ +static char *PrimaryConnInfoPreReload = NULL; +static char *PrimarySlotNamePreReload = NULL; +static char *SyncSlotNamesPreReload = NULL; /The local variables/Local variables/ ~~~ 5. fwd declare static void logicalrep_worker_cleanup(LogicalRepWorker *worker); +static void slotsync_worker_cleanup(SlotSyncWorker *worker); static int logicalrep_pa_worker_count(Oid subid); 5a. Hmmn, I think there were lot more added static functions than just this one. e.g. what about all these? static SlotSyncWorker *slotsync_worker_find static dsa_handle slotsync_dsa_setup static bool slotsync_worker_launch_or_reuse static void slotsync_worker_stop_internal static void slotsync_workers_stop static void slotsync_remove_obsolete_dbs static WalReceiverConn *primary_connect static void SaveCurrentSlotSyncConfigs static bool SlotSyncConfigsChanged static void ApplyLauncherStartSlotSync static void ApplyLauncherStartSubs ~ 5b. There are inconsistent name style used for the new static functions -- e.g. snake_case versus CamelCase. ~~~ 6. WaitForReplicationWorkerAttach int rc; + bool is_slotsync_worker = (lock == SlotSyncWorkerLock) ? true : false; This seemed a hacky way to distinguish the sync-slot workers from other kinds of workers. Wouldn't it be better to pass another parameter to this function? ~~~ 7. slotsync_worker_attach It looks like almost a clone of the logicalrep_worker_attach. Seems a shame if cannot make use of common code. ~~~ 8. slotsync_worker_find + * Walks the slot-sync workers pool and searches for one that matches given + * dbid. Since one worker can manage multiple dbs, so it walks the db array in + * each worker to find the match. 8a. SUGGESTION Searches the slot-sync worker pool for the worker who manages the specified dbid. Because a worker can manage multiple dbs, also walk the db array of each worker to find the match. ~ 8b. Should the comment also say something like "Returns NULL if no matching worker is found." ~~~ 9. + /* Search for attached worker for a given dbid */ SUGGESTION Search for an attached worker managing the given dbid. ~~~ 10. +{ + int i; + SlotSyncWorker *res = NULL; + Oid *dbids; + + Assert(LWLockHeldByMeInMode(SlotSyncWorkerLock, LW_SHARED)); + + /* Search for attached worker for a given dbid */ + for (i = 0; i < max_slotsync_workers; i++) + { + SlotSyncWorker *w = &LogicalRepCtx->ss_workers[i]; + int cnt; + + if (!w->hdr.in_use) + continue; + + dbids = (Oid *) dsa_get_address(w->dbids_dsa, w->dbids_dp); + for (cnt = 0; cnt < w->dbcount; cnt++) + { + Oid wdbid = dbids[cnt]; + + if (wdbid == dbid) + { + res = w; + break; + } + } + + /* If worker is found, break the outer loop */ + if (res) + break; + } + + return res; +} IMO this logical can be simplified a lot: - by not using the 'res' variable; directly return instead. - also moved the 'dbids' declaration. - and 'cnt' variable seems not meaningful; replace with 'dbidx' for the db array index IMO. For example (25 lines instead of 35 lines) { int i; Assert(LWLockHeldByMeInMode(SlotSyncWorkerLock, LW_SHARED)); /* Search for an attached worker managing the given dbid. */ for (i = 0; i < max_slotsync_workers; i++) { SlotSyncWorker *w = &LogicalRepCtx->ss_workers[i]; int dbidx; Oid *dbids; if (!w->hdr.in_use) continue; dbids = (Oid *) dsa_get_address(w->dbids_dsa, w->dbids_dp); for (dbidx = 0; dbidx < w->dbcount; dbidx++) { if (dbids[dbidx] == dbid) return w; } } return NULL; } ~~~ 11. slot_sync_dsa_setup +/* + * Setup DSA for slot-sync worker. + * + * DSA is needed for dbids array. Since max number of dbs a worker can manage + * is not known, so initially fixed size to hold DB_PER_WORKER_ALLOC_INIT + * dbs is allocated. If this size is exhausted, it can be extended using + * dsa free and allocate routines. + */ +static dsa_handle +slotsync_dsa_setup(SlotSyncWorker *worker, int alloc_db_count) 11a. SUGGESTION DSA is used for the dbids array. Because the maximum number of dbs a worker can manage is not known, initially enough memory for DB_PER_WORKER_ALLOC_INIT dbs is allocated. If this size is exhausted, it can be extended using dsa free and allocate routines. ~ 11b. It doesn't make sense for the comment to say DB_PER_WORKER_ALLOC_INIT is the initial allocation, but then the function has a parameter 'alloc_db_count' (which is always passed as DB_PER_WORKER_ALLOC_INIT). IMO revemo the 2nd parameter from this function and hardwire the initial allocation same as what the function comment says. ~~~ 12. + /* Be sure any memory allocated by DSA routines is persistent. */ + oldcontext = MemoryContextSwitchTo(TopMemoryContext); /Be sure any memory/Ensure the memory/ ~~~ 13. slotsync_worker_launch_or_reuse +/* + * Slot-sync worker launch or reuse + * + * Start new slot-sync background worker from the pool of available workers + * going by max_slotsync_workers count. If the worker pool is exhausted, + * reuse the existing worker with minimum number of dbs. The idea is to + * always distribute the dbs equally among launched workers. + * If initially allocated dbids array is exhausted for the selected worker, + * reallocate the dbids array with increased size and copy the existing + * dbids to it and assign the new one as well. + * + * Returns true on success, false on failure. + */ /going by/limited by/ (??) ~~~ 14. + BackgroundWorker bgw; + BackgroundWorkerHandle *bgw_handle; + uint16 generation; + SlotSyncWorker *worker = NULL; + uint32 mindbcnt = 0; + uint32 alloc_count = 0; + uint32 copied_dbcnt = 0; + Oid *copied_dbids = NULL; + int worker_slot = -1; + dsa_handle handle; + Oid *dbids; + int i; + bool attach; IIUC many of these variables can be declared at a different scope in this function, so they will be closer to where they are used. ~~~ 15. + /* + * We need to do the modification of the shared memory under lock so that + * we have consistent view. + */ + LWLockAcquire(SlotSyncWorkerLock, LW_EXCLUSIVE); The current comment seems too much. SUGGESTION The shared memory must only be modified under lock. ~~~ 16. + /* Find unused worker slot. */ + for (i = 0; i < max_slotsync_workers; i++) + { + SlotSyncWorker *w = &LogicalRepCtx->ss_workers[i]; + + if (!w->hdr.in_use) + { + worker = w; + worker_slot = i; + break; + } + } + + /* + * If all the workers are currently in use. Find the one with minimum + * number of dbs and use that. + */ + if (!worker) + { + for (i = 0; i < max_slotsync_workers; i++) + { + SlotSyncWorker *w = &LogicalRepCtx->ss_workers[i]; + + if (i == 0) + { + mindbcnt = w->dbcount; + worker = w; + worker_slot = i; + } + else if (w->dbcount < mindbcnt) + { + mindbcnt = w->dbcount; + worker = w; + worker_slot = i; + } + } + } Why not combine these 2 loops, to avoid iterating over the same slots twice? Then, exit the loop immediately if unused worker found, otherwise if reach the end of loop having not found anything unused then you will already know the one having least dbs. ~~~ 17. + /* Remember the old dbids before we reallocate dsa. */ + copied_dbcnt = worker->dbcount; + copied_dbids = (Oid *) palloc0(worker->dbcount * sizeof(Oid)); + memcpy(copied_dbids, dbids, worker->dbcount * sizeof(Oid)); 17a. Who frees this copied_dbids memory when you are finished needed it. It seems allocated in the TopMemoryContext so IIUC this is a leak. ~ 17b. These are the 'old' values. Not the 'copied' values. The copied_xxx variable names seem misleading. ~~~ 18. + /* Prepare the new worker. */ + worker->hdr.launch_time = GetCurrentTimestamp(); + worker->hdr.in_use = true; If a new worker is required then the launch_time is set like above. + { + slot_db_data->last_launch_time = now; + + slotsync_worker_launch_or_reuse(slot_db_data->database); + } Meanwhile, at the caller of slotsync_worker_launch_or_reuse(), the dbid launch_time was already set as well. And those two timestamps are almost (but not quite) the same value. Isn't that a bit strange? ~~~ 19. + /* Initial DSA setup for dbids array to hold DB_PER_WORKER_ALLOC_INIT dbs */ + handle = slotsync_dsa_setup(worker, DB_PER_WORKER_ALLOC_INIT); + dbids = (Oid *) dsa_get_address(worker->dbids_dsa, worker->dbids_dp); + + dbids[worker->dbcount++] = dbid; Where was this worker->dbcount assigned to 0? Maybe it's better to do this explicity under the "/* Prepare the new worker. */" comment. ~~~ 20. + if (!attach) + ereport(WARNING, + (errmsg("Replication slot-sync worker failed to attach to " + "worker-pool slot %d", worker_slot))); + + /* Attach is done, now safe to log that the worker is managing dbid */ + if (attach) + ereport(LOG, + (errmsg("Added database %d to replication slot-sync " + "worker %d; dbcount now: %d", + dbid, worker_slot, worker->dbcount))); 20a. IMO this should be coded as "if (attach) ...; else ..." ~ 99b. In other code if it failed to register then slotsync_worker_cleanup code is called. How come similar code is not done when fails to attach? ~~~ 21. slotsync_worker_stop_internal +/* + * Internal function to stop the slot-sync worker and wait until it detaches + * from the slot-sync worker-pool slot. + */ +static void +slotsync_worker_stop_internal(SlotSyncWorker *worker) IIUC this function does a bit more than what the function comment says. IIUC (again) I think the "detached" worker slot will still be flagged as 'inUse' but this function then does the extra step of calling slotsync_worker_cleanup() function to make the worker slot available for next process that needs it, am I correct? In this regard, this function seems a lot more like logicalrep_worker_detach() function comment, so there seems some kind of muddling of the different function names here... (??). ~~~ 22. slotsync_remove_obsolete_dbs This function says: +/* + * Slot-sync workers remove obsolete DBs from db-list + * + * If the DBIds fetched from the primary are lesser than the ones being managed + * by slot-sync workers, remove extra dbs from worker's db-list. This may happen + * if some slots are removed on primary but 'synchronize_slot_names' has not + * been changed yet. + */ +static void +slotsync_remove_obsolete_dbs(List *remote_dbs) But, there was another similar logic function too: +/* + * Drop obsolete slots + * + * Drop the slots which no longer need to be synced i.e. these either + * do not exist on primary or are no longer part of synchronize_slot_names. + * + * Also drop the slots which are valid on primary and got invalidated + * on standby due to conflict (say required rows removed on primary). + * The assumption is, these will get recreated in next sync-cycle and + * it is okay to drop and recreate such slots as long as these are not + * consumable on standby (which is the case currently). + */ +static void +drop_obsolete_slots(Oid *dbids, List *remote_slot_list) Those function header comments suggest these have a lot of overlapping functionality. Can't those 2 functions be combined? Or maybe one delegate to the other? ~~~ 23. + ListCell *lc; + Oid *dbids; + int widx; + int dbidx; + int i; Scope of some of these variable declarations can be different so they are declared closer to where they are used. ~~~ 24. + /* If not found, then delete this db from worker's db-list */ + if (!found) + { + for (i = dbidx; i < worker->dbcount; i++) + { + /* Shift the DBs and get rid of wdbid */ + if (i < (worker->dbcount - 1)) + dbids[i] = dbids[i + 1]; + } IIUC, that shift/loop could just have been a memmove() call to remove one Oid element. ~~~ 25. + /* If dbcount for any worker has become 0, shut it down */ + for (widx = 0; widx < max_slotsync_workers; widx++) + { + SlotSyncWorker *worker = &LogicalRepCtx->ss_workers[widx]; + + if (worker->hdr.in_use && !worker->dbcount) + slotsync_worker_stop_internal(worker); + } Is it safe to stop this unguarded by SlotSyncWorkerLock locking? Is there a window where another dbid decides to reuse this worker at the same time this process is about to stop it? ~~~ 26. primary_connect +/* + * Connect to primary server for slotsync purpose and return the connection + * info. Disconnect previous connection if provided in wrconn_prev. + */ /primary server/the primary server/ ~~~ 27. + if (!RecoveryInProgress()) + return NULL; + + if (max_slotsync_workers == 0) + return NULL; + + if (strcmp(synchronize_slot_names, "") == 0) + return NULL; + + /* The primary_slot_name is not set */ + if (!WalRcv || WalRcv->slotname[0] == '\0') + { + ereport(WARNING, + errmsg("Skipping slots synchronization as primary_slot_name " + "is not set.")); + return NULL; + } + + /* The hot_standby_feedback must be ON for slot-sync to work */ + if (!hot_standby_feedback) + { + ereport(WARNING, + errmsg("Skipping slots synchronization as hot_standby_feedback " + "is off.")); + return NULL; + } How come some of these checks giving WARNING that slot synchronization will be skipped, but others are just silently returning NULL? ~~~ 28. SaveCurrentSlotSyncConfigs +static void +SaveCurrentSlotSyncConfigs() +{ + PrimaryConnInfoPreReload = pstrdup(PrimaryConnInfo); + PrimarySlotNamePreReload = pstrdup(WalRcv->slotname); + SyncSlotNamesPreReload = pstrdup(synchronize_slot_names); +} Shouldn't this code also do pfree first? Otherwise these will slowly leak every time this function is called, right? ~~~ 29. SlotSyncConfigsChanged +static bool +SlotSyncConfigsChanged() +{ + if (strcmp(PrimaryConnInfoPreReload, PrimaryConnInfo) != 0) + return true; + + if (strcmp(PrimarySlotNamePreReload, WalRcv->slotname) != 0) + return true; + + if (strcmp(SyncSlotNamesPreReload, synchronize_slot_names) != 0) + return true; I felt those can all be combined to have 1 return instead of 3. ~~~ 30. + /* + * If we have reached this stage, it means original value of + * hot_standby_feedback was 'true', so consider it changed if 'false' now. + */ + if (!hot_standby_feedback) + return true; "If we have reached this stage" seems a bit vague. Can this have some more explanation? And, maybe also an Assert(hot_standby_feedback); is helpful in the calling code (before the config is reloaded)? ~~~ 31. ApplyLauncherStartSlotSync + * It connects to primary, get the list of DBIDs for slots configured in + * synchronize_slot_names. It then launces the slot-sync workers as per + * max_slotsync_workers and then assign the DBs equally to the workers + * launched. + */ SUGGESTION (fix typos etc) Connect to the primary, to get the list of DBIDs for slots configured in synchronize_slot_names. Then launch slot-sync workers (limited by max_slotsync_workers) where the DBs are distributed equally among those workers. ~~~ 32. +static void +ApplyLauncherStartSlotSync(long *wait_time, WalReceiverConn *wrconn) Why does this function even have 'Apply' in the name when it is nothing to do with an apply worker; looks like some cut/paste hangover. How about calling it something like 'LaunchSlotSyncWorkers' ~~~ 33. + /* If connection is NULL due to lack of correct configurations, return */ + if (!wrconn) + return; IMO it would be better to Assert wrconn in this function. If it is NULL then it should be checked a the caller, otherwise it just raises more questions -- like "who logged the warning about bad configuration" etc (which I already questions the NULL returns of primary_connect. ~~~ 34. + if (!OidIsValid(slot_db_data->database)) + continue; This represents some kind of integrity error doesn't it? Is it really OK just to silently skip such a thing? ~~~ 35. + /* + * If the worker is eligible to start now, launch it. Otherwise, + * adjust wait_time so that we'll wake up as soon as it can be + * started. + * + * Each apply worker can only be restarted once per + * wal_retrieve_retry_interval, so that errors do not cause us to + * repeatedly restart the worker as fast as possible. + */ 35a. I found the "we" part of "so that we'll wake up..." to be a bit misleading. There is no waiting in this function; that wait value is handed back to the caller to deal with. TBH, I did not really understand why it is even necessary tp separate the waiting calculation *per-worker* like this. It seems to overcomplicate things and it might even give results like 1st worker is not started but last works is started (if enough time elapsed in the loop). Why can't all this wait logic be done one time up front, and either (a) start all necessary workers, or (b) start none of them and wait a bit longer. ~ 35b. "Each apply worker". Why is this talking about "apply" workers? Maybe cut/paste error? ~~~ 36. + last_launch_tried = slot_db_data->last_launch_time; + now = GetCurrentTimestamp(); + if (last_launch_tried == 0 || + (elapsed = TimestampDifferenceMilliseconds(last_launch_tried, now)) >= + wal_retrieve_retry_interval) + { + slot_db_data->last_launch_time = now; + + slotsync_worker_launch_or_reuse(slot_db_data->database); + } + else + { + *wait_time = Min(*wait_time, + wal_retrieve_retry_interval - elapsed); + } 36a. IMO this might be simpler if you add another variable like bool 'launch_now': last_launch_tried = ... now = ... elapsed = ... launch_now = elapsed >= wal_retrieve_retry_interval; ~ 36b. Do you really care about checking "last_launch_tried == 0"; If it really is zero, then I thought the elapsed check should be enough. ~ 36c. Does this 'last_launch_time' really need to be in some shared memory? Won't a static variable suffice? ~~~ 37. ApplyLauncherStartSubs Wouldn't a better name for the function be something like 'LaunchSubscriptionApplyWorker'? (it is a better match for the suggested LaunchSlotSyncWorkers) ~~~ 38. ApplyLauncherMain Now that this is not only for Apply worker but also for SlotSync workers, maybe this function should be renamed as just LauncherMain, or something equally generic? ~~~ 39. + load_file("libpqwalreceiver", false); + + wrconn = primary_connect(NULL); + This connection did not exist in the HEAD code so I think it is added only for the slot-sync logic. IIUC it is still doing nothing for the non-slot-sync cases because primary_connect will silently return in that case: + if (!RecoveryInProgress()) + return NULL; IMO this is too sneaky, and it is misleading to see the normal apply worker launch apparently ccnnecting to something when it is not really doing so AFAIK. I think these conditions should be done explicity here at the caller to remove any such ambiguity. ~~~ 40. + if (!RecoveryInProgress()) + ApplyLauncherStartSubs(&wait_time); + else + ApplyLauncherStartSlotSync(&wait_time, wrconn); 40a. IMO this is deserving of a comment to explain why RecoveryInProgress means to perform the slot-synchronization. ~ 40b. Also, better to have positive check RecoveryInProgress() instead of !RecoveryInProgress() ~~~ 41. if (ConfigReloadPending) { + bool ssConfigChanged = false; + + SaveCurrentSlotSyncConfigs(); + ConfigReloadPending = false; ProcessConfigFile(PGC_SIGHUP); + + /* + * Stop the slot-sync workers if any of the related GUCs changed. + * These will be relaunched as per the new values during next + * sync-cycle. + */ + ssConfigChanged = SlotSyncConfigsChanged(); + if (ssConfigChanged) + slotsync_workers_stop(); + + /* Reconnect in case primary_conninfo has changed */ + wrconn = primary_connect(wrconn); } } ~ 41a. The 'ssConfigChanged' assignement at declaration is not needed. Indeed, the whole variable is not really necessary because it is used only once. ~ 41b. /as per the new values/using the new values/ ~ 41c. + /* Reconnect in case primary_conninfo has changed */ + wrconn = primary_connect(wrconn); To avoid unnecessary reconnections, shouldn't this be done only if (ssConfigChanged). In fact, assuming the comment is correct, reconnect only if (strcmp(PrimaryConnInfoPreReload, PrimaryConnInfo) != 0) ====== src/backend/replication/logical/slotsync.c 42. wait_for_primary_slot_catchup + ereport(LOG, + errmsg("waiting for remote slot \"%s\" LSN (%u/%X) and catalog xmin" + " (%u) to pass local slot LSN (%u/%X) and and catalog xmin (%u)", + remote_slot->name, + LSN_FORMAT_ARGS(remote_slot->restart_lsn), + remote_slot->catalog_xmin, + LSN_FORMAT_ARGS(MyReplicationSlot->data.restart_lsn), + MyReplicationSlot->data.catalog_xmin)); AFAIK it is usual for the LSN format string to be %X/%X (not %u/%X like here). ~~~ 43. + appendStringInfo(&cmd, + "SELECT restart_lsn, confirmed_flush_lsn, catalog_xmin" + " FROM pg_catalog.pg_replication_slots" + " WHERE slot_name = %s", + quote_literal_cstr(remote_slot->name)); double space before FROM? ~~~ 44. synchronize_one_slot + /* + * We might not have the WALs retained locally corresponding to + * remote's restart_lsn if our local restart_lsn and/or local + * catalog_xmin is ahead of remote's one. And thus we can not create + * the local slot in sync with primary as that would mean moving local + * slot backward. Thus wait for primary's restart_lsn and catalog_xmin + * to catch up with the local ones and then do the sync. + */ + if (remote_slot->restart_lsn < MyReplicationSlot->data.restart_lsn || + TransactionIdPrecedes(remote_slot->catalog_xmin, + MyReplicationSlot->data.catalog_xmin)) + { + if (!wait_for_primary_slot_catchup(wrconn, remote_slot)) + { + /* + * The remote slot didn't catch up to locally reserved + * position + */ + ReplicationSlotRelease(); + CommitTransactionCommand(); + return; + } SUGGESTION (comment is slightly simplified) If the local restart_lsn and/or local catalog_xmin is ahead of those on the remote then we cannot create the local slot in sync with primary because that would mean moving local slot backwards. In this case we will wait for primary's restart_lsn and catalog_xmin to catch up with the local one before attempting the sync. ====== Kind Regards, Peter Smith. Fujitsu Australia
On Wed, Oct 4, 2023 at 5:36 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Oct 3, 2023 at 9:27 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Tue, Oct 3, 2023 at 7:56 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > > > > > > Hi, > > > > > > On 10/3/23 12:54 PM, Amit Kapila wrote: > > > > On Mon, Oct 2, 2023 at 11:39 AM Drouvot, Bertrand > > > > <bertranddrouvot.pg@gmail.com> wrote: > > > >> > > > >> On 9/29/23 1:33 PM, Amit Kapila wrote: > > > >>> On Thu, Sep 28, 2023 at 6:31 PM Drouvot, Bertrand > > > >>> <bertranddrouvot.pg@gmail.com> wrote: > > > >>>> > > > >>> > > > >>>> - probably open corner cases like: what if a standby is down? would that mean > > > >>>> that synchronize_slot_names not being send to the primary would allow the decoding > > > >>>> on the primary to go ahead? > > > >>>> > > > >>> > > > >>> Good question. BTW, irrespective of whether we have > > > >>> 'standby_slot_names' parameters or not, how should we behave if > > > >>> standby is down? Say, if 'synchronize_slot_names' is only specified on > > > >>> standby then in such a situation primary won't be even aware that some > > > >>> of the logical walsenders need to wait. > > > >> > > > >> Exactly, that's why I was thinking keeping standby_slot_names to address > > > >> this scenario. In such a case one could simply decide to keep or remove > > > >> the associated physical replication slot from standby_slot_names. Keep would > > > >> mean "wait" and removing would mean allow to decode on the primary. > > > >> > > > >>> OTOH, one can say that users > > > >>> should configure 'synchronize_slot_names' on both primary and standby > > > >>> but note that this value could be different for different standby's, > > > >>> so we can't configure it on primary. > > > >>> > > > >> > > > >> Yeah, I think that's a good use case for standby_slot_names, what do you think? > > > >> > > > > > > > > But, even if we keep 'standby_slot_names' for this purpose, the > > > > primary doesn't know the value of 'synchronize_slot_names' once the > > > > standby is down and or the primary is restarted. So, how will we know > > > > which logical WAL senders needs to wait for 'standby_slot_names'? > > > > > > > > > > Yeah right, I also think we'd need: > > > > > > - synchronize_slot_names on both primary and standby > > > > > > But now we would need to take care of different standby having different values ( > > > as you said up-thread).... > > > > > > Thinking out loud: What about a single GUC on the primary (not standby_slot_names nor > > > synchronize_slot_names) but say logical_slots_wait_for_standby that could be a list of say > > > "logical_slot_name:physical_slot". > > > > > > I think this GUC would help us define each walsender behavior (should the standby(s) > > > be up or down): > > > > > > > It may help in defining the walsender's behaviour better for sure. But > > the problem I see once we start defining sync-slot-names on primary > > (in any form whether as independent GUC or as above mapping GUC) is > > that it needs to be then in sync with standbys, as each standby for > > sure needs to maintain its own sync-slot-names GUC to make it aware of > > what all it needs to sync. > > Yes, I also think so. Also, defining such a GUC where user wants to > sync all the slots which would normally be the case would be a night > mare for the users. > > > > > This brings us to the original question of > > how do we actually keep these configurations in sync between primary > > and standby if we plan to maintain it on both? > > > > > > > - don't wait if its associated logical_slot is not listed in this GUC > > > - or wait based on its associated "list" of mapped physical slots (would probably > > > have to deal with the min restart_lsn for all the corresponding mapped ones). > > > > > > I don't think we can avoid having to define at least one GUC on the primary (at least to > > > handle the case of standby(s) being down). > > > > > How about an alternate scheme where we define sync_slot_names on > standby but then store the physical_slot_name in the corresponding > logical slot (ReplicationSlotPersistentData) to be synced? So, the > standby will send the list of 'sync_slot_names' and the primary will > add the physical standby's slot_name in each of the corresponding > sync_slot. Now, if we do this then even after restart, we should be > able to know for which physical slot each logical slot needs to wait. > We can even provide an SQL API to reset the value of > standby_slot_names in logical slots as a way to unblock decoding in > case of emergency (for example, corresponding when physical standby > never comes up). > Looks like a better approach to me. It solves most of the pain points like: 1) Avoids the need of multiple GUCs 2) Primary and standby need not to worry to be in sync if we maintain sync-slot-names GUC on both 3) User still gets the flexibility to remove a standby from wait-lost of primary's logical-walsenders' using reset SQL API. Now some initial thoughts: 1) Since each logical slot could be needed to be synched by multiple physical-standbys, so in ReplicationSlotPersistentData, we need to hold a list of standby's name. So this brings us to question as in how much shall we allocate initially in shared-memory? Shall it be for max_replication_slots (worst case scenario) in each ReplicationSlotPersistentData to hold physical-standby names? 2) If standby sends '*', then we need to update each logical-slot with that standby-name. Or do we have better way to deal with '*'? Need to think more on this. JFYI, on the similar line, currently in ReplicationSlotPersistentData, we are maintaining a flag for slot-sync feature which is: bool synced; /* Is this a slot created by a sync-slot worker? */ This flag currently holds significance only on physical-standby. This has been added to distinguish between a slot created by user for logical decoding purpose and the ones being synced from primary. It is needed when we have to choose obsolete slots (synced ones) to drop on standby or block get_changes on standby for synced slots. It can be reused on primary for above approach if needed. thanks Shveta
On Wed, Oct 4, 2023 at 9:56 AM shveta malik <shveta.malik@gmail.com> wrote: > > On Wed, Oct 4, 2023 at 5:36 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Oct 3, 2023 at 9:27 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > On Tue, Oct 3, 2023 at 7:56 PM Drouvot, Bertrand > > > <bertranddrouvot.pg@gmail.com> wrote: > > > > > > > > Hi, > > > > > > > > On 10/3/23 12:54 PM, Amit Kapila wrote: > > > > > On Mon, Oct 2, 2023 at 11:39 AM Drouvot, Bertrand > > > > > <bertranddrouvot.pg@gmail.com> wrote: > > > > >> > > > > >> On 9/29/23 1:33 PM, Amit Kapila wrote: > > > > >>> On Thu, Sep 28, 2023 at 6:31 PM Drouvot, Bertrand > > > > >>> <bertranddrouvot.pg@gmail.com> wrote: > > > > >>>> > > > > >>> > > > > >>>> - probably open corner cases like: what if a standby is down? would that mean > > > > >>>> that synchronize_slot_names not being send to the primary would allow the decoding > > > > >>>> on the primary to go ahead? > > > > >>>> > > > > >>> > > > > >>> Good question. BTW, irrespective of whether we have > > > > >>> 'standby_slot_names' parameters or not, how should we behave if > > > > >>> standby is down? Say, if 'synchronize_slot_names' is only specified on > > > > >>> standby then in such a situation primary won't be even aware that some > > > > >>> of the logical walsenders need to wait. > > > > >> > > > > >> Exactly, that's why I was thinking keeping standby_slot_names to address > > > > >> this scenario. In such a case one could simply decide to keep or remove > > > > >> the associated physical replication slot from standby_slot_names. Keep would > > > > >> mean "wait" and removing would mean allow to decode on the primary. > > > > >> > > > > >>> OTOH, one can say that users > > > > >>> should configure 'synchronize_slot_names' on both primary and standby > > > > >>> but note that this value could be different for different standby's, > > > > >>> so we can't configure it on primary. > > > > >>> > > > > >> > > > > >> Yeah, I think that's a good use case for standby_slot_names, what do you think? > > > > >> > > > > > > > > > > But, even if we keep 'standby_slot_names' for this purpose, the > > > > > primary doesn't know the value of 'synchronize_slot_names' once the > > > > > standby is down and or the primary is restarted. So, how will we know > > > > > which logical WAL senders needs to wait for 'standby_slot_names'? > > > > > > > > > > > > > Yeah right, I also think we'd need: > > > > > > > > - synchronize_slot_names on both primary and standby > > > > > > > > But now we would need to take care of different standby having different values ( > > > > as you said up-thread).... > > > > > > > > Thinking out loud: What about a single GUC on the primary (not standby_slot_names nor > > > > synchronize_slot_names) but say logical_slots_wait_for_standby that could be a list of say > > > > "logical_slot_name:physical_slot". > > > > > > > > I think this GUC would help us define each walsender behavior (should the standby(s) > > > > be up or down): > > > > > > > > > > It may help in defining the walsender's behaviour better for sure. But > > > the problem I see once we start defining sync-slot-names on primary > > > (in any form whether as independent GUC or as above mapping GUC) is > > > that it needs to be then in sync with standbys, as each standby for > > > sure needs to maintain its own sync-slot-names GUC to make it aware of > > > what all it needs to sync. > > > > Yes, I also think so. Also, defining such a GUC where user wants to > > sync all the slots which would normally be the case would be a night > > mare for the users. > > > > > > > > This brings us to the original question of > > > how do we actually keep these configurations in sync between primary > > > and standby if we plan to maintain it on both? > > > > > > > > > > - don't wait if its associated logical_slot is not listed in this GUC > > > > - or wait based on its associated "list" of mapped physical slots (would probably > > > > have to deal with the min restart_lsn for all the corresponding mapped ones). > > > > > > > > I don't think we can avoid having to define at least one GUC on the primary (at least to > > > > handle the case of standby(s) being down). > > > > > > > > How about an alternate scheme where we define sync_slot_names on > > standby but then store the physical_slot_name in the corresponding > > logical slot (ReplicationSlotPersistentData) to be synced? So, the > > standby will send the list of 'sync_slot_names' and the primary will > > add the physical standby's slot_name in each of the corresponding > > sync_slot. Now, if we do this then even after restart, we should be > > able to know for which physical slot each logical slot needs to wait. > > We can even provide an SQL API to reset the value of > > standby_slot_names in logical slots as a way to unblock decoding in > > case of emergency (for example, corresponding when physical standby > > never comes up). > > > > > Looks like a better approach to me. It solves most of the pain points like: > 1) Avoids the need of multiple GUCs > 2) Primary and standby need not to worry to be in sync if we maintain > sync-slot-names GUC on both > 3) User still gets the flexibility to remove a standby from wait-lost > of primary's logical-walsenders' using reset SQL API. > > Now some initial thoughts: > 1) Since each logical slot could be needed to be synched by multiple > physical-standbys, so in ReplicationSlotPersistentData, we need to > hold a list of standby's name. So this brings us to question as in how > much shall we allocate initially in shared-memory? Shall it be for > max_replication_slots (worst case scenario) in each > ReplicationSlotPersistentData to hold physical-standby names? > > 2) If standby sends '*', then we need to update each logical-slot with > that standby-name. Or do we have better way to deal with '*'? Need to > think more on this. > > JFYI, on the similar line, currently in ReplicationSlotPersistentData, > we are maintaining a flag for slot-sync feature which is: > > bool synced; /* Is this a slot created by a > sync-slot worker? */ > > This flag currently holds significance only on physical-standby. This > has been added to distinguish between a slot created by user for > logical decoding purpose and the ones being synced from primary. It is > needed when we have to choose obsolete slots (synced ones) to drop on > standby or block get_changes on standby for synced slots. It can be > reused on primary for above approach if needed. > > thanks > Shveta The most simplistic approach would be: 1) maintain standby_slot_names GUC on primary 2) maintain synchronize_slot_names GUC on physical standby alone. On primary, let all logical-walsenders wait on physical-standbys configured in standby_slot_names GUC. This will work and will avoid all the complexity involved in designs discussed above. But this simplistic approach comes with disadvantages like below: 1) Even if the associated slot of logical-walsender is not part of synchronize_slot_names of any of the physical-standbys, it is still waiting for all the configured standbys to finish. 2) If associated slot of logical walsender is part of synchronize_slot_names of standby1, it is still waiting on standby2,3 etc to finish i.e. waiting on rest of the standbys configured in standby_slot_names which have not even marked that logical slot in their synchronize_slot_names. So we need to weigh our options here. thanks Shveta
On Mon, Oct 2, 2023 at 4:29 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Shveta, > > Thank you for updating the patch! > > I found another ERROR due to the slot removal. Is this a real issue? > > 1. applied add_sleep.txt, which emulated the case the tablesync worker stucked > and the primary crashed during the > initial sync. > 2. executed test_0925_v2.sh (You attached in [1]) > 3. secondary could not start the logical replication because the slot was not > created (log files were also attached). > > > Here is my analysis. The cause is that the slotsync worker aborts the slot creation > on secondary server because the restart_lsn of secondary ahead the primary's one. > IIUC it can be occurred when tablesync workers finishes initial copy before > walsenders stream changes. In this case, the relstate of the worker is set to > SUBREL_STATE_CATCHUP and the apply worker waits till the relation becomes > SUBREL_STATE_SYNCDONE. From here the slot on primary will not be updated until > the relation is caught up. If some changes are come and the primary crashes at > that time, the syncslot worker will abort the slot creation. > Kuroda-San, we need to let slot-creation on standby finish before we start expecting it to support logical replication on failover. In the current case, as you stated the slot-creation itself is aborted and thus it can not support logical-replication later. We are currently trying to think of possibilities to advance remote_lsn on primary internally by slot-sync workers in order to accelerate slot-creation on standby for cases where slot-creation is stuck due to primary's restart_lsn lagging behind standby's restart_lsn. But till then, the way to proceed for testing is to execute workload on primary for such cases in order to accelerate slot-creation. thanks Shveta
Hi, On 10/4/23 6:26 AM, shveta malik wrote: > On Wed, Oct 4, 2023 at 5:36 AM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> On Tue, Oct 3, 2023 at 9:27 PM shveta malik <shveta.malik@gmail.com> wrote: >>> >>> On Tue, Oct 3, 2023 at 7:56 PM Drouvot, Bertrand >>> <bertranddrouvot.pg@gmail.com> wrote: >>>> >>>> Hi, >>>> >>>> On 10/3/23 12:54 PM, Amit Kapila wrote: >>>>> On Mon, Oct 2, 2023 at 11:39 AM Drouvot, Bertrand >>>>> <bertranddrouvot.pg@gmail.com> wrote: >>>>>> >>>>>> On 9/29/23 1:33 PM, Amit Kapila wrote: >>>>>>> On Thu, Sep 28, 2023 at 6:31 PM Drouvot, Bertrand >>>>>>> <bertranddrouvot.pg@gmail.com> wrote: >>>>>>>> >>>>>>> >>>>>>>> - probably open corner cases like: what if a standby is down? would that mean >>>>>>>> that synchronize_slot_names not being send to the primary would allow the decoding >>>>>>>> on the primary to go ahead? >>>>>>>> >>>>>>> >>>>>>> Good question. BTW, irrespective of whether we have >>>>>>> 'standby_slot_names' parameters or not, how should we behave if >>>>>>> standby is down? Say, if 'synchronize_slot_names' is only specified on >>>>>>> standby then in such a situation primary won't be even aware that some >>>>>>> of the logical walsenders need to wait. >>>>>> >>>>>> Exactly, that's why I was thinking keeping standby_slot_names to address >>>>>> this scenario. In such a case one could simply decide to keep or remove >>>>>> the associated physical replication slot from standby_slot_names. Keep would >>>>>> mean "wait" and removing would mean allow to decode on the primary. >>>>>> >>>>>>> OTOH, one can say that users >>>>>>> should configure 'synchronize_slot_names' on both primary and standby >>>>>>> but note that this value could be different for different standby's, >>>>>>> so we can't configure it on primary. >>>>>>> >>>>>> >>>>>> Yeah, I think that's a good use case for standby_slot_names, what do you think? >>>>>> >>>>> >>>>> But, even if we keep 'standby_slot_names' for this purpose, the >>>>> primary doesn't know the value of 'synchronize_slot_names' once the >>>>> standby is down and or the primary is restarted. So, how will we know >>>>> which logical WAL senders needs to wait for 'standby_slot_names'? >>>>> >>>> >>>> Yeah right, I also think we'd need: >>>> >>>> - synchronize_slot_names on both primary and standby >>>> >>>> But now we would need to take care of different standby having different values ( >>>> as you said up-thread).... >>>> >>>> Thinking out loud: What about a single GUC on the primary (not standby_slot_names nor >>>> synchronize_slot_names) but say logical_slots_wait_for_standby that could be a list of say >>>> "logical_slot_name:physical_slot". >>>> >>>> I think this GUC would help us define each walsender behavior (should the standby(s) >>>> be up or down): >>>> >>> >>> It may help in defining the walsender's behaviour better for sure. But >>> the problem I see once we start defining sync-slot-names on primary >>> (in any form whether as independent GUC or as above mapping GUC) is >>> that it needs to be then in sync with standbys, as each standby for >>> sure needs to maintain its own sync-slot-names GUC to make it aware of >>> what all it needs to sync. >> >> Yes, I also think so. Also, defining such a GUC where user wants to >> sync all the slots which would normally be the case would be a night >> mare for the users. >> >>> >>> This brings us to the original question of >>> how do we actually keep these configurations in sync between primary >>> and standby if we plan to maintain it on both? >>> >>> >>>> - don't wait if its associated logical_slot is not listed in this GUC >>>> - or wait based on its associated "list" of mapped physical slots (would probably >>>> have to deal with the min restart_lsn for all the corresponding mapped ones). >>>> >>>> I don't think we can avoid having to define at least one GUC on the primary (at least to >>>> handle the case of standby(s) being down). >>>> >> >> How about an alternate scheme where we define sync_slot_names on >> standby but then store the physical_slot_name in the corresponding >> logical slot (ReplicationSlotPersistentData) to be synced? So, the >> standby will send the list of 'sync_slot_names' and the primary will >> add the physical standby's slot_name in each of the corresponding >> sync_slot. Now, if we do this then even after restart, we should be >> able to know for which physical slot each logical slot needs to wait. >> We can even provide an SQL API to reset the value of >> standby_slot_names in logical slots as a way to unblock decoding in >> case of emergency (for example, corresponding when physical standby >> never comes up). >> > > > Looks like a better approach to me. It solves most of the pain points like: > 1) Avoids the need of multiple GUCs > 2) Primary and standby need not to worry to be in sync if we maintain > sync-slot-names GUC on both > 3) User still gets the flexibility to remove a standby from wait-lost > of primary's logical-walsenders' using reset SQL API. > Fully agree. > Now some initial thoughts: > 1) Since each logical slot could be needed to be synched by multiple > physical-standbys, so in ReplicationSlotPersistentData, we need to > hold a list of standby's name. So this brings us to question as in how > much shall we allocate initially in shared-memory? Shall it be for > max_replication_slots (worst case scenario) in each > ReplicationSlotPersistentData to hold physical-standby names? > Yeah, and even if we do the opposite means add the 'to-sync' logical replication slot in the ReplicationSlotPersistentData of the physical slot(s) the questions still remain (as a physical standby could want to sync multiples slots) > 2) If standby sends '*', then we need to update each logical-slot with > that standby-name. Or do we have better way to deal with '*'? Need to > think more on this. > > JFYI, on the similar line, currently in ReplicationSlotPersistentData, > we are maintaining a flag for slot-sync feature which is: > > bool synced; /* Is this a slot created by a > sync-slot worker? */ > > This flag currently holds significance only on physical-standby. This > has been added to distinguish between a slot created by user for > logical decoding purpose and the ones being synced from primary. BTW, what about having this "user visible" through pg_replication_slots? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On 10/4/23 7:00 AM, shveta malik wrote: > On Wed, Oct 4, 2023 at 9:56 AM shveta malik <shveta.malik@gmail.com> wrote: > The most simplistic approach would be: > > 1) maintain standby_slot_names GUC on primary > 2) maintain synchronize_slot_names GUC on physical standby alone. > > On primary, let all logical-walsenders wait on physical-standbys > configured in standby_slot_names GUC. This will work and will avoid > all the complexity involved in designs discussed above. But this > simplistic approach comes with disadvantages like below: > > 1) Even if the associated slot of logical-walsender is not part of > synchronize_slot_names of any of the physical-standbys, it is still > waiting for all the configured standbys to finish. That's right. Currently (with walsender waiting an arbitrary amount of time) that sounds like a -1. But if we're going with a new CV approach (like proposed in [1], that might not be so terrible). Though I don't feel comfortable with waiting for no reasons (even if this is for a short amount of time possible). > 2) If associated slot of logical walsender is part of > synchronize_slot_names of standby1, it is still waiting on standby2,3 > etc to finish i.e. waiting on rest of the standbys configured in > standby_slot_names which have not even marked that logical slot in > their synchronize_slot_names. > Same thoughts as above for 1) > So we need to weigh our options here. > With the simplistic approach, if a standby goes down that would impact non related walsenders on the primary until the standby's associated physical slot is removed from standby_slot_names and I don't feel comfortable wit this behavior. So, I'm +1 for the ReplicationSlotPersistentData approach proposed by Amit. [1]: https://www.postgresql.org/message-id/CAA4eK1LNjgL6Lghgu1PcDfuoOfa8Ug4J7Uv-H%3DBPP8Wgf1%2BpOw%40mail.gmail.com Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, Oct 4, 2023 at 11:55 AM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > On 10/4/23 6:26 AM, shveta malik wrote: > > On Wed, Oct 4, 2023 at 5:36 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > >> > >> > >> How about an alternate scheme where we define sync_slot_names on > >> standby but then store the physical_slot_name in the corresponding > >> logical slot (ReplicationSlotPersistentData) to be synced? So, the > >> standby will send the list of 'sync_slot_names' and the primary will > >> add the physical standby's slot_name in each of the corresponding > >> sync_slot. Now, if we do this then even after restart, we should be > >> able to know for which physical slot each logical slot needs to wait. > >> We can even provide an SQL API to reset the value of > >> standby_slot_names in logical slots as a way to unblock decoding in > >> case of emergency (for example, corresponding when physical standby > >> never comes up). > >> > > > > > > Looks like a better approach to me. It solves most of the pain points like: > > 1) Avoids the need of multiple GUCs > > 2) Primary and standby need not to worry to be in sync if we maintain > > sync-slot-names GUC on both As per my understanding of this approach, we don't want 'sync-slot-names' to be set on the primary. Do you have a different understanding? > > 3) User still gets the flexibility to remove a standby from wait-lost > > of primary's logical-walsenders' using reset SQL API. > > > > Fully agree. > > > Now some initial thoughts: > > 1) Since each logical slot could be needed to be synched by multiple > > physical-standbys, so in ReplicationSlotPersistentData, we need to > > hold a list of standby's name. So this brings us to question as in how > > much shall we allocate initially in shared-memory? Shall it be for > > max_replication_slots (worst case scenario) in each > > ReplicationSlotPersistentData to hold physical-standby names? > > > > Yeah, and even if we do the opposite means add the 'to-sync' > logical replication slot in the ReplicationSlotPersistentData of the physical > slot(s) the questions still remain (as a physical standby could want to > sync multiples slots) > I think we don't need to allocate the entire max_replication_slots array in ReplicationSlotPersistentData. We should design something like the variable amount of data to be written on disk should be represented similar to what we do with variable TransactionIds in SnapBuildOnDisk. Now, we also need to store the list of standby's in-memory either shared or local memory of walsender. I think storing it in shared-memory say in ReplicationSlot has the advantage that we can easily set that via physical walsender and it may be easier to maintain both for manually created logical slots and logical slots associated with logical walsenders. But still this needs some thoughts as to what is the best way to store this information. > > 2) If standby sends '*', then we need to update each logical-slot with > > that standby-name. Or do we have better way to deal with '*'? Need to > > think more on this. > > I can't see any better way. > > JFYI, on the similar line, currently in ReplicationSlotPersistentData, > > we are maintaining a flag for slot-sync feature which is: > > > > bool synced; /* Is this a slot created by a > > sync-slot worker? */ > > > > This flag currently holds significance only on physical-standby. This > > has been added to distinguish between a slot created by user for > > logical decoding purpose and the ones being synced from primary. > > BTW, what about having this "user visible" through pg_replication_slots? > We can do that. -- With Regards, Amit Kapila.
On Wed, Oct 4, 2023 at 5:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Oct 4, 2023 at 11:55 AM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > > > > On 10/4/23 6:26 AM, shveta malik wrote: > > > On Wed, Oct 4, 2023 at 5:36 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > >> > > >> > > >> How about an alternate scheme where we define sync_slot_names on > > >> standby but then store the physical_slot_name in the corresponding > > >> logical slot (ReplicationSlotPersistentData) to be synced? So, the > > >> standby will send the list of 'sync_slot_names' and the primary will > > >> add the physical standby's slot_name in each of the corresponding > > >> sync_slot. Now, if we do this then even after restart, we should be > > >> able to know for which physical slot each logical slot needs to wait. > > >> We can even provide an SQL API to reset the value of > > >> standby_slot_names in logical slots as a way to unblock decoding in > > >> case of emergency (for example, corresponding when physical standby > > >> never comes up). > > >> > > > > > > > > > Looks like a better approach to me. It solves most of the pain points like: > > > 1) Avoids the need of multiple GUCs > > > 2) Primary and standby need not to worry to be in sync if we maintain > > > sync-slot-names GUC on both > > As per my understanding of this approach, we don't want > 'sync-slot-names' to be set on the primary. Do you have a different > understanding? > Same understanding. We do not need it to be set on primary by user. It will be GUC on standby and standby will convey it to primary. > > > 3) User still gets the flexibility to remove a standby from wait-lost > > > of primary's logical-walsenders' using reset SQL API. > > > > > > > Fully agree. > > > > > Now some initial thoughts: > > > 1) Since each logical slot could be needed to be synched by multiple > > > physical-standbys, so in ReplicationSlotPersistentData, we need to > > > hold a list of standby's name. So this brings us to question as in how > > > much shall we allocate initially in shared-memory? Shall it be for > > > max_replication_slots (worst case scenario) in each > > > ReplicationSlotPersistentData to hold physical-standby names? > > > > > > > Yeah, and even if we do the opposite means add the 'to-sync' > > logical replication slot in the ReplicationSlotPersistentData of the physical > > slot(s) the questions still remain (as a physical standby could want to > > sync multiples slots) > > > > I think we don't need to allocate the entire max_replication_slots > array in ReplicationSlotPersistentData. We should design something > like the variable amount of data to be written on disk should be > represented similar to what we do with variable TransactionIds in > SnapBuildOnDisk. Now, we also need to store the list of standby's > in-memory either shared or local memory of walsender. I think storing > it in shared-memory say in ReplicationSlot has the advantage that we > can easily set that via physical walsender and it may be easier to > maintain both for manually created logical slots and logical slots > associated with logical walsenders. But still this needs some thoughts > as to what is the best way to store this information. > Thanks for the idea, I will review this. thanks Shveta
On Wed, Oct 4, 2023 at 12:08 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 10/4/23 7:00 AM, shveta malik wrote: > > On Wed, Oct 4, 2023 at 9:56 AM shveta malik <shveta.malik@gmail.com> wrote: > > > The most simplistic approach would be: > > > > 1) maintain standby_slot_names GUC on primary > > 2) maintain synchronize_slot_names GUC on physical standby alone. > > > > On primary, let all logical-walsenders wait on physical-standbys > > configured in standby_slot_names GUC. This will work and will avoid > > all the complexity involved in designs discussed above. But this > > simplistic approach comes with disadvantages like below: > > > > 1) Even if the associated slot of logical-walsender is not part of > > synchronize_slot_names of any of the physical-standbys, it is still > > waiting for all the configured standbys to finish. > > That's right. Currently (with walsender waiting an arbitrary amount of time) > that sounds like a -1. But if we're going with a new CV approach (like proposed > in [1], that might not be so terrible). Though I don't feel comfortable with > waiting for no reasons (even if this is for a short amount of time possible). > Agreed. Not a good idea to block each logical walsender. > > 2) If associated slot of logical walsender is part of > > synchronize_slot_names of standby1, it is still waiting on standby2,3 > > etc to finish i.e. waiting on rest of the standbys configured in > > standby_slot_names which have not even marked that logical slot in > > their synchronize_slot_names. > > > > Same thoughts as above for 1) > > > So we need to weigh our options here. > > > > With the simplistic approach, if a standby goes down that would impact non related > walsenders on the primary until the standby's associated physical slot is removed > from standby_slot_names and I don't feel comfortable wit this behavior. > > So, I'm +1 for the ReplicationSlotPersistentData approach proposed by Amit. yes, +1 for ReplicationSlotPersistentData approach. Will start detailed analysis on that approach now. thanks Shveta
Hi, On 10/4/23 1:50 PM, shveta malik wrote: > On Wed, Oct 4, 2023 at 5:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> On Wed, Oct 4, 2023 at 11:55 AM Drouvot, Bertrand >> <bertranddrouvot.pg@gmail.com> wrote: >>> >>> On 10/4/23 6:26 AM, shveta malik wrote: >>>> On Wed, Oct 4, 2023 at 5:36 AM Amit Kapila <amit.kapila16@gmail.com> wrote: >>>>> >>>>> >>>>> How about an alternate scheme where we define sync_slot_names on >>>>> standby but then store the physical_slot_name in the corresponding >>>>> logical slot (ReplicationSlotPersistentData) to be synced? So, the >>>>> standby will send the list of 'sync_slot_names' and the primary will >>>>> add the physical standby's slot_name in each of the corresponding >>>>> sync_slot. Now, if we do this then even after restart, we should be >>>>> able to know for which physical slot each logical slot needs to wait. >>>>> We can even provide an SQL API to reset the value of >>>>> standby_slot_names in logical slots as a way to unblock decoding in >>>>> case of emergency (for example, corresponding when physical standby >>>>> never comes up). >>>>> >>>> >>>> >>>> Looks like a better approach to me. It solves most of the pain points like: >>>> 1) Avoids the need of multiple GUCs >>>> 2) Primary and standby need not to worry to be in sync if we maintain >>>> sync-slot-names GUC on both >> >> As per my understanding of this approach, we don't want >> 'sync-slot-names' to be set on the primary. Do you have a different >> understanding? >> > > Same understanding. We do not need it to be set on primary by user. It > will be GUC on standby and standby will convey it to primary. +1, same understanding here. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, Sep 27, 2023 at 2:37 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are some more review comments for the patch v19-0002. > > This is a WIP.... these review comments are all for the file slotsync.c > > ====== > src/backend/replication/logical/slotsync.c > > 1. wait_for_primary_slot_catchup > > + WalRcvExecResult *res; > + TupleTableSlot *slot; > + Oid slotRow[1] = {LSNOID}; > + StringInfoData cmd; > + bool isnull; > + XLogRecPtr restart_lsn; > + > + for (;;) > + { > + int rc; > > I could not recognize a reason why 'rc' is declared within the loop, > but none of the other local variables are. Personally, I'd declare all > variables at the deepest scope (e.g. inside the for loop). > fixed. > ~~~ > > 2. get_local_synced_slot_names > > +/* > + * Get list of local logical slot names which are synchronized from > + * primary and belongs to one of the DBs passed in. > + */ > +static List * > +get_local_synced_slot_names(Oid *dbids) > +{ > > IIUC, this function gets called only from the drop_obsolete_slots() > function. But I thought this list of local slot names (i.e. for the > dbids that this worker is handling) would be something that perhaps > could the initialized one time for the worker, instead of it being > re-calculated every single time the slots processing/dropping happens. > Isn't the current code expending too much effort recalculating over > and over but giving back the same list every time? > The reason this is being done is because the dblist could be changed at any time by the launcher, which requires us to recalculate the list of slots specific to each workers dblist. > ~~~ > > 3. get_local_synced_slot_names > > + for (int i = 0; i < max_replication_slots; i++) > + { > + ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i]; > + > + /* Check if it is logical synchronized slot */ > + if (s->in_use && SlotIsLogical(s) && s->data.synced) > + { > + for (int j = 0; j < MySlotSyncWorker->dbcount; j++) > + { > > Loop variables are not declared in the common PG code way. > fixed. > ~~~ > > 4. slot_exists_locally > > +static bool > +slot_exists_locally(List *remote_slots, ReplicationSlot *local_slot, > + bool *locally_invalidated) > +{ > + ListCell *cell; > + > + foreach(cell, remote_slots) > + { > + RemoteSlot *remote_slot = (RemoteSlot *) lfirst(cell); > + > + if (strcmp(remote_slot->name, NameStr(local_slot->data.name)) == 0) > + { > + /* > + * if remote slot is marked as non-conflicting (i.e. not > + * invalidated) but local slot is marked as invalidated, then set > + * the bool. > + */ > + if (!remote_slot->conflicting && > + SlotIsLogical(local_slot) && > + local_slot->data.invalidated != RS_INVAL_NONE) > + *locally_invalidated = true; > + > + return true; > + } > + } > + > + return false; > +} > > Why is there a SlotIsLogical(local_slot) check buried in this > function? How is slot_exists_locally() getting called with a > non-logical local_slot? Shouldn't that have been screened out long > before here? > Removed that because it is redundant. > ~~~ > > 5. use_slot_in_query > > +static bool > +use_slot_in_query(char *slot_name, Oid *dbids) > > There are multiple non-standard for-loop variable declarations in this function. > fixed. > ~~~ > > 6. compute_naptime > > + * The first slot managed by each worker is chosen for monitoring purpose. > + * If the lsn of that slot changes during each sync-check time, then the > + * nap time is kept at regular value of WORKER_DEFAULT_NAPTIME_MS. > + * When no lsn change is observed for WORKER_INACTIVITY_THRESHOLD_MS > + * time, then the nap time is increased to WORKER_INACTIVITY_NAPTIME_MS. > + * This nap time is brought back to WORKER_DEFAULT_NAPTIME_MS as soon as > + * lsn change is observed. > > 6a. > /regular value/the regular value/ > > /for WORKER_INACTIVITY_THRESHOLD_MS time/within the threshold period > (WORKER_INACTIVITY_THRESHOLD_MS)/ > Fixed. > ~ > > 6b. > /as soon as lsn change is observed./as soon as another lsn change is observed./ > fixed. > ~~~ > > 7. > + * The caller is supposed to ignore return-value of 0. The 0 value is returned > + * for the slots other that slot being monitored. > + */ > +static long > +compute_naptime(RemoteSlot *remote_slot) > > This rule about the returning 0 seemed hacky to me. IMO this would be > a better API to pass long *naptime (which this function either updates > or doesn't update, depending on this being the "monitored" slot. > Knowing the current naptime is also useful to improve the function > logic (see the next review comment below). > > Also, since this function is really only toggling naptime between 2 > values, it would be helpful to assert that > > Assert(*naptime == WORKER_DEFAULT_NAPTIME_MS || *naptime == > WORKER_INACTIVITY_NAPTIME_MS); > fixed. > ~~~ > > 8. > + if (NameStr(MySlotSyncWorker->monitoring_info.slot_name)[0] == '\0') > + { > + /* > + * First time, just update the name and lsn and return regular > + * nap time. Start comparison from next time onward. > + */ > + strcpy(NameStr(MySlotSyncWorker->monitoring_info.slot_name), > + remote_slot->name); > > I wasn't sure why it was necessary to identify the "monitoring" slot > by name. Why doesn't the compute_naptime just get called only for the > 1st slot found in the tuple loop instead of all the strcmp business > trying to match monitor names? > > And, if the monitored slot gets "dropped", then so what; next time > another slot will be the first tuple so will automatically take its > place, right? > Yes, that is correct. Fixed as commented. > ~~~ > > 9. > + /* > + * If new received lsn (remote one) is different from what we have in > + * our local slot, then update last_update_time. > + */ > + if (MySlotSyncWorker->monitoring_info.confirmed_lsn != > + remote_slot->confirmed_lsn) > + MySlotSyncWorker->monitoring_info.last_update_time = now; > + > + MySlotSyncWorker->monitoring_info.confirmed_lsn = > + remote_slot->confirmed_lsn; > > Doesn't it make more sense to also put that 'confirmed_lsn' assignment > under the same condition? e.g. No need to overwrite the same value > again. > Fixed. > ~~~ > > 10. > + /* If the inactivity time reaches the threshold, increase nap time */ > + if (TimestampDifferenceExceeds(MySlotSyncWorker->monitoring_info.last_update_time, > + now, WORKER_INACTIVITY_THRESHOLD_MS)) > + return WORKER_INACTIVITY_NAPTIME_MS; > + else > + return WORKER_DEFAULT_NAPTIME_MS; > + } > > Somehow this feels overcomplicated to me. > > In reality, the naptime is only toggling between 2 values (DEFAULT and > INACTIVITY) so we should never need to be testing > TimestampDifferenceExceeds again and again on subsequent calls (there > might be 1000s of them) > > Once naptime is WORKER_INACTIVITY_NAPTIME_MS we know to reset it back > to WORKER_DEFAULT_NAPTIME_MS only if > (MySlotSyncWorker->monitoring_info.confirmed_lsn != > remote_slot->confirmed_lsn) is detected. > > Basically, I think the algorithm should be like the code below: > > TimestampTz now = GetCurrentTimestamp(); > > if (MySlotSyncWorker->monitoring_info.confirmed_lsn != > remote_slot->confirmed_lsn) > { > MySlotSyncWorker->monitoring_info.last_update_time = now; > MySlotSyncWorker->monitoring_info.confirmed_lsn = remote_slot->confirmed_lsn; > > /* Something changed; reset naptime to default. */ > *naptime = WORKER_DEFAULT_NAPTIME_MS; > } > else > { > if (*naptime == WORKER_DEFAULT_NAPTIME_MS) > { > /* If the inactivity time reaches the threshold, increase nap time. */ > if (TimestampDifferenceExceeds(MySlotSyncWorker->monitoring_info.last_update_time, > now, WORKER_INACTIVITY_THRESHOLD_MS)) > *naptime = WORKER_INACTIVITY_NAPTIME_MS; > } > } > Fixed as suggested. > ~~~ > > 11. get_remote_invalidation_cause > > +/* > + * Get Remote Slot's invalidation cause. > + * > + * This gets invalidation cause of remote slot. > + */ > +static ReplicationSlotInvalidationCause > +get_remote_invalidation_cause(WalReceiverConn *wrconn, char *slot_name) > +{ > > Isn't that function comment just repeating itself? > Fixed. > ~~~ > > 12. > + initStringInfo(&cmd); > + appendStringInfo(&cmd, > + "select pg_get_slot_invalidation_cause(%s)", > + quote_literal_cstr(slot_name)); > > Use uppercase "SELECT" for consistency with other SQL. > Fixed. > ~~~ > > 13. > + /* Make things live outside TX context */ > + MemoryContextSwitchTo(oldctx); > + > + initStringInfo(&cmd); > + appendStringInfo(&cmd, > + "select pg_get_slot_invalidation_cause(%s)", > + quote_literal_cstr(slot_name)); > + res = walrcv_exec(wrconn, cmd.data, 1, slotRow); > + pfree(cmd.data); > + > + CommitTransactionCommand(); > + > + /* Switch to oldctx we saved */ > + MemoryContextSwitchTo(oldctx); > > There are 2x MemoryContextSwitchTo(oldctx) here. Is that deliberate? > Yes, that is required as both start transaction and commit transaction could change memory context. > ~~~ > > 14. > + if (res->status != WALRCV_OK_TUPLES) > + ereport(ERROR, > + (errmsg("could not fetch invalidation cuase for slot \"%s\" from" > + " primary: %s", slot_name, res->err))); > > typo /cuase/cause/ > fixed. > ~~~ > > 15. > + slot = MakeSingleTupleTableSlot(res->tupledesc, &TTSOpsMinimalTuple); > + if (!tuplestore_gettupleslot(res->tuplestore, true, false, slot)) > + ereport(ERROR, > + (errmsg("slot \"%s\" disapeared from the primary", > + slot_name))); > > typo /disapeared/disappeared/ > > ~~~ > > > 16. drop_obsolete_slots > > +/* > + * Drop obsolete slots > + * > + * Drop the slots which no longer need to be synced i.e. these either > + * do not exist on primary or are no longer part of synchronize_slot_names. > + * > + * Also drop the slots which are valid on primary and got invalidated > + * on standby due to conflict (say required rows removed on primary). > + * The assumption is, these will get recreated in next sync-cycle and > + * it is okay to drop and recreate such slots as long as these are not > + * consumable on standby (which is the case currently). > + */ > > /which no/that no/ > > /which are/that are/ > > /these will/that these will/ > > /and got invalidated/that got invalidated/ > Fixed. > ~~~ > > 17. > + /* If this slot is being monitored, clean-up the monitoring info */ > + if (strcmp(NameStr(local_slot->data.name), > + NameStr(MySlotSyncWorker->monitoring_info.slot_name)) == 0) > + { > + MemSet(NameStr(MySlotSyncWorker->monitoring_info.slot_name), 0, NAMEDATALEN); > + MySlotSyncWorker->monitoring_info.confirmed_lsn = 0; > + MySlotSyncWorker->monitoring_info.last_update_time = 0; > + } > > Maybe it is better to assign InvalidXLogRecPtr instead of 0 to the cleared lsn. > Removed this as the slot_name is no longer required in this structure. > ~ > > Alternatively, consider just zapping the entire monitoring_info > structure in one go: > MemSet(&MySlotSyncWorker->monitoring_info, 0, > sizeof(MySlotSyncWorker->monitoring_info)); > Code removed. > ~~~ > > 18. construct_slot_query (calling use_slot_in_query) > > This separation of functions (use_slot_in_query / > construct_slot_query) seems awkward to me. The use_slot_in_query() > function is only called by construct_slot_query(). I felt it might be > simpler to keep all the logical with the construct_slot_query(). > > Furthermore, it seemed strange to iterate all the DBs (to populate the > "WHERE database IN" clause) and then iterate all the DBs multiple > times again in use_slot_in_query (looking for slots to populate the > "AND slot_name IN (" clause). > > Maybe I misunderstand the reason for this structuring, but IMO it > would be simpler code to keep all the logic in construct_slot_query() > like: > > a. Initialize with empty dblist, empty slotlist. > b. Iterate all dbids > - constructing the dblist as you go > - constructing the slot list as you go (if synchronize_slot_names is > not "" or "*") > c. Finally, build the query: basic + dblist-clause + optional slotlist-clause > This I feel will make it more complicated as to get dbid of slot, we need to search hash, which requires locking, so keeping that seperate. > ~~~ > > 19. construct_slot_query > > Why does this function return a boolean? I only see it returns true, > but never false. > Fixed. > ~~~ > > 20. > + { > + ListCell *lc; > + bool first_slot = true; > + > + > + foreach(lc, sync_slot_names_list) > > Unnecessary blank line. > > ~~~ > > 21. synchronize_one_slot > > +/* > + * Synchronize single slot to given position. > + * > + * This creates new slot if there is no existing one and updates the > + * metadata of existing slots as per the data received from the primary. > + */ > +static void > +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot) > > /creates new slot/creates a new slot/ > > /metadata of existing slots/metadata of the slot/ > > ~~~ > > 22 > > + /* Search for the named slot and mark it active if we find it. */ > + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED); > + for (int i = 0; i < max_replication_slots; i++) > + { > + ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i]; > + > + if (!s->in_use) > + continue; > + > + if (strcmp(NameStr(s->data.name), remote_slot->name) == 0) > + { > + found = true; > + break; > + } > + } > + LWLockRelease(ReplicationSlotControlLock); > 22a. > "and mark it active if we find it." -- What code here is marking > anything active? > > ~ > > 22b. > Uncommon style of loop variable declaration > Fixed all above. > ~ > > 22c. > IMO it is over-complicated code; e.g. same loop can be written like this: > > SUGGESTION > for (i = 0; i < max_replication_slots && !found; i++) > { > ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i]; > > if (s->in_use) > found = (strcmp(NameStr(s->data.name), remote_slot->name) == 0); > } > Fixed as suggested. > ~~~ > > 23. synchronize_slots > > + /* Construct query to get slots info from the primary */ > + initStringInfo(&s); > + if (!construct_slot_query(&s, dbids)) > + { > + pfree(s.data); > + CommitTransactionCommand(); > + LWLockRelease(SlotSyncWorkerLock); > + return naptime; > + } > > As noted elsewhere, it seems construct_slot_query() will never return > false and so this block of code is unreachable. > Removed this code. > ~~~ > > 24. > + /* Create list of remote slot names to be used by drop_obsolete_slots */ > + remote_slot_list = lappend(remote_slot_list, remote_slot); > > This is a list of slots, not just slot names. > Fixed. > ~~~ > > 25. > + /* > + * Update nap time in case of non-zero value returned. The zero value > + * is returned if remote_slot is not the one being monitored. > + */ > + value = compute_naptime(remote_slot); > + if (value) > + naptime = value; > > If the compute_naptime API is changed as suggested in a prior review > comment then this can be simplified to something like: > > SUGGESTION: > /* Update nap time as required depending on slot activity. */ > compute_naptime(remote_slot, &naptime); > Fixed. > ~~~ > > 26. > + /* > + * Drop local slots which no longer need to be synced i.e. these either do > + * not exist on primary or are no longer part of synchronize_slot_names. > + */ > + drop_obsolete_slots(dbids, remote_slot_list); > > /which no longer/that no longer/ > > I thought it might be better to omit the "i.e." part. Just leave it to > the function-header of drop_obsolete_slots for a detailed explanation > about *which* slots are candidates for dropping. > Fixed. > ~ > > 27. > + /* We are done, free remot_slot_list elements */ > + foreach(cell, remote_slot_list) > + { > + RemoteSlot *remote_slot = (RemoteSlot *) lfirst(cell); > + > + pfree(remote_slot); > + } > > 27a. > /remot_slot_list/remote_slot_list/ > Fixed. > ~ > > 27b. > Isn't this just the same as the one-liner: > > list_free_deep(remote_slot_list); > > ~~~ > > 28. > +/* > + * Initialize the list from raw synchronize_slot_names and cache it, in order > + * to avoid parsing it repeatedly. Done at slot-sync worker startup and after > + * each SIGHUP. > + */ > +static void > +SlotSyncInitSlotNamesList() > +{ > + char *rawname; > + > + if (strcmp(synchronize_slot_names, "") != 0 && > + strcmp(synchronize_slot_names, "*") != 0) > + { > + rawname = pstrdup(synchronize_slot_names); > + SplitIdentifierString(rawname, ',', &sync_slot_names_list); > + } > +} > > 28a. > Why this static function name is camel-case, unlike all the others? > Fixed. > ~ > > 28b. > What about when the sync_slot_names_list changes from value to "" or > "*". Shouldn't this function be setting sync_slot_names_list = NIL for > that scenario? > I modified this logic to free sync_slot_names_list prior to setting and initializing it to NIL. > ~~~ > > 29. remote_connect > > +/* > + * Connect to remote (primary) server. > + * > + * This uses primary_conninfo in order to connect to primary. For slot-sync > + * to work, primary_conninfo is expected to have dbname as well. > + */ > +static WalReceiverConn * > +remote_connect() > > 29a. > I felt it might be more helpful to say "GUC primary_conninfo" instead > of just 'primary_conninfo' the first time this is mentioned. > fixed. > ~ > > 29b. > /connect to primary/connect to the primary/ > > ~ > > 29c. > /is expected to have/is required to specify/ > > ~~~ > > 30. reconnect_if_needed > > +/* > + * Reconnect to remote (primary) server if PrimaryConnInfo got changed. > + */ > +static WalReceiverConn * > +reconnect_if_needed(WalReceiverConn *wrconn_prev, char *conninfo_prev) > > /got changed/has changed/ > > ~~~ > > 31. > +static WalReceiverConn * > +reconnect_if_needed(WalReceiverConn *wrconn_prev, char *conninfo_prev) > +{ > + WalReceiverConn *wrconn = NULL; > + > + /* If no change in PrimaryConnInfo, return previous connection itself */ > + if (strcmp(conninfo_prev, PrimaryConnInfo) == 0) > + return wrconn_prev; > + > + walrcv_disconnect(wrconn); > + wrconn = remote_connect(); > + return wrconn; > +} > > /return previous/return the previous/ > > Disconnect NULL is a bug isn't it? Don't you mean to disconnect 'wrconn_prev'? > Fixed > ~~~ > > 32. slotsync_worker_detach > > +/* > + * Detach the worker from DSM and update 'proc' and 'in_use'. > + * Logical replication launcher will come to know using these > + * that the worker has shutdown. > + */ > +static void > +slotsync_worker_detach(int code, Datum arg) > +{ > + dsa_detach((dsa_area *) DatumGetPointer(arg)); > + LWLockAcquire(SlotSyncWorkerLock, LW_EXCLUSIVE); > + MySlotSyncWorker->hdr.in_use = false; > + MySlotSyncWorker->hdr.proc = NULL; > + LWLockRelease(SlotSyncWorkerLock); > +} > > I expected this function to be in the same module as > slotsync_worker_attach. It seems a bit strange to have them separated. > Both now are part of launcher.c file > ~~~ > > 33. ReplSlotSyncMain > > + ereport(ERROR, > + (errmsg("The dbname not specified in primary_conninfo, skipping" > + " slots synchronization"), > + errhint("Specify dbname in primary_conninfo for slots" > + " synchronization to proceed"))); > > /not specified in/was not specified in/ > > /slots synchronization/slot synchronization/ (??) -- there are multiple of these > > ~ > > 34. > + /* > + * Connect to the database specified by user in PrimaryConnInfo. We need > + * database connection for walrcv_exec to work. Please see comments atop > + * libpqrcv_exec. > + */ > > /database connection/a database connection/ > > ~~~ > > 35. > + /* Reconnect if primary_conninfo got changed */ > + if (config_reloaded) > + wrconn = reconnect_if_needed(wrconn, conninfo_prev); > > SUGGESTION > Reconnect if GUC primary_conninfo has changed. > > ~ > > 36. > + /* > + * The slot-sync worker must not get here because it will only stop when > + * it receives a SIGINT from the logical replication launcher, or when > + * there is an error. None of these cases will allow the code to reach > + * here. > + */ > + Assert(false); > > 36a. > /must not/cannot/ > > 36b. > "None of these cases will allow the code to reach here." <-- redundant sentence > Fixed all above. This patch-set also fixes the crash reported by Kuroda-san, thanks to Shveta for that fix. regards, Ajin Cherian Fujitsu Australia
Attachment
Hi Ajin. Thanks for addressing my previous review comments from v19. I checked all the changes. Below are a few follow-up remarks. On Thu, Oct 5, 2023 at 7:54 PM Ajin Cherian <itsajin@gmail.com> wrote: > > On Wed, Sep 27, 2023 at 2:37 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Here are some more review comments for the patch v19-0002. > > 3. get_local_synced_slot_names > > > > + for (int i = 0; i < max_replication_slots; i++) > > + { > > + ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i]; > > + > > + /* Check if it is logical synchronized slot */ > > + if (s->in_use && SlotIsLogical(s) && s->data.synced) > > + { > > + for (int j = 0; j < MySlotSyncWorker->dbcount; j++) > > + { > > > > Loop variables are not declared in the common PG code way. > > > > fixed. Yes, new declarations were added, but some of them (e.g. 'j') could have been declared at a lower scope closer to where they are being used. > > 5. use_slot_in_query > > > > +static bool > > +use_slot_in_query(char *slot_name, Oid *dbids) > > > > There are multiple non-standard for-loop variable declarations in this function. > > > > fixed. Yes, new declarations were added, but some of them (e.g. 'j') could have been declared at a lower scope closer to where they are being used. > > 11. get_remote_invalidation_cause > > > > +/* > > + * Get Remote Slot's invalidation cause. > > + * > > + * This gets invalidation cause of remote slot. > > + */ > > +static ReplicationSlotInvalidationCause > > +get_remote_invalidation_cause(WalReceiverConn *wrconn, char *slot_name) > > +{ > > > > Isn't that function comment just repeating itself? > > > > Fixed. /remote slot./the remote slot./ > > 27. > > + /* We are done, free remot_slot_list elements */ > > + foreach(cell, remote_slot_list) > > + { > > + RemoteSlot *remote_slot = (RemoteSlot *) lfirst(cell); > > + > > + pfree(remote_slot); > > + } > > > > 27a. > > /remot_slot_list/remote_slot_list/ > > > > Fixed. > > > ~ > > > > 27b. > > Isn't this just the same as the one-liner: > > > > list_free_deep(remote_slot_list); It looks like the #27b comment was accidentally missed (??) > > 29. remote_connect > > > > +/* > > + * Connect to remote (primary) server. > > + * > > + * This uses primary_conninfo in order to connect to primary. For slot-sync > > + * to work, primary_conninfo is expected to have dbname as well. > > + */ > > +static WalReceiverConn * > > +remote_connect() > > > > 29a. > > I felt it might be more helpful to say "GUC primary_conninfo" instead > > of just 'primary_conninfo' the first time this is mentioned. > > > > fixed. The changed v21 comment now refers to "GUC PrimaryConnInfo" but I think that is wrong. The GUC really is called "pnmary_conninfo" --- PrimaryConnInfo is just the code static variable name. ====== Kind Regards, Peter Smith. Fujitsu Australia
On 2023-Sep-27, Peter Smith wrote: > 3. get_local_synced_slot_names > > + for (int i = 0; i < max_replication_slots; i++) > + { > + ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i]; > + > + /* Check if it is logical synchronized slot */ > + if (s->in_use && SlotIsLogical(s) && s->data.synced) > + { > + for (int j = 0; j < MySlotSyncWorker->dbcount; j++) > + { > > Loop variables are not declared in the common PG code way. Note that since we added C99 as a mandatory requirement for compilers in commit d9dd406fe281, we've been using declarations in loop initializers (see 143290efd079). We have almost 500 occurrences of this already. Older code, obviously, does not use them, but that's no reason not to introduce them in new code. I think they make the code a bit leaner, so I suggest to use these liberally. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ Officer Krupke, what are we to do? Gee, officer Krupke, Krup you! (West Side Story, "Gee, Officer Krupke")
On Fri, Oct 6, 2023 at 2:07 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2023-Sep-27, Peter Smith wrote: > > > 3. get_local_synced_slot_names > > > > + for (int i = 0; i < max_replication_slots; i++) > > + { > > + ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i]; > > + > > + /* Check if it is logical synchronized slot */ > > + if (s->in_use && SlotIsLogical(s) && s->data.synced) > > + { > > + for (int j = 0; j < MySlotSyncWorker->dbcount; j++) > > + { > > > > Loop variables are not declared in the common PG code way. > > Note that since we added C99 as a mandatory requirement for compilers in > commit d9dd406fe281, we've been using declarations in loop initializers > (see 143290efd079). We have almost 500 occurrences of this already. > Older code, obviously, does not use them, but that's no reason not to > introduce them in new code. I think they make the code a bit leaner, so > I suggest to use these liberally. > Okay, we will. Thanks for letting us know. thanks Shveta
On Wed, Oct 4, 2023 at 5:34 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > On 10/4/23 1:50 PM, shveta malik wrote: > > On Wed, Oct 4, 2023 at 5:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > >> > >> On Wed, Oct 4, 2023 at 11:55 AM Drouvot, Bertrand > >> <bertranddrouvot.pg@gmail.com> wrote: > >>> > >>> On 10/4/23 6:26 AM, shveta malik wrote: > >>>> On Wed, Oct 4, 2023 at 5:36 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > >>>>> > >>>>> > >>>>> How about an alternate scheme where we define sync_slot_names on > >>>>> standby but then store the physical_slot_name in the corresponding > >>>>> logical slot (ReplicationSlotPersistentData) to be synced? So, the > >>>>> standby will send the list of 'sync_slot_names' and the primary will > >>>>> add the physical standby's slot_name in each of the corresponding > >>>>> sync_slot. Now, if we do this then even after restart, we should be > >>>>> able to know for which physical slot each logical slot needs to wait. > >>>>> We can even provide an SQL API to reset the value of > >>>>> standby_slot_names in logical slots as a way to unblock decoding in > >>>>> case of emergency (for example, corresponding when physical standby > >>>>> never comes up). > >>>>> > >>>> > >>>> > >>>> Looks like a better approach to me. It solves most of the pain points like: > >>>> 1) Avoids the need of multiple GUCs > >>>> 2) Primary and standby need not to worry to be in sync if we maintain > >>>> sync-slot-names GUC on both > >> > >> As per my understanding of this approach, we don't want > >> 'sync-slot-names' to be set on the primary. Do you have a different > >> understanding? > >> > > > > Same understanding. We do not need it to be set on primary by user. It > > will be GUC on standby and standby will convey it to primary. > > +1, same understanding here. > At PGConf NYC, I had a brief discussion on this topic with Andres where yet another approach to achieve this came up. Have a parameter like enable_failover at the slot level (this will be persistent information). Users can set it during the create/alter subscription or via pg_create_logical_replication_slot(). Also, on physical standby, there will be a parameter like enable_syncslot. All the physical standbys that have set enable_syncslot will receive all the logical slots that are marked as enable_failover. To me, whether to sync a particular slot is a slot-level property, so defining it in this new way seems reasonable. I think this will simplify the scheme a bit but still, the list of physical standby's for which logical slots wait during decoding needs to be maintained as we thought. But, how about with the above two parameters (enable_failover and enable_syncslot), we have standby_slot_names defined on the primary. That avoids the need to store the list of standby_slot_names in logical slots and simplifies the implementation quite a bit, right? Now, one can think if we have a parameter like 'standby_slot_names' then why do we need enable_syncslot on physical standby but that will be required to invoke sync worker which will pull logical slot's information? The advantage of having standby_slot_names defined on primary is that we can selectively wait on the subset of physical standbys where we are syncing the slots. I think this will be something similar to 'synchronous_standby_names' in the sense that the physical standbys mentioned in standby_slot_names will behave as synchronous copies with respect to slots and after failover user can switch to one of these physical standby and others can start following new master/publisher. Thoughts? -- With Regards, Amit Kapila.
On Fri, Oct 6, 2023 at 7:37 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2023-Sep-27, Peter Smith wrote: > > > 3. get_local_synced_slot_names > > > > + for (int i = 0; i < max_replication_slots; i++) > > + { > > + ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i]; > > + > > + /* Check if it is logical synchronized slot */ > > + if (s->in_use && SlotIsLogical(s) && s->data.synced) > > + { > > + for (int j = 0; j < MySlotSyncWorker->dbcount; j++) > > + { > > > > Loop variables are not declared in the common PG code way. > > Note that since we added C99 as a mandatory requirement for compilers in > commit d9dd406fe281, we've been using declarations in loop initializers > (see 143290efd079). We have almost 500 occurrences of this already. > Older code, obviously, does not use them, but that's no reason not to > introduce them in new code. I think they make the code a bit leaner, so > I suggest to use these liberally. > I also prefer the C99 style, but I had misunderstood there was still a convention to keep using the old style for code consistency (e.g. many new patches I see still seem to use the old style). Thanks for confirming that C99 loop variables are fine for any new code. @Shveta/Ajin - please ignore/revert all my old review comments about this point. ====== Kind Regards, Peter Smith. Fujitsu Australia.
Hi, On 10/6/23 6:48 PM, Amit Kapila wrote: > On Wed, Oct 4, 2023 at 5:34 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> >> On 10/4/23 1:50 PM, shveta malik wrote: >>> On Wed, Oct 4, 2023 at 5:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: >>>> >>>> On Wed, Oct 4, 2023 at 11:55 AM Drouvot, Bertrand >>>> <bertranddrouvot.pg@gmail.com> wrote: >>>>> >>>>> On 10/4/23 6:26 AM, shveta malik wrote: >>>>>> On Wed, Oct 4, 2023 at 5:36 AM Amit Kapila <amit.kapila16@gmail.com> wrote: >>>>>>> >>>>>>> >>>>>>> How about an alternate scheme where we define sync_slot_names on >>>>>>> standby but then store the physical_slot_name in the corresponding >>>>>>> logical slot (ReplicationSlotPersistentData) to be synced? So, the >>>>>>> standby will send the list of 'sync_slot_names' and the primary will >>>>>>> add the physical standby's slot_name in each of the corresponding >>>>>>> sync_slot. Now, if we do this then even after restart, we should be >>>>>>> able to know for which physical slot each logical slot needs to wait. >>>>>>> We can even provide an SQL API to reset the value of >>>>>>> standby_slot_names in logical slots as a way to unblock decoding in >>>>>>> case of emergency (for example, corresponding when physical standby >>>>>>> never comes up). >>>>>>> >>>>>> >>>>>> >>>>>> Looks like a better approach to me. It solves most of the pain points like: >>>>>> 1) Avoids the need of multiple GUCs >>>>>> 2) Primary and standby need not to worry to be in sync if we maintain >>>>>> sync-slot-names GUC on both >>>> >>>> As per my understanding of this approach, we don't want >>>> 'sync-slot-names' to be set on the primary. Do you have a different >>>> understanding? >>>> >>> >>> Same understanding. We do not need it to be set on primary by user. It >>> will be GUC on standby and standby will convey it to primary. >> >> +1, same understanding here. >> > > At PGConf NYC, I had a brief discussion on this topic with Andres > where yet another approach to achieve this came up. Great! > Have a parameter > like enable_failover at the slot level (this will be persistent > information). Users can set it during the create/alter subscription or > via pg_create_logical_replication_slot(). Also, on physical standby, > there will be a parameter like enable_syncslot. All the physical > standbys that have set enable_syncslot will receive all the logical > slots that are marked as enable_failover. To me, whether to sync a > particular slot is a slot-level property, so defining it in this new > way seems reasonable. Yeah, as this is a slot-level property, I agree that this seems reasonable. Also that sounds more natural to me with this approach. The primary is really the one that "drives" which slots can be synced. I like it. One could also set enable_failover while creating a logical slot on a physical standby (so that cascading standbys could also have "extra slot" to sync as compare to "level 1" standbys). > > I think this will simplify the scheme a bit but still, the list of > physical standby's for which logical slots wait during decoding needs > to be maintained as we thought. Right. > But, how about with the above two > parameters (enable_failover and enable_syncslot), we have > standby_slot_names defined on the primary. That avoids the need to > store the list of standby_slot_names in logical slots and simplifies > the implementation quite a bit, right? Agree. > Now, one can think if we have a > parameter like 'standby_slot_names' then why do we need > enable_syncslot on physical standby but that will be required to > invoke sync worker which will pull logical slot's information? yes and enable_sync slot on the standby could also be used to "pause" the sync on standbys (by disabling the parameter) if one would want to (without the need to modify anything on the primary). > The > advantage of having standby_slot_names defined on primary is that we > can selectively wait on the subset of physical standbys where we are > syncing the slots. Yeah and this flexibility/filtering looks somehow mandatory to me. > I think this will be something similar to > 'synchronous_standby_names' in the sense that the physical standbys > mentioned in standby_slot_names will behave as synchronous copies with > respect to slots and after failover user can switch to one of these > physical standby and others can start following new master/publisher. > > Thoughts? I like the idea and I think that's the one that seems the more reasonable to me. I'd vote for this idea with: - standby_slot_names on the primary (could also be set on standbys in case of cascading context) - enable_failover at logical slot creation + API to enable/disable it at wish - enable_syncslot on the standbys Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Oct 9, 2023 at 10:51 AM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > I like the idea and I think that's the one that seems the more reasonable > to me. I'd vote for this idea with: > > - standby_slot_names on the primary (could also be set on standbys in case of > cascading context) > - enable_failover at logical slot creation + API to enable/disable it at wish > - enable_syncslot on the standbys > Thanks, these definitions sounds reasonable to me. -- With Regards, Amit Kapila.
PFA v22 patch-set. It has below changes: patch 001: 1) Now physical walsender wakes up logical walsender(s) by using a new CV as suggested in [1] 2) Now pg_logical_slot_get_changes (and other such get/peek functions) as well wait for standby(s) confirmation. patch 002: 1) New column (synced_slot) added in pg_replication_slots to indicate if it is a synced slot or user one. 2) Any attempt to do pg_drop_replication_slot() on synced-slot will result in an error 3) Some portion of Peter's comments dated Oct4 and Kuroda-san's comments dated Oct 2. Thanks Hou-san for working on changes of patch 001. [1]: https://www.postgresql.org/message-id/a539e247-30c8-4d5c-b561-07d0949cc960%40gmail.com thanks Shveta
Attachment
On Wed, Oct 4, 2023 at 8:53 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are some review comments for v20-0002. > Thanks Peter for the feedback. Comments from 31 till end are addressed in v22. First 30 comments will be addressed in the next version. > ====== > 1. GENERAL - errmsg/elog messages > > There are a a lot of minor problems and/or quirks across all the > message texts. Here is a summary of some I found: > > ERROR > errmsg("could not receive list of slots from the primary server: %s", > errmsg("invalid response from primary server"), > errmsg("invalid connection string syntax: %s", > errmsg("replication slot-sync worker slot %d is empty, cannot attach", > errmsg("replication slot-sync worker slot %d is already used by > another worker, cannot attach", > errmsg("replication slot-sync worker slot %d is already used by > another worker, cannot attach", > errmsg("could not connect to the primary server: %s", > > errmsg("operation not permitted on replication slots on standby which > are synchronized from primary"))); > /primary/the primary/ > > errmsg("could not fetch invalidation cuase for slot \"%s\" from primary: %s", > /cuase/cause/ > /primary/the primary/ > > errmsg("slot \"%s\" disapeared from the primary", > /disapeared/disappeared/ > > errmsg("could not fetch slot info from the primary: %s", > errmsg("could not connect to the primary server: %s", err))); > errmsg("could not map dynamic shared memory segment for slot-sync worker"))); > > errmsg("physical replication slot %s found in synchronize_slot_names", > slot name not quoted? > --- > > WARNING > errmsg("out of background worker slots"), > > errmsg("Replication slot-sync worker failed to attach to worker-pool slot %d", > case? > > errmsg("Removed database %d from replication slot-sync worker %d; > dbcount now: %d", > case? > > errmsg("Skipping slots synchronization as primary_slot_name is not set.")); > case? > > errmsg("Skipping slots synchronization as hot_standby_feedback is off.")); > case? > > errmsg("Skipping slots synchronization as dbname is not specified in > primary_conninfo.")); > case? > > errmsg("slot-sync wait for slot %s interrupted by promotion, slot > creation aborted", > > errmsg("could not fetch slot info for slot \"%s\" from primary: %s", > /primary/the primary/ > > errmsg("slot \"%s\" disappeared from the primary, aborting slot creation", > errmsg("slot \"%s\" invalidated on primary, aborting slot creation", > > errmsg("slot-sync for slot %s interrupted by promotion, sync not possible", > slot name not quoted? > > errmsg("skipping sync of slot \"%s\" as the received slot-sync lsn > %X/%X is ahead of the standby position %X/%X", > > errmsg("not synchronizing slot %s; synchronization would move it backward", > slot name not quoted? > /backward/backwards/ > > --- > > LOG > errmsg("Added database %d to replication slot-sync worker %d; dbcount now: %d", > errmsg("Added database %d to replication slot-sync worker %d; dbcount now: %d", > errmsg("Stopping replication slot-sync worker %d", > errmsg("waiting for remote slot \"%s\" LSN (%u/%X) and catalog xmin > (%u) to pass local slot LSN (%u/%X) and and catalog xmin (%u)", > > errmsg("wait over for remote slot \"%s\" as its LSN (%X/%X)and catalog > xmin (%u) has now passed local slot LSN (%X/%X) and catalog xmin > (%u)", > missing spaces? > > elog(LOG, "Dropped replication slot \"%s\" ", > extra space? > why this one is elog but others are not? > > elog(LOG, "Replication slot-sync worker %d is shutting down on > receiving SIGINT", MySlotSyncWorker->slot); > case? > why this one is elog but others are not? > > elog(LOG, "Replication slot-sync worker %d started", worker_slot); > case? > why this one is elog but others are not? > ---- > > DEBUG1 > errmsg("allocated dsa for slot-sync worker for dbcount: %d" > worker number not given? > should be elog? > > errmsg_internal("logical replication launcher started") > should be elog? > > ---- > > DEBUG2 > elog(DEBUG2, "slot-sync worker%d's query:%s \n", > missing space after 'worker' > extra space before \n > > ====== > .../libpqwalreceiver/libpqwalreceiver.c > > 2. libpqrcv_get_dbname_from_conninfo > > +/* > + * Get database name from primary conninfo. > + * > + * If dbanme is not found in connInfo, return NULL value. > + * The caller should take care of handling NULL value. > + */ > +static char * > +libpqrcv_get_dbname_from_conninfo(const char *connInfo) > > 2a. > /dbanme/dbname/ > > ~ > > 2b. > "The caller should take care of handling NULL value." > > IMO this is not very useful; it's like saying "caller must handle > function return values". > > ~~~ > > 3. > + for (opt = opts; opt->keyword != NULL; ++opt) > + { > + /* Ignore connection options that are not present. */ > + if (opt->val == NULL) > + continue; > + > + if (strcmp(opt->keyword, "dbname") == 0 && opt->val[0] != '\0') > + { > + dbname = pstrdup(opt->val); > + } > + } > > 3a. > If there are multiple "dbname" in the conninfo then it will be the > LAST one that is returned. > > Judging by my quick syntax experiment (below) this seemed like the > correct thing to do, but I think there should be some comment to > explain about it. > > test_sub=# create subscription sub1 connection 'dbname=foo dbname=bar > dbname=test_pub' publication pub1; > 2023-09-28 19:15:15.012 AEST [23997] WARNING: subscriptions created > by regression test cases should have names starting with "regress_" > WARNING: subscriptions created by regression test cases should have > names starting with "regress_" > NOTICE: created replication slot "sub1" on publisher > CREATE SUBSCRIPTION > > ~ > > 3b. > The block brackets {} are not needed for the single statement. > > ~ > > 3c. > Since there is only one keyword of interest here it seemed overkill to > have a separate 'continue' check. Why not do everything in one line: > > for (opt = opts; opt->keyword != NULL; ++opt) > { > if (strcmp(opt->keyword, "dbname") == 0 && opt->val && opt->val[0] != '\0') > dbname = pstrdup(opt->val); > } > > ====== > src/backend/replication/logical/launcher.c > > 4. > +/* > + * The local variables to store the current values of slot-sync related GUCs > + * before each ConfigReload. > + */ > +static char *PrimaryConnInfoPreReload = NULL; > +static char *PrimarySlotNamePreReload = NULL; > +static char *SyncSlotNamesPreReload = NULL; > > /The local variables/Local variables/ > > ~~~ > > 5. fwd declare > > static void logicalrep_worker_cleanup(LogicalRepWorker *worker); > +static void slotsync_worker_cleanup(SlotSyncWorker *worker); > static int logicalrep_pa_worker_count(Oid subid); > > 5a. > Hmmn, I think there were lot more added static functions than just this one. > > e.g. what about all these? > static SlotSyncWorker *slotsync_worker_find > static dsa_handle slotsync_dsa_setup > static bool slotsync_worker_launch_or_reuse > static void slotsync_worker_stop_internal > static void slotsync_workers_stop > static void slotsync_remove_obsolete_dbs > static WalReceiverConn *primary_connect > static void SaveCurrentSlotSyncConfigs > static bool SlotSyncConfigsChanged > static void ApplyLauncherStartSlotSync > static void ApplyLauncherStartSubs > > ~ > > 5b. > There are inconsistent name style used for the new static functions -- > e.g. snake_case versus CamelCase. > > ~~~ > > 6. WaitForReplicationWorkerAttach > > int rc; > + bool is_slotsync_worker = (lock == SlotSyncWorkerLock) ? true : false; > > This seemed a hacky way to distinguish the sync-slot workers from > other kinds of workers. Wouldn't it be better to pass another > parameter to this function? > > ~~~ > > 7. slotsync_worker_attach > > It looks like almost a clone of the logicalrep_worker_attach. Seems a > shame if cannot make use of common code. > > ~~~ > > 8. slotsync_worker_find > > + * Walks the slot-sync workers pool and searches for one that matches given > + * dbid. Since one worker can manage multiple dbs, so it walks the db array in > + * each worker to find the match. > > 8a. > SUGGESTION > Searches the slot-sync worker pool for the worker who manages the > specified dbid. Because a worker can manage multiple dbs, also walk > the db array of each worker to find the match. > > ~ > > 8b. > Should the comment also say something like "Returns NULL if no > matching worker is found." > > ~~~ > > 9. > + /* Search for attached worker for a given dbid */ > > SUGGESTION > Search for an attached worker managing the given dbid. > > ~~~ > > 10. > +{ > + int i; > + SlotSyncWorker *res = NULL; > + Oid *dbids; > + > + Assert(LWLockHeldByMeInMode(SlotSyncWorkerLock, LW_SHARED)); > + > + /* Search for attached worker for a given dbid */ > + for (i = 0; i < max_slotsync_workers; i++) > + { > + SlotSyncWorker *w = &LogicalRepCtx->ss_workers[i]; > + int cnt; > + > + if (!w->hdr.in_use) > + continue; > + > + dbids = (Oid *) dsa_get_address(w->dbids_dsa, w->dbids_dp); > + for (cnt = 0; cnt < w->dbcount; cnt++) > + { > + Oid wdbid = dbids[cnt]; > + > + if (wdbid == dbid) > + { > + res = w; > + break; > + } > + } > + > + /* If worker is found, break the outer loop */ > + if (res) > + break; > + } > + > + return res; > +} > > IMO this logical can be simplified a lot: > - by not using the 'res' variable; directly return instead. > - also moved the 'dbids' declaration. > - and 'cnt' variable seems not meaningful; replace with 'dbidx' for > the db array index IMO. > > For example (25 lines instead of 35 lines) > > { > int i; > > Assert(LWLockHeldByMeInMode(SlotSyncWorkerLock, LW_SHARED)); > > /* Search for an attached worker managing the given dbid. */ > for (i = 0; i < max_slotsync_workers; i++) > { > SlotSyncWorker *w = &LogicalRepCtx->ss_workers[i]; > int dbidx; > Oid *dbids; > > if (!w->hdr.in_use) > continue; > > dbids = (Oid *) dsa_get_address(w->dbids_dsa, w->dbids_dp); > for (dbidx = 0; dbidx < w->dbcount; dbidx++) > { > if (dbids[dbidx] == dbid) > return w; > } > } > > return NULL; > } > > ~~~ > > 11. slot_sync_dsa_setup > > +/* > + * Setup DSA for slot-sync worker. > + * > + * DSA is needed for dbids array. Since max number of dbs a worker can manage > + * is not known, so initially fixed size to hold DB_PER_WORKER_ALLOC_INIT > + * dbs is allocated. If this size is exhausted, it can be extended using > + * dsa free and allocate routines. > + */ > +static dsa_handle > +slotsync_dsa_setup(SlotSyncWorker *worker, int alloc_db_count) > > 11a. > SUGGESTION > DSA is used for the dbids array. Because the maximum number of dbs a > worker can manage is not known, initially enough memory for > DB_PER_WORKER_ALLOC_INIT dbs is allocated. If this size is exhausted, > it can be extended using dsa free and allocate routines. > > ~ > > 11b. > It doesn't make sense for the comment to say DB_PER_WORKER_ALLOC_INIT > is the initial allocation, but then the function has a parameter > 'alloc_db_count' (which is always passed as DB_PER_WORKER_ALLOC_INIT). > IMO revemo the 2nd parameter from this function and hardwire the > initial allocation same as what the function comment says. > > ~~~ > > 12. > + /* Be sure any memory allocated by DSA routines is persistent. */ > + oldcontext = MemoryContextSwitchTo(TopMemoryContext); > > /Be sure any memory/Ensure the memory/ > > ~~~ > > 13. slotsync_worker_launch_or_reuse > > +/* > + * Slot-sync worker launch or reuse > + * > + * Start new slot-sync background worker from the pool of available workers > + * going by max_slotsync_workers count. If the worker pool is exhausted, > + * reuse the existing worker with minimum number of dbs. The idea is to > + * always distribute the dbs equally among launched workers. > + * If initially allocated dbids array is exhausted for the selected worker, > + * reallocate the dbids array with increased size and copy the existing > + * dbids to it and assign the new one as well. > + * > + * Returns true on success, false on failure. > + */ > > /going by/limited by/ (??) > > ~~~ > > 14. > + BackgroundWorker bgw; > + BackgroundWorkerHandle *bgw_handle; > + uint16 generation; > + SlotSyncWorker *worker = NULL; > + uint32 mindbcnt = 0; > + uint32 alloc_count = 0; > + uint32 copied_dbcnt = 0; > + Oid *copied_dbids = NULL; > + int worker_slot = -1; > + dsa_handle handle; > + Oid *dbids; > + int i; > + bool attach; > > IIUC many of these variables can be declared at a different scope in > this function, so they will be closer to where they are used. > > ~~~ > > 15. > + /* > + * We need to do the modification of the shared memory under lock so that > + * we have consistent view. > + */ > + LWLockAcquire(SlotSyncWorkerLock, LW_EXCLUSIVE); > > The current comment seems too much. > > SUGGESTION > The shared memory must only be modified under lock. > > ~~~ > > 16. > + /* Find unused worker slot. */ > + for (i = 0; i < max_slotsync_workers; i++) > + { > + SlotSyncWorker *w = &LogicalRepCtx->ss_workers[i]; > + > + if (!w->hdr.in_use) > + { > + worker = w; > + worker_slot = i; > + break; > + } > + } > + > + /* > + * If all the workers are currently in use. Find the one with minimum > + * number of dbs and use that. > + */ > + if (!worker) > + { > + for (i = 0; i < max_slotsync_workers; i++) > + { > + SlotSyncWorker *w = &LogicalRepCtx->ss_workers[i]; > + > + if (i == 0) > + { > + mindbcnt = w->dbcount; > + worker = w; > + worker_slot = i; > + } > + else if (w->dbcount < mindbcnt) > + { > + mindbcnt = w->dbcount; > + worker = w; > + worker_slot = i; > + } > + } > + } > > Why not combine these 2 loops, to avoid iterating over the same slots > twice? Then, exit the loop immediately if unused worker found, > otherwise if reach the end of loop having not found anything unused > then you will already know the one having least dbs. > > ~~~ > > 17. > + /* Remember the old dbids before we reallocate dsa. */ > + copied_dbcnt = worker->dbcount; > + copied_dbids = (Oid *) palloc0(worker->dbcount * sizeof(Oid)); > + memcpy(copied_dbids, dbids, worker->dbcount * sizeof(Oid)); > > 17a. > Who frees this copied_dbids memory when you are finished needed it. It > seems allocated in the TopMemoryContext so IIUC this is a leak. > > ~ > > 17b. > These are the 'old' values. Not the 'copied' values. The copied_xxx > variable names seem misleading. > > ~~~ > > 18. > + /* Prepare the new worker. */ > + worker->hdr.launch_time = GetCurrentTimestamp(); > + worker->hdr.in_use = true; > > If a new worker is required then the launch_time is set like above. > > + { > + slot_db_data->last_launch_time = now; > + > + slotsync_worker_launch_or_reuse(slot_db_data->database); > + } > > Meanwhile, at the caller of slotsync_worker_launch_or_reuse(), the > dbid launch_time was already set as well. And those two timestamps are > almost (but not quite) the same value. Isn't that a bit strange? > > ~~~ > > 19. > + /* Initial DSA setup for dbids array to hold DB_PER_WORKER_ALLOC_INIT dbs */ > + handle = slotsync_dsa_setup(worker, DB_PER_WORKER_ALLOC_INIT); > + dbids = (Oid *) dsa_get_address(worker->dbids_dsa, worker->dbids_dp); > + > + dbids[worker->dbcount++] = dbid; > > Where was this worker->dbcount assigned to 0? > > Maybe it's better to do this explicity under the "/* Prepare the new > worker. */" comment. > > ~~~ > > 20. > + if (!attach) > + ereport(WARNING, > + (errmsg("Replication slot-sync worker failed to attach to " > + "worker-pool slot %d", worker_slot))); > + > + /* Attach is done, now safe to log that the worker is managing dbid */ > + if (attach) > + ereport(LOG, > + (errmsg("Added database %d to replication slot-sync " > + "worker %d; dbcount now: %d", > + dbid, worker_slot, worker->dbcount))); > > 20a. > IMO this should be coded as "if (attach) ...; else ..." > > ~ > > 99b. > In other code if it failed to register then slotsync_worker_cleanup > code is called. How come similar code is not done when fails to > attach? > > ~~~ > > 21. slotsync_worker_stop_internal > > +/* > + * Internal function to stop the slot-sync worker and wait until it detaches > + * from the slot-sync worker-pool slot. > + */ > +static void > +slotsync_worker_stop_internal(SlotSyncWorker *worker) > > IIUC this function does a bit more than what the function comment > says. IIUC (again) I think the "detached" worker slot will still be > flagged as 'inUse' but this function then does the extra step of > calling slotsync_worker_cleanup() function to make the worker slot > available for next process that needs it, am I correct? > > In this regard, this function seems a lot more like > logicalrep_worker_detach() function comment, so there seems some kind > of muddling of the different function names here... (??). > > ~~~ > > 22. slotsync_remove_obsolete_dbs > > This function says: > +/* > + * Slot-sync workers remove obsolete DBs from db-list > + * > + * If the DBIds fetched from the primary are lesser than the ones being managed > + * by slot-sync workers, remove extra dbs from worker's db-list. This > may happen > + * if some slots are removed on primary but 'synchronize_slot_names' has not > + * been changed yet. > + */ > +static void > +slotsync_remove_obsolete_dbs(List *remote_dbs) > > But, there was another similar logic function too: > > +/* > + * Drop obsolete slots > + * > + * Drop the slots which no longer need to be synced i.e. these either > + * do not exist on primary or are no longer part of synchronize_slot_names. > + * > + * Also drop the slots which are valid on primary and got invalidated > + * on standby due to conflict (say required rows removed on primary). > + * The assumption is, these will get recreated in next sync-cycle and > + * it is okay to drop and recreate such slots as long as these are not > + * consumable on standby (which is the case currently). > + */ > +static void > +drop_obsolete_slots(Oid *dbids, List *remote_slot_list) > > Those function header comments suggest these have a lot of overlapping > functionality. > > Can't those 2 functions be combined? Or maybe one delegate to the other? > > ~~~ > > 23. > + ListCell *lc; > + Oid *dbids; > + int widx; > + int dbidx; > + int i; > > Scope of some of these variable declarations can be different so they > are declared closer to where they are used. > > ~~~ > > 24. > + /* If not found, then delete this db from worker's db-list */ > + if (!found) > + { > + for (i = dbidx; i < worker->dbcount; i++) > + { > + /* Shift the DBs and get rid of wdbid */ > + if (i < (worker->dbcount - 1)) > + dbids[i] = dbids[i + 1]; > + } > > IIUC, that shift/loop could just have been a memmove() call to remove > one Oid element. > > ~~~ > > 25. > + /* If dbcount for any worker has become 0, shut it down */ > + for (widx = 0; widx < max_slotsync_workers; widx++) > + { > + SlotSyncWorker *worker = &LogicalRepCtx->ss_workers[widx]; > + > + if (worker->hdr.in_use && !worker->dbcount) > + slotsync_worker_stop_internal(worker); > + } > > Is it safe to stop this unguarded by SlotSyncWorkerLock locking? Is > there a window where another dbid decides to reuse this worker at the > same time this process is about to stop it? > > ~~~ > > 26. primary_connect > > +/* > + * Connect to primary server for slotsync purpose and return the connection > + * info. Disconnect previous connection if provided in wrconn_prev. > + */ > > /primary server/the primary server/ > > ~~~ > > 27. > + if (!RecoveryInProgress()) > + return NULL; > + > + if (max_slotsync_workers == 0) > + return NULL; > + > + if (strcmp(synchronize_slot_names, "") == 0) > + return NULL; > + > + /* The primary_slot_name is not set */ > + if (!WalRcv || WalRcv->slotname[0] == '\0') > + { > + ereport(WARNING, > + errmsg("Skipping slots synchronization as primary_slot_name " > + "is not set.")); > + return NULL; > + } > + > + /* The hot_standby_feedback must be ON for slot-sync to work */ > + if (!hot_standby_feedback) > + { > + ereport(WARNING, > + errmsg("Skipping slots synchronization as hot_standby_feedback " > + "is off.")); > + return NULL; > + } > > How come some of these checks giving WARNING that slot synchronization > will be skipped, but others are just silently returning NULL? > > ~~~ > > 28. SaveCurrentSlotSyncConfigs > > +static void > +SaveCurrentSlotSyncConfigs() > +{ > + PrimaryConnInfoPreReload = pstrdup(PrimaryConnInfo); > + PrimarySlotNamePreReload = pstrdup(WalRcv->slotname); > + SyncSlotNamesPreReload = pstrdup(synchronize_slot_names); > +} > > Shouldn't this code also do pfree first? Otherwise these will slowly > leak every time this function is called, right? > > ~~~ > > 29. SlotSyncConfigsChanged > > +static bool > +SlotSyncConfigsChanged() > +{ > + if (strcmp(PrimaryConnInfoPreReload, PrimaryConnInfo) != 0) > + return true; > + > + if (strcmp(PrimarySlotNamePreReload, WalRcv->slotname) != 0) > + return true; > + > + if (strcmp(SyncSlotNamesPreReload, synchronize_slot_names) != 0) > + return true; > > I felt those can all be combined to have 1 return instead of 3. > > ~~~ > > 30. > + /* > + * If we have reached this stage, it means original value of > + * hot_standby_feedback was 'true', so consider it changed if 'false' now. > + */ > + if (!hot_standby_feedback) > + return true; > > "If we have reached this stage" seems a bit vague. Can this have some > more explanation? And, maybe also an Assert(hot_standby_feedback); is > helpful in the calling code (before the config is reloaded)? > > ~~~ > > 31. ApplyLauncherStartSlotSync > > + * It connects to primary, get the list of DBIDs for slots configured in > + * synchronize_slot_names. It then launces the slot-sync workers as per > + * max_slotsync_workers and then assign the DBs equally to the workers > + * launched. > + */ > > SUGGESTION (fix typos etc) > Connect to the primary, to get the list of DBIDs for slots configured > in synchronize_slot_names. Then launch slot-sync workers (limited by > max_slotsync_workers) where the DBs are distributed equally among > those workers. > > ~~~ > > 32. > +static void > +ApplyLauncherStartSlotSync(long *wait_time, WalReceiverConn *wrconn) > > Why does this function even have 'Apply' in the name when it is > nothing to do with an apply worker; looks like some cut/paste > hangover. How about calling it something like 'LaunchSlotSyncWorkers' > > ~~~ > > 33. > + /* If connection is NULL due to lack of correct configurations, return */ > + if (!wrconn) > + return; > > IMO it would be better to Assert wrconn in this function. If it is > NULL then it should be checked a the caller, otherwise it just raises > more questions -- like "who logged the warning about bad > configuration" etc (which I already questions the NULL returns of > primary_connect. > > ~~~ > > 34. > + if (!OidIsValid(slot_db_data->database)) > + continue; > > This represents some kind of integrity error doesn't it? Is it really > OK just to silently skip such a thing? > > ~~~ > > 35. > + /* > + * If the worker is eligible to start now, launch it. Otherwise, > + * adjust wait_time so that we'll wake up as soon as it can be > + * started. > + * > + * Each apply worker can only be restarted once per > + * wal_retrieve_retry_interval, so that errors do not cause us to > + * repeatedly restart the worker as fast as possible. > + */ > > 35a. > I found the "we" part of "so that we'll wake up..." to be a bit > misleading. There is no waiting in this function; that wait value is > handed back to the caller to deal with. TBH, I did not really > understand why it is even necessary tp separate the waiting > calculation *per-worker* like this. It seems to overcomplicate things > and it might even give results like 1st worker is not started but last > works is started (if enough time elapsed in the loop). Why can't all > this wait logic be done one time up front, and either (a) start all > necessary workers, or (b) start none of them and wait a bit longer. > > ~ > > 35b. > "Each apply worker". Why is this talking about "apply" workers? Maybe > cut/paste error? > > ~~~ > > 36. > + last_launch_tried = slot_db_data->last_launch_time; > + now = GetCurrentTimestamp(); > + if (last_launch_tried == 0 || > + (elapsed = TimestampDifferenceMilliseconds(last_launch_tried, now)) >= > + wal_retrieve_retry_interval) > + { > + slot_db_data->last_launch_time = now; > + > + slotsync_worker_launch_or_reuse(slot_db_data->database); > + } > + else > + { > + *wait_time = Min(*wait_time, > + wal_retrieve_retry_interval - elapsed); > + } > > 36a. > IMO this might be simpler if you add another variable like bool 'launch_now': > > last_launch_tried = ... > now = ... > elapsed = ... > launch_now = elapsed >= wal_retrieve_retry_interval; > > ~ > > 36b. > Do you really care about checking "last_launch_tried == 0"; If it > really is zero, then I thought the elapsed check should be enough. > > ~ > > 36c. > Does this 'last_launch_time' really need to be in some shared memory? > Won't a static variable suffice? > > > ~~~ > > 37. ApplyLauncherStartSubs > > Wouldn't a better name for the function be something like > 'LaunchSubscriptionApplyWorker'? (it is a better match for the > suggested LaunchSlotSyncWorkers) > > > ~~~ > > 38. ApplyLauncherMain > > Now that this is not only for Apply worker but also for SlotSync > workers, maybe this function should be renamed as just LauncherMain, > or something equally generic? > > ~~~ > > 39. > + load_file("libpqwalreceiver", false); > + > + wrconn = primary_connect(NULL); > + > > This connection did not exist in the HEAD code so I think it is added > only for the slot-sync logic. IIUC it is still doing nothing for the > non-slot-sync cases because primary_connect will silently return in > that case: > > + if (!RecoveryInProgress()) > + return NULL; > > IMO this is too sneaky, and it is misleading to see the normal apply > worker launch apparently ccnnecting to something when it is not really > doing so AFAIK. I think these conditions should be done explicity here > at the caller to remove any such ambiguity. > > ~~~ > > 40. > + if (!RecoveryInProgress()) > + ApplyLauncherStartSubs(&wait_time); > + else > + ApplyLauncherStartSlotSync(&wait_time, wrconn); > > 40a. > IMO this is deserving of a comment to explain why RecoveryInProgress > means to perform the slot-synchronization. > > ~ > > 40b. > Also, better to have positive check RecoveryInProgress() instead of > !RecoveryInProgress() > > ~~~ > > 41. > if (ConfigReloadPending) > { > + bool ssConfigChanged = false; > + > + SaveCurrentSlotSyncConfigs(); > + > ConfigReloadPending = false; > ProcessConfigFile(PGC_SIGHUP); > + > + /* > + * Stop the slot-sync workers if any of the related GUCs changed. > + * These will be relaunched as per the new values during next > + * sync-cycle. > + */ > + ssConfigChanged = SlotSyncConfigsChanged(); > + if (ssConfigChanged) > + slotsync_workers_stop(); > + > + /* Reconnect in case primary_conninfo has changed */ > + wrconn = primary_connect(wrconn); > } > } > > ~ > > 41a. > The 'ssConfigChanged' assignement at declaration is not needed. > Indeed, the whole variable is not really necessary because it is used > only once. > > ~ > > 41b. > /as per the new values/using the new values/ > > ~ > > 41c. > + /* Reconnect in case primary_conninfo has changed */ > + wrconn = primary_connect(wrconn); > > To avoid unnecessary reconnections, shouldn't this be done only if > (ssConfigChanged). > > In fact, assuming the comment is correct, reconnect only if > (strcmp(PrimaryConnInfoPreReload, PrimaryConnInfo) != 0) > > > ====== > src/backend/replication/logical/slotsync.c > > 42. wait_for_primary_slot_catchup > > + ereport(LOG, > + errmsg("waiting for remote slot \"%s\" LSN (%u/%X) and catalog xmin" > + " (%u) to pass local slot LSN (%u/%X) and and catalog xmin (%u)", > + remote_slot->name, > + LSN_FORMAT_ARGS(remote_slot->restart_lsn), > + remote_slot->catalog_xmin, > + LSN_FORMAT_ARGS(MyReplicationSlot->data.restart_lsn), > + MyReplicationSlot->data.catalog_xmin)); > > AFAIK it is usual for the LSN format string to be %X/%X (not %u/%X like here). > > ~~~ > > 43. > + appendStringInfo(&cmd, > + "SELECT restart_lsn, confirmed_flush_lsn, catalog_xmin" > + " FROM pg_catalog.pg_replication_slots" > + " WHERE slot_name = %s", > + quote_literal_cstr(remote_slot->name)); > > double space before FROM? > > ~~~ > > 44. synchronize_one_slot > > + /* > + * We might not have the WALs retained locally corresponding to > + * remote's restart_lsn if our local restart_lsn and/or local > + * catalog_xmin is ahead of remote's one. And thus we can not create > + * the local slot in sync with primary as that would mean moving local > + * slot backward. Thus wait for primary's restart_lsn and catalog_xmin > + * to catch up with the local ones and then do the sync. > + */ > + if (remote_slot->restart_lsn < MyReplicationSlot->data.restart_lsn || > + TransactionIdPrecedes(remote_slot->catalog_xmin, > + MyReplicationSlot->data.catalog_xmin)) > + { > + if (!wait_for_primary_slot_catchup(wrconn, remote_slot)) > + { > + /* > + * The remote slot didn't catch up to locally reserved > + * position > + */ > + ReplicationSlotRelease(); > + CommitTransactionCommand(); > + return; > + } > > > SUGGESTION (comment is slightly simplified) > If the local restart_lsn and/or local catalog_xmin is ahead of those > on the remote then we cannot create the local slot in sync with > primary because that would mean moving local slot backwards. In this > case we will wait for primary's restart_lsn and catalog_xmin to catch > up with the local one before attempting the sync. > > ====== > Kind Regards, > Peter Smith. > Fujitsu Australia
On Mon, Oct 9, 2023 at 3:24 AM Peter Smith <smithpb2250@gmail.com> wrote: > > On Fri, Oct 6, 2023 at 7:37 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > On 2023-Sep-27, Peter Smith wrote: > > > > > 3. get_local_synced_slot_names > > > > > > + for (int i = 0; i < max_replication_slots; i++) > > > + { > > > + ReplicationSlot *s = &ReplicationSlotCtl->replication_slots[i]; > > > + > > > + /* Check if it is logical synchronized slot */ > > > + if (s->in_use && SlotIsLogical(s) && s->data.synced) > > > + { > > > + for (int j = 0; j < MySlotSyncWorker->dbcount; j++) > > > + { > > > > > > Loop variables are not declared in the common PG code way. > > > > Note that since we added C99 as a mandatory requirement for compilers in > > commit d9dd406fe281, we've been using declarations in loop initializers > > (see 143290efd079). We have almost 500 occurrences of this already. > > Older code, obviously, does not use them, but that's no reason not to > > introduce them in new code. I think they make the code a bit leaner, so > > I suggest to use these liberally. > > > > I also prefer the C99 style, but I had misunderstood there was still a > convention to keep using the old style for code consistency (e.g. many > new patches I see still seem to use the old style). > > Thanks for confirming that C99 loop variables are fine for any new code. > > @Shveta/Ajin - please ignore/revert all my old review comments about this point. > Sure, reverted all such changes in v22. Now we have declarations in loop initializers. thanks Shveta
On Mon, Oct 2, 2023 at 4:29 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Shveta, > > Thank you for updating the patch! Thanks for the feedback Kuroda-san. I have addressed most of these in v22. Please find my comments inline. > > I found another ERROR due to the slot removal. Is this a real issue? > > 1. applied add_sleep.txt, which emulated the case the tablesync worker stucked > and the primary crashed during the > initial sync. > 2. executed test_0925_v2.sh (You attached in [1]) > 3. secondary could not start the logical replication because the slot was not > created (log files were also attached). > > > Here is my analysis. The cause is that the slotsync worker aborts the slot creation > on secondary server because the restart_lsn of secondary ahead the primary's one. > IIUC it can be occurred when tablesync workers finishes initial copy before > walsenders stream changes. In this case, the relstate of the worker is set to > SUBREL_STATE_CATCHUP and the apply worker waits till the relation becomes > SUBREL_STATE_SYNCDONE. From here the slot on primary will not be updated until > the relation is caught up. If some changes are come and the primary crashes at > that time, the syncslot worker will abort the slot creation. > > > Anyway, followings are my comments. > I have not checked detailed conventions yet. It should be done in later stage. > > > ~~~~~~~~~~~~~~~~ > > For 0001: > === > > WalSndWaitForStandbyConfirmation() > > ``` > + /* If postmaster asked us to stop, don't wait anymore */ > + if (got_STOPPING) > + break; > ``` > > I have considered again, and it may still have an issue: logical walsenders may > break from the loop before physical walsenders send WALs. This may be occurred > because both physical and logical walsenders would get PROCSIG_WALSND_INIT_STOPPING. > > I think a function like WalSndWaitStopping() must be needed, which waits until > physical walsenders become WALSNDSTATE_STOPPING or exit. Thought? > > WalSndWaitForStandbyConfirmation() > > ``` > + standby_slot_cpy = list_copy(standby_slot_names_list); > ``` > > I found that standby_slot_names_list and standby_slot_cpy would not be updated > even if the GUC was updated. Is this acceptable? Won't it be occurred after you > refactor the patch? > What would be occurred when synchronize_slot_names is updated on secondary > while primary executes this? > > WalSndWaitForStandbyConfirmation() > > ``` > + > + goto retry; > ``` Yes, there could be a problem here. I will review it. Allow some more time for this. > > I checked other "goto retry;", but I could not find the pattern that the return > clause does not exist after the goto (exception: void function). I also think > that current style seems a bit strange. How about using an outer loop like > While (list_length(standby_slot_cpy))? > > ===== > > slot.h > > ``` > +extern void WaitForStandbyLSN(XLogRecPtr wait_for_lsn); > ``` > > WaitForStandbyLSN() does not exist. > Done. > ~~~~~~~~~~~~~~~~ > > For 0002: > ===== > > General > > The patch requires that primary_conninfo must contain the dbname, but it > conflicts with documentation. It says: > > ``` > ...Do not specify a database name in the primary_conninfo string. > ``` > > I confirmed [^a] it is harmless that primary_conninfo has dbname, but at least > the description must be fixed. > Done. > General > > I found that primary server output huge amount of logs when the log_min_duration_messages = 0. > This ie because slotsync worker sends an SQL per 10ms, in wait_for_primary_slot_catchup(). > Is there any good way to suppress it? Or, should we be patient? > I will review it to see if we can do anything here. > ===== > > ``` > +{ oid => '6312', descr => 'what caused the replication slot to become invalid', > ``` > > How did you determine the oid? IIRC, developping features should use oids in > the range 8000-9999. See src/include/catalog/unused_oids. > Corrected it. > ===== > > LogicalRepCtxStruct > > ``` > /* Background workers. */ > + SlotSyncWorker *ss_workers; /* slot-sync workers */ > LogicalRepWorker workers[FLEXIBLE_ARRAY_MEMBER]; > ``` > > It's OK for now, but can we combine them into an array? IIUC there is no > possibility to exist both of processes and they have same component, so it may > be able to be same. It can reduce an attribute but may lead some > difficulties to read. I feel it will add to more confusion and should be kept separate. > > WaitForReplicationWorkerAttach() and logicalrep_worker_stop_internal() > > I could not find cases that has "LWLock *" as an argument (exception: functions in lwlock.c). > Is it sufficient to check RecoveryInProgress() instead of specifying as arguments? > I feel it should be argument based. If not lock-based then a different arg perhaps. Let us say it is in the process of starting a worker and it failed and now it wants to do cleanup. It should do the cleanup of the worker it attempted to start instead of doing it based on 'RecoveryInProgress'. Latter's value may change if standby is promoted in between resulting in an attempt to do cleanup of the wrong type of worker. > ===== > > wait_for_primary_slot_catchup() > > ``` > + /* Check if this standby is promoted while we are waiting */ > + if (!RecoveryInProgress()) > + { > + /* > + * The remote slot didn't pass the locally reserved position at > + * the time of local promotion, so it's not safe to use. > + */ > + ereport( > + WARNING, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg( > + "slot-sync wait for slot %s interrupted by promotion, " > + "slot creation aborted", remote_slot->name))); > + pfree(cmd.data); > + return false; > + } > ``` > > The part would not be executed if the promote signal is sent after the primary > server crashes. I think walrcv_exec() will detect the failure first. > The function must be wrapped by PG_TRY() and the message must be emitted in > PG_CATCH(). There may be other approaches. walrcv_exec() may fail because of other reasons too. So generalising it to failure msg due to a promotion might not be the correct thing to do. We check if standby is promoted just before walrcv_exec(), so I feel that should suffice. > > wait_for_primary_slot_catchup() > > ``` > + rc = WaitLatch(MyLatch, > + WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, > + WORKER_DEFAULT_NAPTIME_MS, > + WAIT_EVENT_REPL_SLOTSYNC_MAIN); > ``` > > New wait event can be added. Done. > > > [1]: https://www.postgresql.org/message-id/CAJpy0uDD%2B9aJnDx9fBfvLvxJtxA7qqoAys4fo6h1tq1b_0_A7Q%40mail.gmail.com > [^a] > > Regarding the secondary side, the libpqrcv_connect() does not do special things > even if the primary_conninfo has dbname="XXX". It adds parameters like > "replication=true" and sends a startup packet. > > As for the primary side, the startup packet is consumed in ProcessStartupPacket(). > It checks whether the process should be a walsender or not (line 2204). > > Then (line 2290) the port->database_name[0] is set as '\0' in case of walsender. > The value is used for setting the process title in BackendInitialize(). > > Also, InitPostgres() really sets some global variables like MyDatabaseId, > but it is not occurred when the process is walsender. > This is expected behaviour. The presence of dbname in primary_conninfo should not affect physical streaming connection while it should only be used for slotsync worker's connection. That is the case currently. When logical=false (i..e for physical streaming), we ignore dbname during connection and when logical=true(slotsync connection), we consider using it. > Best Regards, > Hayato Kuroda > FUJITSU LIMITED >
On Mon, Oct 9, 2023 at 9:34 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Wed, Oct 4, 2023 at 8:53 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Here are some review comments for v20-0002. > > > > Thanks Peter for the feedback. Comments from 31 till end are addressed > in v22. First 30 comments will be addressed in the next version. > Thanks for addressing my previous comments. I checked those and went through other changes in v22-0002 to give a few more review comments below. I understand there are some design changes coming soon regarding the use of GUCs so maybe a few of these comments will become redundant. ====== doc/src/sgml/config.sgml 1. A password needs to be provided too, if the sender demands password authentication. It can be provided in the <varname>primary_conninfo</varname> string, or in a separate - <filename>~/.pgpass</filename> file on the standby server (use - <literal>replication</literal> as the database name). - Do not specify a database name in the - <varname>primary_conninfo</varname> string. + <filename>~/.pgpass</filename> file on the standby server. + </para> + <para> + Specify a database name in <varname>primary_conninfo</varname> string + to allow synchronization of slots from the primary to standby. This + dbname will only be used for slots synchronization purpose and will + be irrelevant for streaming. </para> 1a. "Specify a database name in...". Shouldn't that say "Specify dbname in..."? ~ 1b. BEFORE This dbname will only be used for slots synchronization purpose and will be irrelevant for streaming. SUGGESTION This will only be used for slot synchronization. It is ignored for streaming. ====== doc/src/sgml/system-views.sgml 2. pg_replication_slots + <row> + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>synced_slot</structfield> <type>bool</type> + </para> + <para> + True if this logical slot is created on physical standby as part of + slot-synchronization from primary server. Always false for physical slots. + </para></entry> + </row> /on physical standby/on the physical standby/ /from primary server/from the primary server/ ====== src/backend/replication/logical/launcher.c 3. LaunchSlotSyncWorkers + /* + * If we failed to launch this slotsync worker, return and try + * launching rest of the workers in next sync cycle. But change + * launcher's wait time to minimum of wal_retrieve_retry_interval and + * default wait time to try next sync-cycle sooner. + */ 3a. Use consistent terms -- choose "sync cycle" or "sync-cycle" ~ 3b. Is it correct to just say "rest of the workers"; won't it also try to relaunch this same failed worker again? ~~~ 4. LauncherMain + /* + * Stop the slot-sync workers if any of the related GUCs changed. + * These will be relaunched using the new values during next + * sync-cycle. Also revalidate the new configurations and + * reconnect. + */ + if (SlotSyncConfigsChanged()) + { + slotsync_workers_stop(); + + if (wrconn) + walrcv_disconnect(wrconn); + + if (RecoveryInProgress()) + wrconn = slotsync_remote_connect(); + } Was it overkill to disconnect/reconnect every time any of those GUCs changed? Or is it enough to do that only if the PrimaryConnInfoPreReload was changed? ====== src/backend/replication/logical/logical.c 5. CreateDecodingContext + /* + * Do not allow consumption of a "synchronized" slot until the standby + * gets promoted. + */ + if (RecoveryInProgress() && slot->data.synced) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("cannot use replication slot \"%s\" for logical decoding", + NameStr(slot->data.name)), + errdetail("This slot is being synced from primary."), + errhint("Specify another replication slot."))); + /from primary/from the primary/ ====== src/backend/replication/logical/slotsync.c 6. use_slot_in_query + /* + * Return TRUE if either slot is not yet created on standby or if it + * belongs to one of the dbs passed in dbids. + */ + if (!slot_found || relevant_db) + return true; + + return false; Same as single line: return (!slot_found || relevant_db); ~~~ 7. synchronize_one_slot + /* + * If the local restart_lsn and/or local catalog_xmin is ahead of + * those on the remote then we cannot create the local slot in sync + * with primary because that would mean moving local slot backwards + * and we might not have WALs retained for old lsns. In this case we + * will wait for primary's restart_lsn and catalog_xmin to catch up + * with the local one before attempting the sync. + */ /moving local slot/moving the local slot/ /with primary/with the primary/ /wait for primary's/wait for the primary's/ ~~~ 8. ProcessSlotSyncInterrupts + if (ConfigReloadPending) + { + ConfigReloadPending = false; + + /* Save the PrimaryConnInfo before reloading */ + *conninfo_prev = pstrdup(PrimaryConnInfo); If the configuration keeps changing then there might be a slow leak here because I didn't notice anywhere where this strdup'ed string is getting freed. Is that something worth worrying about? ====== src/backend/replication/slot.c 9. ReplicationSlotDrop + /* + * Do not allow users to drop the slots which are currently being synced + * from the primary to standby. + */ + if (user_cmd && RecoveryInProgress() && MyReplicationSlot->data.synced) + { + ReplicationSlotRelease(); + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("cannot drop replication slot"), + errdetail("This slot is being synced from primary."))); + } + 9a. /to standby/to the standby/ ~ 9b. Shouldn't the errmsg name the slot? Otherwise, the message might not be so useful. ~ 9c. /synced from primary/synced from the primary/ ====== src/backend/replication/walsender.c 10. ListSlotDatabaseOIDs + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED); + for (slotno = 0; slotno < max_replication_slots; slotno++) + { + ReplicationSlot *slot = &ReplicationSlotCtl->replication_slots[slotno]; This is all new code so you can use C99 for loop variable declaration here. ~~~ 11. + /* If synchronize_slot_names is '*', then skip physical slots */ + if (SlotIsPhysical(slot)) + continue; + Some mental gymnastics are needed to understand how this code means " synchronize_slot_names is '*'". IMO it would be easier to understand if the previous "if (numslot_names)" was rewritten as if/else. ====== .../utils/activity/wait_event_names.txt 12. RECOVERY_WAL_STREAM "Waiting in main loop of startup process for WAL to arrive, during streaming recovery." +REPL_SLOTSYNC_MAIN "Waiting in main loop of worker for synchronizing slots to a standby from primary." +REPL_SLOTSYNC_PRIMARY_CATCHP "Waiting for primary to catch-up in worker for synchronizing slots to a standby from primary." SYSLOGGER_MAIN "Waiting in main loop of syslogger process." 12a. Maybe those descriptions can be simplified a bit? SUGGESTION REPL_SLOTSYNC_MAIN "Waiting in the main loop of slot-sync worker." REPL_SLOTSYNC_PRIMARY_CATCHP "Waiting for the primary to catch up, in slot-sync worker." ~ 12b. typo? /REPL_SLOTSYNC_PRIMARY_CATCHP/REPL_SLOTSYNC_PRIMARY_CATCHUP/ ====== src/include/replication/walreceiver.h 13. WalRcvRepSlotDbData +/* + * Slot's DBid related data + */ +typedef struct WalRcvRepSlotDbData +{ + Oid database; /* Slot's DBid received from remote */ +} WalRcvRepSlotDbData; Just calling this new field 'database' seems odd. Searching PG src I found typical fields/variables like this one are called 'databaseid', or 'dboid', or 'dbid', or 'db_id' etc. ====== Kind Regards, Peter Smith. Fujitsu Australia
On Mon, Oct 9, 2023 at 10:51 AM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 10/6/23 6:48 PM, Amit Kapila wrote: > > On Wed, Oct 4, 2023 at 5:34 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > >> > >> On 10/4/23 1:50 PM, shveta malik wrote: > >>> On Wed, Oct 4, 2023 at 5:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > >>>> > >>>> On Wed, Oct 4, 2023 at 11:55 AM Drouvot, Bertrand > >>>> <bertranddrouvot.pg@gmail.com> wrote: > >>>>> > >>>>> On 10/4/23 6:26 AM, shveta malik wrote: > >>>>>> On Wed, Oct 4, 2023 at 5:36 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > >>>>>>> > >>>>>>> > >>>>>>> How about an alternate scheme where we define sync_slot_names on > >>>>>>> standby but then store the physical_slot_name in the corresponding > >>>>>>> logical slot (ReplicationSlotPersistentData) to be synced? So, the > >>>>>>> standby will send the list of 'sync_slot_names' and the primary will > >>>>>>> add the physical standby's slot_name in each of the corresponding > >>>>>>> sync_slot. Now, if we do this then even after restart, we should be > >>>>>>> able to know for which physical slot each logical slot needs to wait. > >>>>>>> We can even provide an SQL API to reset the value of > >>>>>>> standby_slot_names in logical slots as a way to unblock decoding in > >>>>>>> case of emergency (for example, corresponding when physical standby > >>>>>>> never comes up). > >>>>>>> > >>>>>> > >>>>>> > >>>>>> Looks like a better approach to me. It solves most of the pain points like: > >>>>>> 1) Avoids the need of multiple GUCs > >>>>>> 2) Primary and standby need not to worry to be in sync if we maintain > >>>>>> sync-slot-names GUC on both > >>>> > >>>> As per my understanding of this approach, we don't want > >>>> 'sync-slot-names' to be set on the primary. Do you have a different > >>>> understanding? > >>>> > >>> > >>> Same understanding. We do not need it to be set on primary by user. It > >>> will be GUC on standby and standby will convey it to primary. > >> > >> +1, same understanding here. > >> > > > > At PGConf NYC, I had a brief discussion on this topic with Andres > > where yet another approach to achieve this came up. > > Great! > > > Have a parameter > > like enable_failover at the slot level (this will be persistent > > information). Users can set it during the create/alter subscription or > > via pg_create_logical_replication_slot(). Also, on physical standby, > > there will be a parameter like enable_syncslot. All the physical > > standbys that have set enable_syncslot will receive all the logical > > slots that are marked as enable_failover. To me, whether to sync a > > particular slot is a slot-level property, so defining it in this new > > way seems reasonable. > > Yeah, as this is a slot-level property, I agree that this seems reasonable. > > Also that sounds more natural to me with this approach. The primary > is really the one that "drives" which slots can be synced. I like it. > > One could also set enable_failover while creating a logical slot on a physical > standby (so that cascading standbys could also have "extra slot" to sync as > compare to "level 1" standbys). > > > > > I think this will simplify the scheme a bit but still, the list of > > physical standby's for which logical slots wait during decoding needs > > to be maintained as we thought. > > Right. > > > But, how about with the above two > > parameters (enable_failover and enable_syncslot), we have > > standby_slot_names defined on the primary. That avoids the need to > > store the list of standby_slot_names in logical slots and simplifies > > the implementation quite a bit, right? > > Agree. > > > Now, one can think if we have a > > parameter like 'standby_slot_names' then why do we need > > enable_syncslot on physical standby but that will be required to > > invoke sync worker which will pull logical slot's information? > > yes and enable_sync slot on the standby could also be used to "pause" > the sync on standbys (by disabling the parameter) if one would want to > (without the need to modify anything on the primary). > > > The > > advantage of having standby_slot_names defined on primary is that we > > can selectively wait on the subset of physical standbys where we are > > syncing the slots. > > Yeah and this flexibility/filtering looks somehow mandatory to me. > > > I think this will be something similar to > > 'synchronous_standby_names' in the sense that the physical standbys > > mentioned in standby_slot_names will behave as synchronous copies with > > respect to slots and after failover user can switch to one of these > > physical standby and others can start following new master/publisher. > > > > Thoughts? > > I like the idea and I think that's the one that seems the more reasonable > to me. I'd vote for this idea with: > > - standby_slot_names on the primary (could also be set on standbys in case of > cascading context) > - enable_failover at logical slot creation + API to enable/disable it at wish > - enable_syncslot on the standbys > Thank You Amit and Bertrand for feedback on the new design. PFA v23 patch set which attempts to implement the new proposed design to handle sync candidates: a) The synchronize_slot_names GUC is removed. Instead the 'enable_failover' property is added at the slot level which is persistent. It can be set by the user using create-subscription command. eg: create subscription mysub connection '....' publication mypub WITH (enable_failover = true); b) New GUC enable_syncslot is added on standbys to enable disable slot-sync on standbys c) standby_slot_names are maintained on primary. The patch 002 also addresses Peter's comments dated Oct 6 and Oct10. Thank You Ajin for implementing 'create subscription' cmd changes to support 'enable_failover' syntax. This patch has not implemented below yet, it will be done in next version: --Provide support to set/alter enable_failover using alter-subscription and pg_create_logical_replication_slot --Changes needed to support slot-synchronization on cascading standbys --Display "enable_failover" property in pg_replication_slots. I think it makes sense to do this. thanks Shveta
Attachment
On Tue, Oct 10, 2023 at 12:52 PM Peter Smith <smithpb2250@gmail.com> wrote: > > On Mon, Oct 9, 2023 at 9:34 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Wed, Oct 4, 2023 at 8:53 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > > Here are some review comments for v20-0002. > > > > > > > Thanks Peter for the feedback. Comments from 31 till end are addressed > > in v22. First 30 comments will be addressed in the next version. > > > > Thanks for addressing my previous comments. > > I checked those and went through other changes in v22-0002 to give a > few more review comments below. > Thank You for your feedback. I have addressed these in v23. > I understand there are some design changes coming soon regarding the > use of GUCs so maybe a few of these comments will become redundant. > > ====== > doc/src/sgml/config.sgml > > 1. > A password needs to be provided too, if the sender demands password > authentication. It can be provided in the > <varname>primary_conninfo</varname> string, or in a separate > - <filename>~/.pgpass</filename> file on the standby server (use > - <literal>replication</literal> as the database name). > - Do not specify a database name in the > - <varname>primary_conninfo</varname> string. > + <filename>~/.pgpass</filename> file on the standby server. > + </para> > + <para> > + Specify a database name in <varname>primary_conninfo</varname> string > + to allow synchronization of slots from the primary to standby. This > + dbname will only be used for slots synchronization purpose and will > + be irrelevant for streaming. > </para> > > 1a. > "Specify a database name in...". Shouldn't that say "Specify dbname in..."? > > ~ > > 1b. > BEFORE > This dbname will only be used for slots synchronization purpose and > will be irrelevant for streaming. > > SUGGESTION > This will only be used for slot synchronization. It is ignored for streaming. > > ====== > doc/src/sgml/system-views.sgml > > 2. pg_replication_slots > > + <row> > + <entry role="catalog_table_entry"><para role="column_definition"> > + <structfield>synced_slot</structfield> <type>bool</type> > + </para> > + <para> > + True if this logical slot is created on physical standby as part of > + slot-synchronization from primary server. Always false for > physical slots. > + </para></entry> > + </row> > > /on physical standby/on the physical standby/ > > /from primary server/from the primary server/ > > ====== > src/backend/replication/logical/launcher.c > > 3. LaunchSlotSyncWorkers > > + /* > + * If we failed to launch this slotsync worker, return and try > + * launching rest of the workers in next sync cycle. But change > + * launcher's wait time to minimum of wal_retrieve_retry_interval and > + * default wait time to try next sync-cycle sooner. > + */ > > 3a. > Use consistent terms -- choose "sync cycle" or "sync-cycle" > > ~ > > 3b. > Is it correct to just say "rest of the workers"; won't it also try to > relaunch this same failed worker again? > > ~~~ > > 4. LauncherMain > > + /* > + * Stop the slot-sync workers if any of the related GUCs changed. > + * These will be relaunched using the new values during next > + * sync-cycle. Also revalidate the new configurations and > + * reconnect. > + */ > + if (SlotSyncConfigsChanged()) > + { > + slotsync_workers_stop(); > + > + if (wrconn) > + walrcv_disconnect(wrconn); > + > + if (RecoveryInProgress()) > + wrconn = slotsync_remote_connect(); > + } > > Was it overkill to disconnect/reconnect every time any of those GUCs > changed? Or is it enough to do that only if the > PrimaryConnInfoPreReload was changed? > > ====== > src/backend/replication/logical/logical.c > > 5. CreateDecodingContext > > + /* > + * Do not allow consumption of a "synchronized" slot until the standby > + * gets promoted. > + */ > + if (RecoveryInProgress() && slot->data.synced) > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("cannot use replication slot \"%s\" for logical decoding", > + NameStr(slot->data.name)), > + errdetail("This slot is being synced from primary."), > + errhint("Specify another replication slot."))); > + > > /from primary/from the primary/ > > ====== > src/backend/replication/logical/slotsync.c > > 6. use_slot_in_query > > + /* > + * Return TRUE if either slot is not yet created on standby or if it > + * belongs to one of the dbs passed in dbids. > + */ > + if (!slot_found || relevant_db) > + return true; > + > + return false; > > Same as single line: > > return (!slot_found || relevant_db); > > ~~~ > > 7. synchronize_one_slot > > + /* > + * If the local restart_lsn and/or local catalog_xmin is ahead of > + * those on the remote then we cannot create the local slot in sync > + * with primary because that would mean moving local slot backwards > + * and we might not have WALs retained for old lsns. In this case we > + * will wait for primary's restart_lsn and catalog_xmin to catch up > + * with the local one before attempting the sync. > + */ > > /moving local slot/moving the local slot/ > > /with primary/with the primary/ > > /wait for primary's/wait for the primary's/ > > ~~~ > > 8. ProcessSlotSyncInterrupts > > + if (ConfigReloadPending) > + { > + ConfigReloadPending = false; > + > + /* Save the PrimaryConnInfo before reloading */ > + *conninfo_prev = pstrdup(PrimaryConnInfo); > > If the configuration keeps changing then there might be a slow leak > here because I didn't notice anywhere where this strdup'ed string is > getting freed. Is that something worth worrying about? > > ====== > src/backend/replication/slot.c > > 9. ReplicationSlotDrop > > + /* > + * Do not allow users to drop the slots which are currently being synced > + * from the primary to standby. > + */ > + if (user_cmd && RecoveryInProgress() && MyReplicationSlot->data.synced) > + { > + ReplicationSlotRelease(); > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("cannot drop replication slot"), > + errdetail("This slot is being synced from primary."))); > + } > + > > 9a. > /to standby/to the standby/ > > ~ > > 9b. > Shouldn't the errmsg name the slot? Otherwise, the message might not > be so useful. > > ~ > > 9c. > /synced from primary/synced from the primary/ > > ====== > src/backend/replication/walsender.c > > > 10. ListSlotDatabaseOIDs > > + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED); > + for (slotno = 0; slotno < max_replication_slots; slotno++) > + { > + ReplicationSlot *slot = &ReplicationSlotCtl->replication_slots[slotno]; > > This is all new code so you can use C99 for loop variable declaration here. > > ~~~ > > 11. > + /* If synchronize_slot_names is '*', then skip physical slots */ > + if (SlotIsPhysical(slot)) > + continue; > + > > > Some mental gymnastics are needed to understand how this code means " > synchronize_slot_names is '*'". > > IMO it would be easier to understand if the previous "if > (numslot_names)" was rewritten as if/else. > > ====== > .../utils/activity/wait_event_names.txt > > 12. > RECOVERY_WAL_STREAM "Waiting in main loop of startup process for WAL > to arrive, during streaming recovery." > +REPL_SLOTSYNC_MAIN "Waiting in main loop of worker for synchronizing > slots to a standby from primary." > +REPL_SLOTSYNC_PRIMARY_CATCHP "Waiting for primary to catch-up in > worker for synchronizing slots to a standby from primary." > SYSLOGGER_MAIN "Waiting in main loop of syslogger process." > > 12a. > Maybe those descriptions can be simplified a bit? > > SUGGESTION > REPL_SLOTSYNC_MAIN "Waiting in the main loop of slot-sync worker." > REPL_SLOTSYNC_PRIMARY_CATCHP "Waiting for the primary to catch up, in > slot-sync worker." > > ~ > > 12b. > typo? > > /REPL_SLOTSYNC_PRIMARY_CATCHP/REPL_SLOTSYNC_PRIMARY_CATCHUP/ > > ====== > src/include/replication/walreceiver.h > > 13. WalRcvRepSlotDbData > > +/* > + * Slot's DBid related data > + */ > +typedef struct WalRcvRepSlotDbData > +{ > + Oid database; /* Slot's DBid received from remote */ > +} WalRcvRepSlotDbData; > > Just calling this new field 'database' seems odd. Searching PG src I > found typical fields/variables like this one are called 'databaseid', > or 'dboid', or 'dbid', or 'db_id' etc. > > ====== > Kind Regards, > Peter Smith. > Fujitsu Australia
On Tue, Oct 10, 2023 at 12:52 PM Peter Smith <smithpb2250@gmail.com> wrote: > > On Mon, Oct 9, 2023 at 9:34 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Wed, Oct 4, 2023 at 8:53 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > > Here are some review comments for v20-0002. > > > > > > > Thanks Peter for the feedback. Comments from 31 till end are addressed > > in v22. First 30 comments will be addressed in the next version. > > > > Thanks for addressing my previous comments. > > I checked those and went through other changes in v22-0002 to give a > few more review comments below. > > I understand there are some design changes coming soon regarding the > use of GUCs so maybe a few of these comments will become redundant. > > ====== > doc/src/sgml/config.sgml > > 1. > A password needs to be provided too, if the sender demands password > authentication. It can be provided in the > <varname>primary_conninfo</varname> string, or in a separate > - <filename>~/.pgpass</filename> file on the standby server (use > - <literal>replication</literal> as the database name). > - Do not specify a database name in the > - <varname>primary_conninfo</varname> string. > + <filename>~/.pgpass</filename> file on the standby server. > + </para> > + <para> > + Specify a database name in <varname>primary_conninfo</varname> string > + to allow synchronization of slots from the primary to standby. This > + dbname will only be used for slots synchronization purpose and will > + be irrelevant for streaming. > </para> > > 1a. > "Specify a database name in...". Shouldn't that say "Specify dbname in..."? > > ~ > > 1b. > BEFORE > This dbname will only be used for slots synchronization purpose and > will be irrelevant for streaming. > > SUGGESTION > This will only be used for slot synchronization. It is ignored for streaming. > > ====== > doc/src/sgml/system-views.sgml > > 2. pg_replication_slots > > + <row> > + <entry role="catalog_table_entry"><para role="column_definition"> > + <structfield>synced_slot</structfield> <type>bool</type> > + </para> > + <para> > + True if this logical slot is created on physical standby as part of > + slot-synchronization from primary server. Always false for > physical slots. > + </para></entry> > + </row> > > /on physical standby/on the physical standby/ > > /from primary server/from the primary server/ > > ====== > src/backend/replication/logical/launcher.c > > 3. LaunchSlotSyncWorkers > > + /* > + * If we failed to launch this slotsync worker, return and try > + * launching rest of the workers in next sync cycle. But change > + * launcher's wait time to minimum of wal_retrieve_retry_interval and > + * default wait time to try next sync-cycle sooner. > + */ > > 3a. > Use consistent terms -- choose "sync cycle" or "sync-cycle" > > ~ > > 3b. > Is it correct to just say "rest of the workers"; won't it also try to > relaunch this same failed worker again? > > ~~~ > > 4. LauncherMain > > + /* > + * Stop the slot-sync workers if any of the related GUCs changed. > + * These will be relaunched using the new values during next > + * sync-cycle. Also revalidate the new configurations and > + * reconnect. > + */ > + if (SlotSyncConfigsChanged()) > + { > + slotsync_workers_stop(); > + > + if (wrconn) > + walrcv_disconnect(wrconn); > + > + if (RecoveryInProgress()) > + wrconn = slotsync_remote_connect(); > + } > > Was it overkill to disconnect/reconnect every time any of those GUCs > changed? Or is it enough to do that only if the > PrimaryConnInfoPreReload was changed? > The intent is to re-validate all the related GUCs and then decide if we want to carry on with the slot-sync task or leave it as is. > ====== > src/backend/replication/logical/logical.c > > 5. CreateDecodingContext > > + /* > + * Do not allow consumption of a "synchronized" slot until the standby > + * gets promoted. > + */ > + if (RecoveryInProgress() && slot->data.synced) > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("cannot use replication slot \"%s\" for logical decoding", > + NameStr(slot->data.name)), > + errdetail("This slot is being synced from primary."), > + errhint("Specify another replication slot."))); > + > > /from primary/from the primary/ > > ====== > src/backend/replication/logical/slotsync.c > > 6. use_slot_in_query > > + /* > + * Return TRUE if either slot is not yet created on standby or if it > + * belongs to one of the dbs passed in dbids. > + */ > + if (!slot_found || relevant_db) > + return true; > + > + return false; > > Same as single line: > > return (!slot_found || relevant_db); > > ~~~ > > 7. synchronize_one_slot > > + /* > + * If the local restart_lsn and/or local catalog_xmin is ahead of > + * those on the remote then we cannot create the local slot in sync > + * with primary because that would mean moving local slot backwards > + * and we might not have WALs retained for old lsns. In this case we > + * will wait for primary's restart_lsn and catalog_xmin to catch up > + * with the local one before attempting the sync. > + */ > > /moving local slot/moving the local slot/ > > /with primary/with the primary/ > > /wait for primary's/wait for the primary's/ > > ~~~ > > 8. ProcessSlotSyncInterrupts > > + if (ConfigReloadPending) > + { > + ConfigReloadPending = false; > + > + /* Save the PrimaryConnInfo before reloading */ > + *conninfo_prev = pstrdup(PrimaryConnInfo); > > If the configuration keeps changing then there might be a slow leak > here because I didn't notice anywhere where this strdup'ed string is > getting freed. Is that something worth worrying about? > > ====== > src/backend/replication/slot.c > > 9. ReplicationSlotDrop > > + /* > + * Do not allow users to drop the slots which are currently being synced > + * from the primary to standby. > + */ > + if (user_cmd && RecoveryInProgress() && MyReplicationSlot->data.synced) > + { > + ReplicationSlotRelease(); > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("cannot drop replication slot"), > + errdetail("This slot is being synced from primary."))); > + } > + > > 9a. > /to standby/to the standby/ > > ~ > > 9b. > Shouldn't the errmsg name the slot? Otherwise, the message might not > be so useful. > > ~ > > 9c. > /synced from primary/synced from the primary/ > > ====== > src/backend/replication/walsender.c > > > 10. ListSlotDatabaseOIDs > > + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED); > + for (slotno = 0; slotno < max_replication_slots; slotno++) > + { > + ReplicationSlot *slot = &ReplicationSlotCtl->replication_slots[slotno]; > > This is all new code so you can use C99 for loop variable declaration here. > > ~~~ > > 11. > + /* If synchronize_slot_names is '*', then skip physical slots */ > + if (SlotIsPhysical(slot)) > + continue; > + > > > Some mental gymnastics are needed to understand how this code means " > synchronize_slot_names is '*'". > > IMO it would be easier to understand if the previous "if > (numslot_names)" was rewritten as if/else. > > ====== > .../utils/activity/wait_event_names.txt > > 12. > RECOVERY_WAL_STREAM "Waiting in main loop of startup process for WAL > to arrive, during streaming recovery." > +REPL_SLOTSYNC_MAIN "Waiting in main loop of worker for synchronizing > slots to a standby from primary." > +REPL_SLOTSYNC_PRIMARY_CATCHP "Waiting for primary to catch-up in > worker for synchronizing slots to a standby from primary." > SYSLOGGER_MAIN "Waiting in main loop of syslogger process." > > 12a. > Maybe those descriptions can be simplified a bit? > > SUGGESTION > REPL_SLOTSYNC_MAIN "Waiting in the main loop of slot-sync worker." > REPL_SLOTSYNC_PRIMARY_CATCHP "Waiting for the primary to catch up, in > slot-sync worker." > > ~ > > 12b. > typo? > > /REPL_SLOTSYNC_PRIMARY_CATCHP/REPL_SLOTSYNC_PRIMARY_CATCHUP/ > > ====== > src/include/replication/walreceiver.h > > 13. WalRcvRepSlotDbData > > +/* > + * Slot's DBid related data > + */ > +typedef struct WalRcvRepSlotDbData > +{ > + Oid database; /* Slot's DBid received from remote */ > +} WalRcvRepSlotDbData; > > Just calling this new field 'database' seems odd. Searching PG src I > found typical fields/variables like this one are called 'databaseid', > or 'dboid', or 'dbid', or 'db_id' etc. > > ====== > Kind Regards, > Peter Smith. > Fujitsu Australia
On Thu, Oct 12, 2023 at 9:18 AM shveta malik <shveta.malik@gmail.com> wrote: > > On Mon, Oct 9, 2023 at 10:51 AM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > > > > Hi, > > > > On 10/6/23 6:48 PM, Amit Kapila wrote: > > > On Wed, Oct 4, 2023 at 5:34 PM Drouvot, Bertrand > > > <bertranddrouvot.pg@gmail.com> wrote: > > >> > > >> On 10/4/23 1:50 PM, shveta malik wrote: > > >>> On Wed, Oct 4, 2023 at 5:00 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > >>>> > > >>>> On Wed, Oct 4, 2023 at 11:55 AM Drouvot, Bertrand > > >>>> <bertranddrouvot.pg@gmail.com> wrote: > > >>>>> > > >>>>> On 10/4/23 6:26 AM, shveta malik wrote: > > >>>>>> On Wed, Oct 4, 2023 at 5:36 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > >>>>>>> > > >>>>>>> > > >>>>>>> How about an alternate scheme where we define sync_slot_names on > > >>>>>>> standby but then store the physical_slot_name in the corresponding > > >>>>>>> logical slot (ReplicationSlotPersistentData) to be synced? So, the > > >>>>>>> standby will send the list of 'sync_slot_names' and the primary will > > >>>>>>> add the physical standby's slot_name in each of the corresponding > > >>>>>>> sync_slot. Now, if we do this then even after restart, we should be > > >>>>>>> able to know for which physical slot each logical slot needs to wait. > > >>>>>>> We can even provide an SQL API to reset the value of > > >>>>>>> standby_slot_names in logical slots as a way to unblock decoding in > > >>>>>>> case of emergency (for example, corresponding when physical standby > > >>>>>>> never comes up). > > >>>>>>> > > >>>>>> > > >>>>>> > > >>>>>> Looks like a better approach to me. It solves most of the pain points like: > > >>>>>> 1) Avoids the need of multiple GUCs > > >>>>>> 2) Primary and standby need not to worry to be in sync if we maintain > > >>>>>> sync-slot-names GUC on both > > >>>> > > >>>> As per my understanding of this approach, we don't want > > >>>> 'sync-slot-names' to be set on the primary. Do you have a different > > >>>> understanding? > > >>>> > > >>> > > >>> Same understanding. We do not need it to be set on primary by user. It > > >>> will be GUC on standby and standby will convey it to primary. > > >> > > >> +1, same understanding here. > > >> > > > > > > At PGConf NYC, I had a brief discussion on this topic with Andres > > > where yet another approach to achieve this came up. > > > > Great! > > > > > Have a parameter > > > like enable_failover at the slot level (this will be persistent > > > information). Users can set it during the create/alter subscription or > > > via pg_create_logical_replication_slot(). Also, on physical standby, > > > there will be a parameter like enable_syncslot. All the physical > > > standbys that have set enable_syncslot will receive all the logical > > > slots that are marked as enable_failover. To me, whether to sync a > > > particular slot is a slot-level property, so defining it in this new > > > way seems reasonable. > > > > Yeah, as this is a slot-level property, I agree that this seems reasonable. > > > > Also that sounds more natural to me with this approach. The primary > > is really the one that "drives" which slots can be synced. I like it. > > > > One could also set enable_failover while creating a logical slot on a physical > > standby (so that cascading standbys could also have "extra slot" to sync as > > compare to "level 1" standbys). > > > > > > > > I think this will simplify the scheme a bit but still, the list of > > > physical standby's for which logical slots wait during decoding needs > > > to be maintained as we thought. > > > > Right. > > > > > But, how about with the above two > > > parameters (enable_failover and enable_syncslot), we have > > > standby_slot_names defined on the primary. That avoids the need to > > > store the list of standby_slot_names in logical slots and simplifies > > > the implementation quite a bit, right? > > > > Agree. > > > > > Now, one can think if we have a > > > parameter like 'standby_slot_names' then why do we need > > > enable_syncslot on physical standby but that will be required to > > > invoke sync worker which will pull logical slot's information? > > > > yes and enable_sync slot on the standby could also be used to "pause" > > the sync on standbys (by disabling the parameter) if one would want to > > (without the need to modify anything on the primary). > > > > > The > > > advantage of having standby_slot_names defined on primary is that we > > > can selectively wait on the subset of physical standbys where we are > > > syncing the slots. > > > > Yeah and this flexibility/filtering looks somehow mandatory to me. > > > > > I think this will be something similar to > > > 'synchronous_standby_names' in the sense that the physical standbys > > > mentioned in standby_slot_names will behave as synchronous copies with > > > respect to slots and after failover user can switch to one of these > > > physical standby and others can start following new master/publisher. > > > > > > Thoughts? > > > > I like the idea and I think that's the one that seems the more reasonable > > to me. I'd vote for this idea with: > > > > - standby_slot_names on the primary (could also be set on standbys in case of > > cascading context) > > - enable_failover at logical slot creation + API to enable/disable it at wish > > - enable_syncslot on the standbys > > > > Thank You Amit and Bertrand for feedback on the new design. > > PFA v23 patch set which attempts to implement the new proposed design > to handle sync candidates: > a) The synchronize_slot_names GUC is removed. Instead the > 'enable_failover' property is added at the slot level which is > persistent. It can be set by the user using create-subscription > command. eg: create subscription mysub connection '....' publication > mypub WITH (enable_failover = true); > b) New GUC enable_syncslot is added on standbys to enable disable > slot-sync on standbys > c) standby_slot_names are maintained on primary. > > The patch 002 also addresses Peter's comments dated Oct 6 and Oct10. > > Thank You Ajin for implementing 'create subscription' cmd changes to > support 'enable_failover' syntax. > > This patch has not implemented below yet, it will be done in next version: > --Provide support to set/alter enable_failover using > alter-subscription and pg_create_logical_replication_slot > --Changes needed to support slot-synchronization on cascading standbys > --Display "enable_failover" property in pg_replication_slots. I think > it makes sense to do this. > > thanks > Shveta PFA v24 patch set which has below changes: 1) 'enable_failover' displayed in pg_replication_slots. 2) Support for 'enable_failover' in pg_create_logical_replication_slot(). It is an optional argument with default value false. 3) Addressed pending comments (1-30) from Peter in [1]. 4) Fixed an issue in patch002 due to which even slots with enable_failover=false were getting synced. The changes for 1 and 2 are in patch001 while 3 and 4 are in patch0002 Thanks Ajin, for working on 1 and 3. [1]: https://www.postgresql.org/message-id/CAHut%2BPtbb3Ydx40a0p7Qovvp-4cC4ZCDreGRjmFzou8mjh2PmA%40mail.gmail.com Next to do: --Support for enable_failover in alter-subscription. --Support for slot-sync on cascading standbys. thanks Shveta
Attachment
On Wed, Oct 4, 2023 at 2:23 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are some review comments for v20-0002. > These comments below have been addressed in patch v24 posted by Shveta. > ====== > 1. GENERAL - errmsg/elog messages > > There are a a lot of minor problems and/or quirks across all the > message texts. Here is a summary of some I found: > > ERROR > errmsg("could not receive list of slots from the primary server: %s", > errmsg("invalid response from primary server"), > errmsg("invalid connection string syntax: %s", > errmsg("replication slot-sync worker slot %d is empty, cannot attach", > errmsg("replication slot-sync worker slot %d is already used by > another worker, cannot attach", > errmsg("replication slot-sync worker slot %d is already used by > another worker, cannot attach", > errmsg("could not connect to the primary server: %s", > > errmsg("operation not permitted on replication slots on standby which > are synchronized from primary"))); > /primary/the primary/ > == comment no longer part of patch. > errmsg("could not fetch invalidation cuase for slot \"%s\" from primary: %s", > /cuase/cause/ > /primary/the primary/ > == fixed > errmsg("slot \"%s\" disapeared from the primary", > /disapeared/disappeared/ > == fixed > errmsg("could not fetch slot info from the primary: %s", > errmsg("could not connect to the primary server: %s", err))); > errmsg("could not map dynamic shared memory segment for slot-sync worker"))); > > errmsg("physical replication slot %s found in synchronize_slot_names", > slot name not quoted? > --- == comment no longer part of patch > > WARNING > errmsg("out of background worker slots"), > > errmsg("Replication slot-sync worker failed to attach to worker-pool slot %d", > case? > > errmsg("Removed database %d from replication slot-sync worker %d; > dbcount now: %d", > case? > > errmsg("Skipping slots synchronization as primary_slot_name is not set.")); > case? > > errmsg("Skipping slots synchronization as hot_standby_feedback is off.")); > case? > > errmsg("Skipping slots synchronization as dbname is not specified in > primary_conninfo.")); > case? > > errmsg("slot-sync wait for slot %s interrupted by promotion, slot > creation aborted", > > errmsg("could not fetch slot info for slot \"%s\" from primary: %s", > /primary/the primary/ > > errmsg("slot \"%s\" disappeared from the primary, aborting slot creation", > errmsg("slot \"%s\" invalidated on primary, aborting slot creation", > > errmsg("slot-sync for slot %s interrupted by promotion, sync not possible", > slot name not quoted? > == fixed > errmsg("skipping sync of slot \"%s\" as the received slot-sync lsn > %X/%X is ahead of the standby position %X/%X", > > errmsg("not synchronizing slot %s; synchronization would move it backward", > slot name not quoted? > /backward/backwards/ > == comment is no longer part of the patch. > --- > > LOG > errmsg("Added database %d to replication slot-sync worker %d; dbcount now: %d", > errmsg("Added database %d to replication slot-sync worker %d; dbcount now: %d", > errmsg("Stopping replication slot-sync worker %d", > errmsg("waiting for remote slot \"%s\" LSN (%u/%X) and catalog xmin > (%u) to pass local slot LSN (%u/%X) and and catalog xmin (%u)", > > errmsg("wait over for remote slot \"%s\" as its LSN (%X/%X)and catalog > xmin (%u) has now passed local slot LSN (%X/%X) and catalog xmin > (%u)", > missing spaces? > == fixed > elog(LOG, "Dropped replication slot \"%s\" ", > extra space? > why this one is elog but others are not? > > elog(LOG, "Replication slot-sync worker %d is shutting down on > receiving SIGINT", MySlotSyncWorker->slot); > case? > why this one is elog but others are not? > > elog(LOG, "Replication slot-sync worker %d started", worker_slot); > case? > why this one is elog but others are not? > ---- > == changed these to ereports. > DEBUG1 > errmsg("allocated dsa for slot-sync worker for dbcount: %d" > worker number not given? > should be elog? > > errmsg_internal("logical replication launcher started") > should be elog? > == changed to elog > ---- > > DEBUG2 > elog(DEBUG2, "slot-sync worker%d's query:%s \n", > missing space after 'worker' > extra space before \n > > ====== > .../libpqwalreceiver/libpqwalreceiver.c > > 2. libpqrcv_get_dbname_from_conninfo > > +/* > + * Get database name from primary conninfo. > + * > + * If dbanme is not found in connInfo, return NULL value. > + * The caller should take care of handling NULL value. > + */ > +static char * > +libpqrcv_get_dbname_from_conninfo(const char *connInfo) > > 2a. > /dbanme/dbname/ > == fixed > ~ > > 2b. > "The caller should take care of handling NULL value." > > IMO this is not very useful; it's like saying "caller must handle > function return values". > == removed > ~~~ > > 3. > + for (opt = opts; opt->keyword != NULL; ++opt) > + { > + /* Ignore connection options that are not present. */ > + if (opt->val == NULL) > + continue; > + > + if (strcmp(opt->keyword, "dbname") == 0 && opt->val[0] != '\0') > + { > + dbname = pstrdup(opt->val); > + } > + } > > 3a. > If there are multiple "dbname" in the conninfo then it will be the > LAST one that is returned. > > Judging by my quick syntax experiment (below) this seemed like the > correct thing to do, but I think there should be some comment to > explain about it. > > test_sub=# create subscription sub1 connection 'dbname=foo dbname=bar > dbname=test_pub' publication pub1; > 2023-09-28 19:15:15.012 AEST [23997] WARNING: subscriptions created > by regression test cases should have names starting with "regress_" > WARNING: subscriptions created by regression test cases should have > names starting with "regress_" > NOTICE: created replication slot "sub1" on publisher > CREATE SUBSCRIPTION > == added a comment saying that the last dbname would be selected. > ~ > > 3b. > The block brackets {} are not needed for the single statement. > == fixed > ~ > > 3c. > Since there is only one keyword of interest here it seemed overkill to > have a separate 'continue' check. Why not do everything in one line: > > for (opt = opts; opt->keyword != NULL; ++opt) > { > if (strcmp(opt->keyword, "dbname") == 0 && opt->val && opt->val[0] != '\0') > dbname = pstrdup(opt->val); > } > == fixed. > ====== > src/backend/replication/logical/launcher.c > > 4. > +/* > + * The local variables to store the current values of slot-sync related GUCs > + * before each ConfigReload. > + */ > +static char *PrimaryConnInfoPreReload = NULL; > +static char *PrimarySlotNamePreReload = NULL; > +static char *SyncSlotNamesPreReload = NULL; > > /The local variables/Local variables/ > == fixed. > ~~~ > > 5. fwd declare > > static void logicalrep_worker_cleanup(LogicalRepWorker *worker); > +static void slotsync_worker_cleanup(SlotSyncWorker *worker); > static int logicalrep_pa_worker_count(Oid subid); > > 5a. > Hmmn, I think there were lot more added static functions than just this one. > > e.g. what about all these? > static SlotSyncWorker *slotsync_worker_find > static dsa_handle slotsync_dsa_setup > static bool slotsync_worker_launch_or_reuse > static void slotsync_worker_stop_internal > static void slotsync_workers_stop > static void slotsync_remove_obsolete_dbs > static WalReceiverConn *primary_connect > static void SaveCurrentSlotSyncConfigs > static bool SlotSyncConfigsChanged > static void ApplyLauncherStartSlotSync > static void ApplyLauncherStartSubs > > ~ > > 5b. > There are inconsistent name style used for the new static functions -- > e.g. snake_case versus CamelCase. > == fixed. > ~~~ > > 6. WaitForReplicationWorkerAttach > > int rc; > + bool is_slotsync_worker = (lock == SlotSyncWorkerLock) ? true : false; > > This seemed a hacky way to distinguish the sync-slot workers from > other kinds of workers. Wouldn't it be better to pass another > parameter to this function? > == This was discussed and this seemed to be a simple way of doing this. > ~~~ > > 7. slotsync_worker_attach > > It looks like almost a clone of the logicalrep_worker_attach. Seems a > shame if cannot make use of common code. > == this was attempted but was found to require a lot of if conditions. > ~~~ > > 8. slotsync_worker_find > > + * Walks the slot-sync workers pool and searches for one that matches given > + * dbid. Since one worker can manage multiple dbs, so it walks the db array in > + * each worker to find the match. > > 8a. > SUGGESTION > Searches the slot-sync worker pool for the worker who manages the > specified dbid. Because a worker can manage multiple dbs, also walk > the db array of each worker to find the match. > > ~ > > 8b. > Should the comment also say something like "Returns NULL if no > matching worker is found." > == fixed > ~~~ > > 9. > + /* Search for attached worker for a given dbid */ > > SUGGESTION > Search for an attached worker managing the given dbid. > == fixed > ~~~ > > 10. > +{ > + int i; > + SlotSyncWorker *res = NULL; > + Oid *dbids; > + > + Assert(LWLockHeldByMeInMode(SlotSyncWorkerLock, LW_SHARED)); > + > + /* Search for attached worker for a given dbid */ > + for (i = 0; i < max_slotsync_workers; i++) > + { > + SlotSyncWorker *w = &LogicalRepCtx->ss_workers[i]; > + int cnt; > + > + if (!w->hdr.in_use) > + continue; > + > + dbids = (Oid *) dsa_get_address(w->dbids_dsa, w->dbids_dp); > + for (cnt = 0; cnt < w->dbcount; cnt++) > + { > + Oid wdbid = dbids[cnt]; > + > + if (wdbid == dbid) > + { > + res = w; > + break; > + } > + } > + > + /* If worker is found, break the outer loop */ > + if (res) > + break; > + } > + > + return res; > +} > > IMO this logical can be simplified a lot: > - by not using the 'res' variable; directly return instead. > - also moved the 'dbids' declaration. > - and 'cnt' variable seems not meaningful; replace with 'dbidx' for > the db array index IMO. > > For example (25 lines instead of 35 lines) > > { > int i; > > Assert(LWLockHeldByMeInMode(SlotSyncWorkerLock, LW_SHARED)); > > /* Search for an attached worker managing the given dbid. */ > for (i = 0; i < max_slotsync_workers; i++) > { > SlotSyncWorker *w = &LogicalRepCtx->ss_workers[i]; > int dbidx; > Oid *dbids; > > if (!w->hdr.in_use) > continue; > > dbids = (Oid *) dsa_get_address(w->dbids_dsa, w->dbids_dp); > for (dbidx = 0; dbidx < w->dbcount; dbidx++) > { > if (dbids[dbidx] == dbid) > return w; > } > } > > return NULL; > } > == fixed > ~~~ > > 11. slot_sync_dsa_setup > > +/* > + * Setup DSA for slot-sync worker. > + * > + * DSA is needed for dbids array. Since max number of dbs a worker can manage > + * is not known, so initially fixed size to hold DB_PER_WORKER_ALLOC_INIT > + * dbs is allocated. If this size is exhausted, it can be extended using > + * dsa free and allocate routines. > + */ > +static dsa_handle > +slotsync_dsa_setup(SlotSyncWorker *worker, int alloc_db_count) > > 11a. > SUGGESTION > DSA is used for the dbids array. Because the maximum number of dbs a > worker can manage is not known, initially enough memory for > DB_PER_WORKER_ALLOC_INIT dbs is allocated. If this size is exhausted, > it can be extended using dsa free and allocate routines. > == fixed > ~ > > 11b. > It doesn't make sense for the comment to say DB_PER_WORKER_ALLOC_INIT > is the initial allocation, but then the function has a parameter > 'alloc_db_count' (which is always passed as DB_PER_WORKER_ALLOC_INIT). > IMO revemo the 2nd parameter from this function and hardwire the > initial allocation same as what the function comment says. > == fixed > ~~~ > > 12. > + /* Be sure any memory allocated by DSA routines is persistent. */ > + oldcontext = MemoryContextSwitchTo(TopMemoryContext); > > /Be sure any memory/Ensure the memory/ > == fixed > ~~~ > > 13. slotsync_worker_launch_or_reuse > > +/* > + * Slot-sync worker launch or reuse > + * > + * Start new slot-sync background worker from the pool of available workers > + * going by max_slotsync_workers count. If the worker pool is exhausted, > + * reuse the existing worker with minimum number of dbs. The idea is to > + * always distribute the dbs equally among launched workers. > + * If initially allocated dbids array is exhausted for the selected worker, > + * reallocate the dbids array with increased size and copy the existing > + * dbids to it and assign the new one as well. > + * > + * Returns true on success, false on failure. > + */ > > /going by/limited by/ (??) > == fixed > ~~~ > > 14. > + BackgroundWorker bgw; > + BackgroundWorkerHandle *bgw_handle; > + uint16 generation; > + SlotSyncWorker *worker = NULL; > + uint32 mindbcnt = 0; > + uint32 alloc_count = 0; > + uint32 copied_dbcnt = 0; > + Oid *copied_dbids = NULL; > + int worker_slot = -1; > + dsa_handle handle; > + Oid *dbids; > + int i; > + bool attach; > > IIUC many of these variables can be declared at a different scope in > this function, so they will be closer to where they are used. > == fixed > ~~~ > > 15. > + /* > + * We need to do the modification of the shared memory under lock so that > + * we have consistent view. > + */ > + LWLockAcquire(SlotSyncWorkerLock, LW_EXCLUSIVE); > > The current comment seems too much. > > SUGGESTION > The shared memory must only be modified under lock. > == fixed > ~~~ > > 16. > + /* Find unused worker slot. */ > + for (i = 0; i < max_slotsync_workers; i++) > + { > + SlotSyncWorker *w = &LogicalRepCtx->ss_workers[i]; > + > + if (!w->hdr.in_use) > + { > + worker = w; > + worker_slot = i; > + break; > + } > + } > + > + /* > + * If all the workers are currently in use. Find the one with minimum > + * number of dbs and use that. > + */ > + if (!worker) > + { > + for (i = 0; i < max_slotsync_workers; i++) > + { > + SlotSyncWorker *w = &LogicalRepCtx->ss_workers[i]; > + > + if (i == 0) > + { > + mindbcnt = w->dbcount; > + worker = w; > + worker_slot = i; > + } > + else if (w->dbcount < mindbcnt) > + { > + mindbcnt = w->dbcount; > + worker = w; > + worker_slot = i; > + } > + } > + } > > Why not combine these 2 loops, to avoid iterating over the same slots > twice? Then, exit the loop immediately if unused worker found, > otherwise if reach the end of loop having not found anything unused > then you will already know the one having least dbs. > == fixed > ~~~ > > 17. > + /* Remember the old dbids before we reallocate dsa. */ > + copied_dbcnt = worker->dbcount; > + copied_dbids = (Oid *) palloc0(worker->dbcount * sizeof(Oid)); > + memcpy(copied_dbids, dbids, worker->dbcount * sizeof(Oid)); > > 17a. > Who frees this copied_dbids memory when you are finished needed it. It > seems allocated in the TopMemoryContext so IIUC this is a leak. > == fixed > ~ > > 17b. > These are the 'old' values. Not the 'copied' values. The copied_xxx > variable names seem misleading. > == fixed > ~~~ > > 18. > + /* Prepare the new worker. */ > + worker->hdr.launch_time = GetCurrentTimestamp(); > + worker->hdr.in_use = true; > > If a new worker is required then the launch_time is set like above. > > + { > + slot_db_data->last_launch_time = now; > + > + slotsync_worker_launch_or_reuse(slot_db_data->database); > + } > > Meanwhile, at the caller of slotsync_worker_launch_or_reuse(), the > dbid launch_time was already set as well. And those two timestamps are > almost (but not quite) the same value. Isn't that a bit strange? > == in the caller, the purpose of the timestamp is to calculate how long to wait before retrying. > ~~~ > > 19. > + /* Initial DSA setup for dbids array to hold DB_PER_WORKER_ALLOC_INIT dbs */ > + handle = slotsync_dsa_setup(worker, DB_PER_WORKER_ALLOC_INIT); > + dbids = (Oid *) dsa_get_address(worker->dbids_dsa, worker->dbids_dp); > + > + dbids[worker->dbcount++] = dbid; > > Where was this worker->dbcount assigned to 0? > > Maybe it's better to do this explicity under the "/* Prepare the new > worker. */" comment. > == dbcount is assigned 0 in the function called two lines above - slotsync_dsa_setup() > ~~~ > > 20. > + if (!attach) > + ereport(WARNING, > + (errmsg("Replication slot-sync worker failed to attach to " > + "worker-pool slot %d", worker_slot))); > + > + /* Attach is done, now safe to log that the worker is managing dbid */ > + if (attach) > + ereport(LOG, > + (errmsg("Added database %d to replication slot-sync " > + "worker %d; dbcount now: %d", > + dbid, worker_slot, worker->dbcount))); > > 20a. > IMO this should be coded as "if (attach) ...; else ..." > == fixed. > ~ > > 99b. > In other code if it failed to register then slotsync_worker_cleanup > code is called. How come similar code is not done when fails to > attach? > == WaitForReplicationWorkerAttach does the cleanup before returning false. > ~~~ > > 21. slotsync_worker_stop_internal > > +/* > + * Internal function to stop the slot-sync worker and wait until it detaches > + * from the slot-sync worker-pool slot. > + */ > +static void > +slotsync_worker_stop_internal(SlotSyncWorker *worker) > > IIUC this function does a bit more than what the function comment > says. IIUC (again) I think the "detached" worker slot will still be > flagged as 'inUse' but this function then does the extra step of > calling slotsync_worker_cleanup() function to make the worker slot > available for next process that needs it, am I correct? > > In this regard, this function seems a lot more like > logicalrep_worker_detach() function comment, so there seems some kind > of muddling of the different function names here... (??). > == modified the comment to mention the cleanup. > ~~~ > > 22. slotsync_remove_obsolete_dbs > > This function says: > +/* > + * Slot-sync workers remove obsolete DBs from db-list > + * > + * If the DBIds fetched from the primary are lesser than the ones being managed > + * by slot-sync workers, remove extra dbs from worker's db-list. This > may happen > + * if some slots are removed on primary but 'synchronize_slot_names' has not > + * been changed yet. > + */ > +static void > +slotsync_remove_obsolete_dbs(List *remote_dbs) > > But, there was another similar logic function too: > > +/* > + * Drop obsolete slots > + * > + * Drop the slots which no longer need to be synced i.e. these either > + * do not exist on primary or are no longer part of synchronize_slot_names. > + * > + * Also drop the slots which are valid on primary and got invalidated > + * on standby due to conflict (say required rows removed on primary). > + * The assumption is, these will get recreated in next sync-cycle and > + * it is okay to drop and recreate such slots as long as these are not > + * consumable on standby (which is the case currently). > + */ > +static void > +drop_obsolete_slots(Oid *dbids, List *remote_slot_list) > > Those function header comments suggest these have a lot of overlapping > functionality. > > Can't those 2 functions be combined? Or maybe one delegate to the other? > == One is called by the launcher, and the other is called by the slotsync worker. While one prunes the list of dbs that needs to be passed to each slot-sync worker, the other prunes the list of slots each slot-sync worker handles in its dblist. Both are different. > ~~~ > > 23. > + ListCell *lc; > + Oid *dbids; > + int widx; > + int dbidx; > + int i; > > Scope of some of these variable declarations can be different so they > are declared closer to where they are used. > == fixed > ~~~ > > 24. > + /* If not found, then delete this db from worker's db-list */ > + if (!found) > + { > + for (i = dbidx; i < worker->dbcount; i++) > + { > + /* Shift the DBs and get rid of wdbid */ > + if (i < (worker->dbcount - 1)) > + dbids[i] = dbids[i + 1]; > + } > > IIUC, that shift/loop could just have been a memmove() call to remove > one Oid element. > == fixed > ~~~ > > 25. > + /* If dbcount for any worker has become 0, shut it down */ > + for (widx = 0; widx < max_slotsync_workers; widx++) > + { > + SlotSyncWorker *worker = &LogicalRepCtx->ss_workers[widx]; > + > + if (worker->hdr.in_use && !worker->dbcount) > + slotsync_worker_stop_internal(worker); > + } > > Is it safe to stop this unguarded by SlotSyncWorkerLock locking? Is > there a window where another dbid decides to reuse this worker at the > same time this process is about to stop it? > == Only the launcher can do this, and there is only one launcher. > ~~~ > > 26. primary_connect > > +/* > + * Connect to primary server for slotsync purpose and return the connection > + * info. Disconnect previous connection if provided in wrconn_prev. > + */ > > /primary server/the primary server/ > == fixed > ~~~ > > 27. > + if (!RecoveryInProgress()) > + return NULL; > + > + if (max_slotsync_workers == 0) > + return NULL; > + > + if (strcmp(synchronize_slot_names, "") == 0) > + return NULL; > + > + /* The primary_slot_name is not set */ > + if (!WalRcv || WalRcv->slotname[0] == '\0') > + { > + ereport(WARNING, > + errmsg("Skipping slots synchronization as primary_slot_name " > + "is not set.")); > + return NULL; > + } > + > + /* The hot_standby_feedback must be ON for slot-sync to work */ > + if (!hot_standby_feedback) > + { > + ereport(WARNING, > + errmsg("Skipping slots synchronization as hot_standby_feedback " > + "is off.")); > + return NULL; > + } > > How come some of these checks giving WARNING that slot synchronization > will be skipped, but others are just silently returning NULL? > == primary_slot_name and hot_standby_feedback are not GUCs exclusive to slot synchronization, they are previously existing - so warning only for them. The others are specific to slot synchronization, so if users set them (which shows that the user intends to use sync-slot), then warning to let the user know that these others also need to be set. > ~~~ > > 28. SaveCurrentSlotSyncConfigs > > +static void > +SaveCurrentSlotSyncConfigs() > +{ > + PrimaryConnInfoPreReload = pstrdup(PrimaryConnInfo); > + PrimarySlotNamePreReload = pstrdup(WalRcv->slotname); > + SyncSlotNamesPreReload = pstrdup(synchronize_slot_names); > +} > > Shouldn't this code also do pfree first? Otherwise these will slowly > leak every time this function is called, right? > == fixed > ~~~ > > 29. SlotSyncConfigsChanged > > +static bool > +SlotSyncConfigsChanged() > +{ > + if (strcmp(PrimaryConnInfoPreReload, PrimaryConnInfo) != 0) > + return true; > + > + if (strcmp(PrimarySlotNamePreReload, WalRcv->slotname) != 0) > + return true; > + > + if (strcmp(SyncSlotNamesPreReload, synchronize_slot_names) != 0) > + return true; > > I felt those can all be combined to have 1 return instead of 3. > == fixed > ~~~ > > 30. > + /* > + * If we have reached this stage, it means original value of > + * hot_standby_feedback was 'true', so consider it changed if 'false' now. > + */ > + if (!hot_standby_feedback) > + return true; > > "If we have reached this stage" seems a bit vague. Can this have some > more explanation? And, maybe also an Assert(hot_standby_feedback); is > helpful in the calling code (before the config is reloaded)? > == rewrote this without that comment. regards, Ajin Cherian Fujitsu Australia
FYI - the latest patch failed to apply. [postgres@CentOS7-x64 oss_postgres_misc]$ git apply ../patches_misc/v24-0001-Allow-logical-walsenders-to-wait-for-the-physica.patch error: patch failed: src/include/utils/guc_hooks.h:160 error: src/include/utils/guc_hooks.h: patch does not apply ====== Kind Regards, Peter Smith. Fujitsu Australia
On Tue, Oct 17, 2023 at 12:44 PM Peter Smith <smithpb2250@gmail.com> wrote: > > FYI - the latest patch failed to apply. > > [postgres@CentOS7-x64 oss_postgres_misc]$ git apply > ../patches_misc/v24-0001-Allow-logical-walsenders-to-wait-for-the-physica.patch > error: patch failed: src/include/utils/guc_hooks.h:160 > error: src/include/utils/guc_hooks.h: patch does not apply Rebased v24. PFA. thanks Shveta
Attachment
Hi, On 10/13/23 10:35 AM, shveta malik wrote: > On Thu, Oct 12, 2023 at 9:18 AM shveta malik <shveta.malik@gmail.com> wrote: >> > > PFA v24 patch set which has below changes: > > 1) 'enable_failover' displayed in pg_replication_slots. > 2) Support for 'enable_failover' in > pg_create_logical_replication_slot(). It is an optional argument with > default value false. > 3) Addressed pending comments (1-30) from Peter in [1]. > 4) Fixed an issue in patch002 due to which even slots with > enable_failover=false were getting synced. > > The changes for 1 and 2 are in patch001 while 3 and 4 are in patch0002 > > Thanks Ajin, for working on 1 and 3. Thanks for the hard work! + if (RecoveryInProgress()) + wrconn = slotsync_remote_connect(NULL); does produce at compilation time: launcher.c:1916:40: warning: too many arguments in call to 'slotsync_remote_connect' wrconn = slotsync_remote_connect(NULL); Looking at 0001: commit message: "is added at the slot level which will be persistent information" what about "which is persistent information" ? Code: + True if this logical slot is enabled to be synced to the physical standbys + so that logical replication is not blocked after failover. Always false + for physical slots. Not sure "not blocked" is the right wording. "can be resumed from the new primary" maybe? +static void +ProcessRepliesAndTimeOut(void) +{ + CHECK_FOR_INTERRUPTS(); + + /* Process any requests or signals received recently */ + if (ConfigReloadPending) + { + ConfigReloadPending = false; + ProcessConfigFile(PGC_SIGHUP); + SyncRepInitConfig(); + SlotSyncInitConfig(); + } Do we want to do this at each place ProcessRepliesAndTimeOut() is being called? I mean before this change it was not done in ProcessPendingWrites(). + * Wait for physical standby to confirm receiving give lsn. typo? s/give/given/ diff --git a/src/test/recovery/t/050_verify_slot_order.pl b/src/test/recovery/t/050_verify_slot_order.pl new file mode 100644 index 0000000000..25b3d5aac2 --- /dev/null +++ b/src/test/recovery/t/050_verify_slot_order.pl @@ -0,0 +1,145 @@ + +# Copyright (c) 2023, PostgreSQL Global Development Group + Regarding the TAP tests, should we also add some testing related to enable_failover being set in pg_create_logical_replication_slot() and pg_logical_slot_get_changes() behavior too? Please note that current comments are coming while "quickly" going through 0001. I'm planning to have a closer look at 0001 and 0002 too. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Oct 17, 2023 at 9:06 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 10/13/23 10:35 AM, shveta malik wrote: > > On Thu, Oct 12, 2023 at 9:18 AM shveta malik <shveta.malik@gmail.com> wrote: > >> > > > > PFA v24 patch set which has below changes: > > > > 1) 'enable_failover' displayed in pg_replication_slots. > > 2) Support for 'enable_failover' in > > pg_create_logical_replication_slot(). It is an optional argument with > > default value false. > > 3) Addressed pending comments (1-30) from Peter in [1]. > > 4) Fixed an issue in patch002 due to which even slots with > > enable_failover=false were getting synced. > > > > The changes for 1 and 2 are in patch001 while 3 and 4 are in patch0002 > > > > Thanks Ajin, for working on 1 and 3. > > Thanks for the hard work! > Thanks for the feedback. I will try to address these in the next 1-2 versions. > + if (RecoveryInProgress()) > + wrconn = slotsync_remote_connect(NULL); > > does produce at compilation time: > > launcher.c:1916:40: warning: too many arguments in call to 'slotsync_remote_connect' > wrconn = slotsync_remote_connect(NULL); > > Looking at 0001: > > commit message: > > "is added at the slot level which > will be persistent information" > > what about "which is persistent information" ? > > Code: > > + True if this logical slot is enabled to be synced to the physical standbys > + so that logical replication is not blocked after failover. Always false > + for physical slots. > > Not sure "not blocked" is the right wording. "can be resumed from the new primary" maybe? > > +static void > +ProcessRepliesAndTimeOut(void) > +{ > + CHECK_FOR_INTERRUPTS(); > + > + /* Process any requests or signals received recently */ > + if (ConfigReloadPending) > + { > + ConfigReloadPending = false; > + ProcessConfigFile(PGC_SIGHUP); > + SyncRepInitConfig(); > + SlotSyncInitConfig(); > + } > > Do we want to do this at each place ProcessRepliesAndTimeOut() is being > called? I mean before this change it was not done in ProcessPendingWrites(). > Are you referring to ConfigReload stuff ? I see that even in ProcessPendingWrites(), we do it after WalSndWait(). Now only the order is changed, it is before WalSndWait() now. > + * Wait for physical standby to confirm receiving give lsn. > > typo? s/give/given/ > > > diff --git a/src/test/recovery/t/050_verify_slot_order.pl b/src/test/recovery/t/050_verify_slot_order.pl > new file mode 100644 > index 0000000000..25b3d5aac2 > --- /dev/null > +++ b/src/test/recovery/t/050_verify_slot_order.pl > @@ -0,0 +1,145 @@ > + > +# Copyright (c) 2023, PostgreSQL Global Development Group > + > > Regarding the TAP tests, should we also add some testing related to enable_failover being set > in pg_create_logical_replication_slot() and pg_logical_slot_get_changes() behavior too? > Sure, will do it. > Please note that current comments are coming while > "quickly" going through 0001. > > I'm planning to have a closer look at 0001 and 0002 too. > Yes, that will be really helpful. Thanks. thanks Shveta
On Tue, Oct 17, 2023 at 9:06 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > On 10/13/23 10:35 AM, shveta malik wrote: > > On Thu, Oct 12, 2023 at 9:18 AM shveta malik <shveta.malik@gmail.com> wrote: > >> > > Code: > > + True if this logical slot is enabled to be synced to the physical standbys > + so that logical replication is not blocked after failover. Always false > + for physical slots. > > Not sure "not blocked" is the right wording. "can be resumed from the new primary" maybe? > Yeah, your proposed wording sounds better. Also, I think we should document the impact of not doing so because I think the replication can continue after failover but it may lead to data inconsistency. BTW, I noticed that the code for Create Subscription is updated but not the corresponding docs. By looking at other parameters like password_required, streaming, two_phase where true or false indicates whether that option is enabled or not, I am thinking about whether enable_failover is an appropriate name for this option. The other option name that comes to mind is 'failover' where true indicates that the corresponding subscription will be enabled for failover. What do you think? -- With Regards, Amit Kapila.
On Wed, Oct 18, 2023 at 10:20 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Oct 17, 2023 at 9:06 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > > > > On 10/13/23 10:35 AM, shveta malik wrote: > > > On Thu, Oct 12, 2023 at 9:18 AM shveta malik <shveta.malik@gmail.com> wrote: > > >> > > > > Code: > > > > + True if this logical slot is enabled to be synced to the physical standbys > > + so that logical replication is not blocked after failover. Always false > > + for physical slots. > > > > Not sure "not blocked" is the right wording. "can be resumed from the new primary" maybe? > > > > Yeah, your proposed wording sounds better. Also, I think we should > document the impact of not doing so because I think the replication > can continue after failover but it may lead to data inconsistency. > > BTW, I noticed that the code for Create Subscription is updated but > not the corresponding docs. By looking at other parameters like > password_required, streaming, two_phase where true or false indicates > whether that option is enabled or not, I am thinking about whether > enable_failover is an appropriate name for this option. The other > option name that comes to mind is 'failover' where true indicates that > the corresponding subscription will be enabled for failover. What do > you think? +1. 'failover' seems more in sync with other options' names. > -- > With Regards, > Amit Kapila.
On Tue, Oct 17, 2023 at 2:01 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Tue, Oct 17, 2023 at 12:44 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > FYI - the latest patch failed to apply. > > > > [postgres@CentOS7-x64 oss_postgres_misc]$ git apply > > ../patches_misc/v24-0001-Allow-logical-walsenders-to-wait-for-the-physica.patch > > error: patch failed: src/include/utils/guc_hooks.h:160 > > error: src/include/utils/guc_hooks.h: patch does not apply > > Rebased v24. PFA. > Few comments: ============== 1. + List of physical replication slots that logical replication with failover + enabled waits for. /logical replication/logical replication slots 2. If + <varname>enable_syncslot</varname> is not enabled on the + corresponding standbys, then it may result in indefinite waiting + on the primary for physical replication slots configured in + <varname>standby_slot_names</varname> + </para> Why the above leads to indefinite wait? I think we should just ignore standby_slot_names and probably LOG a message in the server for the same. 3. +++ b/src/backend/replication/logical/tablesync.c @@ -1412,7 +1412,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos) */ walrcv_create_slot(LogRepWorkerWalRcvConn, slotname, false /* permanent */ , false /* two_phase */ , - CRS_USE_SNAPSHOT, origin_startpos); + false /* enable_failover */ , CRS_USE_SNAPSHOT, + origin_startpos); As per this code, we won't enable failover for tablesync slots. So, what happens if we need to failover to new node after the tablesync worker has reached SUBREL_STATE_FINISHEDCOPY or SUBREL_STATE_DATASYNC? I think we won't be able to continue replication from failed over node. If this theory is correct, we have two options (a) enable failover for sync slots as well, if it is enabled for main slot; but then after we drop the slot on primary once sync is complete, same needs to be taken care at standby. (b) enable failover even for the main slot after all tables are in ready state, something similar to what we do for two_phase. 4. + /* Verify syntax */ + if (!validate_slot_names(newval, &elemlist)) + return false; + + /* Now verify if these really exist and have correct type */ + if (!validate_standby_slots(elemlist)) These two functions serve quite similar functionality which makes their naming quite confusing. Can we directly move the functionality of validate_slot_names() into validate_standby_slots()? 5. +SlotSyncInitConfig(void) +{ + char *rawname; + + /* Free the old one */ + list_free(standby_slot_names_list); + standby_slot_names_list = NIL; + + if (strcmp(standby_slot_names, "") != 0) + { + rawname = pstrdup(standby_slot_names); + SplitIdentifierString(rawname, ',', &standby_slot_names_list); How does this handle the case where '*' is specified for standby_slot_names? -- With Regards, Amit Kapila.
On Wed, Oct 18, 2023 at 4:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Oct 17, 2023 at 2:01 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Tue, Oct 17, 2023 at 12:44 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > > FYI - the latest patch failed to apply. > > > > > > [postgres@CentOS7-x64 oss_postgres_misc]$ git apply > > > ../patches_misc/v24-0001-Allow-logical-walsenders-to-wait-for-the-physica.patch > > > error: patch failed: src/include/utils/guc_hooks.h:160 > > > error: src/include/utils/guc_hooks.h: patch does not apply > > > > Rebased v24. PFA. > > > > Few comments: > ============== > 1. > + List of physical replication slots that logical replication > with failover > + enabled waits for. > > /logical replication/logical replication slots > > 2. > If > + <varname>enable_syncslot</varname> is not enabled on the > + corresponding standbys, then it may result in indefinite waiting > + on the primary for physical replication slots configured in > + <varname>standby_slot_names</varname> > + </para> > > Why the above leads to indefinite wait? I think we should just ignore > standby_slot_names and probably LOG a message in the server for the > same. > > 3. > +++ b/src/backend/replication/logical/tablesync.c > @@ -1412,7 +1412,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos) > */ > walrcv_create_slot(LogRepWorkerWalRcvConn, > slotname, false /* permanent */ , false /* two_phase */ , > - CRS_USE_SNAPSHOT, origin_startpos); > + false /* enable_failover */ , CRS_USE_SNAPSHOT, > + origin_startpos); > > As per this code, we won't enable failover for tablesync slots. So, > what happens if we need to failover to new node after the tablesync > worker has reached SUBREL_STATE_FINISHEDCOPY or SUBREL_STATE_DATASYNC? > I think we won't be able to continue replication from failed over > node. If this theory is correct, we have two options (a) enable > failover for sync slots as well, if it is enabled for main slot; but > then after we drop the slot on primary once sync is complete, same > needs to be taken care at standby. (b) enable failover even for the > main slot after all tables are in ready state, something similar to > what we do for two_phase. > > 4. > + /* Verify syntax */ > + if (!validate_slot_names(newval, &elemlist)) > + return false; > + > + /* Now verify if these really exist and have correct type */ > + if (!validate_standby_slots(elemlist)) > > These two functions serve quite similar functionality which makes > their naming quite confusing. Can we directly move the functionality > of validate_slot_names() into validate_standby_slots()? > > 5. > +SlotSyncInitConfig(void) > +{ > + char *rawname; > + > + /* Free the old one */ > + list_free(standby_slot_names_list); > + standby_slot_names_list = NIL; > + > + if (strcmp(standby_slot_names, "") != 0) > + { > + rawname = pstrdup(standby_slot_names); > + SplitIdentifierString(rawname, ',', &standby_slot_names_list); > > How does this handle the case where '*' is specified for standby_slot_names? > > > -- > With Regards, > Amit Kapila. PFA v25 patch set. The changes are: 1) 'enable_failover' is changed to 'failover' 2) Alter subscription changes to support 'failover' 3) Fixes a bug in patch001 wherein any change in standby_slot_names was not considered in the flow where logical walsenders wait for standby's confirmation. Now during the wait, if standby_slot_names is changed, wait is restarted using new standby_slot_names. 4) Addresses comments by Bertrand and Amit in [1],[2],[3] The changes are mostly in patch001 and a very few in patch002. Thank You Ajin for working on alter-subscription changes and adding more TAP-tests for 'failover' [1]: https://www.postgresql.org/message-id/2742485f-4118-4fb4-9f94-8150de9e7d7e%40gmail.com [2]: https://www.postgresql.org/message-id/CAA4eK1JcBG6TJ3o5iUd4z0BuTbciLV3dK4aKgb7OgrNGoLcfSQ%40mail.gmail.com [3]: https://www.postgresql.org/message-id/CAA4eK1J6BqO5%3DueFAQO%2BaYyHLaU-oCHrrVFJqHS-i0Ce9aPY2w%40mail.gmail.com thanks Shveta
Attachment
On Wed, Oct 18, 2023 at 4:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Oct 17, 2023 at 2:01 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Tue, Oct 17, 2023 at 12:44 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > > FYI - the latest patch failed to apply. > > > > > > [postgres@CentOS7-x64 oss_postgres_misc]$ git apply > > > ../patches_misc/v24-0001-Allow-logical-walsenders-to-wait-for-the-physica.patch > > > error: patch failed: src/include/utils/guc_hooks.h:160 > > > error: src/include/utils/guc_hooks.h: patch does not apply > > > > Rebased v24. PFA. > > > > Few comments: > ============== > 1. > + List of physical replication slots that logical replication > with failover > + enabled waits for. > > /logical replication/logical replication slots > > 2. > If > + <varname>enable_syncslot</varname> is not enabled on the > + corresponding standbys, then it may result in indefinite waiting > + on the primary for physical replication slots configured in > + <varname>standby_slot_names</varname> > + </para> > > Why the above leads to indefinite wait? I think we should just ignore > standby_slot_names and probably LOG a message in the server for the > same. > Sorry for confusion. This info was wrong, I have corrected it. > 3. > +++ b/src/backend/replication/logical/tablesync.c > @@ -1412,7 +1412,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos) > */ > walrcv_create_slot(LogRepWorkerWalRcvConn, > slotname, false /* permanent */ , false /* two_phase */ , > - CRS_USE_SNAPSHOT, origin_startpos); > + false /* enable_failover */ , CRS_USE_SNAPSHOT, > + origin_startpos); > > As per this code, we won't enable failover for tablesync slots. So, > what happens if we need to failover to new node after the tablesync > worker has reached SUBREL_STATE_FINISHEDCOPY or SUBREL_STATE_DATASYNC? > I think we won't be able to continue replication from failed over > node. If this theory is correct, we have two options (a) enable > failover for sync slots as well, if it is enabled for main slot; but > then after we drop the slot on primary once sync is complete, same > needs to be taken care at standby. (b) enable failover even for the > main slot after all tables are in ready state, something similar to > what we do for two_phase. I have adopted approach a) right now. Table sync slot is created with subscription's failover option and will be dropped by standby if it dropped on primary. > > 4. > + /* Verify syntax */ > + if (!validate_slot_names(newval, &elemlist)) > + return false; > + > + /* Now verify if these really exist and have correct type */ > + if (!validate_standby_slots(elemlist)) > > These two functions serve quite similar functionality which makes > their naming quite confusing. Can we directly move the functionality > of validate_slot_names() into validate_standby_slots()? > > 5. > +SlotSyncInitConfig(void) > +{ > + char *rawname; > + > + /* Free the old one */ > + list_free(standby_slot_names_list); > + standby_slot_names_list = NIL; > + > + if (strcmp(standby_slot_names, "") != 0) > + { > + rawname = pstrdup(standby_slot_names); > + SplitIdentifierString(rawname, ',', &standby_slot_names_list); > > How does this handle the case where '*' is specified for standby_slot_names? > I have removed '*' related doc info in this patch and has introduced error if '*' is given for this GUC. The reason being, I do not see a way to figure out all physical standbys slot names on a primary to make '*' work. We have info about all the physical slots created on primary but which all are actually being used by standbys is not known. Thoughts? thanks Shveta
On Tue, Oct 17, 2023 at 9:06 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 10/13/23 10:35 AM, shveta malik wrote: > > On Thu, Oct 12, 2023 at 9:18 AM shveta malik <shveta.malik@gmail.com> wrote: > >> > > > > PFA v24 patch set which has below changes: > > > > 1) 'enable_failover' displayed in pg_replication_slots. > > 2) Support for 'enable_failover' in > > pg_create_logical_replication_slot(). It is an optional argument with > > default value false. > > 3) Addressed pending comments (1-30) from Peter in [1]. > > 4) Fixed an issue in patch002 due to which even slots with > > enable_failover=false were getting synced. > > > > The changes for 1 and 2 are in patch001 while 3 and 4 are in patch0002 > > > > Thanks Ajin, for working on 1 and 3. > > Thanks for the hard work! > > + if (RecoveryInProgress()) > + wrconn = slotsync_remote_connect(NULL); > > does produce at compilation time: > > launcher.c:1916:40: warning: too many arguments in call to 'slotsync_remote_connect' > wrconn = slotsync_remote_connect(NULL); > > Looking at 0001: > > commit message: > > "is added at the slot level which > will be persistent information" > > what about "which is persistent information" ? > > Code: > > + True if this logical slot is enabled to be synced to the physical standbys > + so that logical replication is not blocked after failover. Always false > + for physical slots. > > Not sure "not blocked" is the right wording. "can be resumed from the new primary" maybe? > > +static void > +ProcessRepliesAndTimeOut(void) > +{ > + CHECK_FOR_INTERRUPTS(); > + > + /* Process any requests or signals received recently */ > + if (ConfigReloadPending) > + { > + ConfigReloadPending = false; > + ProcessConfigFile(PGC_SIGHUP); > + SyncRepInitConfig(); > + SlotSyncInitConfig(); > + } > > Do we want to do this at each place ProcessRepliesAndTimeOut() is being > called? I mean before this change it was not done in ProcessPendingWrites(). > > + * Wait for physical standby to confirm receiving give lsn. > > typo? s/give/given/ > > > diff --git a/src/test/recovery/t/050_verify_slot_order.pl b/src/test/recovery/t/050_verify_slot_order.pl > new file mode 100644 > index 0000000000..25b3d5aac2 > --- /dev/null > +++ b/src/test/recovery/t/050_verify_slot_order.pl > @@ -0,0 +1,145 @@ > + > +# Copyright (c) 2023, PostgreSQL Global Development Group > + > > Regarding the TAP tests, should we also add some testing related to enable_failover being set > in pg_create_logical_replication_slot() and pg_logical_slot_get_changes() behavior too? > We have added some basic tests in v25. More detailed tests to be added in coming versions. > Please note that current comments are coming while > "quickly" going through 0001. > > I'm planning to have a closer look at 0001 and 0002 too. > > Regards, > > -- > Bertrand Drouvot > PostgreSQL Contributors Team > RDS Open Source Databases > Amazon Web Services: https://aws.amazon.com
Hi, On 10/18/23 6:43 AM, shveta malik wrote: > On Tue, Oct 17, 2023 at 9:06 PM Drouvot, Bertrand >> +static void >> +ProcessRepliesAndTimeOut(void) >> +{ >> + CHECK_FOR_INTERRUPTS(); >> + >> + /* Process any requests or signals received recently */ >> + if (ConfigReloadPending) >> + { >> + ConfigReloadPending = false; >> + ProcessConfigFile(PGC_SIGHUP); >> + SyncRepInitConfig(); >> + SlotSyncInitConfig(); >> + } >> >> Do we want to do this at each place ProcessRepliesAndTimeOut() is being >> called? I mean before this change it was not done in ProcessPendingWrites(). >> > > Are you referring to ConfigReload stuff ? I see that even in > ProcessPendingWrites(), we do it after WalSndWait(). Now only the > order is changed, it is before WalSndWait() now. Yeah and the CFI. With the patch the CFI and check on ConfigReloadPending is done in all the case as the break (if !pq_is_send_pending()) is now done after. That seems ok, just wanted to mention it. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On 10/20/23 5:27 AM, shveta malik wrote: > On Wed, Oct 18, 2023 at 4:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > PFA v25 patch set. The changes are: > > 1) 'enable_failover' is changed to 'failover' > 2) Alter subscription changes to support 'failover' > 3) Fixes a bug in patch001 wherein any change in standby_slot_names > was not considered in the flow where logical walsenders wait for > standby's confirmation. Now during the wait, if standby_slot_names is > changed, wait is restarted using new standby_slot_names. > 4) Addresses comments by Bertrand and Amit in [1],[2],[3] > > The changes are mostly in patch001 and a very few in patch002. > > Thank You Ajin for working on alter-subscription changes and adding > more TAP-tests for 'failover' > Thanks for updating the patch! Looking at 0001 and doing some experiment: Creating a logical slot with failover = true and then launching pg_logical_slot_get_changes() or pg_recvlogical() on it results to setting failover back to false. It occurs while creating the decoding context here: @@ -602,6 +602,9 @@ CreateDecodingContext(XLogRecPtr start_lsn, SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn); } + /* set failover in the slot, as requested */ + slot->data.failover = ctx->failover; + I think we can get rid of this change in CreateDecodingContext(). Looking at 0002: /* Enter main loop */ for (;;) { int rc; long wait_time = DEFAULT_NAPTIME_PER_CYCLE; CHECK_FOR_INTERRUPTS(); /* * If it is Hot standby, then try to launch slot-sync workers else * launch apply workers. */ if (RecoveryInProgress()) { /* Launch only if we have succesfully made the connection */ if (wrconn) LaunchSlotSyncWorkers(&wait_time, wrconn); } We are waiting for DEFAULT_NAPTIME_PER_CYCLE (3 minutes) before checking if there is new synced slot(s) to be created on the standby. Do we want to keep this behavior for V1? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Oct 23, 2023 at 5:52 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 10/20/23 5:27 AM, shveta malik wrote: > > On Wed, Oct 18, 2023 at 4:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > PFA v25 patch set. The changes are: > > > > 1) 'enable_failover' is changed to 'failover' > > 2) Alter subscription changes to support 'failover' > > 3) Fixes a bug in patch001 wherein any change in standby_slot_names > > was not considered in the flow where logical walsenders wait for > > standby's confirmation. Now during the wait, if standby_slot_names is > > changed, wait is restarted using new standby_slot_names. > > 4) Addresses comments by Bertrand and Amit in [1],[2],[3] > > > > The changes are mostly in patch001 and a very few in patch002. > > > > Thank You Ajin for working on alter-subscription changes and adding > > more TAP-tests for 'failover' > > > > Thanks for updating the patch! > > Looking at 0001 and doing some experiment: > > Creating a logical slot with failover = true and then launching > pg_logical_slot_get_changes() or pg_recvlogical() on it results > to setting failover back to false. > > It occurs while creating the decoding context here: > > @@ -602,6 +602,9 @@ CreateDecodingContext(XLogRecPtr start_lsn, > SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn); > } > > + /* set failover in the slot, as requested */ > + slot->data.failover = ctx->failover; > + > > I think we can get rid of this change in CreateDecodingContext(). > Thanks for pointing it out. I will correct it in the next patch. > Looking at 0002: > > /* Enter main loop */ > for (;;) > { > int rc; > long wait_time = DEFAULT_NAPTIME_PER_CYCLE; > > CHECK_FOR_INTERRUPTS(); > > /* > * If it is Hot standby, then try to launch slot-sync workers else > * launch apply workers. > */ > if (RecoveryInProgress()) > { > /* Launch only if we have succesfully made the connection */ > if (wrconn) > LaunchSlotSyncWorkers(&wait_time, wrconn); > } > > We are waiting for DEFAULT_NAPTIME_PER_CYCLE (3 minutes) before checking if there > is new synced slot(s) to be created on the standby. Do we want to keep this behavior > for V1? > I think for the slotsync workers case, we should reduce the naptime in the launcher to say 30sec and retain the default one of 3mins for subscription apply workers. Thoughts? > Regards, > > -- > Bertrand Drouvot > PostgreSQL Contributors Team > RDS Open Source Databases > Amazon Web Services: https://aws.amazon.com
On Friday, October 20, 2023 11:27 AM shveta malik <shveta.malik@gmail.com> wrote: > > The changes are mostly in patch001 and a very few in patch002. > > Thank You Ajin for working on alter-subscription changes and adding more > TAP-tests for 'failover' Thanks for updating the patch. Here are few things I noticed when testing the patch. 1) +++ b/src/backend/replication/logical/logical.c @@ -602,6 +602,9 @@ CreateDecodingContext(XLogRecPtr start_lsn, SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn); } + /* set failover in the slot, as requested */ + slot->data.failover = ctx->failover; + I noticed others also commented this change. I found this will over-write the slot's failover value to a wrong one in the case when we have acquired the slot and call CreateDecodingContext after that. (e.g. it can be a problem if user call pg_logical_slot_get_changes for a logical slot). 2) WalSndWaitForStandbyConfirmation ... +void +WalSndWaitForStandbyConfirmation(XLogRecPtr wait_for_lsn) +{ + List *standby_slot_cpy; + + if (!MyReplicationSlot->data.failover) + return; + + standby_slot_cpy = list_copy(standby_slot_names_list); + The standby list could be un-initialized when calling from a non-walsender backend (e.g. via pg_logical_slot_get_changes()). So, we need to call SlotSyncInitConfig somewhere in this case. Best Regards, Hou zj
On Mon, Oct 23, 2023 at 11:22 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 10/20/23 5:27 AM, shveta malik wrote: > > On Wed, Oct 18, 2023 at 4:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > PFA v25 patch set. The changes are: > > > > 1) 'enable_failover' is changed to 'failover' > > 2) Alter subscription changes to support 'failover' > > 3) Fixes a bug in patch001 wherein any change in standby_slot_names > > was not considered in the flow where logical walsenders wait for > > standby's confirmation. Now during the wait, if standby_slot_names is > > changed, wait is restarted using new standby_slot_names. > > 4) Addresses comments by Bertrand and Amit in [1],[2],[3] > > > > The changes are mostly in patch001 and a very few in patch002. > > > > Thank You Ajin for working on alter-subscription changes and adding > > more TAP-tests for 'failover' > > > > Thanks for updating the patch! > > Looking at 0001 and doing some experiment: > > Creating a logical slot with failover = true and then launching > pg_logical_slot_get_changes() or pg_recvlogical() on it results > to setting failover back to false. > > It occurs while creating the decoding context here: > > @@ -602,6 +602,9 @@ CreateDecodingContext(XLogRecPtr start_lsn, > SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn); > } > > + /* set failover in the slot, as requested */ > + slot->data.failover = ctx->failover; > + > > I think we can get rid of this change in CreateDecodingContext(). > Yes, I too noticed this in my testing, however just removing this from CreateDecodingContext will not allow us to change the slot's failover flag using Alter subscription. Currently alter subscription re-establishes the connection using START REPLICATION and failover is one of the options passed in along with START REPLICATION. I am thinking of moving this change to StartLogicalReplication prior to calling CreateDecodingContext by parsing the command options in StartReplicationCmd without adding it to the LogicalDecodingContext. regards, Ajin Cherian Fujitsu Australia
Hi, On 10/23/23 2:56 PM, shveta malik wrote: > On Mon, Oct 23, 2023 at 5:52 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> We are waiting for DEFAULT_NAPTIME_PER_CYCLE (3 minutes) before checking if there >> is new synced slot(s) to be created on the standby. Do we want to keep this behavior >> for V1? >> > > I think for the slotsync workers case, we should reduce the naptime in > the launcher to say 30sec and retain the default one of 3mins for > subscription apply workers. Thoughts? > Another option could be to keep DEFAULT_NAPTIME_PER_CYCLE and create a new API on the standby that would refresh the list of sync slot at wish, thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Here are some review comments for v24-0001 ====== 1. GENERAL - failover slots terminology There is inconsistent terminology, such as below. Try to use the same wording everywhere. - failover logical slots - failover slots - logical failover slots - logical replication failover slots - etc. These are in many places - comments, function names, constants etc. ~~~ 2. GENERAL - THE s/primary.../the primary.../ s/standby.../the standby.../ Missing "the" problems remain in multiple places in the patch. ~~~ 3. GENERAL - messages I searched all the ereports and elogs (the full list is below only for reference). There are many little quirks: 3a. Sometimes messages say "primary"; sometimes "primary server" etc. Be consistent. 3b. /primary/the primary/ 3c. Sometimes messages include errcode and sometimes they do not; Are they deliberate or are there missing errcodes? 3d. At least one message has unwanted trailing space 3e. Sometimes using errcode and/or errmsg enclosed in parentheses; sometimes not. AFAIK it is not necessary anymore. 3f. Inconsistent terminology "slot" V "failover slots" V "failover logical slots" etc mentioned in the previous review comment #1 3g. Sometimes messages "slot creation aborted"; Sometimes "aborting slot creation". Be consistent. 3h. s/lsn/LSN/ 3i. s/move it backward/move it backwards/ 3j. Sometimes LOG message starts uppercase; Sometimes lowercase. Be consistent. 3k. typo: s/and and/and/ 3l. "worker %d" V "worker%d" ~ Messages: ereport(ERROR, (errmsg("could not receive failover slots dbinfo from the primary server: %s", pchomp(PQerrorMessage(conn->streamConn))))); ereport(ERROR, (errmsg("invalid response from primary server"), errdetail("Could not get failover slots dbinfo: got %d fields, " "expected 1", nfields))); ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR), errmsg("invalid connection string syntax: %s", errcopy))); ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("replication slot-sync worker slot %d is " "empty, cannot attach", slot))); ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("replication slot-sync worker slot %d is " "already used by another worker, cannot attach", slot))); ereport(ERROR, (errmsg("could not connect to the primary server: %s", err))); ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("cannot use replication slot \"%s\" for logical decoding", NameStr(slot->data.name)), errdetail("This slot is being synced from the primary."), errhint("Specify another replication slot."))); ereport(ERROR, (errmsg("could not fetch slot info for slot \"%s\" from" " the primary: %s", remote_slot->name, res->err))); ereport(ERROR, (errmsg("could not fetch slot info for slot \"%s\" from" " the primary: %s", remote_slot->name, res->err))); ereport(ERROR, (errmsg("could not fetch invalidation cause for slot \"%s\" from" " primary: %s", slot_name, res->err))); ereport(ERROR, (errmsg("slot \"%s\" disappeared from the primary", slot_name))); ereport(ERROR, (errmsg("could not fetch failover logical slots info from the primary: %s", res->err))); ereport(ERROR, (errmsg("could not connect to the primary server: %s", err))); ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("could not map dynamic shared memory " "segment for slot-sync worker"))); ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("cannot drop replication slot \"%s\"", name), errdetail("This slot is being synced from the primary."))); ereport(ERROR, (errmsg("could not receive failover slots dbinfo from the primary server: %s", pchomp(PQerrorMessage(conn->streamConn))))); ereport(ERROR, (errmsg("invalid response from primary server"), errdetail("Could not get failover slots dbinfo: got %d fields, " "expected 1", nfields))); ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR), errmsg("invalid connection string syntax: %s", errcopy))); ereport(WARNING, (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED), errmsg("out of background worker slots"), errhint("You might need to increase %s.", "max_worker_processes"))); ereport(WARNING, (errmsg("replication slot-sync worker failed to attach to " "worker-pool slot %d", worker_slot))); ereport(WARNING, errmsg("skipping slots synchronization as primary_slot_name " "is not set.")); ereport(WARNING, errmsg("skipping slots synchronization as hot_standby_feedback " "is off.")); ereport(WARNING, errmsg("skipping slots synchronization as dbname is not " "specified in primary_conninfo.")); ereport(WARNING, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("slot-sync wait for slot %s interrupted by promotion, " "slot creation aborted", remote_slot->name))); ereport(WARNING, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("slot-sync wait for slot %s interrupted by promotion, " "slot creation aborted", remote_slot->name))); ereport(WARNING, (errmsg("slot \"%s\" disappeared from the primary, aborting" " slot creation", remote_slot->name))); ereport(WARNING, (errmsg("slot \"%s\" invalidated on primary, aborting" " slot creation", remote_slot->name))); ereport(WARNING, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("slot-sync for slot \"%s\" interrupted by promotion, " "sync not possible", remote_slot->name))); ereport(WARNING, errmsg("skipping sync of slot \"%s\" as the received slot-sync " "lsn %X/%X is ahead of the standby position %X/%X", remote_slot->name, LSN_FORMAT_ARGS(remote_slot->confirmed_lsn), LSN_FORMAT_ARGS(WalRcv->latestWalEnd))); ereport(WARNING, errmsg("not synchronizing slot %s; synchronization would move" " it backward", remote_slot->name)); ereport(LOG, (errmsg("Dropped replication slot \"%s\" ", NameStr(local_slot->data.name)))); ereport(LOG, (errmsg("Added database %d to replication slot-sync " "worker %d; dbcount now: %d", dbid, worker_slot, worker->dbcount))); ereport(LOG, (errmsg("Added database %d to replication slot-sync " "worker %d; dbcount now: %d", dbid, worker_slot, worker->dbcount))); ereport(LOG, (errmsg("Stopping replication slot-sync worker %d", slot))); ereport(LOG, (errmsg("removed database %d from replication slot-sync " "worker %d; dbcount now: %d", wdbid, worker->slot, worker->dbcount))); ereport(LOG, errmsg("waiting for remote slot \"%s\" LSN (%X/%X) and catalog xmin" " (%u) to pass local slot LSN (%X/%X) and and catalog xmin (%u)", remote_slot->name, LSN_FORMAT_ARGS(remote_slot->restart_lsn), remote_slot->catalog_xmin, LSN_FORMAT_ARGS(MyReplicationSlot->data.restart_lsn), MyReplicationSlot->data.catalog_xmin)); ereport(LOG, errmsg("wait over for remote slot \"%s\" as its LSN (%X/%X)" " and catalog xmin (%u) has now passed local slot LSN" " (%X/%X) and catalog xmin (%u)", remote_slot->name, LSN_FORMAT_ARGS(new_restart_lsn), new_catalog_xmin, LSN_FORMAT_ARGS(MyReplicationSlot->data.restart_lsn), MyReplicationSlot->data.catalog_xmin)); ereport(LOG, errmsg("Replication slot-sync worker %d is shutting" " down on receiving SIGINT", MySlotSyncWorker->slot)); ereport(LOG, errmsg("Replication slot-sync worker %d started", worker_slot)); elog(DEBUG1, "allocated dsa for slot-sync worker for dbcount: %d", DB_PER_WORKER_ALLOC_INIT); elog(DEBUG1, "logical replication launcher started"); elog(DEBUG2, "slot-sync worker%d's query:%s \n", MySlotSyncWorker->slot, s.data); ~~~ 4. GENERAL - SlotSyncWorker loops When iterating slot-sync workers the code sometimes looks like + for (int i = 0; i < max_slotsync_workers; i++) + { + SlotSyncWorker *w = &LogicalRepCtx->ss_workers[i]; and other times it looks like + for (int widx = 0; widx < max_slotsync_workers; widx++) + { + SlotSyncWorker *worker = &LogicalRepCtx->ss_workers[widx]; etc. It would be better if such loops would use the same loop variable and SlotSyncWorker variable names; consistency will make the code easier to read. ====== Commit message 5. GUC 'enable_syncslot' enables a physical_satndby to synchronize logical replication failover slots from the primary server. s/physical_satndby/physical standby/ ## I think this one is already fixed in the latest v25. ~~~ 6. The logical slots created by slot-sync workers on physical standbys are not allowed to be consumed and dropped. Any attempt to perform logical decoding on such slots will result in an error. ~ SUGGESTION The logical slots created by slot-sync workers on physical standbys are not allowed to be dropped or consumed. Any attempt to perform logical decoding on such slots will result in an error. ====== doc/src/sgml/config.sgml 7. + <para> + Specify dbname in <varname>primary_conninfo</varname> string + to allow synchronization of slots from the primary to standby. + This will only be used for slot synchronization. It is ignored + for streaming. </para> Maybe better to use <literal> for dbname. ~~~ 8. + </varlistentry> + + </variablelist> Extra blank link not needed. ====== .../libpqwalreceiver/libpqwalreceiver.c 9. libpqrcv_get_dbname_from_conninfo + for (opt = opts; opt->keyword != NULL; ++opt) + { + /* If multiple dbnames are used, then the last one will be returned */ s/are used/are specified/ ====== src/backend/replication/logical/launcher.c 10. slotsync_worker_launch_or_reuse + MemoryContext oldcontext; + uint32 alloc_count = 0; + uint32 old_dbcnt = 0; + Oid *old_dbids = NULL; No need to assign these in the declaration, because they get unconditionally assigned before they are inspected anyhow. ~~~ 11. + /* Prepare the new worker. */ + worker->hdr.launch_time = GetCurrentTimestamp(); + worker->hdr.in_use = true; + + /* + * 'proc' and 'slot' will be assigned in ReplSlotSyncWorkerMain when we + * attach this worker to a particular worker-pool slot + */ + worker->hdr.proc = NULL; + worker->slot = -1; + + /* TODO: do we really need 'generation', analyse more here */ + worker->hdr.generation++; + + /* Initial DSA setup for dbids array to hold DB_PER_WORKER_ALLOC_INIT dbs */ + handle = slotsync_dsa_setup(worker); It is confusing for some of the worker members to be initialized here and other worker members (like `dbcount`) to be initialized within the function slotsync_dsa_setup(). It might be better if all the field initialization can be kept together -- e.g. combined in a new function 'slotsync_worker_setup()'. ~~~ 12. + /* Check if current DB is still present in remote-db-list */ + foreach(lc, remote_dbs) + { + WalRcvFailoverSlotsData *failover_slot_data = lfirst(lc); + + if (failover_slot_data->dboid == wdbid) + { + found = true; + break; + } + } + + /* If not found, then delete this db from worker's db-list */ + if (!found) + { + if (dbidx < (worker->dbcount - 1)) + { + /* Shift the DBs and get rid of wdbid */ + memmove(&dbids[dbidx], &dbids[dbidx + 1], + (worker->dbcount - dbidx - 1) * sizeof(Oid)); + } + + worker->dbcount--; + + ereport(LOG, + (errmsg("removed database %d from replication slot-sync " + "worker %d; dbcount now: %d", + wdbid, worker->slot, worker->dbcount))); + } + + /* Else move to next db-position */ + else + { + dbidx++; + } This code might be simpler if you just remove the whole "Else move..." part and instead just increment the `dbidx` at the same time you set found = true;s/ For example, if (failover_slot_data->dboid == wdbid) { /* advance worker to next db-position */ found = true; dbidxid++; break; } ~~~ 13. slotsync_remote_connect +/* + * Connect to the primary server for slotsync purpose and return the connection + * info. + */ +static WalReceiverConn * +slotsync_remote_connect() +{ + WalReceiverConn *wrconn = NULL; + char *err; + char *dbname; No need to assign NULL there. It will be overwritten before it is used. ~~~ 14. Ajins's previous explanation ([1] #27) of why some of the checks have warnings and some do not was helpful; IMO this should be written as a comment in this function. + /* The primary_slot_name is not set */ + if (!WalRcv || WalRcv->slotname[0] == '\0') + { + ereport(WARNING, + errmsg("skipping slots synchronization as primary_slot_name " + "is not set.")); + return NULL; + } + + /* The hot_standby_feedback must be ON for slot-sync to work */ + if (!hot_standby_feedback) + { + ereport(WARNING, + errmsg("skipping slots synchronization as hot_standby_feedback " + "is off.")); + return NULL; + } + + /* The dbname must be specified in primary_conninfo for slot-sync to work */ + dbname = walrcv_get_dbname_from_conninfo(PrimaryConnInfo); + if (dbname == NULL) + { + ereport(WARNING, + errmsg("skipping slots synchronization as dbname is not " + "specified in primary_conninfo.")); + return NULL; + } Add a new comment above all those: SUGGESTION /* * Check that other GUC settings (primary_slot_name, hot_standby_feedback, primary_conninfo) * are compatible with slot synchronization. */ ~~~ 15. slotsync_configs_changed +static bool +slotsync_configs_changed() +{ + if ((EnableSyncSlotPreReload != enable_syncslot) || + (HotStandbyFeedbackPreReload != hot_standby_feedback) || + (strcmp(PrimaryConnInfoPreReload, PrimaryConnInfo) != 0) || + (strcmp(PrimarySlotNamePreReload, WalRcv->slotname) != 0)) + { + return true; + } + + return false; +} Might as well write this as a single return. Also, IMO it is more natural to write as "if the <now_value> is different to <prev_value>" instead of the other way around For example: return (enable_syncslot != EnableSyncSlotPreReload) || (hot_standby_feedback != HotStandbyFeedbackPreReload) || (strcmp(PrimaryConnInfo, PrimaryConnInfoPreReload) != 0) || (strcmp(WalRcv->slotname,PrimarySlotNamePreReload) != 0); ~~~ 16. slotsync_configs_changed + foreach(lc, slots_dbs) + { + WalRcvFailoverSlotsData *failover_slot_data = lfirst(lc); + SlotSyncWorker *w; + + Assert(OidIsValid(failover_slot_data->dboid)); + + LWLockAcquire(SlotSyncWorkerLock, LW_SHARED); + w = slotsync_worker_find(failover_slot_data->dboid); + LWLockRelease(SlotSyncWorkerLock); + + if (w != NULL) + continue; /* worker is running already */ + + /* + * If we failed to launch this slotsync worker, return and try + * launching the failed and remaining workers in next sync-cycle. But + * change launcher's wait time to minimum of + * wal_retrieve_retry_interval and default wait time to try next + * sync-cycle sooner. + */ + if (!slotsync_worker_launch_or_reuse(failover_slot_data->dboid)) + { + *wait_time = Min(*wait_time, wal_retrieve_retry_interval); + break; + } + } Nit: IMO when the variable scope is small (when you can easily see the declaration and every usage in a few lines) having such long descriptive makes the code *less* instead of more readable. SUGGESTION s/failover_slot_data/slot_data/ OR s/failover_slot_data/sdata/ ====== src/backend/replication/logical/slotsync.c 17. + * This file contains the code for slot-sync workers on physical standby + * to fetch logical failover slots information from the primary server, + * create the slots on the standby and synchronize them periodically. s/on physical standby/on the physical standby/ ~~~ 18. slot_exists_in_list + if (strcmp(remote_slot->name, NameStr(local_slot->data.name)) == 0) + { + /* + * if remote slot is marked as non-conflicting (i.e. not + * invalidated) but local slot is marked as invalidated, then set + * the bool. + */ + if (!remote_slot->conflicting && + local_slot->data.invalidated != RS_INVAL_NONE) + *locally_invalidated = true; + + return true; + } Isn't it better to *always* set that 'locally_invalidated' flag for a found slot? Otherwise, you are assuming that the flag value was initially false, but maybe it was not. SUGGESTION /* * Is the remote slot is marked as non-conflicting (i.e. not * invalidated) when the local slot is marked as invalidated? */ *locally_invalidated = !remote_slot->conflicting && (local_slot->data.invalidated != RS_INVAL_NONE); ~~ 19. get_remote_invalidation_cause + if (res->status != WALRCV_OK_TUPLES) + ereport(ERROR, + (errmsg("could not fetch invalidation cause for slot \"%s\" from" + " primary: %s", slot_name, res->err))); (already mentioned in general review comment) s/from primary/from the primary/ ~~~ 20. +/* + * Drop obsolete slots + * + * Drop the slots that no longer need to be synced i.e. these either + * do not exist on primary or are no longer enabled as failover slots. (??) s/enabled as failover slots/designated as failover slots/ OR s/enabled as failover slots/enabled for failover ~~~ 21. construct_slot_query +static void +construct_slot_query(StringInfo s, Oid *dbids) +{ + Assert(LWLockHeldByMeInMode(SlotSyncWorkerLock, LW_SHARED)); + + appendStringInfo(s, + "SELECT slot_name, plugin, confirmed_flush_lsn," + " restart_lsn, catalog_xmin, two_phase, conflicting, " + " database FROM pg_catalog.pg_replication_slots" + " WHERE enable_failover=true and database IN "); /WHERE enable_failover=true and database IN/WHERE enable_failover AND database IN/ ### I noticed the code is a tiny bit different in v25, but the review comment is still relevant. ~~~ 22. synchronize_slots +/* + * Synchronize slots. + * + * It gets the failover logical slots info from the primary server for the dbids + * managed by this worker and then updates the slots locally as per the info + * received. It creates the slots if not present on the standby. + * + * It returns nap time for the next sync-cycle. + */ Comment can be re-worded to not say "it" everywhere. ====== src/backend/replication/walsender.c 23. + /* + * Check if the database OID is already in the list, and if so, skip + * this slot. + */ + if (list_member_oid(database_oids_list, dboid)) + continue; Simplify the comment SUGGESTION Skip this slot if the database OID is already in the list. ====== src/backend/utils/activity/wait_event_names.txt 24. +REPL_SLOTSYNC_MAIN "Waiting in main loop of slot-sync worker." +REPL_SLOTSYNC_PRIMARY_CATCHUP "Waiting for primary to catch-up, in slot-sync worker." (this was already mentioned in the general review comment) s/primary/the primary/ ====== src/include/postmaster/bgworker_internals.h 25. #define MAX_PARALLEL_WORKER_LIMIT 1024 +#define MAX_SLOTSYNC_WORKER_LIMIT 50 This constant seems to be not used anywhere except in guc_tables.c where the GUC is defined. IMO you should make use of this in some Assert or a message; Otherwise, might as well just remove it and hardwire the 50 in the guc_tables.c directly. ====== src/include/replication/walreceiver.h 26. WalRcvFailoverSlotsData +/* + * Failover logical slots dbids received from remote. + */ +typedef struct WalRcvFailoverSlotsData +{ + Oid dboid; +} WalRcvFailoverSlotsData; + For now, the only data is `dbids` but maybe one day there will be more stuff, so make the struct comment more generic. SUGGESTION Failover logical slots data received from remote. ====== src/include/replication/worker_internal.h 27. LogicalRepWorkerType + +typedef struct LogicalRepWorker +{ + LogicalWorkerHeader hdr; + + /* What type of worker is this? */ + LogicalRepWorkerType type; + Maybe add some struct-level comments for this. ====== [1] https://www.postgresql.org/message-id/CAFPTHDaqn%2Bm47_vkAToQD6Pe8diut0F0g0bSr8PdcuW6cbSSkQ%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Here are some review comments for patch v25-0002 (additional to v25-0002 review comments [1]) ====== src/backend/catalog/system_views.sql 1. @@ -1003,7 +1003,8 @@ CREATE VIEW pg_replication_slots AS L.safe_wal_size, L.two_phase, L.conflicting, - L.failover + L.failover, + L.synced_slot FROM pg_get_replication_slots() AS L LEFT JOIN pg_database D ON (L.datoid = D.oid); AFAICT the patch is missing PG DOCS descriptions for these new view attributes. ====== src/backend/replication/logical/launcher.c 2. slotsync_remove_obsolete_dbs + + /* + * TODO: Take care of of removal of old 'synced' slots for the dbs which + * are no longer eligible for slot-sync. + */ typo: "of of" ~~~ 3. + /* + * Make sure that concerned WAL is received before syncing slot to target + * lsn received from the primary. + * + * This check should never pass as on the primary, we have waited for + * standby's confirmation before updating the logical slot. But to take + * care of any bug in that flow, we should retain this check. + */ + if (remote_slot->confirmed_lsn > WalRcv->latestWalEnd) + { + ereport(LOG, + errmsg_internal("skipping sync of slot \"%s\" as the received slot-sync " + "lsn %X/%X is ahead of the standby position %X/%X", + remote_slot->name, + LSN_FORMAT_ARGS(remote_slot->confirmed_lsn), + LSN_FORMAT_ARGS(WalRcv->latestWalEnd))); + return; + } Would elog be better here than using ereport(LOG, errmsg_internal...); IIUC it does the same thing? ====== [1] https://www.postgresql.org/message-id/CAHut%2BPspseC03Fhsi%3DOqOtksagspE%2B0MVOhrhhUb64cc_4SE1w%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Hi, On 10/24/23 7:44 AM, Ajin Cherian wrote: > On Mon, Oct 23, 2023 at 11:22 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> >> @@ -602,6 +602,9 @@ CreateDecodingContext(XLogRecPtr start_lsn, >> SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn); >> } >> >> + /* set failover in the slot, as requested */ >> + slot->data.failover = ctx->failover; >> + >> >> I think we can get rid of this change in CreateDecodingContext(). >> > Yes, I too noticed this in my testing, however just removing this from > CreateDecodingContext will not allow us to change the slot's failover flag > using Alter subscription. Oh right. > I am thinking of moving this change to > StartLogicalReplication prior to calling CreateDecodingContext by > parsing the command options in StartReplicationCmd > without adding it to the LogicalDecodingContext. > Yeah, that looks like a good place to update "failover". Doing more testing and I have a couple of remarks about he current behavior. 1) Let's imagine that: - there is no standby - standby_slot_names is set to a valid slot on the primary (but due to the above, not linked to any standby) - then a create subscription on a subscriber WITH (failover = true) would start the synchronisation but never finish (means leaving a "synchronisation" slot like "pg_32811_sync_24576_7293415241672430356" in place coming from ReplicationSlotNameForTablesync()). That's expected, but maybe we should emit a warning in WalSndWaitForStandbyConfirmation() on the primary when there is a slot part of standby_slot_names which is not active/does not have an active_pid attached to it? 2) When we create a subscription, another slot is created during the subscription synchronization, namely like "pg_16397_sync_16388_7293447291374081805" (coming from ReplicationSlotNameForTablesync()). This extra slot appears to have failover also set to true. So, If the standby refresh the list of slot to sync when the subscription is still synchronizing we'd see things like on the standby: LOG: waiting for remote slot "mysub" LSN (0/C0034808) and catalog xmin (756) to pass local slot LSN (0/C0034840) and andcatalog xmin (756) LOG: wait over for remote slot "mysub" as its LSN (0/C00368B0) and catalog xmin (756) has now passed local slot LSN (0/C0034840)and catalog xmin (756) LOG: waiting for remote slot "pg_16397_sync_16388_7293447291374081805" LSN (0/C0034808) and catalog xmin (756) to pass localslot LSN (0/C00368E8) and and catalog xmin (756) WARNING: slot "pg_16397_sync_16388_7293447291374081805" disappeared from the primary, aborting slot creation I'm not sure this "pg_16397_sync_16388_7293447291374081805" should have failover set to true. If there is a failover during the subscription creation, better to re-launch the subscription instead? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Oct 24, 2023 at 3:35 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 10/24/23 7:44 AM, Ajin Cherian wrote: > > On Mon, Oct 23, 2023 at 11:22 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > >> > >> @@ -602,6 +602,9 @@ CreateDecodingContext(XLogRecPtr start_lsn, > >> SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn); > >> } > >> > >> + /* set failover in the slot, as requested */ > >> + slot->data.failover = ctx->failover; > >> + > >> > >> I think we can get rid of this change in CreateDecodingContext(). > >> > > Yes, I too noticed this in my testing, however just removing this from > > CreateDecodingContext will not allow us to change the slot's failover flag > > using Alter subscription. > > Oh right. > > > I am thinking of moving this change to > > StartLogicalReplication prior to calling CreateDecodingContext by > > parsing the command options in StartReplicationCmd > > without adding it to the LogicalDecodingContext. > > > > Yeah, that looks like a good place to update "failover". > > Doing more testing and I have a couple of remarks about he current behavior. > > 1) Let's imagine that: > > - there is no standby > - standby_slot_names is set to a valid slot on the primary (but due to the above, not linked to any standby) > - then a create subscription on a subscriber WITH (failover = true) would start the > synchronisation but never finish (means leaving a "synchronisation" slot like "pg_32811_sync_24576_7293415241672430356" > in place coming from ReplicationSlotNameForTablesync()). > > That's expected, but maybe we should emit a warning in WalSndWaitForStandbyConfirmation() on the primary when there is > a slot part of standby_slot_names which is not active/does not have an active_pid attached to it? > Agreed, Will do that. > 2) When we create a subscription, another slot is created during the subscription synchronization, namely > like "pg_16397_sync_16388_7293447291374081805" (coming from ReplicationSlotNameForTablesync()). > > This extra slot appears to have failover also set to true. > > So, If the standby refresh the list of slot to sync when the subscription is still synchronizing we'd see things like > on the standby: > > LOG: waiting for remote slot "mysub" LSN (0/C0034808) and catalog xmin (756) to pass local slot LSN (0/C0034840) and andcatalog xmin (756) > LOG: wait over for remote slot "mysub" as its LSN (0/C00368B0) and catalog xmin (756) has now passed local slot LSN (0/C0034840)and catalog xmin (756) > LOG: waiting for remote slot "pg_16397_sync_16388_7293447291374081805" LSN (0/C0034808) and catalog xmin (756) to passlocal slot LSN (0/C00368E8) and and catalog xmin (756) > WARNING: slot "pg_16397_sync_16388_7293447291374081805" disappeared from the primary, aborting slot creation > > I'm not sure this "pg_16397_sync_16388_7293447291374081805" should have failover set to true. If there is a failover > during the subscription creation, better to re-launch the subscription instead? > 'Failover' property of subscription is carried to the tablesync-slot in recent versions only with the intent that if failover happens during create-sub during table-sync of large tables, then users should be able to start from that point onward on the new primary. But yes, the above scenario is highly probable where-in no activity is happening on primary and thus the table-sync slot is waiting for its creation during sync on standby. So I agree, to simplify the stuff we can skip table-sync slot syncing on standby and document this behaviour. thanks Shveta
On Tue, Oct 24, 2023 at 11:54 AM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 10/23/23 2:56 PM, shveta malik wrote: > > On Mon, Oct 23, 2023 at 5:52 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > > >> We are waiting for DEFAULT_NAPTIME_PER_CYCLE (3 minutes) before checking if there > >> is new synced slot(s) to be created on the standby. Do we want to keep this behavior > >> for V1? > >> > > > > I think for the slotsync workers case, we should reduce the naptime in > > the launcher to say 30sec and retain the default one of 3mins for > > subscription apply workers. Thoughts? > > > > Another option could be to keep DEFAULT_NAPTIME_PER_CYCLE and create a new > API on the standby that would refresh the list of sync slot at wish, thoughts? > Do you mean API to refresh list of DBIDs rather than sync-slots? As per current design, launcher gets DBID lists for all the failover slots from the primary at intervals of DEFAULT_NAPTIME_PER_CYCLE. These dbids are then distributed among max slot-sync workers and then they fetch slots for the concerned DBIDs at regular intervals of 10ms (WORKER_DEFAULT_NAPTIME_MS) and create/update those locally. thanks Shveta
On Tue, Oct 24, 2023 at 3:35 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > On 10/24/23 7:44 AM, Ajin Cherian wrote: > > On Mon, Oct 23, 2023 at 11:22 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > > 2) When we create a subscription, another slot is created during the subscription synchronization, namely > like "pg_16397_sync_16388_7293447291374081805" (coming from ReplicationSlotNameForTablesync()). > > This extra slot appears to have failover also set to true. > > So, If the standby refresh the list of slot to sync when the subscription is still synchronizing we'd see things like > on the standby: > > LOG: waiting for remote slot "mysub" LSN (0/C0034808) and catalog xmin (756) to pass local slot LSN (0/C0034840) and andcatalog xmin (756) > LOG: wait over for remote slot "mysub" as its LSN (0/C00368B0) and catalog xmin (756) has now passed local slot LSN (0/C0034840)and catalog xmin (756) > LOG: waiting for remote slot "pg_16397_sync_16388_7293447291374081805" LSN (0/C0034808) and catalog xmin (756) to passlocal slot LSN (0/C00368E8) and and catalog xmin (756) > WARNING: slot "pg_16397_sync_16388_7293447291374081805" disappeared from the primary, aborting slot creation > > I'm not sure this "pg_16397_sync_16388_7293447291374081805" should have failover set to true. If there is a failover > during the subscription creation, better to re-launch the subscription instead? > But note that the subscription doesn't wait for the completion of tablesync. So, how will we deal with that? Also, this situation is the same for non-tablesync slots as well. I have given another option in the email [1] which is to enable failover even for the main slot after all tables are in ready state, something similar to what we do for two_phase. [1] - https://www.postgresql.org/message-id/CAA4eK1J6BqO5%3DueFAQO%2BaYyHLaU-oCHrrVFJqHS-i0Ce9aPY2w%40mail.gmail.com -- With Regards, Amit Kapila.
Hi, On 10/25/23 5:00 AM, shveta malik wrote: > On Tue, Oct 24, 2023 at 11:54 AM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> >> Hi, >> >> On 10/23/23 2:56 PM, shveta malik wrote: >>> On Mon, Oct 23, 2023 at 5:52 PM Drouvot, Bertrand >>> <bertranddrouvot.pg@gmail.com> wrote: >> >>>> We are waiting for DEFAULT_NAPTIME_PER_CYCLE (3 minutes) before checking if there >>>> is new synced slot(s) to be created on the standby. Do we want to keep this behavior >>>> for V1? >>>> >>> >>> I think for the slotsync workers case, we should reduce the naptime in >>> the launcher to say 30sec and retain the default one of 3mins for >>> subscription apply workers. Thoughts? >>> >> >> Another option could be to keep DEFAULT_NAPTIME_PER_CYCLE and create a new >> API on the standby that would refresh the list of sync slot at wish, thoughts? >> > > Do you mean API to refresh list of DBIDs rather than sync-slots? > As per current design, launcher gets DBID lists for all the failover > slots from the primary at intervals of DEFAULT_NAPTIME_PER_CYCLE. I mean an API to get a newly created slot on the primary being created/synced on the standby at wish. Also let's imagine this scenario: - create logical_slot1 on the primary (and don't start using it) Then on the standby we'll get things like: 2023-10-25 08:33:36.897 UTC [740298] LOG: waiting for remote slot "logical_slot1" LSN (0/C00316A0) and catalog xmin (752)to pass local slot LSN (0/C0049530) and and catalog xmin (754) That's expected and due to the fact that ReplicationSlotReserveWal() does set the slot restart_lsn to a value < at the corresponding restart_lsn slot on the primary. - create logical_slot2 on the primary (and start using it) Then logical_slot2 won't be created/synced on the standby until there is activity on logical_slot1 on the primary that would produce things like: 2023-10-25 08:41:35.508 UTC [740298] LOG: wait over for remote slot "logical_slot1" as its LSN (0/C005FFD8) and catalogxmin (756) has now passed local slot LSN (0/C0049530) and catalog xmin (754) With this new dedicated API, it will be: - clear that the API call is "hanging" until there is some activity on the newly created slot (currently there is "waiting for remote slot " message in the logfile as mentioned above but I'm not sure that's enough) - be possible to create/sync logical_slot2 in the example above without waiting for activity on logical_slot1. Maybe we should change our current algorithm during slot creation so that a newly created inactive slot on the primary does not block other newly created "active" slots on the primary to be created on the standby? Depending on how we implement that, the new API may not be needed at all. Thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On 10/25/23 6:57 AM, Amit Kapila wrote: > On Tue, Oct 24, 2023 at 3:35 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> >> On 10/24/23 7:44 AM, Ajin Cherian wrote: >>> On Mon, Oct 23, 2023 at 11:22 PM Drouvot, Bertrand >>> <bertranddrouvot.pg@gmail.com> wrote: >> >> 2) When we create a subscription, another slot is created during the subscription synchronization, namely >> like "pg_16397_sync_16388_7293447291374081805" (coming from ReplicationSlotNameForTablesync()). >> >> This extra slot appears to have failover also set to true. >> >> So, If the standby refresh the list of slot to sync when the subscription is still synchronizing we'd see things like >> on the standby: >> >> LOG: waiting for remote slot "mysub" LSN (0/C0034808) and catalog xmin (756) to pass local slot LSN (0/C0034840) andand catalog xmin (756) >> LOG: wait over for remote slot "mysub" as its LSN (0/C00368B0) and catalog xmin (756) has now passed local slot LSN (0/C0034840)and catalog xmin (756) >> LOG: waiting for remote slot "pg_16397_sync_16388_7293447291374081805" LSN (0/C0034808) and catalog xmin (756) to passlocal slot LSN (0/C00368E8) and and catalog xmin (756) >> WARNING: slot "pg_16397_sync_16388_7293447291374081805" disappeared from the primary, aborting slot creation >> >> I'm not sure this "pg_16397_sync_16388_7293447291374081805" should have failover set to true. If there is a failover >> during the subscription creation, better to re-launch the subscription instead? >> > > But note that the subscription doesn't wait for the completion of > tablesync. Right. > So, how will we deal with that? Also, this situation is the > same for non-tablesync slots as well. I have given another option in > the email [1] which is to enable failover even for the main slot after > all tables are in ready state, something similar to what we do for > two_phase. Oh right that looks like a better option (enable failover even for the main slot after all tables are in ready state). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On 10/9/23 12:30 PM, shveta malik wrote: > PFA v22 patch-set. It has below changes: > > patch 001: > 1) Now physical walsender wakes up logical walsender(s) by using a new > CV as suggested in [1] Thanks! I think that works fine as long as the standby is up and running and catching up. The problem I see with the current WalSndWaitForStandbyConfirmation() implementation is that if the standby is not running then: + for (;;) + { + ListCell *l; + long sleeptime = -1; will loop until we reach the "terminating walsender process due to replication timeout" if we explicitly want to end with SIGINT or friends. For example a scenario like: - standby down - pg_recvlogical running then CTRL-C on pg_recvlogical would not "respond" immediately but when we reach the replication timeout. So it seems that we should use something like WalSndWait() instead of ConditionVariableTimedSleep() here: + /* + * Sleep until other physical walsenders awaken us or until a timeout + * occurs. + */ + sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp()); + + ConditionVariableTimedSleep(&WalSndCtl->wal_confirm_rcv_cv, sleeptime, + WAIT_EVENT_WAL_SENDER_WAIT_FOR_STANDBY_CONFIRMATION); In that case I think that WalSndWait() should take care of the new CV WalSndCtl->wal_confirm_rcv_cv too. The wait on the socket should allow us to stop waiting when, for example, CTRL-C on pg_recvlogical is triggered. Then we would need to deal with this scenario: Standby down or not catching up and exited WalSndWait() due to the socket to break the loop or shutdown the walsender. Thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Dear Shveta, > PFA v25 patch set. The changes are: Thanks for making the patch! It seems that there are lots of comments, so I can put some high-level comments for 0001. Sorry if there are duplicated comments. 1. The patch seemed not to consider the case that failover option between replication slot and subscription were different. Currently slot option will be overwritten by subscription one. Actually, I'm not sure what specification is better. Regarding the two_phase, 2PC will be decoded only when the both of settings are true. Should we follow? 2. Currently ctx->failover is set only in the pgoutput_startup(), but not sure it is OK. Can we change the parameter in CreateDecodingContext() or similar functions? Because IIUC it means that only slots which have pgoutput can wait. Other output plugins must understand the change and set faliover flag as well - I felt it is not good. E.g., you might miss to enable the parameter in test_decoding. Regarding the two_phase parameter, setting on plugin layer is good because it quite affects the output. As for the failover, it is not related with the content so that all of slots should be enabled. I think CreateDecodingContext or StartupDecodingContext() is the common path. Or, is it the out-of-scope for now? Best Regards, Hayato Kuroda FUJITSU LIMITED
On Wed, Oct 25, 2023 at 8:49 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > On 10/9/23 12:30 PM, shveta malik wrote: > > PFA v22 patch-set. It has below changes: > > > > patch 001: > > 1) Now physical walsender wakes up logical walsender(s) by using a new > > CV as suggested in [1] > > Thanks! > > I think that works fine as long as the standby is up and running and catching up. > > The problem I see with the current WalSndWaitForStandbyConfirmation() implementation > is that if the standby is not running then: > > + for (;;) > + { > + ListCell *l; > + long sleeptime = -1; > > will loop until we reach the "terminating walsender process due to replication timeout" if we > explicitly want to end with SIGINT or friends. > > For example a scenario like: > > - standby down > - pg_recvlogical running > > then CTRL-C on pg_recvlogical would not "respond" immediately but when we reach the replication timeout. > > So it seems that we should use something like WalSndWait() instead of ConditionVariableTimedSleep() here: > > + /* > + * Sleep until other physical walsenders awaken us or until a timeout > + * occurs. > + */ > + sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp()); > + > + ConditionVariableTimedSleep(&WalSndCtl->wal_confirm_rcv_cv, sleeptime, > + WAIT_EVENT_WAL_SENDER_WAIT_FOR_STANDBY_CONFIRMATION); > > In that case I think that WalSndWait() should take care of the new CV WalSndCtl->wal_confirm_rcv_cv too. > The wait on the socket should allow us to stop waiting when, for example, CTRL-C on pg_recvlogical is triggered. > > Then we would need to deal with this scenario: Standby down or not catching up and exited WalSndWait() due to the socket > to break the loop or shutdown the walsender. > > Thoughts? > Good point, I think we should enhance the WalSndWait() logic to address this case. Additionally, I think we should ensure that WalSndWaitForWal() shouldn't wait twice once for wal_flush and a second time for wal to be replayed by physical standby. It should be okay to just wait for Wal to be replayed by physical standby when applicable, otherwise, just wait for Wal to flush as we are doing now. Does that make sense? -- With Regards, Amit Kapila.
Hi, On 10/26/23 10:40 AM, Amit Kapila wrote: > On Wed, Oct 25, 2023 at 8:49 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> > > Good point, I think we should enhance the WalSndWait() logic to > address this case. Agree. I think it would need to take care of the new CV and probably provide a way for the caller to detect it stopped waiting due to the socket (I don't think it can find out currently). > Additionally, I think we should ensure that > WalSndWaitForWal() shouldn't wait twice once for wal_flush and a > second time for wal to be replayed by physical standby. It should be > okay to just wait for Wal to be replayed by physical standby when > applicable, otherwise, just wait for Wal to flush as we are doing now. > Does that make sense? Yeah, I think so. What about moving WalSndWaitForStandbyConfirmation() outside of WalSndWaitForWal() and call one or the other in logical_read_xlog_page()? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Oct 26, 2023 at 5:38 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > On 10/26/23 10:40 AM, Amit Kapila wrote: > > On Wed, Oct 25, 2023 at 8:49 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > >> > > > > Good point, I think we should enhance the WalSndWait() logic to > > address this case. > > Agree. I think it would need to take care of the new CV and probably > provide a way for the caller to detect it stopped waiting due to the socket > (I don't think it can find out currently). > > > Additionally, I think we should ensure that > > WalSndWaitForWal() shouldn't wait twice once for wal_flush and a > > second time for wal to be replayed by physical standby. It should be > > okay to just wait for Wal to be replayed by physical standby when > > applicable, otherwise, just wait for Wal to flush as we are doing now. > > Does that make sense? > > Yeah, I think so. What about moving WalSndWaitForStandbyConfirmation() > outside of WalSndWaitForWal() and call one or the other in logical_read_xlog_page()? > I think we need to somehow integrate the logic of both functions. Let us see what the patch author has to say about this. -- With Regards, Amit Kapila.
On Thu, Oct 26, 2023 at 12:38 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > > PFA v25 patch set. The changes are: > > Thanks for making the patch! It seems that there are lots of comments, so > I can put some high-level comments for 0001. > Sorry if there are duplicated comments. > > 1. > The patch seemed not to consider the case that failover option between replication > slot and subscription were different. Currently slot option will be overwritten > by subscription one. > > Actually, I'm not sure what specification is better. Regarding the two_phase, > 2PC will be decoded only when the both of settings are true. Should we follow? > > 2. > Currently ctx->failover is set only in the pgoutput_startup(), but not sure it is OK. > Can we change the parameter in CreateDecodingContext() or similar functions? > > Because IIUC it means that only slots which have pgoutput can wait. Other > output plugins must understand the change and set faliover flag as well - > I felt it is not good. E.g., you might miss to enable the parameter in test_decoding. > > Regarding the two_phase parameter, setting on plugin layer is good because it > quite affects the output. As for the failover, it is not related with the > content so that all of slots should be enabled. > Both of your points seem valid to me. However, I think they should be addressed once we make option 'failover' behave similar to the '2PC' option as per discussion [1]. [1] - https://www.postgresql.org/message-id/b099ebc2-68fd-4c08-87ce-65fc4cb24121%40gmail.com -- With Regards, Amit Kapila.
On Wed, Oct 25, 2023 at 3:15 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > On 10/25/23 5:00 AM, shveta malik wrote: > > On Tue, Oct 24, 2023 at 11:54 AM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > >> > >> Hi, > >> > >> On 10/23/23 2:56 PM, shveta malik wrote: > >>> On Mon, Oct 23, 2023 at 5:52 PM Drouvot, Bertrand > >>> <bertranddrouvot.pg@gmail.com> wrote: > >> > >>>> We are waiting for DEFAULT_NAPTIME_PER_CYCLE (3 minutes) before checking if there > >>>> is new synced slot(s) to be created on the standby. Do we want to keep this behavior > >>>> for V1? > >>>> > >>> > >>> I think for the slotsync workers case, we should reduce the naptime in > >>> the launcher to say 30sec and retain the default one of 3mins for > >>> subscription apply workers. Thoughts? > >>> > >> > >> Another option could be to keep DEFAULT_NAPTIME_PER_CYCLE and create a new > >> API on the standby that would refresh the list of sync slot at wish, thoughts? > >> > > > > Do you mean API to refresh list of DBIDs rather than sync-slots? > > As per current design, launcher gets DBID lists for all the failover > > slots from the primary at intervals of DEFAULT_NAPTIME_PER_CYCLE. > > I mean an API to get a newly created slot on the primary being created/synced on > the standby at wish. > > Also let's imagine this scenario: > > - create logical_slot1 on the primary (and don't start using it) > > Then on the standby we'll get things like: > > 2023-10-25 08:33:36.897 UTC [740298] LOG: waiting for remote slot "logical_slot1" LSN (0/C00316A0) and catalog xmin (752)to pass local slot LSN (0/C0049530) and and catalog xmin (754) > > That's expected and due to the fact that ReplicationSlotReserveWal() does set the slot > restart_lsn to a value < at the corresponding restart_lsn slot on the primary. > > - create logical_slot2 on the primary (and start using it) > > Then logical_slot2 won't be created/synced on the standby until there is activity on logical_slot1 on the primary > that would produce things like: > > 2023-10-25 08:41:35.508 UTC [740298] LOG: wait over for remote slot "logical_slot1" as its LSN (0/C005FFD8) and catalogxmin (756) has now passed local slot LSN (0/C0049530) and catalog xmin (754) > > With this new dedicated API, it will be: > > - clear that the API call is "hanging" until there is some activity on the newly created slot > (currently there is "waiting for remote slot " message in the logfile as mentioned above but > I'm not sure that's enough) > I think even if we provide such an API, we need to have logic to get the slots from the primary and create them. Say, even if the user used the APIs, there may still be some new slots that the sync worker needs to create. I think it might be better to provide a view for users to view the current state of sync. For example, in the above case, we can say "waiting for the primary to advance remote LSN" or something like that. -- With Regards, Amit Kapila.
On Wed, Oct 25, 2023 at 3:15 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 10/25/23 5:00 AM, shveta malik wrote: > > On Tue, Oct 24, 2023 at 11:54 AM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > >> > >> Hi, > >> > >> On 10/23/23 2:56 PM, shveta malik wrote: > >>> On Mon, Oct 23, 2023 at 5:52 PM Drouvot, Bertrand > >>> <bertranddrouvot.pg@gmail.com> wrote: > >> > >>>> We are waiting for DEFAULT_NAPTIME_PER_CYCLE (3 minutes) before checking if there > >>>> is new synced slot(s) to be created on the standby. Do we want to keep this behavior > >>>> for V1? > >>>> > >>> > >>> I think for the slotsync workers case, we should reduce the naptime in > >>> the launcher to say 30sec and retain the default one of 3mins for > >>> subscription apply workers. Thoughts? > >>> > >> > >> Another option could be to keep DEFAULT_NAPTIME_PER_CYCLE and create a new > >> API on the standby that would refresh the list of sync slot at wish, thoughts? > >> > > > > Do you mean API to refresh list of DBIDs rather than sync-slots? > > As per current design, launcher gets DBID lists for all the failover > > slots from the primary at intervals of DEFAULT_NAPTIME_PER_CYCLE. > > I mean an API to get a newly created slot on the primary being created/synced on > the standby at wish. > > Also let's imagine this scenario: > > - create logical_slot1 on the primary (and don't start using it) > > Then on the standby we'll get things like: > > 2023-10-25 08:33:36.897 UTC [740298] LOG: waiting for remote slot "logical_slot1" LSN (0/C00316A0) and catalog xmin (752)to pass local slot LSN (0/C0049530) and and catalog xmin (754) > > That's expected and due to the fact that ReplicationSlotReserveWal() does set the slot > restart_lsn to a value < at the corresponding restart_lsn slot on the primary. > > - create logical_slot2 on the primary (and start using it) > > Then logical_slot2 won't be created/synced on the standby until there is activity on logical_slot1 on the primary > that would produce things like: > > 2023-10-25 08:41:35.508 UTC [740298] LOG: wait over for remote slot "logical_slot1" as its LSN (0/C005FFD8) and catalogxmin (756) has now passed local slot LSN (0/C0049530) and catalog xmin (754) > > With this new dedicated API, it will be: > > - clear that the API call is "hanging" until there is some activity on the newly created slot > (currently there is "waiting for remote slot " message in the logfile as mentioned above but > I'm not sure that's enough) > > - be possible to create/sync logical_slot2 in the example above without waiting for activity > on logical_slot1. > > Maybe we should change our current algorithm during slot creation so that a newly created inactive > slot on the primary does not block other newly created "active" slots on the primary to be created > on the standby? Depending on how we implement that, the new API may not be needed at all. > > Thoughts? > I discussed this with my colleague Hou-San and we think that one possibility could be to somehow accelerate the increment of restart_lsn on primary. This can be achieved by connecting to the remote and executing pg_log_standby_snapshot() at reasonable intervals while waiting on standby during slot creation. This may increase speed to a reasonable extent w/o having to wait for the user or bgwriter to do the same for us. The current logical decoding uses a similar approach to speed up the slot creation. I refer to usage of LogStandbySnapshot in SnapBuildWaitSnapshot() and ReplicationSlotReserveWal()). Thoughts? thanks Shveta
On Wed, Oct 25, 2023 at 3:15 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 10/25/23 5:00 AM, shveta malik wrote: > > On Tue, Oct 24, 2023 at 11:54 AM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > >> > >> Hi, > >> > >> On 10/23/23 2:56 PM, shveta malik wrote: > >>> On Mon, Oct 23, 2023 at 5:52 PM Drouvot, Bertrand > >>> <bertranddrouvot.pg@gmail.com> wrote: > >> > >>>> We are waiting for DEFAULT_NAPTIME_PER_CYCLE (3 minutes) before checking if there > >>>> is new synced slot(s) to be created on the standby. Do we want to keep this behavior > >>>> for V1? > >>>> > >>> > >>> I think for the slotsync workers case, we should reduce the naptime in > >>> the launcher to say 30sec and retain the default one of 3mins for > >>> subscription apply workers. Thoughts? > >>> > >> > >> Another option could be to keep DEFAULT_NAPTIME_PER_CYCLE and create a new > >> API on the standby that would refresh the list of sync slot at wish, thoughts? > >> > > > > Do you mean API to refresh list of DBIDs rather than sync-slots? > > As per current design, launcher gets DBID lists for all the failover > > slots from the primary at intervals of DEFAULT_NAPTIME_PER_CYCLE. > > I mean an API to get a newly created slot on the primary being created/synced on > the standby at wish. > > Also let's imagine this scenario: > > - create logical_slot1 on the primary (and don't start using it) > > Then on the standby we'll get things like: > > 2023-10-25 08:33:36.897 UTC [740298] LOG: waiting for remote slot "logical_slot1" LSN (0/C00316A0) and catalog xmin (752)to pass local slot LSN (0/C0049530) and and catalog xmin (754) > > That's expected and due to the fact that ReplicationSlotReserveWal() does set the slot > restart_lsn to a value < at the corresponding restart_lsn slot on the primary. > > - create logical_slot2 on the primary (and start using it) > > Then logical_slot2 won't be created/synced on the standby until there is activity on logical_slot1 on the primary > that would produce things like: > 2023-10-25 08:41:35.508 UTC [740298] LOG: wait over for remote slot "logical_slot1" as its LSN (0/C005FFD8) and catalogxmin (756) has now passed local slot LSN (0/C0049530) and catalog xmin (754) Slight correction to above. As soon as we start activity on logical_slot2, it will impact all the slots on primary, as the WALs are consumed by all the slots. So even if there is activity on logical_slot2, logical_slot1 creation on standby will be unblocked and it will then move to logical_slot2 creation. eg: --on standby: 2023-10-27 15:15:46.069 IST [696884] LOG: waiting for remote slot "mysubnew1_1" LSN (0/3C97970) and catalog xmin (756) to pass local slot LSN (0/3C979A8) and and catalog xmin (756) on primary: newdb1=# select now(); now ---------------------------------- 2023-10-27 15:15:51.504835+05:30 (1 row) --activity on mysubnew1_3 newdb1=# insert into tab1_3 values(1); INSERT 0 1 newdb1=# select now(); now ---------------------------------- 2023-10-27 15:15:54.651406+05:30 --on standby, mysubnew1_1 is unblocked. 2023-10-27 15:15:56.223 IST [696884] LOG: wait over for remote slot "mysubnew1_1" as its LSN (0/3C97A18) and catalog xmin (757) has now passed local slot LSN (0/3C979A8) and catalog xmin (756) My Setup: mysubnew1_1 -->mypubnew1_1 -->tab1_1 mysubnew1_3 -->mypubnew1_3-->tab1_3 thanks Shveta
On Fri, Oct 27, 2023 at 3:26 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Wed, Oct 25, 2023 at 3:15 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > > > > Hi, > > > > On 10/25/23 5:00 AM, shveta malik wrote: > > > On Tue, Oct 24, 2023 at 11:54 AM Drouvot, Bertrand > > > <bertranddrouvot.pg@gmail.com> wrote: > > >> > > >> Hi, > > >> > > >> On 10/23/23 2:56 PM, shveta malik wrote: > > >>> On Mon, Oct 23, 2023 at 5:52 PM Drouvot, Bertrand > > >>> <bertranddrouvot.pg@gmail.com> wrote: > > >> > > >>>> We are waiting for DEFAULT_NAPTIME_PER_CYCLE (3 minutes) before checking if there > > >>>> is new synced slot(s) to be created on the standby. Do we want to keep this behavior > > >>>> for V1? > > >>>> > > >>> > > >>> I think for the slotsync workers case, we should reduce the naptime in > > >>> the launcher to say 30sec and retain the default one of 3mins for > > >>> subscription apply workers. Thoughts? > > >>> > > >> > > >> Another option could be to keep DEFAULT_NAPTIME_PER_CYCLE and create a new > > >> API on the standby that would refresh the list of sync slot at wish, thoughts? > > >> > > > > > > Do you mean API to refresh list of DBIDs rather than sync-slots? > > > As per current design, launcher gets DBID lists for all the failover > > > slots from the primary at intervals of DEFAULT_NAPTIME_PER_CYCLE. > > > > I mean an API to get a newly created slot on the primary being created/synced on > > the standby at wish. > > > > Also let's imagine this scenario: > > > > - create logical_slot1 on the primary (and don't start using it) > > > > Then on the standby we'll get things like: > > > > 2023-10-25 08:33:36.897 UTC [740298] LOG: waiting for remote slot "logical_slot1" LSN (0/C00316A0) and catalog xmin(752) to pass local slot LSN (0/C0049530) and and catalog xmin (754) > > > > That's expected and due to the fact that ReplicationSlotReserveWal() does set the slot > > restart_lsn to a value < at the corresponding restart_lsn slot on the primary. > > > > - create logical_slot2 on the primary (and start using it) > > > > Then logical_slot2 won't be created/synced on the standby until there is activity on logical_slot1 on the primary > > that would produce things like: > > 2023-10-25 08:41:35.508 UTC [740298] LOG: wait over for remote slot "logical_slot1" as its LSN (0/C005FFD8) and catalogxmin (756) has now passed local slot LSN (0/C0049530) and catalog xmin (754) > > > Slight correction to above. As soon as we start activity on > logical_slot2, it will impact all the slots on primary, as the WALs > are consumed by all the slots. So even if there is activity on > logical_slot2, logical_slot1 creation on standby will be unblocked and > it will then move to logical_slot2 creation. eg: > > --on standby: > 2023-10-27 15:15:46.069 IST [696884] LOG: waiting for remote slot > "mysubnew1_1" LSN (0/3C97970) and catalog xmin (756) to pass local > slot LSN (0/3C979A8) and and catalog xmin (756) > > on primary: > newdb1=# select now(); > now > ---------------------------------- > 2023-10-27 15:15:51.504835+05:30 > (1 row) > > --activity on mysubnew1_3 > newdb1=# insert into tab1_3 values(1); > INSERT 0 1 > newdb1=# select now(); > now > ---------------------------------- > 2023-10-27 15:15:54.651406+05:30 > > > --on standby, mysubnew1_1 is unblocked. > 2023-10-27 15:15:56.223 IST [696884] LOG: wait over for remote slot > "mysubnew1_1" as its LSN (0/3C97A18) and catalog xmin (757) has now > passed local slot LSN (0/3C979A8) and catalog xmin (756) > > My Setup: > mysubnew1_1 -->mypubnew1_1 -->tab1_1 > mysubnew1_3 -->mypubnew1_3-->tab1_3 > > thanks > Shveta PFA v26 patches. The changes are: 1) 'Failover' in the main slot is now set when the table synchronization phase is finished. So even when failover is enabled for a subscription, the internal failover state remains temporarily “pending” until the initialization phase completes. 2) If the standby is down, but standby_slot_names has that slot name, we emit a warning now while waiting for that standby. 3) Fixed bug where pg_logical_slot_get_changes was resetting failover property of slot. Thanks Ajin for providing the fix. 4) Fixed bug where standby_slot_names_list was not initialized for non-walsender cases making pg_logical_slot_get_changes() to proceed w/o waiting for standbys. 5) Fixed a bug where standby_slot_names_list was freed (due to free of per_query context in non-walsender cases) but was not nullified and thus next call was using this freed pointer and was crashing. 6) Improved wait_for_primary_slot_catchup(), we now fetch remote-conflicting(invalidation) too and abort the wait and slot creation if the slot on primary is invalidated. 7) Slot-sync workers now wait for cascading standby's confirmation before updating logical synced slots on first standby. First 5 changes are in patch001, 6th one is in patch002. For 7th, I have created a new patch (003) to separate out the additional changes needed for cascading standbys. ========== Open questions regarding change for pt 1 above: a) I think we should restrict the 'alter-sub set failover' when failover-state is currently in 'p' (pending) state i.e. table-sync is going over. Once table-sync is over, then toggle of 'failover' should be allowed using alter-subscription. b) Currently I have restricted 'alter subscription.. refresh publication with copy=true' when failover=true (on a similar line of two-phase). The reason being, refresh with copy=true will go for table-sync again and since failover was set in main-slot after table-sync was done, it will need going through the same transition of 'p' to 'e' for main slot making it unsyncable for that time. Should it be allowed? Currently: newdb1=# ALTER SUBSCRIPTION mysubnew1_1 REFRESH PUBLICATION WITH (copy_data=true); ERROR: ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when failover is enabled HINT: Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false, or use DROP/CREATE SUBSCRIPTION. Thoughts on above queries? thanks Shveta
Attachment
Hi, On 10/27/23 11:56 AM, shveta malik wrote: > On Wed, Oct 25, 2023 at 3:15 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> >> Hi, >> >> On 10/25/23 5:00 AM, shveta malik wrote: >>> On Tue, Oct 24, 2023 at 11:54 AM Drouvot, Bertrand >>> <bertranddrouvot.pg@gmail.com> wrote: >>>> >>>> Hi, >>>> >>>> On 10/23/23 2:56 PM, shveta malik wrote: >>>>> On Mon, Oct 23, 2023 at 5:52 PM Drouvot, Bertrand >>>>> <bertranddrouvot.pg@gmail.com> wrote: >>>> >>>>>> We are waiting for DEFAULT_NAPTIME_PER_CYCLE (3 minutes) before checking if there >>>>>> is new synced slot(s) to be created on the standby. Do we want to keep this behavior >>>>>> for V1? >>>>>> >>>>> >>>>> I think for the slotsync workers case, we should reduce the naptime in >>>>> the launcher to say 30sec and retain the default one of 3mins for >>>>> subscription apply workers. Thoughts? >>>>> >>>> >>>> Another option could be to keep DEFAULT_NAPTIME_PER_CYCLE and create a new >>>> API on the standby that would refresh the list of sync slot at wish, thoughts? >>>> >>> >>> Do you mean API to refresh list of DBIDs rather than sync-slots? >>> As per current design, launcher gets DBID lists for all the failover >>> slots from the primary at intervals of DEFAULT_NAPTIME_PER_CYCLE. >> >> I mean an API to get a newly created slot on the primary being created/synced on >> the standby at wish. >> >> Also let's imagine this scenario: >> >> - create logical_slot1 on the primary (and don't start using it) >> >> Then on the standby we'll get things like: >> >> 2023-10-25 08:33:36.897 UTC [740298] LOG: waiting for remote slot "logical_slot1" LSN (0/C00316A0) and catalog xmin (752)to pass local slot LSN (0/C0049530) and and catalog xmin (754) >> >> That's expected and due to the fact that ReplicationSlotReserveWal() does set the slot >> restart_lsn to a value < at the corresponding restart_lsn slot on the primary. >> >> - create logical_slot2 on the primary (and start using it) >> >> Then logical_slot2 won't be created/synced on the standby until there is activity on logical_slot1 on the primary >> that would produce things like: >> 2023-10-25 08:41:35.508 UTC [740298] LOG: wait over for remote slot "logical_slot1" as its LSN (0/C005FFD8) and catalogxmin (756) has now passed local slot LSN (0/C0049530) and catalog xmin (754) > > > Slight correction to above. As soon as we start activity on > logical_slot2, it will impact all the slots on primary, as the WALs > are consumed by all the slots. So even if there is activity on > logical_slot2, logical_slot1 creation on standby will be unblocked and > it will then move to logical_slot2 creation. eg: > > --on standby: > 2023-10-27 15:15:46.069 IST [696884] LOG: waiting for remote slot > "mysubnew1_1" LSN (0/3C97970) and catalog xmin (756) to pass local > slot LSN (0/3C979A8) and and catalog xmin (756) > > on primary: > newdb1=# select now(); > now > ---------------------------------- > 2023-10-27 15:15:51.504835+05:30 > (1 row) > > --activity on mysubnew1_3 > newdb1=# insert into tab1_3 values(1); > INSERT 0 1 > newdb1=# select now(); > now > ---------------------------------- > 2023-10-27 15:15:54.651406+05:30 > > > --on standby, mysubnew1_1 is unblocked. > 2023-10-27 15:15:56.223 IST [696884] LOG: wait over for remote slot > "mysubnew1_1" as its LSN (0/3C97A18) and catalog xmin (757) has now > passed local slot LSN (0/3C979A8) and catalog xmin (756) > > My Setup: > mysubnew1_1 -->mypubnew1_1 -->tab1_1 > mysubnew1_3 -->mypubnew1_3-->tab1_3 > Agree with your test case, but in my case I was not using pub/sub. I was not clear, so when I said: >> - create logical_slot1 on the primary (and don't start using it) I meant don't start decoding from it (like using pg_recvlogical() or pg_logical_slot_get_changes()). By using pub/sub the "don't start using it" is not satisfied. My test case is: " SELECT * FROM pg_create_logical_replication_slot('logical_slot1', 'test_decoding', false, true, true); SELECT * FROM pg_create_logical_replication_slot('logical_slot2', 'test_decoding', false, true, true); pg_recvlogical -d postgres -S logical_slot2 --no-loop --start -f - " Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On 10/27/23 10:51 AM, shveta malik wrote: > On Wed, Oct 25, 2023 at 3:15 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > > I discussed this with my colleague Hou-San and we think that one > possibility could be to somehow accelerate the increment of > restart_lsn on primary. This can be achieved by connecting to the > remote and executing pg_log_standby_snapshot() at reasonable intervals > while waiting on standby during slot creation. This may increase speed > to a reasonable extent w/o having to wait for the user or bgwriter to > do the same for us. The current logical decoding uses a similar > approach to speed up the slot creation. I refer to usage of > LogStandbySnapshot in SnapBuildWaitSnapshot() and > ReplicationSlotReserveWal()). > Thoughts? > I think that's 2 distinct area. My concern was more when there is no activity at all on a newly created slot on the primary. The slot is created on the standby, but then we loop until there is activity on this slot on the primary. That's the test case I described in [1] [1]: https://www.postgresql.org/message-id/afe4ab6c-dde3-48ea-acd8-6f6052c7b8fd%40gmail.com Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On 10/27/23 10:35 AM, Amit Kapila wrote: > On Wed, Oct 25, 2023 at 3:15 PM Drouvot, Bertrand > > I think even if we provide such an API, we need to have logic to get > the slots from the primary and create them. Yeah, my idea was to add an API (in addition to what is already in place). > Say, even if the user used > the APIs, there may still be some new slots that the sync worker needs > to create. Right. > I think it might be better to provide a view for users to > view the current state of sync. For example, in the above case, we can > say "waiting for the primary to advance remote LSN" or something like > that. We are already displaying the wait event "ReplSlotsyncPrimaryCatchup" in pg_stat_activity so that might already be enough? My main idea was to be able to manually create/sync logical_slot2 in the test case described in [1] without waiting for activity on logical_slot1. But another (better?) option might be to change our current algorithm during slot creation on the standby? (to avoid an "active" slot having to wait on a "inactive" one, like described in [1]). [1]: https://www.postgresql.org/message-id/afe4ab6c-dde3-48ea-acd8-6f6052c7b8fd%40gmail.com Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, Oct 27, 2023 at 9:00 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > On 10/27/23 10:35 AM, Amit Kapila wrote: > > On Wed, Oct 25, 2023 at 3:15 PM Drouvot, Bertrand > > > > I think even if we provide such an API, we need to have logic to get > > the slots from the primary and create them. > > Yeah, my idea was to add an API (in addition to what is already in place). > > > Say, even if the user used > > the APIs, there may still be some new slots that the sync worker needs > > to create. > > Right. > > > I think it might be better to provide a view for users to > > view the current state of sync. For example, in the above case, we can > > say "waiting for the primary to advance remote LSN" or something like > > that. > > We are already displaying the wait event "ReplSlotsyncPrimaryCatchup" in pg_stat_activity > so that might already be enough? > I am fine if the wait is already displayed in some form. > My main idea was to be able to manually create/sync logical_slot2 in the test case described in [1] > without waiting for activity on logical_slot1. > > But another (better?) option might be to change our current algorithm during slot creation on the > standby? (to avoid an "active" slot having to wait on a "inactive" one, like described in [1]). > Yeah, I guess it would be better to tweak the algorithm in this case such that the slots can't be created immediately but can be noted in a separate list and we can continue with other remaining slots. Once, we are finished with all the slots, this special list can be traversed and then we can attempt to create all the remaining slots. OTOH, the scenario you described doesn't sound to be a frequent case to be worried for it but if we can deal with it without adding much complexity then it would be good. -- With Regards, Amit Kapila.
On Thu, Oct 26, 2023 at 6:08 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Shveta, > > > PFA v25 patch set. The changes are: > > Thanks for making the patch! It seems that there are lots of comments, so > I can put some high-level comments for 0001. > Sorry if there are duplicated comments. > > 1. > The patch seemed not to consider the case that failover option between replication > slot and subscription were different. Currently slot option will be overwritten > by subscription one. > > Actually, I'm not sure what specification is better. Regarding the two_phase, > 2PC will be decoded only when the both of settings are true. Should we follow? > But this is the intention, we want the Alter subscription to be able to change the failover behaviour of the slot. > 2. > Currently ctx->failover is set only in the pgoutput_startup(), but not sure it is OK. > Can we change the parameter in CreateDecodingContext() or similar functions? > > Because IIUC it means that only slots which have pgoutput can wait. Other > output plugins must understand the change and set faliover flag as well - > I felt it is not good. E.g., you might miss to enable the parameter in test_decoding. > > Regarding the two_phase parameter, setting on plugin layer is good because it > quite affects the output. As for the failover, it is not related with the > content so that all of slots should be enabled. > > I think CreateDecodingContext or StartupDecodingContext() is the common path. > Or, is it the out-of-scope for now? Currently, the failover field is part of the options list in the StartReplicationCmd. This gives some level of flexibility such that only plugins that are interested in this need to handle it. The options list is only deparsed by plugins. If we move it to outside of the options list, this sort of changes the protocol for START_REPLICATION and will impact all plugins. But I agree to your larger point that, we need to do it in such a way that other plugins do not unintentionally change the 'failover' behaviour of the originally created slot. Maybe I can code it in such a way that, only if the failover option is specified in the list of options passed as part of START_REPLICATION will it change the original slot created 'failover' flag by adding another flag "failover_opt_given". Plugins that set this, will be able to change the failover flag of the slot, while plugins that do not support this will not set this and the failover flag of the created slot will remain. What do you think? regards, Ajin Cherian Fujitsu Australia
Dear Ajin, Thanks for your reply! > On Thu, Oct 26, 2023 at 6:08 PM Hayato Kuroda (Fujitsu) > <kuroda.hayato@fujitsu.com> wrote: > > > > Dear Shveta, > > > > > PFA v25 patch set. The changes are: > > > > Thanks for making the patch! It seems that there are lots of comments, so > > I can put some high-level comments for 0001. > > Sorry if there are duplicated comments. > > > > 1. > > The patch seemed not to consider the case that failover option between > replication > > slot and subscription were different. Currently slot option will be overwritten > > by subscription one. > > > > Actually, I'm not sure what specification is better. Regarding the two_phase, > > 2PC will be decoded only when the both of settings are true. Should we follow? > > > > But this is the intention, we want the Alter subscription to be able > to change the failover behaviour > of the slot. I had not understood how two_phase is enabled. I found that slot->data.two_phase is overwritten in CreateDecodingContext(), so the failover option now follows two_phase, right? (I think the overwritten of data.failover should be also done at CreateDecodingContext()). > > 2. > > Currently ctx->failover is set only in the pgoutput_startup(), but not sure it is > OK. > > Can we change the parameter in CreateDecodingContext() or similar functions? > > > > Because IIUC it means that only slots which have pgoutput can wait. Other > > output plugins must understand the change and set faliover flag as well - > > I felt it is not good. E.g., you might miss to enable the parameter in > test_decoding. > > > > Regarding the two_phase parameter, setting on plugin layer is good because it > > quite affects the output. As for the failover, it is not related with the > > content so that all of slots should be enabled. > > > > I think CreateDecodingContext or StartupDecodingContext() is the common > path. > > Or, is it the out-of-scope for now? > > Currently, the failover field is part of the options list in the > StartReplicationCmd. This gives some > level of flexibility such that only plugins that are interested in > this need to handle it. The options list > is only deparsed by plugins. If we move it to outside of the options list, > this sort of changes the protocol for START_REPLICATION and will > impact all plugins. > But I agree to your larger point that, we need to do it in such a way that > other plugins do not unintentionally change the 'failover' behaviour > of the originally created slot. > Maybe I can code it in such a way that, only if the failover option is > specified in the list of options > passed as part of START_REPLICATION will it change the original slot > created 'failover' flag by adding > another flag "failover_opt_given". Plugins that set this, will be able > to change the failover flag of the slot, > while plugins that do not support this will not set this and the > failover flag of the created slot will remain. > What do you think? May be OK, but I came up with a corner case that external plugins have a streaming option 'failover'. What should be? Has the option been reserved? Best Regards, Hayato Kuroda FUJITSU LIMITED
On Tue, Oct 31, 2023 at 7:16 AM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > > > 2. > > > Currently ctx->failover is set only in the pgoutput_startup(), but not sure it is > > OK. > > > Can we change the parameter in CreateDecodingContext() or similar functions? > > > > > > Because IIUC it means that only slots which have pgoutput can wait. Other > > > output plugins must understand the change and set faliover flag as well - > > > I felt it is not good. E.g., you might miss to enable the parameter in > > test_decoding. > > > > > > Regarding the two_phase parameter, setting on plugin layer is good because it > > > quite affects the output. As for the failover, it is not related with the > > > content so that all of slots should be enabled. > > > > > > I think CreateDecodingContext or StartupDecodingContext() is the common > > path. > > > Or, is it the out-of-scope for now? > > > > Currently, the failover field is part of the options list in the > > StartReplicationCmd. This gives some > > level of flexibility such that only plugins that are interested in > > this need to handle it. The options list > > is only deparsed by plugins. If we move it to outside of the options list, > > this sort of changes the protocol for START_REPLICATION and will > > impact all plugins. > > But I agree to your larger point that, we need to do it in such a way that > > other plugins do not unintentionally change the 'failover' behaviour > > of the originally created slot. > > Maybe I can code it in such a way that, only if the failover option is > > specified in the list of options > > passed as part of START_REPLICATION will it change the original slot > > created 'failover' flag by adding > > another flag "failover_opt_given". Plugins that set this, will be able > > to change the failover flag of the slot, > > while plugins that do not support this will not set this and the > > failover flag of the created slot will remain. > > What do you think? > > May be OK, but I came up with a corner case that external plugins have a streaming > option 'failover'. What should be? Has the option been reserved? > Sorry, your question is not clear to me. Did you intend to say that the value of the existing streaming option could be 'failover'? -- With Regards, Amit Kapila.
On Tue, Oct 31, 2023 at 11:21 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Oct 31, 2023 at 7:16 AM Hayato Kuroda (Fujitsu) > <kuroda.hayato@fujitsu.com> wrote: > > > > > > 2. > > > > Currently ctx->failover is set only in the pgoutput_startup(), but not sure it is > > > OK. > > > > Can we change the parameter in CreateDecodingContext() or similar functions? > > > > > > > > Because IIUC it means that only slots which have pgoutput can wait. Other > > > > output plugins must understand the change and set faliover flag as well - > > > > I felt it is not good. E.g., you might miss to enable the parameter in > > > test_decoding. > > > > > > > > Regarding the two_phase parameter, setting on plugin layer is good because it > > > > quite affects the output. As for the failover, it is not related with the > > > > content so that all of slots should be enabled. > > > > > > > > I think CreateDecodingContext or StartupDecodingContext() is the common > > > path. > > > > Or, is it the out-of-scope for now? > > > > > > Currently, the failover field is part of the options list in the > > > StartReplicationCmd. This gives some > > > level of flexibility such that only plugins that are interested in > > > this need to handle it. The options list > > > is only deparsed by plugins. If we move it to outside of the options list, > > > this sort of changes the protocol for START_REPLICATION and will > > > impact all plugins. > > > But I agree to your larger point that, we need to do it in such a way that > > > other plugins do not unintentionally change the 'failover' behaviour > > > of the originally created slot. > > > Maybe I can code it in such a way that, only if the failover option is > > > specified in the list of options > > > passed as part of START_REPLICATION will it change the original slot > > > created 'failover' flag by adding > > > another flag "failover_opt_given". Plugins that set this, will be able > > > to change the failover flag of the slot, > > > while plugins that do not support this will not set this and the > > > failover flag of the created slot will remain. > > > What do you think? > > > > May be OK, but I came up with a corner case that external plugins have a streaming > > option 'failover'. What should be? Has the option been reserved? > > > > Sorry, your question is not clear to me. Did you intend to say that > the value of the existing streaming option could be 'failover'? > > -- > With Regards, > Amit Kapila. PFA v27 patch-set which has below changes: 1) Enhanced WalSndWait to replace ConditionVariableSleep on WalSndCtl->wal_confirm_rcv_cv as per suggestion in [1]. 2) WalSndWaitForWal and WalSndWaitForStandbyConfirmation is now integrated as per suggestion in [2]. WalSndWait is invoked only once. 3) Optimized slot-creation algorithm on standby as per suggestion in [3]. Now, during the first attempt of slots-creation we create all active slots and add inactive ones to the pending list and then we wait on them in the second attempt. 4) Added basic tests for failover slots. Changes for 1 and 2 are in patch001 and for 3 and 4 are in patch002. Thanks Hou-San for implementing changes for 1 and 2. Thanks Ajin for implementing failover tests/4. [1]: https://www.postgresql.org/message-id/f3228cfb-7bf3-4bd8-8f37-c55fc4054759%40gmail.com [2]: https://www.postgresql.org/message-id/CAA4eK1J49j5ew-Tk4Ygv0nbjurJz12kZtqjHLALFuL03NBZdsg%40mail.gmail.com [3]: https://www.postgresql.org/message-id/CAA4eK1KBL0110gamQfc62X%3D5JV8-Qjd0dw0Mq0o07cq6kE%2Bq%3Dg%40mail.gmail.com thanks Shveta
Attachment
On Thu, Oct 26, 2023 at 5:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Oct 26, 2023 at 5:38 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > > > > On 10/26/23 10:40 AM, Amit Kapila wrote: > > > On Wed, Oct 25, 2023 at 8:49 PM Drouvot, Bertrand > > > <bertranddrouvot.pg@gmail.com> wrote: > > >> > > > > > > Good point, I think we should enhance the WalSndWait() logic to > > > address this case. > > > > Agree. I think it would need to take care of the new CV and probably > > provide a way for the caller to detect it stopped waiting due to the socket > > (I don't think it can find out currently). > > > > > Additionally, I think we should ensure that > > > WalSndWaitForWal() shouldn't wait twice once for wal_flush and a > > > second time for wal to be replayed by physical standby. It should be > > > okay to just wait for Wal to be replayed by physical standby when > > > applicable, otherwise, just wait for Wal to flush as we are doing now. > > > Does that make sense? > > > > Yeah, I think so. What about moving WalSndWaitForStandbyConfirmation() > > outside of WalSndWaitForWal() and call one or the other in logical_read_xlog_page()? > > > > I think we need to somehow integrate the logic of both functions. Let > us see what the patch author has to say about this. Amit, this is attempted in v27. thanks Shveta
On Fri, Oct 27, 2023 at 8:43 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 10/27/23 11:56 AM, shveta malik wrote: > > On Wed, Oct 25, 2023 at 3:15 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > >> > >> Hi, > >> > >> On 10/25/23 5:00 AM, shveta malik wrote: > >>> On Tue, Oct 24, 2023 at 11:54 AM Drouvot, Bertrand > >>> <bertranddrouvot.pg@gmail.com> wrote: > >>>> > >>>> Hi, > >>>> > >>>> On 10/23/23 2:56 PM, shveta malik wrote: > >>>>> On Mon, Oct 23, 2023 at 5:52 PM Drouvot, Bertrand > >>>>> <bertranddrouvot.pg@gmail.com> wrote: > >>>> > >>>>>> We are waiting for DEFAULT_NAPTIME_PER_CYCLE (3 minutes) before checking if there > >>>>>> is new synced slot(s) to be created on the standby. Do we want to keep this behavior > >>>>>> for V1? > >>>>>> > >>>>> > >>>>> I think for the slotsync workers case, we should reduce the naptime in > >>>>> the launcher to say 30sec and retain the default one of 3mins for > >>>>> subscription apply workers. Thoughts? > >>>>> > >>>> > >>>> Another option could be to keep DEFAULT_NAPTIME_PER_CYCLE and create a new > >>>> API on the standby that would refresh the list of sync slot at wish, thoughts? > >>>> > >>> > >>> Do you mean API to refresh list of DBIDs rather than sync-slots? > >>> As per current design, launcher gets DBID lists for all the failover > >>> slots from the primary at intervals of DEFAULT_NAPTIME_PER_CYCLE. > >> > >> I mean an API to get a newly created slot on the primary being created/synced on > >> the standby at wish. > >> > >> Also let's imagine this scenario: > >> > >> - create logical_slot1 on the primary (and don't start using it) > >> > >> Then on the standby we'll get things like: > >> > >> 2023-10-25 08:33:36.897 UTC [740298] LOG: waiting for remote slot "logical_slot1" LSN (0/C00316A0) and catalog xmin(752) to pass local slot LSN (0/C0049530) and and catalog xmin (754) > >> > >> That's expected and due to the fact that ReplicationSlotReserveWal() does set the slot > >> restart_lsn to a value < at the corresponding restart_lsn slot on the primary. > >> > >> - create logical_slot2 on the primary (and start using it) > >> > >> Then logical_slot2 won't be created/synced on the standby until there is activity on logical_slot1 on the primary > >> that would produce things like: > >> 2023-10-25 08:41:35.508 UTC [740298] LOG: wait over for remote slot "logical_slot1" as its LSN (0/C005FFD8) and catalogxmin (756) has now passed local slot LSN (0/C0049530) and catalog xmin (754) > > > > > > Slight correction to above. As soon as we start activity on > > logical_slot2, it will impact all the slots on primary, as the WALs > > are consumed by all the slots. So even if there is activity on > > logical_slot2, logical_slot1 creation on standby will be unblocked and > > it will then move to logical_slot2 creation. eg: > > > > --on standby: > > 2023-10-27 15:15:46.069 IST [696884] LOG: waiting for remote slot > > "mysubnew1_1" LSN (0/3C97970) and catalog xmin (756) to pass local > > slot LSN (0/3C979A8) and and catalog xmin (756) > > > > on primary: > > newdb1=# select now(); > > now > > ---------------------------------- > > 2023-10-27 15:15:51.504835+05:30 > > (1 row) > > > > --activity on mysubnew1_3 > > newdb1=# insert into tab1_3 values(1); > > INSERT 0 1 > > newdb1=# select now(); > > now > > ---------------------------------- > > 2023-10-27 15:15:54.651406+05:30 > > > > > > --on standby, mysubnew1_1 is unblocked. > > 2023-10-27 15:15:56.223 IST [696884] LOG: wait over for remote slot > > "mysubnew1_1" as its LSN (0/3C97A18) and catalog xmin (757) has now > > passed local slot LSN (0/3C979A8) and catalog xmin (756) > > > > My Setup: > > mysubnew1_1 -->mypubnew1_1 -->tab1_1 > > mysubnew1_3 -->mypubnew1_3-->tab1_3 > > > > Agree with your test case, but in my case I was not using pub/sub. > > I was not clear, so when I said: > > >> - create logical_slot1 on the primary (and don't start using it) > > I meant don't start decoding from it (like using pg_recvlogical() or > pg_logical_slot_get_changes()). > > By using pub/sub the "don't start using it" is not satisfied. > > My test case is: > > " > SELECT * FROM pg_create_logical_replication_slot('logical_slot1', 'test_decoding', false, true, true); > SELECT * FROM pg_create_logical_replication_slot('logical_slot2', 'test_decoding', false, true, true); > pg_recvlogical -d postgres -S logical_slot2 --no-loop --start -f - > " > Okay, I am able to reproduce it now. Thanks for clarification. I have tried to change the algorithm as per suggestion by Amit in [1] [1]: https://www.postgresql.org/message-id/CAA4eK1KBL0110gamQfc62X%3D5JV8-Qjd0dw0Mq0o07cq6kE%2Bq%3Dg%40mail.gmail.com This is not full proof solution but optimization over first one. Now in any sync-cycle, we take 2 attempts for slots-creation (if any slots are available to be created). In first attempt, we do not wait indefinitely on inactive slots, we wait only for a fixed amount of time and if remote-slot is still behind, then we add that to the pending list and move to the next slot. Once we are done with first attempt, in second attempt, we go for the pending ones and now we wait on each of them until the primary catches up. > Regards, > > -- > Bertrand Drouvot > PostgreSQL Contributors Team > RDS Open Source Databases > Amazon Web Services: https://aws.amazon.com
On Fri, Oct 27, 2023 at 4:04 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Fri, Oct 27, 2023 at 3:26 PM shveta malik <shveta.malik@gmail.com> wrote: > ========== > > Open questions regarding change for pt 1 above: > a) I think we should restrict the 'alter-sub set failover' when > failover-state is currently in 'p' (pending) state i.e. table-sync is > going over. Once table-sync is over, then toggle of 'failover' should > be allowed using alter-subscription. > Agreed. > b) Currently I have restricted 'alter subscription.. refresh > publication with copy=true' when failover=true (on a similar line of > two-phase). The reason being, refresh with copy=true will go for > table-sync again and since failover was set in main-slot after > table-sync was done, it will need going through the same transition of > 'p' to 'e' for main slot making it unsyncable for that time. Should it > be allowed? > Yeah, I also think we can't allow refresh with copy=true when 'failover' is enabled. I think the current implementation of this flag seems a bit clumsy because 'failover' is a slot property and we are trying to map it to plugin_options. It has to be considered similar to the opt_temporary option while creating the slot. We have create_replication_slot and drop_replication_slot in repl_gram.y. How about if introduce alter_replication_slot and handle the 'failover' flag with that? The idea is we will either enable 'failover' at the time create_replication_slot by providing an optional failover option or execute a separate command alter_replication_slot. I think we probably need to perform this command before the start of streaming. I think we will have the following options to allow alter of the 'failover' property: (a) we can allow altering 'failover' only for the 'disabled' subscription; to achieve that, we need to open a connection during alter subscription and change this property of slot; (b) apply worker detects the change in 'failover' option; run the alter_replication_slot command; this needs more analysis as apply_worker is already doing streaming and changing slot property in between could be tricky. -- With Regards, Amit Kapila.
On Tuesday, October 31, 2023 6:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > b) Currently I have restricted 'alter subscription.. refresh > > publication with copy=true' when failover=true (on a similar line of > > two-phase). The reason being, refresh with copy=true will go for > > table-sync again and since failover was set in main-slot after > > table-sync was done, it will need going through the same transition of > > 'p' to 'e' for main slot making it unsyncable for that time. Should it > > be allowed? > > > > Yeah, I also think we can't allow refresh with copy=true when 'failover' is > enabled. > > I think the current implementation of this flag seems a bit clumsy because > 'failover' is a slot property and we are trying to map it to plugin_options. It has > to be considered similar to the opt_temporary option while creating the slot. > > We have create_replication_slot and drop_replication_slot in repl_gram.y. How > about if introduce alter_replication_slot and handle the 'failover' flag with that? > The idea is we will either enable 'failover' at the time create_replication_slot by > providing an optional failover option or execute a separate command > alter_replication_slot. I think we probably need to perform this command > before the start of streaming. > > I think we will have the following options to allow alter of the 'failover' > property: (a) we can allow altering 'failover' only for the 'disabled' subscription; > to achieve that, we need to open a connection during alter subscription and > change this property of slot; (b) apply worker detects the change in 'failover' > option; run the alter_replication_slot command; this needs more analysis as > apply_worker is already doing streaming and changing slot property in > between could be tricky. I think for approach b), one challenge is the handling of the error case. E.g. If the apply worker errored out when executing the alter_replication_slot command, it may not be able to retry that after restarting, because it won't know if the value has changed before. (Or we have to execute alter_replication_slot always at the beginning in apply worker which seems not great). Best Regards, Hou zj
On Tuesday, October 31, 2023 6:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Oct 27, 2023 at 4:04 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Fri, Oct 27, 2023 at 3:26 PM shveta malik <shveta.malik@gmail.com> > wrote: > > ========== > > > > Open questions regarding change for pt 1 above: > > a) I think we should restrict the 'alter-sub set failover' when > > failover-state is currently in 'p' (pending) state i.e. table-sync is > > going over. Once table-sync is over, then toggle of 'failover' should > > be allowed using alter-subscription. > > > > Agreed. > > > b) Currently I have restricted 'alter subscription.. refresh > > publication with copy=true' when failover=true (on a similar line of > > two-phase). The reason being, refresh with copy=true will go for > > table-sync again and since failover was set in main-slot after > > table-sync was done, it will need going through the same transition of > > 'p' to 'e' for main slot making it unsyncable for that time. Should it > > be allowed? > > > > Yeah, I also think we can't allow refresh with copy=true when 'failover' is > enabled. > > I think the current implementation of this flag seems a bit clumsy because > 'failover' is a slot property and we are trying to map it to plugin_options. It has > to be considered similar to the opt_temporary option while creating the slot. > > We have create_replication_slot and drop_replication_slot in repl_gram.y. How > about if introduce alter_replication_slot and handle the 'failover' flag with that? > The idea is we will either enable 'failover' at the time create_replication_slot by > providing an optional failover option or execute a separate command > alter_replication_slot. I think we probably need to perform this command > before the start of streaming. Here is an attempt to achieve the same. I added a new replication command alter_replication_slot and introduced a walreceiver api walrcv_alter_slot to execute the command. The subscription will call the api to enable/disable the failover of the slot on publisher. The patch disallows altering the failover option for the subscription. But we could release the restriction by using the following approaches in next version: > I think we will have the following options to allow alter of the 'failover' > property: (a) we can allow altering 'failover' only for the 'disabled' > subscription; to achieve that, we need to open a connection during alter > subscription and change this property of slot; (b) apply worker detects the > change in 'failover' option; run the alter_replication_slot command; this needs > more analysis as apply_worker is already doing streaming and changing slot > property in between could be tricky. Best Regards, Hou zj
Attachment
On Thursday, November 2, 2023 8:27 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Tuesday, October 31, 2023 6:45 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > On Fri, Oct 27, 2023 at 4:04 PM shveta malik <shveta.malik@gmail.com> > wrote: > > > > > > On Fri, Oct 27, 2023 at 3:26 PM shveta malik > > > <shveta.malik@gmail.com> > > wrote: > > > ========== > > > > > > Open questions regarding change for pt 1 above: > > > a) I think we should restrict the 'alter-sub set failover' when > > > failover-state is currently in 'p' (pending) state i.e. table-sync > > > is going over. Once table-sync is over, then toggle of 'failover' > > > should be allowed using alter-subscription. > > > > > > > Agreed. > > > > > b) Currently I have restricted 'alter subscription.. refresh > > > publication with copy=true' when failover=true (on a similar line of > > > two-phase). The reason being, refresh with copy=true will go for > > > table-sync again and since failover was set in main-slot after > > > table-sync was done, it will need going through the same transition > > > of 'p' to 'e' for main slot making it unsyncable for that time. > > > Should it be allowed? > > > > > > > Yeah, I also think we can't allow refresh with copy=true when > > 'failover' is enabled. > > > > I think the current implementation of this flag seems a bit clumsy > > because 'failover' is a slot property and we are trying to map it to > > plugin_options. It has to be considered similar to the opt_temporary option > while creating the slot. > > > > We have create_replication_slot and drop_replication_slot in > > repl_gram.y. How about if introduce alter_replication_slot and handle the > 'failover' flag with that? > > The idea is we will either enable 'failover' at the time > > create_replication_slot by providing an optional failover option or > > execute a separate command alter_replication_slot. I think we probably > > need to perform this command before the start of streaming. > > Here is an attempt to achieve the same. I added a new replication command > alter_replication_slot and introduced a walreceiver api walrcv_alter_slot to > execute the command. The subscription will call the api to enable/disable the > failover of the slot on publisher. Here is the new version patch set(V29) which addressed Peter comments[1][2] and fixed one doc compile error. Thanks Ajin for helping address some of the comments. [1] https://www.postgresql.org/message-id/CAHut%2BPspseC03Fhsi%3DOqOtksagspE%2B0MVOhrhhUb64cc_4SE1w%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAHut%2BPubYbmLpGeOd2QTBPhHwtZa-Qm9Kg38Cu_EiG%2B1RbV47g%40mail.gmail.com Best Regards, Hou zj
Attachment
On Thu, Nov 2, 2023 at 2:35 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > Here is the new version patch set(V29) which addressed Peter comments[1][2] and > fixed one doc compile error. > Few comments: ============== 1. + <varlistentry id="sql-createsubscription-params-with-failover"> + <term><literal>failover</literal> (<type>boolean</type>)</term> + <listitem> + <para> + Specifies whether the replication slot assocaited with the subscription + is enabled to be synced to the physical standbys so that logical + replication can be resumed from the new primary after failover. + The default is <literal>true</literal>. Why do you think it is a good idea to keep the default value as true? I think the user needs to enable standby for syncing slots which is not a default feature, so by default, the failover property should also be false. AFAICS, it is false for create_slot SQL API as per the below change; so that way also keeping default true for a subscription doesn't make sense. @@ -479,6 +479,7 @@ CREATE OR REPLACE FUNCTION pg_create_logical_replication_slot( IN slot_name name, IN plugin name, IN temporary boolean DEFAULT false, IN twophase boolean DEFAULT false, + IN failover boolean DEFAULT false, OUT slot_name name, OUT lsn pg_lsn) BTW, the below change indicates that the code treats default as false; so, it seems to be a documentation error. @@ -157,6 +158,8 @@ parse_subscription_options(ParseState *pstate, List *stmt_options, opts->runasowner = false; if (IsSet(supported_opts, SUBOPT_ORIGIN)) opts->origin = pstrdup(LOGICALREP_ORIGIN_ANY); + if (IsSet(supported_opts, SUBOPT_FAILOVER)) + opts->failover = false; 2. - /* * Common option parsing function for CREATE and ALTER SUBSCRIPTION commands. * Spurious line removal. 3. + else if (opts.slot_name && failover_enabled) + { + walrcv_alter_slot(wrconn, opts.slot_name, opts.failover); + ereport(NOTICE, + (errmsg("altered replication slot \"%s\" on publisher", + opts.slot_name))); + } I think we can add a comment to describe why it makes sense to enable the failover property of the slot in this case. Can we change the notice message to: "enabled failover for replication slot \"%s\" on publisher" 4. libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname, - bool temporary, bool two_phase, CRSSnapshotAction snapshot_action, - XLogRecPtr *lsn) + bool temporary, bool two_phase, bool failover, + CRSSnapshotAction snapshot_action, XLogRecPtr *lsn) { PGresult *res; StringInfoData cmd; @@ -913,7 +917,14 @@ libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname, else appendStringInfoChar(&cmd, ' '); } - + if (failover) + { + appendStringInfoString(&cmd, "FAILOVER"); + if (use_new_options_syntax) + appendStringInfoString(&cmd, ", "); + else + appendStringInfoChar(&cmd, ' '); + } I don't see a corresponding change in repl_gram.y. I think the following part of the code needs to be changed: /* CREATE_REPLICATION_SLOT slot [TEMPORARY] LOGICAL plugin [options] */ | K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT create_slot_options You also need to update the docs for the same. See [1]. 5. @@ -228,6 +230,28 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin NameStr(MyReplicationSlot->data.plugin), format_procedure(fcinfo->flinfo->fn_oid)))); .. + if (XLogRecPtrIsInvalid(upto_lsn)) + wal_to_wait = end_of_wal; + else + wal_to_wait = Min(upto_lsn, end_of_wal); + + /* Initialize standby_slot_names_list */ + SlotSyncInitConfig(); + + /* + * Wait for specified streaming replication standby servers (if any) + * to confirm receipt of WAL upto wal_to_wait. + */ + WalSndWaitForStandbyConfirmation(wal_to_wait); + + /* + * The memory context used to allocate standby_slot_names_list will be + * freed at the end of this call. So free and nullify the list in + * order to avoid usage of freed list in the next call to this + * function. + */ + SlotSyncFreeConfig(); What if there is an error in WalSndWaitForStandbyConfirmation() before calling SlotSyncFreeConfig()? I think the problem you are trying to avoid by freeing it here can occur. I think it is better to do this in a logical decoding context and free the list along with it as we are doing in commit c7256e6564(see PG15). Also, it is better to allocate this list somewhere in WalSndWaitForStandbyConfirmation(), probably in WalSndGetStandbySlots, that will make the code look neat and also avoid allocating this list when failover is not enabled for the slot. 6. +/* ALTER_REPLICATION_SLOT slot */ +alter_replication_slot: + K_ALTER_REPLICATION_SLOT IDENT '(' generic_option_list ')' I think you need to update the docs for this new command. See existing docs [1]. [1] - https://www.postgresql.org/docs/devel/protocol-replication.html -- With Regards, Amit Kapila.
On Friday, November 3, 2023 7:32 PM Amit Kapila <amit.kapila16@gmail.com> > > On Thu, Nov 2, 2023 at 2:35 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> > wrote: > > > > Here is the new version patch set(V29) which addressed Peter > > comments[1][2] and fixed one doc compile error. > > > > Few comments: > ============== > 1. > + <varlistentry id="sql-createsubscription-params-with-failover"> > + <term><literal>failover</literal> (<type>boolean</type>)</term> > + <listitem> > + <para> > + Specifies whether the replication slot assocaited with the > subscription > + is enabled to be synced to the physical standbys so that logical > + replication can be resumed from the new primary after failover. > + The default is <literal>true</literal>. > > Why do you think it is a good idea to keep the default value as true? > I think the user needs to enable standby for syncing slots which is not a default > feature, so by default, the failover property should also be false. AFAICS, it is > false for create_slot SQL API as per the below change; so that way also keeping > default true for a subscription doesn't make sense. > @@ -479,6 +479,7 @@ CREATE OR REPLACE FUNCTION > pg_create_logical_replication_slot( > IN slot_name name, IN plugin name, > IN temporary boolean DEFAULT false, > IN twophase boolean DEFAULT false, > + IN failover boolean DEFAULT false, > OUT slot_name name, OUT lsn pg_lsn) > > BTW, the below change indicates that the code treats default as false; so, it > seems to be a documentation error. I think the document is wrong and fixed it. > > 2. > - > /* > * Common option parsing function for CREATE and ALTER SUBSCRIPTION > commands. > * > > Spurious line removal. > > 3. > + else if (opts.slot_name && failover_enabled) { > + walrcv_alter_slot(wrconn, opts.slot_name, opts.failover); > + ereport(NOTICE, (errmsg("altered replication slot \"%s\" on > + publisher", opts.slot_name))); } > > I think we can add a comment to describe why it makes sense to enable the > failover property of the slot in this case. Can we change the notice message to: > "enabled failover for replication slot \"%s\" on publisher" Added. > > 4. > libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname, > - bool temporary, bool two_phase, CRSSnapshotAction snapshot_action, > - XLogRecPtr *lsn) > + bool temporary, bool two_phase, bool failover, CRSSnapshotAction > + snapshot_action, XLogRecPtr *lsn) > { > PGresult *res; > StringInfoData cmd; > @@ -913,7 +917,14 @@ libpqrcv_create_slot(WalReceiverConn *conn, const > char *slotname, > else > appendStringInfoChar(&cmd, ' '); > } > - > + if (failover) > + { > + appendStringInfoString(&cmd, "FAILOVER"); if (use_new_options_syntax) > + appendStringInfoString(&cmd, ", "); else appendStringInfoChar(&cmd, ' > + '); } > > I don't see a corresponding change in repl_gram.y. I think the following part of > the code needs to be changed: > /* CREATE_REPLICATION_SLOT slot [TEMPORARY] LOGICAL plugin [options] */ > | K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT > create_slot_options > I think after 0266e98, we started to use the new syntax(see the generic_option_list rule) and we can avoid changing the repl_gram.y when adding new options. The new failover can be detected when parsing the generic option list(in parseCreateReplSlotOptions). > > 5. > @@ -228,6 +230,28 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo > fcinfo, bool confirm, bool bin > NameStr(MyReplicationSlot->data.plugin), > format_procedure(fcinfo->flinfo->fn_oid)))); > .. > + if (XLogRecPtrIsInvalid(upto_lsn)) > + wal_to_wait = end_of_wal; > + else > + wal_to_wait = Min(upto_lsn, end_of_wal); > + > + /* Initialize standby_slot_names_list */ SlotSyncInitConfig(); > + > + /* > + * Wait for specified streaming replication standby servers (if any) > + * to confirm receipt of WAL upto wal_to_wait. > + */ > + WalSndWaitForStandbyConfirmation(wal_to_wait); > + > + /* > + * The memory context used to allocate standby_slot_names_list will be > + * freed at the end of this call. So free and nullify the list in > + * order to avoid usage of freed list in the next call to this > + * function. > + */ > + SlotSyncFreeConfig(); > > What if there is an error in WalSndWaitForStandbyConfirmation() before calling > SlotSyncFreeConfig()? I think the problem you are trying to avoid by freeing it > here can occur. I think it is better to do this in a logical decoding context and > free the list along with it as we are doing in commit c7256e6564(see PG15). I will analyze more about this case and update in next version. > Also, > it is better to allocate this list somewhere in > WalSndWaitForStandbyConfirmation(), probably in WalSndGetStandbySlots, > that will make the code look neat and also avoid allocating this list when > failover is not enabled for the slot. Changed as suggested. > > 6. > +/* ALTER_REPLICATION_SLOT slot */ > +alter_replication_slot: > + K_ALTER_REPLICATION_SLOT IDENT '(' generic_option_list ')' > > I think you need to update the docs for this new command. See existing docs > [1]. > > [1] - https://www.postgresql.org/docs/devel/protocol-replication.html I think the doc for alter_replication_slot was added in V29. Attach the V30 patch set which addressed above comments and fixed CFbot failures. Best Regards, Hou zj
Attachment
On Mon, Nov 6, 2023 at 7:01 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Friday, November 3, 2023 7:32 PM Amit Kapila <amit.kapila16@gmail.com> > > > > On Thu, Nov 2, 2023 at 2:35 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> > > wrote: > > > > > > Here is the new version patch set(V29) which addressed Peter > > > comments[1][2] and fixed one doc compile error. > > > > > > > Few comments: > > ============== > > 1. > > + <varlistentry id="sql-createsubscription-params-with-failover"> > > + <term><literal>failover</literal> (<type>boolean</type>)</term> > > + <listitem> > > + <para> > > + Specifies whether the replication slot assocaited with the > > subscription > > + is enabled to be synced to the physical standbys so that logical > > + replication can be resumed from the new primary after failover. > > + The default is <literal>true</literal>. > > > > Why do you think it is a good idea to keep the default value as true? > > I think the user needs to enable standby for syncing slots which is not a default > > feature, so by default, the failover property should also be false. AFAICS, it is > > false for create_slot SQL API as per the below change; so that way also keeping > > default true for a subscription doesn't make sense. > > @@ -479,6 +479,7 @@ CREATE OR REPLACE FUNCTION > > pg_create_logical_replication_slot( > > IN slot_name name, IN plugin name, > > IN temporary boolean DEFAULT false, > > IN twophase boolean DEFAULT false, > > + IN failover boolean DEFAULT false, > > OUT slot_name name, OUT lsn pg_lsn) > > > > BTW, the below change indicates that the code treats default as false; so, it > > seems to be a documentation error. > > I think the document is wrong and fixed it. > > > > > 2. > > - > > /* > > * Common option parsing function for CREATE and ALTER SUBSCRIPTION > > commands. > > * > > > > Spurious line removal. > > > > 3. > > + else if (opts.slot_name && failover_enabled) { > > + walrcv_alter_slot(wrconn, opts.slot_name, opts.failover); > > + ereport(NOTICE, (errmsg("altered replication slot \"%s\" on > > + publisher", opts.slot_name))); } > > > > I think we can add a comment to describe why it makes sense to enable the > > failover property of the slot in this case. Can we change the notice message to: > > "enabled failover for replication slot \"%s\" on publisher" > > Added. > > > > > 4. > > libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname, > > - bool temporary, bool two_phase, CRSSnapshotAction snapshot_action, > > - XLogRecPtr *lsn) > > + bool temporary, bool two_phase, bool failover, CRSSnapshotAction > > + snapshot_action, XLogRecPtr *lsn) > > { > > PGresult *res; > > StringInfoData cmd; > > @@ -913,7 +917,14 @@ libpqrcv_create_slot(WalReceiverConn *conn, const > > char *slotname, > > else > > appendStringInfoChar(&cmd, ' '); > > } > > - > > + if (failover) > > + { > > + appendStringInfoString(&cmd, "FAILOVER"); if (use_new_options_syntax) > > + appendStringInfoString(&cmd, ", "); else appendStringInfoChar(&cmd, ' > > + '); } > > > > I don't see a corresponding change in repl_gram.y. I think the following part of > > the code needs to be changed: > > /* CREATE_REPLICATION_SLOT slot [TEMPORARY] LOGICAL plugin [options] */ > > | K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT > > create_slot_options > > > > I think after 0266e98, we started to use the new syntax(see the > generic_option_list rule) and we can avoid changing the repl_gram.y when adding > new options. The new failover can be detected when parsing the generic option > list(in parseCreateReplSlotOptions). > > > > > > 5. > > @@ -228,6 +230,28 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo > > fcinfo, bool confirm, bool bin > > NameStr(MyReplicationSlot->data.plugin), > > format_procedure(fcinfo->flinfo->fn_oid)))); > > .. > > + if (XLogRecPtrIsInvalid(upto_lsn)) > > + wal_to_wait = end_of_wal; > > + else > > + wal_to_wait = Min(upto_lsn, end_of_wal); > > + > > + /* Initialize standby_slot_names_list */ SlotSyncInitConfig(); > > + > > + /* > > + * Wait for specified streaming replication standby servers (if any) > > + * to confirm receipt of WAL upto wal_to_wait. > > + */ > > + WalSndWaitForStandbyConfirmation(wal_to_wait); > > + > > + /* > > + * The memory context used to allocate standby_slot_names_list will be > > + * freed at the end of this call. So free and nullify the list in > > + * order to avoid usage of freed list in the next call to this > > + * function. > > + */ > > + SlotSyncFreeConfig(); > > > > What if there is an error in WalSndWaitForStandbyConfirmation() before calling > > SlotSyncFreeConfig()? I think the problem you are trying to avoid by freeing it > > here can occur. I think it is better to do this in a logical decoding context and > > free the list along with it as we are doing in commit c7256e6564(see PG15). > > I will analyze more about this case and update in next version. > > > Also, > > it is better to allocate this list somewhere in > > WalSndWaitForStandbyConfirmation(), probably in WalSndGetStandbySlots, > > that will make the code look neat and also avoid allocating this list when > > failover is not enabled for the slot. > > Changed as suggested. > > > > > > 6. > > +/* ALTER_REPLICATION_SLOT slot */ > > +alter_replication_slot: > > + K_ALTER_REPLICATION_SLOT IDENT '(' generic_option_list ')' > > > > I think you need to update the docs for this new command. See existing docs > > [1]. > > > > [1] - https://www.postgresql.org/docs/devel/protocol-replication.html > > I think the doc for alter_replication_slot was added in V29. > > Attach the V30 patch set which addressed above comments and fixed CFbot failures. > Thanks Hou-San for the patches. + /* The primary_slot_name is not set */ + if (!WalRcv || WalRcv->slotname[0] == '\0') + { + ereport(WARNING, + errmsg("skipping slots synchronization as primary_slot_name " + "is not set.")); + + /* + * It's possible that the Walreceiver has not been started yet, adjust + * the wait_time to retry sooner in the next synchronization cycle. + */ + *wait_time = wal_retrieve_retry_interval; + return NULL; + } + if (!RecoveryInProgress()) + LaunchSubscriptionApplyWorker(&wait_time); + else if (wrconn == NULL) + wrconn = slotsync_remote_connect(&wait_time); If primary_slot_name is genuinely missing, then the launcher will keep on attempting to reconnect and will keep on logging warnings which is not good. 2023-11-06 09:31:32.206 IST [1032781] WARNING: skipping slots synchronization as primary_slot_name is not set. 2023-11-06 09:31:37.212 IST [1032781] WARNING: skipping slots synchronization as primary_slot_name is not set. 2023-11-06 09:31:42.219 IST [1032781] WARNING: skipping slots synchronization as primary_slot_name is not set. Same is true for other parameters checked by slotsync_remote_connect, only the frequency of WARNING msgs will be lesser (after every 3 mins). Perhaps we should try connecting only once during the start of the launcher and then after each configReload? In order to take care of cfbot failure, where the launcher may start before WalReceiver and thus may not find WalRcv->slotname[0] set in the launcher, we may go by checking GUC primary_slot_name directly in the launcher? Thoughts? thanks Shveta
On Mon, Nov 6, 2023 at 7:01 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Friday, November 3, 2023 7:32 PM Amit Kapila <amit.kapila16@gmail.com> > > > > 5. > > @@ -228,6 +230,28 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo > > fcinfo, bool confirm, bool bin > > NameStr(MyReplicationSlot->data.plugin), > > format_procedure(fcinfo->flinfo->fn_oid)))); > > .. > > + if (XLogRecPtrIsInvalid(upto_lsn)) > > + wal_to_wait = end_of_wal; > > + else > > + wal_to_wait = Min(upto_lsn, end_of_wal); > > + > > + /* Initialize standby_slot_names_list */ SlotSyncInitConfig(); > > + > > + /* > > + * Wait for specified streaming replication standby servers (if any) > > + * to confirm receipt of WAL upto wal_to_wait. > > + */ > > + WalSndWaitForStandbyConfirmation(wal_to_wait); > > + > > + /* > > + * The memory context used to allocate standby_slot_names_list will be > > + * freed at the end of this call. So free and nullify the list in > > + * order to avoid usage of freed list in the next call to this > > + * function. > > + */ > > + SlotSyncFreeConfig(); > > > > What if there is an error in WalSndWaitForStandbyConfirmation() before calling > > SlotSyncFreeConfig()? I think the problem you are trying to avoid by freeing it > > here can occur. I think it is better to do this in a logical decoding context and > > free the list along with it as we are doing in commit c7256e6564(see PG15). > > I will analyze more about this case and update in next version. > Okay, thanks for considering it. > > Also, > > it is better to allocate this list somewhere in > > WalSndWaitForStandbyConfirmation(), probably in WalSndGetStandbySlots, > > that will make the code look neat and also avoid allocating this list when > > failover is not enabled for the slot. > > Changed as suggested. > After doing this, do we need to call SlotSyncInitConfig() from other places as below? + SlotSyncInitConfig(); + WalSndGetStandbySlots(&standby_slot_cpy, false); Can we entirely get rid of calling SlotSyncInitConfig() from all places except WalSndGetStandbySlots()? Also, after that or otherwise, the comments atop also need modification. -- With Regards, Amit Kapila.
On Mon, Nov 6, 2023 at 1:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Nov 6, 2023 at 7:01 AM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > +static void +WalSndGetStandbySlots(List **standby_slots, bool force) +{ + if (!MyReplicationSlot->data.failover) + return; + + if (standby_slot_names_list == NIL && strcmp(standby_slot_names, "") != 0) + SlotSyncInitConfig(); + + if (force || StandbySlotNamesPreReload == NULL || + strcmp(StandbySlotNamesPreReload, standby_slot_names) != 0) + { + list_free(*standby_slots); + + if (StandbySlotNamesPreReload) + pfree(StandbySlotNamesPreReload); + + StandbySlotNamesPreReload = pstrdup(standby_slot_names); + *standby_slots = list_copy(standby_slot_names_list); + } +} I find this code bit difficult to understand. I think we don't need to maintain a global variable like StandbySlotNamesPreReload. We can use a local variable for it on the lines of what we do in StartupRereadConfig(). Then, can we think of maintaining standby_slot_names_list in something related to decoding like LogicalDecodingContext as this will be used during decoding only? -- With Regards, Amit Kapila.
Hi, On 10/31/23 10:37 AM, shveta malik wrote: > On Fri, Oct 27, 2023 at 8:43 PM Drouvot, Bertrand >> >> Agree with your test case, but in my case I was not using pub/sub. >> >> I was not clear, so when I said: >> >>>> - create logical_slot1 on the primary (and don't start using it) >> >> I meant don't start decoding from it (like using pg_recvlogical() or >> pg_logical_slot_get_changes()). >> >> By using pub/sub the "don't start using it" is not satisfied. >> >> My test case is: >> >> " >> SELECT * FROM pg_create_logical_replication_slot('logical_slot1', 'test_decoding', false, true, true); >> SELECT * FROM pg_create_logical_replication_slot('logical_slot2', 'test_decoding', false, true, true); >> pg_recvlogical -d postgres -S logical_slot2 --no-loop --start -f - >> " >> > > Okay, I am able to reproduce it now. Thanks for clarification. I have > tried to change the algorithm as per suggestion by Amit in [1] > > [1]: https://www.postgresql.org/message-id/CAA4eK1KBL0110gamQfc62X%3D5JV8-Qjd0dw0Mq0o07cq6kE%2Bq%3Dg%40mail.gmail.com Thanks! > > This is not full proof solution but optimization over first one. Now > in any sync-cycle, we take 2 attempts for slots-creation (if any slots > are available to be created). In first attempt, we do not wait > indefinitely on inactive slots, we wait only for a fixed amount of > time and if remote-slot is still behind, then we add that to the > pending list and move to the next slot. Once we are done with first > attempt, in second attempt, we go for the pending ones and now we wait > on each of them until the primary catches up. Aren't we "just" postponing the "issue"? I mean if there is really no activity on, say, the first created slot, then once we move to the second attempt then any newly created slot from that time would wait to be synced forever, no? Looking at V30: + /* Update lsns of slot to remote slot's current position */ + local_slot_update(remote_slot); + ReplicationSlotPersist(); + + ereport(LOG, errmsg("created slot \"%s\" locally", remote_slot->name)); I think this message is confusing as the slot has been created before it, here: + else + { + TransactionId xmin_horizon = InvalidTransactionId; + ReplicationSlot *slot; + + ReplicationSlotCreate(remote_slot->name, true, RS_EPHEMERAL, + remote_slot->two_phase, false); So that it shows up in pg_replication_slots before this message is emitted (and that specially true/worst for non active slots). Maybe something like "newly locally created slot XXX has been synced..."? While at it, would that make sense to move + slot->data.failover = true; once we stop waiting for this slot? I think that would avoid confusion if one query pg_replication_slots while we are still waiting for this slot to be synced, thoughts? (currently we can see pg_replication_slots.synced_slot set to true while we are still waiting). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Nov 7, 2023 at 3:51 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > On 10/31/23 10:37 AM, shveta malik wrote: > > On Fri, Oct 27, 2023 at 8:43 PM Drouvot, Bertrand > >> > >> Agree with your test case, but in my case I was not using pub/sub. > >> > >> I was not clear, so when I said: > >> > >>>> - create logical_slot1 on the primary (and don't start using it) > >> > >> I meant don't start decoding from it (like using pg_recvlogical() or > >> pg_logical_slot_get_changes()). > >> > >> By using pub/sub the "don't start using it" is not satisfied. > >> > >> My test case is: > >> > >> " > >> SELECT * FROM pg_create_logical_replication_slot('logical_slot1', 'test_decoding', false, true, true); > >> SELECT * FROM pg_create_logical_replication_slot('logical_slot2', 'test_decoding', false, true, true); > >> pg_recvlogical -d postgres -S logical_slot2 --no-loop --start -f - > >> " > >> > > > > Okay, I am able to reproduce it now. Thanks for clarification. I have > > tried to change the algorithm as per suggestion by Amit in [1] > > > > [1]: https://www.postgresql.org/message-id/CAA4eK1KBL0110gamQfc62X%3D5JV8-Qjd0dw0Mq0o07cq6kE%2Bq%3Dg%40mail.gmail.com > > Thanks! > > > > > This is not full proof solution but optimization over first one. Now > > in any sync-cycle, we take 2 attempts for slots-creation (if any slots > > are available to be created). In first attempt, we do not wait > > indefinitely on inactive slots, we wait only for a fixed amount of > > time and if remote-slot is still behind, then we add that to the > > pending list and move to the next slot. Once we are done with first > > attempt, in second attempt, we go for the pending ones and now we wait > > on each of them until the primary catches up. > > Aren't we "just" postponing the "issue"? I mean if there is really no activity > on, say, the first created slot, then once we move to the second attempt then any newly > created slot from that time would wait to be synced forever, no? > We have to wait at some point in time for such inactive slots and the same is true even for manually created slots on standby. Do you have any better ideas to deal with it? > Looking at V30: > > + /* Update lsns of slot to remote slot's current position */ > + local_slot_update(remote_slot); > + ReplicationSlotPersist(); > + > + ereport(LOG, errmsg("created slot \"%s\" locally", remote_slot->name)); > > I think this message is confusing as the slot has been created before it, here: > > + else > + { > + TransactionId xmin_horizon = InvalidTransactionId; > + ReplicationSlot *slot; > + > + ReplicationSlotCreate(remote_slot->name, true, RS_EPHEMERAL, > + remote_slot->two_phase, false); > > So that it shows up in pg_replication_slots before this message is emitted (and that > specially true/worst for non active slots). > > Maybe something like "newly locally created slot XXX has been synced..."? > > While at it, would that make sense to move > > + slot->data.failover = true; > > once we stop waiting for this slot? I think that would avoid confusion if one > query pg_replication_slots while we are still waiting for this slot to be synced, > thoughts? (currently we can see pg_replication_slots.synced_slot set to true > while we are still waiting). > The failover property of the slot is different from whether the slot has been synced yet, so we can't change the location of marking it but we can try to improve when to show that slot has been synced. -- With Regards, Amit Kapila.
On Mon, Nov 6, 2023 at 5:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Nov 6, 2023 at 1:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Nov 6, 2023 at 7:01 AM Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> wrote: > > > > > +static void > +WalSndGetStandbySlots(List **standby_slots, bool force) > +{ > + if (!MyReplicationSlot->data.failover) > + return; > + > + if (standby_slot_names_list == NIL && strcmp(standby_slot_names, "") != 0) > + SlotSyncInitConfig(); > + > + if (force || StandbySlotNamesPreReload == NULL || > + strcmp(StandbySlotNamesPreReload, standby_slot_names) != 0) > + { > + list_free(*standby_slots); > + > + if (StandbySlotNamesPreReload) > + pfree(StandbySlotNamesPreReload); > + > + StandbySlotNamesPreReload = pstrdup(standby_slot_names); > + *standby_slots = list_copy(standby_slot_names_list); > + } > +} > > I find this code bit difficult to understand. I think we don't need to > maintain a global variable like StandbySlotNamesPreReload. We can use > a local variable for it on the lines of what we do in > StartupRereadConfig(). Then, can we think of maintaining > standby_slot_names_list in something related to decoding like > LogicalDecodingContext as this will be used during decoding only? > Yes, agreed. This code part is now simplified in v31. PFA the patches. The overall changes are: 1) Caching of the standby_slots list in the logical-decoding context as suggested above. All the globals have been removed. 2) Dropping of local synced slots for obsolete dbs. Launcher now takes care of that. 3) There was a repeated warning in the log file due to missing GUCs as described in [1]. Fixed that. 4) Optimized code in slotsync.c and launcher.c to get rid of globals. 5) Adjusted patch003's wait-for-standby logic in slot-sync workers as per changes in pt. 1. There is still one optimization left here (in patch003) to avoid repeated parsing. I have mentioned the TODO comment. Will be targeted in the next version. The changes for 1 are in patch01. The changes for 2,3,4 are in patch02. Thanks Hou-san for implementing the changes for 1 and assisting in 5. [1]: https://www.postgresql.org/message-id/CAJpy0uDpV0suPbhCp%2B1aRLXEChD9uKp-ffBW_HfZro%3D53JKK5w%40mail.gmail.com thanks Shveta
Attachment
Hi, On 11/7/23 11:55 AM, Amit Kapila wrote: > On Tue, Nov 7, 2023 at 3:51 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> >> On 10/31/23 10:37 AM, shveta malik wrote: >>> On Fri, Oct 27, 2023 at 8:43 PM Drouvot, Bertrand >>>> >>>> Agree with your test case, but in my case I was not using pub/sub. >>>> >>>> I was not clear, so when I said: >>>> >>>>>> - create logical_slot1 on the primary (and don't start using it) >>>> >>>> I meant don't start decoding from it (like using pg_recvlogical() or >>>> pg_logical_slot_get_changes()). >>>> >>>> By using pub/sub the "don't start using it" is not satisfied. >>>> >>>> My test case is: >>>> >>>> " >>>> SELECT * FROM pg_create_logical_replication_slot('logical_slot1', 'test_decoding', false, true, true); >>>> SELECT * FROM pg_create_logical_replication_slot('logical_slot2', 'test_decoding', false, true, true); >>>> pg_recvlogical -d postgres -S logical_slot2 --no-loop --start -f - >>>> " >>>> >>> >>> Okay, I am able to reproduce it now. Thanks for clarification. I have >>> tried to change the algorithm as per suggestion by Amit in [1] >>> >>> [1]: https://www.postgresql.org/message-id/CAA4eK1KBL0110gamQfc62X%3D5JV8-Qjd0dw0Mq0o07cq6kE%2Bq%3Dg%40mail.gmail.com >> >> Thanks! >> >>> >>> This is not full proof solution but optimization over first one. Now >>> in any sync-cycle, we take 2 attempts for slots-creation (if any slots >>> are available to be created). In first attempt, we do not wait >>> indefinitely on inactive slots, we wait only for a fixed amount of >>> time and if remote-slot is still behind, then we add that to the >>> pending list and move to the next slot. Once we are done with first >>> attempt, in second attempt, we go for the pending ones and now we wait >>> on each of them until the primary catches up. >> >> Aren't we "just" postponing the "issue"? I mean if there is really no activity >> on, say, the first created slot, then once we move to the second attempt then any newly >> created slot from that time would wait to be synced forever, no? >> > > We have to wait at some point in time for such inactive slots and the > same is true even for manually created slots on standby. Do you have > any better ideas to deal with it? > What about: - get rid of the second attempt and the pending_slot_list - keep the wait_count and PrimaryCatchupWaitAttempt logic so basically, get rid of: /* * Now sync the pending slots which were failed to be created in first * attempt. */ foreach(cell, pending_slot_list) { RemoteSlot *remote_slot = (RemoteSlot *) lfirst(cell); /* Wait until the primary server catches up */ PrimaryCatchupWaitAttempt = 0; synchronize_one_slot(wrconn, remote_slot, NULL); } and the pending_slot_list list. That way, for each slot that have not been created and synced yet: - it will be created on the standby - we will wait up to PrimaryCatchupWaitAttempt attempts - the slot will be synced or removed on/from the standby That way an inactive slot on the primary would not "block" any other slots on the standby. By "created" here I mean calling ReplicationSlotCreate() (not to be confused with emitting "ereport(LOG, errmsg("created slot \"%s\" locally", remote_slot->name)); " which is confusing as mentioned up-thread). The problem I can see with this proposal is that the "sync" window waiting for slot activity on the primary is "only" during the PrimaryCatchupWaitAttempt attempts (as the slot will be dropped/recreated). If we think this window is too short we could: - increase it or - don't drop the slot once created (even if there is no activity on the primary during PrimaryCatchupWaitAttempt attempts) so that the next loop of attempts will compare with "older" LSN/xmin (as compare to dropping and re-creating the slot). That way the window would be since the initial slot creation. Thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Nov 7, 2023 at 7:58 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > On 11/7/23 11:55 AM, Amit Kapila wrote: > >>> > >>> This is not full proof solution but optimization over first one. Now > >>> in any sync-cycle, we take 2 attempts for slots-creation (if any slots > >>> are available to be created). In first attempt, we do not wait > >>> indefinitely on inactive slots, we wait only for a fixed amount of > >>> time and if remote-slot is still behind, then we add that to the > >>> pending list and move to the next slot. Once we are done with first > >>> attempt, in second attempt, we go for the pending ones and now we wait > >>> on each of them until the primary catches up. > >> > >> Aren't we "just" postponing the "issue"? I mean if there is really no activity > >> on, say, the first created slot, then once we move to the second attempt then any newly > >> created slot from that time would wait to be synced forever, no? > >> > > > > We have to wait at some point in time for such inactive slots and the > > same is true even for manually created slots on standby. Do you have > > any better ideas to deal with it? > > > > What about: > > - get rid of the second attempt and the pending_slot_list > - keep the wait_count and PrimaryCatchupWaitAttempt logic > > so basically, get rid of: > > /* > * Now sync the pending slots which were failed to be created in first > * attempt. > */ > foreach(cell, pending_slot_list) > { > RemoteSlot *remote_slot = (RemoteSlot *) lfirst(cell); > > /* Wait until the primary server catches up */ > PrimaryCatchupWaitAttempt = 0; > > synchronize_one_slot(wrconn, remote_slot, NULL); > } > > and the pending_slot_list list. > > That way, for each slot that have not been created and synced yet: > > - it will be created on the standby > - we will wait up to PrimaryCatchupWaitAttempt attempts > - the slot will be synced or removed on/from the standby > > That way an inactive slot on the primary would not "block" > any other slots on the standby. > > By "created" here I mean calling ReplicationSlotCreate() (not to be confused > with emitting "ereport(LOG, errmsg("created slot \"%s\" locally", remote_slot->name)); " > which is confusing as mentioned up-thread). > > The problem I can see with this proposal is that the "sync" window waiting > for slot activity on the primary is "only" during the PrimaryCatchupWaitAttempt > attempts (as the slot will be dropped/recreated). > > If we think this window is too short we could: > > - increase it > or > - don't drop the slot once created (even if there is no activity > on the primary during PrimaryCatchupWaitAttempt attempts) so that > the next loop of attempts will compare with "older" LSN/xmin (as compare to > dropping and re-creating the slot). That way the window would be since the > initial slot creation. > Yeah, this sounds reasonable but we can't mark such slots to be synced/available for use after failover. I think if we want to follow this approach then we need to also monitor these slots for any change in the consecutive cycles and if we are able to sync them then accordingly we enable them to use after failover. Another somewhat related point is that right now, we just wait for the change on the first slot (the patch refers to it as the monitoring slot) for computing nap_time before which we will recheck all the slots. I think we can improve that as well such that even if any slot's information is changed, we don't consider changing naptime. -- With Regards, Amit Kapila.
Hi, On 11/8/23 4:50 AM, Amit Kapila wrote: > On Tue, Nov 7, 2023 at 7:58 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> >> If we think this window is too short we could: >> >> - increase it >> or >> - don't drop the slot once created (even if there is no activity >> on the primary during PrimaryCatchupWaitAttempt attempts) so that >> the next loop of attempts will compare with "older" LSN/xmin (as compare to >> dropping and re-creating the slot). That way the window would be since the >> initial slot creation. >> > > Yeah, this sounds reasonable but we can't mark such slots to be > synced/available for use after failover. Yeah, currently we are fine as slots are dropped in wait_for_primary_slot_catchup() if we are not in recovery anymore. > I think if we want to follow > this approach then we need to also monitor these slots for any change > in the consecutive cycles and if we are able to sync them then > accordingly we enable them to use after failover. What about to add a new field in ReplicationSlotPersistentData indicating that we are waiting for "sync" and drop such slots during promotion and /or if not in recovery? > Another somewhat related point is that right now, we just wait for the > change on the first slot (the patch refers to it as the monitoring > slot) for computing nap_time before which we will recheck all the > slots. I think we can improve that as well such that even if any > slot's information is changed, we don't consider changing naptime. > Yeah, that sounds reasonable to me. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, Nov 8, 2023 at 12:32 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 11/8/23 4:50 AM, Amit Kapila wrote: > > > I think if we want to follow > > this approach then we need to also monitor these slots for any change > > in the consecutive cycles and if we are able to sync them then > > accordingly we enable them to use after failover. > > What about to add a new field in ReplicationSlotPersistentData > indicating that we are waiting for "sync" and drop such slots during promotion and > /or if not in recovery? > This patch is already adding 'synced' flag in ReplicationSlotPersistentData to distinguish synced slots so that we can disallow decoding on then in standby and disallow to drop those. I suggest we change that field to have multiple states where one of the states would indicate that the initial sync of the slot is done. -- With Regards, Amit Kapila.
Hi, On 11/8/23 9:57 AM, Amit Kapila wrote: > On Wed, Nov 8, 2023 at 12:32 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> >> Hi, >> >> On 11/8/23 4:50 AM, Amit Kapila wrote: >> >>> I think if we want to follow >>> this approach then we need to also monitor these slots for any change >>> in the consecutive cycles and if we are able to sync them then >>> accordingly we enable them to use after failover. >> >> What about to add a new field in ReplicationSlotPersistentData >> indicating that we are waiting for "sync" and drop such slots during promotion and >> /or if not in recovery? >> > > This patch is already adding 'synced' flag in > ReplicationSlotPersistentData to distinguish synced slots so that we > can disallow decoding on then in standby and disallow to drop those. I > suggest we change that field to have multiple states where one of the > states would indicate that the initial sync of the slot is done. > Yeah, agree. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, Nov 8, 2023 at 3:19 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 11/8/23 9:57 AM, Amit Kapila wrote: > > On Wed, Nov 8, 2023 at 12:32 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > >> > >> Hi, > >> > >> On 11/8/23 4:50 AM, Amit Kapila wrote: > >> > >>> I think if we want to follow > >>> this approach then we need to also monitor these slots for any change > >>> in the consecutive cycles and if we are able to sync them then > >>> accordingly we enable them to use after failover. > >> > >> What about to add a new field in ReplicationSlotPersistentData > >> indicating that we are waiting for "sync" and drop such slots during promotion and > >> /or if not in recovery? > >> > > > > This patch is already adding 'synced' flag in > > ReplicationSlotPersistentData to distinguish synced slots so that we > > can disallow decoding on then in standby and disallow to drop those. I > > suggest we change that field to have multiple states where one of the > > states would indicate that the initial sync of the slot is done. > > > > Yeah, agree. > I am working on this implementation. This sync-state is even needed for cascading standbys to know when to start syncing the slots from the first standby. It should start syncing only after the first standby has finished initialization of it (i.e. wait for primary is over) and not before that. Unrelated to above, if there is a user slot on standby with the same name which the slot-sync worker is trying to create, then shall it emit a warning and skip the sync of that slot or shall it throw an error? thanks Shveta
Hi, On 11/8/23 12:50 PM, shveta malik wrote: > On Wed, Nov 8, 2023 at 3:19 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> >> Hi, >> >> On 11/8/23 9:57 AM, Amit Kapila wrote: >>> On Wed, Nov 8, 2023 at 12:32 PM Drouvot, Bertrand >>> <bertranddrouvot.pg@gmail.com> wrote: >>>> >>>> Hi, >>>> >>>> On 11/8/23 4:50 AM, Amit Kapila wrote: >>>> >>>>> I think if we want to follow >>>>> this approach then we need to also monitor these slots for any change >>>>> in the consecutive cycles and if we are able to sync them then >>>>> accordingly we enable them to use after failover. >>>> >>>> What about to add a new field in ReplicationSlotPersistentData >>>> indicating that we are waiting for "sync" and drop such slots during promotion and >>>> /or if not in recovery? >>>> >>> >>> This patch is already adding 'synced' flag in >>> ReplicationSlotPersistentData to distinguish synced slots so that we >>> can disallow decoding on then in standby and disallow to drop those. I >>> suggest we change that field to have multiple states where one of the >>> states would indicate that the initial sync of the slot is done. >>> >> >> Yeah, agree. >> > > I am working on this implementation. Thanks! > This sync-state is even needed > for cascading standbys to know when to start syncing the slots from > the first standby. It should start syncing only after the first > standby has finished initialization of it (i.e. wait for primary is > over) and not before that. > Yeah, makes sense. > Unrelated to above, if there is a user slot on standby with the same > name which the slot-sync worker is trying to create, then shall it > emit a warning and skip the sync of that slot or shall it throw an > error? > I'd vote for emit a warning and move on to the next slot if any. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, Nov 8, 2023 at 8:09 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > > Unrelated to above, if there is a user slot on standby with the same > > name which the slot-sync worker is trying to create, then shall it > > emit a warning and skip the sync of that slot or shall it throw an > > error? > > > > I'd vote for emit a warning and move on to the next slot if any. > But then it could take time for users to know the actual problem and they probably notice it after failover. OTOH, if we throw an error then probably they will come to know earlier because the slot sync mechanism would be stopped. Do you have reasons to prefer giving a WARNING and skipping creating such slots? I expect this WARNING to keep getting repeated in LOGs because the consecutive sync tries will again generate a WARNING. -- With Regards, Amit Kapila.
On Thu, Nov 9, 2023 at 8:11 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Nov 8, 2023 at 8:09 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > > > > > Unrelated to above, if there is a user slot on standby with the same > > > name which the slot-sync worker is trying to create, then shall it > > > emit a warning and skip the sync of that slot or shall it throw an > > > error? > > > > > > > I'd vote for emit a warning and move on to the next slot if any. > > > > But then it could take time for users to know the actual problem and > they probably notice it after failover. OTOH, if we throw an error > then probably they will come to know earlier because the slot sync > mechanism would be stopped. Do you have reasons to prefer giving a > WARNING and skipping creating such slots? I expect this WARNING to > keep getting repeated in LOGs because the consecutive sync tries will > again generate a WARNING. > Apart from the above, I would like to discuss the slot sync work distribution strategy of this patch. The current implementation as explained in the commit message [1] works well if the slots belong to multiple databases. It is clear from the data in emails [2][3][4] that having more workers really helps if the slots belong to multiple databases. But I think if all the slots belong to one or very few databases then such a strategy won't be as good. Now, on one hand, we get very good numbers for a particular workload with the strategy used in the patch but OTOH it may not be adaptable to various different kinds of workloads. So, I have a question whether we should try to optimize this strategy for various kinds of workloads or for the first version let's use a single-slot sync-worker and then we can enhance the functionality in later patches either in PG17 itself or in PG18 or later versions. One thing to note is that a lot of the complexity of the patch is attributed to the multi-worker strategy which may still not be efficient, so there is an argument to go with a simpler single-slot sync-worker strategy and then enhance it in future versions as we learn more about various workloads. It will also help to develop this feature incrementally instead of doing all the things in one go and taking a much longer time than it should. Thoughts? [1] - "The replication launcher on the physical standby queries primary to get the list of dbids for failover logical slots. Once it gets the dbids, if dbids < max_slotsync_workers, it starts only that many workers, and if dbids > max_slotsync_workers, it starts max_slotsync_workers and divides the work equally among them. Each worker is then responsible to keep on syncing the logical slots belonging to the DBs assigned to it. Each slot-sync worker will have its own dbids list. Since the upper limit of this dbid-count is not known, it needs to be handled using dsa. We initially allocated memory to hold 100 dbids for each worker. If this limit is exhausted, we reallocate this memory with size incremented again by 100." [2] - https://www.postgresql.org/message-id/CAJpy0uD2F43avuXy_yQv7Wa3kpUwioY_Xn955xdmd6vX0ME6%3Dg%40mail.gmail.com [3] - https://www.postgresql.org/message-id/CAFPTHDZw2G3Pax0smymMjfPqdPcZhMWo36f9F%2BTwNTs0HFxK%2Bw%40mail.gmail.com [4] - https://www.postgresql.org/message-id/CAJpy0uD%3DDevMxTwFVsk_%3DxHqYNH8heptwgW6AimQ9fbRmx4ioQ%40mail.gmail.com -- With Regards, Amit Kapila.
On Thu, Nov 9, 2023 at 8:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Nov 9, 2023 at 8:11 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Nov 8, 2023 at 8:09 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > > > > > > > Unrelated to above, if there is a user slot on standby with the same > > > > name which the slot-sync worker is trying to create, then shall it > > > > emit a warning and skip the sync of that slot or shall it throw an > > > > error? > > > > > > > > > > I'd vote for emit a warning and move on to the next slot if any. > > > > > > > But then it could take time for users to know the actual problem and > > they probably notice it after failover. OTOH, if we throw an error > > then probably they will come to know earlier because the slot sync > > mechanism would be stopped. Do you have reasons to prefer giving a > > WARNING and skipping creating such slots? I expect this WARNING to > > keep getting repeated in LOGs because the consecutive sync tries will > > again generate a WARNING. > > > > Apart from the above, I would like to discuss the slot sync work > distribution strategy of this patch. The current implementation as > explained in the commit message [1] works well if the slots belong to > multiple databases. It is clear from the data in emails [2][3][4] that > having more workers really helps if the slots belong to multiple > databases. But I think if all the slots belong to one or very few > databases then such a strategy won't be as good. Now, on one hand, we > get very good numbers for a particular workload with the strategy used > in the patch but OTOH it may not be adaptable to various different > kinds of workloads. So, I have a question whether we should try to > optimize this strategy for various kinds of workloads or for the first > version let's use a single-slot sync-worker and then we can enhance > the functionality in later patches either in PG17 itself or in PG18 or > later versions. One thing to note is that a lot of the complexity of > the patch is attributed to the multi-worker strategy which may still > not be efficient, so there is an argument to go with a simpler > single-slot sync-worker strategy and then enhance it in future > versions as we learn more about various workloads. It will also help > to develop this feature incrementally instead of doing all the things > in one go and taking a much longer time than it should. > > Thoughts? > > [1] - "The replication launcher on the physical standby queries > primary to get the list of dbids for failover logical slots. Once it > gets the dbids, if dbids < max_slotsync_workers, it starts only that > many workers, and if dbids > max_slotsync_workers, it starts > max_slotsync_workers and divides the work equally among them. Each > worker is then responsible to keep on syncing the logical slots > belonging to the DBs assigned to it. > > Each slot-sync worker will have its own dbids list. Since the upper > limit of this dbid-count is not known, it needs to be handled using > dsa. We initially allocated memory to hold 100 dbids for each worker. > If this limit is exhausted, we reallocate this memory with size > incremented again by 100." > > [2] - https://www.postgresql.org/message-id/CAJpy0uD2F43avuXy_yQv7Wa3kpUwioY_Xn955xdmd6vX0ME6%3Dg%40mail.gmail.com > [3] - https://www.postgresql.org/message-id/CAFPTHDZw2G3Pax0smymMjfPqdPcZhMWo36f9F%2BTwNTs0HFxK%2Bw%40mail.gmail.com > [4] - https://www.postgresql.org/message-id/CAJpy0uD%3DDevMxTwFVsk_%3DxHqYNH8heptwgW6AimQ9fbRmx4ioQ%40mail.gmail.com > > -- > With Regards, > Amit Kapila. PFA v32 patches which has below changes: 1) Changed how standby_slot_names is handled. On reanalyzing, logical decoding context might not be the best place to cache the standby slot list, because not all the callers(1. user backend. 2. walsender 3. slotsync worker) can access the logical decoding ctx. To make the access of the list consistent, cache the list in a global variable instead. Also, to avoid the trouble of allocating and freeing the list at various places, we [re]initialize the list in the GUC assign hook, it would be easier for caller to use the list. 2) Changed 'bool synced' in ReplicationSlotPersistentData to 'char sync_state'. Values are: 'n': none for user slots, 'i': sync initiated for the slot but waiting for primary to catch up. 'r': ready for periodic syncs. 3) Improved slot-creation logic in slot sync worker. Now any active slot's sync is not blocked by inactive slot's creation. The worker attempts pinging the primary server a fixed number of times and waits for it to catch-up with local-slot's lsn, after that it moves to the next slot. The worker reattempts the wait for pending ones in the next sync-cycle. Meanwhile any such slot (waiting for primary to catch-up) is not dropped but sync_status is marked as 'i'. Once the worker finishes initialization for such a slot (in any of the sync-cycles), sync_state of slot is changed to 'r'. 4) The slots with state 'i' are dropped by the slot-sync worker when it finds out that it is no longer in standby mode and then it exits. 5) Cascading standby does not sync slots with 'sync_state' = 'i' from the first standby. 6) Changed the naptime computation logic. Now during each sync-cycle, if any of the received slots is updated, we retain default-naptime else we increase the naptime provided inactivity time reaches threshold. 7) Added warning for cases where a user-slot with the same name is already present which slot-sync worker is trying to create. Sync for such slots is skipped. Changes for 1 are in patch001 and patch003. Changes for 2-7 are in patch002. Thank You Hou-san for working on 1. Open Question: 1) Currently I have put drop slot logic for slots with 'sync_state=i' in slot-sync worker. Do we need to put it somewhere in promotion-logic as well? Perhaps in WaitForWALToBecomeAvailable() where we call XLogShutdownWalRcv after checking 'CheckForStandbyTrigger'. Thoughts? thanks Shveta
Attachment
Hi, On 11/9/23 3:41 AM, Amit Kapila wrote: > On Wed, Nov 8, 2023 at 8:09 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> >>> Unrelated to above, if there is a user slot on standby with the same >>> name which the slot-sync worker is trying to create, then shall it >>> emit a warning and skip the sync of that slot or shall it throw an >>> error? >>> >> >> I'd vote for emit a warning and move on to the next slot if any. >> > > But then it could take time for users to know the actual problem and > they probably notice it after failover. Right, that's not appealing.... OTOH the slot has already been created manually on the standby so there is probably already a "use case" for it (that is probably unrelated to the failover story then). In V32, the following states have been introduced: " 'n': none for user slots, 'i': sync initiated for the slot but waiting for primary to catch up. 'r': ready for periodic syncs. " Should we introduce a new state that indicates that a sync slot creation has failed because the slot already existed? That would probably be simple to monitor instead of looking at the log file. > OTOH, if we throw an error > then probably they will come to know earlier because the slot sync > mechanism would be stopped. Right. > Do you have reasons to prefer giving a > WARNING and skipping creating such slots? My idea was that with a WARNING it won't block others slot creation (if any). > I expect this WARNING to > keep getting repeated in LOGs because the consecutive sync tries will > again generate a WARNING. > Yes. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On 11/9/23 11:54 AM, shveta malik wrote: > > PFA v32 patches which has below changes: Thanks! > 7) Added warning for cases where a user-slot with the same name is > already present which slot-sync worker is trying to create. Sync for > such slots is skipped. I'm seeing assertion and segfault in this case due to ReplicationSlotRelease() in synchronize_one_slot(). Adding this extra check prior to it: - ReplicationSlotRelease(); + if (!(found && s->data.sync_state == SYNCSLOT_STATE_NONE)) + ReplicationSlotRelease(); make them disappear. > > Open Question: > 1) Currently I have put drop slot logic for slots with 'sync_state=i' > in slot-sync worker. Do we need to put it somewhere in promotion-logic > as well? Yeah I think so, because there is a time window when one could "use" the slot after the promotion and before it is removed. Producing things like: " 2023-11-09 15:16:50.294 UTC [2580462] LOG: dropped replication slot "logical_slot2" of dbid 5 as it was not sync-ready 2023-11-09 15:16:50.295 UTC [2580462] LOG: dropped replication slot "logical_slot3" of dbid 5 as it was not sync-ready 2023-11-09 15:16:50.297 UTC [2580462] LOG: dropped replication slot "logical_slot4" of dbid 5 as it was not sync-ready 2023-11-09 15:16:50.297 UTC [2580462] ERROR: replication slot "logical_slot5" is active for PID 2594628 " After the promotion one was able to use logical_slot5 and now we can now drop it. > Perhaps in WaitForWALToBecomeAvailable() where we call > XLogShutdownWalRcv after checking 'CheckForStandbyTrigger'. Thoughts? > You mean here? /* * Check to see if promotion is requested. Note that we do * this only after failure, so when you promote, we still * finish replaying as much as we can from archive and * pg_wal before failover. */ if (StandbyMode && CheckForStandbyTrigger()) { XLogShutdownWalRcv(); return XLREAD_FAIL; } If so, that sounds like a good place to me. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Nov 9, 2023 at 9:15 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 11/9/23 11:54 AM, shveta malik wrote: > > > > PFA v32 patches which has below changes: > > Thanks! > > > 7) Added warning for cases where a user-slot with the same name is > > already present which slot-sync worker is trying to create. Sync for > > such slots is skipped. > > I'm seeing assertion and segfault in this case due to ReplicationSlotRelease() > in synchronize_one_slot(). > > Adding this extra check prior to it: > > - ReplicationSlotRelease(); > + if (!(found && s->data.sync_state == SYNCSLOT_STATE_NONE)) > + ReplicationSlotRelease(); > > make them disappear. > Oh, I see. Thanks for pointing it out. I will fix it in the next version. > > > > Open Question: > > 1) Currently I have put drop slot logic for slots with 'sync_state=i' > > in slot-sync worker. Do we need to put it somewhere in promotion-logic > > as well? > > Yeah I think so, because there is a time window when one could "use" the slot > after the promotion and before it is removed. Producing things like: > > " > 2023-11-09 15:16:50.294 UTC [2580462] LOG: dropped replication slot "logical_slot2" of dbid 5 as it was not sync-ready > 2023-11-09 15:16:50.295 UTC [2580462] LOG: dropped replication slot "logical_slot3" of dbid 5 as it was not sync-ready > 2023-11-09 15:16:50.297 UTC [2580462] LOG: dropped replication slot "logical_slot4" of dbid 5 as it was not sync-ready > 2023-11-09 15:16:50.297 UTC [2580462] ERROR: replication slot "logical_slot5" is active for PID 2594628 > " > > After the promotion one was able to use logical_slot5 and now we can now drop it. Yes, I was suspicious about this small window which may allow others to use this slot, that is why I was thinking of putting it in the promotion flow and thus asked that question earlier. But the slot-sync worker may end up creating it again in case it has not exited. So we need to carefully decide at what all places we need to put 'not-in recovery' checks in slot-sync workers. In the previous version, synchronize_one_slot() had that check and it was skipping sync if '!RecoveryInProgress'. But I have removed that check in v32 thinking that the slots which the worker has already fetched from the primary, let them all get synced and exit after that nstead of syncing half and leaving rest. But now on rethinking, was the previous behaviour correct i.e. skip sync at that point onward where we see it is no longer in standby-mode while few of the slots have already been synced in that sync-cycle. Thoughts? > > > Perhaps in WaitForWALToBecomeAvailable() where we call > > XLogShutdownWalRcv after checking 'CheckForStandbyTrigger'. Thoughts? > > > > You mean here? > > /* > * Check to see if promotion is requested. Note that we do > * this only after failure, so when you promote, we still > * finish replaying as much as we can from archive and > * pg_wal before failover. > */ > if (StandbyMode && CheckForStandbyTrigger()) > { > XLogShutdownWalRcv(); > return XLREAD_FAIL; > } > yes, here. > If so, that sounds like a good place to me. okay. Will add it. > > Regards, > > -- > Bertrand Drouvot > PostgreSQL Contributors Team > RDS Open Source Databases > Amazon Web Services: https://aws.amazon.com
On Thu, Nov 9, 2023 at 7:29 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > On 11/9/23 3:41 AM, Amit Kapila wrote: > > On Wed, Nov 8, 2023 at 8:09 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > >> > >>> Unrelated to above, if there is a user slot on standby with the same > >>> name which the slot-sync worker is trying to create, then shall it > >>> emit a warning and skip the sync of that slot or shall it throw an > >>> error? > >>> > >> > >> I'd vote for emit a warning and move on to the next slot if any. > >> > > > > But then it could take time for users to know the actual problem and > > they probably notice it after failover. > > Right, that's not appealing.... > > OTOH the slot has already been created manually on the standby so there is > probably already a "use case" for it (that is probably unrelated to the > failover story then). > > In V32, the following states have been introduced: > > " > 'n': none for user slots, > 'i': sync initiated for the slot but waiting for primary to catch up. > 'r': ready for periodic syncs. > " > > Should we introduce a new state that indicates that a sync slot creation > has failed because the slot already existed? That would probably > be simple to monitor instead of looking at the log file. > Are you saying that we change the state of the already existing slot on standby? And, such a state would indicate that we are trying to sync the slot with the same name from the primary. Is that what you have in mind? If so, it appears quite odd to me to have such a state and also set it in some unrelated slot that just has the same name. I understand your point that we can allow other slots to proceed but it is also important to not create any sort of inconsistency that can surprise user after failover. Also, the current coding doesn't ensure we will always give WARNING. If we see the below code that deals with this WARNING, + /* User created slot with the same name exists, emit WARNING. */ + else if (found && s->data.sync_state == SYNCSLOT_STATE_NONE) + { + ereport(WARNING, + errmsg("not synchronizing slot %s; it is a user created slot", + remote_slot->name)); + } + /* Otherwise create the slot first. */ + else + { + TransactionId xmin_horizon = InvalidTransactionId; + ReplicationSlot *slot; + + ReplicationSlotCreate(remote_slot->name, true, RS_EPHEMERAL, + remote_slot->two_phase, false); I think this is not a solid check to ensure that the slot existed before. Because it could be created as soon as the slot sync worker invokes ReplicationSlotCreate() here. So, depending on the timing, we can either get an ERROR or WARNING. I feel giving an ERROR in this case should be okay. -- With Regards, Amit Kapila.
Hi, On 11/10/23 6:41 AM, Amit Kapila wrote: > On Thu, Nov 9, 2023 at 7:29 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > > Are you saying that we change the state of the already existing slot > on standby? Yes. > And, such a state would indicate that we are trying to > sync the slot with the same name from the primary. Is that what you > have in mind? Yes. > If so, it appears quite odd to me to have such a state > and also set it in some unrelated slot that just has the same name. > > I understand your point that we can allow other slots to proceed but > it is also important to not create any sort of inconsistency that can > surprise user after failover. But even if we ERROR out instead of emitting a WARNING, the user would still need to be notified/monitor such errors. I agree that then probably they will come to know earlier because the slot sync mechanism would be stopped but still it is not "guaranteed" (specially if there is no others "working" synced slots around.) And if they do not, then there is still a risk to use this slot after a failover thinking this is a "synced" slot. Giving more thoughts, what about using a dedicated/reserved naming convention for synced slot like synced_<primary_slot_name> or such and then: - prevent user to create sync_<whatever> slots on standby - sync <slot> on primary to sync_<slot> on standby - during failover, rename sync_<slot> to <slot> and if <slot> exists then emit a WARNING and keep sync_<slot> in place. That way both slots are still in place (the manually created <slot> and the sync_<slot<) and one could decide what to do with them. I don't think we'd need to worry about the cases where sync_ slot could be already created before we "prevent" such slots creation. Indeed I think they would not survive pg_upgrade before 17 -> 18 upgrades. So it looks like we'd be good as long as we are able to prevent sync_ slots creation on 17. Thoughts? > Also, the current coding doesn't ensure > we will always give WARNING. If we see the below code that deals with > this WARNING, > > + /* User created slot with the same name exists, emit WARNING. */ > + else if (found && s->data.sync_state == SYNCSLOT_STATE_NONE) > + { > + ereport(WARNING, > + errmsg("not synchronizing slot %s; it is a user created slot", > + remote_slot->name)); > + } > + /* Otherwise create the slot first. */ > + else > + { > + TransactionId xmin_horizon = InvalidTransactionId; > + ReplicationSlot *slot; > + > + ReplicationSlotCreate(remote_slot->name, true, RS_EPHEMERAL, > + remote_slot->two_phase, false); > > I think this is not a solid check to ensure that the slot existed > before. Because it could be created as soon as the slot sync worker > invokes ReplicationSlotCreate() here. Agree. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, Nov 10, 2023 at 12:50 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > On 11/10/23 6:41 AM, Amit Kapila wrote: > > On Thu, Nov 9, 2023 at 7:29 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > > > > Are you saying that we change the state of the already existing slot > > on standby? > > Yes. > > > And, such a state would indicate that we are trying to > > sync the slot with the same name from the primary. Is that what you > > have in mind? > > Yes. > > > If so, it appears quite odd to me to have such a state > > and also set it in some unrelated slot that just has the same name. > > > > > I understand your point that we can allow other slots to proceed but > > it is also important to not create any sort of inconsistency that can > > surprise user after failover. > > But even if we ERROR out instead of emitting a WARNING, the user would still > need to be notified/monitor such errors. I agree that then probably they will > come to know earlier because the slot sync mechanism would be stopped but still > it is not "guaranteed" (specially if there is no others "working" synced slots > around.) > > And if they do not, then there is still a risk to use this slot after a > failover thinking this is a "synced" slot. > I think this is another reason that probably giving ERROR has better chances for the user to notice before failover. IF knowing such errors user still proceeds with the failover, the onus is on her. We can probably document this hazard along with the failover feature so that users are aware that they either need to be careful while creating slots on standby or consult ERROR logs. I guess we can even make it visible in the view also. > Giving more thoughts, what about using a dedicated/reserved naming convention for > synced slot like synced_<primary_slot_name> or such and then: > > - prevent user to create sync_<whatever> slots on standby > - sync <slot> on primary to sync_<slot> on standby > - during failover, rename sync_<slot> to <slot> and if <slot> exists then > emit a WARNING and keep sync_<slot> in place. > > That way both slots are still in place (the manually created <slot> and > the sync_<slot<) and one could decide what to do with them. > Hmm, I think after failover, users need to rename all slots or we need to provide a way to rename them so that they can be used by subscribers which sounds like much more work. > > Also, the current coding doesn't ensure > > we will always give WARNING. If we see the below code that deals with > > this WARNING, > > > > + /* User created slot with the same name exists, emit WARNING. */ > > + else if (found && s->data.sync_state == SYNCSLOT_STATE_NONE) > > + { > > + ereport(WARNING, > > + errmsg("not synchronizing slot %s; it is a user created slot", > > + remote_slot->name)); > > + } > > + /* Otherwise create the slot first. */ > > + else > > + { > > + TransactionId xmin_horizon = InvalidTransactionId; > > + ReplicationSlot *slot; > > + > > + ReplicationSlotCreate(remote_slot->name, true, RS_EPHEMERAL, > > + remote_slot->two_phase, false); > > > > I think this is not a solid check to ensure that the slot existed > > before. Because it could be created as soon as the slot sync worker > > invokes ReplicationSlotCreate() here. > > Agree. > So, having a concrete check to give WARNING would require some more logic which I don't think is a good idea to handle this boundary case. -- With Regards, Amit Kapila.
On Thu, Nov 9, 2023 at 9:15 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > > You mean here? > > /* > * Check to see if promotion is requested. Note that we do > * this only after failure, so when you promote, we still > * finish replaying as much as we can from archive and > * pg_wal before failover. > */ > if (StandbyMode && CheckForStandbyTrigger()) > { > XLogShutdownWalRcv(); > return XLREAD_FAIL; > } > > If so, that sounds like a good place to me. > One more thing to think about is whether we want to shut down syncslot workers as well on promotion similar to walreceiver? Because we don't want them to even attempt once to sync after promotion. -- With Regards, Amit Kapila.
Hi, On 11/10/23 8:55 AM, Amit Kapila wrote: > On Fri, Nov 10, 2023 at 12:50 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> >> But even if we ERROR out instead of emitting a WARNING, the user would still >> need to be notified/monitor such errors. I agree that then probably they will >> come to know earlier because the slot sync mechanism would be stopped but still >> it is not "guaranteed" (specially if there is no others "working" synced slots >> around.) > >> >> And if they do not, then there is still a risk to use this slot after a >> failover thinking this is a "synced" slot. >> > > I think this is another reason that probably giving ERROR has better > chances for the user to notice before failover. IF knowing such errors > user still proceeds with the failover, the onus is on her. Agree. My concern is more when they don't know about the error. > We can > probably document this hazard along with the failover feature so that > users are aware that they either need to be careful while creating > slots on standby or consult ERROR logs. I guess we can even make it > visible in the view also. Yeah. >> Giving more thoughts, what about using a dedicated/reserved naming convention for >> synced slot like synced_<primary_slot_name> or such and then: >> >> - prevent user to create sync_<whatever> slots on standby >> - sync <slot> on primary to sync_<slot> on standby >> - during failover, rename sync_<slot> to <slot> and if <slot> exists then >> emit a WARNING and keep sync_<slot> in place. >> >> That way both slots are still in place (the manually created <slot> and >> the sync_<slot<) and one could decide what to do with them. >> > > Hmm, I think after failover, users need to rename all slots or we need > to provide a way to rename them so that they can be used by > subscribers which sounds like much more work. Agree that's much more work for the subscriber case. Maybe that's not worth the extra work. >>> Also, the current coding doesn't ensure >>> we will always give WARNING. If we see the below code that deals with >>> this WARNING, >>> >>> + /* User created slot with the same name exists, emit WARNING. */ >>> + else if (found && s->data.sync_state == SYNCSLOT_STATE_NONE) >>> + { >>> + ereport(WARNING, >>> + errmsg("not synchronizing slot %s; it is a user created slot", >>> + remote_slot->name)); >>> + } >>> + /* Otherwise create the slot first. */ >>> + else >>> + { >>> + TransactionId xmin_horizon = InvalidTransactionId; >>> + ReplicationSlot *slot; >>> + >>> + ReplicationSlotCreate(remote_slot->name, true, RS_EPHEMERAL, >>> + remote_slot->two_phase, false); >>> >>> I think this is not a solid check to ensure that the slot existed >>> before. Because it could be created as soon as the slot sync worker >>> invokes ReplicationSlotCreate() here. >> >> Agree. >> > > So, having a concrete check to give WARNING would require some more > logic which I don't think is a good idea to handle this boundary case. > Yeah good point, agree to just error out in all the case then (if we discard the sync_ reserved wording proposal, which seems to be the case as probably not worth the extra work). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On 11/10/23 4:31 AM, shveta malik wrote: > On Thu, Nov 9, 2023 at 9:15 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> Yeah I think so, because there is a time window when one could "use" the slot >> after the promotion and before it is removed. Producing things like: >> >> " >> 2023-11-09 15:16:50.294 UTC [2580462] LOG: dropped replication slot "logical_slot2" of dbid 5 as it was not sync-ready >> 2023-11-09 15:16:50.295 UTC [2580462] LOG: dropped replication slot "logical_slot3" of dbid 5 as it was not sync-ready >> 2023-11-09 15:16:50.297 UTC [2580462] LOG: dropped replication slot "logical_slot4" of dbid 5 as it was not sync-ready >> 2023-11-09 15:16:50.297 UTC [2580462] ERROR: replication slot "logical_slot5" is active for PID 2594628 >> " >> >> After the promotion one was able to use logical_slot5 and now we can now drop it. > > Yes, I was suspicious about this small window which may allow others > to use this slot, that is why I was thinking of putting it in the > promotion flow and thus asked that question earlier. But the slot-sync > worker may end up creating it again in case it has not exited. Sorry, there is a typo up-thread, I meant "After the promotion one was able to use logical_slot5 and now we can NOT drop it.". We can not drop it because it is in use. > So we > need to carefully decide at what all places we need to put 'not-in > recovery' checks in slot-sync workers. In the previous version, > synchronize_one_slot() had that check and it was skipping sync if > '!RecoveryInProgress'. But I have removed that check in v32 thinking > that the slots which the worker has already fetched from the primary, > let them all get synced and exit after that nstead of syncing half > and leaving rest. But now on rethinking, was the previous behaviour > correct i.e. skip sync at that point onward where we see it is no > longer in standby-mode while few of the slots have already been synced > in that sync-cycle. Thoughts? > I think we still need to think/discuss the promotion flow. I think we would need to have the slot sync worker shutdown during the promotion (as suggested by Amit in [1]) but before that let the sync slot worker knows it is now acting during promotion. Something like: - let the sync worker know it is now acting under promotion - do what needs to be done while acting under promotion - shutdown the sync worker That way we would avoid any "risk" of having the sync worker doing something we don't expect while not in recovery anymore. Regarding "do what needs to be done while acting under promotion": - Ensure all slots in 'r' state are synced - drop slots that are in 'i' state Thoughts? [1]: https://www.postgresql.org/message-id/CAA4eK1J2Pc%3D5TOgty5u4bp--y7ZHaQx3_2eWPL%3DVPJ7A_0JF2g%40mail.gmail.com Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thursday, November 9, 2023 6:54 PM shveta malik <shveta.malik@gmail.com> wrote: > > > PFA v32 patches which has below changes: Thanks for updating the patch. Here are few comments: 1. Do we need to update the slot upgrade code in pg_upgrade to upgrade the slot's failover property as well ? 2. The check for wal_level < WAL_LEVEL_LOGICAL. It seems the existing codes disallow slot creation if wal_level is not sufficient, I am thinking we might need similar check in slot sync worker. Otherwise, the synced slot could not be used after standby promotion. Best Regards, Hou zj
On Mon, Nov 13, 2023 at 6:19 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Thursday, November 9, 2023 6:54 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > PFA v32 patches which has below changes: > > Thanks for updating the patch. > > Here are few comments: > > > 2. > The check for wal_level < WAL_LEVEL_LOGICAL. > > It seems the existing codes disallow slot creation if wal_level is not > sufficient, I am thinking we might need similar check in slot sync worker. > Otherwise, the synced slot could not be used after standby promotion. > Yes, I agree. We should add this check. > Best Regards, > Hou zj
On Thu, Nov 9, 2023 at 8:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Nov 9, 2023 at 8:11 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Nov 8, 2023 at 8:09 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > > > > > > > Unrelated to above, if there is a user slot on standby with the same > > > > name which the slot-sync worker is trying to create, then shall it > > > > emit a warning and skip the sync of that slot or shall it throw an > > > > error? > > > > > > > > > > I'd vote for emit a warning and move on to the next slot if any. > > > > > > > But then it could take time for users to know the actual problem and > > they probably notice it after failover. OTOH, if we throw an error > > then probably they will come to know earlier because the slot sync > > mechanism would be stopped. Do you have reasons to prefer giving a > > WARNING and skipping creating such slots? I expect this WARNING to > > keep getting repeated in LOGs because the consecutive sync tries will > > again generate a WARNING. > > > > Apart from the above, I would like to discuss the slot sync work > distribution strategy of this patch. The current implementation as > explained in the commit message [1] works well if the slots belong to > multiple databases. It is clear from the data in emails [2][3][4] that > having more workers really helps if the slots belong to multiple > databases. But I think if all the slots belong to one or very few > databases then such a strategy won't be as good. Now, on one hand, we > get very good numbers for a particular workload with the strategy used > in the patch but OTOH it may not be adaptable to various different > kinds of workloads. So, I have a question whether we should try to > optimize this strategy for various kinds of workloads or for the first > version let's use a single-slot sync-worker and then we can enhance > the functionality in later patches either in PG17 itself or in PG18 or > later versions. I can work on separating the patch. We can first focus on single worker design and then we can work on multi-worker design either immediately (if needed) or we can target it in the second draft of the patch. I would like to know the thoughts of others on this. One thing to note is that a lot of the complexity of > the patch is attributed to the multi-worker strategy which may still > not be efficient, so there is an argument to go with a simpler > single-slot sync-worker strategy and then enhance it in future > versions as we learn more about various workloads. It will also help > to develop this feature incrementally instead of doing all the things > in one go and taking a much longer time than it should. > Agreed. With multi-workers, a lot of complexity (dsa, locks etc) have come into play. We can decide better on our workload distribution strategy among workers once we have more clarity on different types of workloads. > > [1] - "The replication launcher on the physical standby queries > primary to get the list of dbids for failover logical slots. Once it > gets the dbids, if dbids < max_slotsync_workers, it starts only that > many workers, and if dbids > max_slotsync_workers, it starts > max_slotsync_workers and divides the work equally among them. Each > worker is then responsible to keep on syncing the logical slots > belonging to the DBs assigned to it. > > Each slot-sync worker will have its own dbids list. Since the upper > limit of this dbid-count is not known, it needs to be handled using > dsa. We initially allocated memory to hold 100 dbids for each worker. > If this limit is exhausted, we reallocate this memory with size > incremented again by 100." > > [2] - https://www.postgresql.org/message-id/CAJpy0uD2F43avuXy_yQv7Wa3kpUwioY_Xn955xdmd6vX0ME6%3Dg%40mail.gmail.com > [3] - https://www.postgresql.org/message-id/CAFPTHDZw2G3Pax0smymMjfPqdPcZhMWo36f9F%2BTwNTs0HFxK%2Bw%40mail.gmail.com > [4] - https://www.postgresql.org/message-id/CAJpy0uD%3DDevMxTwFVsk_%3DxHqYNH8heptwgW6AimQ9fbRmx4ioQ%40mail.gmail.com > > -- > With Regards, > Amit Kapila.
On Thu, Nov 9, 2023 at 9:54 PM shveta malik <shveta.malik@gmail.com> wrote: > > PFA v32 patches which has below changes: Testing with this patch, I see that if the failover enabled slot is invalidated on the primary, then the corresponding synced slot is not invalidated on the standby. Instead, I see that it continuously gets the below error: " WARNING: not synchronizing slot sub; synchronization would move it backwards" In the code, I see that: if (remote_slot->restart_lsn < MyReplicationSlot->data.restart_lsn) { ereport(WARNING, errmsg("not synchronizing slot %s; synchronization would move" " it backwards", remote_slot->name)); ReplicationSlotRelease(); CommitTransactionCommand(); return; } If the restart_lsn of the remote slot is behind, then the local_slot_update() function is never called to set the invalidation status on the local slot. And for invalidated slots, restart_lsn is always NULL. regards, Ajin Cherian Fujitsu Australia
On Mon, Nov 13, 2023 at 11:02 AM Ajin Cherian <itsajin@gmail.com> wrote: > > On Thu, Nov 9, 2023 at 9:54 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > PFA v32 patches which has below changes: > Testing with this patch, I see that if the failover enabled slot is > invalidated on the primary, then the corresponding synced slot is not > invalidated on the standby. Instead, I see that it continuously gets > the below error: > " WARNING: not synchronizing slot sub; synchronization would move it backwards" > > In the code, I see that: > if (remote_slot->restart_lsn < MyReplicationSlot->data.restart_lsn) > { > ereport(WARNING, > errmsg("not synchronizing slot %s; synchronization > would move" > " it backwards", remote_slot->name)); > > ReplicationSlotRelease(); > CommitTransactionCommand(); > return; > } > > If the restart_lsn of the remote slot is behind, then the > local_slot_update() function is never called to set the invalidation > status on the local slot. And for invalidated slots, restart_lsn is > always NULL. > Okay. Thanks for testing Ajin. I think it needs a fix wherein we set the local-slot's invalidation status (provided it is not invalidated already) from the remote slot before this check itself. And if the slot is invalidated locally (either by itself) or by primary_slot being invalidated, then we should skip the sync. I will fix this in the next version. thanks Shveta
On Mon, Nov 13, 2023 at 5:38 PM shveta malik <shveta.malik@gmail.com> wrote: > Okay. Thanks for testing Ajin. I think it needs a fix wherein we set > the local-slot's invalidation status (provided it is not invalidated > already) from the remote slot before this check itself. And if the > slot is invalidated locally (either by itself) or by primary_slot > being invalidated, then we should skip the sync. I will fix this in > the next version. Yes, that works. Another bug I see in my testing is that pg_get_slot_invalidation_cause() does not release the LOCK if it finds the slot it is searching for. regards, Ajin Cherian Fujitsu Australia
Hi, On 11/13/23 5:24 AM, shveta malik wrote: > On Thu, Nov 9, 2023 at 8:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> Apart from the above, I would like to discuss the slot sync work >> distribution strategy of this patch. The current implementation as >> explained in the commit message [1] works well if the slots belong to >> multiple databases. It is clear from the data in emails [2][3][4] that >> having more workers really helps if the slots belong to multiple >> databases. But I think if all the slots belong to one or very few >> databases then such a strategy won't be as good. Now, on one hand, we >> get very good numbers for a particular workload with the strategy used >> in the patch but OTOH it may not be adaptable to various different >> kinds of workloads. So, I have a question whether we should try to >> optimize this strategy for various kinds of workloads or for the first >> version let's use a single-slot sync-worker and then we can enhance >> the functionality in later patches either in PG17 itself or in PG18 or >> later versions. > > I can work on separating the patch. We can first focus on single > worker design and then we can work on multi-worker design either > immediately (if needed) or we can target it in the second draft of the > patch. I would like to know the thoughts of others on this. If we need to put more thoughts on the workers distribution strategy then I also think it's better to focus on a single worker and then improve/discuss a distribution design later on. > > One thing to note is that a lot of the complexity of >> the patch is attributed to the multi-worker strategy which may still >> not be efficient, so there is an argument to go with a simpler >> single-slot sync-worker strategy and then enhance it in future >> versions as we learn more about various workloads. It will also help >> to develop this feature incrementally instead of doing all the things >> in one go and taking a much longer time than it should. > > Agreed. With multi-workers, a lot of complexity (dsa, locks etc) have > come into play. We can decide better on our workload distribution > strategy among workers once we have more clarity on different types of > workloads. > Agreed. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Friday, November 10, 2023 4:16 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > > >>> Also, the current coding doesn't ensure we will always give WARNING. > >>> If we see the below code that deals with this WARNING, > >>> > >>> + /* User created slot with the same name exists, emit WARNING. */ > >>> + else if (found && s->data.sync_state == SYNCSLOT_STATE_NONE) > >>> + { > >>> + ereport(WARNING, > >>> + errmsg("not synchronizing slot %s; it is a user created slot", > >>> + remote_slot->name)); > >>> + } > >>> + /* Otherwise create the slot first. */ > >>> + else > >>> + { > >>> + TransactionId xmin_horizon = InvalidTransactionId; > >>> + ReplicationSlot *slot; > >>> + > >>> + ReplicationSlotCreate(remote_slot->name, true, RS_EPHEMERAL, > >>> + remote_slot->two_phase, false); > >>> > >>> I think this is not a solid check to ensure that the slot existed > >>> before. Because it could be created as soon as the slot sync worker > >>> invokes ReplicationSlotCreate() here. > >> > >> Agree. > >> > > > > So, having a concrete check to give WARNING would require some more > > logic which I don't think is a good idea to handle this boundary case. > > > > Yeah good point, agree to just error out in all the case then (if we discard the > sync_ reserved wording proposal, which seems to be the case as probably not > worth the extra work). Thanks for the discussion! Here is the V33 patch set which includes the following changes: 1) Drop slots with state 'i' in promotion flow after we shut down WalReceiver. 2) Raise error if user slot with same name already exists on standby. 3) Hold spinlock when updating the porperty of the replication slot or when reading the slot's info without acuqiring it. 4) Fixed the bug: - if slot is invalidated on standby but standby is restarted immediately after that, then it fails to recreate that slot and instead end up syncing it again. It is fixed by checking the invalidated flag after acquiring the slot and skip syncing for invalidated slots. 5) Fixed the bugs reported by Ajin[1][2]. 6) Removed some unused variables. Thanks Shveta for working on 1) 2) 4) and 5). Thanks Ajin for testing 5). --- TODO There are few pending comments and bugs have not been addressed, I will work on them in next version: 1) Comments posted by me[3]: 2) Shutdown the slotsync worker on promotion and probably let slotsync do necessary cleanups[4] 3) Consider documenting the hazard that create slot on standby may cause ERRORs in the log and consider making it visible in the view. 4) One bug found internally: If we give non-existing dbname in primary_conninfo on standby, it should be handled gracefully, launcher should skip slots-synchronization. 5) One bug found internally: when wait_for_primary_slot_catchup is doing WaitLatch and I stop the standby using: ./pg_ctl -D ../../standbydb/ -l standby.log stop it does not come out of wait and shutdown hangs. [1] https://www.postgresql.org/message-id/CAFPTHDZNV_HFAXULkaJOv_MMtLukCzDEgTaixxBwjEO_0Jg-kg%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAFPTHDa5C_vHQbeqemToyucWySB0kEFbdS2WOA0PB%2BGSei2v7A%40mail.gmail.com [3] https://www.postgresql.org/message-id/OS0PR01MB571652CCD42F1D08D5BD69D494B3A%40OS0PR01MB5716.jpnprd01.prod.outlook.com [4] https://www.postgresql.org/message-id/64056e35-1916-461c-a816-26e40ffde3a0%40gmail.com Best Regards, Hou zj
Attachment
Hi, On 11/13/23 2:57 PM, Zhijie Hou (Fujitsu) wrote: > On Friday, November 10, 2023 4:16 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: >> Yeah good point, agree to just error out in all the case then (if we discard the >> sync_ reserved wording proposal, which seems to be the case as probably not >> worth the extra work). > > Thanks for the discussion! > > Here is the V33 patch set which includes the following changes: Thanks for working on it! > > 1) Drop slots with state 'i' in promotion flow after we shut down WalReceiver. @@ -3557,10 +3558,15 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, * this only after failure, so when you promote, we still * finish replaying as much as we can from archive and * pg_wal before failover. + * + * Drop the slots for which sync is initiated but not yet + * completed i.e. they are still waiting for the primary + * server to catch up. */ if (StandbyMode && CheckForStandbyTrigger()) { XLogShutdownWalRcv(); + slotsync_drop_initiated_slots(); return XLREAD_FAIL; } I had a closer look and it seems this is not located at the right place. Indeed, it's added here: switch (currentSource) { case XLOG_FROM_ARCHIVE: case XLOG_FROM_PG_WAL: While in our case we are in case XLOG_FROM_STREAM: So I think we should move slotsync_drop_initiated_slots() in the XLOG_FROM_STREAM case. Maybe before shutting down the sync slot worker? (the TODO item number 2 you mentioned up-thread) BTW in order to prevent any corner case, would'nt also be better to replace: + /* + * Do not allow consumption of a "synchronized" slot until the standby + * gets promoted. + */ + if (RecoveryInProgress() && (slot->data.sync_state != SYNCSLOT_STATE_NONE)) with something like: if ((RecoveryInProgress() && (slot->data.sync_state != SYNCSLOT_STATE_NONE)) || slot->data.sync_state == SYNCSLOT_STATE_INITIATED) to ensure slots in 'i' case can never be used? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Nov 14, 2023 at 7:56 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 11/13/23 2:57 PM, Zhijie Hou (Fujitsu) wrote: > > On Friday, November 10, 2023 4:16 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > >> Yeah good point, agree to just error out in all the case then (if we discard the > >> sync_ reserved wording proposal, which seems to be the case as probably not > >> worth the extra work). > > > > Thanks for the discussion! > > > > Here is the V33 patch set which includes the following changes: > > Thanks for working on it! > > > > > 1) Drop slots with state 'i' in promotion flow after we shut down WalReceiver. > > @@ -3557,10 +3558,15 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, > * this only after failure, so when you promote, we still > * finish replaying as much as we can from archive and > * pg_wal before failover. > + * > + * Drop the slots for which sync is initiated but not yet > + * completed i.e. they are still waiting for the primary > + * server to catch up. > */ > if (StandbyMode && CheckForStandbyTrigger()) > { > XLogShutdownWalRcv(); > + slotsync_drop_initiated_slots(); > return XLREAD_FAIL; > } > > I had a closer look and it seems this is not located at the right place. > > Indeed, it's added here: > > switch (currentSource) > { > case XLOG_FROM_ARCHIVE: > case XLOG_FROM_PG_WAL: > > While in our case we are in > > case XLOG_FROM_STREAM: > > So I think we should move slotsync_drop_initiated_slots() in the > XLOG_FROM_STREAM case. Maybe before shutting down the sync slot worker? > (the TODO item number 2 you mentioned up-thread) > > BTW in order to prevent any corner case, would'nt also be better to > > replace: > > + /* > + * Do not allow consumption of a "synchronized" slot until the standby > + * gets promoted. > + */ > + if (RecoveryInProgress() && (slot->data.sync_state != SYNCSLOT_STATE_NONE)) > > with something like: > > if ((RecoveryInProgress() && (slot->data.sync_state != SYNCSLOT_STATE_NONE)) || slot->data.sync_state == SYNCSLOT_STATE_INITIATED) > > to ensure slots in 'i' case can never be used? > > Regards, > > -- > Bertrand Drouvot > PostgreSQL Contributors Team > RDS Open Source Databases > Amazon Web Services: https://aws.amazon.com PFA v34. It has changed patch002 from multi workers to single worker design as per the discussion in [1] and [2]. Please note that the TODO list mentioned in [3] is still pending and will be implemented in next version. [1]: https://www.postgresql.org/message-id/CAA4eK1JzYoHu2r%3D%2BKwn%2BN4ZgVcWKtdX_yLSNyTqjdWGkr-q0iA%40mail.gmail.com [2]: https://www.postgresql.org/message-id/e7b63103-2a8c-4ee9-866a-ddba45ead388%40gmail.com [3]: https://www.postgresql.org/message-id/OS0PR01MB5716CE0729CEB3B5994A954194B3A%40OS0PR01MB5716.jpnprd01.prod.outlook.com thanks Shveta
Attachment
On Tue, Nov 14, 2023 at 7:56 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 11/13/23 2:57 PM, Zhijie Hou (Fujitsu) wrote: > > On Friday, November 10, 2023 4:16 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > >> Yeah good point, agree to just error out in all the case then (if we discard the > >> sync_ reserved wording proposal, which seems to be the case as probably not > >> worth the extra work). > > > > Thanks for the discussion! > > > > Here is the V33 patch set which includes the following changes: > > Thanks for working on it! > > > > > 1) Drop slots with state 'i' in promotion flow after we shut down WalReceiver. > > @@ -3557,10 +3558,15 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, > * this only after failure, so when you promote, we still > * finish replaying as much as we can from archive and > * pg_wal before failover. > + * > + * Drop the slots for which sync is initiated but not yet > + * completed i.e. they are still waiting for the primary > + * server to catch up. > */ > if (StandbyMode && CheckForStandbyTrigger()) > { > XLogShutdownWalRcv(); > + slotsync_drop_initiated_slots(); > return XLREAD_FAIL; > } > > I had a closer look and it seems this is not located at the right place. > > Indeed, it's added here: > > switch (currentSource) > { > case XLOG_FROM_ARCHIVE: > case XLOG_FROM_PG_WAL: > > While in our case we are in > > case XLOG_FROM_STREAM: > > So I think we should move slotsync_drop_initiated_slots() in the > XLOG_FROM_STREAM case. Maybe before shutting down the sync slot worker? > (the TODO item number 2 you mentioned up-thread) > > BTW in order to prevent any corner case, would'nt also be better to > > replace: > > + /* > + * Do not allow consumption of a "synchronized" slot until the standby > + * gets promoted. > + */ > + if (RecoveryInProgress() && (slot->data.sync_state != SYNCSLOT_STATE_NONE)) > > with something like: > > if ((RecoveryInProgress() && (slot->data.sync_state != SYNCSLOT_STATE_NONE)) || slot->data.sync_state == SYNCSLOT_STATE_INITIATED) > > to ensure slots in 'i' case can never be used? > Yes, it makes sense. WIll do it. > Regards, > > -- > Bertrand Drouvot > PostgreSQL Contributors Team > RDS Open Source Databases > Amazon Web Services: https://aws.amazon.com
On Wed, Nov 15, 2023 at 5:21 PM shveta malik <shveta.malik@gmail.com> wrote: > > PFA v34. > Few comments on v34-0001* ======================= 1. + char buf[100]; + + buf[0] = '\0'; + + if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING) + strcat(buf, "twophase"); + if (MySubscription->failoverstate == LOGICALREP_FAILOVER_STATE_PENDING) + { + if (buf[0] != '\0') + strcat(buf, " and "); + strcat(buf, "failover"); + } + ereport(LOG, - (errmsg("logical replication apply worker for subscription \"%s\" will restart so that two_phase can be enabled", - MySubscription->name))); + (errmsg("logical replication apply worker for subscription \"%s\" will restart so that %s can be enabled", + MySubscription->name, buf))); I feel it is better to separate elogs rather than construct the string. It would be easier for the translation. 2. - /* Initialize walsender process before entering the main command loop */ Spurious line removal 3. @@ -440,17 +448,8 @@ pg_physical_replication_slot_advance(XLogRecPtr moveto) if (startlsn < moveto) { - SpinLockAcquire(&MyReplicationSlot->mutex); - MyReplicationSlot->data.restart_lsn = moveto; - SpinLockRelease(&MyReplicationSlot->mutex); + PhysicalConfirmReceivedLocation(moveto); retlsn = moveto; - - /* - * Dirty the slot so as it is written out at the next checkpoint. Note - * that the LSN position advanced may still be lost in the event of a - * crash, but this makes the data consistent after a clean shutdown. - */ - ReplicationSlotMarkDirty(); } I think this change has been made so that we can wakeup logical walsenders from a central location. In general, this is a good idea but it seems calling PhysicalConfirmReceivedLocation() would make an additional call to ReplicationSlotsComputeRequiredLSN() which is already called in the caller of pg_physical_replication_slot_advance(), so not sure such unification is a good idea here. 4. + * Here logical walsender associated with failover logical slot waits + * for physical standbys corresponding to physical slots specified in + * standby_slot_names GUC. + */ +void +WalSndWaitForStandbyConfirmation(XLogRecPtr wait_for_lsn) In the above comments, we don't seem to follow the 80-col limit. Please check all other comments in the patch for similar problem. 5. +static void +WalSndRereadConfigAndSlots(List **standby_slots) +{ + char *pre_standby_slot_names = pstrdup(standby_slot_names); + + ProcessConfigFile(PGC_SIGHUP); + + if (strcmp(pre_standby_slot_names, standby_slot_names) != 0) + { + list_free(*standby_slots); + *standby_slots = GetStandbySlotList(true); + } + + pfree(pre_standby_slot_names); +} The function name is misleading w.r.t the functionality. Can we name it on the lines of WalSndRereadConfigAndReInitSlotList()? I know it is a bit longer but couldn't come up with anything better. 6. + /* + * Fast path to entering the loop in case we already know we have + * enough WAL available and all the standby servers has confirmed + * receipt of WAL upto RecentFlushPtr. I think this comment is a bit misleading because it is a fast path to avoid entering the loop. I think we can keep the existing comment here: "Fast path to avoid acquiring the spinlock in case we already know ..." 7. @@ -3381,7 +3673,9 @@ WalSndWait(uint32 socket_events, long timeout, uint32 wait_event) * And, we use separate shared memory CVs for physical and logical * walsenders for selective wake ups, see WalSndWakeup() for more details. */ - if (MyWalSnd->kind == REPLICATION_KIND_PHYSICAL) + if (wait_for_standby) + ConditionVariablePrepareToSleep(&WalSndCtl->wal_confirm_rcv_cv); + else if (MyWalSnd->kind == REPLICATION_KIND_PHYSICAL) The comment above this change needs to be updated for the usage of this new CV. 8. +WAL_SENDER_WAIT_FOR_STANDBY_CONFIRMATION "Waiting for physical standby confirmation in WAL sender process." I feel the above description is not clear. How about being more specific with something along the lines of: "Waiting for the WAL to be received by physical standby in WAL sender process." 9. + {"standby_slot_names", PGC_SIGHUP, REPLICATION_PRIMARY, + gettext_noop("List of streaming replication standby server slot " + "names that logical walsenders waits for."), I think we slightly simplify it by saying: "Lists streaming replication standby server slot names that logical WAL sender processes wait for.". It would be more consistent with a few other similar variables. 10. + gettext_noop("List of streaming replication standby server slot " + "names that logical walsenders waits for."), + gettext_noop("Decoded changes are sent out to plugins by logical " + "walsenders only after specified replication slots " + "confirm receiving WAL."), Instead of walsenders, let's use WAL sender processes. 11. @@ -6622,10 +6623,12 @@ describeSubscriptions(const char *pattern, bool verbose) appendPQExpBuffer(&buf, ", suborigin AS \"%s\"\n" ", subpasswordrequired AS \"%s\"\n" - ", subrunasowner AS \"%s\"\n", + ", subrunasowner AS \"%s\"\n" + ", subfailoverstate AS \"%s\"\n", gettext_noop("Origin"), gettext_noop("Password required"), - gettext_noop("Run as owner?")); + gettext_noop("Run as owner?"), + gettext_noop("Enable failover?")); Let's name the new column as "Failover" and also it should be displayed only when pset.sversion is >=17. 12. @@ -93,6 +97,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) BKI_SHARED_RELATION BKI_ROW bool subrunasowner; /* True if replication should execute as the * subscription owner */ + char subfailoverstate; /* Enable Failover State */ This should be listed in system_views.sql in the below GRANT statement: GRANT SELECT (oid, subdbid, subskiplsn, subname, subowner, subenabled, subbinary, substream, subtwophasestate, subdisableonerr, subpasswordrequired, subrunasowner, subslotname, subsynccommit, subpublications, suborigin) 13. + ConditionVariable wal_confirm_rcv_cv; + WalSnd walsnds[FLEXIBLE_ARRAY_MEMBER]; } WalSndCtlData; It is better to add a comment for this new variable explaining its use. -- With Regards, Amit Kapila.
On Thu, Nov 16, 2023 at 3:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Nov 15, 2023 at 5:21 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > PFA v34. > > > > Few comments on v34-0001* > ======================= > 1. > + char buf[100]; > + > + buf[0] = '\0'; > + > + if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING) > + strcat(buf, "twophase"); > + if (MySubscription->failoverstate == LOGICALREP_FAILOVER_STATE_PENDING) > + { > + if (buf[0] != '\0') > + strcat(buf, " and "); > + strcat(buf, "failover"); > + } > + > ereport(LOG, > - (errmsg("logical replication apply worker for subscription \"%s\" > will restart so that two_phase can be enabled", > - MySubscription->name))); > + (errmsg("logical replication apply worker for subscription \"%s\" > will restart so that %s can be enabled", > + MySubscription->name, buf))); > > I feel it is better to separate elogs rather than construct the > string. It would be easier for the translation. > > 2. > - > /* Initialize walsender process before entering the main command loop */ > > Spurious line removal > > 3. > @@ -440,17 +448,8 @@ pg_physical_replication_slot_advance(XLogRecPtr moveto) > > if (startlsn < moveto) > { > - SpinLockAcquire(&MyReplicationSlot->mutex); > - MyReplicationSlot->data.restart_lsn = moveto; > - SpinLockRelease(&MyReplicationSlot->mutex); > + PhysicalConfirmReceivedLocation(moveto); > retlsn = moveto; > - > - /* > - * Dirty the slot so as it is written out at the next checkpoint. Note > - * that the LSN position advanced may still be lost in the event of a > - * crash, but this makes the data consistent after a clean shutdown. > - */ > - ReplicationSlotMarkDirty(); > } > > I think this change has been made so that we can wakeup logical > walsenders from a central location. In general, this is a good idea > but it seems calling PhysicalConfirmReceivedLocation() would make an > additional call to ReplicationSlotsComputeRequiredLSN() which is > already called in the caller of > pg_physical_replication_slot_advance(), so not sure such unification > is a good idea here. > > 4. > + * Here logical walsender associated with failover logical slot waits > + * for physical standbys corresponding to physical slots specified in > + * standby_slot_names GUC. > + */ > +void > +WalSndWaitForStandbyConfirmation(XLogRecPtr wait_for_lsn) > > In the above comments, we don't seem to follow the 80-col limit. > Please check all other comments in the patch for similar problem. > > 5. > +static void > +WalSndRereadConfigAndSlots(List **standby_slots) > +{ > + char *pre_standby_slot_names = pstrdup(standby_slot_names); > + > + ProcessConfigFile(PGC_SIGHUP); > + > + if (strcmp(pre_standby_slot_names, standby_slot_names) != 0) > + { > + list_free(*standby_slots); > + *standby_slots = GetStandbySlotList(true); > + } > + > + pfree(pre_standby_slot_names); > +} > > The function name is misleading w.r.t the functionality. Can we name > it on the lines of WalSndRereadConfigAndReInitSlotList()? I know it is > a bit longer but couldn't come up with anything better. > > 6. > + /* > + * Fast path to entering the loop in case we already know we have > + * enough WAL available and all the standby servers has confirmed > + * receipt of WAL upto RecentFlushPtr. > > I think this comment is a bit misleading because it is a fast path to > avoid entering the loop. I think we can keep the existing comment > here: "Fast path to avoid acquiring the spinlock in case we already > know ..." > > 7. > @@ -3381,7 +3673,9 @@ WalSndWait(uint32 socket_events, long timeout, > uint32 wait_event) > * And, we use separate shared memory CVs for physical and logical > * walsenders for selective wake ups, see WalSndWakeup() for more details. > */ > - if (MyWalSnd->kind == REPLICATION_KIND_PHYSICAL) > + if (wait_for_standby) > + ConditionVariablePrepareToSleep(&WalSndCtl->wal_confirm_rcv_cv); > + else if (MyWalSnd->kind == REPLICATION_KIND_PHYSICAL) > > The comment above this change needs to be updated for the usage of this new CV. > > 8. > +WAL_SENDER_WAIT_FOR_STANDBY_CONFIRMATION "Waiting for physical > standby confirmation in WAL sender process." > > I feel the above description is not clear. How about being more > specific with something along the lines of: "Waiting for the WAL to be > received by physical standby in WAL sender process." > > 9. > + {"standby_slot_names", PGC_SIGHUP, REPLICATION_PRIMARY, > + gettext_noop("List of streaming replication standby server slot " > + "names that logical walsenders waits for."), > > I think we slightly simplify it by saying: "Lists streaming > replication standby server slot names that logical WAL sender > processes wait for.". It would be more consistent with a few other > similar variables. > > 10. > + gettext_noop("List of streaming replication standby server slot " > + "names that logical walsenders waits for."), > + gettext_noop("Decoded changes are sent out to plugins by logical " > + "walsenders only after specified replication slots " > + "confirm receiving WAL."), > > Instead of walsenders, let's use WAL sender processes. > > 11. > @@ -6622,10 +6623,12 @@ describeSubscriptions(const char *pattern, bool verbose) > appendPQExpBuffer(&buf, > ", suborigin AS \"%s\"\n" > ", subpasswordrequired AS \"%s\"\n" > - ", subrunasowner AS \"%s\"\n", > + ", subrunasowner AS \"%s\"\n" > + ", subfailoverstate AS \"%s\"\n", > gettext_noop("Origin"), > gettext_noop("Password required"), > - gettext_noop("Run as owner?")); > + gettext_noop("Run as owner?"), > + gettext_noop("Enable failover?")); > > Let's name the new column as "Failover" and also it should be > displayed only when pset.sversion is >=17. > > 12. > @@ -93,6 +97,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) > BKI_SHARED_RELATION BKI_ROW > bool subrunasowner; /* True if replication should execute as the > * subscription owner */ > > + char subfailoverstate; /* Enable Failover State */ > > This should be listed in system_views.sql in the below GRANT statement: > GRANT SELECT (oid, subdbid, subskiplsn, subname, subowner, subenabled, > subbinary, substream, subtwophasestate, subdisableonerr, > subpasswordrequired, subrunasowner, > subslotname, subsynccommit, subpublications, suborigin) > > 13. > + ConditionVariable wal_confirm_rcv_cv; > + > WalSnd walsnds[FLEXIBLE_ARRAY_MEMBER]; > } WalSndCtlData; > > It is better to add a comment for this new variable explaining its use. > > -- > With Regards, > Amit Kapila. PFA v35. It has below changes: 1) change of default for 'enable_syncslot' to false. 2) validate the dbname provided in primary_conninfo before attempting slot-sync. 3) do not allow logical decoding on slots with 'i' sync_state. 4) support in pg_upgrade for the failover property of slot. 5) do not start slot-sync if wal_level < logical 6) shutdown the slotsync worker on promotion. Thanks Ajin for working on 4 and 5. Thanks Hou-San for working on 6. The changes are in patch001 and patch002. With above changes, comments in [1] and [2] are addressed TODO: 1) Comments in [3]. 2) Analyze if we need to consider supporting an upgrade of the slot's 'sync_state' property? [1]: https://www.postgresql.org/message-id/OS0PR01MB571652CCD42F1D08D5BD69D494B3A%40OS0PR01MB5716.jpnprd01.prod.outlook.com [2]: https://www.postgresql.org/message-id/46070646-9e09-4566-8a62-ae31a12a510c%40gmail.com [3]: https://www.postgresql.org/message-id/CAA4eK1J%3D-kPHS1eHNBtzOQHZ64j6WSgSYQZ3fH%3D2vfiwy_48AA%40mail.gmail.com thanks Shveta
Attachment
On Tuesday, November 14, 2023 10:27 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > On 11/13/23 2:57 PM, Zhijie Hou (Fujitsu) wrote: > > On Friday, November 10, 2023 4:16 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > >> Yeah good point, agree to just error out in all the case then (if we > >> discard the sync_ reserved wording proposal, which seems to be the > >> case as probably not worth the extra work). > > > > Thanks for the discussion! > > > > Here is the V33 patch set which includes the following changes: > > Thanks for working on it! > > > > > 1) Drop slots with state 'i' in promotion flow after we shut down WalReceiver. > > @@ -3557,10 +3558,15 @@ WaitForWALToBecomeAvailable(XLogRecPtr > RecPtr, bool randAccess, > * this only after failure, so when you promote, we still > * finish replaying as much as we can from archive and > * pg_wal before failover. > + * > + * Drop the slots for which sync is initiated but not yet > + * completed i.e. they are still waiting for the primary > + * server to catch up. > */ > if (StandbyMode && CheckForStandbyTrigger()) > { > XLogShutdownWalRcv(); > + slotsync_drop_initiated_slots(); > return XLREAD_FAIL; > } > > I had a closer look and it seems this is not located at the right place. > > Indeed, it's added here: > > switch (currentSource) > { > case XLOG_FROM_ARCHIVE: > case XLOG_FROM_PG_WAL: > > While in our case we are in > > case XLOG_FROM_STREAM: > > So I think we should move slotsync_drop_initiated_slots() in the > XLOG_FROM_STREAM case. Maybe before shutting down the sync slot worker? > (the TODO item number 2 you mentioned up-thread) Thanks for the comment. I feel the WaitForWALToBecomeAvailable may not be the best place to shutdown slotsync worker and drop slots. There could be other reasons(other than promotion) as mentioned in comments in case XLOG_FROM_STREAM to reach the code there. I thought if the intention is to stop slotsync workers on promotion, maybe FinishWalRecovery() is a better place to do it as it's indicating the end of recovery and XLogShutdownWalRcv is also called in it. And I feel we'd better drop the slots after shutting down the slotsync workers, because otherwise the slotsync workers could create the dropped slot again in rare cases. Best Regards, Hou zj
On Tue, Nov 14, 2023 at 12:57 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > 2) Raise error if user slot with same name already exists on standby. "ERROR: not synchronizing slot test; it is a user created slot" I just tested this using v35 and to me the error message when this happens is not very good. Neither does it sound like an error, nor is there clarity on what the underlying problem is or how to correct it. regards, Ajin Cherian Fujitsu Australia
Hi, On 11/16/23 1:03 PM, shveta malik wrote: > On Thu, Nov 16, 2023 at 3:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > PFA v35. It has below changes: Thanks for the update! > 6) shutdown the slotsync worker on promotion. + /* + * Shutdown the slot sync workers to prevent potential conflicts between + * user processes and slotsync workers after a promotion. Additionally, + * drop any slots that have initiated but not yet completed the sync + * process. + */ + ShutDownSlotSync(); + slotsync_drop_initiated_slots(); I think there is a corner case here. If there is promotion while slot creation is in progress (slot has just been created and is in 'i' state), then when we shutdown the sync slot worker in ShutDownSlotSync() we'll set slot->in_use = false in ReplicationSlotDropPtr(). Indeed, when we shut the sync worker down: (gdb) bt #0 ReplicationSlotDropPtr (slot=0x7f25af5c9bb0) at slot.c:734 #1 0x000056266c8106a7 in ReplicationSlotDropAcquired () at slot.c:725 #2 0x000056266c810170 in ReplicationSlotRelease () at slot.c:583 #3 0x000056266c80f420 in ReplicationSlotShmemExit (code=1, arg=0) at slot.c:189 #4 0x000056266c86213b in shmem_exit (code=1) at ipc.c:243 #5 0x000056266c861fdf in proc_exit_prepare (code=1) at ipc.c:198 #6 0x000056266c861f23 in proc_exit (code=1) at ipc.c:111 So later on, when we'll want to drop this slot in slotsync_drop_initiated_slots() we'll get things like: 2023-11-17 11:22:08.526 UTC [2195486] FATAL: replication slot "logical_slot4" does not exist Reason is that slotsync_drop_initiated_slots() does call SearchNamedReplicationSlot(): (gdb) bt #0 SearchNamedReplicationSlot (name=0x7f743f5c9ab8 "logical_slot4", need_lock=false) at slot.c:388 #1 0x0000556ef0974ec1 in ReplicationSlotAcquire (name=0x7f743f5c9ab8 "logical_slot4", nowait=true) at slot.c:484 #2 0x0000556ef09754e7 in ReplicationSlotDrop (name=0x7f743f5c9ab8 "logical_slot4", nowait=true, user_cmd=false) at slot.c:668 #3 0x0000556ef095f0a3 in slotsync_drop_initiated_slots () at slotsync.c:369 that returns a NULL slot if slot->in_use = false. One option could be to make sure slot->in_use = true before calling ReplicationSlotDrop() here? + foreach(lc, slots) + { + ReplicationSlot *s = (ReplicationSlot *) lfirst(lc); + + ReplicationSlotDrop(NameStr(s->data.name), true, false); Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Nov 16, 2023 at 5:34 PM shveta malik <shveta.malik@gmail.com> wrote: > > PFA v35. > Review v35-0002* ============== 1. As quoted in the commit message, > If a logical slot is invalidated on the primary, slot on the standby is also invalidated. If a logical slot on the primary is valid but is invalidated on the standby due to conflict (say required rows removed on the primary), then that slot is dropped and recreated on the standby in next sync-cycle. It is okay to recreate such slots as long as these are not consumable on the standby (which is the case currently). > I think this won't happen normally because of the physical slot and hot_standby_feedback but probably can occur in cases like if the user temporarily switches hot_standby_feedback from on to off. Are there any other reasons? I think we can mention the cases along with it as well at least for now. Additionally, I think this should be covered in code comments as well. 2. #include "postgres.h" - +#include "access/genam.h" Spurious line removal. 3. A password needs to be provided too, if the sender demands password authentication. It can be provided in the <varname>primary_conninfo</varname> string, or in a separate - <filename>~/.pgpass</filename> file on the standby server (use - <literal>replication</literal> as the database name). - Do not specify a database name in the - <varname>primary_conninfo</varname> string. + <filename>~/.pgpass</filename> file on the standby server. + </para> + <para> + Specify <literal>dbname</literal> in + <varname>primary_conninfo</varname> string to allow synchronization + of slots from the primary server to the standby server. + This will only be used for slot synchronization. It is ignored + for streaming. Is there a reason to remove part of the earlier sentence "use <literal>replication</literal> as the database name"? 4. + <primary><varname>enable_syncslot</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + It enables a physical standby to synchronize logical failover slots + from the primary server so that logical subscribers are not blocked + after failover. + </para> + <para> + It is enabled by default. This parameter can only be set in the + <filename>postgresql.conf</filename> file or on the server command line. + </para> I think you forgot to update the documentation for the default value of this variable. 5. + * a) start the logical replication workers for every enabled subscription + * when not in standby_mode + * b) start the slot-sync worker for logical failover slots synchronization + * from the primary server when in standby_mode. Either use a full stop after both lines or none of these. 6. +static void slotsync_worker_cleanup(SlotSyncWorkerInfo * worker); There shouldn't be space between * and the worker. 7. + if (!SlotSyncWorker->hdr.in_use) + { + LWLockRelease(SlotSyncWorkerLock); + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("replication slot-sync worker not initialized, " + "cannot attach"))); + } + + if (SlotSyncWorker->hdr.proc) + { + LWLockRelease(SlotSyncWorkerLock); + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("replication slot-sync worker is " + "already running, cannot attach"))); + } Using slot-sync in the error messages looks a bit odd to me. Can we use "replication slot sync worker ..." in both these and other similar messages? I think it would be better if we don't split the messages into multiple lines in these cases as messages don't appear too long to me. 8. +/* + * Detach the worker from DSM and update 'proc' and 'in_use'. + * Logical replication launcher will come to know using these + * that the worker has shutdown. + */ +void +slotsync_worker_detach(int code, Datum arg) +{ I think the reference to DSM is leftover from the previous version of the patch. Can we change the above comments as per the new code? 9. +static bool +slotsync_worker_launch() { ... + /* TODO: do we really need 'generation', analyse more here */ + worker->hdr.generation++; We should do something about this TODO. As per my understanding, we don't need a generation number for the slot sync worker as we have one such worker but I guess the patch requires it because we are using existing logical replication worker infrastructure. This brings the question of whether we really need a separate SlotSyncWorkerInfo or if we can use existing LogicalRepWorker and distinguish it with LogicalRepWorkerType? I guess you didn't use it because most of the fields in LogicalRepWorker will be unused for slot sync worker. 10. + * Can't use existing functions like 'get_database_oid' from dbcommands.c for + * validity purpose as they need db connection. + */ +static bool +validate_dbname(const char *dbname) I don't know how important it is to validate the dbname before launching the sync slot worker because anyway after launching, it will give an error while initializing the connection if the dbname is invalid. But, if we think it is really required, did you consider using GetDatabaseTuple()? -- With Regards, Amit Kapila.
Hi, On 11/17/23 2:46 AM, Zhijie Hou (Fujitsu) wrote: > On Tuesday, November 14, 2023 10:27 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: >> On 11/13/23 2:57 PM, Zhijie Hou (Fujitsu) wrote: >>> On Friday, November 10, 2023 4:16 PM Drouvot, Bertrand >> <bertranddrouvot.pg@gmail.com> wrote: >>>> Yeah good point, agree to just error out in all the case then (if we >>>> discard the sync_ reserved wording proposal, which seems to be the >>>> case as probably not worth the extra work). >>> >>> Thanks for the discussion! >>> >>> Here is the V33 patch set which includes the following changes: >> >> Thanks for working on it! >> >>> >>> 1) Drop slots with state 'i' in promotion flow after we shut down WalReceiver. >> >> @@ -3557,10 +3558,15 @@ WaitForWALToBecomeAvailable(XLogRecPtr >> RecPtr, bool randAccess, >> * this only after failure, so when you promote, we still >> * finish replaying as much as we can from archive and >> * pg_wal before failover. >> + * >> + * Drop the slots for which sync is initiated but not yet >> + * completed i.e. they are still waiting for the primary >> + * server to catch up. >> */ >> if (StandbyMode && CheckForStandbyTrigger()) >> { >> XLogShutdownWalRcv(); >> + slotsync_drop_initiated_slots(); >> return XLREAD_FAIL; >> } >> >> I had a closer look and it seems this is not located at the right place. >> >> Indeed, it's added here: >> >> switch (currentSource) >> { >> case XLOG_FROM_ARCHIVE: >> case XLOG_FROM_PG_WAL: >> >> While in our case we are in >> >> case XLOG_FROM_STREAM: >> >> So I think we should move slotsync_drop_initiated_slots() in the >> XLOG_FROM_STREAM case. Maybe before shutting down the sync slot worker? >> (the TODO item number 2 you mentioned up-thread) > > Thanks for the comment. > > I feel the WaitForWALToBecomeAvailable may not be the best place to shutdown > slotsync worker and drop slots. There could be other reasons(other than > promotion) as mentioned in comments in case XLOG_FROM_STREAM to reach the code > there. I thought if the intention is to stop slotsync workers on promotion, > maybe FinishWalRecovery() is a better place to do it as it's indicating the end > of recovery and XLogShutdownWalRcv is also called in it. I can see that slotsync_drop_initiated_slots() has been moved in FinishWalRecovery() in v35. That looks ok. > > And I feel we'd better drop the slots after shutting down the slotsync workers, > because otherwise the slotsync workers could create the dropped slot again in > rare cases. Yeah, agree and I can see that it's done that way in v35. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thursday, November 16, 2023 6:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Nov 15, 2023 at 5:21 PM shveta malik <shveta.malik@gmail.com> > wrote: > > > > PFA v34. > > > > Few comments on v34-0001* > ======================= > 1. > + char buf[100]; > + > + buf[0] = '\0'; > + > + if (MySubscription->twophasestate == > + LOGICALREP_TWOPHASE_STATE_PENDING) > + strcat(buf, "twophase"); > + if (MySubscription->failoverstate == > + LOGICALREP_FAILOVER_STATE_PENDING) > + { > + if (buf[0] != '\0') > + strcat(buf, " and "); > + strcat(buf, "failover"); > + } > + > ereport(LOG, > - (errmsg("logical replication apply worker for subscription \"%s\" > will restart so that two_phase can be enabled", > - MySubscription->name))); > + (errmsg("logical replication apply worker for subscription \"%s\" > will restart so that %s can be enabled", > + MySubscription->name, buf))); > > I feel it is better to separate elogs rather than construct the string. It would be > easier for the translation. > > 2. > - > /* Initialize walsender process before entering the main command loop */ > > Spurious line removal > > 3. > @@ -440,17 +448,8 @@ pg_physical_replication_slot_advance(XLogRecPtr > moveto) > > if (startlsn < moveto) > { > - SpinLockAcquire(&MyReplicationSlot->mutex); > - MyReplicationSlot->data.restart_lsn = moveto; > - SpinLockRelease(&MyReplicationSlot->mutex); > + PhysicalConfirmReceivedLocation(moveto); > retlsn = moveto; > - > - /* > - * Dirty the slot so as it is written out at the next checkpoint. Note > - * that the LSN position advanced may still be lost in the event of a > - * crash, but this makes the data consistent after a clean shutdown. > - */ > - ReplicationSlotMarkDirty(); > } > > I think this change has been made so that we can wakeup logical walsenders > from a central location. In general, this is a good idea but it seems calling > PhysicalConfirmReceivedLocation() would make an additional call to > ReplicationSlotsComputeRequiredLSN() which is already called in the caller of > pg_physical_replication_slot_advance(), so not sure such unification is a good > idea here. > > 4. > + * Here logical walsender associated with failover logical slot waits > + * for physical standbys corresponding to physical slots specified in > + * standby_slot_names GUC. > + */ > +void > +WalSndWaitForStandbyConfirmation(XLogRecPtr wait_for_lsn) > > In the above comments, we don't seem to follow the 80-col limit. > Please check all other comments in the patch for similar problem. > > 5. > +static void > +WalSndRereadConfigAndSlots(List **standby_slots) { > + char *pre_standby_slot_names = pstrdup(standby_slot_names); > + > + ProcessConfigFile(PGC_SIGHUP); > + > + if (strcmp(pre_standby_slot_names, standby_slot_names) != 0) { > + list_free(*standby_slots); *standby_slots = GetStandbySlotList(true); > + } > + > + pfree(pre_standby_slot_names); > +} > > The function name is misleading w.r.t the functionality. Can we name it on the > lines of WalSndRereadConfigAndReInitSlotList()? I know it is a bit longer but > couldn't come up with anything better. > > 6. > + /* > + * Fast path to entering the loop in case we already know we have > + * enough WAL available and all the standby servers has confirmed > + * receipt of WAL upto RecentFlushPtr. > > I think this comment is a bit misleading because it is a fast path to avoid > entering the loop. I think we can keep the existing comment > here: "Fast path to avoid acquiring the spinlock in case we already know ..." > > 7. > @@ -3381,7 +3673,9 @@ WalSndWait(uint32 socket_events, long timeout, > uint32 wait_event) > * And, we use separate shared memory CVs for physical and logical > * walsenders for selective wake ups, see WalSndWakeup() for more details. > */ > - if (MyWalSnd->kind == REPLICATION_KIND_PHYSICAL) > + if (wait_for_standby) > + ConditionVariablePrepareToSleep(&WalSndCtl->wal_confirm_rcv_cv); > + else if (MyWalSnd->kind == REPLICATION_KIND_PHYSICAL) > > The comment above this change needs to be updated for the usage of this new > CV. > > 8. > +WAL_SENDER_WAIT_FOR_STANDBY_CONFIRMATION "Waiting for physical > standby confirmation in WAL sender process." > > I feel the above description is not clear. How about being more specific with > something along the lines of: "Waiting for the WAL to be received by physical > standby in WAL sender process." > > 9. > + {"standby_slot_names", PGC_SIGHUP, REPLICATION_PRIMARY, > + gettext_noop("List of streaming replication standby server slot " > + "names that logical walsenders waits for."), > > I think we slightly simplify it by saying: "Lists streaming replication standby > server slot names that logical WAL sender processes wait for.". It would be > more consistent with a few other similar variables. > > 10. > + gettext_noop("List of streaming replication standby server slot " > + "names that logical walsenders waits for."), gettext_noop("Decoded > + changes are sent out to plugins by logical " > + "walsenders only after specified replication slots " > + "confirm receiving WAL."), > > Instead of walsenders, let's use WAL sender processes. > > 11. > @@ -6622,10 +6623,12 @@ describeSubscriptions(const char *pattern, bool > verbose) > appendPQExpBuffer(&buf, > ", suborigin AS \"%s\"\n" > ", subpasswordrequired AS \"%s\"\n" > - ", subrunasowner AS \"%s\"\n", > + ", subrunasowner AS \"%s\"\n" > + ", subfailoverstate AS \"%s\"\n", > gettext_noop("Origin"), > gettext_noop("Password required"), > - gettext_noop("Run as owner?")); > + gettext_noop("Run as owner?"), > + gettext_noop("Enable failover?")); > > Let's name the new column as "Failover" and also it should be displayed only > when pset.sversion is >=17. > > 12. > @@ -93,6 +97,8 @@ CATALOG(pg_subscription,6100,SubscriptionRelationId) > BKI_SHARED_RELATION BKI_ROW > bool subrunasowner; /* True if replication should execute as the > * subscription owner */ > > + char subfailoverstate; /* Enable Failover State */ > > This should be listed in system_views.sql in the below GRANT statement: > GRANT SELECT (oid, subdbid, subskiplsn, subname, subowner, subenabled, > subbinary, substream, subtwophasestate, subdisableonerr, > subpasswordrequired, subrunasowner, > subslotname, subsynccommit, subpublications, suborigin) > > 13. > + ConditionVariable wal_confirm_rcv_cv; > + > WalSnd walsnds[FLEXIBLE_ARRAY_MEMBER]; } WalSndCtlData; > > It is better to add a comment for this new variable explaining its use. Thanks for the comments. Here is the new version patch set which addressed all above comments and the comment in [1]. [1] https://www.postgresql.org/message-id/1e0b2eb4-c977-482d-b16e-c52711c34d6c%40gmail.com Best Regards, Hou zj
Attachment
On Fri, Nov 17, 2023 at 5:18 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > On 11/17/23 2:46 AM, Zhijie Hou (Fujitsu) wrote: > > On Tuesday, November 14, 2023 10:27 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > > > I feel the WaitForWALToBecomeAvailable may not be the best place to shutdown > > slotsync worker and drop slots. There could be other reasons(other than > > promotion) as mentioned in comments in case XLOG_FROM_STREAM to reach the code > > there. I thought if the intention is to stop slotsync workers on promotion, > > maybe FinishWalRecovery() is a better place to do it as it's indicating the end > > of recovery and XLogShutdownWalRcv is also called in it. > > I can see that slotsync_drop_initiated_slots() has been moved in FinishWalRecovery() > in v35. That looks ok. > > I was thinking what if we just ignore creating such slots (which require init state) in the first place? I think that can be time-consuming in some cases but it will reduce the complexity and we can always improve such cases later if we really encounter them in the real world. I am not very sure that added complexity is worth addressing this particular case, so I would like to know your and others' opinions. More Review for v35-0002* ============================ 1. + ereport(WARNING, + errmsg("skipping slots synchronization as primary_slot_name " + "is not set.")); There is no need to use a full stop at the end for WARNING messages and as previously mentioned, let's not split message lines in such cases. There are other messages in the patch with similar problems, please fix those as well. 2. +slotsync_checks() { ... ... + /* The hot_standby_feedback must be ON for slot-sync to work */ + if (!hot_standby_feedback) + { + ereport(WARNING, + errmsg("skipping slots synchronization as hot_standby_feedback " + "is off.")); This message has the same problem as mentioned in the previous comment. Additionally, I think either atop slotsync_checks or along with GUC check we should write comments as to why we expect these values to be set for slot sync to work. 3. + /* The worker is running already */ + if (SlotSyncWorker &&SlotSyncWorker->hdr.in_use + && SlotSyncWorker->hdr.proc) The spacing for both the &&'s has problems. You need a space after the first && and the second && should be in the prior line. 4. + LauncherRereadConfig(&recheck_slotsync); + } An empty line after LauncherRereadConfig() is not required. 5. +static void +LauncherRereadConfig(bool *ss_recheck) +{ + char *conninfo = pstrdup(PrimaryConnInfo); + char *slotname = pstrdup(PrimarySlotName); + bool syncslot = enable_syncslot; + bool feedback = hot_standby_feedback; Can we change the variable name 'feedback' to 'standbyfeedback' to make it slightly more descriptive? 6. The logic to recheck the slot_sync related parameters in LauncherMain() is not very clear. IIUC, if after reload config any parameter is changed, we just seem to be checking the validity of the changed parameter but not restarting the slot sync worker, is that correct? If so, what if dbname is changed, don't we need to restart the slot-sync worker and re-initialize the connection; similarly slotname change also needs some thoughts. Also, if all the parameters are valid we seem to be re-launching the slot-sync worker without first stopping it which doesn't seem correct, am I missing something in this logic? 7. @@ -524,6 +525,25 @@ CreateDecodingContext(XLogRecPtr start_lsn, errmsg("replication slot \"%s\" was not created in this database", NameStr(slot->data.name)))); + in_recovery = RecoveryInProgress(); + + /* + * Do not allow consumption of a "synchronized" slot until the standby + * gets promoted. Also do not allow consumption of slots with sync_state + * as SYNCSLOT_STATE_INITIATED as they are not synced completely to be + * used. + */ + if ((in_recovery && (slot->data.sync_state != SYNCSLOT_STATE_NONE)) || + slot->data.sync_state == SYNCSLOT_STATE_INITIATED) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("cannot use replication slot \"%s\" for logical decoding", + NameStr(slot->data.name)), + in_recovery ? + errdetail("This slot is being synced from the primary server.") : + errdetail("This slot was not synced completely from the primary server."), + errhint("Specify another replication slot."))); + If we are planning to drop slots in state SYNCSLOT_STATE_INITIATED at the time of promotion, don't we need to just have an assert or elog(ERROR, .. for non-recovery cases as such cases won't be reachable? If so, I think we can separate out that case here. 8. wait_for_primary_slot_catchup() { ... + /* Check if this standby is promoted while we are waiting */ + if (!RecoveryInProgress()) + { + /* + * The remote slot didn't pass the locally reserved position at + * the time of local promotion, so it's not safe to use. + */ + ereport( + WARNING, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg( + "slot-sync wait for slot %s interrupted by promotion, " + "slot creation aborted", remote_slot->name))); + pfree(cmd.data); + return false; + } ... } Shouldn't this be an Assert because a slot-sync worker shouldn't exist for non-standby servers? 9. wait_for_primary_slot_catchup() { ... + slot = MakeSingleTupleTableSlot(res->tupledesc, &TTSOpsMinimalTuple); + if (!tuplestore_gettupleslot(res->tuplestore, true, false, slot)) + { + ereport(WARNING, + (errmsg("slot \"%s\" disappeared from the primary server," + " slot creation aborted", remote_slot->name))); + pfree(cmd.data); + walrcv_clear_result(res); + return false; If the slot on primary disappears, shouldn't this part of the code somehow ensure to remove the slot on standby as well? If it is taken at some other point in time then at least we should write a comment here to state how it is taken care of. I think this comment also applies to a few other checks following this check. 10. + /* + * It is possible to get null values for lsns and xmin if slot is + * invalidated on the primary server, so handle accordingly. + */ + new_invalidated = DatumGetBool(slot_getattr(slot, 1, &isnull)); We can say LSN and Xmin in the above comment to make it easier to read/understand. 11. /* + * Once we got valid restart_lsn, then confirmed_lsn and catalog_xmin + * are expected to be valid/non-null, so assert if found null. + */ No need to explicitly say about assert, it is clear from the code. We can slightly change this comment to: "Once we got valid restart_lsn, then confirmed_lsn and catalog_xmin are expected to be valid/non-null." 12. + if (remote_slot->restart_lsn < MyReplicationSlot->data.restart_lsn || + TransactionIdPrecedes(remote_slot->catalog_xmin, + MyReplicationSlot->data.catalog_xmin)) + { + if (!wait_for_primary_slot_catchup(wrconn, remote_slot)) + { + /* + * The remote slot didn't catch up to locally reserved position. + * But still persist it and attempt the wait and sync in next + * sync-cycle. + */ + if (MyReplicationSlot->data.persistency != RS_PERSISTENT) + { + ReplicationSlotPersist(); + *slot_updated = true; + } I think the reason to persist in this case is because next time local restart_lsn can be ahead than the current location and it can take more time to create such a slot. We can probably mention the same in the comments. -- With Regards, Amit Kapila.
Hi, On 11/18/23 11:45 AM, Amit Kapila wrote: > On Fri, Nov 17, 2023 at 5:18 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> >> On 11/17/23 2:46 AM, Zhijie Hou (Fujitsu) wrote: >>> On Tuesday, November 14, 2023 10:27 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: >>> >>> I feel the WaitForWALToBecomeAvailable may not be the best place to shutdown >>> slotsync worker and drop slots. There could be other reasons(other than >>> promotion) as mentioned in comments in case XLOG_FROM_STREAM to reach the code >>> there. I thought if the intention is to stop slotsync workers on promotion, >>> maybe FinishWalRecovery() is a better place to do it as it's indicating the end >>> of recovery and XLogShutdownWalRcv is also called in it. >> >> I can see that slotsync_drop_initiated_slots() has been moved in FinishWalRecovery() >> in v35. That looks ok. >>> > > I was thinking what if we just ignore creating such slots (which > require init state) in the first place? I think that can be > time-consuming in some cases but it will reduce the complexity and we > can always improve such cases later if we really encounter them in the > real world. I am not very sure that added complexity is worth > addressing this particular case, so I would like to know your and > others' opinions. > I'm not sure I understand your point. Are you saying that we should not create slots on the standby that are "currently" reported in a 'i' state? (so just keep the 'r' and 'n' states?) Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Sat, Nov 18, 2023 at 4:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Nov 17, 2023 at 5:18 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > > More Review for v35-0002* > ============================ > More review of v35-0002* ==================== 1. +/* + * Helper function to check if local_slot is present in remote_slots list. + * + * It also checks if logical slot is locally invalidated i.e. invalidated on + * the standby but valid on the primary server. If found so, it sets + * locally_invalidated to true. + */ +static bool +slot_exists_in_list(ReplicationSlot *local_slot, List *remote_slots, + bool *locally_invalidated) The name of the function is a bit misleading because it checks the validity of the slot not only whether it exists in remote_list. Would it be better to name it as ValidateSyncSlot() or something along those lines? 2. +static long +synchronize_slots(WalReceiverConn *wrconn) { ... + /* Construct query to get slots info from the primary server */ + initStringInfo(&s); + construct_slot_query(&s); ... + if (remote_slot->conflicting) + remote_slot->invalidated = get_remote_invalidation_cause(wrconn, + remote_slot->name); ... +static ReplicationSlotInvalidationCause +get_remote_invalidation_cause(WalReceiverConn *wrconn, char *slot_name) { ... + appendStringInfo(&cmd, + "SELECT pg_get_slot_invalidation_cause(%s)", + quote_literal_cstr(slot_name)); + res = walrcv_exec(wrconn, cmd.data, 1, slotRow); Do we really need to query a second time to get the invalidation cause? Can we adjust the slot_query to get it in one round trip? I think this may not optimize much because the patch uses second round trip only for invalidated slots but still looks odd. So unless the query becomes too complicated, we should try to achive it one round trip. 3. +static long +synchronize_slots(WalReceiverConn *wrconn) +{ ... ... + /* The syscache access needs a transaction env. */ + StartTransactionCommand(); + + /* Make things live outside TX context */ + MemoryContextSwitchTo(oldctx); + + /* Construct query to get slots info from the primary server */ + initStringInfo(&s); + construct_slot_query(&s); + + elog(DEBUG2, "slot-sync worker's query:%s \n", s.data); + + /* Execute the query */ + res = walrcv_exec(wrconn, s.data, SLOTSYNC_COLUMN_COUNT, slotRow); It is okay to perform the above query execution outside the transaction context but I would like to know the reason for the same. Do we want to retain anything beyond the transaction context or is there some other reason to do this outside the transaction context? 4. +static void +construct_slot_query(StringInfo s) +{ + /* + * Fetch data for logical failover slots with sync_state either as + * SYNCSLOT_STATE_NONE or SYNCSLOT_STATE_READY. + */ + appendStringInfo(s, + "SELECT slot_name, plugin, confirmed_flush_lsn," + " restart_lsn, catalog_xmin, two_phase, conflicting, " + " database FROM pg_catalog.pg_replication_slots" + " WHERE failover and sync_state != 'i'"); +} Why would the sync_state on the primary server be any valid value? I thought it was set only on physical standby. I think it is better to mention the reason for using the sync state and or failover flag in the above comments. The current comment doesn't seem of much use as it just states what is evident from the query. 5. * This check should never pass as on the primary server, we have waited + * for the standby's confirmation before updating the logical slot. But to + * take care of any bug in that flow, we should retain this check. + */ + if (remote_slot->confirmed_lsn > WalRcv->latestWalEnd) + { + elog(LOG, "skipping sync of slot \"%s\" as the received slot-sync " + "LSN %X/%X is ahead of the standby position %X/%X", + remote_slot->name, + LSN_FORMAT_ARGS(remote_slot->confirmed_lsn), + LSN_FORMAT_ARGS(WalRcv->latestWalEnd)); + This should be elog(ERROR, ..). Normally, we use elog(ERROR, ...) for such unexpected cases. And, you don't need to explicitly mention the last sentence in the comment: "But to take care of any bug in that flow, we should retain this check.". 6. +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot, + bool *slot_updated) { ... + if (remote_slot->restart_lsn < MyReplicationSlot->data.restart_lsn) + { + ereport(WARNING, + errmsg("not synchronizing slot %s; synchronization would" + " move it backwards", remote_slot->name)); I think here elevel should be LOG because user can't do much about this. Do we use ';' at other places in the message? But when can we hit this case? We can add some comments to state in which scenario this possible. OTOH, if this is sort of can't happen case and we have kept it to avoid any sort of inconsistency then we can probably use elog(ERROR, .. with approapriate LSN locations, so that later the problem could be debugged. 7. +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot, + bool *slot_updated) { ... + + StartTransactionCommand(); + + /* Make things live outside TX context */ + MemoryContextSwitchTo(oldctx); + ... Similar to one of the previous comments, it is not clear to me why the patch is doing a memory context switch here. Can we add a comment? 8. + /* User created slot with the same name exists, raise ERROR. */ + else if (sync_state == SYNCSLOT_STATE_NONE) + { + ereport(ERROR, + errmsg("not synchronizing slot %s; it is a user created slot", + remote_slot->name)); + } Won't we need error_code in this error? Also, the message doesn't seem to follow the code's usual style. 9. +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot, + bool *slot_updated) { ... + else + { + TransactionId xmin_horizon = InvalidTransactionId; + ReplicationSlot *slot; + + ReplicationSlotCreate(remote_slot->name, true, RS_EPHEMERAL, + remote_slot->two_phase, false); + slot = MyReplicationSlot; + + SpinLockAcquire(&slot->mutex); + slot->data.database = get_database_oid(remote_slot->database, false); + + /* Mark it as sync initiated by slot-sync worker */ + slot->data.sync_state = SYNCSLOT_STATE_INITIATED; + slot->data.failover = true; + + namestrcpy(&slot->data.plugin, remote_slot->plugin); + SpinLockRelease(&slot->mutex); + + ReplicationSlotReserveWal(); + How and when will this init state (SYNCSLOT_STATE_INITIATED) persist to disk? 10. + if (slot_updated) + SlotSyncWorker->last_update_time = now; + + else if (TimestampDifferenceExceeds(SlotSyncWorker->last_update_time, + now, WORKER_INACTIVITY_THRESHOLD_MS)) Empty line between if/else if is not required. 11. +static WalReceiverConn * +remote_connect() +{ + WalReceiverConn *wrconn = NULL; + char *err; + + wrconn = walrcv_connect(PrimaryConnInfo, true, false, "slot-sync", &err); + if (wrconn == NULL) + ereport(ERROR, + (errmsg("could not connect to the primary server: %s", err))); Let's use appname similar to what we do for "walreceiver" as shown below: /* Establish the connection to the primary for XLOG streaming */ wrconn = walrcv_connect(conninfo, false, false, cluster_name[0] ? cluster_name : "walreceiver", &err); if (!wrconn) ereport(ERROR, (errcode(ERRCODE_CONNECTION_FAILURE), errmsg("could not connect to the primary server: %s", err))); Some proposals for default appname "slotsynchronizer", "slotsync worker". Also, use the same error code as used by "walreceiver". 12. Do we need the handling of the slotsync worker in GetBackendTypeDesc()? Please check without that what value this patch displays for backend_type. 13. +/* + * Re-read the config file. + * + * If primary_conninfo has changed, reconnect to primary. + */ +static void +slotsync_reread_config(WalReceiverConn **wrconn) +{ + char *conninfo = pstrdup(PrimaryConnInfo); + + ConfigReloadPending = false; + ProcessConfigFile(PGC_SIGHUP); + + /* Reconnect if GUC primary_conninfo got changed */ + if (strcmp(conninfo, PrimaryConnInfo) != 0) + { + if (*wrconn) + walrcv_disconnect(*wrconn); + + *wrconn = remote_connect(); I think we should exit the worker in this case and allow it to reconnect. See the similar handling in maybe_reread_subscription(). One effect of not doing is that the dbname patch has used in ReplSlotSyncWorkerMain() will become inconsistent. 14. +void +ReplSlotSyncWorkerMain(Datum main_arg) +{ ... ... + /* + * If the standby has been promoted, skip the slot synchronization process. + * + * Although the startup process stops all the slot-sync workers on + * promotion, the launcher may not have realized the promotion and could + * start additional workers after that. Therefore, this check is still + * necessary to prevent these additional workers from running. + */ + if (PromoteIsTriggered()) + exit(0); ... ... + /* Check if got promoted */ + if (!RecoveryInProgress()) + { + /* + * Drop the slots for which sync is initiated but not yet + * completed i.e. they are still waiting for the primary server to + * catch up. + */ + slotsync_drop_initiated_slots(); + ereport(LOG, + errmsg("exiting slot-sync woker on promotion of standby")); I think we should never reach this code in non-standby mode. It should elog(ERROR,.. Can you please explain why promotion handling is required here? 15. @@ -190,6 +190,8 @@ static const char *const BuiltinTrancheNames[] = { "LogicalRepLauncherDSA", /* LWTRANCHE_LAUNCHER_HASH: */ "LogicalRepLauncherHash", + /* LWTRANCHE_SLOTSYNC_DSA: */ + "SlotSyncWorkerDSA", }; ... ... + LWTRANCHE_SLOTSYNC_DSA, LWTRANCHE_FIRST_USER_DEFINED, } BuiltinTrancheIds; These are not used in the patch. 16. +/* ------------------------------- + * LIST_DBID_FOR_FAILOVER_SLOTS command + * ------------------------------- + */ +typedef struct ListDBForFailoverSlotsCmd +{ + NodeTag type; + List *slot_names; +} ListDBForFailoverSlotsCmd; ... +/* + * Failover logical slots data received from remote. + */ +typedef struct WalRcvFailoverSlotsData +{ + Oid dboid; +} WalRcvFailoverSlotsData; These structures don't seem to be used in the current version of the patch. 17. --- a/src/include/replication/slot.h +++ b/src/include/replication/slot.h @@ -15,7 +15,6 @@ #include "storage/lwlock.h" #include "storage/shmem.h" #include "storage/spin.h" -#include "replication/walreceiver.h" ... ... -extern void WaitForStandbyLSN(XLogRecPtr wait_for_lsn); extern List *GetStandbySlotList(bool copy); Why the above two are removed as part of this patch? -- With Regards, Amit Kapila.
On Mon, Nov 20, 2023 at 3:17 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > On 11/18/23 11:45 AM, Amit Kapila wrote: > > On Fri, Nov 17, 2023 at 5:18 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > >> > >> On 11/17/23 2:46 AM, Zhijie Hou (Fujitsu) wrote: > >>> On Tuesday, November 14, 2023 10:27 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > >>> > >>> I feel the WaitForWALToBecomeAvailable may not be the best place to shutdown > >>> slotsync worker and drop slots. There could be other reasons(other than > >>> promotion) as mentioned in comments in case XLOG_FROM_STREAM to reach the code > >>> there. I thought if the intention is to stop slotsync workers on promotion, > >>> maybe FinishWalRecovery() is a better place to do it as it's indicating the end > >>> of recovery and XLogShutdownWalRcv is also called in it. > >> > >> I can see that slotsync_drop_initiated_slots() has been moved in FinishWalRecovery() > >> in v35. That looks ok. > >>> > > > > I was thinking what if we just ignore creating such slots (which > > require init state) in the first place? I think that can be > > time-consuming in some cases but it will reduce the complexity and we > > can always improve such cases later if we really encounter them in the > > real world. I am not very sure that added complexity is worth > > addressing this particular case, so I would like to know your and > > others' opinions. > > > > I'm not sure I understand your point. Are you saying that we should not create > slots on the standby that are "currently" reported in a 'i' state? (so just keep > the 'r' and 'n' states?) > Yes. -- With Regards, Amit Kapila.
On 11/20/23 11:59 AM, Amit Kapila wrote: > On Mon, Nov 20, 2023 at 3:17 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> >> On 11/18/23 11:45 AM, Amit Kapila wrote: >>> On Fri, Nov 17, 2023 at 5:18 PM Drouvot, Bertrand >>> <bertranddrouvot.pg@gmail.com> wrote: >>>> >>>> On 11/17/23 2:46 AM, Zhijie Hou (Fujitsu) wrote: >>>>> On Tuesday, November 14, 2023 10:27 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: >>>>> >>>>> I feel the WaitForWALToBecomeAvailable may not be the best place to shutdown >>>>> slotsync worker and drop slots. There could be other reasons(other than >>>>> promotion) as mentioned in comments in case XLOG_FROM_STREAM to reach the code >>>>> there. I thought if the intention is to stop slotsync workers on promotion, >>>>> maybe FinishWalRecovery() is a better place to do it as it's indicating the end >>>>> of recovery and XLogShutdownWalRcv is also called in it. >>>> >>>> I can see that slotsync_drop_initiated_slots() has been moved in FinishWalRecovery() >>>> in v35. That looks ok. >>>>> >>> >>> I was thinking what if we just ignore creating such slots (which >>> require init state) in the first place? I think that can be >>> time-consuming in some cases but it will reduce the complexity and we >>> can always improve such cases later if we really encounter them in the >>> real world. I am not very sure that added complexity is worth >>> addressing this particular case, so I would like to know your and >>> others' opinions. >>> >> >> I'm not sure I understand your point. Are you saying that we should not create >> slots on the standby that are "currently" reported in a 'i' state? (so just keep >> the 'r' and 'n' states?) >> > > Yes. > As far the 'i' state here, from what I see, it is currently useful for: 1. Cascading standby to not sync slots with state = 'i' from the first standby. 2. Easily report Slots that did not catch up on the primary yet. 3. Avoid inactive slots to block "active" ones creation. So not creating those slots should not be an issue for 1. (sync are not needed on cascading standby as not created on the first standby yet) but is an issue for 2. (unless we provide another way to keep track and report such slots) and 3. (as I think we should still need to reserve WAL). I've a question: we'd still need to reserve WAL for those slots, no? If that's the case and if we don't call ReplicationSlotCreate() then ReplicationSlotReserveWal() would not work as MyReplicationSlot would be NULL. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Nov 20, 2023 at 4:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > 9. > +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot, > + bool *slot_updated) > { > ... > + else > + { > + TransactionId xmin_horizon = InvalidTransactionId; > + ReplicationSlot *slot; > + > + ReplicationSlotCreate(remote_slot->name, true, RS_EPHEMERAL, > + remote_slot->two_phase, false); > + slot = MyReplicationSlot; > + > + SpinLockAcquire(&slot->mutex); > + slot->data.database = get_database_oid(remote_slot->database, false); > + > + /* Mark it as sync initiated by slot-sync worker */ > + slot->data.sync_state = SYNCSLOT_STATE_INITIATED; > + slot->data.failover = true; > + > + namestrcpy(&slot->data.plugin, remote_slot->plugin); > + SpinLockRelease(&slot->mutex); > + > + ReplicationSlotReserveWal(); > + > > How and when will this init state (SYNCSLOT_STATE_INITIATED) persist to disk? > On closer inspection, I see that it is done inside wait_for_primary_and_sync() when it fails to sync. I think it is better to refactor the code a bit and persist it in synchronize_one_slot() to make the code flow easier to understand. -- With Regards, Amit Kapila.
On Saturday, November 18, 2023 6:46 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Nov 17, 2023 at 5:18 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > > > > On 11/17/23 2:46 AM, Zhijie Hou (Fujitsu) wrote: > > > On Tuesday, November 14, 2023 10:27 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > > > > > > I feel the WaitForWALToBecomeAvailable may not be the best place to > > > shutdown slotsync worker and drop slots. There could be other > > > reasons(other than > > > promotion) as mentioned in comments in case XLOG_FROM_STREAM to > > > reach the code there. I thought if the intention is to stop slotsync > > > workers on promotion, maybe FinishWalRecovery() is a better place to > > > do it as it's indicating the end of recovery and XLogShutdownWalRcv is also > called in it. > > > > I can see that slotsync_drop_initiated_slots() has been moved in > > FinishWalRecovery() in v35. That looks ok. > > > > > > More Review for v35-0002* Thanks for the comments. > ============================ > 1. > + ereport(WARNING, > + errmsg("skipping slots synchronization as primary_slot_name " > + "is not set.")); > > There is no need to use a full stop at the end for WARNING messages and as > previously mentioned, let's not split message lines in such cases. There are > other messages in the patch with similar problems, please fix those as well. Adjusted. > > 2. > +slotsync_checks() > { > ... > ... > + /* The hot_standby_feedback must be ON for slot-sync to work */ if > + (!hot_standby_feedback) { ereport(WARNING, errmsg("skipping slots > + synchronization as hot_standby_feedback " > + "is off.")); > > This message has the same problem as mentioned in the previous comment. > Additionally, I think either atop slotsync_checks or along with GUC check we > should write comments as to why we expect these values to be set for slot sync > to work. Added comments for these cases. > > 3. > + /* The worker is running already */ > + if (SlotSyncWorker &&SlotSyncWorker->hdr.in_use && > + SlotSyncWorker->hdr.proc) > > The spacing for both the &&'s has problems. You need a space after the first > && and the second && should be in the prior line. Adjusted. > > 4. > + LauncherRereadConfig(&recheck_slotsync); > + > } > > An empty line after LauncherRereadConfig() is not required. > > 5. > +static void > +LauncherRereadConfig(bool *ss_recheck) > +{ > + char *conninfo = pstrdup(PrimaryConnInfo); > + char *slotname = pstrdup(PrimarySlotName); > + bool syncslot = enable_syncslot; > + bool feedback = hot_standby_feedback; > > Can we change the variable name 'feedback' to 'standbyfeedback' to make it > slightly more descriptive? Changed. > > 6. The logic to recheck the slot_sync related parameters in > LauncherMain() is not very clear. IIUC, if after reload config any parameter is > changed, we just seem to be checking the validity of the changed parameter > but not restarting the slot sync worker, is that correct? If so, what if dbname is > changed, don't we need to restart the slot-sync worker and re-initialize the > connection; similarly slotname change also needs some thoughts. Also, if all the > parameters are valid we seem to be re-launching the slot-sync worker without > first stopping it which doesn't seem correct, am I missing something in this > logic? I think the slot sync worker will be stopped in LauncherRereadConfig() if GUC changed and new slot sync worker will be started in next loop in LauncherMain(). > 7. > @@ -524,6 +525,25 @@ CreateDecodingContext(XLogRecPtr start_lsn, > errmsg("replication slot \"%s\" was not created in this database", > NameStr(slot->data.name)))); > > + in_recovery = RecoveryInProgress(); > + > + /* > + * Do not allow consumption of a "synchronized" slot until the standby > + * gets promoted. Also do not allow consumption of slots with > + sync_state > + * as SYNCSLOT_STATE_INITIATED as they are not synced completely to be > + * used. > + */ > + if ((in_recovery && (slot->data.sync_state != SYNCSLOT_STATE_NONE)) || > + slot->data.sync_state == SYNCSLOT_STATE_INITIATED) > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("cannot use replication slot \"%s\" for logical decoding", > + NameStr(slot->data.name)), in_recovery ? > + errdetail("This slot is being synced from the primary server.") : > + errdetail("This slot was not synced completely from the primary > + server."), errhint("Specify another replication slot."))); > + > > If we are planning to drop slots in state SYNCSLOT_STATE_INITIATED at the > time of promotion, don't we need to just have an assert or elog(ERROR, .. for > non-recovery cases as such cases won't be reachable? If so, I think we can > separate out that case here. Adjusted the codes as suggested. > > 8. > wait_for_primary_slot_catchup() > { > ... > + /* Check if this standby is promoted while we are waiting */ if > + (!RecoveryInProgress()) { > + /* > + * The remote slot didn't pass the locally reserved position at > + * the time of local promotion, so it's not safe to use. > + */ > + ereport( > + WARNING, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg( > + "slot-sync wait for slot %s interrupted by promotion, " > + "slot creation aborted", remote_slot->name))); pfree(cmd.data); return > + false; } > ... > } > > Shouldn't this be an Assert because a slot-sync worker shouldn't exist for > non-standby servers? Changed to Assert. > > 9. > wait_for_primary_slot_catchup() > { > ... > + slot = MakeSingleTupleTableSlot(res->tupledesc, &TTSOpsMinimalTuple); > + if (!tuplestore_gettupleslot(res->tuplestore, true, false, slot)) { > + ereport(WARNING, (errmsg("slot \"%s\" disappeared from the primary > + server," > + " slot creation aborted", remote_slot->name))); pfree(cmd.data); > + walrcv_clear_result(res); return false; > > If the slot on primary disappears, shouldn't this part of the code somehow > ensure to remove the slot on standby as well? If it is taken at some other point > in time then at least we should write a comment here to state how it is taken > care of. I think this comment also applies to a few other checks following this > check. I adjusted the code here to not persist the slots if the slot disappeared or invalidated on primary, so that the local slot will get dropped when releasing. > > 10. > + /* > + * It is possible to get null values for lsns and xmin if slot is > + * invalidated on the primary server, so handle accordingly. > + */ > + new_invalidated = DatumGetBool(slot_getattr(slot, 1, &isnull)); > > We can say LSN and Xmin in the above comment to make it easier to > read/understand. Changed. > > 11. > /* > + * Once we got valid restart_lsn, then confirmed_lsn and catalog_xmin > + * are expected to be valid/non-null, so assert if found null. > + */ > > No need to explicitly say about assert, it is clear from the code. We can slightly > change this comment to: "Once we got valid restart_lsn, then confirmed_lsn > and catalog_xmin are expected to be valid/non-null." Changed. > > 12. > + if (remote_slot->restart_lsn < MyReplicationSlot->data.restart_lsn || > + TransactionIdPrecedes(remote_slot->catalog_xmin, > + MyReplicationSlot->data.catalog_xmin)) > + { > + if (!wait_for_primary_slot_catchup(wrconn, remote_slot)) { > + /* > + * The remote slot didn't catch up to locally reserved position. > + * But still persist it and attempt the wait and sync in next > + * sync-cycle. > + */ > + if (MyReplicationSlot->data.persistency != RS_PERSISTENT) { > + ReplicationSlotPersist(); *slot_updated = true; } > > I think the reason to persist in this case is because next time local restart_lsn can > be ahead than the current location and it can take more time to create such a > slot. We can probably mention the same in the comments. Updated the comments. Here is the V37 patch set which addressed comments above and [1]. [1] https://www.postgresql.org/message-id/CAA4eK1%2BP9R3GO2rwGBg2EOh%3DuYjWUSEOHD8yvs4Je8WYa2RHag%40mail.gmail.com Best Regards, Hou zj
Attachment
On Friday, November 17, 2023 7:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Nov 16, 2023 at 5:34 PM shveta malik <shveta.malik@gmail.com> > wrote: > > > > PFA v35. > > > > Review v35-0002* > ============== Thanks for the comments. > 1. > As quoted in the commit message, > > > If a logical slot is invalidated on the primary, slot on the standby is also > invalidated. If a logical slot on the primary is valid but is invalidated on the > standby due to conflict (say required rows removed on the primary), then that > slot is dropped and recreated on the standby in next sync-cycle. > It is okay to recreate such slots as long as these are not consumable on the > standby (which is the case currently). > > > > I think this won't happen normally because of the physical slot and > hot_standby_feedback but probably can occur in cases like if the user > temporarily switches hot_standby_feedback from on to off. Are there any other > reasons? I think we can mention the cases along with it as well at least for now. > Additionally, I think this should be covered in code comments as well. I will collect all these cases and update in next version. > > 2. > #include "postgres.h" > - > +#include "access/genam.h" > > Spurious line removal. Removed. > > 3. > A password needs to be provided too, if the sender demands > password > authentication. It can be provided in the > <varname>primary_conninfo</varname> string, or in a separate > - <filename>~/.pgpass</filename> file on the standby server (use > - <literal>replication</literal> as the database name). > - Do not specify a database name in the > - <varname>primary_conninfo</varname> string. > + <filename>~/.pgpass</filename> file on the standby server. > + </para> > + <para> > + Specify <literal>dbname</literal> in > + <varname>primary_conninfo</varname> string to allow > synchronization > + of slots from the primary server to the standby server. > + This will only be used for slot synchronization. It is ignored > + for streaming. > > Is there a reason to remove part of the earlier sentence "use > <literal>replication</literal> as the database name"? Added it back. > > 4. > + <primary><varname>enable_syncslot</varname> configuration > parameter</primary> > + </indexterm> > + </term> > + <listitem> > + <para> > + It enables a physical standby to synchronize logical failover slots > + from the primary server so that logical subscribers are not blocked > + after failover. > + </para> > + <para> > + It is enabled by default. This parameter can only be set in the > + <filename>postgresql.conf</filename> file or on the server > command line. > + </para> > > I think you forgot to update the documentation for the default value of this > variable. Updated. > > 5. > + * a) start the logical replication workers for every enabled subscription > + * when not in standby_mode > + * b) start the slot-sync worker for logical failover slots synchronization > + * from the primary server when in standby_mode. > > Either use a full stop after both lines or none of these. Added a full stop. > > 6. > +static void slotsync_worker_cleanup(SlotSyncWorkerInfo * worker); > > There shouldn't be space between * and the worker. Removed, and added the type to typedefs.list. > > 7. > + if (!SlotSyncWorker->hdr.in_use) > + { > + LWLockRelease(SlotSyncWorkerLock); > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("replication slot-sync worker not initialized, " > + "cannot attach"))); > + } > + > + if (SlotSyncWorker->hdr.proc) > + { > + LWLockRelease(SlotSyncWorkerLock); > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("replication slot-sync worker is " > + "already running, cannot attach"))); > + } > > Using slot-sync in the error messages looks a bit odd to me. Can we use > "replication slot sync worker ..." in both these and other similar messages? I > think it would be better if we don't split the messages into multiple lines in > these cases as messages don't appear too long to me. Changed as suggested. > > 8. > +/* > + * Detach the worker from DSM and update 'proc' and 'in_use'. > + * Logical replication launcher will come to know using these > + * that the worker has shutdown. > + */ > +void > +slotsync_worker_detach(int code, Datum arg) { > > I think the reference to DSM is leftover from the previous version of the patch. > Can we change the above comments as per the new code? Changed. > > 9. > +static bool > +slotsync_worker_launch() > { > ... > + /* TODO: do we really need 'generation', analyse more here */ > + worker->hdr.generation++; > > We should do something about this TODO. As per my understanding, we don't > need a generation number for the slot sync worker as we have one such worker > but I guess the patch requires it because we are using existing logical > replication worker infrastructure. This brings the question of whether we really > need a separate SlotSyncWorkerInfo or if we can use existing > LogicalRepWorker and distinguish it with LogicalRepWorkerType? I guess you > didn't use it because most of the fields in LogicalRepWorker will be unused for > slot sync worker. Will think about this one and update in next version. > > 10. > + * Can't use existing functions like 'get_database_oid' from > +dbcommands.c for > + * validity purpose as they need db connection. > + */ > +static bool > +validate_dbname(const char *dbname) > > I don't know how important it is to validate the dbname before launching the > sync slot worker because anyway after launching, it will give an error while > initializing the connection if the dbname is invalid. But, if we think it is really > required, did you consider using GetDatabaseTuple()? Yes, we could export GetDatabaseTuple. Apart from this, I am thinking is it possible to release the restriction for the dbname. For example, slot sync worker could always connect to the 'template1' as the worker doesn't update the database objects. Although I didn't find some examples on server side, but some client commands(e.g. pg_upgrade) will connect to template1 to check some global objects. (Just FYI, the previous version patch used a replication command which may avoid the dbname but was replaced with SELECT to improve the flexibility and avoid introducing new command.) Best Regards, Hou zj
On Mon, Nov 20, 2023 at 6:51 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > On 11/20/23 11:59 AM, Amit Kapila wrote: > > On Mon, Nov 20, 2023 at 3:17 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > >> > >> On 11/18/23 11:45 AM, Amit Kapila wrote: > >>> On Fri, Nov 17, 2023 at 5:18 PM Drouvot, Bertrand > >>> <bertranddrouvot.pg@gmail.com> wrote: > >>>> > >>>> On 11/17/23 2:46 AM, Zhijie Hou (Fujitsu) wrote: > >>>>> On Tuesday, November 14, 2023 10:27 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > >>>>> > >>>>> I feel the WaitForWALToBecomeAvailable may not be the best place to shutdown > >>>>> slotsync worker and drop slots. There could be other reasons(other than > >>>>> promotion) as mentioned in comments in case XLOG_FROM_STREAM to reach the code > >>>>> there. I thought if the intention is to stop slotsync workers on promotion, > >>>>> maybe FinishWalRecovery() is a better place to do it as it's indicating the end > >>>>> of recovery and XLogShutdownWalRcv is also called in it. > >>>> > >>>> I can see that slotsync_drop_initiated_slots() has been moved in FinishWalRecovery() > >>>> in v35. That looks ok. > >>>>> > >>> > >>> I was thinking what if we just ignore creating such slots (which > >>> require init state) in the first place? I think that can be > >>> time-consuming in some cases but it will reduce the complexity and we > >>> can always improve such cases later if we really encounter them in the > >>> real world. I am not very sure that added complexity is worth > >>> addressing this particular case, so I would like to know your and > >>> others' opinions. > >>> > >> > >> I'm not sure I understand your point. Are you saying that we should not create > >> slots on the standby that are "currently" reported in a 'i' state? (so just keep > >> the 'r' and 'n' states?) > >> > > > > Yes. > > > > As far the 'i' state here, from what I see, it is currently useful for: > > 1. Cascading standby to not sync slots with state = 'i' from > the first standby. > 2. Easily report Slots that did not catch up on the primary yet. > 3. Avoid inactive slots to block "active" ones creation. > > So not creating those slots should not be an issue for 1. (sync are > not needed on cascading standby as not created on the first standby yet) > but is an issue for 2. (unless we provide another way to keep track and report > such slots) and 3. (as I think we should still need to reserve WAL). > > I've a question: we'd still need to reserve WAL for those slots, no? > > If that's the case and if we don't call ReplicationSlotCreate() then ReplicationSlotReserveWal() > would not work as MyReplicationSlot would be NULL. > Yes, we need to reserve WAL to see if we can sync the slot. We are currently creating an RS_EPHEMERAL slot and if we don't explicitly persist it when we can't sync, then it will be dropped when we do ReplicationSlotRelease() at the end of synchronize_one_slot(). So, the loss is probably, the next time we again try to sync the slot, we need to again create it and may need to wait for newer restart_lsn on standby which could be avoided if we have the slot in 'i' state from the previous run. I don't deny the importance of having 'i' (initialized) state but was just trying to say that it has additional code complexity. OTOH, having it may give better visibility to even users about slots that are not active (say manually created slots on the primary). -- With Regards, Amit Kapila.
Here are some review comments for the patch v35-0001. ====== 0. GENERAL documentation I felt that the documentation gave details of the individual changes (e.g. GUC 'standby_slot_names' and API, CREATE SUBSCRIPTION option, and pg_replication_slots 'failover' attribute etc.) but there is nothing that seemed to bring all these parts together to give examples for user "when" and "how" to make all these parts work. I'm not sure if there is some overview missing from this patch 00001 or if you are planning that extra documentation for subsequent patches. ====== Commit message 1. A new property 'failover' is added at the slot level which is persistent information which specifies that this logical slot is enabled to be synced to the physical standbys so that logical replication can be resumed after failover. It is always false for physical slots. ~ SUGGESTION A new property 'failover' is added at the slot level. This is persistent information to indicate that this logical slot... ~~~ 2. Users can set it during the create subscription or during pg_create_logical_replication_slot. Examples: create subscription mysub connection '..' publication mypub WITH (failover = true); --last arg SELECT * FROM pg_create_logical_replication_slot('myslot', 'pgoutput', false, true, true); ~ 2a. Add a blank line before this ~ 2b. Use uppercase for the SQL ~ 2c. SUGGESTION Users can set this flag during CREATE SUBSCRIPTION or during pg_create_logical_replication_slot API. Ex1. CREATE SUBSCRIPTION mysub CONNECTION '...' PUBLICATION mypub WITH (failover = true); Ex2. (failover is the last arg) SELECT * FROM pg_create_logical_replication_slot('myslot', 'pgoutput', false, true, true); ~~~ 3. This 'failover' is displayed as part of pg_replication_slots view. ~ SUGGESTION The value of the 'failover' flag is displayed as part of pg_replication_slots view. ~~~ 4. A new GUC standby_slot_names has been added. It is the list of physical replication slots that logical replication with failover enabled waits for. The intent of this wait is that no logical replication subscribers (with failover=true) should go ahead of physical replication standbys (corresponding to the physical slots in standby_slot_names). ~ 4a. SUGGESTION A new GUC standby_slot_names has been added. This is a list of physical replication slots that logical replication with failover enabled will wait for. ~ 4b. /no logical replication subscribers/no logical replication subscriptions/ ~ 4c /should go ahead of physical/should get ahead of physical/ ====== contrib/test_decoding/sql/slot.sql 5. + +-- Test logical slots creation with 'failover'=true (last arg) +SELECT 'init' FROM pg_create_logical_replication_slot('failover_slot', 'test_decoding', false, false, true); +SELECT slot_name, slot_type, failover FROM pg_replication_slots; + +SELECT pg_drop_replication_slot('failover_slot'); How about a couple more simple tests: a) pass false arg to confirm it is false in the view. b) according to the docs this failover is optional, so try API without passing it c) create a physical slot to confirm it is false in the view. ====== doc/src/sgml/catalogs.sgml 6. + <row> + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>subfailoverstate</structfield> <type>char</type> + </para> + <para> + State codes for failover mode: + <literal>d</literal> = disabled, + <literal>p</literal> = pending enablement, + <literal>e</literal> = enabled + </para></entry> + </row> + This attribute is very similar to the 'subtwophasestate' so IMO it would be better to be adjacent to that one in the docs. (probably this means putting it in the same order in the catalog also, assuming that is allowed) ====== doc/src/sgml/config.sgml 7. + <para> + List of physical replication slots that logical replication slots with + failover enabled waits for. If a logical replication connection is + meant to switch to a physical standby after the standby is promoted, + the physical replication slot for the standby should be listed here. + </para> + <para> + The standbys corresponding to the physical replication slots in + <varname>standby_slot_names</varname> must enable + <varname>enable_syncslot</varname> for the standbys to receive + failover logical slots changes from the primary. + </para> That sentence mentioning 'enable_syncslot' seems premature because AFAIK that GUC is not introduced until patch 0002. So this part should be moved into the 0002 patch. ====== doc/src/sgml/ref/alter_subscription.sgml 8. These commands also cannot be executed when the subscription has <link linkend="sql-createsubscription-params-with-two-phase"><literal>two_phase</literal></link> - commit enabled, unless + commit enabled or + <link linkend="sql-createsubscription-params-with-failover"><literal>failover</literal></link> + enabled, unless <link linkend="sql-createsubscription-params-with-copy-data"><literal>copy_data</literal></link> is <literal>false</literal>. See column <structfield>subtwophasestate</structfield> - of <link linkend="catalog-pg-subscription"><structname>pg_subscription</structname></link> + and <structfield>subfailoverstate</structfield> of + <link linkend="catalog-pg-subscription"><structname>pg_subscription</structname></link> to know the actual two-phase state. I think the last sentence doesn't make sense anymore because it is no longer talking about only two-phase state. BEFORE See column subtwophasestate and subfailoverstate of pg_subscription to know the actual two-phase state. SUGGESTION See column subtwophasestate and subfailoverstate of pg_subscription to know the actual states. ====== doc/src/sgml/ref/create_subscription.sgml 9. + + <varlistentry id="sql-createsubscription-params-with-failover"> + <term><literal>failover</literal> (<type>boolean</type>)</term> + <listitem> + <para> + Specifies whether the replication slot assocaited with the subscription + is enabled to be synced to the physical standbys so that logical + replication can be resumed from the new primary after failover. + The default is <literal>false</literal>. + </para> + + <para> + The implementation of failover requires that replication + has successfully finished the initial table synchronization + phase. So even when <literal>failover</literal> is enabled for a + subscription, the internal failover state remains + temporarily <quote>pending</quote> until the initialization phase + completes. See column <structfield>subfailoverstate</structfield> + of <link linkend="catalog-pg-subscription"><structname>pg_subscription</structname></link> + to know the actual failover state. + </para> + + </listitem> + </varlistentry> 9a. /assocaited/associated/ ~ 9b. Unnecessary blank line before </listitem> ====== src/backend/commands/subscriptioncmds.c 10. #define SUBOPT_ORIGIN 0x00004000 +#define SUBOPT_FAILOVER 0x00008000 Bad indentation ~~~ 11. CreateSubscription + /* + * If only the slot_name is specified, it is possible that the user intends to + * use an existing slot on the publisher, so here we enable failover for the + * slot if requested. + */ + else if (opts.slot_name && failover_enabled) + { + walrcv_alter_slot(wrconn, opts.slot_name, opts.failover); + ereport(NOTICE, + (errmsg("enabled failover for replication slot \"%s\" on publisher", + opts.slot_name))); + } 11a. How does this code ensure that *only* slot_name was set (e.g the comment says "only the slot_name is specified") ~ 11b. Should 3rd arg to walrcv_alter_slot be 'failover_enabled', or maybe just 'true'? ~~~ 12. AlterSubscription + if (sub->failoverstate == LOGICALREP_FAILOVER_STATE_ENABLED && opts.copy_data) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when failover is enabled"), + errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION with refresh = false, or with copy_data = false, or use DROP/CREATE SUBSCRIPTION."))); 12a. This should have a comment like what precedes the sub->twophasestate error. Or maybe group them both and use the same common comment. ~ 12b. AFAIK when there are messages like this that differ only by non-translatable things ("failover" option) then that non-translatable thing should be extracted as a parameter so the messages are common. And, don't forget to add a /* translator: %s is a subscription option like 'failover' */ comment. SUGGESTION like: errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when %s is enabled", "two_phase") errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when %s is enabled", "failover") ~~~ 13. + if (sub->failoverstate == LOGICALREP_FAILOVER_STATE_ENABLED && opts.copy_data) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when failover is enabled"), + /* translator: %s is an SQL ALTER command */ + errhint("Use %s with refresh = false, or with copy_data = false, or use DROP/CREATE SUBSCRIPTION.", + isadd ? + "ALTER SUBSCRIPTION ... ADD PUBLICATION" : + "ALTER SUBSCRIPTION ... DROP PUBLICATION"))); Same comment as above #12b. SUGGESTION like: errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when %s is enabled", "two_phase") errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when %s is enabled", "failover") ~~~ 14. + /* + * See comments above for twophasestate, same holds true for + * 'failover' + */ + if (sub->failoverstate == LOGICALREP_FAILOVER_STATE_ENABLED && opts.copy_data) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when failover is enabled"), + errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false, or use DROP/CREATE SUBSCRIPTION."))); IMO this is another message where the option should be extracted to make a common message for the translators. And don't forget to add a /* translator: %s is a subscription option like 'failover' */ comment. SUGGESTION like: errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when %s is enabled", "two_phase"), errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when %s is enabled", "failover"), ====== .../libpqwalreceiver/libpqwalreceiver.c 15. libpqrcv_create_slot + if (failover) + { + appendStringInfoString(&cmd, "FAILOVER"); + if (use_new_options_syntax) + appendStringInfoString(&cmd, ", "); + else + appendStringInfoChar(&cmd, ' '); + } 15a. Isn't failover a new option that is unsupported pre-PG17? Why is it necessary to support an old-style syntax for something that was not supported on old servers? (I'm confused). ~ 15b. Also IIRC, this FAILOVER wasn't not listed in the old-style syntax of doc/src/sgml/protocol.sgml. Was that deliberate? ====== .../replication/logical/logicalfuncs.c 16. pg_logical_slot_get_changes_guts + if (XLogRecPtrIsInvalid(upto_lsn)) + wal_to_wait = end_of_wal; + else + wal_to_wait = Min(upto_lsn, end_of_wal); + + /* + * Wait for specified streaming replication standby servers (if any) + * to confirm receipt of WAL upto wal_to_wait. + */ + WalSndWaitForStandbyConfirmation(wal_to_wait); + 16a. /WAL upto wal_to_wait./WAL up to wal_to_wait./ ~ 16b. Is there another name for this variable (wal_to_wait) that conveys more meaning? Maybe 'wal_received_pos' or 'wait_for_wal_lsn' or something better. ====== src/backend/replication/logical/tablesync.c 17. process_syncing_tables_for_apply CommandCounterIncrement(); /* make updates visible */ if (AllTablesyncsReady()) { + char buf[100]; + + buf[0] = '\0'; + + if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING) + strcat(buf, "twophase"); + if (MySubscription->failoverstate == LOGICALREP_FAILOVER_STATE_PENDING) + { + if (buf[0] != '\0') + strcat(buf, " and "); + strcat(buf, "failover"); + } + ereport(LOG, - (errmsg("logical replication apply worker for subscription \"%s\" will restart so that two_phase can be enabled", - MySubscription->name))); + (errmsg("logical replication apply worker for subscription \"%s\" will restart so that %s can be enabled", + MySubscription->name, buf))); should_exit = true; } ~ IMO you cannot build up a log buffer using " and " like this because the translation would be a mess. IIUC, you might have to do it the long way with multiple errmsg. SUGGESTION twophase_pending = MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING; failover_pending = MySubscription->failoverstate == LOGICALREP_FAILOVER_STATE_PENDING; if (twophase_pending || twophase_pending) ereport(LOG, twophase_pending && twophase_pending /* translator: 'two_phase' or 'failover' are subscription options */ ? errmsg("logical replication apply worker for subscription \"%s\" will restart so that two_phase and failover can be enabled") : errmsg("logical replication apply worker for subscription \"%s\" will restart so that %s can be enabled", twophase_pending ? "two_phase" : "failover")); ~~~ 18. UpdateTwoPhaseFailoverStates -UpdateTwoPhaseState(Oid suboid, char new_state) +UpdateTwoPhaseFailoverStates(Oid suboid, + bool update_twophase, char new_state_twophase, + bool update_failover, char new_state_failover) Although this function is written to update to *any* specified state, in practice it only ever seems called to update from PENDING to ENABLE state and nothing else. Therefore it can be simplified by not even passing those states, and by changing the function name like 'EnableTwoPhaseFailoverTriState' ====== src/backend/replication/logical/worker.c 19. File header comment There is a lot of talk here about two_phase tri-state and the special ALTER REFRESH considerations for the two-phase transactions. IIUC, there should be lots of similar commentary for the failover tri-sate and ALTER REFRESH. ~~~ 20. * PENDING, which allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to * work. */ - if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING && - AllTablesyncsReady()) + twophase_pending = (MySubscription->twophasestate + == LOGICALREP_TWOPHASE_STATE_PENDING) ? true : false; + failover_pending = (MySubscription->failoverstate + == LOGICALREP_FAILOVER_STATE_PENDING) ? true : false; + The comment preceding this is only talking about 'two_phase', so should be expanded to mention also 'failover' ~~~ 21. run_apply_worker + twophase_pending = (MySubscription->twophasestate + == LOGICALREP_TWOPHASE_STATE_PENDING) ? true : false; + failover_pending = (MySubscription->failoverstate + == LOGICALREP_FAILOVER_STATE_PENDING) ? true : false; These ternaries are not necessary. SUGGESTION (has the same meaning) twophase_pending = (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING); failover_pending = (MySubscription->failoverstate == LOGICALREP_FAILOVER_STATE_PENDING); ~~~ 22. - UpdateTwoPhaseState(MySubscription->oid, LOGICALREP_TWOPHASE_STATE_ENABLED); - MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED; + + /* Update twophase and/or failover */ + if (twophase_pending || failover_pending) + UpdateTwoPhaseFailoverStates(MySubscription->oid, + twophase_pending, + LOGICALREP_TWOPHASE_STATE_ENABLED, + failover_pending, + LOGICALREP_FAILOVER_STATE_ENABLED); + if (twophase_pending) + MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED; + + if (failover_pending) + MySubscription->failoverstate = LOGICALREP_FAILOVER_STATE_ENABLED; There seem rather too many checks for 'twophase_pending' and 'failover_pending'. With some refactoring this could be done with less code I think. For example, 1. Unconditionally call UpdateTwoPhaseFailoverStates() but just quick return if nothing to do 2. Pass address of MySubscription->twophasestate/failoverstate, and let function UpdateTwoPhaseFailoverStates() set those ====== src/backend/replication/slot.c 23. +char *standby_slot_names; +static List *standby_slot_names_list = NIL; Should there be a comment for the new GUC? ~~~ 24. ReplicationSlotAlter +/* + * Change the definition of the slot identified by the passed in name. + */ +void +ReplicationSlotAlter(const char *name, bool failover) /passed in/specified/ /the definition/the failover state/ ~~~ 25. validate_standby_slots + +/* + * A helper function to validate slots specified in standby_slot_names GUCs. + */ +static bool +validate_standby_slots(char **newval) /in standby_slot_names GUCs./in GUC standby_slot_names./ ~ 26. validate_standby_slots + /* + * Verify 'type' of slot now. + * + * Skip check if replication slots' data is not initialized yet i.e. we + * are in startup process. + */ + if (!ReplicationSlotCtl) + return true; 26a. This code seems to neglect doing memory cleanup. + pfree(rawname); + list_free(elemlist); ~ 26b. Indeed, most of this function's return points seem to be neglecting some memory cleanup, so IMO it would be better to write this function with some common goto labels that do all this common cleanup: SUGGESTION ret_standby_slot_names_ok: pfree(rawname); list_free(elemlist); return true; ret_standby_slot_names_ng: pfree(rawname); list_free(elemlist); return false; ~ 27. validate_standby_slots + if (SlotIsLogical(slot)) + { + GUC_check_errdetail("cannot have logical replication slot \"%s\" " + "in this parameter", name); + list_free(elemlist); + return false; + } IIUC, the GUC is for physical replication slots only, so somehow I felt it was better to keep everything from that (physical) perspective. YMMV. SUGGESTION if (!SlotIsPhysical(slot)) { GUC_check_errdetail("\"%s\" is not a physical replication slot", name); list_free(elemlist); return false; } ~~~ 28. check_standby_slot_names +bool +check_standby_slot_names(char **newval, void **extra, GucSource source) +{ + if (strcmp(*newval, "") == 0) + return true; + + /* + * "*" is not accepted as in that case primary will not be able to know + * for which all standbys to wait for. Even if we have physical-slots + * info, there is no way to confirm whether there is any standby + * configured for the known physical slots. + */ + if (strcmp(*newval, "*") == 0) + { + GUC_check_errdetail("\"%s\" is not accepted for standby_slot_names", + *newval); + return false; + } + + /* Now verify if the specified slots really exist and have correct type */ + if (!validate_standby_slots(newval)) + return false; + + *extra = guc_strdup(ERROR, *newval); + + return true; +} Is it really necessary to have a special test for the special value "*" which you are going to reject? I don't see why this should be any different from checking for other values like "." or "$" or "?" etc. Why not just let validate_standby_slots() handle all of these? ~~~ 29. assign_standby_slot_names + /* No value is specified for standby_slot_names. */ + if (standby_slot_names_cpy == NULL) + return; Is this possible? IIUC the check_standby_slot_names() did: *extra = guc_strdup(ERROR, *newval); Maybe this code also needs a similar elog and comment like already in this function: /* This should not happen if GUC checked check_standby_slot_names. */ ~ 30. assign_standby_slot_names + char *standby_slot_names_cpy = extra; IIUC, the 'extra' was unconditionally guc_strdup()'ed in the check hook, so should we also free it here before leaving this function? ~~~ 31. GetStandbySlotList +/* + * Return a copy of standby_slot_names_list if the copy flag is set to true, + * otherwise return the original list. + */ +List * +GetStandbySlotList(bool copy) +{ + if (copy) + return list_copy(standby_slot_names_list); + else + return standby_slot_names_list; +} Why is this better than just exposing the standby_slot_names_list. The caller can make a copy or not. e.g. why is calling GetStandbySlotList(true) better than just doing list_copy(standby_slot_names_list)? ====== src/backend/replication/walsender.c 32. parseCreateReplSlotOptions static void WalSndSegmentOpen(XLogReaderState *state, XLogSegNo nextSegNo, TimeLineID *tli_p); - /* Initialize walsender process before entering the main command loop */ void ~ Unnecessary changing of whitespace unrelated to this patch. ~~~ 33. WalSndWakeupNeeded +/* + * Does this Wal Sender need to wake up logical walsender. + * + * Check if the physical slot of this walsender is specified in + * standby_slot_names GUC. + */ +static bool +WalSndWakeupNeeded() /Wal Sender/physical walsender process/ (maybe??) ~~~ 34. WalSndFilterStandbySlots + /* Log warning if no active_pid for this physical slot */ + if (slot->active_pid == 0) + ereport(WARNING, Other nearby code is guarding the slot in case it was NULL, so why not here? Is it a potential NPE? ~~~ 35. + /* + * If logical slot name is given in standby_slot_names, give WARNING + * and skip it. Since it is harmless, so WARNING should be enough, no + * need to error-out. + */ + else if (SlotIsLogical(slot)) + warningfmt = _("cannot have logical replication slot \"%s\" in parameter \"%s\", ignoring"); Is this possible? Doesn't the function 'validate_standby_slots' called by the GUC hook prevent specifying logical slots in the GUC? Maybe this warning should be changed to Assert? ~~~ 36. + /* + * Reaching here indicates that either the slot has passed the + * wait_for_lsn or there is an issue with the slot that requires a + * warning to be reported. + */ + if (warningfmt) + ereport(WARNING, errmsg(warningfmt, name, "standby_slot_names")); + + standby_slots_cpy = foreach_delete_current(standby_slots_cpy, lc); If something was wrong with the slot that required a warning, is it really OK to remove this slot from the list? This seems contrary to the function comment which only talks about removing slots that have caught up. ~~~ 37. WalSndWaitForStandbyConfirmation +/* + * Wait for physical standby to confirm receiving given lsn. + * + * Here logical walsender associated with failover logical slot waits + * for physical standbys corresponding to physical slots specified in + * standby_slot_names GUC. + */ /given/the given/ ~~~ 38. WalSndWaitForStandbyConfirmation + ConditionVariablePrepareToSleep(&WalSndCtl->wal_confirm_rcv_cv); + + for (;;) This ConditionVariablePrepareToSleep was already called in the WalSndWait() function. Did it need to be called 2 times? ~~~ 39. + WalSndFilterStandbySlots(wait_for_lsn, &standby_slots); + + /* Exit if done waiting for every slot. */ + if (standby_slots == NIL) + break; + + CHECK_FOR_INTERRUPTS(); + + if (ConfigReloadPending) + { + ConfigReloadPending = false; + WalSndRereadConfigAndSlots(&standby_slots); + } Shouldn't all the config reload stuff come first before the filter and NIL check, just in case after the reload there is nothing to do? Otherwise, it might cause unnecessary sleep. ~~~ 40. WalSndWaitForWal /* * Wait till WAL < loc is flushed to disk so it can be safely sent to client. * - * Returns end LSN of flushed WAL. Normally this will be >= loc, but - * if we detect a shutdown request (either from postmaster or client) - * we will return early, so caller must always check. + * If the walsender holds a logical slot that has enabled failover, the + * function also waits for all the specified streaming replication standby + * servers to confirm receipt of WAL upto RecentFlushPtr. + * + * Returns end LSN of flushed WAL. Normally this will be >= loc, but if we + * detect a shutdown request (either from postmaster or client) we will return + * early, so caller must always check. */ static XLogRecPtr WalSndWaitForWal(XLogRecPtr loc) ~ /upto/up to/ ~~~ 41. /* - * Fast path to avoid acquiring the spinlock in case we already know we - * have enough WAL available. This is particularly interesting if we're - * far behind. + * Check if all the standby servers have confirmed receipt of WAL upto + * RecentFlushPtr if we already know we have enough WAL available. + * + * Note that we cannot directly return without checking the status of + * standby servers because the standby_slot_names may have changed, which + * means there could be new standby slots in the list that have not yet + * caught up to the RecentFlushPtr. */ if (RecentFlushPtr != InvalidXLogRecPtr && loc <= RecentFlushPtr) - return RecentFlushPtr; + { + WalSndFilterStandbySlots(RecentFlushPtr, &standby_slots); 41a. /upto/up to/ ~ 41b. IMO there is some missing information in this comment because it wasn't clear to me that calling WalSndFilterStandbySlots was going to side-efect that list to give it a different meaning. e.g. it seems it no longer means "standby slots" but instead means something like "standby slots that are not caught up". Perhaps that local variable can have a name that helps to convey that better? ~~~ 42. + /* + * Fast path to entering the loop in case we already know we have + * enough WAL available and all the standby servers has confirmed + * receipt of WAL upto RecentFlushPtr. This is particularly + * interesting if we're far behind. + */ + if (standby_slots == NIL) + return RecentFlushPtr; 42a. /has/have/ ~ 42b. For entering what loop? There's no context for this comment. I assume it means the loop that comes later in this function, but then isn't this a typo? /Fast path to entering the loop/Fast path to avoid entering the loop/. Alternatively, just don't even mention the loop - just say "Quick return" etc. ~~~ 43. WalSndWait -WalSndWait(uint32 socket_events, long timeout, uint32 wait_event) +WalSndWait(uint32 socket_events, long timeout, uint32 wait_event, + bool wait_for_standby) Does this need the 'wait_for_standby' parameter? AFAICT this was only set true when the event enum was WAIT_EVENT_WAL_SENDER_WAIT_FOR_STANDBY_CONFIRMATION, so why do we need an extra boolean to be passed when there is already enough information in the event to know when it is waiting for standby? ~~~ 44. + if (wait_for_standby) + ConditionVariablePrepareToSleep(&WalSndCtl->wal_confirm_rcv_cv); + else if (MyWalSnd->kind == REPLICATION_KIND_PHYSICAL) ConditionVariablePrepareToSleep(&WalSndCtl->wal_flush_cv); else if (MyWalSnd->kind == REPLICATION_KIND_LOGICAL) ConditionVariablePrepareToSleep(&WalSndCtl->wal_replay_cv); ~ A walsender is either physical or logical, but here the 'wait_for_standby' flag overrides everything. Is it OK for this to be if/else/else or should this code call for wal_confirm_rcv_cv AND the other one? e.g. The function comment for WalSndWaitForWal said "the function also waits..." ====== src/backend/utils/misc/guc_tables.c 45. + {"standby_slot_names", PGC_SIGHUP, REPLICATION_PRIMARY, + gettext_noop("List of streaming replication standby server slot " + "names that logical walsenders waits for."), /walsenders waits for./walsender processes will wait for./ ====== src/backend/utils/misc/postgresql.conf.sample 46. +#standby_slot_names = '' # streaming replication standby server slot names that + # logical walsenders waits for (same as the msg in guc_tables) /walsenders waits for/walsender processes will wait for/ ====== src/bin/pg_upgrade/info.c 47. @@ -681,6 +681,7 @@ get_old_cluster_logical_slot_infos(DbInfo *dbinfo, bool live_check) int i_twophase; int i_caught_up; int i_invalid; + int i_failover; ~ IMO it would be better if all these were coded to use the same order as the SQL -- so put each of the "failover" code immediately after the 'two_phase" code. ~~~ 48. @@ -689,6 +690,7 @@ get_old_cluster_logical_slot_infos(DbInfo *dbinfo, bool live_check) i_twophase = PQfnumber(res, "two_phase"); i_caught_up = PQfnumber(res, "caught_up"); i_invalid = PQfnumber(res, "invalid"); + i_failover = PQfnumber(res, "failover"); ~ ditto #47. ~~~ 49. @@ -699,6 +701,7 @@ get_old_cluster_logical_slot_infos(DbInfo *dbinfo, bool live_check) curr->two_phase = (strcmp(PQgetvalue(res, slotnum, i_twophase), "t") == 0); curr->caught_up = (strcmp(PQgetvalue(res, slotnum, i_caught_up), "t") == 0); curr->invalid = (strcmp(PQgetvalue(res, slotnum, i_invalid), "t") == 0); + curr->failover = (strcmp(PQgetvalue(res, slotnum, i_failover), "t") == 0); ~ ditto #47. ====== src/bin/pg_upgrade/pg_upgrade.c 50. + if (GET_MAJOR_VERSION(new_cluster.major_version) >= 1700) + appendPQExpBuffer(query, ", false, %s, %s);", + slot_info->two_phase ? "true" : "false", + slot_info->failover ? "true" : "false"); + else + appendPQExpBuffer(query, ", false, %s);", + slot_info->two_phase ? "true" : "false"); IMO this would be easier to read if it was written the other way around like if (GET_MAJOR_VERSION(new_cluster.major_version) < 1700) ... old args else ... new args ====== src/bin/pg_upgrade/pg_upgrade.h 51. + bool failover; /* is the slot designated to be synced + * to the physical standby */ } LogicalSlotInfo; The comment is missing a question mark (?) which the others have. ====== src/bin/psql/describe.c 52. ", suborigin AS \"%s\"\n" ", subpasswordrequired AS \"%s\"\n" - ", subrunasowner AS \"%s\"\n", + ", subrunasowner AS \"%s\"\n" + ", subfailoverstate AS \"%s\"\n", gettext_noop("Origin"), gettext_noop("Password required"), - gettext_noop("Run as owner?")); + gettext_noop("Run as owner?"), + gettext_noop("Enable failover?")); I didn't think "Enable failover?" should not have a question mark. IMO "run as owner?" is the odd one out so should not have been copied. Anyway, the subfailoverstate is a 'state' rather than a simple boolean, so it should be more like subtwophasestate than anything else. ====== src/bin/psql/tab-complete.c 53. COMPLETE_WITH("binary", "connect", "copy_data", "create_slot", "disable_on_error", "enabled", "origin", "password_required", "run_as_owner", "slot_name", - "streaming", "synchronous_commit", "two_phase"); + "streaming", "synchronous_commit", "two_phase", + "failover"); All these tab completion options are supposed to be in alphabetical order, so this 'failover' has been added in the wrong position. ====== src/include/catalog/pg_subscription.h 54. /* * two_phase tri-state values. See comments atop worker.c to know more about * these states. */ #define LOGICALREP_TWOPHASE_STATE_DISABLED 'd' #define LOGICALREP_TWOPHASE_STATE_PENDING 'p' #define LOGICALREP_TWOPHASE_STATE_ENABLED 'e' #define LOGICALREP_FAILOVER_STATE_DISABLED 'd' #define LOGICALREP_FAILOVER_STATE_PENDING 'p' #define LOGICALREP_FAILOVER_STATE_ENABLED 'e' ~ 54a. There should either be another comment (like the 'two_phase tri-state' one) added for the FAILOVER states or that existing comment should be expanded so that it also mentions the 'failover' tri-states. ~ 54b. Idea: If you are willing to change the constant names (not the values) of the current tri-states then now both the 'two_phase' and 'failover' could share them -- I also think this might give the ability to create macros (if wanted) or to share more code instead of always handling failover and two_phase separately. SUGGESTION #define LOGICALREP_TRISTATE_DISABLED 'd' #define LOGICALREP_TRISTATE_PENDING 'p' #define LOGICALREP_TRISTATE_ENABLED 'e' ~ 54c. The header comment at the top of worker.c should give more details about the 'failover' tri-state. (also mentioned in another review comment) ~~~ 55. FormData_pg_subscription + char subfailoverstate; /* Enable Failover State */ + /Enable Failover State/Failover state/ ====== src/include/replication/slot.h 56. + + /* + * Is this a failover slot (sync candidate for physical standbys)? + * Relevant for logical slots on the primary server. + */ + bool failover; } ReplicationSlotPersistentData; ~ /Relevant/Only relevant/ ====== src/include/replication/walreceiver.h 57. +#define walrcv_create_slot(conn, slotname, temporary, two_phase, failover, snapshot_action, lsn) \ + WalReceiverFunctions->walrcv_create_slot(conn, slotname, temporary, two_phase, failover, snapshot_action, lsn) double whitespace after the 'failover' parameter? ====== src/include/replication/walsender_private.h 58. ConditionVariable wal_flush_cv; ConditionVariable wal_replay_cv; + ConditionVariable wal_confirm_rcv_cv; Should this new field have a comment? Or should it be grouped with the 2 preceding fields (if that same group comment is valid for all of them)? ====== Kind Regards, Peter Smith. Fujitsu Australia
On Tue, Nov 21, 2023 at 10:01 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Saturday, November 18, 2023 6:46 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, Nov 17, 2023 at 5:18 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > > > > > > On 11/17/23 2:46 AM, Zhijie Hou (Fujitsu) wrote: > > > > On Tuesday, November 14, 2023 10:27 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > > > > > > > > I feel the WaitForWALToBecomeAvailable may not be the best place to > > > > shutdown slotsync worker and drop slots. There could be other > > > > reasons(other than > > > > promotion) as mentioned in comments in case XLOG_FROM_STREAM to > > > > reach the code there. I thought if the intention is to stop slotsync > > > > workers on promotion, maybe FinishWalRecovery() is a better place to > > > > do it as it's indicating the end of recovery and XLogShutdownWalRcv is also > > called in it. > > > > > > I can see that slotsync_drop_initiated_slots() has been moved in > > > FinishWalRecovery() in v35. That looks ok. > > > > > > > > > > More Review for v35-0002* > > Thanks for the comments. > > > ============================ > > 1. > > + ereport(WARNING, > > + errmsg("skipping slots synchronization as primary_slot_name " > > + "is not set.")); > > > > There is no need to use a full stop at the end for WARNING messages and as > > previously mentioned, let's not split message lines in such cases. There are > > other messages in the patch with similar problems, please fix those as well. > > Adjusted. > > > > > 2. > > +slotsync_checks() > > { > > ... > > ... > > + /* The hot_standby_feedback must be ON for slot-sync to work */ if > > + (!hot_standby_feedback) { ereport(WARNING, errmsg("skipping slots > > + synchronization as hot_standby_feedback " > > + "is off.")); > > > > This message has the same problem as mentioned in the previous comment. > > Additionally, I think either atop slotsync_checks or along with GUC check we > > should write comments as to why we expect these values to be set for slot sync > > to work. > > Added comments for these cases. > > > > > 3. > > + /* The worker is running already */ > > + if (SlotSyncWorker &&SlotSyncWorker->hdr.in_use && > > + SlotSyncWorker->hdr.proc) > > > > The spacing for both the &&'s has problems. You need a space after the first > > && and the second && should be in the prior line. > > Adjusted. > > > > > 4. > > + LauncherRereadConfig(&recheck_slotsync); > > + > > } > > > > An empty line after LauncherRereadConfig() is not required. > > > > 5. > > +static void > > +LauncherRereadConfig(bool *ss_recheck) > > +{ > > + char *conninfo = pstrdup(PrimaryConnInfo); > > + char *slotname = pstrdup(PrimarySlotName); > > + bool syncslot = enable_syncslot; > > + bool feedback = hot_standby_feedback; > > > > Can we change the variable name 'feedback' to 'standbyfeedback' to make it > > slightly more descriptive? > > Changed. > > > > > 6. The logic to recheck the slot_sync related parameters in > > LauncherMain() is not very clear. IIUC, if after reload config any parameter is > > changed, we just seem to be checking the validity of the changed parameter > > but not restarting the slot sync worker, is that correct? If so, what if dbname is > > changed, don't we need to restart the slot-sync worker and re-initialize the > > connection; similarly slotname change also needs some thoughts. Also, if all the > > parameters are valid we seem to be re-launching the slot-sync worker without > > first stopping it which doesn't seem correct, am I missing something in this > > logic? > > I think the slot sync worker will be stopped in LauncherRereadConfig() if GUC changed > and new slot sync worker will be started in next loop in LauncherMain(). yes, LauncherRereadConfig will stop the worker on any parameter change and will set recheck_slotsync(). On finding this flag as true, LauncherMain will redo all the validations and restart slot-sync worker if needed. Yes, we do need to stop and relaunch slotsync workers on dbname change as well. This is currently missing inLauncherRereadConfig (). Regarding slot name change,we are already doing it, we are already checking PrimarySlotName in LauncherRereadConfig() > > > > 7. > > @@ -524,6 +525,25 @@ CreateDecodingContext(XLogRecPtr start_lsn, > > errmsg("replication slot \"%s\" was not created in this database", > > NameStr(slot->data.name)))); > > > > + in_recovery = RecoveryInProgress(); > > + > > + /* > > + * Do not allow consumption of a "synchronized" slot until the standby > > + * gets promoted. Also do not allow consumption of slots with > > + sync_state > > + * as SYNCSLOT_STATE_INITIATED as they are not synced completely to be > > + * used. > > + */ > > + if ((in_recovery && (slot->data.sync_state != SYNCSLOT_STATE_NONE)) || > > + slot->data.sync_state == SYNCSLOT_STATE_INITIATED) > > + ereport(ERROR, > > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > + errmsg("cannot use replication slot \"%s\" for logical decoding", > > + NameStr(slot->data.name)), in_recovery ? > > + errdetail("This slot is being synced from the primary server.") : > > + errdetail("This slot was not synced completely from the primary > > + server."), errhint("Specify another replication slot."))); > > + > > > > If we are planning to drop slots in state SYNCSLOT_STATE_INITIATED at the > > time of promotion, don't we need to just have an assert or elog(ERROR, .. for > > non-recovery cases as such cases won't be reachable? If so, I think we can > > separate out that case here. > > Adjusted the codes as suggested. > > > > > 8. > > wait_for_primary_slot_catchup() > > { > > ... > > + /* Check if this standby is promoted while we are waiting */ if > > + (!RecoveryInProgress()) { > > + /* > > + * The remote slot didn't pass the locally reserved position at > > + * the time of local promotion, so it's not safe to use. > > + */ > > + ereport( > > + WARNING, > > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > + errmsg( > > + "slot-sync wait for slot %s interrupted by promotion, " > > + "slot creation aborted", remote_slot->name))); pfree(cmd.data); return > > + false; } > > ... > > } > > > > Shouldn't this be an Assert because a slot-sync worker shouldn't exist for > > non-standby servers? > > Changed to Assert. > > > > > 9. > > wait_for_primary_slot_catchup() > > { > > ... > > + slot = MakeSingleTupleTableSlot(res->tupledesc, &TTSOpsMinimalTuple); > > + if (!tuplestore_gettupleslot(res->tuplestore, true, false, slot)) { > > + ereport(WARNING, (errmsg("slot \"%s\" disappeared from the primary > > + server," > > + " slot creation aborted", remote_slot->name))); pfree(cmd.data); > > + walrcv_clear_result(res); return false; > > > > If the slot on primary disappears, shouldn't this part of the code somehow > > ensure to remove the slot on standby as well? If it is taken at some other point > > in time then at least we should write a comment here to state how it is taken > > care of. I think this comment also applies to a few other checks following this > > check. > > I adjusted the code here to not persist the slots if the slot disappeared or invalidated > on primary, so that the local slot will get dropped when releasing. > > > > > 10. > > + /* > > + * It is possible to get null values for lsns and xmin if slot is > > + * invalidated on the primary server, so handle accordingly. > > + */ > > + new_invalidated = DatumGetBool(slot_getattr(slot, 1, &isnull)); > > > > We can say LSN and Xmin in the above comment to make it easier to > > read/understand. > > Changed. > > > > > 11. > > /* > > + * Once we got valid restart_lsn, then confirmed_lsn and catalog_xmin > > + * are expected to be valid/non-null, so assert if found null. > > + */ > > > > No need to explicitly say about assert, it is clear from the code. We can slightly > > change this comment to: "Once we got valid restart_lsn, then confirmed_lsn > > and catalog_xmin are expected to be valid/non-null." > > Changed. > > > > > 12. > > + if (remote_slot->restart_lsn < MyReplicationSlot->data.restart_lsn || > > + TransactionIdPrecedes(remote_slot->catalog_xmin, > > + MyReplicationSlot->data.catalog_xmin)) > > + { > > + if (!wait_for_primary_slot_catchup(wrconn, remote_slot)) { > > + /* > > + * The remote slot didn't catch up to locally reserved position. > > + * But still persist it and attempt the wait and sync in next > > + * sync-cycle. > > + */ > > + if (MyReplicationSlot->data.persistency != RS_PERSISTENT) { > > + ReplicationSlotPersist(); *slot_updated = true; } > > > > I think the reason to persist in this case is because next time local restart_lsn can > > be ahead than the current location and it can take more time to create such a > > slot. We can probably mention the same in the comments. > > Updated the comments. > > Here is the V37 patch set which addressed comments above and [1]. > > [1] https://www.postgresql.org/message-id/CAA4eK1%2BP9R3GO2rwGBg2EOh%3DuYjWUSEOHD8yvs4Je8WYa2RHag%40mail.gmail.com > > Best Regards, > Hou zj
On Tue, Nov 21, 2023 at 10:02 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Friday, November 17, 2023 7:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Nov 16, 2023 at 5:34 PM shveta malik <shveta.malik@gmail.com> > > wrote: > > > > > > PFA v35. > > > > > > > Review v35-0002* > > ============== > > Thanks for the comments. > > > 1. > > As quoted in the commit message, > > > > > If a logical slot is invalidated on the primary, slot on the standby is also > > invalidated. If a logical slot on the primary is valid but is invalidated on the > > standby due to conflict (say required rows removed on the primary), then that > > slot is dropped and recreated on the standby in next sync-cycle. > > It is okay to recreate such slots as long as these are not consumable on the > > standby (which is the case currently). > > > > > > > I think this won't happen normally because of the physical slot and > > hot_standby_feedback but probably can occur in cases like if the user > > temporarily switches hot_standby_feedback from on to off. Are there any other > > reasons? I think we can mention the cases along with it as well at least for now. > > Additionally, I think this should be covered in code comments as well. > > I will collect all these cases and update in next version. > > > > > 2. > > #include "postgres.h" > > - > > +#include "access/genam.h" > > > > Spurious line removal. > > Removed. > > > > > 3. > > A password needs to be provided too, if the sender demands > > password > > authentication. It can be provided in the > > <varname>primary_conninfo</varname> string, or in a separate > > - <filename>~/.pgpass</filename> file on the standby server (use > > - <literal>replication</literal> as the database name). > > - Do not specify a database name in the > > - <varname>primary_conninfo</varname> string. > > + <filename>~/.pgpass</filename> file on the standby server. > > + </para> > > + <para> > > + Specify <literal>dbname</literal> in > > + <varname>primary_conninfo</varname> string to allow > > synchronization > > + of slots from the primary server to the standby server. > > + This will only be used for slot synchronization. It is ignored > > + for streaming. > > > > Is there a reason to remove part of the earlier sentence "use > > <literal>replication</literal> as the database name"? > > Added it back. > > > > > 4. > > + <primary><varname>enable_syncslot</varname> configuration > > parameter</primary> > > + </indexterm> > > + </term> > > + <listitem> > > + <para> > > + It enables a physical standby to synchronize logical failover slots > > + from the primary server so that logical subscribers are not blocked > > + after failover. > > + </para> > > + <para> > > + It is enabled by default. This parameter can only be set in the > > + <filename>postgresql.conf</filename> file or on the server > > command line. > > + </para> > > > > I think you forgot to update the documentation for the default value of this > > variable. > > Updated. > > > > > 5. > > + * a) start the logical replication workers for every enabled subscription > > + * when not in standby_mode > > + * b) start the slot-sync worker for logical failover slots synchronization > > + * from the primary server when in standby_mode. > > > > Either use a full stop after both lines or none of these. > > Added a full stop. > > > > > 6. > > +static void slotsync_worker_cleanup(SlotSyncWorkerInfo * worker); > > > > There shouldn't be space between * and the worker. > > Removed, and added the type to typedefs.list. > > > > > 7. > > + if (!SlotSyncWorker->hdr.in_use) > > + { > > + LWLockRelease(SlotSyncWorkerLock); > > + ereport(ERROR, > > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > + errmsg("replication slot-sync worker not initialized, " > > + "cannot attach"))); > > + } > > + > > + if (SlotSyncWorker->hdr.proc) > > + { > > + LWLockRelease(SlotSyncWorkerLock); > > + ereport(ERROR, > > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > + errmsg("replication slot-sync worker is " > > + "already running, cannot attach"))); > > + } > > > > Using slot-sync in the error messages looks a bit odd to me. Can we use > > "replication slot sync worker ..." in both these and other similar messages? I > > think it would be better if we don't split the messages into multiple lines in > > these cases as messages don't appear too long to me. > > Changed as suggested. > > > > > 8. > > +/* > > + * Detach the worker from DSM and update 'proc' and 'in_use'. > > + * Logical replication launcher will come to know using these > > + * that the worker has shutdown. > > + */ > > +void > > +slotsync_worker_detach(int code, Datum arg) { > > > > I think the reference to DSM is leftover from the previous version of the patch. > > Can we change the above comments as per the new code? > > Changed. > > > > > 9. > > +static bool > > +slotsync_worker_launch() > > { > > ... > > + /* TODO: do we really need 'generation', analyse more here */ > > + worker->hdr.generation++; > > > > We should do something about this TODO. As per my understanding, we don't > > need a generation number for the slot sync worker as we have one such worker > > but I guess the patch requires it because we are using existing logical > > replication worker infrastructure. This brings the question of whether we really > > need a separate SlotSyncWorkerInfo or if we can use existing > > LogicalRepWorker and distinguish it with LogicalRepWorkerType? I guess you > > didn't use it because most of the fields in LogicalRepWorker will be unused for > > slot sync worker. > > Will think about this one and update in next version. > > > > > 10. > > + * Can't use existing functions like 'get_database_oid' from > > +dbcommands.c for > > + * validity purpose as they need db connection. > > + */ > > +static bool > > +validate_dbname(const char *dbname) > > > > I don't know how important it is to validate the dbname before launching the > > sync slot worker because anyway after launching, it will give an error while > > initializing the connection if the dbname is invalid. But, if we think it is really > > required, did you consider using GetDatabaseTuple()? > > Yes, we could export GetDatabaseTuple. Apart from this, I am thinking is it possible to > release the restriction for the dbname. For example, slot sync worker could > always connect to the 'template1' as the worker doesn't update the > database objects. Although I didn't find some examples on server side, but some > client commands(e.g. pg_upgrade) will connect to template1 to check some global > objects. We use this dbname for 2 purposes: a) which you pointed out i.e. to have db connection in sync worker, b) to make connection to primary server's db so that we can run SELECT queries there. Thought of adding this point, so that we have complete info before deciding the next step. (Just FYI, the previous version patch used a replication command which > may avoid the dbname but was replaced with SELECT to improve the flexibility and > avoid introducing new command.) Would like to add more info here. We had a LIST command in launcher.c for multi-worker design case to fetch dbids from primary. But for a single worker case we do not need that info, so we got rid of that command. In slotsync.c we always had usage of 'SELECT' query and never had any cmd implemented. So we never replaced replication-cmd with 'SELECT' in any of the patches. Why we had cmd usage in launcher.c is due the fact that launcher originally did not have any db-connection and we did not want to change that . And since running 'SELECT' query through exposed libpq APIs need db-connection, thus we decided to retain replication-cmd in the launcher. While in slotsync.c we had to run multiple queries to get different information, and thus to retain the flexibility and ease of extension over replication-cmds, we decided to go with SELECT and thus opted for db-connection as needed by libpq APIs. thanks Shveta
Hi, On 11/21/23 6:16 AM, Amit Kapila wrote: > On Mon, Nov 20, 2023 at 6:51 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> As far the 'i' state here, from what I see, it is currently useful for: >> >> 1. Cascading standby to not sync slots with state = 'i' from >> the first standby. >> 2. Easily report Slots that did not catch up on the primary yet. >> 3. Avoid inactive slots to block "active" ones creation. >> >> So not creating those slots should not be an issue for 1. (sync are >> not needed on cascading standby as not created on the first standby yet) >> but is an issue for 2. (unless we provide another way to keep track and report >> such slots) and 3. (as I think we should still need to reserve WAL). >> >> I've a question: we'd still need to reserve WAL for those slots, no? >> >> If that's the case and if we don't call ReplicationSlotCreate() then ReplicationSlotReserveWal() >> would not work as MyReplicationSlot would be NULL. >> > > Yes, we need to reserve WAL to see if we can sync the slot. We are > currently creating an RS_EPHEMERAL slot and if we don't explicitly > persist it when we can't sync, then it will be dropped when we do > ReplicationSlotRelease() at the end of synchronize_one_slot(). So, the > loss is probably, the next time we again try to sync the slot, we need > to again create it and may need to wait for newer restart_lsn on > standby Yeah, and doing so we'd reduce the time window to give the slot a chance to catch up (as opposed to create it a single time and maintain an 'i' state). > which could be avoided if we have the slot in 'i' state from > the previous run. Right. > I don't deny the importance of having 'i' > (initialized) state but was just trying to say that it has additional > code complexity. Right, and I think it's worth it. > OTOH, having it may give better visibility to even > users about slots that are not active (say manually created slots on > the primary). Agree. All that being said, on my side I'm +1 on keeping the 'i' state behavior as it is implemented currently (would be happy to hear others' opinions too). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Nov 21, 2023 at 10:02 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Friday, November 17, 2023 7:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Nov 16, 2023 at 5:34 PM shveta malik <shveta.malik@gmail.com> > > wrote: > > > > > > PFA v35. > > > > > > > Review v35-0002* > > ============== > > Thanks for the comments. > > > 1. > > As quoted in the commit message, > > > > > If a logical slot is invalidated on the primary, slot on the standby is also > > invalidated. If a logical slot on the primary is valid but is invalidated on the > > standby due to conflict (say required rows removed on the primary), then that > > slot is dropped and recreated on the standby in next sync-cycle. > > It is okay to recreate such slots as long as these are not consumable on the > > standby (which is the case currently). > > > > > > > I think this won't happen normally because of the physical slot and > > hot_standby_feedback but probably can occur in cases like if the user > > temporarily switches hot_standby_feedback from on to off. Are there any other > > reasons? I think we can mention the cases along with it as well at least for now. > > Additionally, I think this should be covered in code comments as well. > > I will collect all these cases and update in next version. > > > > > 2. > > #include "postgres.h" > > - > > +#include "access/genam.h" > > > > Spurious line removal. > > Removed. > > > > > 3. > > A password needs to be provided too, if the sender demands > > password > > authentication. It can be provided in the > > <varname>primary_conninfo</varname> string, or in a separate > > - <filename>~/.pgpass</filename> file on the standby server (use > > - <literal>replication</literal> as the database name). > > - Do not specify a database name in the > > - <varname>primary_conninfo</varname> string. > > + <filename>~/.pgpass</filename> file on the standby server. > > + </para> > > + <para> > > + Specify <literal>dbname</literal> in > > + <varname>primary_conninfo</varname> string to allow > > synchronization > > + of slots from the primary server to the standby server. > > + This will only be used for slot synchronization. It is ignored > > + for streaming. > > > > Is there a reason to remove part of the earlier sentence "use > > <literal>replication</literal> as the database name"? > > Added it back. > > > > > 4. > > + <primary><varname>enable_syncslot</varname> configuration > > parameter</primary> > > + </indexterm> > > + </term> > > + <listitem> > > + <para> > > + It enables a physical standby to synchronize logical failover slots > > + from the primary server so that logical subscribers are not blocked > > + after failover. > > + </para> > > + <para> > > + It is enabled by default. This parameter can only be set in the > > + <filename>postgresql.conf</filename> file or on the server > > command line. > > + </para> > > > > I think you forgot to update the documentation for the default value of this > > variable. > > Updated. > > > > > 5. > > + * a) start the logical replication workers for every enabled subscription > > + * when not in standby_mode > > + * b) start the slot-sync worker for logical failover slots synchronization > > + * from the primary server when in standby_mode. > > > > Either use a full stop after both lines or none of these. > > Added a full stop. > > > > > 6. > > +static void slotsync_worker_cleanup(SlotSyncWorkerInfo * worker); > > > > There shouldn't be space between * and the worker. > > Removed, and added the type to typedefs.list. > > > > > 7. > > + if (!SlotSyncWorker->hdr.in_use) > > + { > > + LWLockRelease(SlotSyncWorkerLock); > > + ereport(ERROR, > > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > + errmsg("replication slot-sync worker not initialized, " > > + "cannot attach"))); > > + } > > + > > + if (SlotSyncWorker->hdr.proc) > > + { > > + LWLockRelease(SlotSyncWorkerLock); > > + ereport(ERROR, > > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > + errmsg("replication slot-sync worker is " > > + "already running, cannot attach"))); > > + } > > > > Using slot-sync in the error messages looks a bit odd to me. Can we use > > "replication slot sync worker ..." in both these and other similar messages? I > > think it would be better if we don't split the messages into multiple lines in > > these cases as messages don't appear too long to me. > > Changed as suggested. > > > > > 8. > > +/* > > + * Detach the worker from DSM and update 'proc' and 'in_use'. > > + * Logical replication launcher will come to know using these > > + * that the worker has shutdown. > > + */ > > +void > > +slotsync_worker_detach(int code, Datum arg) { > > > > I think the reference to DSM is leftover from the previous version of the patch. > > Can we change the above comments as per the new code? > > Changed. > > > > > 9. > > +static bool > > +slotsync_worker_launch() > > { > > ... > > + /* TODO: do we really need 'generation', analyse more here */ > > + worker->hdr.generation++; > > > > We should do something about this TODO. As per my understanding, we don't > > need a generation number for the slot sync worker as we have one such worker > > but I guess the patch requires it because we are using existing logical > > replication worker infrastructure. Yes, we do not need generation, but since we want to use existing logical-rep worker infrastructure, we can retain generation but can keep it as zero always for the slot-sync worker case. > > This brings the question of whether we really > > need a separate SlotSyncWorkerInfo or if we can use existing > > LogicalRepWorker and distinguish it with LogicalRepWorkerType? I guess you > > didn't use it because most of the fields in LogicalRepWorker will be unused for > > slot sync worker. Yes, right. If we use LogicalRepWorker in the slot-sync worker, then it will be a task to keep a check (even in future) that no-one should end up using uninitialized fields in slot-sync code. That is why shifting common fields to LogicalWorkerHeader and using that in SlotSyncWorkerInfo and LogicalRepWorker seems a better approach to me. > > Will think about this one and update in next version. > > > > > 10. > > + * Can't use existing functions like 'get_database_oid' from > > +dbcommands.c for > > + * validity purpose as they need db connection. > > + */ > > +static bool > > +validate_dbname(const char *dbname) > > > > I don't know how important it is to validate the dbname before launching the > > sync slot worker because anyway after launching, it will give an error while > > initializing the connection if the dbname is invalid. But, if we think it is really > > required, did you consider using GetDatabaseTuple()? > > Yes, we could export GetDatabaseTuple. Apart from this, I am thinking is it possible to > release the restriction for the dbname. For example, slot sync worker could > always connect to the 'template1' as the worker doesn't update the > database objects. Although I didn't find some examples on server side, but some > client commands(e.g. pg_upgrade) will connect to template1 to check some global > objects. (Just FYI, the previous version patch used a replication command which > may avoid the dbname but was replaced with SELECT to improve the flexibility and > avoid introducing new command.) > > Best Regards, > Hou zj > >
On Tue, Nov 21, 2023 at 1:13 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 11/21/23 6:16 AM, Amit Kapila wrote: > > On Mon, Nov 20, 2023 at 6:51 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > >> As far the 'i' state here, from what I see, it is currently useful for: > >> > >> 1. Cascading standby to not sync slots with state = 'i' from > >> the first standby. > >> 2. Easily report Slots that did not catch up on the primary yet. > >> 3. Avoid inactive slots to block "active" ones creation. > >> > >> So not creating those slots should not be an issue for 1. (sync are > >> not needed on cascading standby as not created on the first standby yet) > >> but is an issue for 2. (unless we provide another way to keep track and report > >> such slots) and 3. (as I think we should still need to reserve WAL). > >> > >> I've a question: we'd still need to reserve WAL for those slots, no? > >> > >> If that's the case and if we don't call ReplicationSlotCreate() then ReplicationSlotReserveWal() > >> would not work as MyReplicationSlot would be NULL. > >> > > > > Yes, we need to reserve WAL to see if we can sync the slot. We are > > currently creating an RS_EPHEMERAL slot and if we don't explicitly > > persist it when we can't sync, then it will be dropped when we do > > ReplicationSlotRelease() at the end of synchronize_one_slot(). So, the > > loss is probably, the next time we again try to sync the slot, we need > > to again create it and may need to wait for newer restart_lsn on > > standby > > Yeah, and doing so we'd reduce the time window to give the slot a chance > to catch up (as opposed to create it a single time and maintain an 'i' state). > > > which could be avoided if we have the slot in 'i' state from > > the previous run. > > Right. > > > I don't deny the importance of having 'i' > > (initialized) state but was just trying to say that it has additional > > code complexity. > > Right, and I think it's worth it. > > > OTOH, having it may give better visibility to even > > users about slots that are not active (say manually created slots on > > the primary). > > Agree. > > All that being said, on my side I'm +1 on keeping the 'i' state behavior > as it is implemented currently (would be happy to hear others' opinions too). > +1 for 'i' state. I feel it gives a better slot-sync functionality (optimizing redo-effort for inactive slots, inactive not blocking active ones) along with its usage for monitoring purposes.
On Tue, Nov 21, 2023 at 2:02 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Tue, Nov 21, 2023 at 1:13 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > > > > Hi, > > > > On 11/21/23 6:16 AM, Amit Kapila wrote: > > > On Mon, Nov 20, 2023 at 6:51 PM Drouvot, Bertrand > > > <bertranddrouvot.pg@gmail.com> wrote: > > >> As far the 'i' state here, from what I see, it is currently useful for: > > >> > > >> 1. Cascading standby to not sync slots with state = 'i' from > > >> the first standby. > > >> 2. Easily report Slots that did not catch up on the primary yet. > > >> 3. Avoid inactive slots to block "active" ones creation. > > >> > > >> So not creating those slots should not be an issue for 1. (sync are > > >> not needed on cascading standby as not created on the first standby yet) > > >> but is an issue for 2. (unless we provide another way to keep track and report > > >> such slots) and 3. (as I think we should still need to reserve WAL). > > >> > > >> I've a question: we'd still need to reserve WAL for those slots, no? > > >> > > >> If that's the case and if we don't call ReplicationSlotCreate() then ReplicationSlotReserveWal() > > >> would not work as MyReplicationSlot would be NULL. > > >> > > > > > > Yes, we need to reserve WAL to see if we can sync the slot. We are > > > currently creating an RS_EPHEMERAL slot and if we don't explicitly > > > persist it when we can't sync, then it will be dropped when we do > > > ReplicationSlotRelease() at the end of synchronize_one_slot(). So, the > > > loss is probably, the next time we again try to sync the slot, we need > > > to again create it and may need to wait for newer restart_lsn on > > > standby > > > > Yeah, and doing so we'd reduce the time window to give the slot a chance > > to catch up (as opposed to create it a single time and maintain an 'i' state). > > > > > which could be avoided if we have the slot in 'i' state from > > > the previous run. > > > > Right. > > > > > I don't deny the importance of having 'i' > > > (initialized) state but was just trying to say that it has additional > > > code complexity. > > > > Right, and I think it's worth it. > > > > > OTOH, having it may give better visibility to even > > > users about slots that are not active (say manually created slots on > > > the primary). > > > > Agree. > > > > All that being said, on my side I'm +1 on keeping the 'i' state behavior > > as it is implemented currently (would be happy to hear others' opinions too). > > > > +1 for 'i' state. I feel it gives a better slot-sync functionality > (optimizing redo-effort for inactive slots, inactive not blocking > active ones) along with its usage for monitoring purposes. v37 fails to apply to HEAD due to a recent commit e83aa9f92fdd, rebased the patches. PFA v37_2 patches. thanks Shveta
Attachment
Hi, On 11/21/23 10:32 AM, shveta malik wrote: > On Tue, Nov 21, 2023 at 2:02 PM shveta malik <shveta.malik@gmail.com> wrote: >> > v37 fails to apply to HEAD due to a recent commit e83aa9f92fdd, > rebased the patches. PFA v37_2 patches. Thanks! Regarding the promotion flow: If the primary is available and reachable I don't think we currently try to ensure that slots are in sync. I think we'd miss the activity since the last sync and the promotion request or am I missing something? If the primary is available and reachable shouldn't we launch a last round of synchronization (skipping all the slots that are not in 'r' state)? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tuesday, November 21, 2023 5:33 PM shveta malik <shveta.malik@gmail.com> wrote: > > > v37 fails to apply to HEAD due to a recent commit e83aa9f92fdd, rebased the > patches. PFA v37_2 patches. Thanks for updating the patches. I'd like to discuss one issue related to the correct handling of failover flag when executing ALTER SUBSCRIPTION SET (slot_name = 'new_slot')". Since the command intends to use a new slot on the primary, the new slot needs to reflect the "failover" state that the subscription currently has. If the failoverstate of the Subscription is LOGICALREP_FAILOVER_STATE_ENABLED, then I can reset it to LOGICALREP_FAILOVER_STATE_PENDING and allow the apply worker to handle it the way it is handled today (just like two_phase handling). But if the failoverstate is LOGICALREP_FAILOVER_STATE_DISABLED, the original idea is to call walrcv_alter_slot and alter the slot from the "ALTER SUBSCRIPTION" handling backend itself. This works if the slot is currently disabled. But the " ALTER SUBSCRIPTION SET (slot_name = 'new_slot')" command is supported even if the subscription is enabled. If the subscription is enabled, then calling walrcv_alter_slot() fails because the slot is still acquired by apply worker. So, I am thinking do we need a new mechanism to change the failover flag to false on an enabled subscription ? For example, we could call walrcv_alter_slot on startup of apply worker if AllTablesyncsReady(), for both true and false values of failover flag. This way, every time apply worker is started, it calls walrcv_alter_slot to set the failover flag on the primary. Or we could just document that it is user's responsibility to match the failover property in case it changes the slot_name. Thoughts ? Best Regards, Hou zj
In addition to my recent v35-0001 comment not yet addressed [1], here are some review comments for patch v37-0001. ====== src/backend/replication/walsender.c 1. PhysicalWakeupLogicalWalSnd +/* + * Wake up logical walsenders with failover-enabled slots if the physical slot + * of the current walsender is specified in standby_slot_names GUC. + */ +void +PhysicalWakeupLogicalWalSnd(void) +{ + ListCell *lc; + List *standby_slots; + bool slot_in_list = false; + + Assert(MyReplicationSlot != NULL); + Assert(SlotIsPhysical(MyReplicationSlot)); + + standby_slots = GetStandbySlotList(false); + + foreach(lc, standby_slots) + { + char *name = lfirst(lc); + + if (strcmp(name, NameStr(MyReplicationSlot->data.name)) == 0) + { + slot_in_list = true; + break; + } + } + + if (slot_in_list) + ConditionVariableBroadcast(&WalSndCtl->wal_confirm_rcv_cv); +} 1a. Easier to have single assertion -- Assert(MyReplicationSlot && SlotIsPhysical(MyReplicationSlot)); ~ 1b. Why bother with the 'slot_in_list' and break, when you can just call the ConditionVariableBroadcast() and return without having the extra variable? ====== src/test/recovery/t/050_verify_slot_order.pl ~~~ 2. Should you name the global objects with a 'regress_' prefix which seems to be the standard for other new TAP tests? ~~~ 3. +# +# | ----> standby1 (connected via streaming replication) +# | ----> standby2 (connected via streaming replication) +# primary ----- | +# | ----> subscriber1 (connected via logical replication) +# | ----> subscriber2 (connected via logical replication) +# +# +# Set up is configured in such a way that primary never lets subscriber1 ahead +# of standby1. 3a. Misaligned "|" in comment? ~ 3b. IMO it would be better to give an overview of how this all works instead of just saying "configured in such a way". ~~~ 4. +# Configure primary to disallow specified logical replication slot (lsub1_slot) +# getting ahead of specified physical replication slot (sb1_slot). +$primary->append_conf( It is confusing because there is no "lsub1_slot" specified anywhere until much later. Would you be able to provide some more details? ~~~ 5. +# Create another subscriber node, wait for sync to complete +my $subscriber2 = PostgreSQL::Test::Cluster->new('subscriber2'); +$subscriber2->init(allows_streaming => 'logical'); +$subscriber2->start; +$subscriber2->safe_psql('postgres', "CREATE TABLE tab_int (a int PRIMARY KEY);"); +$subscriber2->safe_psql('postgres', + "CREATE SUBSCRIPTION mysub2 CONNECTION '$publisher_connstr' " + . "PUBLICATION mypub WITH (slot_name = lsub2_slot);"); +$subscriber2->wait_for_subscription_sync; Maybe this comment should explicitly say there is no failover enabled here. Maybe the SUBSCRIPTION should explicitly set failover=false? ~~~ 6. +# The subscription that's up and running and is enabled for failover +# doesn't get the data from primary and keeps waiting for the +# standby specified in standby_slot_names. +$result = $subscriber1->safe_psql('postgres', + "SELECT count(*) = 0 FROM tab_int;"); +is($result, 't', "subscriber1 doesn't get data from primary until standby1 acknowledges changes"); Might it be better to write as "SELECT count(*) = $primary_row_count FROM tab_int;" and expect it to return false? ====== src/test/regress/expected/subscription.out 7. Everything here displays the "Failover" state 'd' (disabled). How about tests for different state values? ====== [1] https://www.postgresql.org/message-id/CAHut%2BPv-yu71ogj_hRi6cCtmD55bsyw7XTxj1Nq8yVFKpY3NDQ%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
On Tue, Nov 21, 2023 at 4:35 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > On 11/21/23 10:32 AM, shveta malik wrote: > > On Tue, Nov 21, 2023 at 2:02 PM shveta malik <shveta.malik@gmail.com> wrote: > >> > > > v37 fails to apply to HEAD due to a recent commit e83aa9f92fdd, > > rebased the patches. PFA v37_2 patches. > > Thanks! > > Regarding the promotion flow: If the primary is available and reachable I don't > think we currently try to ensure that slots are in sync. I think we'd miss the > activity since the last sync and the promotion request or am I missing something? > > If the primary is available and reachable shouldn't we launch a last round of > synchronization (skipping all the slots that are not in 'r' state)? > We may miss the last round but there is no guarantee that we can ensure to sync of everything if the primary is available. Because after our last sync, there could probably be some more activity. I think it is the user's responsibility to promote a new primary when the old one is not required for some reason. It is not only slots that can be out of sync but even we can miss fetching some of the data. I think this is quite similar to what we do for WAL where on finding the promotion signal, we shut down Walreceiver and just replay any WAL that was already received by walreceiver. Also, the promotion shouldn't create any problem w.r.t subscribers connecting to the new primary because the slot's position is slightly behind what could be requested by subscribers which means the corresponding data will be available on the new primary. Do you have something in mind that can create any problem if we don't attempt additional fetching round after the promotion signal is received? -- With Regards, Amit Kapila.
On Mon, Nov 20, 2023 at 4:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sat, Nov 18, 2023 at 4:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, Nov 17, 2023 at 5:18 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > > > > More Review for v35-0002* > > ============================ > > > Thanks for the feedback. Please find the patch attached and my comments inline. > More review of v35-0002* > ==================== > 1. > +/* > + * Helper function to check if local_slot is present in remote_slots list. > + * > + * It also checks if logical slot is locally invalidated i.e. invalidated on > + * the standby but valid on the primary server. If found so, it sets > + * locally_invalidated to true. > + */ > +static bool > +slot_exists_in_list(ReplicationSlot *local_slot, List *remote_slots, > + bool *locally_invalidated) > > The name of the function is a bit misleading because it checks the > validity of the slot not only whether it exists in remote_list. Would > it be better to name it as ValidateSyncSlot() or something along those > lines? > Sure, updated the name. > 2. > +static long > +synchronize_slots(WalReceiverConn *wrconn) > { > ... > + /* Construct query to get slots info from the primary server */ > + initStringInfo(&s); > + construct_slot_query(&s); > ... > + if (remote_slot->conflicting) > + remote_slot->invalidated = get_remote_invalidation_cause(wrconn, > + remote_slot->name); > ... > > +static ReplicationSlotInvalidationCause > +get_remote_invalidation_cause(WalReceiverConn *wrconn, char *slot_name) > { > ... > + appendStringInfo(&cmd, > + "SELECT pg_get_slot_invalidation_cause(%s)", > + quote_literal_cstr(slot_name)); > + res = walrcv_exec(wrconn, cmd.data, 1, slotRow); > > Do we really need to query a second time to get the invalidation > cause? Can we adjust the slot_query to get it in one round trip? I > think this may not optimize much because the patch uses second round > trip only for invalidated slots but still looks odd. So unless the > query becomes too complicated, we should try to achive it one round > trip. > Modified the query to fetch all the info at once. > 3. > +static long > +synchronize_slots(WalReceiverConn *wrconn) > +{ > ... > ... > + /* The syscache access needs a transaction env. */ > + StartTransactionCommand(); > + > + /* Make things live outside TX context */ > + MemoryContextSwitchTo(oldctx); > + > + /* Construct query to get slots info from the primary server */ > + initStringInfo(&s); > + construct_slot_query(&s); > + > + elog(DEBUG2, "slot-sync worker's query:%s \n", s.data); > + > + /* Execute the query */ > + res = walrcv_exec(wrconn, s.data, SLOTSYNC_COLUMN_COUNT, slotRow); > > It is okay to perform the above query execution outside the > transaction context but I would like to know the reason for the same. > Do we want to retain anything beyond the transaction context or is > there some other reason to do this outside the transaction context? > Modified the comment with the reason. We need to start a transaction for syscache access. We can end it as soon as walrcv_exec() is over, but we need the tuple-results to be accessed even after that, thus those should not be allocated in TopTransactionContext. > 4. > +static void > +construct_slot_query(StringInfo s) > +{ > + /* > + * Fetch data for logical failover slots with sync_state either as > + * SYNCSLOT_STATE_NONE or SYNCSLOT_STATE_READY. > + */ > + appendStringInfo(s, > + "SELECT slot_name, plugin, confirmed_flush_lsn," > + " restart_lsn, catalog_xmin, two_phase, conflicting, " > + " database FROM pg_catalog.pg_replication_slots" > + " WHERE failover and sync_state != 'i'"); > +} > > Why would the sync_state on the primary server be any valid value? I > thought it was set only on physical standby. I think it is better to > mention the reason for using the sync state and or failover flag in > the above comments. The current comment doesn't seem of much use as it > just states what is evident from the query. Updated the reason in comment. It is mainly for cascading standby to fetch correct slots. > > 5. > * This check should never pass as on the primary server, we have waited > + * for the standby's confirmation before updating the logical slot. But to > + * take care of any bug in that flow, we should retain this check. > + */ > + if (remote_slot->confirmed_lsn > WalRcv->latestWalEnd) > + { > + elog(LOG, "skipping sync of slot \"%s\" as the received slot-sync " > + "LSN %X/%X is ahead of the standby position %X/%X", > + remote_slot->name, > + LSN_FORMAT_ARGS(remote_slot->confirmed_lsn), > + LSN_FORMAT_ARGS(WalRcv->latestWalEnd)); > + > > This should be elog(ERROR, ..). Normally, we use elog(ERROR, ...) for > such unexpected cases. And, you don't need to explicitly mention the > last sentence in the comment: "But to take care of any bug in that > flow, we should retain this check.". > Sure, modified. > 6. > +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot, > + bool *slot_updated) > { > ... > + if (remote_slot->restart_lsn < MyReplicationSlot->data.restart_lsn) > + { > + ereport(WARNING, > + errmsg("not synchronizing slot %s; synchronization would" > + " move it backwards", remote_slot->name)); > > I think here elevel should be LOG because user can't do much about > this. Do we use ';' at other places in the message? But when can we > hit this case? We can add some comments to state in which scenario > this possible. OTOH, if this is sort of can't happen case and we have > kept it to avoid any sort of inconsistency then we can probably use > elog(ERROR, .. with approapriate LSN locations, so that later the > problem could be debugged. > Converted to ERROR and updated comment > 7. > +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot, > + bool *slot_updated) > { > ... > + > + StartTransactionCommand(); > + > + /* Make things live outside TX context */ > + MemoryContextSwitchTo(oldctx); > + > ... > > Similar to one of the previous comments, it is not clear to me why the > patch is doing a memory context switch here. Can we add a comment? > I have removed the memory-context-switch here as the results are all consumed within the span of transaction, so we do not need to retain those even after commit of txn for this particular case. > 8. > + /* User created slot with the same name exists, raise ERROR. */ > + else if (sync_state == SYNCSLOT_STATE_NONE) > + { > + ereport(ERROR, > + errmsg("not synchronizing slot %s; it is a user created slot", > + remote_slot->name)); > + } > > Won't we need error_code in this error? Also, the message doesn't seem > to follow the code's usual style. Modified. I have added errdetail as well, but not sure what we can add as error-hint, Shall we add something like: Try renaming existing slot. > > 9. > +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot, > + bool *slot_updated) > { > ... > + else > + { > + TransactionId xmin_horizon = InvalidTransactionId; > + ReplicationSlot *slot; > + > + ReplicationSlotCreate(remote_slot->name, true, RS_EPHEMERAL, > + remote_slot->two_phase, false); > + slot = MyReplicationSlot; > + > + SpinLockAcquire(&slot->mutex); > + slot->data.database = get_database_oid(remote_slot->database, false); > + > + /* Mark it as sync initiated by slot-sync worker */ > + slot->data.sync_state = SYNCSLOT_STATE_INITIATED; > + slot->data.failover = true; > + > + namestrcpy(&slot->data.plugin, remote_slot->plugin); > + SpinLockRelease(&slot->mutex); > + > + ReplicationSlotReserveWal(); > + > > How and when will this init state (SYNCSLOT_STATE_INITIATED) persist to disk? This will be inside wait_for_primary_and_sync. I have reorganized code here (removed wait_for_primary_and_sync) to make it more readable. > > 10. > + if (slot_updated) > + SlotSyncWorker->last_update_time = now; > + > + else if (TimestampDifferenceExceeds(SlotSyncWorker->last_update_time, > + now, WORKER_INACTIVITY_THRESHOLD_MS)) > > Empty line between if/else if is not required. > This is added by pg_indent. Not sure how we can correct it. > 11. > +static WalReceiverConn * > +remote_connect() > +{ > + WalReceiverConn *wrconn = NULL; > + char *err; > + > + wrconn = walrcv_connect(PrimaryConnInfo, true, false, "slot-sync", &err); > + if (wrconn == NULL) > + ereport(ERROR, > + (errmsg("could not connect to the primary server: %s", err))); > > Let's use appname similar to what we do for "walreceiver" as shown below: > /* Establish the connection to the primary for XLOG streaming */ > wrconn = walrcv_connect(conninfo, false, false, > cluster_name[0] ? cluster_name : "walreceiver", > &err); > if (!wrconn) > ereport(ERROR, > (errcode(ERRCODE_CONNECTION_FAILURE), > errmsg("could not connect to the primary server: %s", err))); > > Some proposals for default appname "slotsynchronizer", "slotsync > worker". Also, use the same error code as used by "walreceiver". Modified. > > 12. Do we need the handling of the slotsync worker in > GetBackendTypeDesc()? Please check without that what value this patch > displays for backend_type. It currently displays "slot sync worker'. It is the same desc which launcher has launched this worker with (snprintf(bgw.bgw_type, BGW_MAXLEN, "slot sync worker")). postgres=# select backend_type from pg_stat_activity; backend_type ------------------------------ logical replication launcher slot sync worker ....... For slot sync and logical launcher, BackendType is B_BG_WORKER and thus pg_stat_get_activity() for this type displays backend_type as the one given during background process registration and thus we get these correctly. But pg_stat_get_io() does not have the same implementation, it displays 'background worker' as the description. I think slot-sync and logical launcher are one of these entries postgres=# select backend_type from pg_stat_io; backend_type --------------------- autovacuum launcher .. background worker background worker background worker background worker background worker background writer ..... > > 13. > +/* > + * Re-read the config file. > + * > + * If primary_conninfo has changed, reconnect to primary. > + */ > +static void > +slotsync_reread_config(WalReceiverConn **wrconn) > +{ > + char *conninfo = pstrdup(PrimaryConnInfo); > + > + ConfigReloadPending = false; > + ProcessConfigFile(PGC_SIGHUP); > + > + /* Reconnect if GUC primary_conninfo got changed */ > + if (strcmp(conninfo, PrimaryConnInfo) != 0) > + { > + if (*wrconn) > + walrcv_disconnect(*wrconn); > + > + *wrconn = remote_connect(); > > I think we should exit the worker in this case and allow it to > reconnect. See the similar handling in maybe_reread_subscription(). > One effect of not doing is that the dbname patch has used in > ReplSlotSyncWorkerMain() will become inconsistent. > Modified as suggested. > 14. > +void > +ReplSlotSyncWorkerMain(Datum main_arg) > +{ > ... > ... > + /* > + * If the standby has been promoted, skip the slot synchronization process. > + * > + * Although the startup process stops all the slot-sync workers on > + * promotion, the launcher may not have realized the promotion and could > + * start additional workers after that. Therefore, this check is still > + * necessary to prevent these additional workers from running. > + */ > + if (PromoteIsTriggered()) > + exit(0); > ... > ... > + /* Check if got promoted */ > + if (!RecoveryInProgress()) > + { > + /* > + * Drop the slots for which sync is initiated but not yet > + * completed i.e. they are still waiting for the primary server to > + * catch up. > + */ > + slotsync_drop_initiated_slots(); > + ereport(LOG, > + errmsg("exiting slot-sync woker on promotion of standby")); > > I think we should never reach this code in non-standby mode. It should > elog(ERROR,.. Can you please explain why promotion handling is > required here? I will handle this in the next version. It needs some more thoughts, especially on how 'PromoteIsTriggered' can be removed. > > 15. > @@ -190,6 +190,8 @@ static const char *const BuiltinTrancheNames[] = { > "LogicalRepLauncherDSA", > /* LWTRANCHE_LAUNCHER_HASH: */ > "LogicalRepLauncherHash", > + /* LWTRANCHE_SLOTSYNC_DSA: */ > + "SlotSyncWorkerDSA", > }; > ... > ... > + LWTRANCHE_SLOTSYNC_DSA, > LWTRANCHE_FIRST_USER_DEFINED, > } BuiltinTrancheIds; > > These are not used in the patch. > Removed. > 16. > +/* ------------------------------- > + * LIST_DBID_FOR_FAILOVER_SLOTS command > + * ------------------------------- > + */ > +typedef struct ListDBForFailoverSlotsCmd > +{ > + NodeTag type; > + List *slot_names; > +} ListDBForFailoverSlotsCmd; > > ... > > +/* > + * Failover logical slots data received from remote. > + */ > +typedef struct WalRcvFailoverSlotsData > +{ > + Oid dboid; > +} WalRcvFailoverSlotsData; > > These structures don't seem to be used in the current version of the patch. Removed. > > 17. > --- a/src/include/replication/slot.h > +++ b/src/include/replication/slot.h > @@ -15,7 +15,6 @@ > #include "storage/lwlock.h" > #include "storage/shmem.h" > #include "storage/spin.h" > -#include "replication/walreceiver.h" > ... > ... > -extern void WaitForStandbyLSN(XLogRecPtr wait_for_lsn); > extern List *GetStandbySlotList(bool copy); > > Why the above two are removed as part of this patch? WaitForStandbyLSN() is no longer there, so that is why it was removed. I think it should have been removed from patch0001. WIll make this change in the next version where we have pacth0001 changes coming. Regarding header inclusion and 'ReplicationSlotDropAtPubNode' removal, not sure when those were removed. But my best guess is that the header inclusion chain has changed a little bit in patch. The tablesync.c uses ReplicationSlotDropAtPubNode which is part of subscriptioncmds.h. Now in our patch since tablesync.c includes subscriptioncmds.h and thus slot.h need not to extern it for tablesync.c. And if we can get rid of ReplicationSlotDropAtPubNode in slot.h, then walreceiver.h inclusion can also be removed as that was needed for 'WalReceiverConn' argument of ReplicationSlotDropAtPubNode. There could be other 'header inclusions' involved as well but this seems the primary reason. > -- > With Regards, > Amit Kapila.
Attachment
On Tue, Nov 21, 2023 at 8:32 PM shveta malik <shveta.malik@gmail.com> wrote: > > v37 fails to apply to HEAD due to a recent commit e83aa9f92fdd, > rebased the patches. PFA v37_2 patches. Thanks for the patch. Some comments: subscriptioncmds.c: CreateSubscription() and tablesync.c: process_syncing_tables_for_apply() walrcv_create_slot(wrconn, opts.slot_name, false, twophase_enabled, - CRS_NOEXPORT_SNAPSHOT, NULL); - - if (twophase_enabled) - UpdateTwoPhaseState(subid, LOGICALREP_TWOPHASE_STATE_ENABLED); - + failover_enabled, CRS_NOEXPORT_SNAPSHOT, NULL); either here or in libpqrcv_create_slot(), shouldn't you check the remote server version if it supports the failover flag? + + /* + * If only the slot_name is specified, it is possible that the user intends to + * use an existing slot on the publisher, so here we enable failover for the + * slot if requested. + */ + else if (opts.slot_name && failover_enabled) + { + walrcv_alter_slot(wrconn, opts.slot_name, opts.failover); + ereport(NOTICE, + (errmsg("enabled failover for replication slot \"%s\" on publisher", + opts.slot_name))); + } Here, the code only alters the slot if failover = true. You could use "else if (opts.slot_name && IsSet(opts.specified_opts, SUBOPT_FAILOVER)" to check if the failover flag is specified and alter for failover=false as well. Also, shouldn't you check for the server version if the command ALTER_REPLICATION_SLOT is supported? slot.c: ReplicationSlotAlter() +void +ReplicationSlotAlter(const char *name, bool failover) +{ + Assert(MyReplicationSlot == NULL); + + ReplicationSlotAcquire(name, true); + + if (SlotIsPhysical(MyReplicationSlot)) + ereport(ERROR, + errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot use %s with a physical replication slot", + "ALTER_REPLICATION_SLOT")); shouldn't you release the slot by calling ReplicationSlotRelease before erroring out? slot.c: +/* + * A helper function to validate slots specified in standby_slot_names GUCs. + */ +static bool +validate_standby_slots(char **newval) +{ + char *rawname; + List *elemlist; + ListCell *lc; + + /* Need a modifiable copy of string */ + rawname = pstrdup(*newval); rawname is not always freed. launcher.c: + SlotSyncWorker->hdr.proc = MyProc; + + before_shmem_exit(slotsync_worker_detach, (Datum) 0); + + LWLockRelease(SlotSyncWorkerLock); +} before_shmem_exit() can error out leaving the lock acquired. Maybe you should release the lock prior to calling before_shmem_exit() because you don't need the lock there. regards, Ajin Cherian Fujitsu Australia
On Wed, Nov 22, 2023 at 10:02 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Tuesday, November 21, 2023 5:33 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > v37 fails to apply to HEAD due to a recent commit e83aa9f92fdd, rebased the > > patches. PFA v37_2 patches. > > Thanks for updating the patches. > > I'd like to discuss one issue related to the correct handling of failover flag > when executing ALTER SUBSCRIPTION SET (slot_name = 'new_slot')". > > Since the command intends to use a new slot on the primary, the new slot needs > to reflect the "failover" state that the subscription currently has. If the > failoverstate of the Subscription is LOGICALREP_FAILOVER_STATE_ENABLED, then I > can reset it to LOGICALREP_FAILOVER_STATE_PENDING and allow the apply worker to > handle it the way it is handled today (just like two_phase handling). > > But if the failoverstate is LOGICALREP_FAILOVER_STATE_DISABLED, the original > idea is to call walrcv_alter_slot and alter the slot from the "ALTER > SUBSCRIPTION" handling backend itself. This works if the slot is currently > disabled. But the " ALTER SUBSCRIPTION SET (slot_name = 'new_slot')" command is > supported even if the subscription is enabled. If the subscription is enabled, > then calling walrcv_alter_slot() fails because the slot is still acquired by > apply worker. > > So, I am thinking do we need a new mechanism to change the failover flag to > false on an enabled subscription ? For example, we could call walrcv_alter_slot > on startup of apply worker if AllTablesyncsReady(), for both true and false > values of failover flag. This way, every time apply worker is started, it calls > walrcv_alter_slot to set the failover flag on the primary. > I think for the false case, we need to execute walrcv_alter_slot() every time at the start of apply worker and it doesn't sound like an ideal way to achieve it. > Or we could just document that it is user's responsibility to match the failover > property in case it changes the slot_name. > Personally, I think we should document this behavior instead of complicating the patch and the user anyway has a way to achieve it. -- With Regards, Amit Kapila.
Hi, On 11/23/23 6:13 AM, Amit Kapila wrote: > On Tue, Nov 21, 2023 at 4:35 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> >> On 11/21/23 10:32 AM, shveta malik wrote: >>> On Tue, Nov 21, 2023 at 2:02 PM shveta malik <shveta.malik@gmail.com> wrote: >>>> >> >>> v37 fails to apply to HEAD due to a recent commit e83aa9f92fdd, >>> rebased the patches. PFA v37_2 patches. >> >> Thanks! >> >> Regarding the promotion flow: If the primary is available and reachable I don't >> think we currently try to ensure that slots are in sync. I think we'd miss the >> activity since the last sync and the promotion request or am I missing something? >> >> If the primary is available and reachable shouldn't we launch a last round of >> synchronization (skipping all the slots that are not in 'r' state)? >> > > We may miss the last round but there is no guarantee that we can > ensure to sync of everything if the primary is available. Because > after our last sync, there could probably be some more activity. I don't think so thanks to the fact that we ensure that logical walsenders on the primary wait for the physical standby. Indeed that should prevent any decoding activity on the primary while the promotion is in progress on the standby (at least as soon as the walreceiver is shutdown). So that I think that a promotion flow like: - walreceiver shutdown - last round of sync - sync-worker shutdown Should ensure that slots are in sync (as logical slots on the primary should not be able to advance as soon as the walreceiver is shutdown during the promotion). > I think it is the user's responsibility to promote a new primary when > the old one is not required for some reason. Do you mean they should ensure something like? 1. no more activity on the primary 2. check that the slots are in sync with the primary 3. promote but then they could also (without the new feature we're building): 1. create and advance slots manually (pg_replication_slot_advance) on the standby to sync them up at regular interval and then before promotion: 2. ensure no more activity on the primary 3. last round of advance slots manually 3. promote I think that ensuring the slots are in sync during promotion (should the primary be available) would provide added value as compared to the above scenarios. > It is not only slots that > can be out of sync but even we can miss fetching some of the data. I > think this is quite similar to what we do for WAL where on finding the > promotion signal, we shut down Walreceiver and just replay any WAL > that was already received by walreceiver. > Also, the promotion > shouldn't create any problem w.r.t subscribers connecting to the new > primary because the slot's position is slightly behind what could be > requested by subscribers which means the corresponding data will be > available on the new primary. > Right. > Do you have something in mind that can create any problem if we don't > attempt additional fetching round after the promotion signal is > received? It's not a "real" problem per say, but in case of non synced slot, I can see 2 cases: - publisher/subscriber case: I don't see any problem here, since after an "alter subscription XXX connection '<new_primary>'" logical replication should start from the right place thanks to the replication origin associated to the subscription. - non publisher/subscriber case (say pg_recvlogical that does not make use of replication origin) then: a) data since the last sync and promotion would be decoded again unless b) or c) b) user manually advances the slot on the standby after promotion c) user restarts the decoding with an appropriate --startpos option That's for this non publisher/subscriber case that I think it would be beneficial to try to ensure that the slots are in sync during the promotion. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thursday, November 23, 2023 11:45 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: Hi, > > On 11/23/23 6:13 AM, Amit Kapila wrote: > > On Tue, Nov 21, 2023 at 4:35 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > >> > >> On 11/21/23 10:32 AM, shveta malik wrote: > >>> On Tue, Nov 21, 2023 at 2:02 PM shveta malik <shveta.malik@gmail.com> > wrote: > >>>> > >> > >>> v37 fails to apply to HEAD due to a recent commit e83aa9f92fdd, > >>> rebased the patches. PFA v37_2 patches. > >> > >> Thanks! > >> > >> Regarding the promotion flow: If the primary is available and > >> reachable I don't think we currently try to ensure that slots are in > >> sync. I think we'd miss the activity since the last sync and the promotion > request or am I missing something? > >> > >> If the primary is available and reachable shouldn't we launch a last > >> round of synchronization (skipping all the slots that are not in 'r' state)? > >> > > > > We may miss the last round but there is no guarantee that we can > > ensure to sync of everything if the primary is available. Because > > after our last sync, there could probably be some more activity. > > I don't think so thanks to the fact that we ensure that logical walsenders on the > primary wait for the physical standby. > > Indeed that should prevent any decoding activity on the primary while the > promotion is in progress on the standby (at least as soon as the walreceiver is > shutdown). > > So that I think that a promotion flow like: > > - walreceiver shutdown > - last round of sync > - sync-worker shutdown > > Should ensure that slots are in sync (as logical slots on the primary should not > be able to advance as soon as the walreceiver is shutdown during the > promotion). > I think it could not ensure the slots are in sync, because there is no guarantee that the logical slot has caught up to the physical standby on promotion and logical publisher and subscriber both could still be active during promotion. IOW, the logical slot's LSN can still be advanced after the walreceiver shutdown if it was far bebind the physical slot's LSN. Best Regards, Hou zj
Hi, On 11/24/23 4:35 AM, Zhijie Hou (Fujitsu) wrote: > On Thursday, November 23, 2023 11:45 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > IOW, the logical slot's LSN can still be advanced after the > walreceiver shutdown if it was far bebind the physical slot's LSN. > oh yeah right, it would need much more work/discussion to handle this case. As mentioned up-thread for publisher/subscriber I think it's fine (thanks to the replication origin linked to the subscriber) but for anything else that don't make use of replication origin (or similar approach to re-start the decoding from the right place after promotion) I feel like the user experience is not as good. It may not be worth it to work on it for V1 but maybe something to keep in mind as improvement for later? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On 11/23/23 11:45 AM, Amit Kapila wrote: > On Wed, Nov 22, 2023 at 10:02 AM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > >> Or we could just document that it is user's responsibility to match the failover >> property in case it changes the slot_name. >> > > Personally, I think we should document this behavior instead of > complicating the patch and the user anyway has a way to achieve it. Same point of view. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, Nov 24, 2023 at 1:53 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > On 11/24/23 4:35 AM, Zhijie Hou (Fujitsu) wrote: > > On Thursday, November 23, 2023 11:45 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > > > IOW, the logical slot's LSN can still be advanced after the > > walreceiver shutdown if it was far bebind the physical slot's LSN. > > > > oh yeah right, it would need much more work/discussion to handle this case. > > As mentioned up-thread for publisher/subscriber I think it's fine > (thanks to the replication origin linked to the subscriber) but for > anything else that don't make use of replication origin (or similar > approach to re-start the decoding from the right place after promotion) > I feel like the user experience is not as good. > > It may not be worth it to work on it for V1 but maybe something to keep > in mind as improvement for later? > Agreed, we can think of improving it in the future but there is no correctness issue with the current implementation (not trying to do the last fetch after the promotion signal). -- With Regards, Amit Kapila.
Hi, On 11/24/23 10:45 AM, Amit Kapila wrote: > On Fri, Nov 24, 2023 at 1:53 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> >> On 11/24/23 4:35 AM, Zhijie Hou (Fujitsu) wrote: >>> On Thursday, November 23, 2023 11:45 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: >>> >>> IOW, the logical slot's LSN can still be advanced after the >>> walreceiver shutdown if it was far bebind the physical slot's LSN. >>> >> >> oh yeah right, it would need much more work/discussion to handle this case. >> >> As mentioned up-thread for publisher/subscriber I think it's fine >> (thanks to the replication origin linked to the subscriber) but for >> anything else that don't make use of replication origin (or similar >> approach to re-start the decoding from the right place after promotion) >> I feel like the user experience is not as good. >> >> It may not be worth it to work on it for V1 but maybe something to keep >> in mind as improvement for later? >> > > Agreed, we can think of improving it in the future but there is no > correctness issue with the current implementation (not trying to do > the last fetch after the promotion signal). > Yeah agree, no correctness issue, my remark was all about trying to improve the user experience in some cases. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tuesday, November 21, 2023 1:39 PM Peter Smith <smithpb2250@gmail.com> wrote: Hi, Thanks for the comments. > > ====== > doc/src/sgml/catalogs.sgml > > 6. > + <row> > + <entry role="catalog_table_entry"><para role="column_definition"> > + <structfield>subfailoverstate</structfield> <type>char</type> > + </para> > + <para> > + State codes for failover mode: > + <literal>d</literal> = disabled, > + <literal>p</literal> = pending enablement, > + <literal>e</literal> = enabled > + </para></entry> > + </row> > + > > This attribute is very similar to the 'subtwophasestate' so IMO it would be > better to be adjacent to that one in the docs. > > (probably this means putting it in the same order in the catalog also, assuming > that is allowed) It's allowed, but I think the functionality of two fields are different and I didn’t find the correlation between two fields except for the type of value. So I didn't change the order. > > ~~~ > > 12. AlterSubscription > > + if (sub->failoverstate == LOGICALREP_FAILOVER_STATE_ENABLED && > + opts.copy_data) ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed > when failover is enabled"), > + errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION with refresh = > false, or with copy_data = false, or use DROP/CREATE SUBSCRIPTION."))); > > ~ > > 12b. > AFAIK when there are messages like this that differ only by non-translatable > things ("failover" option) then that non-translatable thing should be extracted > as a parameter so the messages are common. > And, don't forget to add a /* translator: %s is a subscription option like > 'failover' */ comment. > > SUGGESTION like: > errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed > when %s is enabled", "two_phase") errmsg("ALTER SUBSCRIPTION with refresh > and copy_data is not allowed when %s is enabled", "failover") I am not sure about changing the existing message here, I feel you can start a separate thread to change the twophase related messages, and we can change accordingly if it's accepted. > > ====== > .../libpqwalreceiver/libpqwalreceiver.c > > 15. libpqrcv_create_slot > > + if (failover) > + { > + appendStringInfoString(&cmd, "FAILOVER"); if (use_new_options_syntax) > + appendStringInfoString(&cmd, ", "); else appendStringInfoChar(&cmd, ' > + '); } > > 15a. > Isn't failover a new option that is unsupported pre-PG17? Why is it necessary to > support an old-style syntax for something that was not supported on old > servers? (I'm confused). > > ~ > > 15b. > Also IIRC, this FAILOVER wasn't not listed in the old-style syntax of > doc/src/sgml/protocol.sgml. Was that deliberate? We don't support FAILOVER for old-style syntax and pre-PG17, libpqrcv_create_slot is only building the replication command string and we will add failover in the string so that the publisher will report errors if it doesn't support these options ,the same is true for two_phase. > ~~~ > > 24. ReplicationSlotAlter > +/* > + * Change the definition of the slot identified by the passed in name. > + */ > +void > +ReplicationSlotAlter(const char *name, bool failover) > > /the definition/the failover state/ I kept this as it's a general function but we only support changing failover state for now. > ~~~ > > 28. check_standby_slot_names > > +bool > +check_standby_slot_names(char **newval, void **extra, GucSource source) > +{ if (strcmp(*newval, "") == 0) return true; > + > + /* > + * "*" is not accepted as in that case primary will not be able to know > + * for which all standbys to wait for. Even if we have physical-slots > + * info, there is no way to confirm whether there is any standby > + * configured for the known physical slots. > + */ > + if (strcmp(*newval, "*") == 0) > + { > + GUC_check_errdetail("\"%s\" is not accepted for standby_slot_names", > + *newval); return false; } > + > + /* Now verify if the specified slots really exist and have correct > + type */ if (!validate_standby_slots(newval)) return false; > + > + *extra = guc_strdup(ERROR, *newval); > + > + return true; > +} > > Is it really necessary to have a special test for the special value "*" which you are > going to reject? I don't see why this should be any different from checking for > other values like "." or "$" or "?" etc. > Why not just let validate_standby_slots() handle all of these? SplitIdentifierString() does not give error for '*' and '*' can be considered as valid value which if accepted can mislead user that all the standbys's slots are now considered, which is not the case here. So we want to explicitly call out this case i.e. '*' is not accepted as valid value for standby_slot_names. > > ~~~ > > 29. assign_standby_slot_names > > + /* No value is specified for standby_slot_names. */ if > + (standby_slot_names_cpy == NULL) return; > > Is this possible? IIUC the check_standby_slot_names() did: > *extra = guc_strdup(ERROR, *newval); > > Maybe this code also needs a similar elog and comment like already in this > function: > /* This should not happen if GUC checked check_standby_slot_names. */ This case is possible, standby_slot_names_cpy(e.g. extra pointer) is NULL if no value("") is specified for the GUC.(see the code in check_standby_slot_names). > ~ > > 30. assign_standby_slot_names > > + char *standby_slot_names_cpy = extra; > > IIUC, the 'extra' was unconditionally guc_strdup()'ed in the check hook, so > should we also free it here before leaving this function? No, as mentioned in src/backend/utils/misc/README, the space of extra will be automatically freed when the associated GUC setting is no longer of interest. > > ~~~ > > 31. GetStandbySlotList > > +/* > + * Return a copy of standby_slot_names_list if the copy flag is set to > +true, > + * otherwise return the original list. > + */ > +List * > +GetStandbySlotList(bool copy) > +{ > + if (copy) > + return list_copy(standby_slot_names_list); > + else > + return standby_slot_names_list; > +} > > Why is this better than just exposing the standby_slot_names_list. The caller > can make a copy or not. > e.g. why is calling GetStandbySlotList(true) better than just doing > list_copy(standby_slot_names_list)? I think either way is fine, but I prefer not to add one global variable if possible. > > ~~~ > > 34. WalSndFilterStandbySlots > > + /* Log warning if no active_pid for this physical slot */ if > + (slot->active_pid == 0) ereport(WARNING, > > Other nearby code is guarding the slot in case it was NULL, so why not here? Is > it a potential NPE? I think it will not pass the check for restart_lsn before the active_pid if slot is NULL. > > ~~~ > > 35. > + /* > + * If logical slot name is given in standby_slot_names, give WARNING > + * and skip it. Since it is harmless, so WARNING should be enough, no > + * need to error-out. > + */ > + else if (SlotIsLogical(slot)) > + warningfmt = _("cannot have logical replication slot \"%s\" in > parameter \"%s\", ignoring"); > > Is this possible? Doesn't the function 'validate_standby_slots' called by the GUC > hook prevent specifying logical slots in the GUC? Maybe this warning should be > changed to Assert? I think user could drop the logical slot and recreate a physical slot with the same name without changing the GUC. > > ~~~ > > 36. > + /* > + * Reaching here indicates that either the slot has passed the > + * wait_for_lsn or there is an issue with the slot that requires a > + * warning to be reported. > + */ > + if (warningfmt) > + ereport(WARNING, errmsg(warningfmt, name, "standby_slot_names")); > + > + standby_slots_cpy = foreach_delete_current(standby_slots_cpy, lc); > > If something was wrong with the slot that required a warning, is it really OK to > remove this slot from the list? This seems contrary to the function comment > which only talks about removing slots that have caught up. I think it's OK to remove slots if it's invalidated, dropped, or was changed to logical one as we don't need to wait for these slots to catch up anymore. > ~~~ > > 41. > /* > - * Fast path to avoid acquiring the spinlock in case we already know we > - * have enough WAL available. This is particularly interesting if we're > - * far behind. > + * Check if all the standby servers have confirmed receipt of WAL upto > + * RecentFlushPtr if we already know we have enough WAL available. > + * > + * Note that we cannot directly return without checking the status of > + * standby servers because the standby_slot_names may have changed, > + which > + * means there could be new standby slots in the list that have not yet > + * caught up to the RecentFlushPtr. > */ > if (RecentFlushPtr != InvalidXLogRecPtr && > loc <= RecentFlushPtr) > - return RecentFlushPtr; > + { > + WalSndFilterStandbySlots(RecentFlushPtr, &standby_slots); > > 41b. > IMO there is some missing information in this comment because it wasn't clear > to me that calling WalSndFilterStandbySlots was going to side-efect that list to > give it a different meaning. e.g. it seems it no longer means "standby slots" but > instead means something like "standby slots that are not caught up". Perhaps > that local variable can have a name that helps to convey that better? I am not sure about this, WalSndFilterStandbySlots already indicates it will filter the slot list which seems clear to me. But if you have better ideas, we can adjust in next version. > > ~~~ > > 44. > + if (wait_for_standby) > + ConditionVariablePrepareToSleep(&WalSndCtl->wal_confirm_rcv_cv); > + else if (MyWalSnd->kind == REPLICATION_KIND_PHYSICAL) > ConditionVariablePrepareToSleep(&WalSndCtl->wal_flush_cv); > else if (MyWalSnd->kind == REPLICATION_KIND_LOGICAL) > ConditionVariablePrepareToSleep(&WalSndCtl->wal_replay_cv); > ~ > > A walsender is either physical or logical, but here the 'wait_for_standby' flag > overrides everything. Is it OK for this to be if/else/else or should this code call > for wal_confirm_rcv_cv AND the other one? No, we cannot prepare to sleep twice(see the comment in ConditionVariablePrepareToSleep()). > ====== > src/include/catalog/pg_subscription.h > > 54. > /* > * two_phase tri-state values. See comments atop worker.c to know more > about > * these states. > */ > #define LOGICALREP_TWOPHASE_STATE_DISABLED 'd' > #define LOGICALREP_TWOPHASE_STATE_PENDING 'p' > #define LOGICALREP_TWOPHASE_STATE_ENABLED 'e' > > #define LOGICALREP_FAILOVER_STATE_DISABLED 'd' > #define LOGICALREP_FAILOVER_STATE_PENDING 'p' > #define LOGICALREP_FAILOVER_STATE_ENABLED 'e' > > ~ > > 54a. > There should either be another comment (like the 'two_phase tri-state' > one) added for the FAILOVER states or that existing comment should be > expanded so that it also mentions the 'failover' tri-states. > > ~ > > 54b. > Idea: If you are willing to change the constant names (not the values) of the > current tri-states then now both the 'two_phase' and 'failover' > could share them -- I also think this might give the ability to create macros (if > wanted) or to share more code instead of always handling failover and > two_phase separately. > > SUGGESTION > #define LOGICALREP_TRISTATE_DISABLED 'd' > #define LOGICALREP_TRISTATE_PENDING 'p' > #define LOGICALREP_TRISTATE_ENABLED 'e' I am not sure about the idea, but if others also prefer this then we can adjust the code. ~~~ On Wednesday, November 22, 2023 3:42 PM Peter Smith <smithpb2250@gmail.com> wrote: > 6. > +# The subscription that's up and running and is enabled for failover # > +doesn't get the data from primary and keeps waiting for the # standby > +specified in standby_slot_names. > +$result = $subscriber1->safe_psql('postgres', > + "SELECT count(*) = 0 FROM tab_int;"); > +is($result, 't', "subscriber1 doesn't get data from primary until > standby1 acknowledges changes"); > > Might it be better to write as "SELECT count(*) = $primary_row_count FROM > tab_int;" and expect it to return false? Ensuring the number is 0 looks better to me. Attach the V38 patch set which addressed all comments in [1][2] except for the ones that mentioned above. [1] https://www.postgresql.org/message-id/CAHut%2BPv-yu71ogj_hRi6cCtmD55bsyw7XTxj1Nq8yVFKpY3NDQ%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAHut%2BPuEGX5kr0xh06yv8ndoAQvDNedoec1OqOq3GMxDN6p%3D9A%40mail.gmail.com Best Regards, Hou zj
Attachment
On Monday, November 27, 2023 12:03 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > Attach the V38 patch set which addressed all comments in [1][2] except for the > ones that mentioned above. > > [1] > https://www.postgresql.org/message-id/CAHut%2BPv-yu71ogj_hRi6cCtmD5 > 5bsyw7XTxj1Nq8yVFKpY3NDQ%40mail.gmail.com > [2] > https://www.postgresql.org/message-id/CAHut%2BPuEGX5kr0xh06yv8ndoA > QvDNedoec1OqOq3GMxDN6p%3D9A%40mail.gmail.com I didn't increment the patch version, sorry for that. Attach the same patch set but increment the patch version to V39. Best Regards, Hou zj
Attachment
Hi, On 11/27/23 7:02 AM, Zhijie Hou (Fujitsu) wrote: > On Monday, November 27, 2023 12:03 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: >> >> Attach the V38 patch set which addressed all comments in [1][2] except for the >> ones that mentioned above. >> >> [1] >> https://www.postgresql.org/message-id/CAHut%2BPv-yu71ogj_hRi6cCtmD5 >> 5bsyw7XTxj1Nq8yVFKpY3NDQ%40mail.gmail.com >> [2] >> https://www.postgresql.org/message-id/CAHut%2BPuEGX5kr0xh06yv8ndoA >> QvDNedoec1OqOq3GMxDN6p%3D9A%40mail.gmail.com > > I didn't increment the patch version, sorry for that. Attach the same patch set > but increment the patch version to V39. Thanks! It looks like v39 does not contain (some / all?) the changes that have been done in v38 [1]. For example, slot_exists_in_list() still exists in v39 while it was renamed to validate_sync_slot() in v38. [1]: https://www.postgresql.org/message-id/CAJpy0uD6dWUvBgy8MGdugf_Am4pLXTL_vqcwSeHO13v%2BMzc9KA%40mail.gmail.com Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Nov 27, 2023 at 2:15 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 11/27/23 7:02 AM, Zhijie Hou (Fujitsu) wrote: > > On Monday, November 27, 2023 12:03 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > >> > >> Attach the V38 patch set which addressed all comments in [1][2] except for the > >> ones that mentioned above. > >> > >> [1] > >> https://www.postgresql.org/message-id/CAHut%2BPv-yu71ogj_hRi6cCtmD5 > >> 5bsyw7XTxj1Nq8yVFKpY3NDQ%40mail.gmail.com > >> [2] > >> https://www.postgresql.org/message-id/CAHut%2BPuEGX5kr0xh06yv8ndoA > >> QvDNedoec1OqOq3GMxDN6p%3D9A%40mail.gmail.com > > > > I didn't increment the patch version, sorry for that. Attach the same patch set > > but increment the patch version to V39. > > Thanks! > > It looks like v39 does not contain (some / all?) the changes that have been > done in v38 [1]. > > For example, slot_exists_in_list() still exists in v39 while it was renamed to > validate_sync_slot() in v38. > Yes, I noticed that and informed Hou-san about this. New patches will be posted soon with the correction. Meanwhile, please review v38 instead if you intend to review patch002 right now. v39 is supposed to have changes in patch001 alone. thanks Shveta thanks Shveta
On Monday, November 27, 2023 4:51 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Mon, Nov 27, 2023 at 2:15 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > > > > Hi, > > > > On 11/27/23 7:02 AM, Zhijie Hou (Fujitsu) wrote: > > > On Monday, November 27, 2023 12:03 PM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > >> > > >> Attach the V38 patch set which addressed all comments in [1][2] > > >> except for the ones that mentioned above. > > >> > > >> [1] > > >> > https://www.postgresql.org/message-id/CAHut%2BPv-yu71ogj_hRi6cCtmD5 > > >> 5bsyw7XTxj1Nq8yVFKpY3NDQ%40mail.gmail.com > > >> [2] > > >> > https://www.postgresql.org/message-id/CAHut%2BPuEGX5kr0xh06yv8ndoA > > >> QvDNedoec1OqOq3GMxDN6p%3D9A%40mail.gmail.com > > > > > > I didn't increment the patch version, sorry for that. Attach the > > > same patch set but increment the patch version to V39. > > > > Thanks! > > > > It looks like v39 does not contain (some / all?) the changes that have > > been done in v38 [1]. > > > > For example, slot_exists_in_list() still exists in v39 while it was > > renamed to > > validate_sync_slot() in v38. > > > > Yes, I noticed that and informed Hou-san about this. New patches will be > posted soon with the correction. Meanwhile, please review v38 instead if you > intend to review patch002 right now. v39 is supposed to have changes in > patch001 alone. Here is the updated version(v39_2) which include all the changes made in 0002. Please use for review, and sorry for the confusion. Best Regards, Hou zj
Attachment
On Mon, Nov 27, 2023 at 2:27 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > Here is the updated version(v39_2) which include all the changes made in 0002. > Please use for review, and sorry for the confusion. > --- a/src/backend/replication/logical/launcher.c +++ b/src/backend/replication/logical/launcher.c @@ -8,20 +8,27 @@ * src/backend/replication/logical/launcher.c * * NOTES - * This module contains the logical replication worker launcher which - * uses the background worker infrastructure to start the logical - * replication workers for every enabled subscription. + * This module contains the replication worker launcher which + * uses the background worker infrastructure to: + * a) start the logical replication workers for every enabled subscription + * when not in standby_mode. + * b) start the slot sync worker for logical failover slots synchronization + * from the primary server when in standby_mode. I was wondering do we really need a launcher on standby to invoke sync-slot worker. If so, why? I guess it may be required for previous versions where we were managing work for multiple slot-sync workers which is also questionable in the sense of whether launcher is the right candidate for the same but now with the single slot-sync worker, it doesn't seem worth having it. What do you think? -- With Regards, Amit Kapila.
Hi, On 11/6/23 2:30 AM, Zhijie Hou (Fujitsu) wrote: > On Friday, November 3, 2023 7:32 PM Amit Kapila <amit.kapila16@gmail.com> >> >> I don't see a corresponding change in repl_gram.y. I think the following part of >> the code needs to be changed: >> /* CREATE_REPLICATION_SLOT slot [TEMPORARY] LOGICAL plugin [options] */ >> | K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT >> create_slot_options >> > > I think after 0266e98, we started to use the new syntax(see the > generic_option_list rule) and we can avoid changing the repl_gram.y when adding > new options. The new failover can be detected when parsing the generic option > list(in parseCreateReplSlotOptions). Did not look in details but it looks like there is more to do here as this is failing (with v39_2): " postgres@primary: psql replication=database psql (17devel) Type "help" for help. postgres=# CREATE_REPLICATION_SLOT test_logical20 LOGICAL pgoutput FAILOVER; ERROR: syntax error " Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Monday, November 27, 2023 8:05 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: Hi, > On 11/6/23 2:30 AM, Zhijie Hou (Fujitsu) wrote: > > On Friday, November 3, 2023 7:32 PM Amit Kapila > <amit.kapila16@gmail.com> > >> > >> I don't see a corresponding change in repl_gram.y. I think the following part > of > >> the code needs to be changed: > >> /* CREATE_REPLICATION_SLOT slot [TEMPORARY] LOGICAL plugin [options] > */ > >> | K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT > >> create_slot_options > >> > > > > I think after 0266e98, we started to use the new syntax(see the > > generic_option_list rule) and we can avoid changing the repl_gram.y when > adding > > new options. The new failover can be detected when parsing the generic > option > > list(in parseCreateReplSlotOptions). > > Did not look in details but it looks like there is more to do here as > this is failing (with v39_2): > > " > postgres@primary: psql replication=database > psql (17devel) > Type "help" for help. > > postgres=# CREATE_REPLICATION_SLOT test_logical20 LOGICAL pgoutput > FAILOVER; > ERROR: syntax error I think the command you executed is of old syntax style, which was kept for compatibility with older releases. And I think we can avoid supporting new option for the old syntax as described in the original thread[1] of commit 0266e98. So, the "syntax error" is as expected IIUC. The new style command is like: CREATE_REPLICATION_SLOT test_logical20 LOGICAL pgoutput (FAILOVER); [1] https://www.postgresql.org/message-id/CA%2BTgmobAczXDRO_Gr2euo_TxgzaH1JxbNxvFx%3DHYvBinefNH8Q%40mail.gmail.com Best Regards, Hou zj
Hi, On 11/27/23 1:23 PM, Zhijie Hou (Fujitsu) wrote: > On Monday, November 27, 2023 8:05 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: >> Did not look in details but it looks like there is more to do here as >> this is failing (with v39_2): >> >> " >> postgres@primary: psql replication=database >> psql (17devel) >> Type "help" for help. >> >> postgres=# CREATE_REPLICATION_SLOT test_logical20 LOGICAL pgoutput >> FAILOVER; >> ERROR: syntax error > > I think the command you executed is of old syntax style, which was kept for > compatibility with older releases. And I think we can avoid supporting new > option for the old syntax as described in the original thread[1] of commit > 0266e98. So, the "syntax error" is as expected IIUC. > > The new style command is like: > CREATE_REPLICATION_SLOT test_logical20 LOGICAL pgoutput (FAILOVER); > > [1] https://www.postgresql.org/message-id/CA%2BTgmobAczXDRO_Gr2euo_TxgzaH1JxbNxvFx%3DHYvBinefNH8Q%40mail.gmail.com > Oh, I see, thanks for pointing out. Well, not related to that thread but it seems weird to me that the backward compatibility is done at the "option" level then. I think it would make more sense to support all the options if the old syntax is still supported. For example, having postgres=# CREATE_REPLICATION_SLOT test_logical2 LOGICAL pgoutput TWO_PHASE; working fine but CREATE_REPLICATION_SLOT test_logical3 LOGICAL pgoutput FAILOVER; failing looks weird to me. But that's probably out of this thread's context anyway. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On 11/27/23 1:23 PM, Zhijie Hou (Fujitsu) wrote: > On Monday, November 27, 2023 8:05 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > >> On 11/6/23 2:30 AM, Zhijie Hou (Fujitsu) wrote: >>> On Friday, November 3, 2023 7:32 PM Amit Kapila >> <amit.kapila16@gmail.com> >>>> >>>> I don't see a corresponding change in repl_gram.y. I think the following part >> of >>>> the code needs to be changed: >>>> /* CREATE_REPLICATION_SLOT slot [TEMPORARY] LOGICAL plugin [options] >> */ >>>> | K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT >>>> create_slot_options >>>> >>> >>> I think after 0266e98, we started to use the new syntax(see the >>> generic_option_list rule) and we can avoid changing the repl_gram.y when >> adding >>> new options. The new failover can be detected when parsing the generic >> option >>> list(in parseCreateReplSlotOptions). >> >> Did not look in details but it looks like there is more to do here as >> this is failing (with v39_2): >> >> " >> postgres@primary: psql replication=database >> psql (17devel) >> Type "help" for help. >> >> postgres=# CREATE_REPLICATION_SLOT test_logical20 LOGICAL pgoutput >> FAILOVER; >> ERROR: syntax error > > I think the command you executed is of old syntax style, which was kept for > compatibility with older releases. And I think we can avoid supporting new > option for the old syntax as described in the original thread[1] of commit > 0266e98. So, the "syntax error" is as expected IIUC. > > The new style command is like: > CREATE_REPLICATION_SLOT test_logical20 LOGICAL pgoutput (FAILOVER); > If / As we are not going to support the old syntax for the FAILOVER option so I think we can get rid of the check on "use_new_options_syntax" here: - + if (failover) + { + appendStringInfoString(&cmd, "FAILOVER"); + if (use_new_options_syntax) + appendStringInfoString(&cmd, ", "); + else + appendStringInfoChar(&cmd, ' '); + } as we'd error out before if using the old syntax. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Nov 27, 2023 at 4:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Nov 27, 2023 at 2:27 PM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > Here is the updated version(v39_2) which include all the changes made in 0002. > > Please use for review, and sorry for the confusion. > > > > --- a/src/backend/replication/logical/launcher.c > +++ b/src/backend/replication/logical/launcher.c > @@ -8,20 +8,27 @@ > * src/backend/replication/logical/launcher.c > * > * NOTES > - * This module contains the logical replication worker launcher which > - * uses the background worker infrastructure to start the logical > - * replication workers for every enabled subscription. > + * This module contains the replication worker launcher which > + * uses the background worker infrastructure to: > + * a) start the logical replication workers for every enabled subscription > + * when not in standby_mode. > + * b) start the slot sync worker for logical failover slots synchronization > + * from the primary server when in standby_mode. > > I was wondering do we really need a launcher on standby to invoke > sync-slot worker. If so, why? I guess it may be required for previous > versions where we were managing work for multiple slot-sync workers > which is also questionable in the sense of whether launcher is the > right candidate for the same but now with the single slot-sync worker, > it doesn't seem worth having it. What do you think? > > -- Yes, earlier a manager process was needed to manage multiple slot-sync workers and distribute load among them, but now that does not seem necessary. I gave it a try (PoC) and it seems to work well. If there are no objections to this approach, I can share the patch soon. thanks Shveta
Hi, On 11/28/23 4:13 AM, shveta malik wrote: > On Mon, Nov 27, 2023 at 4:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> On Mon, Nov 27, 2023 at 2:27 PM Zhijie Hou (Fujitsu) >> <houzj.fnst@fujitsu.com> wrote: >>> >>> Here is the updated version(v39_2) which include all the changes made in 0002. >>> Please use for review, and sorry for the confusion. >>> >> >> --- a/src/backend/replication/logical/launcher.c >> +++ b/src/backend/replication/logical/launcher.c >> @@ -8,20 +8,27 @@ >> * src/backend/replication/logical/launcher.c >> * >> * NOTES >> - * This module contains the logical replication worker launcher which >> - * uses the background worker infrastructure to start the logical >> - * replication workers for every enabled subscription. >> + * This module contains the replication worker launcher which >> + * uses the background worker infrastructure to: >> + * a) start the logical replication workers for every enabled subscription >> + * when not in standby_mode. >> + * b) start the slot sync worker for logical failover slots synchronization >> + * from the primary server when in standby_mode. >> >> I was wondering do we really need a launcher on standby to invoke >> sync-slot worker. If so, why? I guess it may be required for previous >> versions where we were managing work for multiple slot-sync workers >> which is also questionable in the sense of whether launcher is the >> right candidate for the same but now with the single slot-sync worker, >> it doesn't seem worth having it. What do you think? >> >> -- > > Yes, earlier a manager process was needed to manage multiple slot-sync > workers and distribute load among them, but now that does not seem > necessary. I gave it a try (PoC) and it seems to work well. If there > are no objections to this approach, I can share the patch soon. > +1 on this new approach, thanks! Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On 11/2/23 1:27 AM, Zhijie Hou (Fujitsu) wrote: > On Tuesday, October 31, 2023 6:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote: >> We have create_replication_slot and drop_replication_slot in repl_gram.y. How >> about if introduce alter_replication_slot and handle the 'failover' flag with that? >> The idea is we will either enable 'failover' at the time create_replication_slot by >> providing an optional failover option or execute a separate command >> alter_replication_slot. I think we probably need to perform this command >> before the start of streaming. > > Here is an attempt to achieve the same. I added a new replication command > alter_replication_slot and introduced a walreceiver api walrcv_alter_slot to > execute the command. The subscription will call the api to enable/disable > the failover of the slot on publisher. > > The patch disallows altering the failover option for the subscription. But we > could release the restriction by using the following approaches in next version: > >> I think we will have the following options to allow alter of the 'failover' >> property: (a) we can allow altering 'failover' only for the 'disabled' >> subscription; to achieve that, we need to open a connection during alter >> subscription and change this property of slot; (b) apply worker detects the >> change in 'failover' option; run the alter_replication_slot command; this needs >> more analysis as apply_worker is already doing streaming and changing slot >> property in between could be tricky. > What do you think about also adding a pg_alter_logical_replication_slot() or such function? That would allow users to alter manually created logical replication slots without the need to make a replication connection. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Nov 28, 2023 at 12:19 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 11/28/23 4:13 AM, shveta malik wrote: > > On Mon, Nov 27, 2023 at 4:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > >> > >> On Mon, Nov 27, 2023 at 2:27 PM Zhijie Hou (Fujitsu) > >> <houzj.fnst@fujitsu.com> wrote: > >>> > >>> Here is the updated version(v39_2) which include all the changes made in 0002. > >>> Please use for review, and sorry for the confusion. > >>> > >> > >> --- a/src/backend/replication/logical/launcher.c > >> +++ b/src/backend/replication/logical/launcher.c > >> @@ -8,20 +8,27 @@ > >> * src/backend/replication/logical/launcher.c > >> * > >> * NOTES > >> - * This module contains the logical replication worker launcher which > >> - * uses the background worker infrastructure to start the logical > >> - * replication workers for every enabled subscription. > >> + * This module contains the replication worker launcher which > >> + * uses the background worker infrastructure to: > >> + * a) start the logical replication workers for every enabled subscription > >> + * when not in standby_mode. > >> + * b) start the slot sync worker for logical failover slots synchronization > >> + * from the primary server when in standby_mode. > >> > >> I was wondering do we really need a launcher on standby to invoke > >> sync-slot worker. If so, why? I guess it may be required for previous > >> versions where we were managing work for multiple slot-sync workers > >> which is also questionable in the sense of whether launcher is the > >> right candidate for the same but now with the single slot-sync worker, > >> it doesn't seem worth having it. What do you think? > >> > >> -- > > > > Yes, earlier a manager process was needed to manage multiple slot-sync > > workers and distribute load among them, but now that does not seem > > necessary. I gave it a try (PoC) and it seems to work well. If there > > are no objections to this approach, I can share the patch soon. > > > > +1 on this new approach, thanks! PFA v40. This patch has removed Logical Replication Launcher support to launch slotsync worker. The slot-sync worker is now registered as bgworker with postmaster, with bgw_start_time=BgWorkerStart_ConsistentState and bgw_restart_time=60sec. On removal of launcher, now all the validity checks have been shifted to slot-sync worker itself. This brings us to some point of concerns: a) We still need to maintain RecoveryInProgress() check in slotsync worker. Since worker has the start time of BgWorkerStart_ConsistentState, it will be started on non-standby as well. So to ensure that it exists on non-standby, "RecoveryInProgress" has been introduced at the beginning of the worker. But once it exits, postmaster will not restart it since it will be clean-exist i.e. proc_exit(0) (the restart logic of postmaster comes into play only when there is an abnormal exit). But to exit for the first time on non-standby, we need that Recovery related check in worker. b) "enable_syncslot" check is moved to slotsync worker now. Since enable_syncslot is PGC_SIGHUP, so proc_exit(1) is currently used to exit the worker if 'enable_syncslot' is found to be disabled. 'proc_exit(1)' has been used in order to ensure that the worker is restarted and GUCs are checked again after restart_time. Downside of this approach is, if someone has kept "enable_syncslot" as disabled permanently even on standby, slotsync worker will keep on restarting and exiting. So to overcome the above pain-points, I think a potential approach will be to start slotsync worker only if 'enable_syncslot' is on and the system is non-standby. Potential ways (each with some issues) are: 1) Use the current way i.e. register slot-sync worker as bgworker with postmaster, but introduce extra checks in 'maybe_start_bgworkers'. But this seems more like a hack. This will need extra changes as currently once 'maybe_start_bgworkers' is attempted by postmaster, it will attempt again to start any worker only if the worker had abnormal exit and restart_time !=0. The current postmatser will not attempt to start worker on any GUC change. 2) Another way maybe to treat slotsync worker as special case and separate out the start/restart of slotsync worker from bgworker, and follow what we do for autovacuum launcher(StartAutoVacLauncher) to keep starting it in the postmaster loop(ServerLoop). In this way, we may be able to add more checks before starting worker. But by opting this approach, we will have to manage slotsync worker completely by ourself as it will be no longer be part of existing bgworker-registration infra. If this seems okay and there are no other better options, it can be analyzed further in detail. 3) Another approach could be, in order to solve issue (a), introduce a new start_time 'BgWorkerStart_ConsistentState_HotStandby' which means start a bgworker only if consistent state is reached and the system is standby. And for issue (b), lets retain check of enable_syncslot in the worker itself but make it 'PGC_POSTMASTER'. This will ensure we can safely exit the worker(proc_exit(0) if enable_syncslot is disabled and postmaster will not restart it. But I'm not sure if making it "PGC_POSTMASTER" is acceptable from the user's perspective. thanks Shveta
Attachment
On Tue, Nov 28, 2023 at 3:10 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Tue, Nov 28, 2023 at 12:19 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > > > > Hi, > > > > On 11/28/23 4:13 AM, shveta malik wrote: > > > On Mon, Nov 27, 2023 at 4:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > >> > > >> On Mon, Nov 27, 2023 at 2:27 PM Zhijie Hou (Fujitsu) > > >> <houzj.fnst@fujitsu.com> wrote: > > >>> > > >>> Here is the updated version(v39_2) which include all the changes made in 0002. > > >>> Please use for review, and sorry for the confusion. > > >>> > > >> > > >> --- a/src/backend/replication/logical/launcher.c > > >> +++ b/src/backend/replication/logical/launcher.c > > >> @@ -8,20 +8,27 @@ > > >> * src/backend/replication/logical/launcher.c > > >> * > > >> * NOTES > > >> - * This module contains the logical replication worker launcher which > > >> - * uses the background worker infrastructure to start the logical > > >> - * replication workers for every enabled subscription. > > >> + * This module contains the replication worker launcher which > > >> + * uses the background worker infrastructure to: > > >> + * a) start the logical replication workers for every enabled subscription > > >> + * when not in standby_mode. > > >> + * b) start the slot sync worker for logical failover slots synchronization > > >> + * from the primary server when in standby_mode. > > >> > > >> I was wondering do we really need a launcher on standby to invoke > > >> sync-slot worker. If so, why? I guess it may be required for previous > > >> versions where we were managing work for multiple slot-sync workers > > >> which is also questionable in the sense of whether launcher is the > > >> right candidate for the same but now with the single slot-sync worker, > > >> it doesn't seem worth having it. What do you think? > > >> > > >> -- > > > > > > Yes, earlier a manager process was needed to manage multiple slot-sync > > > workers and distribute load among them, but now that does not seem > > > necessary. I gave it a try (PoC) and it seems to work well. If there > > > are no objections to this approach, I can share the patch soon. > > > > > > > +1 on this new approach, thanks! > > PFA v40. This patch has removed Logical Replication Launcher support > to launch slotsync worker. The slot-sync worker is now registered as > bgworker with postmaster, with > bgw_start_time=BgWorkerStart_ConsistentState and > bgw_restart_time=60sec. > > On removal of launcher, now all the validity checks have been shifted > to slot-sync worker itself. This brings us to some point of concerns: > > a) We still need to maintain RecoveryInProgress() check in slotsync > worker. Since worker has the start time of > BgWorkerStart_ConsistentState, it will be started on non-standby as > well. So to ensure that it exists on non-standby, "RecoveryInProgress" > has been introduced at the beginning of the worker. But once it exits, > postmaster will not restart it since it will be clean-exist i.e. > proc_exit(0) (the restart logic of postmaster comes into play only > when there is an abnormal exit). But to exit for the first time on > non-standby, we need that Recovery related check in worker. > > b) "enable_syncslot" check is moved to slotsync worker now. Since > enable_syncslot is PGC_SIGHUP, so proc_exit(1) is currently used to > exit the worker if 'enable_syncslot' is found to be disabled. > 'proc_exit(1)' has been used in order to ensure that the worker is > restarted and GUCs are checked again after restart_time. Downside of > this approach is, if someone has kept "enable_syncslot" as disabled > permanently even on standby, slotsync worker will keep on restarting > and exiting. > > So to overcome the above pain-points, I think a potential approach > will be to start slotsync worker only if 'enable_syncslot' is on and > the system is non-standby. Potential ways (each with some issues) are: > Correction here: start slotsync worker only if 'enable_syncslot' is on and the system is standby. > 1) Use the current way i.e. register slot-sync worker as bgworker with > postmaster, but introduce extra checks in 'maybe_start_bgworkers'. But > this seems more like a hack. This will need extra changes as currently > once 'maybe_start_bgworkers' is attempted by postmaster, it will > attempt again to start any worker only if the worker had abnormal exit > and restart_time !=0. The current postmatser will not attempt to start > worker on any GUC change. > > 2) Another way maybe to treat slotsync worker as special case and > separate out the start/restart of slotsync worker from bgworker, and > follow what we do for autovacuum launcher(StartAutoVacLauncher) to > keep starting it in the postmaster loop(ServerLoop). In this way, we > may be able to add more checks before starting worker. But by opting > this approach, we will have to manage slotsync worker completely by > ourself as it will be no longer be part of existing > bgworker-registration infra. If this seems okay and there are no other > better options, it can be analyzed further in detail. > > 3) Another approach could be, in order to solve issue (a), introduce a > new start_time 'BgWorkerStart_ConsistentState_HotStandby' which means > start a bgworker only if consistent state is reached and the system is > standby. And for issue (b), lets retain check of enable_syncslot in > the worker itself but make it 'PGC_POSTMASTER'. This will ensure we > can safely exit the worker(proc_exit(0) if enable_syncslot is disabled > and postmaster will not restart it. But I'm not sure if making it > "PGC_POSTMASTER" is acceptable from the user's perspective. > > thanks > Shveta
On Fri, Nov 17, 2023 at 5:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Nov 16, 2023 at 5:34 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > PFA v35. > > > > Review v35-0002* > ============== > 1. > As quoted in the commit message, > > > If a logical slot is invalidated on the primary, slot on the standby is also > invalidated. If a logical slot on the primary is valid but is invalidated > on the standby due to conflict (say required rows removed on the primary), > then that slot is dropped and recreated on the standby in next sync-cycle. > It is okay to recreate such slots as long as these are not consumable on the > standby (which is the case currently). > > > > I think this won't happen normally because of the physical slot and > hot_standby_feedback but probably can occur in cases like if the user > temporarily switches hot_standby_feedback from on to off. Are there > any other reasons? I think we can mention the cases along with it as > well at least for now. Additionally, I think this should be covered in > code comments as well. > > 2. > #include "postgres.h" > - > +#include "access/genam.h" > > Spurious line removal. > > 3. > A password needs to be provided too, if the sender demands password > authentication. It can be provided in the > <varname>primary_conninfo</varname> string, or in a separate > - <filename>~/.pgpass</filename> file on the standby server (use > - <literal>replication</literal> as the database name). > - Do not specify a database name in the > - <varname>primary_conninfo</varname> string. > + <filename>~/.pgpass</filename> file on the standby server. > + </para> > + <para> > + Specify <literal>dbname</literal> in > + <varname>primary_conninfo</varname> string to allow synchronization > + of slots from the primary server to the standby server. > + This will only be used for slot synchronization. It is ignored > + for streaming. > > Is there a reason to remove part of the earlier sentence "use > <literal>replication</literal> as the database name"? > > 4. > + <primary><varname>enable_syncslot</varname> configuration > parameter</primary> > + </indexterm> > + </term> > + <listitem> > + <para> > + It enables a physical standby to synchronize logical failover slots > + from the primary server so that logical subscribers are not blocked > + after failover. > + </para> > + <para> > + It is enabled by default. This parameter can only be set in the > + <filename>postgresql.conf</filename> file or on the server > command line. > + </para> > > I think you forgot to update the documentation for the default value > of this variable. > > 5. > + * a) start the logical replication workers for every enabled subscription > + * when not in standby_mode > + * b) start the slot-sync worker for logical failover slots synchronization > + * from the primary server when in standby_mode. > > Either use a full stop after both lines or none of these. > > 6. > +static void slotsync_worker_cleanup(SlotSyncWorkerInfo * worker); > > There shouldn't be space between * and the worker. > > 7. > + if (!SlotSyncWorker->hdr.in_use) > + { > + LWLockRelease(SlotSyncWorkerLock); > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("replication slot-sync worker not initialized, " > + "cannot attach"))); > + } > + > + if (SlotSyncWorker->hdr.proc) > + { > + LWLockRelease(SlotSyncWorkerLock); > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("replication slot-sync worker is " > + "already running, cannot attach"))); > + } > > Using slot-sync in the error messages looks a bit odd to me. Can we > use "replication slot sync worker ..." in both these and other > similar messages? I think it would be better if we don't split the > messages into multiple lines in these cases as messages don't appear > too long to me. > > 8. > +/* > + * Detach the worker from DSM and update 'proc' and 'in_use'. > + * Logical replication launcher will come to know using these > + * that the worker has shutdown. > + */ > +void > +slotsync_worker_detach(int code, Datum arg) > +{ > > I think the reference to DSM is leftover from the previous version of > the patch. Can we change the above comments as per the new code? > > 9. > +static bool > +slotsync_worker_launch() > { > ... > + /* TODO: do we really need 'generation', analyse more here */ > + worker->hdr.generation++; > > We should do something about this TODO. As per my understanding, we > don't need a generation number for the slot sync worker as we have one > such worker but I guess the patch requires it because we are using > existing logical replication worker infrastructure. This brings the > question of whether we really need a separate SlotSyncWorkerInfo or if > we can use existing LogicalRepWorker and distinguish it with > LogicalRepWorkerType? I guess you didn't use it because most of the > fields in LogicalRepWorker will be unused for slot sync worker. > > 10. > + * Can't use existing functions like 'get_database_oid' from dbcommands.c for > + * validity purpose as they need db connection. > + */ > +static bool > +validate_dbname(const char *dbname) > > I don't know how important it is to validate the dbname before > launching the sync slot worker because anyway after launching, it will > give an error while initializing the connection if the dbname is > invalid. But, if we think it is really required, did you consider > using GetDatabaseTuple()? I have removed 'validate_dbname' in v40. We let dbname go through BackgroundWorkerInitializeConnection() which internally does dbname validation. Later if 'primary_conninfo' is changed and the db name specified in it is different, we exit the worker and let it get restarted which will do the validation again when it does BackgroundWorkerInitializeConnection(). > > -- > With Regards, > Amit Kapila.
Hi, On 11/27/23 9:57 AM, Zhijie Hou (Fujitsu) wrote: > On Monday, November 27, 2023 4:51 PM shveta malik <shveta.malik@gmail.com> wrote: > > Here is the updated version(v39_2) which include all the changes made in 0002. > Please use for review, and sorry for the confusion. > Thanks! As far v39_2-0001: " Altering the failover option of the subscription is currently not permitted. However, this restriction may be lifted in future versions. " Should we mention that we can alter the related replication slot? + <para> + The implementation of failover requires that replication + has successfully finished the initial table synchronization + phase. So even when <literal>failover</literal> is enabled for a + subscription, the internal failover state remains + temporarily <quote>pending</quote> until the initialization phase + completes. See column <structfield>subfailoverstate</structfield> + of <link linkend="catalog-pg-subscription"><structname>pg_subscription</structname></link> + to know the actual failover state. + </para> I think we have a corner case here. If one alter the replication slot on the primary then "subfailoverstate" is not updated accordingly on the subscriber. Given the 2 remarks above would that make sense to prevent altering a replication slot associated to a subscription? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On 11/28/23 10:40 AM, shveta malik wrote: > On Tue, Nov 28, 2023 at 12:19 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> >> Hi, >> >> On 11/28/23 4:13 AM, shveta malik wrote: >>> On Mon, Nov 27, 2023 at 4:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote: >>>> >>>> On Mon, Nov 27, 2023 at 2:27 PM Zhijie Hou (Fujitsu) >>>> <houzj.fnst@fujitsu.com> wrote: >>>>> >>>>> Here is the updated version(v39_2) which include all the changes made in 0002. >>>>> Please use for review, and sorry for the confusion. >>>>> >>>> >>>> --- a/src/backend/replication/logical/launcher.c >>>> +++ b/src/backend/replication/logical/launcher.c >>>> @@ -8,20 +8,27 @@ >>>> * src/backend/replication/logical/launcher.c >>>> * >>>> * NOTES >>>> - * This module contains the logical replication worker launcher which >>>> - * uses the background worker infrastructure to start the logical >>>> - * replication workers for every enabled subscription. >>>> + * This module contains the replication worker launcher which >>>> + * uses the background worker infrastructure to: >>>> + * a) start the logical replication workers for every enabled subscription >>>> + * when not in standby_mode. >>>> + * b) start the slot sync worker for logical failover slots synchronization >>>> + * from the primary server when in standby_mode. >>>> >>>> I was wondering do we really need a launcher on standby to invoke >>>> sync-slot worker. If so, why? I guess it may be required for previous >>>> versions where we were managing work for multiple slot-sync workers >>>> which is also questionable in the sense of whether launcher is the >>>> right candidate for the same but now with the single slot-sync worker, >>>> it doesn't seem worth having it. What do you think? >>>> >>>> -- >>> >>> Yes, earlier a manager process was needed to manage multiple slot-sync >>> workers and distribute load among them, but now that does not seem >>> necessary. I gave it a try (PoC) and it seems to work well. If there >>> are no objections to this approach, I can share the patch soon. >>> >> >> +1 on this new approach, thanks! > > PFA v40. This patch has removed Logical Replication Launcher support > to launch slotsync worker. Thanks! > The slot-sync worker is now registered as > bgworker with postmaster, with > bgw_start_time=BgWorkerStart_ConsistentState and > bgw_restart_time=60sec. > > On removal of launcher, now all the validity checks have been shifted > to slot-sync worker itself. This brings us to some point of concerns: > > a) We still need to maintain RecoveryInProgress() check in slotsync > worker. Since worker has the start time of > BgWorkerStart_ConsistentState, it will be started on non-standby as > well. So to ensure that it exists on non-standby, "RecoveryInProgress" > has been introduced at the beginning of the worker. But once it exits, > postmaster will not restart it since it will be clean-exist i.e. > proc_exit(0) (the restart logic of postmaster comes into play only > when there is an abnormal exit). But to exit for the first time on > non-standby, we need that Recovery related check in worker. > > b) "enable_syncslot" check is moved to slotsync worker now. Since > enable_syncslot is PGC_SIGHUP, so proc_exit(1) is currently used to > exit the worker if 'enable_syncslot' is found to be disabled. > 'proc_exit(1)' has been used in order to ensure that the worker is > restarted and GUCs are checked again after restart_time. Downside of > this approach is, if someone has kept "enable_syncslot" as disabled > permanently even on standby, slotsync worker will keep on restarting > and exiting. > > So to overcome the above pain-points, I think a potential approach > will be to start slotsync worker only if 'enable_syncslot' is on and > the system is non-standby. That makes sense to me. > Potential ways (each with some issues) are: > > 1) Use the current way i.e. register slot-sync worker as bgworker with > postmaster, but introduce extra checks in 'maybe_start_bgworkers'. But > this seems more like a hack. This will need extra changes as currently > once 'maybe_start_bgworkers' is attempted by postmaster, it will > attempt again to start any worker only if the worker had abnormal exit > and restart_time !=0. The current postmatser will not attempt to start > worker on any GUC change. > > 2) Another way maybe to treat slotsync worker as special case and > separate out the start/restart of slotsync worker from bgworker, and > follow what we do for autovacuum launcher(StartAutoVacLauncher) to > keep starting it in the postmaster loop(ServerLoop). In this way, we > may be able to add more checks before starting worker. But by opting > this approach, we will have to manage slotsync worker completely by > ourself as it will be no longer be part of existing > bgworker-registration infra. If this seems okay and there are no other > better options, it can be analyzed further in detail. > > 3) Another approach could be, in order to solve issue (a), introduce a > new start_time 'BgWorkerStart_ConsistentState_HotStandby' which means > start a bgworker only if consistent state is reached and the system is > standby. And for issue (b), lets retain check of enable_syncslot in > the worker itself but make it 'PGC_POSTMASTER'. This will ensure we > can safely exit the worker(proc_exit(0) if enable_syncslot is disabled > and postmaster will not restart it. But I'm not sure if making it > "PGC_POSTMASTER" is acceptable from the user's perspective. I had the same idea (means make enable_syncslot as 'PGC_POSTMASTER') when reading b). I'm +1 on it (at least for V1) as I don't think that this parameter value would change frequently. Curious to know what others think too. Then as far a) is concerned, I'd vote for introducing a new BgWorkerStart_ConsistentState_HotStandby. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Nov 28, 2023 at 2:17 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > On 11/2/23 1:27 AM, Zhijie Hou (Fujitsu) wrote: > > On Tuesday, October 31, 2023 6:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > >> We have create_replication_slot and drop_replication_slot in repl_gram.y. How > >> about if introduce alter_replication_slot and handle the 'failover' flag with that? > >> The idea is we will either enable 'failover' at the time create_replication_slot by > >> providing an optional failover option or execute a separate command > >> alter_replication_slot. I think we probably need to perform this command > >> before the start of streaming. > > > > Here is an attempt to achieve the same. I added a new replication command > > alter_replication_slot and introduced a walreceiver api walrcv_alter_slot to > > execute the command. The subscription will call the api to enable/disable > > the failover of the slot on publisher. > > > > The patch disallows altering the failover option for the subscription. But we > > could release the restriction by using the following approaches in next version: > > > >> I think we will have the following options to allow alter of the 'failover' > >> property: (a) we can allow altering 'failover' only for the 'disabled' > >> subscription; to achieve that, we need to open a connection during alter > >> subscription and change this property of slot; (b) apply worker detects the > >> change in 'failover' option; run the alter_replication_slot command; this needs > >> more analysis as apply_worker is already doing streaming and changing slot > >> property in between could be tricky. > > > > What do you think about also adding a pg_alter_logical_replication_slot() or such > function? > > That would allow users to alter manually created logical replication slots without > the need to make a replication connection. > But then won't that make it inconsistent with the subscription failover state? I think if we don't have a simple solution for this, we can always do it as an enhancement to the main feature once we have good ideas to solve it. -- With Regards, Amit Kapila.
Hi. Here are some review comments for the patch v39_2-0001. Multiple items from my previous review [1] seemed unanswered, so it wasn't clear if they were discarded because they were wrong or maybe accidently missed. I've repeated all those again here, as well as some new comments. ====== 1. General. Previously (see [1] #0) I asked a question about if there is some documentation missing. Seems not yet answered. ====== Commit message 2. Users can set this flag during CREATE SUBSCRIPTION or during pg_create_logical_replication_slot API. Examples: CREATE SUBSCRIPTION mysub CONNECTION '..' PUBLICATION mypub WITH (failover = true); (failover is the last arg) SELECT * FROM pg_create_logical_replication_slot('myslot', 'pgoutput', false, true, true); ~ I felt it is better to say "Ex1" / "Ex2" (or "1" / "2" or something similar) to indicate better where these examples start and finish, otherwise they just sort of get lost among the text. ====== doc/src/sgml/catalogs.sgml 3. From previous review ([1] #6) I suggested reordering fields. Hous-san wrote: "but I think the functionality of two fields are different and I didn’t find the correlation between two fields except for the type of value." Yes, that is true. OTOH, I felt grouping the attributes by the same types made the docs easier to read. ====== src/backend/commands/subscriptioncmds.c 4. CreateSubscription + /* + * If only the slot_name is specified (without create_slot option), + * it is possible that the user intends to use an existing slot on + * the publisher, so here we enable failover for the slot if + * requested. + */ + else if (opts.slot_name && failover_enabled) + { Unanswered question from previous review (see [1] #11a). i.e. How does this condition ensure that *only* the slot name was specified (like the comment is saying)? ~~~ 5. AlterSubscription errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when two_phase is enabled"), errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION with refresh = false, or with copy_data = false, or use DROP/CREATE SUBSCRIPTION."))); + if (sub->failoverstate == LOGICALREP_FAILOVER_STATE_ENABLED && opts.copy_data) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when failover is enabled"), + errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION with refresh = false, or with copy_data = false, or use DROP/CREATE SUBSCRIPTION."))); + There are translations issues same as reported in my previous review (see [1] #12b and also several other places as noted in [1]). Hou-san replied that I "can start a separate thread to change the twophase related messages, and we can change accordingly if it's accepted.", but that's not right IMO because it is only the fact that this sysncslot patch is reusing a similar message that warrants the need to extract a "common" message part in the first place. So I think it is responsibility if this sycslot patch to make this change. ====== src/backend/replication/logical/tablesync.c 6. process_syncing_tables_for_apply + if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING) + ereport(LOG, + (errmsg("logical replication apply worker for subscription \"%s\" will restart so that two_phase can be enabled", + MySubscription->name))); + + if (MySubscription->failoverstate == LOGICALREP_FAILOVER_STATE_PENDING) + ereport(LOG, + (errmsg("logical replication apply worker for subscription \"%s\" will restart so that failover can be enabled", + MySubscription->name))); 6a. You may end up log 2 restart messages for the same restart. Is it OK? ~ 6b. This is another example where you should share the same common message (for less translations) ====== src/backend/replication/logical/worker.c 7. + * The logical slot on the primary can be synced to the standby by specifying + * the failover = true when creating the subscription. Enabling failover allows + * us to smoothly transition to the standby in case the primary gets promoted, + * ensuring that we can subscribe to the new primary without losing any data. /the failover = true/the failover = true option/ or /the failover = true/failover = true/ ~~~ 8. + #include "postgres.h" Unnecessary extra blank line ====== src/backend/replication/slot.c 9. validate_standby_slots There was no reply to the comment in my previous review (see [1] #27). Maybe you disagree or maybe accidentally overlooked? ~~~ 10. check_standby_slot_names In previous review I asked ([1] #28) why a special check was needed for "*". Hou-san replied that "SplitIdentifierString() does not give error for '*' and '*' can be considered as valid value which if accepted can mislead user". Sure, but won't the code then just try to find if there is a replication slot called "*" and that will fail. That was my point, if the slot name lookup is going to fail anyway then why have the extra code for the special "*" case up-front? Note -- I haven't tried it, so maybe code doesn't work like I think it does. ====== src/backend/replication/walsender.c 11. PhysicalWakeupLogicalWalSnd No reply to my previous review comment ([1] #33). Not done? Disagreed, or accidentally missed? ~~~ 12. WalSndFilterStandbySlots + /* + * If logical slot name is given in standby_slot_names, give WARNING + * and skip it. Since it is harmless, so WARNING should be enough, no + * need to error-out. + */ + else if (SlotIsLogical(slot)) + warningfmt = _("cannot have logical replication slot \"%s\" in parameter \"%s\", ignoring"); I previously raised an issue (see [1] #35) thinking this could not happen. Hou-san explained how it might happen ("user could drop the logical slot and recreate a physical slot with the same name without changing the GUC.") so this code was necessary. That is OK, but I think your same explanation in the code commen. ~~~ 13. WalSndFilterStandbySlots + standby_slots_cpy = foreach_delete_current(standby_slots_cpy, lc); I previously raised issue (see [1] #36). Hou-san replied "I think it's OK to remove slots if it's invalidated, dropped, or was changed to logical one as we don't need to wait for these slots to catch up anymore." Sure, maybe code is fine, but my point was that the code is removing elements *more* scenarios than are mentioned by the function comment, so maybe update that function comment for all the removal scenarios. ~~~ 14. WalSndWaitForStandbyConfirmation The comment change from my previous review ([1] #37) not done. Disagreed, or accidentally missed? ~~~ 15. WalSndWaitForStandbyConfirmation The question about calling ConditionVariablePrepareToSleep in my previous review ([1] #39) not answered. Accidentally missed? ~~~ 16. WalSndWaitForWal if (RecentFlushPtr != InvalidXLogRecPtr && loc <= RecentFlushPtr) - return RecentFlushPtr; + { + WalSndFilterStandbySlots(RecentFlushPtr, &standby_slots); It is better to use XLogRecPtrIsInvalid macro here. I know it was not strictly added by your patch, but so much else changed nearby so I thought this should be fixed at the same time. ====== src/bin/pg_upgrade/info.c 17. get_old_cluster_logical_slot_infos + slotinfos = (LogicalSlotInfo *) pg_malloc(sizeof(LogicalSlotInfo) * num_slots); Excessive whitespace. ====== [1] My previous review of v35-0001. https://www.postgresql.org/message-id/CAHut%2BPv-yu71ogj_hRi6cCtmD55bsyw7XTxj1Nq8yVFKpY3NDQ%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
On Tue, Nov 28, 2023 at 9:28 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 11/28/23 10:40 AM, shveta malik wrote: > > On Tue, Nov 28, 2023 at 12:19 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > >> > >> Hi, > >> > >> On 11/28/23 4:13 AM, shveta malik wrote: > >>> On Mon, Nov 27, 2023 at 4:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > >>>> > >>>> On Mon, Nov 27, 2023 at 2:27 PM Zhijie Hou (Fujitsu) > >>>> <houzj.fnst@fujitsu.com> wrote: > >>>>> > >>>>> Here is the updated version(v39_2) which include all the changes made in 0002. > >>>>> Please use for review, and sorry for the confusion. > >>>>> > >>>> > >>>> --- a/src/backend/replication/logical/launcher.c > >>>> +++ b/src/backend/replication/logical/launcher.c > >>>> @@ -8,20 +8,27 @@ > >>>> * src/backend/replication/logical/launcher.c > >>>> * > >>>> * NOTES > >>>> - * This module contains the logical replication worker launcher which > >>>> - * uses the background worker infrastructure to start the logical > >>>> - * replication workers for every enabled subscription. > >>>> + * This module contains the replication worker launcher which > >>>> + * uses the background worker infrastructure to: > >>>> + * a) start the logical replication workers for every enabled subscription > >>>> + * when not in standby_mode. > >>>> + * b) start the slot sync worker for logical failover slots synchronization > >>>> + * from the primary server when in standby_mode. > >>>> > >>>> I was wondering do we really need a launcher on standby to invoke > >>>> sync-slot worker. If so, why? I guess it may be required for previous > >>>> versions where we were managing work for multiple slot-sync workers > >>>> which is also questionable in the sense of whether launcher is the > >>>> right candidate for the same but now with the single slot-sync worker, > >>>> it doesn't seem worth having it. What do you think? > >>>> > >>>> -- > >>> > >>> Yes, earlier a manager process was needed to manage multiple slot-sync > >>> workers and distribute load among them, but now that does not seem > >>> necessary. I gave it a try (PoC) and it seems to work well. If there > >>> are no objections to this approach, I can share the patch soon. > >>> > >> > >> +1 on this new approach, thanks! > > > > PFA v40. This patch has removed Logical Replication Launcher support > > to launch slotsync worker. > > Thanks! > > > The slot-sync worker is now registered as > > bgworker with postmaster, with > > bgw_start_time=BgWorkerStart_ConsistentState and > > bgw_restart_time=60sec. > > > > On removal of launcher, now all the validity checks have been shifted > > to slot-sync worker itself. This brings us to some point of concerns: > > > > a) We still need to maintain RecoveryInProgress() check in slotsync > > worker. Since worker has the start time of > > BgWorkerStart_ConsistentState, it will be started on non-standby as > > well. So to ensure that it exists on non-standby, "RecoveryInProgress" > > has been introduced at the beginning of the worker. But once it exits, > > postmaster will not restart it since it will be clean-exist i.e. > > proc_exit(0) (the restart logic of postmaster comes into play only > > when there is an abnormal exit). But to exit for the first time on > > non-standby, we need that Recovery related check in worker. > > > > b) "enable_syncslot" check is moved to slotsync worker now. Since > > enable_syncslot is PGC_SIGHUP, so proc_exit(1) is currently used to > > exit the worker if 'enable_syncslot' is found to be disabled. > > 'proc_exit(1)' has been used in order to ensure that the worker is > > restarted and GUCs are checked again after restart_time. Downside of > > this approach is, if someone has kept "enable_syncslot" as disabled > > permanently even on standby, slotsync worker will keep on restarting > > and exiting. > > > > So to overcome the above pain-points, I think a potential approach > > will be to start slotsync worker only if 'enable_syncslot' is on and > > the system is non-standby. > > That makes sense to me. > > > Potential ways (each with some issues) are: > > > > 1) Use the current way i.e. register slot-sync worker as bgworker with > > postmaster, but introduce extra checks in 'maybe_start_bgworkers'. But > > this seems more like a hack. This will need extra changes as currently > > once 'maybe_start_bgworkers' is attempted by postmaster, it will > > attempt again to start any worker only if the worker had abnormal exit > > and restart_time !=0. The current postmatser will not attempt to start > > worker on any GUC change. > > > > 2) Another way maybe to treat slotsync worker as special case and > > separate out the start/restart of slotsync worker from bgworker, and > > follow what we do for autovacuum launcher(StartAutoVacLauncher) to > > keep starting it in the postmaster loop(ServerLoop). In this way, we > > may be able to add more checks before starting worker. But by opting > > this approach, we will have to manage slotsync worker completely by > > ourself as it will be no longer be part of existing > > bgworker-registration infra. If this seems okay and there are no other > > better options, it can be analyzed further in detail. > > > > 3) Another approach could be, in order to solve issue (a), introduce a > > new start_time 'BgWorkerStart_ConsistentState_HotStandby' which means > > start a bgworker only if consistent state is reached and the system is > > standby. And for issue (b), lets retain check of enable_syncslot in > > the worker itself but make it 'PGC_POSTMASTER'. This will ensure we > > can safely exit the worker(proc_exit(0) if enable_syncslot is disabled > > and postmaster will not restart it. But I'm not sure if making it > > "PGC_POSTMASTER" is acceptable from the user's perspective. > > I had the same idea (means make enable_syncslot as 'PGC_POSTMASTER') > when reading b). I'm +1 on it (at least for V1) as I don't think that > this parameter value would change frequently. Curious to know what others > think too. > > Then as far a) is concerned, I'd vote for introducing a new > BgWorkerStart_ConsistentState_HotStandby. > +1 on PGC_POSTMASTER and BgWorkerStart_ConsistentState_HotStandby. A clean solution as compared to the rest of the approaches. Will implement it. thanks Shveta
On Tuesday, November 28, 2023 8:07 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: Hi, > On 11/27/23 9:57 AM, Zhijie Hou (Fujitsu) wrote: > > On Monday, November 27, 2023 4:51 PM shveta malik > <shveta.malik@gmail.com> wrote: > > > > Here is the updated version(v39_2) which include all the changes made in > 0002. > > Please use for review, and sorry for the confusion. > > > > Thanks! > > As far v39_2-0001: > > " > Altering the failover option of the subscription is currently not > permitted. However, this restriction may be lifted in future versions. > " > > Should we mention that we can alter the related replication slot? Will add. > > + <para> > + The implementation of failover requires that replication > + has successfully finished the initial table synchronization > + phase. So even when <literal>failover</literal> is enabled for a > + subscription, the internal failover state remains > + temporarily <quote>pending</quote> until the initialization > phase > + completes. See column > <structfield>subfailoverstate</structfield> > + of <link > linkend="catalog-pg-subscription"><structname>pg_subscription</structna > me></link> > + to know the actual failover state. > + </para> > > I think we have a corner case here. If one alter the replication slot on the > primary then "subfailoverstate" is not updated accordingly on the subscriber. > Given the 2 remarks above would that make sense to prevent altering a > replication slot associated to a subscription? Thanks for the review! I think we could not distinguish the user created logical slot or subscriber created slot as there is no related info in slot's data. And user could change the slot on subscription by "alter sub set (slot_name)", so maintaining this info would need some efforts. Besides, I think this case overlaps the previous discussed "alter sub set (slot_name)" issue[1]. Both the cases are because the slot's failover is different from the subscription's failover setting. I think we could handle them similarly that user need to take care of not changing the failover to wrong value. Or do you prefer another approach that mentioned in that thread[1] ? (always alter the slot at the startup of apply worker). [1] https://www.postgresql.org/message-id/564b195a-180c-42e9-902b-b1a8b50218ee%40gmail.com Best Regards, Hou zj
On Tuesday, November 28, 2023 11:58 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 11/28/23 10:40 AM, shveta malik wrote: > > On Tue, Nov 28, 2023 at 12:19 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > >> > >> Hi, > >> > >> On 11/28/23 4:13 AM, shveta malik wrote: > >>> On Mon, Nov 27, 2023 at 4:08 PM Amit Kapila > <amit.kapila16@gmail.com> wrote: > >>>> > >>>> On Mon, Nov 27, 2023 at 2:27 PM Zhijie Hou (Fujitsu) > >>>> <houzj.fnst@fujitsu.com> wrote: > >>>>> > >>>>> Here is the updated version(v39_2) which include all the changes made > in 0002. > >>>>> Please use for review, and sorry for the confusion. > >>>>> > >>>> > >>>> --- a/src/backend/replication/logical/launcher.c > >>>> +++ b/src/backend/replication/logical/launcher.c > >>>> @@ -8,20 +8,27 @@ > >>>> * src/backend/replication/logical/launcher.c > >>>> * > >>>> * NOTES > >>>> - * This module contains the logical replication worker launcher which > >>>> - * uses the background worker infrastructure to start the logical > >>>> - * replication workers for every enabled subscription. > >>>> + * This module contains the replication worker launcher which > >>>> + * uses the background worker infrastructure to: > >>>> + * a) start the logical replication workers for every enabled > subscription > >>>> + * when not in standby_mode. > >>>> + * b) start the slot sync worker for logical failover slots > synchronization > >>>> + * from the primary server when in standby_mode. > >>>> > >>>> I was wondering do we really need a launcher on standby to invoke > >>>> sync-slot worker. If so, why? I guess it may be required for > >>>> previous versions where we were managing work for multiple > >>>> slot-sync workers which is also questionable in the sense of > >>>> whether launcher is the right candidate for the same but now with > >>>> the single slot-sync worker, it doesn't seem worth having it. What do you > think? > >>>> > >>>> -- > >>> > >>> Yes, earlier a manager process was needed to manage multiple > >>> slot-sync workers and distribute load among them, but now that does > >>> not seem necessary. I gave it a try (PoC) and it seems to work well. > >>> If there are no objections to this approach, I can share the patch soon. > >>> > >> > >> +1 on this new approach, thanks! > > > > PFA v40. This patch has removed Logical Replication Launcher support > > to launch slotsync worker. > > Thanks! > > > The slot-sync worker is now registered as bgworker with postmaster, > > with bgw_start_time=BgWorkerStart_ConsistentState and > > bgw_restart_time=60sec. > > > > On removal of launcher, now all the validity checks have been shifted > > to slot-sync worker itself. This brings us to some point of concerns: > > > > a) We still need to maintain RecoveryInProgress() check in slotsync > > worker. Since worker has the start time of > > BgWorkerStart_ConsistentState, it will be started on non-standby as > > well. So to ensure that it exists on non-standby, "RecoveryInProgress" > > has been introduced at the beginning of the worker. But once it exits, > > postmaster will not restart it since it will be clean-exist i.e. > > proc_exit(0) (the restart logic of postmaster comes into play only > > when there is an abnormal exit). But to exit for the first time on > > non-standby, we need that Recovery related check in worker. > > > > b) "enable_syncslot" check is moved to slotsync worker now. Since > > enable_syncslot is PGC_SIGHUP, so proc_exit(1) is currently used to > > exit the worker if 'enable_syncslot' is found to be disabled. > > 'proc_exit(1)' has been used in order to ensure that the worker is > > restarted and GUCs are checked again after restart_time. Downside of > > this approach is, if someone has kept "enable_syncslot" as disabled > > permanently even on standby, slotsync worker will keep on restarting > > and exiting. > > > > So to overcome the above pain-points, I think a potential approach > > will be to start slotsync worker only if 'enable_syncslot' is on and > > the system is non-standby. > > That makes sense to me. > > > Potential ways (each with some issues) are: > > > > 1) Use the current way i.e. register slot-sync worker as bgworker with > > postmaster, but introduce extra checks in 'maybe_start_bgworkers'. But > > this seems more like a hack. This will need extra changes as currently > > once 'maybe_start_bgworkers' is attempted by postmaster, it will > > attempt again to start any worker only if the worker had abnormal exit > > and restart_time !=0. The current postmatser will not attempt to start > > worker on any GUC change. > > > > 2) Another way maybe to treat slotsync worker as special case and > > separate out the start/restart of slotsync worker from bgworker, and > > follow what we do for autovacuum launcher(StartAutoVacLauncher) to > > keep starting it in the postmaster loop(ServerLoop). In this way, we > > may be able to add more checks before starting worker. But by opting > > this approach, we will have to manage slotsync worker completely by > > ourself as it will be no longer be part of existing > > bgworker-registration infra. If this seems okay and there are no other > > better options, it can be analyzed further in detail. > > > > 3) Another approach could be, in order to solve issue (a), introduce a > > new start_time 'BgWorkerStart_ConsistentState_HotStandby' which means > > start a bgworker only if consistent state is reached and the system is > > standby. And for issue (b), lets retain check of enable_syncslot in > > the worker itself but make it 'PGC_POSTMASTER'. This will ensure we > > can safely exit the worker(proc_exit(0) if enable_syncslot is disabled > > and postmaster will not restart it. But I'm not sure if making it > > "PGC_POSTMASTER" is acceptable from the user's perspective. > > I had the same idea (means make enable_syncslot as 'PGC_POSTMASTER') > when reading b). I'm +1 on it (at least for V1) as I don't think that this parameter > value would change frequently. Curious to know what others think too. > > Then as far a) is concerned, I'd vote for introducing a new > BgWorkerStart_ConsistentState_HotStandby. Here is the V41 patch set which includes the following changes. V41-0001: 1) Based on the discussion[1], I update the document to remind user to change the slot's failover option when ALTER SUBSCRIPTION ... SET (slot_name = xx). 2) Address comments in [2][3][4]. V41-0002: 1) 'enable_syncslot' is changed from PGC_SIGHUP to PGC_POSTMASTER, slot-sync worker will now clean exit (proc_exit(0)) if enable_syncslot is found disabled. 2) BgWorkerStart_ConsistentState_HotStandby is introduced as new start-time for bgworker. This will start worker only if it is standby_mode and consistent state is reached. 3) 'SYNCSLOT_STATE_INITIATED' is now set in 'ReplicationSlotCreate' itself in slot-sync worker case. Earlier it was set at later point of time giving a window wherein even a synced slot was in 'n' state for quite some time, which was not correct. Thanks Shveta for working on the V41-0002. [1] https://www.postgresql.org/message-id/CAA4eK1Jd9dk%3D5POTKM9p4EyYqYzLXe-AnLzHrUELjzZScLz7mw%40mail.gmail.com [2] https://www.postgresql.org/message-id/eb09f682-db82-41cd-93bc-5d44e10e1d6d%40gmail.com [3] https://www.postgresql.org/message-id/CAHut%2BPsuSWjm7U_sVnL8FXZ7ZQcfCcT44kAK7i6qMG35Cwjy3A%40mail.gmail.com [4] https://www.postgresql.org/message-id/CAFPTHDbFqLgXS6Et%2BshNGPDjCKK66C%2BZSarqFHmQvfnAah3Qsw%40mail.gmail.com Best Regards, Hou zj
Attachment
On Wednesday, November 29, 2023 11:04 AM Peter Smith <smithpb2250@gmail.com> wrote: Thanks for the comments. > ====== > 1. General. > > Previously (see [1] #0) I asked a question about if there is some documentation > missing. Seems not yet answered. The document was add in V39-0002 in logicaldecoding.sgml because some necessary GUCs for slotsync are not in 0001. > > ====== > Commit message > > 2. > Users can set this flag during CREATE SUBSCRIPTION or during > pg_create_logical_replication_slot API. Examples: > > CREATE SUBSCRIPTION mysub CONNECTION '..' PUBLICATION mypub WITH > (failover = true); > > (failover is the last arg) > SELECT * FROM pg_create_logical_replication_slot('myslot', > 'pgoutput', false, true, true); > > ~ > > I felt it is better to say "Ex1" / "Ex2" (or "1" / "2" or something > similar) to indicate better where these examples start and finish, otherwise they > just sort of get lost among the text. Changed. > > ====== > doc/src/sgml/catalogs.sgml > > 3. > From previous review ([1] #6) I suggested reordering fields. Hous-san > wrote: "but I think the functionality of two fields are different and I didn’t find > the correlation between two fields except for the type of value." > > Yes, that is true. OTOH, I felt grouping the attributes by the same types made > the docs easier to read. The document's order should be same as the pg_subscription catalog, and I prefer not to move the new subfailoverstate in the middle of catalog as the functionality of them is different. > > ====== > src/backend/commands/subscriptioncmds.c > > 4. CreateSubscription > > + /* > + * If only the slot_name is specified (without create_slot option), > + * it is possible that the user intends to use an existing slot on > + * the publisher, so here we enable failover for the slot if > + * requested. > + */ > + else if (opts.slot_name && failover_enabled) { > > Unanswered question from previous review (see [1] #11a). i.e. How does this > condition ensure that *only* the slot name was specified (like the comment is > saying)? It is the else part of 'if (opts.create_slot)', so it means create_slot is not specified while only slot_name is specified. I have improved the comment. > > ~~~ > > 5. AlterSubscription > > errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed > when two_phase is enabled"), > errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION with refresh = false, > or with copy_data = false, or use DROP/CREATE SUBSCRIPTION."))); > > + if (sub->failoverstate == LOGICALREP_FAILOVER_STATE_ENABLED && > + opts.copy_data) ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed > when failover is enabled"), > + errhint("Use ALTER SUBSCRIPTION ... SET PUBLICATION with refresh = > false, or with copy_data = false, or use DROP/CREATE SUBSCRIPTION."))); > + > > There are translations issues same as reported in my previous review (see [1] > #12b and also several other places as noted in [1]). Hou-san replied that I "can > start a separate thread to change the twophase related messages, and we can > change accordingly if it's accepted.", but that's not right IMO because it is only > the fact that this sysncslot patch is reusing a similar message that warrants the > need to extract a "common" message part in the first place. So I think it is > responsibility if this sycslot patch to make this change. OK, changed. > > ====== > src/backend/replication/logical/tablesync.c > > 6. process_syncing_tables_for_apply > > + if (MySubscription->twophasestate == > + LOGICALREP_TWOPHASE_STATE_PENDING) > + ereport(LOG, > + (errmsg("logical replication apply worker for subscription \"%s\" > will restart so that two_phase can be enabled", > + MySubscription->name))); > + > + if (MySubscription->failoverstate == > + LOGICALREP_FAILOVER_STATE_PENDING) > + ereport(LOG, > + (errmsg("logical replication apply worker for subscription \"%s\" > will restart so that failover can be enabled", > + MySubscription->name))); > > 6a. > You may end up log 2 restart messages for the same restart. Is it OK? I think it's OK as it can provide complete information. > > ~ > > 6b. > This is another example where you should share the same common message > (for less translations) I adjusted the message there. > > ====== > src/backend/replication/logical/worker.c > > 7. > + * The logical slot on the primary can be synced to the standby by > + specifying > + * the failover = true when creating the subscription. Enabling > + failover allows > + * us to smoothly transition to the standby in case the primary gets > + promoted, > + * ensuring that we can subscribe to the new primary without losing any data. > > /the failover = true/the failover = true option/ > > or > > /the failover = true/failover = true/ > Changed. > ~~~ > > 8. > + > #include "postgres.h" > > Unnecessary extra blank line Removed. > > ====== > src/backend/replication/slot.c > > 9. validate_standby_slots > > There was no reply to the comment in my previous review (see [1] #27). > Maybe you disagree or maybe accidentally overlooked? The error message has already been adjusted in V39. I adjusted the check in this version as well to be consistent. > > ~~~ > > 10. check_standby_slot_names > > In previous review I asked ([1] #28) why a special check was needed for "*". > Hou-san replied that "SplitIdentifierString() does not give error for '*' and '*' > can be considered as valid value which if accepted can mislead user". > > Sure, but won't the code then just try to find if there is a replication slot called > "*" and that will fail. That was my point, if the slot name lookup is going to fail > anyway then why have the extra code for the special "*" case up-front? Note -- > I haven't tried it, so maybe code doesn't work like I think it does. I think allowing "*" can mislead user because it normally means every slot, but we don't want to support the "every slot" option as mentioned in the comment. So I think reject it here is fine. Reporting ERROR because the slot named '*' was not there may look confusing. > > ====== > src/backend/replication/walsender.c > > 11. PhysicalWakeupLogicalWalSnd > > No reply to my previous review comment ([1] #33). Not done? Disagreed, or > accidentally missed? The function mentioned in your previous comment has been removed in previous version, so I am not sure are you pointing to some other codes that has similar issues ? > > ~~~ > > 12. WalSndFilterStandbySlots > > + /* > + * If logical slot name is given in standby_slot_names, give WARNING > + * and skip it. Since it is harmless, so WARNING should be enough, no > + * need to error-out. > + */ > + else if (SlotIsLogical(slot)) > + warningfmt = _("cannot have logical replication slot \"%s\" in > parameter \"%s\", ignoring"); > > I previously raised an issue (see [1] #35) thinking this could not happen. > Hou-san explained how it might happen ("user could drop the logical slot and > recreate a physical slot with the same name without changing the GUC.") so this > code was necessary. That is OK, but I think your same explanation in the code > commen. OK, I have added comments here. > > ~~~ > > 13. WalSndFilterStandbySlots > > + standby_slots_cpy = foreach_delete_current(standby_slots_cpy, lc); > > I previously raised issue (see [1] #36). Hou-san replied "I think it's OK to remove > slots if it's invalidated, dropped, or was changed to logical one as we don't > need to wait for these slots to catch up anymore." > > Sure, maybe code is fine, but my point was that the code is removing elements > *more* scenarios than are mentioned by the function comment, so maybe > update that function comment for all the removal scenarios. Updated the comments. > > ~~~ > > 14. WalSndWaitForStandbyConfirmation > > The comment change from my previous review ([1] #37) not done. > Disagreed, or accidentally missed? Thanks for pointing, this was missed. > > ~~~ > > 15. WalSndWaitForStandbyConfirmation > > The question about calling ConditionVariablePrepareToSleep in my previous > review ([1] #39) not answered. Accidentally missed? I think V39 has already adjusted the order of reload and NIL check in this function. > > ~~~ > > 16. WalSndWaitForWal > > if (RecentFlushPtr != InvalidXLogRecPtr && > loc <= RecentFlushPtr) > - return RecentFlushPtr; > + { > + WalSndFilterStandbySlots(RecentFlushPtr, &standby_slots); > > It is better to use XLogRecPtrIsInvalid macro here. I know it was not strictly > added by your patch, but so much else changed nearby so I thought this should > be fixed at the same time. Changed. > > ====== > src/bin/pg_upgrade/info.c > > 17. get_old_cluster_logical_slot_infos > > + > slotinfos = (LogicalSlotInfo *) pg_malloc(sizeof(LogicalSlotInfo) * num_slots); > > Excessive whitespace. Removed. Best Regards, Hou zj
On Thursday, November 23, 2023 6:06 PM Ajin Cherian <itsajin@gmail.com> wrote: > > On Tue, Nov 21, 2023 at 8:32 PM shveta malik <shveta.malik@gmail.com> > wrote: > > > > v37 fails to apply to HEAD due to a recent commit e83aa9f92fdd, > > rebased the patches. PFA v37_2 patches. > > Thanks for the patch. Some comments: Thanks for the comments. > subscriptioncmds.c: > CreateSubscription() > and tablesync.c: > process_syncing_tables_for_apply() > walrcv_create_slot(wrconn, opts.slot_name, false, > twophase_enabled, > - CRS_NOEXPORT_SNAPSHOT, NULL); > - > - if (twophase_enabled) > - UpdateTwoPhaseState(subid, > LOGICALREP_TWOPHASE_STATE_ENABLED); > - > + failover_enabled, > CRS_NOEXPORT_SNAPSHOT, NULL); > > either here or in libpqrcv_create_slot(), shouldn't you check the remote server > version if it supports the failover flag? I think we expect the create slot to fail if the server doesn't support failover. The same is true for two_phase option. > > > + > + /* > + * If only the slot_name is specified, it is possible > that the user intends to > + * use an existing slot on the publisher, so here we > enable failover for the > + * slot if requested. > + */ > + else if (opts.slot_name && failover_enabled) > + { > + walrcv_alter_slot(wrconn, opts.slot_name, opts.failover); > + ereport(NOTICE, > + (errmsg("enabled failover for replication > slot \"%s\" on publisher", > + opts.slot_name))); > + } > > Here, the code only alters the slot if failover = true. You could use "else if > (opts.slot_name && IsSet(opts.specified_opts, SUBOPT_FAILOVER)" to check if > the failover flag is specified and alter for failover=false as well. Adjusted. > Also, shouldn't > you check for the server version if the command ALTER_REPLICATION_SLOT is > supported? Similar to create_slot, we expect the command fail if the target server doesn't support failover the user specified failover = true. > > slot.c: > ReplicationSlotAlter() > > +void > +ReplicationSlotAlter(const char *name, bool failover) { > + Assert(MyReplicationSlot == NULL); > + > + ReplicationSlotAcquire(name, true); > + > + if (SlotIsPhysical(MyReplicationSlot)) > + ereport(ERROR, > + errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("cannot use %s with a physical replication slot", > + "ALTER_REPLICATION_SLOT")); > > shouldn't you release the slot by calling ReplicationSlotRelease before > erroring out? No, I think the release of the replication slot will be handled by either WalSndErrorCleanup, ReplicationSlotShmemExit, or the ReplicationSlotRelease call in PostgresMain(). > > slot.c: > +/* > + * A helper function to validate slots specified in standby_slot_names GUCs. > + */ > +static bool > +validate_standby_slots(char **newval) > +{ > + char *rawname; > + List *elemlist; > + ListCell *lc; > + > + /* Need a modifiable copy of string */ > + rawname = pstrdup(*newval); > > rawname is not always freed. The code has been changed due to other comments. > > launcher.c: > > + SlotSyncWorker->hdr.proc = MyProc; > + > + before_shmem_exit(slotsync_worker_detach, (Datum) 0); > + > + LWLockRelease(SlotSyncWorkerLock); > +} > > before_shmem_exit() can error out leaving the lock acquired. Maybe you > should release the lock prior to calling before_shmem_exit() because you don't > need the lock there. This has been fixed. Best Regards, Hou zj
Hi, On 11/29/23 6:58 AM, Zhijie Hou (Fujitsu) wrote: > On Tuesday, November 28, 2023 8:07 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > >> On 11/27/23 9:57 AM, Zhijie Hou (Fujitsu) wrote: >>> On Monday, November 27, 2023 4:51 PM shveta malik >> <shveta.malik@gmail.com> wrote: >>> >>> Here is the updated version(v39_2) which include all the changes made in >> 0002. >>> Please use for review, and sorry for the confusion. >>> >> >> Thanks! >> >> As far v39_2-0001: >> >> " >> Altering the failover option of the subscription is currently not >> permitted. However, this restriction may be lifted in future versions. >> " >> >> Should we mention that we can alter the related replication slot? > > Will add. > >> >> + <para> >> + The implementation of failover requires that replication >> + has successfully finished the initial table synchronization >> + phase. So even when <literal>failover</literal> is enabled for a >> + subscription, the internal failover state remains >> + temporarily <quote>pending</quote> until the initialization >> phase >> + completes. See column >> <structfield>subfailoverstate</structfield> >> + of <link >> linkend="catalog-pg-subscription"><structname>pg_subscription</structna >> me></link> >> + to know the actual failover state. >> + </para> >> >> I think we have a corner case here. If one alter the replication slot on the >> primary then "subfailoverstate" is not updated accordingly on the subscriber. >> Given the 2 remarks above would that make sense to prevent altering a >> replication slot associated to a subscription? > > Thanks for the review! > > I think we could not distinguish the user created logical slot or subscriber > created slot as there is no related info in slot's data. Yeah that would need extra work. > And user could change > the slot on subscription by "alter sub set (slot_name)", so maintaining this info > would need some efforts. > Yes. > Besides, I think this case overlaps the previous discussed "alter sub set > (slot_name)" issue[1]. Both the cases are because the slot's failover is > different from the subscription's failover setting. Yeah agree. > I think we could handle > them similarly that user need to take care of not changing the failover to > wrong value. Or do you prefer another approach that mentioned in that thread[1] > ? (always alter the slot at the startup of apply worker). > I think I'm fine with documenting the fact that the user should not change the failover value. But if he does change it (because at the end nothing prevents it to do so) then I think the meaning of subfailoverstate should still make sense. One way to achieve this could be to change its meaning? Say rename it to say subfailovercreationstate (to reflect the fact that it was the state at the creation time) and change messages like: " ALTER SUBSCRIPTION with refresh and copy_data is not allowed when failover is enabled " to something like " ALTER SUBSCRIPTION with refresh and copy_data is not allowed for subscription created with failover enabled" " and change the doc accordingly. What do you think? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On 11/29/23 3:58 AM, Amit Kapila wrote: > On Tue, Nov 28, 2023 at 2:17 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> >> What do you think about also adding a pg_alter_logical_replication_slot() or such >> function? >> >> That would allow users to alter manually created logical replication slots without >> the need to make a replication connection. >> > > But then won't that make it inconsistent with the subscription > failover state? Do you mean allowing one to use pg_alter_logical_replication_slot() on a slot linked to a subscription? (while we're saying / documenting to not alter such slots?) Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Here are some review comments for v41-0001. ====== doc/src/sgml/ref/alter_subscription.sgml 1. + <para> + When altering the + <link linkend="sql-createsubscription-params-with-slot-name"><literal>slot_name</literal></link>, + the <literal>failover</literal> property of the new slot may differ from the + <link linkend="sql-createsubscription-params-with-failover"><literal>failover</literal></link> + parameter specified in the subscription, you need to adjust the + <literal>failover</literal> property when creating the slot so that it + matches the value specified in subscription. + </para> For the second part a) it should be a separate sentence, and b) IMO you are not really "adjusting" something if you are "creating" it. SUGGESTION When altering the slot_name, the failover property of the new slot may differ from the failover parameter specified in the subscription. When creating the slot ensure the slot failover property matches the failover parameter value of the subscription. ====== src/backend/catalog/pg_subscription.c 2. AlterSubscription + if (sub->failoverstate == LOGICALREP_FAILOVER_STATE_ENABLED && opts.copy_data) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("ALTER SUBSCRIPTION with refresh and copy_data is not allowed when failover is enabled"), This is another example where the "two_phase" and "failover" should be extracted to make a common message for the translator. (Already posted this comment before -- see [1] #13) ~~~ 3. + /* + * See comments above for twophasestate, same holds true for + * 'failover' + */ + if (sub->failoverstate == LOGICALREP_FAILOVER_STATE_ENABLED && opts.copy_data) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("ALTER SUBSCRIPTION ... REFRESH with copy_data is not allowed when failover is enabled"), + errhint("Use ALTER SUBSCRIPTION ... REFRESH with copy_data = false, or use DROP/CREATE SUBSCRIPTION."))); This is another example where the "two_phase" and "failover" should be extracted to make a common message for the translator. (Already posted this comment before -- see [1] #14) ====== src/backend/replication/walsender.c 4. +/* + * Wake up logical walsenders with failover-enabled slots if the physical slot + * of the current walsender is specified in standby_slot_names GUC. + */ +void +PhysicalWakeupLogicalWalSnd(void) Is it better to refer to "walsender processes" being woken instead of just walsenders? e.g. "Wake up logical walsenders..." --> "Wake up the logical walsender processes..." ====== [1] v35-0001 review. https://www.postgresql.org/message-id/CAHut%2BPv-yu71ogj_hRi6cCtmD55bsyw7XTxj1Nq8yVFKpY3NDQ%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
On Wed, Nov 29, 2023 at 8:17 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > This has been fixed. > > Best Regards, > Hou zj Thanks for addressing my comments. Some comments from my testing of patch v41 1. In my opinion, the second message "aborting the wait...moving to the next slot" does not hold much value. There might not even be a "next slot", there might be just one slot. I think the first LOG is enough to indicate that the sync-slot is waiting as it repeats this log till the slot catches up. I know these messages hold great value for debugging but in production, "waiting..", "aborting the wait.." might not be as helpful, maybe change it to debug? 2023-11-30 05:13:49.811 EST [6115] LOG: waiting for remote slot "sub1" LSN (0/3047A90) and catalog xmin (745) to pass local slot LSN (0/3047AC8) and catalog xmin (745) 2023-11-30 05:13:57.909 EST [6115] LOG: aborting the wait for remote slot "sub1" and moving to the next slot, will attempt creating it again 2023-11-30 05:14:07.921 EST [6115] LOG: waiting for remote slot "sub1" LSN (0/3047A90) and catalog xmin (745) to pass local slot LSN (0/3047AC8) and catalog xmin (745) 2. If a slot on the standby is in the "i" state as it hasn't been synced and it was invalidated on the primary, should you continuously retry creating this invalidated slot on the standby? 2023-11-30 06:21:41.844 EST [10563] LOG: waiting for remote slot "sub1" LSN (0/0) and catalog xmin (785) to pass local slot LSN (0/EED9330) and catalog xmin (785) 2023-11-30 06:21:41.845 EST [10563] WARNING: slot "sub1" invalidated on the primary server, slot creation aborted 2023-11-30 06:21:51.892 EST [10563] LOG: waiting for remote slot "sub1" LSN (0/0) and catalog xmin (785) to pass local slot LSN (0/EED9330) and catalog xmin (785) 2023-11-30 06:21:51.893 EST [10563] WARNING: slot "sub1" invalidated on the primary server, slot creation aborted 3. If creation of a slot on the standby fails for one slot because a slot of the same name exists, then thereafter no new sync slots are created on standby. Is this expected? I do see that previously created slots are kept up to date, just that no new slots are created after that. regards, Ajin Cherian Fujitsu australia
On Thu, Nov 30, 2023 at 5:37 PM Ajin Cherian <itsajin@gmail.com> wrote: > > On Wed, Nov 29, 2023 at 8:17 PM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > This has been fixed. > > > > Best Regards, > > Hou zj > > Thanks for addressing my comments. Some comments from my testing of patch v41 > > 1. In my opinion, the second message "aborting the wait...moving to > the next slot" does not hold much value. There might not even be a > "next slot", there might be just one slot. I think the first LOG is > enough to indicate that the sync-slot is waiting as it repeats this > log till the slot catches up. I know these messages hold great value > for debugging but in production, "waiting..", "aborting the wait.." > might not be as helpful, maybe change it to debug? > > 2023-11-30 05:13:49.811 EST [6115] LOG: waiting for remote slot > "sub1" LSN (0/3047A90) and catalog xmin (745) to pass local slot LSN > (0/3047AC8) and catalog xmin (745) > 2023-11-30 05:13:57.909 EST [6115] LOG: aborting the wait for remote > slot "sub1" and moving to the next slot, will attempt creating it > again > 2023-11-30 05:14:07.921 EST [6115] LOG: waiting for remote slot > "sub1" LSN (0/3047A90) and catalog xmin (745) to pass local slot LSN > (0/3047AC8) and catalog xmin (745) > Sure, the message can be trimmed down. But I am not very sure if we should convert it to DEBUG. It might be useful to know what exactly is happening with this slot through the log file.Curious to know what others think here? > > 2. If a slot on the standby is in the "i" state as it hasn't been > synced and it was invalidated on the primary, should you continuously > retry creating this invalidated slot on the standby? > > 2023-11-30 06:21:41.844 EST [10563] LOG: waiting for remote slot > "sub1" LSN (0/0) and catalog xmin (785) to pass local slot LSN > (0/EED9330) and catalog xmin (785) > 2023-11-30 06:21:41.845 EST [10563] WARNING: slot "sub1" invalidated > on the primary server, slot creation aborted > 2023-11-30 06:21:51.892 EST [10563] LOG: waiting for remote slot > "sub1" LSN (0/0) and catalog xmin (785) to pass local slot LSN > (0/EED9330) and catalog xmin (785) > 2023-11-30 06:21:51.893 EST [10563] WARNING: slot "sub1" invalidated > on the primary server, slot creation aborted > No, it should not be synced after that. It should be marked as invalidated and skipped. And perhaps the state should also be moved to 'r' as we are done with it; but since it is invalidated, it will not be used further even if 'r'. > 3. If creation of a slot on the standby fails for one slot because a > slot of the same name exists, then thereafter no new sync slots are > created on standby. Is this expected? I do see that previously created > slots are kept up to date, just that no new slots are created after > that. > yes, it is done so as per the suggestion/discussion in [1]. It is done so that users can catch this issue at the earliest. [1]: https://www.postgresql.org/message-id/CAA4eK1J5D-Z7dFa89acf7O%2BCa6Y9bygTpi52KAKVCg%2BPE%2BZfog%40mail.gmail.com thanks Shveta
On Wednesday, November 29, 2023 5:12 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: I was reviewing slotsync worker design and here are few comments on 0002 patch: 1. + if (!WalRcv || + (WalRcv->slotname[0] == '\0') || + XLogRecPtrIsInvalid(WalRcv->latestWalEnd)) I think we'd better take spinlock when accessing these shared memory fields. 2. /* * The slot sync feature itself is disabled, exit. */ if (!enable_syncslot) { ereport(LOG, errmsg("exiting slot sync worker as enable_syncslot is disabled.")); Can we check the GUC when registering the worker(SlotSyncWorkerRegister), so that the worker won't be started if enable_syncslot is false. 3. In synchronize_one_slot, do we need to skip the slot sync and drop if the local slot is a physical one ? 4. *locally_invalidated = (remote_slot->invalidated == RS_INVAL_NONE) && (local_slot->data.invalidated != RS_INVAL_NONE); When reading the invalidated flag of local slot, I think we'd better take spinlock. Best Regards, Hou zj
On Fri, Dec 1, 2023 at 9:40 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Wednesday, November 29, 2023 5:12 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > I was reviewing slotsync worker design and here > are few comments on 0002 patch: Thanks for reviewing the patch. > > 1. > > + if (!WalRcv || > + (WalRcv->slotname[0] == '\0') || > + XLogRecPtrIsInvalid(WalRcv->latestWalEnd)) > > I think we'd better take spinlock when accessing these shared memory fields. > > 2. > > /* > * The slot sync feature itself is disabled, exit. > */ > if (!enable_syncslot) > { > ereport(LOG, > errmsg("exiting slot sync worker as enable_syncslot is disabled.")); > > Can we check the GUC when registering the worker(SlotSyncWorkerRegister), > so that the worker won't be started if enable_syncslot is false. > > 3. In synchronize_one_slot, do we need to skip the slot sync and drop if the > local slot is a physical one ? > IMO, if a local slot exists which is a physical one, it will be a user created slot and in that case worker will error out on finding existing slot with same name. And the case where local slot is physical one but not user-created is not possible on standby (assuming we have correct check on primary disallowing setting 'failover' property for physical slot). Do you have some other scenario in mind, which I am missing here? > 4. > > *locally_invalidated = > (remote_slot->invalidated == RS_INVAL_NONE) && > (local_slot->data.invalidated != RS_INVAL_NONE); > > When reading the invalidated flag of local slot, I think we'd better take > spinlock. > > Best Regards, > Hou zj
On Friday, December 1, 2023 12:51 PM shveta malik <shveta.malik@gmail.com> wrote: Hi, > > On Fri, Dec 1, 2023 at 9:40 AM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > On Wednesday, November 29, 2023 5:12 PM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > I was reviewing slotsync worker design and here > > are few comments on 0002 patch: > > Thanks for reviewing the patch. > > > > > > > 3. In synchronize_one_slot, do we need to skip the slot sync and drop if the > > local slot is a physical one ? > > > > IMO, if a local slot exists which is a physical one, it will be a user > created slot and in that case worker will error out on finding > existing slot with same name. And the case where local slot is > physical one but not user-created is not possible on standby (assuming > we have correct check on primary disallowing setting 'failover' > property for physical slot). Do you have some other scenario in mind, > which I am missing here? I was thinking about the race condition when it has confirmed that the slot is not a user created one and enter "sync_state == SYNCSLOT_STATE_READY" branch, but at this moment, if someone uses "DROP_REPLICATION_SLOT" to drop this slot and recreate another one(e.g. a physical one), then the slotsync worker will overwrite the fields of this physical slot. Although this affects user created logical slots in similar cases as well. And the same is true for slotsync_drop_initiated_slots() and drop_obsolete_slots(), as we don't lock the slots in the list, if user tri to drop and re-create old slot concurrently, then we could drop user created slot here. Best Regards, Hou zj
On Fri, Dec 1, 2023 at 11:17 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Friday, December 1, 2023 12:51 PM shveta malik <shveta.malik@gmail.com> wrote: > > Hi, > > > > > On Fri, Dec 1, 2023 at 9:40 AM Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> wrote: > > > > > > On Wednesday, November 29, 2023 5:12 PM Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> wrote: > > > > > > I was reviewing slotsync worker design and here > > > are few comments on 0002 patch: > > > > Thanks for reviewing the patch. > > > > > > > > > > > 3. In synchronize_one_slot, do we need to skip the slot sync and drop if the > > > local slot is a physical one ? > > > > > > > IMO, if a local slot exists which is a physical one, it will be a user > > created slot and in that case worker will error out on finding > > existing slot with same name. And the case where local slot is > > physical one but not user-created is not possible on standby (assuming > > we have correct check on primary disallowing setting 'failover' > > property for physical slot). Do you have some other scenario in mind, > > which I am missing here? > > I was thinking about the race condition when it has confirmed that the slot is > not a user created one and enter "sync_state == SYNCSLOT_STATE_READY" branch, > but at this moment, if someone uses "DROP_REPLICATION_SLOT" to drop this slot and > recreate another one(e.g. a physical one), then the slotsync worker will > overwrite the fields of this physical slot. Although this affects user created > logical slots in similar cases as well. > User can not drop the synced slots on standby. It should result in ERROR. Currently we emit this error in pg_drop_replication_slot(), same is needed in "DROP_REPLICATION_SLOT" replication cmd. I will change it. Thanks for raising this point. I think, after this ERROR, there is no need to worry about physical slots handling in synchronize_one_slot(). > And the same is true for slotsync_drop_initiated_slots() and > drop_obsolete_slots(), as we don't lock the slots in the list, if user tri to > drop and re-create old slot concurrently, then we could drop user created slot > here. >
On Fri, Dec 1, 2023 at 12:47 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Fri, Dec 1, 2023 at 11:17 AM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > On Friday, December 1, 2023 12:51 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > Hi, > > > > > > > > On Fri, Dec 1, 2023 at 9:40 AM Zhijie Hou (Fujitsu) > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > On Wednesday, November 29, 2023 5:12 PM Zhijie Hou (Fujitsu) > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > I was reviewing slotsync worker design and here > > > > are few comments on 0002 patch: > > > > > > Thanks for reviewing the patch. > > > > > > > > > > > > > > > 3. In synchronize_one_slot, do we need to skip the slot sync and drop if the > > > > local slot is a physical one ? > > > > > > > > > > IMO, if a local slot exists which is a physical one, it will be a user > > > created slot and in that case worker will error out on finding > > > existing slot with same name. And the case where local slot is > > > physical one but not user-created is not possible on standby (assuming > > > we have correct check on primary disallowing setting 'failover' > > > property for physical slot). Do you have some other scenario in mind, > > > which I am missing here? > > > > I was thinking about the race condition when it has confirmed that the slot is > > not a user created one and enter "sync_state == SYNCSLOT_STATE_READY" branch, > > but at this moment, if someone uses "DROP_REPLICATION_SLOT" to drop this slot and > > recreate another one(e.g. a physical one), then the slotsync worker will > > overwrite the fields of this physical slot. Although this affects user created > > logical slots in similar cases as well. > > > > User can not drop the synced slots on standby. It should result in > ERROR. Currently we emit this error in pg_drop_replication_slot(), > same is needed in "DROP_REPLICATION_SLOT" replication cmd. I will > change it. Thanks for raising this point. I think, after this ERROR, > there is no need to worry about physical slots handling in > synchronize_one_slot(). > > > And the same is true for slotsync_drop_initiated_slots() and > > drop_obsolete_slots(), as we don't lock the slots in the list, if user tri to > > drop and re-create old slot concurrently, then we could drop user created slot > > here. > > PFA v42. Changes: v42-0001: addressed comments in [1]. Thanks Hou-San for working on this. v42-0002: addressed comments in [2] and [3] [1]: https://www.postgresql.org/message-id/CAHut%2BPsMTvrwUBtcHff0CG_j-ALSuEta8xC1R_k0kjR%2B9A6ehg%40mail.gmail.com [2]: https://www.postgresql.org/message-id/CAFPTHDb8LW4i9-nyvz%2BXVkJmmciZwYGivpH%3DaDOrDkBfHR_q9w%40mail.gmail.com [3]: https://www.postgresql.org/message-id/OS0PR01MB571678BABEDBE830062CAB119481A%40OS0PR01MB5716.jpnprd01.prod.outlook.com thanks Shveta
Attachment
Hi, On 11/30/23 1:06 PM, Ajin Cherian wrote: > On Wed, Nov 29, 2023 at 8:17 PM Zhijie Hou (Fujitsu) > > 3. If creation of a slot on the standby fails for one slot because a > slot of the same name exists, then thereafter no new sync slots are > created on standby. Is this expected? I do see that previously created > slots are kept up to date, just that no new slots are created after > that. Yes this is the expected behavior as per discussion in [1]. Does this behavior make sense to you? [1]: https://www.postgresql.org/message-id/dd9dbbaf-ca77-423a-8d62-bfc814626b47%40gmail.com Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, Dec 1, 2023 at 3:43 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 11/30/23 1:06 PM, Ajin Cherian wrote: > > On Wed, Nov 29, 2023 at 8:17 PM Zhijie Hou (Fujitsu) > > > > 3. If creation of a slot on the standby fails for one slot because a > > slot of the same name exists, then thereafter no new sync slots are > > created on standby. Is this expected? I do see that previously created > > slots are kept up to date, just that no new slots are created after > > that. > > Yes this is the expected behavior as per discussion in [1]. > Does this behavior make sense to you? > Not completely. The chances of slots getting synced in this case seems order based. Every time a worker restarts after the error (considering the user has not taken corrective action yet), it will successfully sync the slots prior to the problematic one, while leaving the ones after that un-synced. I need a little more clarity on what is the way for the user to know that the slot-sync worker (or any background worker for say) has error'ed out? Is it only from a log file or are there other mechanisms used for this? I mean, do ERRORs have better chances to catch user's attention than WARNING in the context of background worker? I feel we can give a second thought on this and see if it is more appropriate to keep on syncing the rest of the slots and skip the duplicate-name one? thanks Shveta
Hi, On 12/1/23 4:19 AM, shveta malik wrote: > On Thu, Nov 30, 2023 at 5:37 PM Ajin Cherian <itsajin@gmail.com> wrote: >> >> >> 1. In my opinion, the second message "aborting the wait...moving to >> the next slot" does not hold much value. There might not even be a >> "next slot", there might be just one slot. I think the first LOG is >> enough to indicate that the sync-slot is waiting as it repeats this >> log till the slot catches up. I know these messages hold great value >> for debugging but in production, "waiting..", "aborting the wait.." >> might not be as helpful, maybe change it to debug? >> >> 2023-11-30 05:13:49.811 EST [6115] LOG: waiting for remote slot >> "sub1" LSN (0/3047A90) and catalog xmin (745) to pass local slot LSN >> (0/3047AC8) and catalog xmin (745) >> 2023-11-30 05:13:57.909 EST [6115] LOG: aborting the wait for remote >> slot "sub1" and moving to the next slot, will attempt creating it >> again >> 2023-11-30 05:14:07.921 EST [6115] LOG: waiting for remote slot >> "sub1" LSN (0/3047A90) and catalog xmin (745) to pass local slot LSN >> (0/3047AC8) and catalog xmin (745) >> > > Sure, the message can be trimmed down. But I am not very sure if we > should convert it to DEBUG. It might be useful to know what exactly is > happening with this slot through the log file.Curious to know what > others think here? > I think LOG is fine for the "waiting" one but I'd be tempted to put part of the message in errdetail(). I think we could get rid of the "aborting" message (or move it to DEBUG). I mean if one does not see the "newly locally created slot" message then I think it's enough to guess the wait has been aborted or that it is still waiting. But that's probably just a matter of taste. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, Nov 29, 2023 at 3:24 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > On 11/29/23 6:58 AM, Zhijie Hou (Fujitsu) wrote: > > On Tuesday, November 28, 2023 8:07 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > > > Hi, > > > >> On 11/27/23 9:57 AM, Zhijie Hou (Fujitsu) wrote: > >>> On Monday, November 27, 2023 4:51 PM shveta malik > >> <shveta.malik@gmail.com> wrote: > >>> > >>> Here is the updated version(v39_2) which include all the changes made in > >> 0002. > >>> Please use for review, and sorry for the confusion. > >>> > >> > >> Thanks! > >> > >> As far v39_2-0001: > >> > >> " > >> Altering the failover option of the subscription is currently not > >> permitted. However, this restriction may be lifted in future versions. > >> " > >> > >> Should we mention that we can alter the related replication slot? > > > > Will add. > > > >> > >> + <para> > >> + The implementation of failover requires that replication > >> + has successfully finished the initial table synchronization > >> + phase. So even when <literal>failover</literal> is enabled for a > >> + subscription, the internal failover state remains > >> + temporarily <quote>pending</quote> until the initialization > >> phase > >> + completes. See column > >> <structfield>subfailoverstate</structfield> > >> + of <link > >> linkend="catalog-pg-subscription"><structname>pg_subscription</structna > >> me></link> > >> + to know the actual failover state. > >> + </para> > >> > >> I think we have a corner case here. If one alter the replication slot on the > >> primary then "subfailoverstate" is not updated accordingly on the subscriber. > >> Given the 2 remarks above would that make sense to prevent altering a > >> replication slot associated to a subscription? > > > > Thanks for the review! > > > > I think we could not distinguish the user created logical slot or subscriber > > created slot as there is no related info in slot's data. > > Yeah that would need extra work. > > > And user could change > > the slot on subscription by "alter sub set (slot_name)", so maintaining this info > > would need some efforts. > > > > Yes. > > > Besides, I think this case overlaps the previous discussed "alter sub set > > (slot_name)" issue[1]. Both the cases are because the slot's failover is > > different from the subscription's failover setting. > > Yeah agree. > > > I think we could handle > > them similarly that user need to take care of not changing the failover to > > wrong value. Or do you prefer another approach that mentioned in that thread[1] > > ? (always alter the slot at the startup of apply worker). > > > > I think I'm fine with documenting the fact that the user should not change the failover > value. But if he does change it (because at the end nothing prevents it to do so) then > I think the meaning of subfailoverstate should still make sense. > How user can change the slot's failover property? Do we provide any command for it? -- With Regards, Amit Kapila.
Hi, On 12/1/23 12:06 PM, Amit Kapila wrote: > On Wed, Nov 29, 2023 at 3:24 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> I think I'm fine with documenting the fact that the user should not change the failover >> value. But if he does change it (because at the end nothing prevents it to do so) then >> I think the meaning of subfailoverstate should still make sense. >> > > How user can change the slot's failover property? Do we provide any > command for it? It's doable, using a replication connection: " $ psql replication=database psql (17devel) Type "help" for help. postgres=# ALTER_REPLICATION_SLOT logical_slot6 (FAILOVER false); ALTER_REPLICATION_SLOT " Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Review for v41 patch. 1. ====== src/backend/utils/misc/postgresql.conf.sample +#enable_syncslot = on # enables slot synchronization on the physical standby from the primary enable_syncslot is disabled by default, so, it should be 'off' here. ~~~ 2. IIUC, the slotsyncworker's connection to the primary is to execute a query. Its aim is not walsender type connection, but at primary when queried, the 'backend_type' is set to 'walsender'. Snippet from primary db- datname | usename | application_name | wait_event_type | backend_type ---------+-------------+------------------+-----------------+-------------- postgres | replication | slotsyncworker | Client | walsender Is it okay? ~~~ 3. As per current logic, If there are slots on primary with disabled subscriptions, then, when standby is created it replicates these slots but can't make them sync-ready until any activity happens on the slots. So, such slots stay in 'i' sync-state and get dropped when failover happens. Now, if the subscriber tries to enable their existing subscription after failover, it gives an error that the slot does not exist. ~~~ 4. primary_slot_name GUC value test: When standby is started with a non-existing primary_slot_name, the wal-receiver gives an error but the slot-sync worker does not raise any error/warning. It is no-op though as it has a check 'if (XLogRecPtrIsInvalid(WalRcv->latestWalEnd)) do nothing'. Is this okay or shall the slot-sync worker too raise an error and exit? In another case, when standby is started with valid primary_slot_name, but it is changed to some invalid value in runtime, then walreceiver starts giving error but the slot-sync worker keeps on running. In this case, unlike the previous case, it even did not go to no-op mode (as it sees valid WalRcv->latestWalEnd from the earlier run) and keep pinging primary repeatedly for slots. Shall here it should error out or at least be no-op until we give a valid primary_slot_name? -- Thanks, Nisha
On Fri, Dec 1, 2023 at 5:40 PM Nisha Moond <nisha.moond412@gmail.com> wrote: > > Review for v41 patch. > > 1. > ====== > src/backend/utils/misc/postgresql.conf.sample > > +#enable_syncslot = on # enables slot synchronization on the physical > standby from the primary > > enable_syncslot is disabled by default, so, it should be 'off' here. > > ~~~ > 2. > IIUC, the slotsyncworker's connection to the primary is to execute a > query. Its aim is not walsender type connection, but at primary when > queried, the 'backend_type' is set to 'walsender'. > Snippet from primary db- > > datname | usename | application_name | wait_event_type | backend_type > ---------+-------------+------------------+-----------------+-------------- > postgres | replication | slotsyncworker | Client | walsender > > Is it okay? > > ~~~ > 3. > As per current logic, If there are slots on primary with disabled > subscriptions, then, when standby is created it replicates these slots > but can't make them sync-ready until any activity happens on the > slots. > So, such slots stay in 'i' sync-state and get dropped when failover > happens. Now, if the subscriber tries to enable their existing > subscription after failover, it gives an error that the slot does not > exist. Is this behavior expected? If yes, then is it worth documenting about disabled subscription slots not being synced? -- Thanks, Nisha
On Fri, Dec 1, 2023 at 9:31 PM Nisha Moond <nisha.moond412@gmail.com> wrote: > > On Fri, Dec 1, 2023 at 5:40 PM Nisha Moond <nisha.moond412@gmail.com> wrote: > > > > Review for v41 patch. > > > > 1. > > ====== > > src/backend/utils/misc/postgresql.conf.sample > > > > +#enable_syncslot = on # enables slot synchronization on the physical > > standby from the primary > > > > enable_syncslot is disabled by default, so, it should be 'off' here. > > > > ~~~ > > 2. > > IIUC, the slotsyncworker's connection to the primary is to execute a > > query. Its aim is not walsender type connection, but at primary when > > queried, the 'backend_type' is set to 'walsender'. > > Snippet from primary db- > > > > datname | usename | application_name | wait_event_type | backend_type > > ---------+-------------+------------------+-----------------+-------------- > > postgres | replication | slotsyncworker | Client | walsender > > > > Is it okay? > > > > ~~~ > > 3. > > As per current logic, If there are slots on primary with disabled > > subscriptions, then, when standby is created it replicates these slots > > but can't make them sync-ready until any activity happens on the > > slots. > > So, such slots stay in 'i' sync-state and get dropped when failover > > happens. Now, if the subscriber tries to enable their existing > > subscription after failover, it gives an error that the slot does not > > exist. > > Is this behavior expected? If yes, then is it worth documenting about > disabled subscription slots not being synced? > This is expected behavior because even if we would retain such slots (state 'i'), we won't be able to make them in the 'ready' state after failover because we can't get the required WAL from the primary. -- With Regards, Amit Kapila.
On Wednesday, November 29, 2023 5:55 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: Hi, > On 11/29/23 6:58 AM, Zhijie Hou (Fujitsu) wrote: > > On Tuesday, November 28, 2023 8:07 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > > > > Hi, > > > >> On 11/27/23 9:57 AM, Zhijie Hou (Fujitsu) wrote: > >>> On Monday, November 27, 2023 4:51 PM shveta malik > >> <shveta.malik@gmail.com> wrote: > >>> > >>> Here is the updated version(v39_2) which include all the changes > >>> made in > >> 0002. > >>> Please use for review, and sorry for the confusion. > >>> > >> > >> Thanks! > >> > >> As far v39_2-0001: > >> > >> " > >> Altering the failover option of the subscription is currently not > >> permitted. However, this restriction may be lifted in future versions. > >> " > >> > >> Should we mention that we can alter the related replication slot? > > > > Will add. > > > >> > >> + <para> > >> + The implementation of failover requires that replication > >> + has successfully finished the initial table synchronization > >> + phase. So even when <literal>failover</literal> is enabled for a > >> + subscription, the internal failover state remains > >> + temporarily <quote>pending</quote> until the > >> + initialization > >> phase > >> + completes. See column > >> <structfield>subfailoverstate</structfield> > >> + of <link > >> > linkend="catalog-pg-subscription"><structname>pg_subscription</struct > >> na > >> me></link> > >> + to know the actual failover state. > >> + </para> > >> > >> I think we have a corner case here. If one alter the replication slot > >> on the primary then "subfailoverstate" is not updated accordingly on the > subscriber. > >> Given the 2 remarks above would that make sense to prevent altering a > >> replication slot associated to a subscription? > > > > Thanks for the review! > > > > I think we could not distinguish the user created logical slot or > > subscriber created slot as there is no related info in slot's data. > > Yeah that would need extra work. > > > And user could change > > the slot on subscription by "alter sub set (slot_name)", so > > maintaining this info would need some efforts. > > > > Yes. > > > Besides, I think this case overlaps the previous discussed "alter sub > > set (slot_name)" issue[1]. Both the cases are because the slot's > > failover is different from the subscription's failover setting. > > Yeah agree. > > > I think we could handle > > them similarly that user need to take care of not changing the > > failover to wrong value. Or do you prefer another approach that > > mentioned in that thread[1] ? (always alter the slot at the startup of apply > worker). > > > > I think I'm fine with documenting the fact that the user should not change the > failover value. But if he does change it (because at the end nothing prevents it > to do so) then I think the meaning of subfailoverstate should still make sense. > > One way to achieve this could be to change its meaning? Say rename it to say > subfailovercreationstate (to reflect the fact that it was the state at the creation > time) and change messages like: > > " > ALTER SUBSCRIPTION with refresh and copy_data is not allowed when failover > is enabled " > > to something like > > " > ALTER SUBSCRIPTION with refresh and copy_data is not allowed for > subscription created with failover enabled" > " > > and change the doc accordingly. > > What do you think? This idea may work for now, but I think we planned to support ALTER SUBSCRIPTION (failover) in a later patch, which means the meaning of subfailovercreationstate may be invalid after that because we will be able to change this value using ALTER SUBSCRIPTION as well. I think document the case is OK because: Currently, user already can create similar inconsistency cases as we don't restrict user to change the slot on publisher. E.g., User could drop and recreate the slot used by subscription but with different setting. Or user ALTER SUBSCRIPTION set (slot_name) to switch to a new slot with different setting. For example, about two_phase option, user can create a subscription with two_phase disabled, then later it can set subscription slot_name to a new slot with two_phase enabled which is the similar case as the failover. Best Regards, Hou zj
On Fri, Dec 1, 2023 at 5:40 PM Nisha Moond <nisha.moond412@gmail.com> wrote: > > Review for v41 patch. Thanks for the feedback. > > 1. > ====== > src/backend/utils/misc/postgresql.conf.sample > > +#enable_syncslot = on # enables slot synchronization on the physical > standby from the primary > > enable_syncslot is disabled by default, so, it should be 'off' here. > Sure, I will change it. > ~~~ > 2. > IIUC, the slotsyncworker's connection to the primary is to execute a > query. Its aim is not walsender type connection, but at primary when > queried, the 'backend_type' is set to 'walsender'. > Snippet from primary db- > > datname | usename | application_name | wait_event_type | backend_type > ---------+-------------+------------------+-----------------+-------------- > postgres | replication | slotsyncworker | Client | walsender > > Is it okay? > Slot sync worker uses 'libpqrcv_connect' for connection which sends 'replication'-'database' key-value pair as one of the connection options. And on the primary side, 'ProcessStartupPacket' on the basis of this key-value pair sets the process as walsender one (am_walsender = true). And thus this reflects as backend_type='walsender' in pg_stat_activity. I do not see any harm in this backend_type for slot-sync worker currently. This is on a similar line of connections used for logical-replications. And since a slot-sync worker also deals with wals-positions (lsns), it is okay to maintain backend_type as walsender unless you (or others) see any potential issue in doing that. So let me know. > ~~~ > 3. > As per current logic, If there are slots on primary with disabled > subscriptions, then, when standby is created it replicates these slots > but can't make them sync-ready until any activity happens on the > slots. > So, such slots stay in 'i' sync-state and get dropped when failover > happens. Now, if the subscriber tries to enable their existing > subscription after failover, it gives an error that the slot does not > exist. > yes, this is expected as Amit explained in [1]. But let me review if we need to document this case for disabled subscriptions. i.e. disabled subscription if enabled after promotion might not work. > ~~~ > 4. primary_slot_name GUC value test: > > When standby is started with a non-existing primary_slot_name, the > wal-receiver gives an error but the slot-sync worker does not raise > any error/warning. It is no-op though as it has a check 'if > (XLogRecPtrIsInvalid(WalRcv->latestWalEnd)) do nothing'. Is this > okay or shall the slot-sync worker too raise an error and exit? > > In another case, when standby is started with valid primary_slot_name, > but it is changed to some invalid value in runtime, then walreceiver > starts giving error but the slot-sync worker keeps on running. In this > case, unlike the previous case, it even did not go to no-op mode (as > it sees valid WalRcv->latestWalEnd from the earlier run) and keep > pinging primary repeatedly for slots. Shall here it should error out > or at least be no-op until we give a valid primary_slot_name? > I reviewed it. There is no way to test the existence/validity of 'primary_slot_name' on standby without making a connection to primary. If primary_slot_name is invalid from the start, slot-sync worker will be no-op (as you tested) as WalRecv->latestWalENd will be invalid, and if 'primary_slot_name' is changed to invalid on runtime, slot-sync worker will still keep on pinging primary. But that should be okay (in fact needed) as it needs to sync at-least the previous slot's positions (in case it is delayed in doing so for some reason earlier). And once the slots are up-to-date on standby, even if worker pings primary, it will not see any change in slots lsns and thus go for longer nap. I think, it is not worth the effort to introduce the complexity of checking validity of 'primary_slot_name' on primary from standby for this rare scenario. It will be good to know thoughts of others on above 3 points. thanks Shveta
On Mon, Dec 4, 2023 at 10:40 AM shveta malik <shveta.malik@gmail.com> wrote: > > On Fri, Dec 1, 2023 at 5:40 PM Nisha Moond <nisha.moond412@gmail.com> wrote: > > > > Review for v41 patch. > > Thanks for the feedback. > > > > > 1. > > ====== > > src/backend/utils/misc/postgresql.conf.sample > > > > +#enable_syncslot = on # enables slot synchronization on the physical > > standby from the primary > > > > enable_syncslot is disabled by default, so, it should be 'off' here. > > > > Sure, I will change it. > > > ~~~ > > 2. > > IIUC, the slotsyncworker's connection to the primary is to execute a > > query. Its aim is not walsender type connection, but at primary when > > queried, the 'backend_type' is set to 'walsender'. > > Snippet from primary db- > > > > datname | usename | application_name | wait_event_type | backend_type > > ---------+-------------+------------------+-----------------+-------------- > > postgres | replication | slotsyncworker | Client | walsender > > > > Is it okay? > > > > Slot sync worker uses 'libpqrcv_connect' for connection which sends > 'replication'-'database' key-value pair as one of the connection > options. And on the primary side, 'ProcessStartupPacket' on the basis > of this key-value pair sets the process as walsender one (am_walsender > = true). > And thus this reflects as backend_type='walsender' in > pg_stat_activity. I do not see any harm in this backend_type for > slot-sync worker currently. This is on a similar line of connections > used for logical-replications. And since a slot-sync worker also deals > with wals-positions (lsns), it is okay to maintain backend_type as > walsender unless you (or others) see any potential issue in doing > that. So let me know. > > > ~~~ > > 3. > > As per current logic, If there are slots on primary with disabled > > subscriptions, then, when standby is created it replicates these slots > > but can't make them sync-ready until any activity happens on the > > slots. > > So, such slots stay in 'i' sync-state and get dropped when failover > > happens. Now, if the subscriber tries to enable their existing > > subscription after failover, it gives an error that the slot does not > > exist. > > > > yes, this is expected as Amit explained in [1]. But let me review if > we need to document this case for disabled subscriptions. i.e. > disabled subscription if enabled after promotion might not work. Sorry, missed to mention the link earlier: [1]: https://www.postgresql.org/message-id/CAA4eK1J5Hxp%2BzhvptyyjqQ4JSQzwnkFRXtQn8v9opxtZmmY_Ug%40mail.gmail.com
Hi, On 12/4/23 4:33 AM, Zhijie Hou (Fujitsu) wrote: > On Wednesday, November 29, 2023 5:55 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > >> I think I'm fine with documenting the fact that the user should not change the >> failover value. But if he does change it (because at the end nothing prevents it >> to do so) then I think the meaning of subfailoverstate should still make sense. >> >> One way to achieve this could be to change its meaning? Say rename it to say >> subfailovercreationstate (to reflect the fact that it was the state at the creation >> time) and change messages like: >> >> " >> ALTER SUBSCRIPTION with refresh and copy_data is not allowed when failover >> is enabled " >> >> to something like >> >> " >> ALTER SUBSCRIPTION with refresh and copy_data is not allowed for >> subscription created with failover enabled" >> " >> >> and change the doc accordingly. >> >> What do you think? > I think document the case is OK because: > > Currently, user already can create similar inconsistency cases as we don't restrict > user to change the slot on publisher. E.g., User could drop and recreate the > slot used by subscription but with different setting. Or user ALTER > SUBSCRIPTION set (slot_name) to switch to a new slot with different setting. > > For example, about two_phase option, user can create a subscription with > two_phase disabled, then later it can set subscription slot_name to a new slot > with two_phase enabled which is the similar case as the failover. > Yeah, right, did not think that such "inconsistency" can already happen. So agree to keep "subfailoverstate" and "just" document the case then. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On 12/4/23 6:10 AM, shveta malik wrote: > On Fri, Dec 1, 2023 at 5:40 PM Nisha Moond <nisha.moond412@gmail.com> wrote: >> >> Review for v41 patch. > > Thanks for the feedback. > >> ~~~ >> 2. >> IIUC, the slotsyncworker's connection to the primary is to execute a >> query. Its aim is not walsender type connection, but at primary when >> queried, the 'backend_type' is set to 'walsender'. >> Snippet from primary db- >> >> datname | usename | application_name | wait_event_type | backend_type >> ---------+-------------+------------------+-----------------+-------------- >> postgres | replication | slotsyncworker | Client | walsender >> >> Is it okay? >> > > Slot sync worker uses 'libpqrcv_connect' for connection which sends > 'replication'-'database' key-value pair as one of the connection > options. And on the primary side, 'ProcessStartupPacket' on the basis > of this key-value pair sets the process as walsender one (am_walsender > = true). > And thus this reflects as backend_type='walsender' in > pg_stat_activity. I do not see any harm in this backend_type for > slot-sync worker currently. This is on a similar line of connections > used for logical-replications. And since a slot-sync worker also deals > with wals-positions (lsns), it is okay to maintain backend_type as > walsender unless you (or others) see any potential issue in doing > that. So let me know. I don't see any issue as well (though I understand it might seems weird to see a walsender process being spawned doing non replication stuff) > >> ~~~ >> 4. primary_slot_name GUC value test: >> >> When standby is started with a non-existing primary_slot_name, the >> wal-receiver gives an error but the slot-sync worker does not raise >> any error/warning. It is no-op though as it has a check 'if >> (XLogRecPtrIsInvalid(WalRcv->latestWalEnd)) do nothing'. Is this >> okay or shall the slot-sync worker too raise an error and exit? >> >> In another case, when standby is started with valid primary_slot_name, >> but it is changed to some invalid value in runtime, then walreceiver >> starts giving error but the slot-sync worker keeps on running. In this >> case, unlike the previous case, it even did not go to no-op mode (as >> it sees valid WalRcv->latestWalEnd from the earlier run) and keep >> pinging primary repeatedly for slots. Shall here it should error out >> or at least be no-op until we give a valid primary_slot_name? >> > Nice catch, thanks! > I reviewed it. There is no way to test the existence/validity of > 'primary_slot_name' on standby without making a connection to primary. > If primary_slot_name is invalid from the start, slot-sync worker will > be no-op (as you tested) as WalRecv->latestWalENd will be invalid, and > if 'primary_slot_name' is changed to invalid on runtime, slot-sync > worker will still keep on pinging primary. But that should be okay (in > fact needed) as it needs to sync at-least the previous slot's > positions (in case it is delayed in doing so for some reason earlier). > And once the slots are up-to-date on standby, even if worker pings > primary, it will not see any change in slots lsns and thus go for > longer nap. I think, it is not worth the effort to introduce the > complexity of checking validity of 'primary_slot_name' on primary from > standby for this rare scenario. > Maybe another option could be to have the walreceiver a way to let the slot sync worker knows that it (the walreceiver) was not able to start due to non existing replication slot on the primary? (that way we'd avoid the slot sync worker having to talk to the primary). Not sure about the extra effort to make it works though. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Dec 4, 2023 at 10:07 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 12/4/23 6:10 AM, shveta malik wrote: > > On Fri, Dec 1, 2023 at 5:40 PM Nisha Moond <nisha.moond412@gmail.com> wrote: > >> > >> Review for v41 patch. > > > > Thanks for the feedback. > > > >> ~~~ > >> 2. > >> IIUC, the slotsyncworker's connection to the primary is to execute a > >> query. Its aim is not walsender type connection, but at primary when > >> queried, the 'backend_type' is set to 'walsender'. > >> Snippet from primary db- > >> > >> datname | usename | application_name | wait_event_type | backend_type > >> ---------+-------------+------------------+-----------------+-------------- > >> postgres | replication | slotsyncworker | Client | walsender > >> > >> Is it okay? > >> > > > > Slot sync worker uses 'libpqrcv_connect' for connection which sends > > 'replication'-'database' key-value pair as one of the connection > > options. And on the primary side, 'ProcessStartupPacket' on the basis > > of this key-value pair sets the process as walsender one (am_walsender > > = true). > > And thus this reflects as backend_type='walsender' in > > pg_stat_activity. I do not see any harm in this backend_type for > > slot-sync worker currently. This is on a similar line of connections > > used for logical-replications. And since a slot-sync worker also deals > > with wals-positions (lsns), it is okay to maintain backend_type as > > walsender unless you (or others) see any potential issue in doing > > that. So let me know. > > I don't see any issue as well (though I understand it might > seems weird to see a walsender process being spawned doing non > replication stuff) > > > > >> ~~~ > >> 4. primary_slot_name GUC value test: > >> > >> When standby is started with a non-existing primary_slot_name, the > >> wal-receiver gives an error but the slot-sync worker does not raise > >> any error/warning. It is no-op though as it has a check 'if > >> (XLogRecPtrIsInvalid(WalRcv->latestWalEnd)) do nothing'. Is this > >> okay or shall the slot-sync worker too raise an error and exit? > >> > >> In another case, when standby is started with valid primary_slot_name, > >> but it is changed to some invalid value in runtime, then walreceiver > >> starts giving error but the slot-sync worker keeps on running. In this > >> case, unlike the previous case, it even did not go to no-op mode (as > >> it sees valid WalRcv->latestWalEnd from the earlier run) and keep > >> pinging primary repeatedly for slots. Shall here it should error out > >> or at least be no-op until we give a valid primary_slot_name? > >> > > > > Nice catch, thanks! > > > I reviewed it. There is no way to test the existence/validity of > > 'primary_slot_name' on standby without making a connection to primary. > > If primary_slot_name is invalid from the start, slot-sync worker will > > be no-op (as you tested) as WalRecv->latestWalENd will be invalid, and > > if 'primary_slot_name' is changed to invalid on runtime, slot-sync > > worker will still keep on pinging primary. But that should be okay (in > > fact needed) as it needs to sync at-least the previous slot's > > positions (in case it is delayed in doing so for some reason earlier). > > And once the slots are up-to-date on standby, even if worker pings > > primary, it will not see any change in slots lsns and thus go for > > longer nap. I think, it is not worth the effort to introduce the > > complexity of checking validity of 'primary_slot_name' on primary from > > standby for this rare scenario. > > > > Maybe another option could be to have the walreceiver a way to let the slot sync > worker knows that it (the walreceiver) was not able to start due to non existing > replication slot on the primary? (that way we'd avoid the slot sync worker having > to talk to the primary). Few points: 1) I think if we do it, we should do it in generic way i.e. slotsync worker should go to no-op if walreceiver is not able to start due to any reason and not only due to invalid primary_slot_name. 2) Secondly, slotsync worker needs to make sure it has synced the slots so far i.e. worker should not go to no-op immediately on seeing missing WalRcv process if there are pending slots to be synced. So the generic way I see to have this optimization is: 1) Slotsync worker can use 'WalRcv->pid' to figure out if WalReceiver is running or not. 2) Slotsync worker should check null 'WalRcv->pid' only when no-activity is observed for threshold time i.e. it can do it during existing logic of increasing naptime. 3) On finding null 'WalRcv->pid', worker can mark a flag to go to no-op unless WalRcv->pid becomes valid again. Marking this flag during increasing naptime will guarantee that the worker has taken all the changes so far i.e. standby is not lagging in terms of slots. Thoughts? thanks Shveta
Hi, On 12/5/23 6:08 AM, shveta malik wrote: > On Mon, Dec 4, 2023 at 10:07 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> Maybe another option could be to have the walreceiver a way to let the slot sync >> worker knows that it (the walreceiver) was not able to start due to non existing >> replication slot on the primary? (that way we'd avoid the slot sync worker having >> to talk to the primary). > > Few points: > 1) I think if we do it, we should do it in generic way i.e. slotsync > worker should go to no-op if walreceiver is not able to start due to > any reason and not only due to invalid primary_slot_name. Agree. > 2) Secondly, slotsync worker needs to make sure it has synced the > slots so far i.e. worker should not go to no-op immediately on seeing > missing WalRcv process if there are pending slots to be synced. Agree. > So the generic way I see to have this optimization is: > 1) Slotsync worker can use 'WalRcv->pid' to figure out if WalReceiver > is running or not. Not sure that would work because the walreceiver keeps try re-starting and so get a pid before reaching the "could not start WAL streaming: ERROR: replication slot "XXXX" does not exist" error. We may want to add an extra check on walrcv->walRcvState (or should/could be enough by its own). But walrcv->walRcvState is set to WALRCV_STREAMING way before walrcv_startstreaming(). Wouldn't that make sense to move it once we are sure that walrcv_startstreaming() returns true and first_stream is true, here? " if (first_stream) + { ereport(LOG, (errmsg("started streaming WAL from primary at %X/%X on timeline %u", LSN_FORMAT_ARGS(startpoint), startpointTLI))); + SpinLockAcquire(&walrcv->mutex); + walrcv->walRcvState = WALRCV_STREAMING; + SpinLockRelease(&walrcv->mutex); + } " > 2) Slotsync worker should check null 'WalRcv->pid' only when > no-activity is observed for threshold time i.e. it can do it during > existing logic of increasing naptime. > 3) On finding null 'WalRcv->pid', worker can mark a flag to go to > no-op unless WalRcv->pid becomes valid again. Marking this flag during > increasing naptime will guarantee that the worker has taken all the > changes so far i.e. standby is not lagging in terms of slots. > 2) and 3) looks good to me but with a check on walrcv->walRcvState looking for WALRCV_STREAMING state instead of looking for a non null WalRcv->pid. And only if it makes sense to move the walrcv->walRcvState = WALRCV_STREAMING as mentioned above (I think it does). Thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Dec 5, 2023 at 2:18 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 12/5/23 6:08 AM, shveta malik wrote: > > On Mon, Dec 4, 2023 at 10:07 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > >> Maybe another option could be to have the walreceiver a way to let the slot sync > >> worker knows that it (the walreceiver) was not able to start due to non existing > >> replication slot on the primary? (that way we'd avoid the slot sync worker having > >> to talk to the primary). > > > > Few points: > > 1) I think if we do it, we should do it in generic way i.e. slotsync > > worker should go to no-op if walreceiver is not able to start due to > > any reason and not only due to invalid primary_slot_name. > > Agree. > > > 2) Secondly, slotsync worker needs to make sure it has synced the > > slots so far i.e. worker should not go to no-op immediately on seeing > > missing WalRcv process if there are pending slots to be synced. > > Agree. > > > So the generic way I see to have this optimization is: > > 1) Slotsync worker can use 'WalRcv->pid' to figure out if WalReceiver > > is running or not. > > Not sure that would work because the walreceiver keeps try re-starting > and so get a pid before reaching the "could not start WAL streaming: ERROR: replication slot "XXXX" does not exist" > error. > yes, right. pid will keep on toggling. > We may want to add an extra check on walrcv->walRcvState (or should/could be enough by its own). > But walrcv->walRcvState is set to WALRCV_STREAMING way before walrcv_startstreaming(). > Agree. Check on 'walrcv->walRcvState' alone should suffice. > Wouldn't that make sense to move it once we are sure that > walrcv_startstreaming() returns true and first_stream is true, here? > > " > if (first_stream) > + { > ereport(LOG, > (errmsg("started streaming WAL from primary at %X/%X on timeline %u", > LSN_FORMAT_ARGS(startpoint), startpointTLI))); > + SpinLockAcquire(&walrcv->mutex); > + walrcv->walRcvState = WALRCV_STREAMING; > + SpinLockRelease(&walrcv->mutex); > + } > " > Yes, it makes sense and is the basis for current slot-sync worker changes being discussed. > > 2) Slotsync worker should check null 'WalRcv->pid' only when > > no-activity is observed for threshold time i.e. it can do it during > > existing logic of increasing naptime. > > 3) On finding null 'WalRcv->pid', worker can mark a flag to go to > > no-op unless WalRcv->pid becomes valid again. Marking this flag during > > increasing naptime will guarantee that the worker has taken all the > > changes so far i.e. standby is not lagging in terms of slots. > > > > 2) and 3) looks good to me but with a check on walrcv->walRcvState > looking for WALRCV_STREAMING state instead of looking for a non null > WalRcv->pid. yes. But I think, the worker should enter no-op, when walRcvState is WALRCV_STOPPED and not when walRcvState != WALRCV_STREAMING as it is okay to have WALRCV_WAITING/STARTING/RESTARTING. But the worker should exit no-op only when it finds walRcvState switched back to WALRCV_STREAMING. > > And only if it makes sense to move the walrcv->walRcvState = WALRCV_STREAMING as > mentioned above (I think it does). > yes, I agree. thanks Shveta
On Tue, Dec 5, 2023 at 10:38 AM shveta malik <shveta.malik@gmail.com> wrote: > > On Mon, Dec 4, 2023 at 10:07 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > > > > > > > >> ~~~ > > >> 4. primary_slot_name GUC value test: > > >> > > >> When standby is started with a non-existing primary_slot_name, the > > >> wal-receiver gives an error but the slot-sync worker does not raise > > >> any error/warning. It is no-op though as it has a check 'if > > >> (XLogRecPtrIsInvalid(WalRcv->latestWalEnd)) do nothing'. Is this > > >> okay or shall the slot-sync worker too raise an error and exit? > > >> > > >> In another case, when standby is started with valid primary_slot_name, > > >> but it is changed to some invalid value in runtime, then walreceiver > > >> starts giving error but the slot-sync worker keeps on running. In this > > >> case, unlike the previous case, it even did not go to no-op mode (as > > >> it sees valid WalRcv->latestWalEnd from the earlier run) and keep > > >> pinging primary repeatedly for slots. Shall here it should error out > > >> or at least be no-op until we give a valid primary_slot_name? > > >> > > > > > > > Nice catch, thanks! > > > > > I reviewed it. There is no way to test the existence/validity of > > > 'primary_slot_name' on standby without making a connection to primary. > > > If primary_slot_name is invalid from the start, slot-sync worker will > > > be no-op (as you tested) as WalRecv->latestWalENd will be invalid, and > > > if 'primary_slot_name' is changed to invalid on runtime, slot-sync > > > worker will still keep on pinging primary. But that should be okay (in > > > fact needed) as it needs to sync at-least the previous slot's > > > positions (in case it is delayed in doing so for some reason earlier). > > > And once the slots are up-to-date on standby, even if worker pings > > > primary, it will not see any change in slots lsns and thus go for > > > longer nap. I think, it is not worth the effort to introduce the > > > complexity of checking validity of 'primary_slot_name' on primary from > > > standby for this rare scenario. > > > > > > > Maybe another option could be to have the walreceiver a way to let the slot sync > > worker knows that it (the walreceiver) was not able to start due to non existing > > replication slot on the primary? (that way we'd avoid the slot sync worker having > > to talk to the primary). > > Few points: > 1) I think if we do it, we should do it in generic way i.e. slotsync > worker should go to no-op if walreceiver is not able to start due to > any reason and not only due to invalid primary_slot_name. > 2) Secondly, slotsync worker needs to make sure it has synced the > slots so far i.e. worker should not go to no-op immediately on seeing > missing WalRcv process if there are pending slots to be synced. > Won't it be better to just ping and check the validity of 'primary_slot_name' at the start of slot-sync and if it is changed anytime? I think it would be better to avoid adding dependency on walreciever state as that sounds like needless complexity. -- With Regards, Amit Kapila.
Hi, On 12/5/23 11:29 AM, shveta malik wrote: > On Tue, Dec 5, 2023 at 2:18 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> Wouldn't that make sense to move it once we are sure that >> walrcv_startstreaming() returns true and first_stream is true, here? >> >> " >> if (first_stream) >> + { >> ereport(LOG, >> (errmsg("started streaming WAL from primary at %X/%X on timeline %u", >> LSN_FORMAT_ARGS(startpoint), startpointTLI))); >> + SpinLockAcquire(&walrcv->mutex); >> + walrcv->walRcvState = WALRCV_STREAMING; >> + SpinLockRelease(&walrcv->mutex); >> + } >> " >> > > Yes, it makes sense and is the basis for current slot-sync worker > changes being discussed. I think this change deserves its own dedicated thread and patch, does that make sense? If so, I'll submit one. >> >> 2) and 3) looks good to me but with a check on walrcv->walRcvState >> looking for WALRCV_STREAMING state instead of looking for a non null >> WalRcv->pid. > > yes. But I think, the worker should enter no-op, when walRcvState is > WALRCV_STOPPED and not when walRcvState != WALRCV_STREAMING as it is > okay to have WALRCV_WAITING/STARTING/RESTARTING. But the worker should > exit no-op only when it finds walRcvState switched back to > WALRCV_STREAMING. > Yeah, fully agree. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On 12/5/23 12:32 PM, Amit Kapila wrote: > On Tue, Dec 5, 2023 at 10:38 AM shveta malik <shveta.malik@gmail.com> wrote: >> >> On Mon, Dec 4, 2023 at 10:07 PM Drouvot, Bertrand >> <bertranddrouvot.pg@gmail.com> wrote: >>>> >>> >>> Maybe another option could be to have the walreceiver a way to let the slot sync >>> worker knows that it (the walreceiver) was not able to start due to non existing >>> replication slot on the primary? (that way we'd avoid the slot sync worker having >>> to talk to the primary). >> >> Few points: >> 1) I think if we do it, we should do it in generic way i.e. slotsync >> worker should go to no-op if walreceiver is not able to start due to >> any reason and not only due to invalid primary_slot_name. >> 2) Secondly, slotsync worker needs to make sure it has synced the >> slots so far i.e. worker should not go to no-op immediately on seeing >> missing WalRcv process if there are pending slots to be synced. >> > > Won't it be better to just ping and check the validity of > 'primary_slot_name' at the start of slot-sync and if it is changed > anytime? I think it would be better to avoid adding dependency on > walreciever state as that sounds like needless complexity. I think the overall extra complexity is linked to the fact that we first want to ensure that the slots are in sync before shutting down the sync slot worker. I think than talking to the primary or relying on the walreceiver state is "just" what would trigger the decision to shutdown the sync slot worker. Relying on the walreceiver state looks better to me (as it avoids possibly useless round trips with the primary). Also the walreceiver could be down for multiple reasons, and I think there is no point of having a sync slot worker running if the slots are in sync and there is no walreceiver running (even if primary_slot_name is a valid one). That said, I'm also ok with the "ping primary" approach if others have another point of view and find it better. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Dec 5, 2023 at 7:38 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > On 12/5/23 12:32 PM, Amit Kapila wrote: > > On Tue, Dec 5, 2023 at 10:38 AM shveta malik <shveta.malik@gmail.com> wrote: > >> > >> On Mon, Dec 4, 2023 at 10:07 PM Drouvot, Bertrand > >> <bertranddrouvot.pg@gmail.com> wrote: > >>>> > >>> > >>> Maybe another option could be to have the walreceiver a way to let the slot sync > >>> worker knows that it (the walreceiver) was not able to start due to non existing > >>> replication slot on the primary? (that way we'd avoid the slot sync worker having > >>> to talk to the primary). > >> > >> Few points: > >> 1) I think if we do it, we should do it in generic way i.e. slotsync > >> worker should go to no-op if walreceiver is not able to start due to > >> any reason and not only due to invalid primary_slot_name. > >> 2) Secondly, slotsync worker needs to make sure it has synced the > >> slots so far i.e. worker should not go to no-op immediately on seeing > >> missing WalRcv process if there are pending slots to be synced. > >> > > > > Won't it be better to just ping and check the validity of > > 'primary_slot_name' at the start of slot-sync and if it is changed > > anytime? I think it would be better to avoid adding dependency on > > walreciever state as that sounds like needless complexity. > > I think the overall extra complexity is linked to the fact that we first > want to ensure that the slots are in sync before shutting down the > sync slot worker. > > I think than talking to the primary or relying on the walreceiver state > is "just" what would trigger the decision to shutdown the sync slot worker. > > Relying on the walreceiver state looks better to me (as it avoids possibly > useless round trips with the primary). > But the round trip will only be once in the beginning and if the user changes the GUC primary-slot_name which shouldn't be that often. > Also the walreceiver could be down for multiple reasons, and I think there > is no point of having a sync slot worker running if the slots are in sync and > there is no walreceiver running (even if primary_slot_name is a valid one). > I feel that is indirectly relying on the fact that the primary won't advance logical slots unless physical standby has consumed data. Now, it is possible that slot-sync worker lags behind and still needs to sync more data for slots in which it makes sense for slot-sync worker to be alive. I think we can try to avoid checking walreceiver status till we can get more data to avoid the problem I mentioned but it doesn't sound like a clean way to achieve our purpose. -- With Regards, Amit Kapila.
On Wed, Dec 6, 2023 at 10:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Dec 5, 2023 at 7:38 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > > > > On 12/5/23 12:32 PM, Amit Kapila wrote: > > > On Tue, Dec 5, 2023 at 10:38 AM shveta malik <shveta.malik@gmail.com> wrote: > > >> > > >> On Mon, Dec 4, 2023 at 10:07 PM Drouvot, Bertrand > > >> <bertranddrouvot.pg@gmail.com> wrote: > > >>>> > > >>> > > >>> Maybe another option could be to have the walreceiver a way to let the slot sync > > >>> worker knows that it (the walreceiver) was not able to start due to non existing > > >>> replication slot on the primary? (that way we'd avoid the slot sync worker having > > >>> to talk to the primary). > > >> > > >> Few points: > > >> 1) I think if we do it, we should do it in generic way i.e. slotsync > > >> worker should go to no-op if walreceiver is not able to start due to > > >> any reason and not only due to invalid primary_slot_name. > > >> 2) Secondly, slotsync worker needs to make sure it has synced the > > >> slots so far i.e. worker should not go to no-op immediately on seeing > > >> missing WalRcv process if there are pending slots to be synced. > > >> > > > > > > Won't it be better to just ping and check the validity of > > > 'primary_slot_name' at the start of slot-sync and if it is changed > > > anytime? I think it would be better to avoid adding dependency on > > > walreciever state as that sounds like needless complexity. > > > > I think the overall extra complexity is linked to the fact that we first > > want to ensure that the slots are in sync before shutting down the > > sync slot worker. > > > > I think than talking to the primary or relying on the walreceiver state > > is "just" what would trigger the decision to shutdown the sync slot worker. > > > > Relying on the walreceiver state looks better to me (as it avoids possibly > > useless round trips with the primary). > > > > But the round trip will only be once in the beginning and if the user > changes the GUC primary-slot_name which shouldn't be that often. > > > Also the walreceiver could be down for multiple reasons, and I think there > > is no point of having a sync slot worker running if the slots are in sync and > > there is no walreceiver running (even if primary_slot_name is a valid one). > > > > I feel that is indirectly relying on the fact that the primary won't > advance logical slots unless physical standby has consumed data. Yes, that is the basis of this discussion. But now on rethinking, if the user has not set 'standby_slot_names' on primary at first pace, then even if walreceiver on standby is down, slots on primary will keep on advancing and thus we need to sync. We have no check currently that mandates users to set standby_slot_names. > Now, > it is possible that slot-sync worker lags behind and still needs to > sync more data for slots in which it makes sense for slot-sync worker > to be alive. I think we can try to avoid checking walreceiver status > till we can get more data to avoid the problem I mentioned but it > doesn't sound like a clean way to achieve our purpose. >
Hi, On 12/6/23 7:18 AM, shveta malik wrote: > On Wed, Dec 6, 2023 at 10:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> I feel that is indirectly relying on the fact that the primary won't >> advance logical slots unless physical standby has consumed data. > > Yes, that is the basis of this discussion. Yes. > But now on rethinking, if > the user has not set 'standby_slot_names' on primary at first pace, > then even if walreceiver on standby is down, slots on primary will > keep on advancing Oh right, good point. > and thus we need to sync. Yes and I think our current check "XLogRecPtrIsInvalid(WalRcv->latestWalEnd)" in synchronize_slots() prevents us to do so (as I think WalRcv->latestWalEnd would be invalid for a non started walreceiver). > We have no check currently > that mandates users to set standby_slot_names. > Yeah and OTOH unset standby_slot_names is currently the only way for users to "force" advance failover slots if they want to (in case say the standby is down for a long time and they don't want to block logical decoding on the primary) as we don't provide a way to alter the failover property (unless connecting with replication which sounds more like a hack). >> Now, >> it is possible that slot-sync worker lags behind and still needs to >> sync more data for slots in which it makes sense for slot-sync worker >> to be alive. Right. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, Dec 6, 2023 at 3:00 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 12/6/23 7:18 AM, shveta malik wrote: > > On Wed, Dec 6, 2023 at 10:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > >> > >> I feel that is indirectly relying on the fact that the primary won't > >> advance logical slots unless physical standby has consumed data. > > > > Yes, that is the basis of this discussion. > > Yes. > > > But now on rethinking, if > > the user has not set 'standby_slot_names' on primary at first pace, > > then even if walreceiver on standby is down, slots on primary will > > keep on advancing > > Oh right, good point. > > > and thus we need to sync. > > Yes and I think our current check "XLogRecPtrIsInvalid(WalRcv->latestWalEnd)" > in synchronize_slots() prevents us to do so (as I think WalRcv->latestWalEnd > would be invalid for a non started walreceiver). > But I think we do not need to deal with the case that walreceiver is not started at all on standby. It is always started. Walreceiver not getting started or down for long is a rare scenario. We have other checks too for 'latestWalEnd' in slotsync worker and I think we should retain those as is. > > We have no check currently > > that mandates users to set standby_slot_names. > > > > Yeah and OTOH unset standby_slot_names is currently the only > way for users to "force" advance failover slots if they want to (in case > say the standby is down for a long time and they don't want to block logical decoding > on the primary) as we don't provide a way to alter the failover property > (unless connecting with replication which sounds more like a hack). > yes, right. > >> Now, > >> it is possible that slot-sync worker lags behind and still needs to > >> sync more data for slots in which it makes sense for slot-sync worker > >> to be alive. > > Right. > > Regards, > > -- > Bertrand Drouvot > PostgreSQL Contributors Team > RDS Open Source Databases > Amazon Web Services: https://aws.amazon.com
On Wed, Dec 6, 2023 at 4:28 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Wed, Dec 6, 2023 at 3:00 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > > > > Hi, > > > > On 12/6/23 7:18 AM, shveta malik wrote: > > > On Wed, Dec 6, 2023 at 10:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > >> > > >> I feel that is indirectly relying on the fact that the primary won't > > >> advance logical slots unless physical standby has consumed data. > > > > > > Yes, that is the basis of this discussion. > > > > Yes. > > > > > But now on rethinking, if > > > the user has not set 'standby_slot_names' on primary at first pace, > > > then even if walreceiver on standby is down, slots on primary will > > > keep on advancing > > > > Oh right, good point. > > > > > and thus we need to sync. > > > > Yes and I think our current check "XLogRecPtrIsInvalid(WalRcv->latestWalEnd)" > > in synchronize_slots() prevents us to do so (as I think WalRcv->latestWalEnd > > would be invalid for a non started walreceiver). > > > > But I think we do not need to deal with the case that walreceiver is > not started at all on standby. It is always started. Walreceiver not > getting started or down for long is a rare scenario. We have other > checks too for 'latestWalEnd' in slotsync worker and I think we should > retain those as is. > > > > We have no check currently > > > that mandates users to set standby_slot_names. > > > > > > > Yeah and OTOH unset standby_slot_names is currently the only > > way for users to "force" advance failover slots if they want to (in case > > say the standby is down for a long time and they don't want to block logical decoding > > on the primary) as we don't provide a way to alter the failover property > > (unless connecting with replication which sounds more like a hack). > > > > yes, right. > > > >> Now, > > >> it is possible that slot-sync worker lags behind and still needs to > > >> sync more data for slots in which it makes sense for slot-sync worker > > >> to be alive. > > > > Right. > > > > Regards, > > > > -- > > Bertrand Drouvot > > PostgreSQL Contributors Team > > RDS Open Source Databases > > Amazon Web Services: https://aws.amazon.com PFA v43, changes are: v43-001: 1) Support of 'failover' dump in pg_dump. It was missing earlier. v43-002: 1) Slot-sync worker now checks validity of primary_slot_name by connecting to primary, once during its start and later if primary_slot_name GUC is changed. 2) Doc improvement (see logicaldecoding.sgml). More details on overall slot-sync feature is added along with Nisha's comment of documenting disabled-subscription behaviour wrt to synced slots. thanks Shveta
Attachment
Hi, On 12/6/23 11:58 AM, shveta malik wrote: > On Wed, Dec 6, 2023 at 3:00 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> >> Hi, >> >> On 12/6/23 7:18 AM, shveta malik wrote: >>> On Wed, Dec 6, 2023 at 10:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote: >>>> >>>> I feel that is indirectly relying on the fact that the primary won't >>>> advance logical slots unless physical standby has consumed data. >>> >>> Yes, that is the basis of this discussion. >> >> Yes. >> >>> But now on rethinking, if >>> the user has not set 'standby_slot_names' on primary at first pace, >>> then even if walreceiver on standby is down, slots on primary will >>> keep on advancing >> >> Oh right, good point. >> >>> and thus we need to sync. >> >> Yes and I think our current check "XLogRecPtrIsInvalid(WalRcv->latestWalEnd)" >> in synchronize_slots() prevents us to do so (as I think WalRcv->latestWalEnd >> would be invalid for a non started walreceiver). >> > > But I think we do not need to deal with the case that walreceiver is > not started at all on standby. It is always started. Walreceiver not > getting started or down for long is a rare scenario. We have other > checks too for 'latestWalEnd' in slotsync worker and I think we should > retain those as is. > Agree to not deal with the walreceiver being down for now (we can still improve that part later if we encounter the case in the real world). Might be worth to add comments in the code (around the WalRcv->latestWalEnd checks) that no "lagging" sync are possible if the walreceiver is not started though? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On 12/6/23 12:23 PM, shveta malik wrote: > On Wed, Dec 6, 2023 at 4:28 PM shveta malik <shveta.malik@gmail.com> wrote: > > > PFA v43, changes are: Thanks! > > v43-001: > 1) Support of 'failover' dump in pg_dump. It was missing earlier. > > v43-002: > 1) Slot-sync worker now checks validity of primary_slot_name by > connecting to primary, once during its start and later if > primary_slot_name GUC is changed. I gave a second thought on it and yeah that sounds like the best option (as compare to relying on the walreceiver being up or down). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi. Here are my review comments for patch v43-0002. ====== Commit message 1. The nap time of worker is tuned according to the activity on the primary. The worker starts with nap time of 10ms and if no activity is observed on the primary for some time, then nap time is increased to 10sec. And if activity is observed again, nap time is reduced back to 10ms. ~ /nap time of worker/nap time of the worker/ /And if/If/ ~~~ 2. Slots synced on the standby can be identified using 'sync_state' column of pg_replication_slots view. The values are: 'n': none for user slots, 'i': sync initiated for the slot but waiting for the remote slot on the primary server to catch up. 'r': ready for periodic syncs. ~ /identified using/identified using the/ The meaning of "identified by" is unclear to me. It also seems to clash with later descriptions in system-views.sgml. Please see my later review comment about it (in the sgml file) ====== doc/src/sgml/bgworker.sgml 3. bgw_start_time is the server state during which postgres should start the process; it can be one of BgWorkerStart_PostmasterStart (start as soon as postgres itself has finished its own initialization; processes requesting this are not eligible for database connections), BgWorkerStart_ConsistentState (start as soon as a consistent state has been reached in a hot standby, allowing processes to connect to databases and run read-only queries), and BgWorkerStart_RecoveryFinished (start as soon as the system has entered normal read-write state. Note that the BgWorkerStart_ConsistentState and BgWorkerStart_RecoveryFinished are equivalent in a server that's not a hot standby), and BgWorkerStart_ConsistentState_HotStandby (same meaning as BgWorkerStart_ConsistentState but it is more strict in terms of the server i.e. start the worker only if it is hot-standby; if it is consistent state in non-standby, worker will not be started). Note that this setting only indicates when the processes are to be started; they do not stop when a different state is reached. ~ 3a. This seems to have grown to become just one enormous sentence that is too hard to read. IMO this should be changed to be a <variablelist> of possible values instead of a big slab of text. I suspect it could also be simplified quite a lot -- something like below SUGGESTION bgw_start_time is the server state during which postgres should start the process. Note that this setting only indicates when the processes are to be started; they do not stop when a different state is reached. Possible values are: - BgWorkerStart_PostmasterStart (start as soon as postgres itself has finished its own initialization; processes requesting this are not eligible for database connections) - BgWorkerStart_ConsistentState (start as soon as a consistent state has been reached in a hot-standby, allowing processes to connect to databases and run read-only queries) - BgWorkerStart_RecoveryFinished (start as soon as the system has entered normal read-write state. Note that the BgWorkerStart_ConsistentState and BgWorkerStart_RecoveryFinished are equivalent in a server that's not a hot standby) - BgWorkerStart_ConsistentState_HotStandby (same meaning as BgWorkerStart_ConsistentState but it is more strict in terms of the server i.e. start the worker only if it is hot-standby; if it is a consistent state in non-standby, the worker will not be started). ~~~ 3b. "i.e. start the worker only if it is hot-standby; if it is consistent state in non-standby, worker will not be started" ~ Why is it even necessary to say the 2nd part "if it is consistent state in non-standby, worker will not be started". It seems redundant given 1st part says the same, right? ====== doc/src/sgml/config.sgml 4. + <para> + The standbys corresponding to the physical replication slots in + <varname>standby_slot_names</varname> must enable + <varname>enable_syncslot</varname> for the standbys to receive + failover logical slots changes from the primary. + </para> 4a. Somehow "must enable enable_syncslot" seemed strange. Maybe re-word like: "must enable slot synchronization (see enable_syncslot)" OR "must configure enable_syncslot = true" ~~~ 4b. (seems like repetitive use of "the standbys") /for the standbys to/to/ OR /for the standbys to/so they can/ ~~~ 5. <varname>primary_conninfo</varname> string, or in a separate - <filename>~/.pgpass</filename> file on the standby server (use + <filename>~/.pgpass</filename> file on the standby server. (use This rearranged period seems unrelated to the current patch. Maybe don't touch this. ~~~ 6. + <para> + Specify <literal>dbname</literal> in + <varname>primary_conninfo</varname> string to allow synchronization + of slots from the primary server to the standby server. + This will only be used for slot synchronization. It is ignored + for streaming. </para> The wording "to allow synchronization of slots" seemed misleading to me. Isn't that more the purpose of the 'enable_syncslot' GUC? I think the intended wording is more like below: SUGGESTION If slot synchronization is enabled then it is also necessary to specify <literal>dbname</literal> in the <varname>primary_conninfo</varname> string. This will only be used for slot synchronization. It is ignored for streaming. ====== doc/src/sgml/logicaldecoding.sgml 7. + <para> + A logical replication slot on the primary can be synchronized to the hot + standby by enabling the failover option during slot creation and set + <varname>enable_syncslot</varname> on the standby. For the synchronization + to work, it is mandatory to have physical replication slot between the + primary and the standby. This physical replication slot for the standby + should be listed in <varname>standby_slot_names</varname> on the primary + to prevent the subscriber from consuming changes faster than the hot + standby. Additionally, similar to creating a logical replication slot + on the hot standby, <varname>hot_standby_feedback</varname> should be + set on the standby and a physical slot between the primary and the standby + should be used. + </para> 7a. /creation and set/creation and setting/ /to have physical replication/to have a physical replication/ ~ 7b. It's unclear why this is saying "should be listed in standby_slot_names" and "hot_standby_feedback should be set on the standby". Why is it saying "should" instead of MUST -- are these optional? I thought the GUC validation function mandates these (???). ~ 7c. Why does the paragraph say "and a physical slot between the primary and the standby should be used."; isn't that exactly what was already written earlier ("For the synchronization to work, it is mandatory to have physical replication slot between the primary and the standby" ~~~ 8. + <para> + By enabling synchronization of slots, logical replication can be resumed + after failover depending upon the + <link linkend="view-pg-replication-slots">pg_replication_slots</link>.<structfield>sync_state</structfield> + for the synchronized slots on the standby at the time of failover. + The slots which were in ready sync_state ('r') on the standby before + failover can be used for logical replication after failover. However, + the slots which were in initiated sync_state ('i) and were not + sync-ready ('r') at the time of failover will be dropped and logical + replication for such slots can not be resumed after failover. This applies + to the case where a logical subscription is disabled before failover and is + enabled after failover. If the synchronized slot due to disabled + subscription could not be made sync-ready ('r') on standby, then the + subscription can not be resumed after failover even when enabled. 8a. This feels overcomplicated -- too much information? SUGGESTION depending upon the ... sync_state for the synchronized slots on the standby at the time of failover. Only slots that were in ready sync_state ('r') on the standby before failover can be used for logical replication after failover ~~~ 8b. + the slots which were in initiated sync_state ('i) and were not + sync-ready ('r') at the time of failover will be dropped and logical + replication for such slots can not be resumed after failover. This applies + to the case where a logical subscription is disabled before failover and is + enabled after failover. If the synchronized slot due to disabled + subscription could not be made sync-ready ('r') on standby, then the + subscription can not be resumed after failover even when enabled. But isn't ALL that part pretty much redundant information for the user? I thought these are not ready state, so they are not usable... End-Of-Story. Isn't everything else just more like implementation details, which the user does not need to know about? ~~~ 9. + If the primary is idle, making the synchronized slot on the standby + as sync-ready ('r') for enabled subscription may take noticeable time. + This can be sped up by calling the + <function>pg_log_standby_snapshot</function> function on the primary. + </para> SUGGESTION If the primary is idle, then the synchronized slots on the standby may take a noticeable time to reach the ready ('r') sync_state. This can be sped up by calling the <function>pg_log_standby_snapshot</function> function on the primary. ====== doc/src/sgml/system-views.sgml 10. + + <row> + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>sync_state</structfield> <type>char</type> + </para> + <para> + Defines slot synchronization state. This is meaningful on the physical + standby which has enabled slots synchronization. + </para> I felt that this part "which has enabled slots synchronization" should cross-reference to the 'sync_enabled' GUC. ~~~ 11. + <para> + State code: + <literal>n</literal> = none for user created slots, + <literal>i</literal> = sync initiated for the slot but slot is not ready + yet for periodic syncs, + <literal>r</literal> = ready for periodic syncs. + </para> I'm wondering why don't we just reuse 'd' (disabled), 'p' (pending), 'e' (enabled) like the other tri-state attributes are using. ~~~ 12. + <para> + The hot standby can have any of these sync_state for the slots but on a + hot standby, the slots with state 'r' and 'i' can neither be used for logical + decoded nor dropped by the user. The primary server will have sync_state + as 'n' for all the slots. But if the standby is promoted to become the + new primary server, sync_state can be seen 'r' as well. On this new + primary server, slots with sync_state as 'r' and 'n' will behave the same. + </para></entry> + </row> 12a. /logical decoded/logical decoding/ ~ 12b. "sync_state as 'r' and 'n' will behave the same" sounds kind of hacky. Is there no alternative? Anyway, IMO mentioning about primary server states seems overkill, because you already said "This is meaningful on the physical standby" which I took as implying that it is *not* meaningful from the POV of the primary server. In light of this, I'm wondering if a better name for this attribute would be: 'standby_sync_state' ====== src/backend/access/transam/xlogrecovery.c 13. + /* + * Shutdown the slot sync workers to prevent potential conflicts between + * user processes and slotsync workers after a promotion. Additionally, + * drop any slots that have initiated but not yet completed the sync + * process. + */ + ShutDownSlotSync(); + slotsync_drop_initiated_slots(); + Is this where maybe the 'sync_state' should also be updated for everything so you are not left with confusion about different states on a node that is no longer a standby node? ====== src/backend/postmaster/postmaster.c 14. PostmasterMain ApplyLauncherRegister(); + SlotSyncWorkerRegister(); + Every other function call here is heavily commented but there is a conspicuous absence of a comment here. ~~~ 15. bgworker_should_start_now if (start_time == BgWorkerStart_ConsistentState) return true; + else if (start_time == BgWorkerStart_ConsistentState_HotStandby && + pmState != PM_RUN) + return true; /* fall through */ Change "else if" to "if" would be simpler. ====== .../libpqwalreceiver/libpqwalreceiver.c 16. + for (opt = opts; opt->keyword != NULL; ++opt) + { + /* + * If multiple dbnames are specified, then the last one will be + * returned + */ + if (strcmp(opt->keyword, "dbname") == 0 && opt->val && + opt->val[0] != '\0') + dbname = pstrdup(opt->val); + } This can use a tidier C99 style to declare 'opt' as the loop variable. ~~~ 17. static void libpqrcv_alter_slot(WalReceiverConn *conn, const char *slotname, - bool failover) + bool failover) What is this change for? Or, if something is wrong with the indent then anyway it should be fixed in patch 0001. ====== src/backend/replication/logical/logical.c 18. + /* + * Slots in state SYNCSLOT_STATE_INITIATED should have been dropped on + * promotion. + */ + if (!RecoveryInProgress() && slot->data.sync_state == SYNCSLOT_STATE_INITIATED) + elog(ERROR, "replication slot \"%s\" was not synced completely from the primary server", + NameStr(slot->data.name)); + + /* + * Do not allow consumption of a "synchronized" slot until the standby + * gets promoted. + */ + if (RecoveryInProgress() && slot->data.sync_state != SYNCSLOT_STATE_NONE) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("cannot use replication slot \"%s\" for logical decoding", + NameStr(slot->data.name)), + errdetail("This slot is being synced from the primary server."), + errhint("Specify another replication slot."))); + 18a. Instead of having !RecoveryInProgress() and RecoveryInProgress() in separate conditions is the code simpler like: SUGGESTION if (RecoveryInProgress()) { /* Do not allow ... */ if (slot->data.sync_state != SYNCSLOT_STATE_NONE) ... } else { /* Slots in state... */ if (slot->data.sync_state == SYNCSLOT_STATE_INITIATED) ... } ~ 18b. Should the errdetail give the current state? ====== src/backend/replication/logical/slotsync.c 19. +/* + * Number of attempts for wait_for_primary_slot_catchup() after + * which it aborts the wait and the slot sync worker then moves + * to the next slot creation/sync. + */ +#define WORKER_PRIMARY_CATCHUP_WAIT_ATTEMPTS 5 Given this is only used within one static function, I'm wondering if it would be tidier to also move this macro to within that function. ~~~ 20. wait_for_primary_slot_catchup +/* + * Wait for remote slot to pass locally reserved position. + * + * Ping and wait for the primary server for + * WORKER_PRIMARY_CATCHUP_WAIT_ATTEMPTS during a slot creation, if it still + * does not catch up, abort the wait. The ones for which wait is aborted will + * attempt the wait and sync in the next sync-cycle. + * + * *persist will be set to false if the slot has disappeared or was invalidated + * on the primary; otherwise, it will be set to true. + */ 20a. The comment doesn't say the meaning of the boolean returned. ~ 20b. /*persist will be set/If passed, *persist will be set/ ~~~ 21. + appendStringInfo(&cmd, + "SELECT conflicting, restart_lsn, confirmed_flush_lsn," + " catalog_xmin FROM pg_catalog.pg_replication_slots" + " WHERE slot_name = %s", + quote_literal_cstr(remote_slot->name)); Somehow, I felt it is more readable if the " FROM" starts on a new line. e.g. "SELECT conflicting, restart_lsn, confirmed_flush_lsn, catalog_xmin" " FROM pg_catalog.pg_replication_slots" " WHERE slot_name = %s" ~~~ 22. + ereport(ERROR, + (errmsg("could not fetch slot info for slot \"%s\" from the" + " primary server: %s", + remote_slot->name, res->err))); Perhaps the message can be shortened like: "could not fetch slot \"%s\" info from the primary server: %s" ~~~ 23. + ereport(WARNING, + (errmsg("slot \"%s\" disappeared from the primary server," + " slot creation aborted", remote_slot->name))); Would this be better split into parts? SUGGESTION errmsg "slot \"%s\" creation aborted" errdetail "slot was not found on the primary server" ~~~ 24. + ereport(WARNING, + (errmsg("slot \"%s\" invalidated on the primary server," + " slot creation aborted", remote_slot->name))); (similar to previous) SUGGESTION errmsg "slot \"%s\" creation aborted" errdetail "slot was invalidated on the primary server" ~~~ 25. + /* + * Once we got valid restart_lsn, then confirmed_lsn and catalog_xmin + * are expected to be valid/non-null. + */ SUGGESTION Having got a valid restart_lsn, the confirmed_lsn and catalog_xmin are expected to be valid/non-null. ~~~ 26. slotsync_drop_initiated_slots +/* + * Drop the slots for which sync is initiated but not yet completed + * i.e. they are still waiting for the primary server to catch up. + */ I found "waiting for the primary server to catch up" to be difficult to understand without knowing the full details, but it is not really described properly until a much larger comment that is buried in the synchronize_one_slot(). So I think all this needs explanation up-front in the file, which you can refer to. I have repeated this same review comment in a couple of places. ~~~ 27. get_local_synced_slot_names +static List * +get_local_synced_slot_names(void) +{ + List *localSyncedSlots = NIL; 27a. It's not returning a list of "names" though, so is this an appropriate function name? ~~~ 27b. Suggest just call that ('localSyncedSlots') differently. - In slotsync_drop_initiated_slots() function they are just called 'slots' - In drop_obsolete_slots() function it is called 'local_slot_list' IMO it is better if all these are consistently named -- just all lists 'slots' or all 'local_slots' or whatever. ~~~ 28. check_sync_slot_validity +static bool +check_sync_slot_validity(ReplicationSlot *local_slot, List *remote_slots, + bool *locally_invalidated) Somehow this wording "validity" seems like a misleading function name, because the return value has nothing to do with the slot field invalidated. The validity/locally_invalidated stuff is a secondary return as a side effect for the "true" case. A more accurate function name would be more like check_sync_slot_on_remote(). ~~~ 29. check_sync_slot_validity +static bool +check_sync_slot_validity(ReplicationSlot *local_slot, List *remote_slots, + bool *locally_invalidated) +{ + ListCell *cell; There is inconsistent naming -- ListCell lc; ListCell cell; ListCell lc_slot; etc.. IMO the more complicated names aren't of much value -- probably everything can be changed to 'lc' for consistency. ~~~ 30. drop_obsolete_slots + /* + * Get the list of local 'synced' slot so that those not on remote could + * be dropped. + */ /slot/slots/ Also, I don't think it is necessary to say "so that those not on remote could be dropped." -- That is already described in the function comment and again in a comment later in the loop. That seems enough. If the function name get_local_synced_slot_names() is improved a bit the comment seems redundant because it is obvious from the function name. ~~~ 31. + foreach(lc_slot, local_slot_list) + { + ReplicationSlot *local_slot = (ReplicationSlot *) lfirst(lc_slot); + bool local_exists = false; + bool locally_invalidated = false; + + local_exists = check_sync_slot_validity(local_slot, remote_slot_list, + &locally_invalidated); Shouldn't that 'local_exists' variable be called 'remote_exists'? That's what the other comments seem to be saying. ~~~ 32. construct_slot_query + appendStringInfo(s, + "SELECT slot_name, plugin, confirmed_flush_lsn," + " restart_lsn, catalog_xmin, two_phase, failover," + " database, pg_get_slot_invalidation_cause(slot_name)" + " FROM pg_catalog.pg_replication_slots" + " WHERE failover and sync_state != 'i'"); Just wondering if substituting the SYNCSLOT_STATE_INITIATED constant here might be more appropriate than hardwiring 'i'. Why have a constant but not use it? ~~~ 33. synchronize_one_slot +static void +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot, + bool *slot_updated) +{ + ReplicationSlot *s; + char sync_state = 0; 33a. It seems strange that the sync_state is initially assigned something other than the 3 legal values. Should this be defaulting to SYNCSLOT_STATE_NONE instead? ~ 33b. I think it is safer to default the *slot_updated = false; because the code appears to assume it was false already which may or may not be true. ~~~ 34. + /* + * Make sure that concerned WAL is received before syncing slot to target + * lsn received from the primary server. + * + * This check should never pass as on the primary server, we have waited + * for the standby's confirmation before updating the logical slot. + */ Maybe this comment should mention up-front that it is just a "Sanity check:" ~~~ 35. + /* + * With hot_standby_feedback enabled and invalidations handled + * apropriately as above, this should never happen. + */ + if (remote_slot->restart_lsn < MyReplicationSlot->data.restart_lsn) + { + ereport(ERROR, + errmsg("not synchronizing local slot \"%s\" LSN(%X/%X)" + " to remote slot's LSN(%X/%X) as synchronization " + " would move it backwards", remote_slot->name, + LSN_FORMAT_ARGS(MyReplicationSlot->data.restart_lsn), + LSN_FORMAT_ARGS(remote_slot->restart_lsn))); + + goto cleanup; + } 35a. IIUC then this another comment that should say it is just a "Sanity-check:". ~ 35b. I was wondering if there should be Assert(hot_standby_feedback) here also. The comment "With hot_standby_feedback enabled" is a bit vague whereas including an Assert will clarify that it must be set. ~ 35c. Since it says "this should never happen" then it appears elog is more appropriate than ereport because translations are not needed, right? ~ 35d. The ERROR will make that goto cleanup unreachable, won't it? ~~~ 36. + /* + * Already existing slot but not ready (i.e. waiting for the primary + * server to catch-up), lets attempt to make it sync-ready now. + */ /lets/let's/ ~~~ 37. + /* + * Refer the slot creation part (last 'else' block) for more details + * on this wait. + */ + if (remote_slot->restart_lsn < MyReplicationSlot->data.restart_lsn || + TransactionIdPrecedes(remote_slot->catalog_xmin, + MyReplicationSlot->data.catalog_xmin)) + { + if (!wait_for_primary_slot_catchup(wrconn, remote_slot, NULL)) + { + goto cleanup; + } + } 37a. Having to jump forward to understand earlier code seems backward. IMO there should be a big comment atop this module about this subject which the comment here can just refer to. I will write more about this topic later (below). ~ 37b. The extra code curly braces are not needed. ~~~ 38. + ereport(LOG, errmsg("newly locally created slot \"%s\" is sync-ready " + "now", remote_slot->name)); Better to put the whole errmsg() on a newline instead of splitting the string like that. ~~~ 39. + /* User created slot with the same name exists, raise ERROR. */ + else if (sync_state == SYNCSLOT_STATE_NONE) + { + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("skipping sync of slot \"%s\" as it is a user created" + " slot", remote_slot->name), + errdetail("This slot has failover enabled on the primary and" + " thus is sync candidate but user created slot with" + " the same name already exists on the standby"))); + } I felt it would be better to eliminate this case immediately up-front when you first searched for the slot names. e.g. code like below. IIUC this refactor also means the default sync_state can be assigned a normal value (as I suggested above) instead of the strange assignment to 0. + /* Search for the named slot */ + if ((s = SearchNamedReplicationSlot(remote_slot->name, true))) + { + SpinLockAcquire(&s->mutex); + sync_state = s->data.sync_state; + SpinLockRelease(&s->mutex); INSERT HERE + /* User-created slot with the same name exists, raise ERROR. */ + if (sync_state == SYNCSLOT_STATE_NONE) + ereport(ERROR, ... + } ~~~ 40. + /* Otherwise create the slot first. */ + else + { Insert a blank line above that comment for better readability (same as done for earlier 'else' in this same function) ~~~ 41. + ReplicationSlotCreate(remote_slot->name, true, RS_EPHEMERAL, + remote_slot->two_phase, + remote_slot->failover, + SYNCSLOT_STATE_INITIATED); + + slot = MyReplicationSlot; In hindsight, the prior if/else code blocks in this function also could have done "slot = MyReplicationSlot;" same as this -- then the code would be much less verbose. ~~~ 42. + SpinLockAcquire(&slot->mutex); + slot->data.database = get_database_oid(remote_slot->database, false); + + namestrcpy(&slot->data.plugin, remote_slot->plugin); + SpinLockRelease(&slot->mutex); IMO the code would be more readable *without* a blank line here because the mutexed block is more obvious. ~~~ 43. + /* + * If the local restart_lsn and/or local catalog_xmin is ahead of + * those on the remote then we cannot create the local slot in sync + * with the primary server because that would mean moving the local + * slot backwards and we might not have WALs retained for old LSN. In + * this case we will wait for the primary server's restart_lsn and + * catalog_xmin to catch up with the local one before attempting the + * sync. + */ 43a. This comment describes some fundamental concepts about how this logic works. I felt this and other comments like this should be at the top of this slotsync.c file. Then anything that needs to mention about it can refer to the top comment. For example, I also found other comments like "... they are still waiting for the primary server to catch up." to be difficult to understand without knowing these details, but I think describing core design stuff up-front and saying "refer to the comment atop the fil" probably would help a lot. ~ 43b. Should "wait for the primary server's restart_lsn and..." be "wait for the primary server slot's restart_lsn and..." ? ~~~ 44. + { + bool persist; + + if (!wait_for_primary_slot_catchup(wrconn, remote_slot, &persist)) + { + /* + * The remote slot didn't catch up to locally reserved + * position. + * + * We do not drop the slot because the restart_lsn can be + * ahead of the current location when recreating the slot in + * the next cycle. It may take more time to create such a + * slot. Therefore, we persist it (provided remote-slot is + * still valid) and attempt the wait and synchronization in + * the next cycle. + */ + if (persist) + { + ReplicationSlotPersist(); + *slot_updated = true; + } + + goto cleanup; + } + } Looking at the way this 'persist' parameter is used I felt is it too complicated. IIUC the wait_for_primary_slot_catchup can only return *persist = true (for a false return) when it has reached/exceeded the number of retries and still not yet caught up. Why should wait_for_primary_slot_catchup() pretend to know about persistence? In other words, I thought a more meaningful parameter/variable name (instead of 'persist') is something like 'wait_attempts_exceeded'. IMO that will make wait_for_primary_slot_catchup() code easier, and here you can just say like below, where the code matches the comment better. Thoughts? + if (wait_attempts_exceeded) + { + ReplicationSlotPersist(); + *slot_updated = true; + } ~~~ 45. + + + /* + * Wait for primary is either not needed or is over. Update the lsns + * and mark the slot as READY for further syncs. + */ Double blank lines? ~~~ 46. + ereport(LOG, errmsg("newly locally created slot \"%s\" is sync-ready " + "now", remote_slot->name)); + } + +cleanup: Better to put the whole errmsg() on a newline instead of splitting the string like that. ~~~ 47. synchronize_slots +/* + * Synchronize slots. + * + * Gets the failover logical slots info from the primary server and update + * the slots locally. Creates the slots if not present on the standby. + * + * Returns nap time for the next sync-cycle. + */ +static long +synchronize_slots(WalReceiverConn *wrconn) /update/updates/ ~~~ 48. + /* The primary_slot_name is not set yet or WALs not received yet */ + SpinLockAcquire(&WalRcv->mutex); + if (!WalRcv || + (WalRcv->slotname[0] == '\0') || + XLogRecPtrIsInvalid(WalRcv->latestWalEnd)) + { + SpinLockRelease(&WalRcv->mutex); + return naptime; + } + SpinLockRelease(&WalRcv->mutex); Just wondering if the scenario of "WALS not received" is a bit more like "no activity" so perhaps the naptime returned should be WORKER_INACTIVITY_NAPTIME_MS here? ~~~ 49. + /* Construct query to get slots info from the primary server */ + initStringInfo(&s); + construct_slot_query(&s); I did not like the construct_slot_query() to be separated from this function because it makes it too difficult to see if the slot_attr numbers and column types in this function are correct w.r.t. that query. IMO better when everything is in the same place where you can see it all together. e.g. Less risk of breaking something if changes are made. ~~~ 50. + /* Construct the remote_slot tuple and synchronize each slot locally */ + slot = MakeSingleTupleTableSlot(res->tupledesc, &TTSOpsMinimalTuple); Normally in all the other functions the variable 'slot' was the local ReplicationSlot but IIUC here represents a remote tuple. Making a different name would be better like 'remote_slottup' or something else. ~~~ 51. + /* + * If any of the slots get updated in this sync-cycle, retain default + * naptime and update 'last_update_time' in slot sync worker. But if no + * activity is observed in this sync-cycle, then increase naptime provided + * inactivity time reaches threshold. + */ I think "retain" is a slightly wrong word here because it might have been WORKER_INACTIVITY_NAPTIME_MS in the previous cycle. Maybe just /retain/use/ ~~~ 52. +/* + * Connects primary to validate the slot specified in primary_slot_name. + * + * Exits the worker if physical slot with the specified name does not exist. + */ +static void +validate_primary_slot(WalReceiverConn *wrconn) There is already a connection, so not sure if this connect should be saying "connects to"; Maybe is should be saying more like below: SUGGESTION Using the specified primary server connection, validate if the physical slot identified by GUC primary_slot_name exists. Exit the worker if the slot is not found. ~~~ 53. + initStringInfo(&cmd); + appendStringInfo(&cmd, + "select count(*) = 1 from pg_replication_slots where " + "slot_type='physical' and slot_name=%s", + quote_literal_cstr(PrimarySlotName)); Write the SQL keywords in uppercase. ~~~ 54. + if (res->status != WALRCV_OK_TUPLES) + ereport(ERROR, + (errmsg("could not fetch primary_slot_name info from the " + "primary: %s", res->err))); Shouldn't the name of the unfound slot be shown in the ereport, or will that already appear in the res->err? ~~~ 55. + ereport(ERROR, + errmsg("exiting slots synchronization as slot specified in " + "primary_slot_name is not valid")); + IMO the format should be the same as I suggested (later) for all the validate_slotsync_parameters() errors. Also, I think the name of the unfound slot needs to be in this message. So maybe result is like this: SUGGESTION ereport(ERROR, errmsg("exiting from slot synchronization due to bad configuration") /* translator: second %s is a GUC variable name */ errhint("The primary slot \"%s\" specified by %s is not valid.", slot_name, "primary_slot_name") ); ~~~ 56. +/* + * Checks if GUCs are set appropriately before starting slot sync worker + */ +static void +validate_slotsync_parameters(char **dbname) +{ + /* + * Since 'enable_syncslot' is ON, check that other GUC settings + * (primary_slot_name, hot_standby_feedback, wal_level, primary_conninfo) + * are compatible with slot synchronization. If not, raise ERROR. + */ + 56a. I thought that 2nd comment sort of belonged in the function comment. ~ 56b. It says "Since 'enable_syncslot' is ON", but I IIUC that is wrong because the other function slotsync_reread_config() might detect a change in this GUC and cause this validate_slotsync_parameters() to be called when enable_syncslot was changed to false. In other words, I think you also need to check 'enable_syncslot' and exit with appropriate ERROR same as all the other config problems. OTOH if this is not possible, then the slotsync_reread_config() might need fixing instead. ~~~ 57. + /* + * A physical replication slot(primary_slot_name) is required on the + * primary to ensure that the rows needed by the standby are not removed + * after restarting, so that the synchronized slot on the standby will not + * be invalidated. + */ + if (PrimarySlotName == NULL || strcmp(PrimarySlotName, "") == 0) + ereport(ERROR, + errmsg("exiting slots synchronization as primary_slot_name is " + "not set")); + + /* + * Hot_standby_feedback must be enabled to cooperate with the physical + * replication slot, which allows informing the primary about the xmin and + * catalog_xmin values on the standby. + */ + if (!hot_standby_feedback) + ereport(ERROR, + errmsg("exiting slots synchronization as hot_standby_feedback " + "is off")); + + /* + * Logical decoding requires wal_level >= logical and we currently only + * synchronize logical slots. + */ + if (wal_level < WAL_LEVEL_LOGICAL) + ereport(ERROR, + errmsg("exiting slots synchronisation as it requires " + "wal_level >= logical")); + + /* + * The primary_conninfo is required to make connection to primary for + * getting slots information. + */ + if (PrimaryConnInfo == NULL || strcmp(PrimaryConnInfo, "") == 0) + ereport(ERROR, + errmsg("exiting slots synchronization as primary_conninfo " + "is not set")); + + /* + * The slot sync worker needs a database connection for walrcv_exec to + * work. + */ + *dbname = walrcv_get_dbname_from_conninfo(PrimaryConnInfo); + if (*dbname == NULL) + ereport(ERROR, + errmsg("exiting slots synchronization as dbname is not " + "specified in primary_conninfo")); + +} IMO all these errors can be improved by: - using a common format - including errhint for the reason - using the same tone for instructions on what to do (e.g saying must be set, rather than what was not set) SUGGESTION (something like this) ereport(ERROR, errmsg("exiting from slot synchronization due to bad configuration") /* translator: %s is a GUC variable name */ errhint("%s must be defined.", "primary_slot_name") ); ereport(ERROR, errmsg("exiting from slot synchronization due to bad configuration") /* translator: %s is a GUC variable name */ errhint("%s must be enabled.", "hot_standby_feedback") ); ereport(ERROR, errmsg("exiting from slot synchronization due to bad configuration") /* translator: wal_level is a GUC variable name, 'logical' is a value */ errhint("wal_level must be >= logical.") ); ereport(ERROR, errmsg("exiting from slot synchronization due to bad configuration") /* translator: %s is a GUC variable name */ errhint("%s must be defined.", "primary_conninfo") ); ereport(ERROR, errmsg("exiting from slot synchronization due to bad configuration") /* translator: 'dbname' is a specific option; %s is a GUC variable name */ errhint("'dbname' must be specified in %s.", "primary_conninfo") ); ~~~ 58. + *dbname = walrcv_get_dbname_from_conninfo(PrimaryConnInfo); + if (*dbname == NULL) + ereport(ERROR, + errmsg("exiting slots synchronization as dbname is not specified in primary_conninfo")); + +} Unnecessary blank line at the end of the function ~~~ 59. +/* + * Re-read the config file. + * + * If any of the slot sync GUCs changed, validate the values again + * through validate_slotsync_parameters() which will exit the worker + * if validaity fails. + */ SUGGESTION If any of the slot sync GUCs have changed, re-validate them. The worker will exit if the check fails. ~~~ 60. + char *conninfo = pstrdup(PrimaryConnInfo); + char *slotname = pstrdup(PrimarySlotName); + bool syncslot = enable_syncslot; + bool standbyfeedback = hot_standby_feedback; For clarity, I would have used var names to match the old GUCs. e.g. /conninfo/old_primary_conninfo/ /slotname/old_primary_slot_name/ /syncslot/old_enable_syncslot/ /standbyfeedback/old_hot_standby_feedback/ ~~~ 61. + dbname = walrcv_get_dbname_from_conninfo(PrimaryConnInfo); + Assert(dbname); This code seems premature. IIUC this is only needed to detect that the dbname was changed. But I think the prerequisite is first that the conninfoChanged is true. So really this code should be guarded by if (conninfoChanged) so it can be done later in the function. ~~~ 62. + if (conninfoChanged || slotnameChanged || + (syncslot != enable_syncslot) || + (standbyfeedback != hot_standby_feedback)) + { + revalidate = true; + } SUGGESTION revalidate = conninfoChanged || slotnameChanged || (syncslot != enable_syncslot) || (standbyfeedback != hot_standby_feedback); ~~~ 63. + /* + * Since we have initialized this worker with old dbname, thus exit if + * dbname changed. Let it get restarted and connect to new dbname + * specified. + */ + if (conninfoChanged && strcmp(dbname, new_dbname) != 0) + { + ereport(ERROR, + errmsg("exiting slot sync woker as dbname in " + "primary_conninfo changed")); + } 63a. /old dbname/the old dbname/ /new dbname/the new dbname/ /woker/worker/ ~ 63b. This code feels awkward. Can't this dbname check and accompanying ERROR message be moved down into validate_slotsync_parameters(), so it lives along with all the other GUC validation logic? Maybe you'll need to change the validate_slotsync_parameters() parameters slightly but I think it is much better to keep all the validation together. ~~~ 64. + + +/* + * Interrupt handler for main loop of slot sync worker. + */ +static void +ProcessSlotSyncInterrupts(WalReceiverConn **wrconn) Double blank lines. ~~~ 65. + + + if (ConfigReloadPending) + slotsync_reread_config(); +} Double blank lines ~~~ 66. slotsync_worker_onexit +static void +slotsync_worker_onexit(int code, Datum arg) +{ + SpinLockAcquire(&SlotSyncWorker->mutex); + SlotSyncWorker->pid = 0; + SpinLockRelease(&SlotSyncWorker->mutex); +} Should assignment use InvalidPid (-1) instead of 0? ~~~ 67. ReplSlotSyncWorkerMain + SpinLockAcquire(&SlotSyncWorker->mutex); + + Assert(SlotSyncWorker->pid == 0); + + /* Advertise our PID so that the startup process can kill us on promotion */ + SlotSyncWorker->pid = MyProcPid; + + SpinLockRelease(&SlotSyncWorker->mutex); Shouldn't pid start as InvalidPid (-1) instead of Assert 0? ~~~ 68. + /* Connect to the primary server */ + wrconn = remote_connect(); + + /* + * Connect to primary and validate the slot specified in + * primary_slot_name. + */ + validate_primary_slot(wrconn); Maybe needs some slight rewording in the 2nd comment. "Connect to primary server" is already said and done in the 1st part. ~~~ 69. IsSlotSyncWorker +/* + * Is current process the slot sync worker? + */ +bool +IsSlotSyncWorker(void) +{ + return SlotSyncWorker->pid == MyProcPid; +} 69a. For consistency with others like it, I thought this be called IsLogicalSlotSyncWorker(). ~ 69b. For consistency with the others like this, I think the extern should be declared in logicalworker.h ~~~ 70. ShutDownSlotSync + SpinLockAcquire(&SlotSyncWorker->mutex); + if (!SlotSyncWorker->pid) + { + SpinLockRelease(&SlotSyncWorker->mutex); + return; + } IMO should be comparing with InvalidPid (-1) here; not 0. ~~~ 71. + SpinLockAcquire(&SlotSyncWorker->mutex); + + /* Is it gone? */ + if (!SlotSyncWorker->pid) + break; + + SpinLockRelease(&SlotSyncWorker->mutex); Ditto. bad pids should be InvalidPid (-1), not 0. ~~~ 72. SlotSyncWorkerShmemInit + if (!found) + { + memset(SlotSyncWorker, 0, size); + SpinLockInit(&SlotSyncWorker->mutex); + } Probably here the unassigned pid should be set to InvalidPid (-1), not 0. ~~~ 73. SlotSyncWorkerRegister + if (!enable_syncslot) + { + ereport(LOG, + errmsg("skipping slots synchronization as enable_syncslot is " + "disabled.")); + return; + } /as/because/ ====== src/backend/replication/logical/tablesync.c 74. #include "commands/copy.h" +#include "commands/subscriptioncmds.h" #include "miscadmin.h" There were only #include changes but no code changes. Is the #include needed? ====== src/backend/replication/slot.c 75. ReplicationSlotCreate void ReplicationSlotCreate(const char *name, bool db_specific, ReplicationSlotPersistency persistency, - bool two_phase, bool failover) + bool two_phase, bool failover, char sync_state) The function comment goes to trouble to describe all the parameters except for 'failover' and 'sync_slate'. I think a failover comment should be added in patch 0001 and then the sync_state comment should be added in patch 0002. ~~~ 76. + /* + * Do not allow users to drop the slots which are currently being synced + * from the primary to the standby. + */ + if (user_cmd && RecoveryInProgress() && + MyReplicationSlot->data.sync_state != SYNCSLOT_STATE_NONE) + { + ReplicationSlotRelease(); + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("cannot drop replication slot \"%s\"", name), + errdetail("This slot is being synced from the primary."))); + } Should the errdetail give the current state? ====== src/backend/tcop/postgres.c 77. + else if (IsSlotSyncWorker()) + { + ereport(DEBUG1, + (errmsg_internal("replication slot sync worker is shutting down due to administrator command"))); + + /* + * Slot sync worker can be stopped at any time. + * Use exit status 1 so the background worker is restarted. + */ + proc_exit(1); + } Explicitly saying "ereport(DEBUG1, errmsg_internal(..." is a bit overkill; it is simpler to write this as "elog(DEBUG1, ....); ====== src/include/replication/slot.h 78. +/* The possible values for 'sync_state' in ReplicationSlotPersistentData */ +#define SYNCSLOT_STATE_NONE 'n' /* None for user created slots */ +#define SYNCSLOT_STATE_INITIATED 'i' /* Sync initiated for the slot but + * not completed yet, waiting for + * the primary server to catch-up */ +#define SYNCSLOT_STATE_READY 'r' /* Initialization complete, ready + * to be synced further */ Already questioned the same elsewhere. IIUC the same tri-state values of other attributes might be used here too without needing to introduce 3 new values. e.g. #define SYNCSLOT_STATE_DISABLED 'd' /* No syncing for this slot */ #define SYNCSLOT_STATE_PENDING 'p' /* Sync is enabled but we must wait for the primary server to catch up */ #define SYNCSLOT_STATE_ENABLED 'e' /* Sync is enabled and the slot is ready to be synced */ ~~~ 79. + /* + * Is this a slot created by a sync-slot worker? + * + * Relevant for logical slots on the physical standby. + */ + char sync_state; + I assumed that "Relevant for" means "Only relevant for". It should say that. If correct, IMO a better field name might be 'standby_sync_state' ====== src/test/recovery/t/050_verify_slot_order.pl 80. +$backup_name = 'backup2'; +$primary->backup($backup_name); + +# Create standby3 +my $standby3 = PostgreSQL::Test::Cluster->new('standby3'); +$standby3->init_from_backup( + $primary, $backup_name, + has_streaming => 1, + has_restoring => 1); The mixture of 'backup2' for 'standby3' seems confusing. Is there a reason to call it backup2? ~~~ 81. +# Verify slot properties on the standby +is( $standby3->safe_psql('postgres', + q{SELECT failover, sync_state FROM pg_replication_slots WHERE slot_name = 'lsub1_slot';} + ), + "t|r", + 'logical slot has sync_state as ready and failover as true on standby'); It might be better if the message has the same order as the SQL. Eg. "failover as true and sync_state as ready". ~~~ 82. +# Verify slot properties on the primary +is( $primary->safe_psql('postgres', + q{SELECT failover, sync_state FROM pg_replication_slots WHERE slot_name = 'lsub1_slot';} + ), + "t|n", + 'logical slot has sync_state as none and failover as true on primary'); + It might be better if the message has the same order as the SQL. Eg. "failover as true and sync_state as none". ~~~ 83. +# Test to confirm that restart_lsn of the logical slot on the primary is synced to the standby IMO the major test parts (like this one) may need more highlighting "# ---------------------" so those comments don't get lost among all the other comments. ~~~ 84. +# let the slots get synced on the standby +sleep 2; Won't this make the test prone to failure on slow machines? Is there not a more deterministic way to wait for the sync? ====== Kind Regards, Peter Smith. Fujitsu Australia
On Thu, Dec 7, 2023 at 1:19 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 12/6/23 11:58 AM, shveta malik wrote: > > On Wed, Dec 6, 2023 at 3:00 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > >> > >> Hi, > >> > >> On 12/6/23 7:18 AM, shveta malik wrote: > >>> On Wed, Dec 6, 2023 at 10:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > >>>> > >>>> I feel that is indirectly relying on the fact that the primary won't > >>>> advance logical slots unless physical standby has consumed data. > >>> > >>> Yes, that is the basis of this discussion. > >> > >> Yes. > >> > >>> But now on rethinking, if > >>> the user has not set 'standby_slot_names' on primary at first pace, > >>> then even if walreceiver on standby is down, slots on primary will > >>> keep on advancing > >> > >> Oh right, good point. > >> > >>> and thus we need to sync. > >> > >> Yes and I think our current check "XLogRecPtrIsInvalid(WalRcv->latestWalEnd)" > >> in synchronize_slots() prevents us to do so (as I think WalRcv->latestWalEnd > >> would be invalid for a non started walreceiver). > >> > > > > But I think we do not need to deal with the case that walreceiver is > > not started at all on standby. It is always started. Walreceiver not > > getting started or down for long is a rare scenario. We have other > > checks too for 'latestWalEnd' in slotsync worker and I think we should > > retain those as is. > > > > Agree to not deal with the walreceiver being down for now (we can > still improve that part later if we encounter the case in the real > world). > yes, agreed. > Might be worth to add comments in the code (around the WalRcv->latestWalEnd > checks) that no "lagging" sync are possible if the walreceiver is not started > though? > I am a bit confused. Do you mean as a TODO item? Otherwise the comment will be opposite of the code we are writing. > > Regards, > > -- > Bertrand Drouvot > PostgreSQL Contributors Team > RDS Open Source Databases > Amazon Web Services: https://aws.amazon.com
Hi, On 12/7/23 10:07 AM, shveta malik wrote: > On Thu, Dec 7, 2023 at 1:19 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: >> Might be worth to add comments in the code (around the WalRcv->latestWalEnd >> checks) that no "lagging" sync are possible if the walreceiver is not started >> though? >> > > I am a bit confused. Do you mean as a TODO item? Otherwise the comment > will be opposite of the code we are writing. Sorry for the confusion: what I meant to say is that synchronization (should it be lagging) is not possible if the walreceiver is not started (as XLogRecPtrIsInvalid(WalRcv->latestWalEnd) would be true). More precisely here (in synchronize_slots()): /* The primary_slot_name is not set yet or WALs not received yet */ SpinLockAcquire(&WalRcv->mutex); if (!WalRcv || (WalRcv->slotname[0] == '\0') || XLogRecPtrIsInvalid(WalRcv->latestWalEnd)) { SpinLockRelease(&WalRcv->mutex); return naptime; } Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, Dec 6, 2023 at 4:53 PM shveta malik <shveta.malik@gmail.com> wrote: > > v43-001: > 1) Support of 'failover' dump in pg_dump. It was missing earlier. > Review v43-0001 ================ 1. + * However, we do not enable failover for slots created by the table sync + * worker. This is because the table sync slot might not be fully synced on the + * standby. The reason for not enabling failover for table sync slots is not clearly mentioned. 2. During syncing, the local restart_lsn and/or local catalog_xmin of + * the newly created slot on the standby are typically ahead of those on the + * primary. Therefore, the standby needs to wait for the primary server's + * restart_lsn and catalog_xmin to catch up, which takes time. I think this part of the comment should be moved to 0002 patch. We can probably describe a bit more about why slot on standby will be ahead and about waiting time. 3. validate_standby_slots() { ... + slot = SearchNamedReplicationSlot(name, true); + + if (!slot) + goto ret_standby_slot_names_ng; + + if (!SlotIsPhysical(slot)) + { + GUC_check_errdetail("\"%s\" is not a physical replication slot", + name); + goto ret_standby_slot_names_ng; + } Why the first check (slot not found) doesn't have errdetail? The goto's in this function look a bit odd, can we try to avoid those? 4. + /* Verify syntax and parse string into list of identifiers */ + if (!SplitIdentifierString(rawname, ',', &elemlist)) + { + /* syntax error in name list */ + GUC_check_errdetail("List syntax is invalid."); ... ... + if (!SplitIdentifierString(standby_slot_names_cpy, ',', &standby_slots)) + { + /* This should not happen if GUC checked check_standby_slot_names. */ + elog(ERROR, "invalid list syntax"); Both are checking the same string but giving different error messages. I think the error message should be the same in both cases. The first one seems better. 5. In WalSndFilterStandbySlots(), the comments around else if checks should move inside the checks. It is hard to read the code in the current format. I have tried to change the same in the attached. Apart from the above, I have changed the comments and made some minor cosmetic changes in the attached. Kindly include in next version if you are fine with it. -- With Regards, Amit Kapila.
Attachment
Hi. Here is another review comment for the patch v43-0001. ====== src/bin/pg_dump/pg_dump.c 1. getSubscriptions + if (fout->remoteVersion >= 170000) + appendPQExpBufferStr(query, + " subfailoverstate\n"); + else + appendPQExpBuffer(query, + " '%c' AS subfailoverstate\n", + LOGICALREP_FAILOVER_STATE_DISABLED); + That first appended string should include the table alias same as all the nearby code does. e.g. " subfailoverstate\n" should be " s.subfailoverstate\n" ====== Kind Regards, Peter Smith. Fujitsu Australia
On Thu, Dec 7, 2023 at 2:57 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 12/7/23 10:07 AM, shveta malik wrote: > > On Thu, Dec 7, 2023 at 1:19 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > >> Might be worth to add comments in the code (around the WalRcv->latestWalEnd > >> checks) that no "lagging" sync are possible if the walreceiver is not started > >> though? > >> > > > > I am a bit confused. Do you mean as a TODO item? Otherwise the comment > > will be opposite of the code we are writing. > > Sorry for the confusion: what I meant to say is that > synchronization (should it be lagging) is not possible if the walreceiver is not started > (as XLogRecPtrIsInvalid(WalRcv->latestWalEnd) would be true). > Sure, I will add it. Thanks for the clarification. > More precisely here (in synchronize_slots()): > > /* The primary_slot_name is not set yet or WALs not received yet */ > SpinLockAcquire(&WalRcv->mutex); > if (!WalRcv || > (WalRcv->slotname[0] == '\0') || > XLogRecPtrIsInvalid(WalRcv->latestWalEnd)) > { > SpinLockRelease(&WalRcv->mutex); > return naptime; > } > > Regards, > > -- > Bertrand Drouvot > PostgreSQL Contributors Team > RDS Open Source Databases > Amazon Web Services: https://aws.amazon.com
On Wed, Dec 6, 2023 at 4:53 PM shveta malik <shveta.malik@gmail.com> wrote: > > PFA v43, changes are: > I wanted to discuss 0003 patch about cascading standby's. It is not clear to me whether we want to allow physical standbys to further wait for cascading standby to sync their slots. If we allow such a feature one may expect even primary to wait for all the cascading standby's because otherwise still logical subscriber can be ahead of one of the cascading standby. I feel even if we want to allow such a behaviour we can do it later once the main feature is committed. I think it would be good to just allow logical walsenders on primary to wait for physical standbys represented by GUC 'standby_slot_names'. If we agree on that then it would be good to prohibit setting this GUC on standby or at least it should be a no-op even if this GUC should be set on physical standby. Thoughts? -- With Regards, Amit Kapila.
On Thursday, December 7, 2023 7:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Dec 6, 2023 at 4:53 PM shveta malik <shveta.malik@gmail.com> > wrote: > > > > v43-001: > > 1) Support of 'failover' dump in pg_dump. It was missing earlier. > > > > Review v43-0001 > ================ > 1. > + * However, we do not enable failover for slots created by the table > + sync > + * worker. This is because the table sync slot might not be fully > + synced on the > + * standby. > > The reason for not enabling failover for table sync slots is not clearly > mentioned. > > 2. > During syncing, the local restart_lsn and/or local catalog_xmin of > + * the newly created slot on the standby are typically ahead of those > + on the > + * primary. Therefore, the standby needs to wait for the primary > + server's > + * restart_lsn and catalog_xmin to catch up, which takes time. > > I think this part of the comment should be moved to 0002 patch. We can > probably describe a bit more about why slot on standby will be ahead and > about waiting time. > > 3. > validate_standby_slots() > { > ... > + slot = SearchNamedReplicationSlot(name, true); > + > + if (!slot) > + goto ret_standby_slot_names_ng; > + > + if (!SlotIsPhysical(slot)) > + { > + GUC_check_errdetail("\"%s\" is not a physical replication slot", > + name); goto ret_standby_slot_names_ng; } > > Why the first check (slot not found) doesn't have errdetail? The goto's in this > function look a bit odd, can we try to avoid those? > > 4. > + /* Verify syntax and parse string into list of identifiers */ if > + (!SplitIdentifierString(rawname, ',', &elemlist)) { > + /* syntax error in name list */ > + GUC_check_errdetail("List syntax is invalid."); > ... > ... > + if (!SplitIdentifierString(standby_slot_names_cpy, ',', > + &standby_slots)) { > + /* This should not happen if GUC checked check_standby_slot_names. */ > + elog(ERROR, "invalid list syntax"); > > Both are checking the same string but giving different error messages. > I think the error message should be the same in both cases. The first one seems > better. > > 5. In WalSndFilterStandbySlots(), the comments around else if checks should > move inside the checks. It is hard to read the code in the current format. I have > tried to change the same in the attached. > > Apart from the above, I have changed the comments and made some minor > cosmetic changes in the attached. Kindly include in next version if you are fine > with it. Thanks for the comments and changes, I have addressed them. Here is the V44 patch set which addressed comments above and [1]. The new version patches also include the follow changes: V44-0001 * Let the pg_replication_slot_advance also wait for the slots specified in standby_slot_names to catch up. * added few test cases to cover the wait/wakeup logic in walsender related to standby_slot_names. * ran pgindent. V44-0002 * added few comments to explain the case when the slot is valid on primary while is invalidated on standby. Thanks Ajin for analyzing and making the tests. The pending comments on 0002 will be addressed in next version. [1] https://www.postgresql.org/message-id/CAHut%2BPvRD5V-zzTvffDdcnqB1T4JNATKGgw%2BwdQCKAgeCYr0xQ%40mail.gmail.com Best Regards, Hou zj
Attachment
On Wed, Dec 6, 2023 at 4:53 PM shveta malik <shveta.malik@gmail.com> wrote: > > v43-002: > Review comments on v43-0002: ========================= 1. synchronize_one_slot() { ... + /* + * With hot_standby_feedback enabled and invalidations handled + * apropriately as above, this should never happen. + */ + if (remote_slot->restart_lsn < MyReplicationSlot->data.restart_lsn) + { + ereport(ERROR, + errmsg("not synchronizing local slot \"%s\" LSN(%X/%X)" + " to remote slot's LSN(%X/%X) as synchronization " + " would move it backwards", remote_slot->name, + LSN_FORMAT_ARGS(MyReplicationSlot->data.restart_lsn), + LSN_FORMAT_ARGS(remote_slot->restart_lsn))); + + goto cleanup; ... } After the error, the control won't return, so the above goto doesn't make any sense. 2. synchronize_one_slot() { ... + /* Search for the named slot */ + if ((s = SearchNamedReplicationSlot(remote_slot->name, true))) + { + SpinLockAcquire(&s->mutex); + sync_state = s->data.sync_state; + SpinLockRelease(&s->mutex); + } ... ... + ReplicationSlotAcquire(remote_slot->name, true); + + /* + * Copy the invalidation cause from remote only if local slot is not + * invalidated locally, we don't want to overwrite existing one. + */ + if (MyReplicationSlot->data.invalidated == RS_INVAL_NONE) + { + SpinLockAcquire(&MyReplicationSlot->mutex); + MyReplicationSlot->data.invalidated = remote_slot->invalidated; + SpinLockRelease(&MyReplicationSlot->mutex); + } + + /* Skip the sync if slot has been invalidated locally. */ + if (MyReplicationSlot->data.invalidated != RS_INVAL_NONE) + goto cleanup; ... It seems useless to acquire the slot if it is locally invalidated in the first place. Won't it be better if after the search we first check whether the slot is locally invalidated and take appropriate action? 3. After doing the above two, I think it doesn't make sense to have goto at the remaining places in synchronize_one_slot(). We can simply release the slot and commit the transaction at other places. 4. + * Returns nap time for the next sync-cycle. + */ +static long +synchronize_slots(WalReceiverConn *wrconn) Returning nap time from here appears a bit awkward. I think it is better if this function returns any_slot_updated and then the caller decides the adjustment of naptime. 5. +synchronize_slots(WalReceiverConn *wrconn) { ... ... + /* The syscache access needs a transaction env. */ + StartTransactionCommand(); + + /* + * Make result tuples live outside TopTransactionContext to make them + * accessible even after transaction is committed. + */ + MemoryContextSwitchTo(oldctx); + + /* Construct query to get slots info from the primary server */ + initStringInfo(&s); + construct_slot_query(&s); + + elog(DEBUG2, "slot sync worker's query:%s \n", s.data); + + /* Execute the query */ + res = walrcv_exec(wrconn, s.data, SLOTSYNC_COLUMN_COUNT, slotRow); + pfree(s.data); + + if (res->status != WALRCV_OK_TUPLES) + ereport(ERROR, + (errmsg("could not fetch failover logical slots info " + "from the primary server: %s", res->err))); + + CommitTransactionCommand(); ... ... } Where exactly in the above code, there is a syscache access as mentioned above StartTransactionCommand()? 6. - <filename>~/.pgpass</filename> file on the standby server (use + <filename>~/.pgpass</filename> file on the standby server. (use <literal>replication</literal> as the database name). Why do we need this change? 7. + standby. Additionally, similar to creating a logical replication slot + on the hot standby, <varname>hot_standby_feedback</varname> should be + set on the standby and a physical slot between the primary and the standby + should be used. In this, I don't understand the relation between the first part of the line: "Additionally, similar to creating a logical replication slot on the hot standby ..." with the rest. 8. However, + the slots which were in initiated sync_state ('i) and were not A single quote after 'i' is missing. 9. the slots with state 'r' and 'i' can neither be used for logical + decoded nor dropped by the user. /decoded/decoding 10. +/* + * Allocate and initialize slow sync worker shared memory + */ /slow/slot -- With Regards, Amit Kapila.
FYI -- the patch 0002 did not apply cleanly for me on top of the 050 test file created by patch 0001. [postgres@CentOS7-x64 oss_postgres_misc]$ git apply ../patches_misc/v44-0001-Allow-logical-walsenders-to-wait-for-the-physica.patch [postgres@CentOS7-x64 oss_postgres_misc]$ git apply ../patches_misc/v44-0002-Add-logical-slot-sync-capability-to-the-physical.patch error: patch failed: src/test/recovery/t/050_standby_failover_slots_sync.pl:289 error: src/test/recovery/t/050_standby_failover_slots_sync.pl: patch does not apply ====== Kind Regards, Peter Smith. Fujitsu Australia
On Monday, December 11, 2023 8:17 AM Peter Smith <smithpb2250@gmail.com> wrote: > > FYI -- the patch 0002 did not apply cleanly for me on top of the 050 test file > created by patch 0001. > > [postgres@CentOS7-x64 oss_postgres_misc]$ git > apply ../patches_misc/v44-0001-Allow-logical-walsenders-to-wait-for-the-ph > ysica.patch > > [postgres@CentOS7-x64 oss_postgres_misc]$ git > apply ../patches_misc/v44-0002-Add-logical-slot-sync-capability-to-the-phy > sical.patch > error: patch failed: src/test/recovery/t/050_standby_failover_slots_sync.pl:289 > error: src/test/recovery/t/050_standby_failover_slots_sync.pl: patch does not > apply Thanks for reporting. Here is the rebased patch set V44_2. (There are no code changes in this version.) Best Regards, Hou zj
Attachment
Here are some review comments for v44-0001 ====== src/backend/replication/slot.c 1. ReplicationSlotCreate * during getting changes, if the two_phase option is enabled it can skip * prepare because by that time start decoding point has been moved. So the * user will only get commit prepared. + * failover: Allows the slot to be synced to physical standbys so that logical + * replication can be resumed after failover. */ void ReplicationSlotCreate(const char *name, bool db_specific, ~ /Allows the slot.../If enabled, allows the slot.../ ====== 2. validate_standby_slots +validate_standby_slots(char **newval) +{ + char *rawname; + List *elemlist; + ListCell *lc; + bool ok = true; + + /* Need a modifiable copy of string */ + rawname = pstrdup(*newval); + + /* Verify syntax and parse string into list of identifiers */ + if (!(ok = SplitIdentifierString(rawname, ',', &elemlist))) + GUC_check_errdetail("List syntax is invalid."); + + /* + * If there is a syntax error in the name or if the replication slots' + * data is not initialized yet (i.e., we are in the startup process), skip + * the slot verification. + */ + if (!ok || !ReplicationSlotCtl) + { + pfree(rawname); + list_free(elemlist); + return ok; + } 2a. You don't need to initialize 'ok' during declaration because it is assigned immediately anyway. ~ 2b. AFAIK assignment within a conditional like this is not a normal PG coding style unless there is no other way to do it. ~ 2c. /into list/into a list/ SUGGESTION /* Verify syntax and parse string into a list of identifiers */ ok = SplitIdentifierString(rawname, ',', &elemlist); if (!ok) GUC_check_errdetail("List syntax is invalid."); ~~~ 3. assign_standby_slot_names + if (!SplitIdentifierString(standby_slot_names_cpy, ',', &standby_slots)) + { + /* This should not happen if GUC checked check_standby_slot_names. */ + elog(ERROR, "list syntax is invalid"); + } This error here and in validate_standby_slots() are different -- "list" versus "List". ====== src/backend/replication/walsender.c 4. WalSndFilterStandbySlots + foreach(lc, standby_slots_cpy) + { + char *name = lfirst(lc); + XLogRecPtr restart_lsn = InvalidXLogRecPtr; + bool invalidated = false; + char *warningfmt = NULL; + ReplicationSlot *slot; + + slot = SearchNamedReplicationSlot(name, true); + + if (slot && SlotIsPhysical(slot)) + { + SpinLockAcquire(&slot->mutex); + restart_lsn = slot->data.restart_lsn; + invalidated = slot->data.invalidated != RS_INVAL_NONE; + SpinLockRelease(&slot->mutex); + } + + /* Continue if the current slot hasn't caught up. */ + if (!invalidated && !XLogRecPtrIsInvalid(restart_lsn) && + restart_lsn < wait_for_lsn) + { + /* Log warning if no active_pid for this physical slot */ + if (slot->active_pid == 0) + ereport(WARNING, + errmsg("replication slot \"%s\" specified in parameter \"%s\" does not have active_pid", + name, "standby_slot_names"), + errdetail("Logical replication is waiting on the " + "standby associated with \"%s\"", name), + errhint("Consider starting standby associated with " + "\"%s\" or amend standby_slot_names", name)); + + continue; + } + else if (!slot) + { + /* + * It may happen that the slot specified in standby_slot_names GUC + * value is dropped, so let's skip over it. + */ + warningfmt = _("replication slot \"%s\" specified in parameter \"%s\" does not exist, ignoring"); + } + else if (SlotIsLogical(slot)) + { + /* + * If a logical slot name is provided in standby_slot_names, issue + * a WARNING and skip it. Although logical slots are disallowed in + * the GUC check_hook(validate_standby_slots), it is still + * possible for a user to drop an existing physical slot and + * recreate a logical slot with the same name. Since it is + * harmless, a WARNING should be enough, no need to error-out. + */ + warningfmt = _("cannot have logical replication slot \"%s\" in parameter \"%s\", ignoring"); + } + else if (XLogRecPtrIsInvalid(restart_lsn) || invalidated) + { + /* + * Specified physical slot may have been invalidated, so there is no point + * in waiting for it. + */ + warningfmt = _("physical slot \"%s\" specified in parameter \"%s\" has been invalidated, ignoring"); + } + else + { + Assert(restart_lsn >= wait_for_lsn); + } This if/else chain seems structured awkwardly. IMO it would be tidier to eliminate the NULL slot and IsLogicalSlot up-front, which would also simplify some of the subsequent conditions SUGGESTION slot = SearchNamedReplicationSlot(name, true); if (!slot) { ... } else if (SlotIsLogical(slot)) { ... } else { Assert(SlotIsPhysical(slot)) SpinLockAcquire(&slot->mutex); restart_lsn = slot->data.restart_lsn; invalidated = slot->data.invalidated != RS_INVAL_NONE; SpinLockRelease(&slot->mutex); if (XLogRecPtrIsInvalid(restart_lsn) || invalidated) { ... } else if (!invalidated && !XLogRecPtrIsInvalid(restart_lsn) && restart_lsn < wait_for_lsn) { ... } else { Assert(restart_lsn >= wait_for_lsn); } } ~~~~ 5. WalSndWaitForWal + else + { + /* already caught up and doesn't need to wait for standby_slots */ break; + } /Already/already/ ====== src/test/recovery/t/050_standby_failover_slots_sync.pl 6. +$subscriber1->safe_psql('postgres', + "CREATE TABLE tab_int (a int PRIMARY KEY);"); + +# Create a subscription with failover = true +$subscriber1->safe_psql('postgres', + "CREATE SUBSCRIPTION regress_mysub1 CONNECTION '$publisher_connstr' " + . "PUBLICATION regress_mypub WITH (slot_name = lsub1_slot, failover = true);" +); Consider combining these DDL statements. ~~~ 7. +$subscriber2->safe_psql('postgres', + "CREATE TABLE tab_int (a int PRIMARY KEY);"); +$subscriber2->safe_psql('postgres', + "CREATE SUBSCRIPTION regress_mysub2 CONNECTION '$publisher_connstr' " + . "PUBLICATION regress_mypub WITH (slot_name = lsub2_slot);"); Consider combining these DDL statements ~~~ 8. +# Stop the standby associated with specified physical replication slot so that +# the logical replication slot won't receive changes until the standby slot's +# restart_lsn is advanced or the slots is removed from the standby_slot_names +# list +$publisher->safe_psql('postgres', "TRUNCATE tab_int;"); +$publisher->wait_for_catchup('regress_mysub1'); +$standby1->stop; /with specified/with the specified/ /or the slots is/or the slot is/ ~~~ 9. +# Create some data on primary /on primary/on the primary/ ~~~ 10. +$result = + $subscriber1->safe_psql('postgres', "SELECT count(*) = 10 FROM tab_int;"); +is($result, 't', + "subscriber1 doesn't get data as the sb1_slot doesn't catch up"); I felt instead of checking for 10 maybe it's more consistent with the previous code to assign again that $primary_row_count variable to 20; Then check that those primary rows are not all yet received like: SELECT count(*) < $primary_row_count FROM tab_int; ~~~ 11. +# Now that the standby lsn has advanced, primary must send the decoded +# changes to the subscription. +$publisher->wait_for_catchup('regress_mysub1'); +$result = + $subscriber1->safe_psql('postgres', "SELECT count(*) = 20 FROM tab_int;"); +is($result, 't', + "subscriber1 gets data from primary after standby1 is removed from the standby_slot_names list" +); /primary must/the primary must/ (continuing the suggestion from the previous review comment) Now this SQL can use the variable too: subscriber1->safe_psql('postgres', "SELECT count(*) = $primary_row_count FROM tab_int;"); ~~~ 12. + +# Create another subscription enabling failover +$subscriber1->safe_psql('postgres', + "CREATE SUBSCRIPTION regress_mysub3 CONNECTION '$publisher_connstr' " + . "PUBLICATION regress_mypub WITH (slot_name = lsub3_slot, copy_data=false, failover = true, create_slot = false);" +); Maybe give some more information in that comment: SUGGESTION Create another subscription (using the same slot created above) that enables failover. ====== Kind Regards, Peter Smith. Fujitsu Australia
Hi, On 12/8/23 10:06 AM, Amit Kapila wrote: > On Wed, Dec 6, 2023 at 4:53 PM shveta malik <shveta.malik@gmail.com> wrote: >> >> PFA v43, changes are: >> > > I wanted to discuss 0003 patch about cascading standby's. It is not > clear to me whether we want to allow physical standbys to further wait > for cascading standby to sync their slots. If we allow such a feature > one may expect even primary to wait for all the cascading standby's > because otherwise still logical subscriber can be ahead of one of the > cascading standby. I've the same feeling here. I think it would probably be expected that the primary also wait for all the cascading standby. > I feel even if we want to allow such a behaviour we > can do it later once the main feature is committed. Agree. > I think it would > be good to just allow logical walsenders on primary to wait for > physical standbys represented by GUC 'standby_slot_names'. That makes sense for me for v1. > If we agree > on that then it would be good to prohibit setting this GUC on standby > or at least it should be a no-op even if this GUC should be set on > physical standby. I'd prefer to completely prohibit it on standby (to make it very clear it's not working at all) as long as one can enable it without downtime once the standby is promoted (which is the case currently). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, Dec 8, 2023 at 2:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Dec 6, 2023 at 4:53 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > PFA v43, changes are: > > > > I wanted to discuss 0003 patch about cascading standby's. It is not > clear to me whether we want to allow physical standbys to further wait > for cascading standby to sync their slots. If we allow such a feature > one may expect even primary to wait for all the cascading standby's > because otherwise still logical subscriber can be ahead of one of the > cascading standby. I feel even if we want to allow such a behaviour we > can do it later once the main feature is committed. I think it would > be good to just allow logical walsenders on primary to wait for > physical standbys represented by GUC 'standby_slot_names'. If we agree > on that then it would be good to prohibit setting this GUC on standby > or at least it should be a no-op even if this GUC should be set on > physical standby. > > Thoughts? IMHO, why not keep the behavior consistent across primary and standby? I mean if it doesn't require a lot of new code/design addition then it should be the user's responsibility. I mean if the user has set 'standby_slot_names' on standby then let standby also wait for cascading standby to sync their slots? Is there any issue with that behavior? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Mon, Dec 11, 2023 at 1:02 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are some review comments for v44-0001 > > ~~~ > > 3. assign_standby_slot_names > > + if (!SplitIdentifierString(standby_slot_names_cpy, ',', &standby_slots)) > + { > + /* This should not happen if GUC checked check_standby_slot_names. */ > + elog(ERROR, "list syntax is invalid"); > + } > > This error here and in validate_standby_slots() are different -- > "list" versus "List". > Note here elog(ERROR,.. is used and in the other place it is part of the detail message. I have suggested in my previous review to make them the same but I overlooked the difference, so I think we should change the message to "invalid list syntax" as it was there previously. > ====== > src/backend/replication/walsender.c > > > 4. WalSndFilterStandbySlots > > > + foreach(lc, standby_slots_cpy) > + { > + char *name = lfirst(lc); > + XLogRecPtr restart_lsn = InvalidXLogRecPtr; > + bool invalidated = false; > + char *warningfmt = NULL; > + ReplicationSlot *slot; > + > + slot = SearchNamedReplicationSlot(name, true); > + > + if (slot && SlotIsPhysical(slot)) > + { > + SpinLockAcquire(&slot->mutex); > + restart_lsn = slot->data.restart_lsn; > + invalidated = slot->data.invalidated != RS_INVAL_NONE; > + SpinLockRelease(&slot->mutex); > + } > + > + /* Continue if the current slot hasn't caught up. */ > + if (!invalidated && !XLogRecPtrIsInvalid(restart_lsn) && > + restart_lsn < wait_for_lsn) > + { > + /* Log warning if no active_pid for this physical slot */ > + if (slot->active_pid == 0) > + ereport(WARNING, > + errmsg("replication slot \"%s\" specified in parameter \"%s\" does > not have active_pid", > + name, "standby_slot_names"), > + errdetail("Logical replication is waiting on the " > + "standby associated with \"%s\"", name), > + errhint("Consider starting standby associated with " > + "\"%s\" or amend standby_slot_names", name)); > + > + continue; > + } > + else if (!slot) > + { > + /* > + * It may happen that the slot specified in standby_slot_names GUC > + * value is dropped, so let's skip over it. > + */ > + warningfmt = _("replication slot \"%s\" specified in parameter > \"%s\" does not exist, ignoring"); > + } > + else if (SlotIsLogical(slot)) > + { > + /* > + * If a logical slot name is provided in standby_slot_names, issue > + * a WARNING and skip it. Although logical slots are disallowed in > + * the GUC check_hook(validate_standby_slots), it is still > + * possible for a user to drop an existing physical slot and > + * recreate a logical slot with the same name. Since it is > + * harmless, a WARNING should be enough, no need to error-out. > + */ > + warningfmt = _("cannot have logical replication slot \"%s\" in > parameter \"%s\", ignoring"); > + } > + else if (XLogRecPtrIsInvalid(restart_lsn) || invalidated) > + { > + /* > + * Specified physical slot may have been invalidated, so there is no point > + * in waiting for it. > + */ > + warningfmt = _("physical slot \"%s\" specified in parameter \"%s\" > has been invalidated, ignoring"); > + } > + else > + { > + Assert(restart_lsn >= wait_for_lsn); > + } > > This if/else chain seems structured awkwardly. IMO it would be tidier > to eliminate the NULL slot and IsLogicalSlot up-front, which would > also simplify some of the subsequent conditions > > SUGGESTION > > slot = SearchNamedReplicationSlot(name, true); > > if (!slot) > { > ... > } > else if (SlotIsLogical(slot)) > { > ... > } > else > { > Assert(SlotIsPhysical(slot)) > > SpinLockAcquire(&slot->mutex); > restart_lsn = slot->data.restart_lsn; > invalidated = slot->data.invalidated != RS_INVAL_NONE; > SpinLockRelease(&slot->mutex); > > if (XLogRecPtrIsInvalid(restart_lsn) || invalidated) > { > ... > } > else if (!invalidated && !XLogRecPtrIsInvalid(restart_lsn) && > restart_lsn < wait_for_lsn) > { > ... > } > else > { > Assert(restart_lsn >= wait_for_lsn); > } > } > +1. -- With Regards, Amit Kapila.
On Mon, Dec 11, 2023 at 1:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Fri, Dec 8, 2023 at 2:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Dec 6, 2023 at 4:53 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > PFA v43, changes are: > > > > > > > I wanted to discuss 0003 patch about cascading standby's. It is not > > clear to me whether we want to allow physical standbys to further wait > > for cascading standby to sync their slots. If we allow such a feature > > one may expect even primary to wait for all the cascading standby's > > because otherwise still logical subscriber can be ahead of one of the > > cascading standby. I feel even if we want to allow such a behaviour we > > can do it later once the main feature is committed. I think it would > > be good to just allow logical walsenders on primary to wait for > > physical standbys represented by GUC 'standby_slot_names'. If we agree > > on that then it would be good to prohibit setting this GUC on standby > > or at least it should be a no-op even if this GUC should be set on > > physical standby. > > > > Thoughts? > > IMHO, why not keep the behavior consistent across primary and standby? > I mean if it doesn't require a lot of new code/design addition then > it should be the user's responsibility. I mean if the user has set > 'standby_slot_names' on standby then let standby also wait for > cascading standby to sync their slots? Is there any issue with that > behavior? > Without waiting for cascading standby on primary, it won't be helpful to just wait on standby. Currently logical walsenders on primary waits for physical standbys to take changes before they update their own logical slots. But they wait only for their immediate standbys and not for cascading standbys. Although, on first standby, we do have logic where slot-sync workers wait for cascading standbys before they update their own slots (synced ones, see patch3). But this does not guarantee that logical subscribers on primary will never be ahead of the cascading standbys. Let us consider this timeline: t1: logical walsender on primary waiting for standby1 (first standby). t2: physical walsender on standby1 is stuck and thus there is delay in sending these changes to standby2 (cascading standby). t3: standby1 has taken changes and sends confirmation to primary. t4: logical walsender on primary receives confirmation from standby1 and updates slot, logical subscribers of primary also receives the changes. t5: standby2 has not received changes yet as physical walsender on standby1 is still stuck, slotsync worker still waiting for standby2 (cascading) before it updates its own slots (synced ones). t6: standby2 is promoted to become primary. Now we are in a state wherein primary, logical subscriber and first standby has some changes but cascading standby does not. And logical slots on primary were updated w/o confirming if cascading standby has taken changes or not. This is a problem and we do not have a simple solution for this yet. thanks Shveta
On Mon, Dec 11, 2023 at 2:20 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Mon, Dec 11, 2023 at 1:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Fri, Dec 8, 2023 at 2:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Wed, Dec 6, 2023 at 4:53 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > > > PFA v43, changes are: > > > > > > > > > > I wanted to discuss 0003 patch about cascading standby's. It is not > > > clear to me whether we want to allow physical standbys to further wait > > > for cascading standby to sync their slots. If we allow such a feature > > > one may expect even primary to wait for all the cascading standby's > > > because otherwise still logical subscriber can be ahead of one of the > > > cascading standby. I feel even if we want to allow such a behaviour we > > > can do it later once the main feature is committed. I think it would > > > be good to just allow logical walsenders on primary to wait for > > > physical standbys represented by GUC 'standby_slot_names'. If we agree > > > on that then it would be good to prohibit setting this GUC on standby > > > or at least it should be a no-op even if this GUC should be set on > > > physical standby. > > > > > > Thoughts? > > > > IMHO, why not keep the behavior consistent across primary and standby? > > I mean if it doesn't require a lot of new code/design addition then > > it should be the user's responsibility. I mean if the user has set > > 'standby_slot_names' on standby then let standby also wait for > > cascading standby to sync their slots? Is there any issue with that > > behavior? > > > > Without waiting for cascading standby on primary, it won't be helpful > to just wait on standby. > > Currently logical walsenders on primary waits for physical standbys to > take changes before they update their own logical slots. But they wait > only for their immediate standbys and not for cascading standbys. > Although, on first standby, we do have logic where slot-sync workers > wait for cascading standbys before they update their own slots (synced > ones, see patch3). But this does not guarantee that logical > subscribers on primary will never be ahead of the cascading standbys. > Let us consider this timeline: > > t1: logical walsender on primary waiting for standby1 (first standby). > t2: physical walsender on standby1 is stuck and thus there is delay in > sending these changes to standby2 (cascading standby). > t3: standby1 has taken changes and sends confirmation to primary. > t4: logical walsender on primary receives confirmation from standby1 > and updates slot, logical subscribers of primary also receives the > changes. > t5: standby2 has not received changes yet as physical walsender on > standby1 is still stuck, slotsync worker still waiting for standby2 > (cascading) before it updates its own slots (synced ones). > t6: standby2 is promoted to become primary. > > Now we are in a state wherein primary, logical subscriber and first > standby has some changes but cascading standby does not. And logical > slots on primary were updated w/o confirming if cascading standby has > taken changes or not. This is a problem and we do not have a simple > solution for this yet. > > thanks > Shveta PFA v45, changes in patch002: --Addressed comments in [1] and [2] --Added holistic test case for patch02. Thanks Nisha for the test implementation. [1]: https://www.postgresql.org/message-id/CAHut%2BPuuqEpDse5msENsVuK3rjTRN-QGS67rRCGVv%2BzcT-f0GA%40mail.gmail.com [2]: https://www.postgresql.org/message-id/CAA4eK1KbhdjKqui%3Dfr4Ny2TwGAFU9WLWTdypN%2BWG0WEfnBR%3D4w%40mail.gmail.com thanks Shveta
Attachment
On Sun, Dec 10, 2023 at 4:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Dec 6, 2023 at 4:53 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > v43-002: > > > > Review comments on v43-0002: > ========================= Thanks for the feedback Amit. Addressed these in v45. Please find my response on a few of these. > 1. > synchronize_one_slot() > { > ... > + /* > + * With hot_standby_feedback enabled and invalidations handled > + * apropriately as above, this should never happen. > + */ > + if (remote_slot->restart_lsn < MyReplicationSlot->data.restart_lsn) > + { > + ereport(ERROR, > + errmsg("not synchronizing local slot \"%s\" LSN(%X/%X)" > + " to remote slot's LSN(%X/%X) as synchronization " > + " would move it backwards", remote_slot->name, > + LSN_FORMAT_ARGS(MyReplicationSlot->data.restart_lsn), > + LSN_FORMAT_ARGS(remote_slot->restart_lsn))); > + > + goto cleanup; > ... > } > > After the error, the control won't return, so the above goto doesn't > make any sense. > > 2. > synchronize_one_slot() > { > ... > + /* Search for the named slot */ > + if ((s = SearchNamedReplicationSlot(remote_slot->name, true))) > + { > + SpinLockAcquire(&s->mutex); > + sync_state = s->data.sync_state; > + SpinLockRelease(&s->mutex); > + } > ... > ... > + ReplicationSlotAcquire(remote_slot->name, true); > + > + /* > + * Copy the invalidation cause from remote only if local slot is not > + * invalidated locally, we don't want to overwrite existing one. > + */ > + if (MyReplicationSlot->data.invalidated == RS_INVAL_NONE) > + { > + SpinLockAcquire(&MyReplicationSlot->mutex); > + MyReplicationSlot->data.invalidated = remote_slot->invalidated; > + SpinLockRelease(&MyReplicationSlot->mutex); > + } > + > + /* Skip the sync if slot has been invalidated locally. */ > + if (MyReplicationSlot->data.invalidated != RS_INVAL_NONE) > + goto cleanup; > ... > > It seems useless to acquire the slot if it is locally invalidated in > the first place. Won't it be better if after the search we first check > whether the slot is locally invalidated and take appropriate action? > If we don't acquire the slot first, there could be a race condition that the local slot could be invalidated just after checking the invalidated flag. See InvalidatePossiblyObsoleteSlot() where it invalidates slot directly if the slot is not acquired by other processes. Thus, I have not removed 'ReplicationSlotAcquire' but I have re-structured the code a little bit to get rid of duplicate code in 'if' and 'else' part for invalidation logic. > 3. After doing the above two, I think it doesn't make sense to have > goto at the remaining places in synchronize_one_slot(). We can simply > release the slot and commit the transaction at other places. > > 4. > + * Returns nap time for the next sync-cycle. > + */ > +static long > +synchronize_slots(WalReceiverConn *wrconn) > > Returning nap time from here appears a bit awkward. I think it is > better if this function returns any_slot_updated and then the caller > decides the adjustment of naptime. > > 5. > +synchronize_slots(WalReceiverConn *wrconn) > { > ... > ... > + /* The syscache access needs a transaction env. */ > + StartTransactionCommand(); > + > + /* > + * Make result tuples live outside TopTransactionContext to make them > + * accessible even after transaction is committed. > + */ > + MemoryContextSwitchTo(oldctx); > + > + /* Construct query to get slots info from the primary server */ > + initStringInfo(&s); > + construct_slot_query(&s); > + > + elog(DEBUG2, "slot sync worker's query:%s \n", s.data); > + > + /* Execute the query */ > + res = walrcv_exec(wrconn, s.data, SLOTSYNC_COLUMN_COUNT, slotRow); > + pfree(s.data); > + > + if (res->status != WALRCV_OK_TUPLES) > + ereport(ERROR, > + (errmsg("could not fetch failover logical slots info " > + "from the primary server: %s", res->err))); > + > + CommitTransactionCommand(); > ... > ... > } > > Where exactly in the above code, there is a syscache access as > mentioned above StartTransactionCommand()? > It is in walrcv_exec (libpqrcv_processTuples). I have changed the comments to add this info. > 6. > - <filename>~/.pgpass</filename> file on the standby server (use > + <filename>~/.pgpass</filename> file on the standby server. (use > <literal>replication</literal> as the database name). > > Why do we need this change? We don't, removed it. > > 7. > + standby. Additionally, similar to creating a logical replication slot > + on the hot standby, <varname>hot_standby_feedback</varname> should be > + set on the standby and a physical slot between the primary and the standby > + should be used. > > In this, I don't understand the relation between the first part of the > line: "Additionally, similar to creating a logical replication slot on > the hot standby ..." with the rest. > > 8. > However, > + the slots which were in initiated sync_state ('i) and were not > > A single quote after 'i' is missing. > > 9. > the slots with state 'r' and 'i' can neither be used for logical > + decoded nor dropped by the user. > > /decoded/decoding > > 10. > +/* > + * Allocate and initialize slow sync worker shared memory > + */ > > /slow/slot > > -- > With Regards, > Amit Kapila.
On Thu, Dec 7, 2023 at 1:33 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Hi. > > Here are my review comments for patch v43-0002. > Thanks for the feedback. I have addressed most of these in v45. Please find my response on a few which are pending or are not needed. > ====== > Commit message > > 1. > The nap time of worker is tuned according to the activity on the primary. > The worker starts with nap time of 10ms and if no activity is observed on > the primary for some time, then nap time is increased to 10sec. And if > activity is observed again, nap time is reduced back to 10ms. > > ~ > /nap time of worker/nap time of the worker/ > /And if/If/ > > ~~~ > > 2. > Slots synced on the standby can be identified using 'sync_state' column of > pg_replication_slots view. The values are: > 'n': none for user slots, > 'i': sync initiated for the slot but waiting for the remote slot on the > primary server to catch up. > 'r': ready for periodic syncs. > > ~ > > /identified using/identified using the/ > > The meaning of "identified by" is unclear to me. It also seems to > clash with later descriptions in system-views.sgml. Please see my > later review comment about it (in the sgml file) > I have rephrased it, please check now and let me know. > ====== > doc/src/sgml/bgworker.sgml > > 3. > bgw_start_time is the server state during which postgres should start > the process; it can be one of BgWorkerStart_PostmasterStart (start as > soon as postgres itself has finished its own initialization; processes > requesting this are not eligible for database connections), > BgWorkerStart_ConsistentState (start as soon as a consistent state has > been reached in a hot standby, allowing processes to connect to > databases and run read-only queries), and > BgWorkerStart_RecoveryFinished (start as soon as the system has > entered normal read-write state. Note that the > BgWorkerStart_ConsistentState and BgWorkerStart_RecoveryFinished are > equivalent in a server that's not a hot standby), and > BgWorkerStart_ConsistentState_HotStandby (same meaning as > BgWorkerStart_ConsistentState but it is more strict in terms of the > server i.e. start the worker only if it is hot-standby; if it is > consistent state in non-standby, worker will not be started). Note > that this setting only indicates when the processes are to be started; > they do not stop when a different state is reached. > > ~ > > 3a. > This seems to have grown to become just one enormous sentence that is > too hard to read. IMO this should be changed to be a <variablelist> of > possible values instead of a big slab of text. I suspect it could also > be simplified quite a lot -- something like below > > SUGGESTION > bgw_start_time is the server state during which postgres should start > the process. Note that this setting only indicates when the processes > are to be started; they do not stop when a different state is reached. > Possible values are: > > - BgWorkerStart_PostmasterStart (start as soon as postgres itself has > finished its own initialization; processes requesting this are not > eligible for database connections) > > - BgWorkerStart_ConsistentState (start as soon as a consistent state > has been reached in a hot-standby, allowing processes to connect to > databases and run read-only queries) > > - BgWorkerStart_RecoveryFinished (start as soon as the system has > entered normal read-write state. Note that the > BgWorkerStart_ConsistentState and BgWorkerStart_RecoveryFinished are > equivalent in a server that's not a hot standby) > > - BgWorkerStart_ConsistentState_HotStandby (same meaning as > BgWorkerStart_ConsistentState but it is more strict in terms of the > server i.e. start the worker only if it is hot-standby; if it is a > consistent state in non-standby, the worker will not be started). > > ~~~ > > 3b. > "i.e. start the worker only if it is hot-standby; if it is consistent > state in non-standby, worker will not be started" > > ~ > > Why is it even necessary to say the 2nd part "if it is consistent > state in non-standby, worker will not be started". It seems redundant > given 1st part says the same, right? > > > ====== > doc/src/sgml/config.sgml > > 4. > + <para> > + The standbys corresponding to the physical replication slots in > + <varname>standby_slot_names</varname> must enable > + <varname>enable_syncslot</varname> for the standbys to receive > + failover logical slots changes from the primary. > + </para> > > 4a. > Somehow "must enable enable_syncslot" seemed strange. Maybe re-word like: > > "must enable slot synchronization (see enable_syncslot)" > > OR > > "must configure enable_syncslot = true" > > ~~~ > > 4b. > (seems like repetitive use of "the standbys") > > /for the standbys to/to/ > > OR > > /for the standbys to/so they can/ > > ~~~ > > 5. > <varname>primary_conninfo</varname> string, or in a separate > - <filename>~/.pgpass</filename> file on the standby server (use > + <filename>~/.pgpass</filename> file on the standby server. (use > > This rearranged period seems unrelated to the current patch. Maybe > don't touch this. > > ~~~ > > 6. > + <para> > + Specify <literal>dbname</literal> in > + <varname>primary_conninfo</varname> string to allow synchronization > + of slots from the primary server to the standby server. > + This will only be used for slot synchronization. It is ignored > + for streaming. > </para> > > The wording "to allow synchronization of slots" seemed misleading to > me. Isn't that more the purpose of the 'enable_syncslot' GUC? I think > the intended wording is more like below: > > SUGGESTION > If slot synchronization is enabled then it is also necessary to > specify <literal>dbname</literal> in the > <varname>primary_conninfo</varname> string. This will only be used for > slot synchronization. It is ignored for streaming. > > ====== > doc/src/sgml/logicaldecoding.sgml > > 7. > + <para> > + A logical replication slot on the primary can be synchronized to the hot > + standby by enabling the failover option during slot creation and set > + <varname>enable_syncslot</varname> on the standby. For the synchronization > + to work, it is mandatory to have physical replication slot between the > + primary and the standby. This physical replication slot for the standby > + should be listed in <varname>standby_slot_names</varname> on the primary > + to prevent the subscriber from consuming changes faster than the hot > + standby. Additionally, similar to creating a logical replication slot > + on the hot standby, <varname>hot_standby_feedback</varname> should be > + set on the standby and a physical slot between the primary and the standby > + should be used. > + </para> > > > 7a. > /creation and set/creation and setting/ > /to have physical replication/to have a physical replication/ > > ~ > > 7b. > It's unclear why this is saying "should be listed in > standby_slot_names" and "hot_standby_feedback should be set on the > standby". Why is it saying "should" instead of MUST -- are these > optional? I thought the GUC validation function mandates these (???). > standby_slot_names setting is not mandatory, it is recommended though. OTOH hot_standby_feedback setting is mandatory. So I have changed accordingly. > ~ > > 7c. > Why does the paragraph say "and a physical slot between the primary > and the standby should be used."; isn't that exactly what was already > written earlier ("For the synchronization to work, it is mandatory to > have physical replication slot between the primary and the standby" > Removed the duplicate line. > ~~~ > > 8. > + <para> > + By enabling synchronization of slots, logical replication can be resumed > + after failover depending upon the > + <link linkend="view-pg-replication-slots">pg_replication_slots</link>.<structfield>sync_state</structfield> > + for the synchronized slots on the standby at the time of failover. > + The slots which were in ready sync_state ('r') on the standby before > + failover can be used for logical replication after failover. However, > + the slots which were in initiated sync_state ('i) and were not > + sync-ready ('r') at the time of failover will be dropped and logical > + replication for such slots can not be resumed after failover. This applies > + to the case where a logical subscription is disabled before > failover and is > + enabled after failover. If the synchronized slot due to disabled > + subscription could not be made sync-ready ('r') on standby, then the > + subscription can not be resumed after failover even when enabled. > > > 8a. > This feels overcomplicated -- too much information? > > SUGGESTION > depending upon the ... sync_state for the synchronized slots on the > standby at the time of failover. Only slots that were in ready > sync_state ('r') on the standby before failover can be used for > logical replication after failover > > ~~~ > > 8b. > + the slots which were in initiated sync_state ('i) and were not > + sync-ready ('r') at the time of failover will be dropped and logical > + replication for such slots can not be resumed after failover. This applies > + to the case where a logical subscription is disabled before > failover and is > + enabled after failover. If the synchronized slot due to disabled > + subscription could not be made sync-ready ('r') on standby, then the > + subscription can not be resumed after failover even when enabled. > > But isn't ALL that part pretty much redundant information for the > user? I thought these are not ready state, so they are not usable... > End-Of-Story. Isn't everything else just more like implementation > details, which the user does not need to know about? > 'sync_state' is a way to monitor the state of synchronization and I feel it is important to tell what happens with 'i' state slots. Also there was a comment to add this info in doc that disabled subscriptions are not guaranteed to be usable if enabled after failover. Thus it was added and rest of the info forms a base for that. We can trim down or rephrase if needed. > ~~~ > > 9. > + If the primary is idle, making the synchronized slot on the standby > + as sync-ready ('r') for enabled subscription may take noticeable time. > + This can be sped up by calling the > + <function>pg_log_standby_snapshot</function> function on the primary. > + </para> > > SUGGESTION > If the primary is idle, then the synchronized slots on the standby may > take a noticeable time to reach the ready ('r') sync_state. This can > be sped up by calling the > <function>pg_log_standby_snapshot</function> function on the primary. > > ====== > doc/src/sgml/system-views.sgml > > 10. > + > + <row> > + <entry role="catalog_table_entry"><para role="column_definition"> > + <structfield>sync_state</structfield> <type>char</type> > + </para> > + <para> > + Defines slot synchronization state. This is meaningful on the physical > + standby which has enabled slots synchronization. > + </para> > > I felt that this part "which has enabled slots synchronization" should > cross-reference to the 'sync_enabled' GUC. > > ~~~ > > 11. > + <para> > + State code: > + <literal>n</literal> = none for user created slots, > + <literal>i</literal> = sync initiated for the slot but slot is not ready > + yet for periodic syncs, > + <literal>r</literal> = ready for periodic syncs. > + </para> > > I'm wondering why don't we just reuse 'd' (disabled), 'p' (pending), > 'e' (enabled) like the other tri-state attributes are using. > I think it is not a property of a slot where we say enabled/disabled. It is more like an operation and thus initiated, ready etc sounds better. These states are similar to the ones maintained for table-sync operation (SUBREL_STATE_INIT, SUBREL_STATE_READY etc) > > 12. > + <para> > + The hot standby can have any of these sync_state for the slots but on a > + hot standby, the slots with state 'r' and 'i' can neither be > used for logical > + decoded nor dropped by the user. The primary server will have sync_state > + as 'n' for all the slots. But if the standby is promoted to become the > + new primary server, sync_state can be seen 'r' as well. On this new > + primary server, slots with sync_state as 'r' and 'n' will > behave the same. > + </para></entry> > + </row> > > 12a. > /logical decoded/logical decoding/ > > ~ > > 12b. > "sync_state as 'r' and 'n' will behave the same" sounds kind of hacky. > Is there no alternative? > I am reviewing your suggestion on 'r' to 'n' conversion on promotion given later in this email. So give me some more time. > Anyway, IMO mentioning about primary server states seems overkill, > because you already said "This is meaningful on the physical standby" > which I took as implying that it is *not* meaningful from the POV of > the primary server. > In case we planned to retain 'r', it then makes sense to document that the sync_state on primary can also be 'r' if the primary was promoted from a standby, because this is a special case which the user may not be aware of. > In light of this, I'm wondering if a better name for this attribute > would be: 'standby_sync_state' > sync_state has some value for primary too. It is not null on primary. Thus the current name seems a better choice. > ====== > src/backend/access/transam/xlogrecovery.c > > 13. > + /* > + * Shutdown the slot sync workers to prevent potential conflicts between > + * user processes and slotsync workers after a promotion. Additionally, > + * drop any slots that have initiated but not yet completed the sync > + * process. > + */ > + ShutDownSlotSync(); > + slotsync_drop_initiated_slots(); > + > > Is this where maybe the 'sync_state' should also be updated for > everything so you are not left with confusion about different states > on a node that is no longer a standby node? > yes, this is the place. But this needs more thought as it may cause too much disk activity during promotion. so let me analyze and come back. > ====== > src/backend/postmaster/postmaster.c > > 14. PostmasterMain > > ApplyLauncherRegister(); > > + SlotSyncWorkerRegister(); > + > > Every other function call here is heavily commented but there is a > conspicuous absence of a comment here. > Added some comments, but not very confident on those, so let me know. > ~~~ > > 15. bgworker_should_start_now > > if (start_time == BgWorkerStart_ConsistentState) > return true; > + else if (start_time == BgWorkerStart_ConsistentState_HotStandby && > + pmState != PM_RUN) > + return true; > /* fall through */ > Change "else if" to "if" would be simpler. > > ====== > .../libpqwalreceiver/libpqwalreceiver.c > > 16. > + for (opt = opts; opt->keyword != NULL; ++opt) > + { > + /* > + * If multiple dbnames are specified, then the last one will be > + * returned > + */ > + if (strcmp(opt->keyword, "dbname") == 0 && opt->val && > + opt->val[0] != '\0') > + dbname = pstrdup(opt->val); > + } > > This can use a tidier C99 style to declare 'opt' as the loop variable. > > ~~~ > > 17. > static void > libpqrcv_alter_slot(WalReceiverConn *conn, const char *slotname, > - bool failover) > + bool failover) > > What is this change for? Or, if something is wrong with the indent > then anyway it should be fixed in patch 0001. > yes, it should go to patch01. Done. > ====== > src/backend/replication/logical/logical.c > > 18. > > + /* > + * Slots in state SYNCSLOT_STATE_INITIATED should have been dropped on > + * promotion. > + */ > + if (!RecoveryInProgress() && slot->data.sync_state == > SYNCSLOT_STATE_INITIATED) > + elog(ERROR, "replication slot \"%s\" was not synced completely from > the primary server", > + NameStr(slot->data.name)); > + > + /* > + * Do not allow consumption of a "synchronized" slot until the standby > + * gets promoted. > + */ > + if (RecoveryInProgress() && slot->data.sync_state != SYNCSLOT_STATE_NONE) > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("cannot use replication slot \"%s\" for logical decoding", > + NameStr(slot->data.name)), > + errdetail("This slot is being synced from the primary server."), > + errhint("Specify another replication slot."))); > + > > 18a. > > Instead of having !RecoveryInProgress() and RecoveryInProgress() in > separate conditions is the code simpler like: > > SUGGESTION > > if (RecoveryInProgress()) > { > /* Do not allow ... */ > if (slot->data.sync_state != SYNCSLOT_STATE_NONE) ... > } > else > { > /* Slots in state... */ > if (slot->data.sync_state == SYNCSLOT_STATE_INITIATED) ... > } > > ~ > > 18b. > Should the errdetail give the current state? > I think it is not needed, current info looks good enough. User can always use pg_replication_slots to monitor sync_state info. > ====== > src/backend/replication/logical/slotsync.c > > 19. > +/* > + * Number of attempts for wait_for_primary_slot_catchup() after > + * which it aborts the wait and the slot sync worker then moves > + * to the next slot creation/sync. > + */ > +#define WORKER_PRIMARY_CATCHUP_WAIT_ATTEMPTS 5 > > Given this is only used within one static function, I'm wondering if > it would be tidier to also move this macro to within that function. > > ~~~ > > 20. wait_for_primary_slot_catchup > > +/* > + * Wait for remote slot to pass locally reserved position. > + * > + * Ping and wait for the primary server for > + * WORKER_PRIMARY_CATCHUP_WAIT_ATTEMPTS during a slot creation, if it still > + * does not catch up, abort the wait. The ones for which wait is aborted will > + * attempt the wait and sync in the next sync-cycle. > + * > + * *persist will be set to false if the slot has disappeared or was invalidated > + * on the primary; otherwise, it will be set to true. > + */ > > 20a. > The comment doesn't say the meaning of the boolean returned. > > ~ > > 20b. > /*persist will be set/If passed, *persist will be set/ > > ~~~ > > 21. > + appendStringInfo(&cmd, > + "SELECT conflicting, restart_lsn, confirmed_flush_lsn," > + " catalog_xmin FROM pg_catalog.pg_replication_slots" > + " WHERE slot_name = %s", > + quote_literal_cstr(remote_slot->name)); > > Somehow, I felt it is more readable if the " FROM" starts on a new line. > > e.g. > "SELECT conflicting, restart_lsn, confirmed_flush_lsn, catalog_xmin" > " FROM pg_catalog.pg_replication_slots" > " WHERE slot_name = %s" > > ~~~ > > 22. > + ereport(ERROR, > + (errmsg("could not fetch slot info for slot \"%s\" from the" > + " primary server: %s", > + remote_slot->name, res->err))); > > Perhaps the message can be shortened like: > "could not fetch slot \"%s\" info from the primary server: %s" > > ~~~ > > 23. > + ereport(WARNING, > + (errmsg("slot \"%s\" disappeared from the primary server," > + " slot creation aborted", remote_slot->name))); > > Would this be better split into parts? > > SUGGESTION > errmsg "slot \"%s\" creation aborted" > errdetail "slot was not found on the primary server" > > ~~~ > > 24. > + ereport(WARNING, > + (errmsg("slot \"%s\" invalidated on the primary server," > + " slot creation aborted", remote_slot->name))); > > (similar to previous) > > SUGGESTION > errmsg "slot \"%s\" creation aborted" > errdetail "slot was invalidated on the primary server" > > ~~~ > > 25. > + /* > + * Once we got valid restart_lsn, then confirmed_lsn and catalog_xmin > + * are expected to be valid/non-null. > + */ > > SUGGESTION > Having got a valid restart_lsn, the confirmed_lsn and catalog_xmin are > expected to be valid/non-null. > > ~~~ > > 26. slotsync_drop_initiated_slots > > +/* > + * Drop the slots for which sync is initiated but not yet completed > + * i.e. they are still waiting for the primary server to catch up. > + */ > > I found "waiting for the primary server to catch up" to be difficult > to understand without knowing the full details, but it is not really > described properly until a much larger comment that is buried in the > synchronize_one_slot(). So I think all this needs explanation up-front > in the file, which you can refer to. I have repeated this same review > comment in a couple of places. > I have updated header of file with details and gave reference here and all such similar places. > ~~~ > > 27. get_local_synced_slot_names > > +static List * > +get_local_synced_slot_names(void) > +{ > + List *localSyncedSlots = NIL; > > 27a. > It's not returning a list of "names" though, so is this an appropriate > function name? > > ~~~ > > 27b. > Suggest just call that ('localSyncedSlots') differently. > - In slotsync_drop_initiated_slots() function they are just called 'slots' > - In drop_obsolete_slots() function it is called 'local_slot_list' > > IMO it is better if all these are consistently named -- just all lists > 'slots' or all 'local_slots' or whatever. > > ~~~ > > 28. check_sync_slot_validity > > +static bool > +check_sync_slot_validity(ReplicationSlot *local_slot, List *remote_slots, > + bool *locally_invalidated) > > Somehow this wording "validity" seems like a misleading function name, > because the return value has nothing to do with the slot field > invalidated. > > The validity/locally_invalidated stuff is a secondary return as a side > effect for the "true" case. > > A more accurate function name would be more like check_sync_slot_on_remote(). > > ~~~ > > 29. check_sync_slot_validity > > +static bool > +check_sync_slot_validity(ReplicationSlot *local_slot, List *remote_slots, > + bool *locally_invalidated) > +{ > + ListCell *cell; > > There is inconsistent naming -- > > ListCell lc; ListCell cell; ListCell lc_slot; etc.. > > IMO the more complicated names aren't of much value -- probably > everything can be changed to 'lc' for consistency. > > ~~~ > > 30. drop_obsolete_slots > > + /* > + * Get the list of local 'synced' slot so that those not on remote could > + * be dropped. > + */ > > /slot/slots/ > > Also, I don't think it is necessary to say "so that those not on > remote could be dropped." -- That is already described in the function > comment and again in a comment later in the loop. That seems enough. > If the function name get_local_synced_slot_names() is improved a bit > the comment seems redundant because it is obvious from the function > name. > > ~~~ > > 31. > + foreach(lc_slot, local_slot_list) > + { > + ReplicationSlot *local_slot = (ReplicationSlot *) lfirst(lc_slot); > + bool local_exists = false; > + bool locally_invalidated = false; > + > + local_exists = check_sync_slot_validity(local_slot, remote_slot_list, > + &locally_invalidated); > > Shouldn't that 'local_exists' variable be called 'remote_exists'? > That's what the other comments seem to be saying. > > ~~~ > > 32. construct_slot_query > > + appendStringInfo(s, > + "SELECT slot_name, plugin, confirmed_flush_lsn," > + " restart_lsn, catalog_xmin, two_phase, failover," > + " database, pg_get_slot_invalidation_cause(slot_name)" > + " FROM pg_catalog.pg_replication_slots" > + " WHERE failover and sync_state != 'i'"); > > Just wondering if substituting the SYNCSLOT_STATE_INITIATED constant > here might be more appropriate than hardwiring 'i'. Why have a > constant but not use it? > On hold. I could not find quote_* function for a character just like we have 'quote_literal_cstr' for string. Will review. Let me know if you know. > ~~~ > > 33. synchronize_one_slot > > +static void > +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot, > + bool *slot_updated) > +{ > + ReplicationSlot *s; > + char sync_state = 0; > > 33a. > It seems strange that the sync_state is initially assigned something > other than the 3 legal values. Should this be defaulting to > SYNCSLOT_STATE_NONE instead? > No, that will change the flow. It should stay uninitialized if the slot is not found. I have changed assignment to '\0' for better clarity. > ~ > > 33b. > I think it is safer to default the *slot_updated = false; because the > code appears to assume it was false already which may or may not be > true. > It is initialized to false in the caller, so we are good here. > ~~~ > > 34. > + /* > + * Make sure that concerned WAL is received before syncing slot to target > + * lsn received from the primary server. > + * > + * This check should never pass as on the primary server, we have waited > + * for the standby's confirmation before updating the logical slot. > + */ > > Maybe this comment should mention up-front that it is just a "Sanity check:" > > ~~~ > > 35. > + /* > + * With hot_standby_feedback enabled and invalidations handled > + * apropriately as above, this should never happen. > + */ > + if (remote_slot->restart_lsn < MyReplicationSlot->data.restart_lsn) > + { > + ereport(ERROR, > + errmsg("not synchronizing local slot \"%s\" LSN(%X/%X)" > + " to remote slot's LSN(%X/%X) as synchronization " > + " would move it backwards", remote_slot->name, > + LSN_FORMAT_ARGS(MyReplicationSlot->data.restart_lsn), > + LSN_FORMAT_ARGS(remote_slot->restart_lsn))); > + > + goto cleanup; > + } > > 35a. > IIUC then this another comment that should say it is just a "Sanity-check:". > > ~ > > 35b. > I was wondering if there should be Assert(hot_standby_feedback) here > also. The comment "With hot_standby_feedback enabled" is a bit vague > whereas including an Assert will clarify that it must be set. > I think assert is not needed. Slot-sync worker will never start if hot_standby_feedback is disabled. If we put assert here, we need to assert at all other places too where we use other related GUCs like primary_slot_name, conn_info etc. > ~ > > 35c. > Since it says "this should never happen" then it appears elog is more > appropriate than ereport because translations are not needed, right? > > ~ > > 35d. > The ERROR will make that goto cleanup unreachable, won't it? > > ~~~ > > 36. > + /* > + * Already existing slot but not ready (i.e. waiting for the primary > + * server to catch-up), lets attempt to make it sync-ready now. > + */ > > /lets/let's/ > > ~~~ > > 37. > + /* > + * Refer the slot creation part (last 'else' block) for more details > + * on this wait. > + */ > + if (remote_slot->restart_lsn < MyReplicationSlot->data.restart_lsn || > + TransactionIdPrecedes(remote_slot->catalog_xmin, > + MyReplicationSlot->data.catalog_xmin)) > + { > + if (!wait_for_primary_slot_catchup(wrconn, remote_slot, NULL)) > + { > + goto cleanup; > + } > + } > > 37a. > Having to jump forward to understand earlier code seems backward. IMO > there should be a big comment atop this module about this subject > which the comment here can just refer to. I will write more about this > topic later (below). > > ~ > > 37b. > The extra code curly braces are not needed. > > ~~~ > > 38. > + ereport(LOG, errmsg("newly locally created slot \"%s\" is sync-ready " > + "now", remote_slot->name)); > > Better to put the whole errmsg() on a newline instead of splitting the > string like that. > > ~~~ > > 39. > + /* User created slot with the same name exists, raise ERROR. */ > + else if (sync_state == SYNCSLOT_STATE_NONE) > + { > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("skipping sync of slot \"%s\" as it is a user created" > + " slot", remote_slot->name), > + errdetail("This slot has failover enabled on the primary and" > + " thus is sync candidate but user created slot with" > + " the same name already exists on the standby"))); > + } > > I felt it would be better to eliminate this case immediately up-front > when you first searched for the slot names. e.g. code like below. IIUC > this refactor also means the default sync_state can be assigned a > normal value (as I suggested above) instead of the strange assignment > to 0. I feel NULL character ('\0') is better default for local variable sync_slot as we specifically wanted it to be NULL if not assigned. Assigning it to 'SYNCSLOT_STATE_NONE' will be misleading. But moved the 'SYNCSLOT_STATE_NONE' related error though. > > + /* Search for the named slot */ > + if ((s = SearchNamedReplicationSlot(remote_slot->name, true))) > + { > + SpinLockAcquire(&s->mutex); > + sync_state = s->data.sync_state; > + SpinLockRelease(&s->mutex); > > INSERT HERE > + /* User-created slot with the same name exists, raise ERROR. */ > + if (sync_state == SYNCSLOT_STATE_NONE) > + ereport(ERROR, ... > + } > > ~~~ > > 40. > + /* Otherwise create the slot first. */ > + else > + { > > Insert a blank line above that comment for better readability (same as > done for earlier 'else' in this same function) > > ~~~ > > 41. > + ReplicationSlotCreate(remote_slot->name, true, RS_EPHEMERAL, > + remote_slot->two_phase, > + remote_slot->failover, > + SYNCSLOT_STATE_INITIATED); > + > + slot = MyReplicationSlot; > > In hindsight, the prior if/else code blocks in this function also > could have done "slot = MyReplicationSlot;" same as this -- then the > code would be much less verbose. > > ~~~ > > 42. > + SpinLockAcquire(&slot->mutex); > + slot->data.database = get_database_oid(remote_slot->database, false); > + > + namestrcpy(&slot->data.plugin, remote_slot->plugin); > + SpinLockRelease(&slot->mutex); > > IMO the code would be more readable *without* a blank line here > because the mutexed block is more obvious. > > ~~~ > > 43. > + /* > + * If the local restart_lsn and/or local catalog_xmin is ahead of > + * those on the remote then we cannot create the local slot in sync > + * with the primary server because that would mean moving the local > + * slot backwards and we might not have WALs retained for old LSN. In > + * this case we will wait for the primary server's restart_lsn and > + * catalog_xmin to catch up with the local one before attempting the > + * sync. > + */ > > 43a. > This comment describes some fundamental concepts about how this logic > works. I felt this and other comments like this should be at the top > of this slotsync.c file. Then anything that needs to mention about it > can refer to the top comment. For example, I also found other comments > like "... they are still waiting for the primary server to catch up." > to be difficult to understand without knowing these details, but I > think describing core design stuff up-front and saying "refer to the > comment atop the fil" probably would help a lot. > > ~ > > 43b. > Should "wait for the primary server's restart_lsn and..." be "wait for > the primary server slot's restart_lsn and..." ? > > ~~~ > > 44. > + { > + bool persist; > + > + if (!wait_for_primary_slot_catchup(wrconn, remote_slot, &persist)) > + { > + /* > + * The remote slot didn't catch up to locally reserved > + * position. > + * > + * We do not drop the slot because the restart_lsn can be > + * ahead of the current location when recreating the slot in > + * the next cycle. It may take more time to create such a > + * slot. Therefore, we persist it (provided remote-slot is > + * still valid) and attempt the wait and synchronization in > + * the next cycle. > + */ > + if (persist) > + { > + ReplicationSlotPersist(); > + *slot_updated = true; > + } > + > + goto cleanup; > + } > + } > > Looking at the way this 'persist' parameter is used I felt is it too > complicated. IIUC the wait_for_primary_slot_catchup can only return > *persist = true (for a false return) when it has reached/exceeded the > number of retries and still not yet caught up. Why should > wait_for_primary_slot_catchup() pretend to know about persistence? > > In other words, I thought a more meaningful parameter/variable name > (instead of 'persist') is something like 'wait_attempts_exceeded'. IMO > that will make wait_for_primary_slot_catchup() code easier, and here > you can just say like below, where the code matches the comment > better. Thoughts? > > + if (wait_attempts_exceeded) > + { > + ReplicationSlotPersist(); > + *slot_updated = true; > + } > yes, it will make code simpler. Changed it. > ~~~ > > 45. > + > + > + /* > + * Wait for primary is either not needed or is over. Update the lsns > + * and mark the slot as READY for further syncs. > + */ > > Double blank lines? > > ~~~ > > 46. > + ereport(LOG, errmsg("newly locally created slot \"%s\" is sync-ready " > + "now", remote_slot->name)); > + } > + > +cleanup: > > Better to put the whole errmsg() on a newline instead of splitting the > string like that. > > ~~~ > > 47. synchronize_slots > > +/* > + * Synchronize slots. > + * > + * Gets the failover logical slots info from the primary server and update > + * the slots locally. Creates the slots if not present on the standby. > + * > + * Returns nap time for the next sync-cycle. > + */ > +static long > +synchronize_slots(WalReceiverConn *wrconn) > > /update/updates/ > > ~~~ > > 48. > + /* The primary_slot_name is not set yet or WALs not received yet */ > + SpinLockAcquire(&WalRcv->mutex); > + if (!WalRcv || > + (WalRcv->slotname[0] == '\0') || > + XLogRecPtrIsInvalid(WalRcv->latestWalEnd)) > + { > + SpinLockRelease(&WalRcv->mutex); > + return naptime; > + } > + SpinLockRelease(&WalRcv->mutex); > > Just wondering if the scenario of "WALS not received" is a bit more > like "no activity" so perhaps the naptime returned should be > WORKER_INACTIVITY_NAPTIME_MS here? > This may happen if walreceiver is temporarily having some issue. Longer nap is not recommended here. We should check the state again after a short nap. > > ~~~ > > 49. > + /* Construct query to get slots info from the primary server */ > + initStringInfo(&s); > + construct_slot_query(&s); > > I did not like the construct_slot_query() to be separated from this > function because it makes it too difficult to see if the slot_attr > numbers and column types in this function are correct w.r.t. that > query. IMO better when everything is in the same place where you can > see it all together. e.g. Less risk of breaking something if changes > are made. > > ~~~ > > 50. > + /* Construct the remote_slot tuple and synchronize each slot locally */ > + slot = MakeSingleTupleTableSlot(res->tupledesc, &TTSOpsMinimalTuple); > > Normally in all the other functions the variable 'slot' was the local > ReplicationSlot but IIUC here represents a remote tuple. Making a > different name would be better like 'remote_slottup' or something > else. > Have changed it to 'tupslot' to keep it short but different from slot. > ~~~ > > 51. > + /* > + * If any of the slots get updated in this sync-cycle, retain default > + * naptime and update 'last_update_time' in slot sync worker. But if no > + * activity is observed in this sync-cycle, then increase naptime provided > + * inactivity time reaches threshold. > + */ > > I think "retain" is a slightly wrong word here because it might have > been WORKER_INACTIVITY_NAPTIME_MS in the previous cycle. > > Maybe just /retain/use/ > > ~~~ > > 52. > +/* > + * Connects primary to validate the slot specified in primary_slot_name. > + * > + * Exits the worker if physical slot with the specified name does not exist. > + */ > +static void > +validate_primary_slot(WalReceiverConn *wrconn) > > There is already a connection, so not sure if this connect should be > saying "connects to"; Maybe is should be saying more like below: > > SUGGESTION > Using the specified primary server connection, validate if the > physical slot identified by GUC primary_slot_name exists. > > Exit the worker if the slot is not found. > > ~~~ > > 53. > + initStringInfo(&cmd); > + appendStringInfo(&cmd, > + "select count(*) = 1 from pg_replication_slots where " > + "slot_type='physical' and slot_name=%s", > + quote_literal_cstr(PrimarySlotName)); > > Write the SQL keywords in uppercase. > > ~~~ > > 54. > + if (res->status != WALRCV_OK_TUPLES) > + ereport(ERROR, > + (errmsg("could not fetch primary_slot_name info from the " > + "primary: %s", res->err))); > > Shouldn't the name of the unfound slot be shown in the ereport, or > will that already appear in the res->err? > > ~~~ > > 55. > + ereport(ERROR, > + errmsg("exiting slots synchronization as slot specified in " > + "primary_slot_name is not valid")); > + > > IMO the format should be the same as I suggested (later) for all the > validate_slotsync_parameters() errors. > > Also, I think the name of the unfound slot needs to be in this message. > > So maybe result is like this: > > SUGGESTION > > ereport(ERROR, > errmsg("exiting from slot synchronization due to bad configuration") > /* translator: second %s is a GUC variable name */ > errhint("The primary slot \"%s\" specified by %s is not valid.", > slot_name, "primary_slot_name") > ); > > ~~~ > > 56. > +/* > + * Checks if GUCs are set appropriately before starting slot sync worker > + */ > +static void > +validate_slotsync_parameters(char **dbname) > +{ > + /* > + * Since 'enable_syncslot' is ON, check that other GUC settings > + * (primary_slot_name, hot_standby_feedback, wal_level, primary_conninfo) > + * are compatible with slot synchronization. If not, raise ERROR. > + */ > + > > 56a. > I thought that 2nd comment sort of belonged in the function comment. > > ~ > > 56b. > It says "Since 'enable_syncslot' is ON", but I IIUC that is wrong > because the other function slotsync_reread_config() might detect a > change in this GUC and cause this validate_slotsync_parameters() to be > called when enable_syncslot was changed to false. > > In other words, I think you also need to check 'enable_syncslot' and > exit with appropriate ERROR same as all the other config problems. > > OTOH if this is not possible, then the slotsync_reread_config() might > need fixing instead. > 'enable_syncslot' is recently changed to PGC_POSTMASTER from PGC_SIGHUP. Thus 'slotsync_reread_config' also needs to get rid of 'enable_syncslot'. I have changed that now. Slightly changed the comment as well. > ~~~ > > 57. > + /* > + * A physical replication slot(primary_slot_name) is required on the > + * primary to ensure that the rows needed by the standby are not removed > + * after restarting, so that the synchronized slot on the standby will not > + * be invalidated. > + */ > + if (PrimarySlotName == NULL || strcmp(PrimarySlotName, "") == 0) > + ereport(ERROR, > + errmsg("exiting slots synchronization as primary_slot_name is " > + "not set")); > + > + /* > + * Hot_standby_feedback must be enabled to cooperate with the physical > + * replication slot, which allows informing the primary about the xmin and > + * catalog_xmin values on the standby. > + */ > + if (!hot_standby_feedback) > + ereport(ERROR, > + errmsg("exiting slots synchronization as hot_standby_feedback " > + "is off")); > + > + /* > + * Logical decoding requires wal_level >= logical and we currently only > + * synchronize logical slots. > + */ > + if (wal_level < WAL_LEVEL_LOGICAL) > + ereport(ERROR, > + errmsg("exiting slots synchronisation as it requires " > + "wal_level >= logical")); > + > + /* > + * The primary_conninfo is required to make connection to primary for > + * getting slots information. > + */ > + if (PrimaryConnInfo == NULL || strcmp(PrimaryConnInfo, "") == 0) > + ereport(ERROR, > + errmsg("exiting slots synchronization as primary_conninfo " > + "is not set")); > + > + /* > + * The slot sync worker needs a database connection for walrcv_exec to > + * work. > + */ > + *dbname = walrcv_get_dbname_from_conninfo(PrimaryConnInfo); > + if (*dbname == NULL) > + ereport(ERROR, > + errmsg("exiting slots synchronization as dbname is not " > + "specified in primary_conninfo")); > + > +} > > IMO all these errors can be improved by: > - using a common format > - including errhint for the reason > - using the same tone for instructions on what to do (e.g saying must > be set, rather than what was not set) > > SUGGESTION (something like this) > > ereport(ERROR, > errmsg("exiting from slot synchronization due to bad configuration") > /* translator: %s is a GUC variable name */ > errhint("%s must be defined.", "primary_slot_name") > ); > > ereport(ERROR, > errmsg("exiting from slot synchronization due to bad configuration") > /* translator: %s is a GUC variable name */ > errhint("%s must be enabled.", "hot_standby_feedback") > ); > > ereport(ERROR, > errmsg("exiting from slot synchronization due to bad configuration") > /* translator: wal_level is a GUC variable name, 'logical' is a value */ > errhint("wal_level must be >= logical.") > ); > > ereport(ERROR, > errmsg("exiting from slot synchronization due to bad configuration") > /* translator: %s is a GUC variable name */ > errhint("%s must be defined.", "primary_conninfo") > ); > > ereport(ERROR, > errmsg("exiting from slot synchronization due to bad configuration") > /* translator: 'dbname' is a specific option; %s is a GUC variable name */ > errhint("'dbname' must be specified in %s.", "primary_conninfo") > ); > > ~~~ > > 58. > + *dbname = walrcv_get_dbname_from_conninfo(PrimaryConnInfo); > + if (*dbname == NULL) > + ereport(ERROR, > + errmsg("exiting slots synchronization as dbname is not specified in > primary_conninfo")); > + > +} > > Unnecessary blank line at the end of the function > > ~~~ > > 59. > +/* > + * Re-read the config file. > + * > + * If any of the slot sync GUCs changed, validate the values again > + * through validate_slotsync_parameters() which will exit the worker > + * if validaity fails. > + */ > > SUGGESTION > If any of the slot sync GUCs have changed, re-validate them. The > worker will exit if the check fails. > > ~~~ > > 60. > + char *conninfo = pstrdup(PrimaryConnInfo); > + char *slotname = pstrdup(PrimarySlotName); > + bool syncslot = enable_syncslot; > + bool standbyfeedback = hot_standby_feedback; > > For clarity, I would have used var names to match the old GUCs. > > e.g. > /conninfo/old_primary_conninfo/ > /slotname/old_primary_slot_name/ > /syncslot/old_enable_syncslot/ > /standbyfeedback/old_hot_standby_feedback/ > > ~~~ > > 61. > + dbname = walrcv_get_dbname_from_conninfo(PrimaryConnInfo); > + Assert(dbname); > > This code seems premature. IIUC this is only needed to detect that the > dbname was changed. But I think the prerequisite is first that the > conninfoChanged is true. So really this code should be guarded by if > (conninfoChanged) so it can be done later in the function. > Once PrimaryConnInfo is changed, we can not get old-dbname. So it is required to be done before we reach 'conninfoChanged' > ~~~ > > 62. > + if (conninfoChanged || slotnameChanged || > + (syncslot != enable_syncslot) || > + (standbyfeedback != hot_standby_feedback)) > + { > + revalidate = true; > + } > > SUGGESTION > > revalidate = conninfoChanged || slotnameChanged || > (syncslot != enable_syncslot) || > (standbyfeedback != hot_standby_feedback); > > ~~~ > > 63. > + /* > + * Since we have initialized this worker with old dbname, thus exit if > + * dbname changed. Let it get restarted and connect to new dbname > + * specified. > + */ > + if (conninfoChanged && strcmp(dbname, new_dbname) != 0) > + { > + ereport(ERROR, > + errmsg("exiting slot sync woker as dbname in " > + "primary_conninfo changed")); > + } > > 63a. > /old dbname/the old dbname/ > /new dbname/the new dbname/ > /woker/worker/ > > ~ > > 63b. > This code feels awkward. Can't this dbname check and accompanying > ERROR message be moved down into validate_slotsync_parameters(), so it > lives along with all the other GUC validation logic? Maybe you'll need > to change the validate_slotsync_parameters() parameters slightly but I > think it is much better to keep all the validation together. > > ~~~ > > 64. > + > + > +/* > + * Interrupt handler for main loop of slot sync worker. > + */ > +static void > +ProcessSlotSyncInterrupts(WalReceiverConn **wrconn) > > Double blank lines. > > ~~~ > > 65. > + > + > + if (ConfigReloadPending) > + slotsync_reread_config(); > +} > > Double blank lines > > ~~~ > > 66. slotsync_worker_onexit > > +static void > +slotsync_worker_onexit(int code, Datum arg) > +{ > + SpinLockAcquire(&SlotSyncWorker->mutex); > + SlotSyncWorker->pid = 0; > + SpinLockRelease(&SlotSyncWorker->mutex); > +} > > Should assignment use InvalidPid (-1) instead of 0? > > ~~~ > > 67. ReplSlotSyncWorkerMain > > + SpinLockAcquire(&SlotSyncWorker->mutex); > + > + Assert(SlotSyncWorker->pid == 0); > + > + /* Advertise our PID so that the startup process can kill us on promotion */ > + SlotSyncWorker->pid = MyProcPid; > + > + SpinLockRelease(&SlotSyncWorker->mutex); > > Shouldn't pid start as InvalidPid (-1) instead of Assert 0? > > ~~~ > > 68. > + /* Connect to the primary server */ > + wrconn = remote_connect(); > + > + /* > + * Connect to primary and validate the slot specified in > + * primary_slot_name. > + */ > + validate_primary_slot(wrconn); > > Maybe needs some slight rewording in the 2nd comment. "Connect to > primary server" is already said and done in the 1st part. > > ~~~ > > 69. IsSlotSyncWorker > > +/* > + * Is current process the slot sync worker? > + */ > +bool > +IsSlotSyncWorker(void) > +{ > + return SlotSyncWorker->pid == MyProcPid; > +} > > 69a. > For consistency with others like it, I thought this be called > IsLogicalSlotSyncWorker(). > > ~ > > 69b. > For consistency with the others like this, I think the extern should > be declared in logicalworker.h > > ~~~ > > 70. ShutDownSlotSync > > + SpinLockAcquire(&SlotSyncWorker->mutex); > + if (!SlotSyncWorker->pid) > + { > + SpinLockRelease(&SlotSyncWorker->mutex); > + return; > + } > > IMO should be comparing with InvalidPid (-1) here; not 0. > > ~~~ > > 71. > + SpinLockAcquire(&SlotSyncWorker->mutex); > + > + /* Is it gone? */ > + if (!SlotSyncWorker->pid) > + break; > + > + SpinLockRelease(&SlotSyncWorker->mutex); > > Ditto. bad pids should be InvalidPid (-1), not 0. > > ~~~ > > 72. SlotSyncWorkerShmemInit > > + if (!found) > + { > + memset(SlotSyncWorker, 0, size); > + SpinLockInit(&SlotSyncWorker->mutex); > + } > > Probably here the unassigned pid should be set to InvalidPid (-1), not 0. > > ~~~ > > 73. SlotSyncWorkerRegister > > + if (!enable_syncslot) > + { > + ereport(LOG, > + errmsg("skipping slots synchronization as enable_syncslot is " > + "disabled.")); > + return; > + } > > /as/because/ > > ====== > src/backend/replication/logical/tablesync.c > > 74. > #include "commands/copy.h" > +#include "commands/subscriptioncmds.h" > #include "miscadmin.h" > > There were only #include changes but no code changes. Is the #include needed? > > ====== > src/backend/replication/slot.c There is some change in way headers are included. I need to review it in detail. Keeping it on hold. I tried to explain few points on this in [1] (see last comment) [1]: https://www.postgresql.org/message-id/CAJpy0uD6dWUvBgy8MGdugf_Am4pLXTL_vqcwSeHO13v%2BMzc9KA%40mail.gmail.com > > 75. ReplicationSlotCreate > > void > ReplicationSlotCreate(const char *name, bool db_specific, > ReplicationSlotPersistency persistency, > - bool two_phase, bool failover) > + bool two_phase, bool failover, char sync_state) > > The function comment goes to trouble to describe all the parameters > except for 'failover' and 'sync_slate'. I think a failover comment > should be added in patch 0001 and then the sync_state comment should > be added in patch 0002. > > ~~~ > > 76. > + /* > + * Do not allow users to drop the slots which are currently being synced > + * from the primary to the standby. > + */ > + if (user_cmd && RecoveryInProgress() && > + MyReplicationSlot->data.sync_state != SYNCSLOT_STATE_NONE) > + { > + ReplicationSlotRelease(); > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("cannot drop replication slot \"%s\"", name), > + errdetail("This slot is being synced from the primary."))); > + } > > Should the errdetail give the current state? > I feel current info looks good. User can always use pg_replication_slots to monitor sync_state info. > > ====== > src/backend/tcop/postgres.c > > 77. > + else if (IsSlotSyncWorker()) > + { > + ereport(DEBUG1, > + (errmsg_internal("replication slot sync worker is shutting down due > to administrator command"))); > + > + /* > + * Slot sync worker can be stopped at any time. > + * Use exit status 1 so the background worker is restarted. > + */ > + proc_exit(1); > + } > > Explicitly saying "ereport(DEBUG1, errmsg_internal(..." is a bit > overkill; it is simpler to write this as "elog(DEBUG1, ....); > > ====== > src/include/replication/slot.h > > 78. > +/* The possible values for 'sync_state' in ReplicationSlotPersistentData */ > +#define SYNCSLOT_STATE_NONE 'n' /* None for user created slots */ > +#define SYNCSLOT_STATE_INITIATED 'i' /* Sync initiated for the slot but > + * not completed yet, waiting for > + * the primary server to catch-up */ > +#define SYNCSLOT_STATE_READY 'r' /* Initialization complete, ready > + * to be synced further */ > > Already questioned the same elsewhere. IIUC the same tri-state values > of other attributes might be used here too without needing to > introduce 3 new values. > > e.g. > > #define SYNCSLOT_STATE_DISABLED 'd' /* No syncing for this slot */ > #define SYNCSLOT_STATE_PENDING 'p' /* Sync is enabled but we must > wait for the primary server to catch up */ > #define SYNCSLOT_STATE_ENABLED 'e' /* Sync is enabled and the slot is > ready to be synced */ > responded in comment 11. > ~~~ > > 79. > + /* > + * Is this a slot created by a sync-slot worker? > + * > + * Relevant for logical slots on the physical standby. > + */ > + char sync_state; > + > > I assumed that "Relevant for" means "Only relevant for". It should say that. > > If correct, IMO a better field name might be 'standby_sync_state' > sync_state has some value for primary too. It is not null on primary. Thus current name seems a better choice. > ====== > src/test/recovery/t/050_verify_slot_order.pl > > 80. > +$backup_name = 'backup2'; > +$primary->backup($backup_name); > + > +# Create standby3 > +my $standby3 = PostgreSQL::Test::Cluster->new('standby3'); > +$standby3->init_from_backup( > + $primary, $backup_name, > + has_streaming => 1, > + has_restoring => 1); > > The mixture of 'backup2' for 'standby3' seems confusing. Is there a > reason to call it backup2? > > ~~~ > > 81. > +# Verify slot properties on the standby > +is( $standby3->safe_psql('postgres', > + q{SELECT failover, sync_state FROM pg_replication_slots WHERE > slot_name = 'lsub1_slot';} > + ), > + "t|r", > + 'logical slot has sync_state as ready and failover as true on standby'); > > It might be better if the message has the same order as the SQL. Eg. > "failover as true and sync_state as ready". > > ~~~ > > 82. > +# Verify slot properties on the primary > +is( $primary->safe_psql('postgres', > + q{SELECT failover, sync_state FROM pg_replication_slots WHERE > slot_name = 'lsub1_slot';} > + ), > + "t|n", > + 'logical slot has sync_state as none and failover as true on primary'); > + > > It might be better if the message has the same order as the SQL. Eg. > "failover as true and sync_state as none". > > ~~~ > > 83. > +# Test to confirm that restart_lsn of the logical slot on the primary > is synced to the standby > > IMO the major test parts (like this one) may need more highlighting "# > ---------------------" so those comments don't get lost among all the > other comments. > > ~~~ > > 84. > +# let the slots get synced on the standby > +sleep 2; > > Won't this make the test prone to failure on slow machines? Is there > not a more deterministic way to wait for the sync? > > ====== > Kind Regards, > Peter Smith. > Fujitsu Australia
On Mon, Dec 11, 2023 at 2:21 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Mon, Dec 11, 2023 at 1:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Fri, Dec 8, 2023 at 2:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Wed, Dec 6, 2023 at 4:53 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > > > PFA v43, changes are: > > > > > > > > > > I wanted to discuss 0003 patch about cascading standby's. It is not > > > clear to me whether we want to allow physical standbys to further wait > > > for cascading standby to sync their slots. If we allow such a feature > > > one may expect even primary to wait for all the cascading standby's > > > because otherwise still logical subscriber can be ahead of one of the > > > cascading standby. I feel even if we want to allow such a behaviour we > > > can do it later once the main feature is committed. I think it would > > > be good to just allow logical walsenders on primary to wait for > > > physical standbys represented by GUC 'standby_slot_names'. If we agree > > > on that then it would be good to prohibit setting this GUC on standby > > > or at least it should be a no-op even if this GUC should be set on > > > physical standby. > > > > > > Thoughts? > > > > IMHO, why not keep the behavior consistent across primary and standby? > > I mean if it doesn't require a lot of new code/design addition then > > it should be the user's responsibility. I mean if the user has set > > 'standby_slot_names' on standby then let standby also wait for > > cascading standby to sync their slots? Is there any issue with that > > behavior? > > > > Without waiting for cascading standby on primary, it won't be helpful > to just wait on standby. > > Currently logical walsenders on primary waits for physical standbys to > take changes before they update their own logical slots. But they wait > only for their immediate standbys and not for cascading standbys. > Although, on first standby, we do have logic where slot-sync workers > wait for cascading standbys before they update their own slots (synced > ones, see patch3). But this does not guarantee that logical > subscribers on primary will never be ahead of the cascading standbys. > Let us consider this timeline: > > t1: logical walsender on primary waiting for standby1 (first standby). > t2: physical walsender on standby1 is stuck and thus there is delay in > sending these changes to standby2 (cascading standby). > t3: standby1 has taken changes and sends confirmation to primary. > t4: logical walsender on primary receives confirmation from standby1 > and updates slot, logical subscribers of primary also receives the > changes. > t5: standby2 has not received changes yet as physical walsender on > standby1 is still stuck, slotsync worker still waiting for standby2 > (cascading) before it updates its own slots (synced ones). > t6: standby2 is promoted to become primary. > > Now we are in a state wherein primary, logical subscriber and first > standby has some changes but cascading standby does not. And logical > slots on primary were updated w/o confirming if cascading standby has > taken changes or not. This is a problem and we do not have a simple > solution for this yet. Okay, I think that makes sense. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Mon, Dec 11, 2023 at 1:22 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On 12/8/23 10:06 AM, Amit Kapila wrote: > > On Wed, Dec 6, 2023 at 4:53 PM shveta malik <shveta.malik@gmail.com> wrote: > >> > >> PFA v43, changes are: > >> > > > > I wanted to discuss 0003 patch about cascading standby's. It is not > > clear to me whether we want to allow physical standbys to further wait > > for cascading standby to sync their slots. If we allow such a feature > > one may expect even primary to wait for all the cascading standby's > > because otherwise still logical subscriber can be ahead of one of the > > cascading standby. > > I've the same feeling here. I think it would probably be expected that > the primary also wait for all the cascading standby. > > > I feel even if we want to allow such a behaviour we > > can do it later once the main feature is committed. > > Agree. > > > I think it would > > be good to just allow logical walsenders on primary to wait for > > physical standbys represented by GUC 'standby_slot_names'. > > That makes sense for me for v1. > > > If we agree > > on that then it would be good to prohibit setting this GUC on standby > > or at least it should be a no-op even if this GUC should be set on > > physical standby. > > I'd prefer to completely prohibit it on standby (to make it very clear it's not > working at all) as long as one can enable it without downtime once the standby > is promoted (which is the case currently). And I think slot-sync worker should exit as well on cascading standby. Thoughts? If we agree on the above, then we need to look for a way to distinguish between first and cascading standby. I could not find any existing way to do so. One possible approach is to connect to the remote using PrimaryConninfo and run 'pg_is_in_recovery()' there, if it returns true, then it means we are cascading standby. Any simpler way to achieve this? thanks Shveta
On Mon, Dec 11, 2023 at 2:41 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > 5. > > +synchronize_slots(WalReceiverConn *wrconn) > > { > > ... > > ... > > + /* The syscache access needs a transaction env. */ > > + StartTransactionCommand(); > > + > > + /* > > + * Make result tuples live outside TopTransactionContext to make them > > + * accessible even after transaction is committed. > > + */ > > + MemoryContextSwitchTo(oldctx); > > + > > + /* Construct query to get slots info from the primary server */ > > + initStringInfo(&s); > > + construct_slot_query(&s); > > + > > + elog(DEBUG2, "slot sync worker's query:%s \n", s.data); > > + > > + /* Execute the query */ > > + res = walrcv_exec(wrconn, s.data, SLOTSYNC_COLUMN_COUNT, slotRow); > > + pfree(s.data); > > + > > + if (res->status != WALRCV_OK_TUPLES) > > + ereport(ERROR, > > + (errmsg("could not fetch failover logical slots info " > > + "from the primary server: %s", res->err))); > > + > > + CommitTransactionCommand(); > > ... > > ... > > } > > > > Where exactly in the above code, there is a syscache access as > > mentioned above StartTransactionCommand()? > > > > It is in walrcv_exec (libpqrcv_processTuples). I have changed the > comments to add this info. > Okay, I see that the patch switches context twice once after starting the transaction and the second time after committing the transaction, why is that required? Also, can't we extend the duration of the transaction till the remote_slot information is constructed? I am asking this because the context used is TopMemoryContext which should be used only if we need something specific to be retained at the process level which doesn't seem to be the case here. I have noticed a few other minor things: 1. postgres=# select * from pg_logical_slot_get_changes('log_slot_2', NULL, NULL); ERROR: cannot use replication slot "log_slot_2" for logical decoding DETAIL: This slot is being synced from the primary server. ... ... postgres=# select * from pg_drop_replication_slot('log_slot_2'); ERROR: cannot drop replication slot "log_slot_2" DETAIL: This slot is being synced from the primary. I think the DETAIL message should be the same in the above two cases. 2. +void +WalSndWaitForStandbyConfirmation(XLogRecPtr wait_for_lsn) +{ + List *standby_slots; + + Assert(!am_walsender); + + if (!MyReplicationSlot->data.failover) + return; + + standby_slots = GetStandbySlotList(true); + + ConditionVariablePrepareToSleep(&WalSndCtl->wal_confirm_rcv_cv); ... ... Shouldn't we return if the standby slot names list is NIL unless there is a reason to do ConditionVariablePrepareToSleep() or any of the code following it? -- With Regards, Amit Kapila.
On Mon, Dec 11, 2023 at 7:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Dec 11, 2023 at 2:41 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > > > 5. > > > +synchronize_slots(WalReceiverConn *wrconn) > > > { > > > ... > > > ... > > > + /* The syscache access needs a transaction env. */ > > > + StartTransactionCommand(); > > > + > > > + /* > > > + * Make result tuples live outside TopTransactionContext to make them > > > + * accessible even after transaction is committed. > > > + */ > > > + MemoryContextSwitchTo(oldctx); > > > + > > > + /* Construct query to get slots info from the primary server */ > > > + initStringInfo(&s); > > > + construct_slot_query(&s); > > > + > > > + elog(DEBUG2, "slot sync worker's query:%s \n", s.data); > > > + > > > + /* Execute the query */ > > > + res = walrcv_exec(wrconn, s.data, SLOTSYNC_COLUMN_COUNT, slotRow); > > > + pfree(s.data); > > > + > > > + if (res->status != WALRCV_OK_TUPLES) > > > + ereport(ERROR, > > > + (errmsg("could not fetch failover logical slots info " > > > + "from the primary server: %s", res->err))); > > > + > > > + CommitTransactionCommand(); > > > ... > > > ... > > > } > > > > > > Where exactly in the above code, there is a syscache access as > > > mentioned above StartTransactionCommand()? > > > > > > > It is in walrcv_exec (libpqrcv_processTuples). I have changed the > > comments to add this info. > > > > Okay, I see that the patch switches context twice once after starting > the transaction and the second time after committing the transaction, > why is that required? Also, can't we extend the duration of the > transaction till the remote_slot information is constructed? If we extend duration, we have to extend till remote_slot information is consumed and not only till it is constructed. > I am > asking this because the context used is TopMemoryContext which should > be used only if we need something specific to be retained at the > process level which doesn't seem to be the case here. > Okay, I understand your concern. But this needs more thoughts on shall we have all the slots synchronized in one txn or is it better to have it existing way i.e. each slot being synchronized in its own txn started in synchronize_one_slot. If we go by the former, can it have any implications? I need to review this bit more before concluding. . > I have noticed a few other minor things: > 1. > postgres=# select * from pg_logical_slot_get_changes('log_slot_2', NULL, NULL); > ERROR: cannot use replication slot "log_slot_2" for logical decoding > DETAIL: This slot is being synced from the primary server. > ... > ... > postgres=# select * from pg_drop_replication_slot('log_slot_2'); > ERROR: cannot drop replication slot "log_slot_2" > DETAIL: This slot is being synced from the primary. > > I think the DETAIL message should be the same in the above two cases. > > 2. > +void > +WalSndWaitForStandbyConfirmation(XLogRecPtr wait_for_lsn) > +{ > + List *standby_slots; > + > + Assert(!am_walsender); > + > + if (!MyReplicationSlot->data.failover) > + return; > + > + standby_slots = GetStandbySlotList(true); > + > + ConditionVariablePrepareToSleep(&WalSndCtl->wal_confirm_rcv_cv); > ... > ... > > Shouldn't we return if the standby slot names list is NIL unless there > is a reason to do ConditionVariablePrepareToSleep() or any of the code > following it? > > -- > With Regards, > Amit Kapila.
On Monday, December 11, 2023 3:52 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote: Hi, > On 12/8/23 10:06 AM, Amit Kapila wrote: > > On Wed, Dec 6, 2023 at 4:53 PM shveta malik <shveta.malik@gmail.com> > wrote: > >> > >> PFA v43, changes are: > >> > > > > I wanted to discuss 0003 patch about cascading standby's. It is not > > clear to me whether we want to allow physical standbys to further wait > > for cascading standby to sync their slots. If we allow such a feature > > one may expect even primary to wait for all the cascading standby's > > because otherwise still logical subscriber can be ahead of one of the > > cascading standby. > > I've the same feeling here. I think it would probably be expected that the > primary also wait for all the cascading standby. > > > I feel even if we want to allow such a behaviour we can do it later > > once the main feature is committed. > > Agree. > > > I think it would > > be good to just allow logical walsenders on primary to wait for > > physical standbys represented by GUC 'standby_slot_names'. > > That makes sense for me for v1. > > > If we agree > > on that then it would be good to prohibit setting this GUC on standby > > or at least it should be a no-op even if this GUC should be set on > > physical standby. > > I'd prefer to completely prohibit it on standby (to make it very clear it's not > working at all) as long as one can enable it without downtime once the standby > is promoted (which is the case currently). I think we could not check if we are in a standby server in the GUC check_hook, because the XLogCtl(which is checked in RecoveryInProgress) may have not been initialized yet. Besides, other GUCs like synchronous_standby_names also don't work on standby but it will be no-op. So I feel we can also ignore standby_slot_names on standby. What do you think ? Here is the V46 patch set which changed the following things: V46-0001: * Address Peter[1] and Amit's[2] comments. * Fix one CFbot failure in meson build. * Ignore the standby_slot_names on a standby server since we don't support syncing slots to cascade standby. V46-0002: 1) Fix for CFBot make warning. 2) Cascading support removal. Now we do not need to check 'sync_state != 'i'' in the query while fetching failover slots. This check was needed on the cascading standby to fetch failover slots from the first standby. 3) Test correction and optimization. 0003 patch is removed since we agreed not to support syncing slots to cascading standby. Thanks Shveta for working on the changes in V46-0002 and thanks Ajin for working on the test optimization. -- TODO There are few pending comments that mentioned in [3][4][5] which are still in progress. [1] https://www.postgresql.org/message-id/CAHut%2BPsf9z132WNgy0Gr10ZTnonpNjvTBj74wG8kSxXU4rOD7g%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAA4eK1%2BCXpfiTLbYRaOoUBP9Z1-xJZdX6QOp14rCdaF5E2gsgQ%40mail.gmail.com [3] https://www.postgresql.org/message-id/CAJpy0uDaGMNpgmdxie-MgHmMhnD4ET_LDjQNEe76xJ%2BMLqRQ8Q%40mail.gmail.com [4] https://www.postgresql.org/message-id/CAJpy0uDcOf5Hvk_CdCCAbfx9SY%2Bog%3D%3D%3DtgiuhWKzkYyqebui9g%40mail.gmail.com [5] https://www.postgresql.org/message-id/CAJpy0uC-8mrn6jakcFjSVmbJiHZs-Okq8YKxGfrMLPD-2%3DwOqQ%40mail.gmail.com Best Regards, Hou zj
Attachment
On Monday, December 11, 2023 3:32 PM Peter Smith <smithpb2250@gmail.com> > > Here are some review comments for v44-0001 > > ====== > src/backend/replication/slot.c > > > 1. ReplicationSlotCreate > > * during getting changes, if the two_phase option is enabled it can skip > * prepare because by that time start decoding point has been moved. So > the > * user will only get commit prepared. > + * failover: Allows the slot to be synced to physical standbys so that logical > + * replication can be resumed after failover. > */ > void > ReplicationSlotCreate(const char *name, bool db_specific, > > ~ > > /Allows the slot.../If enabled, allows the slot.../ Changed. > > ====== > > 2. validate_standby_slots > > +validate_standby_slots(char **newval) > +{ > + char *rawname; > + List *elemlist; > + ListCell *lc; > + bool ok = true; > + > + /* Need a modifiable copy of string */ rawname = pstrdup(*newval); > + > + /* Verify syntax and parse string into list of identifiers */ if (!(ok > + = SplitIdentifierString(rawname, ',', &elemlist))) > + GUC_check_errdetail("List syntax is invalid."); > + > + /* > + * If there is a syntax error in the name or if the replication slots' > + * data is not initialized yet (i.e., we are in the startup process), > + skip > + * the slot verification. > + */ > + if (!ok || !ReplicationSlotCtl) > + { > + pfree(rawname); > + list_free(elemlist); > + return ok; > + } > > > 2a. > You don't need to initialize 'ok' during declaration because it is assigned > immediately anyway. > > ~ > > 2b. > AFAIK assignment within a conditional like this is not a normal PG coding style > unless there is no other way to do it. > Changed. > ~ > > 2c. > /into list/into a list/ > > SUGGESTION > /* Verify syntax and parse string into a list of identifiers */ ok = > SplitIdentifierString(rawname, ',', &elemlist); if (!ok) > GUC_check_errdetail("List syntax is invalid."); > > Changed. > ~~~ > > 3. assign_standby_slot_names > > + if (!SplitIdentifierString(standby_slot_names_cpy, ',', > + &standby_slots)) { > + /* This should not happen if GUC checked check_standby_slot_names. */ > + elog(ERROR, "list syntax is invalid"); } > > This error here and in validate_standby_slots() are different -- "list" versus > "List". > The message has been changed to "invalid list syntax" to be consistent with other elog. > ====== > src/backend/replication/walsender.c > > > 4. WalSndFilterStandbySlots > > > + foreach(lc, standby_slots_cpy) > + { > + char *name = lfirst(lc); > + XLogRecPtr restart_lsn = InvalidXLogRecPtr; bool invalidated = false; > + char *warningfmt = NULL; > + ReplicationSlot *slot; > + > + slot = SearchNamedReplicationSlot(name, true); > + > + if (slot && SlotIsPhysical(slot)) > + { > + SpinLockAcquire(&slot->mutex); > + restart_lsn = slot->data.restart_lsn; > + invalidated = slot->data.invalidated != RS_INVAL_NONE; > + SpinLockRelease(&slot->mutex); } > + > + /* Continue if the current slot hasn't caught up. */ if (!invalidated > + && !XLogRecPtrIsInvalid(restart_lsn) && restart_lsn < wait_for_lsn) { > + /* Log warning if no active_pid for this physical slot */ if > + (slot->active_pid == 0) ereport(WARNING, errmsg("replication slot > + \"%s\" specified in parameter \"%s\" does > not have active_pid", > + name, "standby_slot_names"), > + errdetail("Logical replication is waiting on the " > + "standby associated with \"%s\"", name), errhint("Consider starting > + standby associated with " > + "\"%s\" or amend standby_slot_names", name)); > + > + continue; > + } > + else if (!slot) > + { > + /* > + * It may happen that the slot specified in standby_slot_names GUC > + * value is dropped, so let's skip over it. > + */ > + warningfmt = _("replication slot \"%s\" specified in parameter > \"%s\" does not exist, ignoring"); > + } > + else if (SlotIsLogical(slot)) > + { > + /* > + * If a logical slot name is provided in standby_slot_names, issue > + * a WARNING and skip it. Although logical slots are disallowed in > + * the GUC check_hook(validate_standby_slots), it is still > + * possible for a user to drop an existing physical slot and > + * recreate a logical slot with the same name. Since it is > + * harmless, a WARNING should be enough, no need to error-out. > + */ > + warningfmt = _("cannot have logical replication slot \"%s\" in > parameter \"%s\", ignoring"); > + } > + else if (XLogRecPtrIsInvalid(restart_lsn) || invalidated) { > + /* > + * Specified physical slot may have been invalidated, so there is no > + point > + * in waiting for it. > + */ > + warningfmt = _("physical slot \"%s\" specified in parameter \"%s\" > has been invalidated, ignoring"); > + } > + else > + { > + Assert(restart_lsn >= wait_for_lsn); > + } > > This if/else chain seems structured awkwardly. IMO it would be tidier to > eliminate the NULL slot and IsLogicalSlot up-front, which would also simplify > some of the subsequent conditions > > SUGGESTION > > slot = SearchNamedReplicationSlot(name, true); > > if (!slot) > { > ... > } > else if (SlotIsLogical(slot)) > { > ... > } > else > { > Assert(SlotIsPhysical(slot)) > > SpinLockAcquire(&slot->mutex); > restart_lsn = slot->data.restart_lsn; > invalidated = slot->data.invalidated != RS_INVAL_NONE; > SpinLockRelease(&slot->mutex); > > if (XLogRecPtrIsInvalid(restart_lsn) || invalidated) > { > ... > } > else if (!invalidated && !XLogRecPtrIsInvalid(restart_lsn) && restart_lsn < > wait_for_lsn) > { > ... > } > else > { > Assert(restart_lsn >= wait_for_lsn); > } > } > Changed. > ~~~~ > > 5. WalSndWaitForWal > > + else > + { > + /* already caught up and doesn't need to wait for standby_slots */ > break; > + } > > /Already/already/ > Changed. > ====== > src/test/recovery/t/050_standby_failover_slots_sync.pl > > > 6. > +$subscriber1->safe_psql('postgres', > + "CREATE TABLE tab_int (a int PRIMARY KEY);"); > + > +# Create a subscription with failover = true > +$subscriber1->safe_psql('postgres', > + "CREATE SUBSCRIPTION regress_mysub1 CONNECTION > '$publisher_connstr' " > + . "PUBLICATION regress_mypub WITH (slot_name = lsub1_slot, > failover = true);" > +); > > > Consider combining these DDL statements. > Changed. > ~~~ > > 7. > +$subscriber2->safe_psql('postgres', > + "CREATE TABLE tab_int (a int PRIMARY KEY);"); > +$subscriber2->safe_psql('postgres', > + "CREATE SUBSCRIPTION regress_mysub2 CONNECTION > '$publisher_connstr' " > + . "PUBLICATION regress_mypub WITH (slot_name = lsub2_slot);"); > > Consider combining these DDL statements > Changed. > ~~~ > > 8. > +# Stop the standby associated with specified physical replication slot > +so that # the logical replication slot won't receive changes until the > +standby slot's # restart_lsn is advanced or the slots is removed from > +the standby_slot_names # list $publisher->safe_psql('postgres', > +"TRUNCATE tab_int;"); $publisher->wait_for_catchup('regress_mysub1'); > +$standby1->stop; > > /with specified/with the specified/ > > /or the slots is/or the slot is/ > Changed. > ~~~ > > 9. > +# Create some data on primary > > /on primary/on the primary/ > Changed. > ~~~ > > 10. > +$result = > + $subscriber1->safe_psql('postgres', "SELECT count(*) = 10 FROM > +tab_int;"); is($result, 't', > + "subscriber1 doesn't get data as the sb1_slot doesn't catch up"); > > > I felt instead of checking for 10 maybe it's more consistent with the previous > code to assign again that $primary_row_count variable to 20; > > Then check that those primary rows are not all yet received like: > > SELECT count(*) < $primary_row_count FROM tab_int; > I think we'd better check the accurate number here to make sure the number is what we expect. > ~~~ > > 11. > +# Now that the standby lsn has advanced, primary must send the decoded > +# changes to the subscription. > +$publisher->wait_for_catchup('regress_mysub1'); > +$result = > + $subscriber1->safe_psql('postgres', "SELECT count(*) = 20 FROM > +tab_int;"); is($result, 't', > + "subscriber1 gets data from primary after standby1 is removed from > the standby_slot_names list" > +); > > /primary must/the primary must/ > > (continuing the suggestion from the previous review comment) > > Now this SQL can use the variable too: > > subscriber1->safe_psql('postgres', "SELECT count(*) = > $primary_row_count FROM tab_int;"); > Changed. > ~~~ > > 12. > + > +# Create another subscription enabling failover > +$subscriber1->safe_psql('postgres', > + "CREATE SUBSCRIPTION regress_mysub3 CONNECTION > '$publisher_connstr' " > + . "PUBLICATION regress_mypub WITH (slot_name = lsub3_slot, > copy_data=false, failover = true, create_slot = false);" > +); > > > Maybe give some more information in that comment: > > SUGGESTION > Create another subscription (using the same slot created above) that enables > failover. > Added. Best Regards, Hou zj
A review on v45 patch: If one creates a logical slot with failover=true as - select pg_create_logical_replication_slot('logical_slot','pgoutput', false, true, true); Then, uses the existing logical slot while creating a subscription - postgres=# create subscription sub4 connection 'dbname=postgres host=localhost port=5433' publication pub1t4 WITH (slot_name=logical_slot, create_slot=false, failover=true); NOTICE: changed the failover state of replication slot "logical_slot" on publisher to false CREATE SUBSCRIPTION Despite configuring logical_slot's failover to true and specifying failover=true during subscription creation, the NOTICE indicates a change in the failover state to 'false', without providing any explanation for this transition. It can be confusing for users, so IMO, the notice should include the reason for switching failover to 'false' or should give a hint to use either refresh=false or copy_data=false to enable failover=true for the slot as we do in other similar 'alter subscription...' scenarios. -- Thanks & Regards, Nisha
On Tue, Dec 12, 2023 at 2:44 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Mon, Dec 11, 2023 at 7:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > I am > > asking this because the context used is TopMemoryContext which should > > be used only if we need something specific to be retained at the > > process level which doesn't seem to be the case here. > > > > Okay, I understand your concern. But this needs more thoughts on shall > we have all the slots synchronized in one txn or is it better to have > it existing way i.e. each slot being synchronized in its own txn > started in synchronize_one_slot. If we go by the former, can it have > any implications? > I think the one advantage of syncing each slot in a different transaction could have been if that helps with the visibility of updated slot information but that is not the case here as we always persist it to file. As per my understanding, here we need a transaction as we may access catalogs while creating/updating slots, so, a single transaction should be okay unless there are any other reasons. -- With Regards, Amit Kapila.
On Mon, Dec 11, 2023 at 5:13 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Mon, Dec 11, 2023 at 1:22 PM Drouvot, Bertrand > <bertranddrouvot.pg@gmail.com> wrote: > > > > > If we agree > > > on that then it would be good to prohibit setting this GUC on standby > > > or at least it should be a no-op even if this GUC should be set on > > > physical standby. > > > > I'd prefer to completely prohibit it on standby (to make it very clear it's not > > working at all) as long as one can enable it without downtime once the standby > > is promoted (which is the case currently). > > And I think slot-sync worker should exit as well on cascading standby. Thoughts? > I think one has set all the valid parameters for the slot-sync worker on standby, we should not exit, rather it should be no-op which means it should not try to sync slots from another standby. One scenario where this may help is when users promote the standby which has already synced slots from the primary. In this case, cascading standby will become non-cascading and should sync slots. -- With Regards, Amit Kapila.
On Tue, Dec 12, 2023 at 5:56 PM Nisha Moond <nisha.moond412@gmail.com> wrote: > > A review on v45 patch: > > If one creates a logical slot with failover=true as - > select pg_create_logical_replication_slot('logical_slot','pgoutput', > false, true, true); > > Then, uses the existing logical slot while creating a subscription - > postgres=# create subscription sub4 connection 'dbname=postgres > host=localhost port=5433' publication pub1t4 WITH > (slot_name=logical_slot, create_slot=false, failover=true); > NOTICE: changed the failover state of replication slot "logical_slot" > on publisher to false > CREATE SUBSCRIPTION > > Despite configuring logical_slot's failover to true and specifying > failover=true during subscription creation, the NOTICE indicates a > change in the failover state to 'false', without providing any > explanation for this transition. > It can be confusing for users, so IMO, the notice should include the > reason for switching failover to 'false' or should give a hint to use > either refresh=false or copy_data=false to enable failover=true for > the slot as we do in other similar 'alter subscription...' scenarios. > Agree. The NOTICE should be more informative. thanks SHveta
On Wed, Dec 13, 2023 at 10:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Dec 11, 2023 at 5:13 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Mon, Dec 11, 2023 at 1:22 PM Drouvot, Bertrand > > <bertranddrouvot.pg@gmail.com> wrote: > > > > > > > If we agree > > > > on that then it would be good to prohibit setting this GUC on standby > > > > or at least it should be a no-op even if this GUC should be set on > > > > physical standby. > > > > > > I'd prefer to completely prohibit it on standby (to make it very clear it's not > > > working at all) as long as one can enable it without downtime once the standby > > > is promoted (which is the case currently). > > > > And I think slot-sync worker should exit as well on cascading standby. Thoughts? > > > > I think one has set all the valid parameters for the slot-sync worker > on standby, we should not exit, rather it should be no-op which means > it should not try to sync slots from another standby. One scenario > where this may help is when users promote the standby which has > already synced slots from the primary. In this case, cascading standby > will become non-cascading and should sync slots. > Right, then perhaps we should increase naptime in this no-op case. It could be even more then current inactivity naptime which is just 10sec. Shall it be say 5min in this case? thanks Shveta
Hi Shveta, here are some review comments for v45-0002. ====== doc/src/sgml/bgworker.sgml 1. + <variablelist> + <varlistentry> + <term><literal>BgWorkerStart_PostmasterStart</literal></term> + <listitem> + <para> + <indexterm><primary>BgWorkerStart_PostmasterStart</primary></indexterm> + Start as soon as postgres itself has finished its own initialization; + processes requesting this are not eligible for database connections. + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term><literal>BgWorkerStart_ConsistentState</literal></term> + <listitem> + <para> + <indexterm><primary>BgWorkerStart_ConsistentState</primary></indexterm> + Start as soon as a consistent state has been reached in a hot-standby, + allowing processes to connect to databases and run read-only queries. + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term><literal>BgWorkerStart_RecoveryFinished</literal></term> + <listitem> + <para> + <indexterm><primary>BgWorkerStart_RecoveryFinished</primary></indexterm> + Start as soon as the system has entered normal read-write state. Note + that the <literal>BgWorkerStart_ConsistentState</literal> and + <literal>BgWorkerStart_RecoveryFinished</literal> are equivalent + in a server that's not a hot standby. + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term><literal>BgWorkerStart_ConsistentState_HotStandby</literal></term> + <listitem> + <para> + <indexterm><primary>BgWorkerStart_ConsistentState_HotStandby</primary></indexterm> + Same meaning as <literal>BgWorkerStart_ConsistentState</literal> but + it is more strict in terms of the server i.e. start the worker only + if it is hot-standby. + </para> + </listitem> + </varlistentry> + </variablelist> Maybe reorder these slightly, because I felt it is better if the BgWorkerStart_ConsistentState_HotStandby comes next after BgWorkerStart_ConsistentState, which it refers to For example:: 1st.BgWorkerStart_PostmasterStart 2nd.BgWorkerStart_ConsistentState 3rd.BgWorkerStart_ConsistentState_HotStandby 4th.BgWorkerStart_RecoveryFinished ====== doc/src/sgml/config.sgml 2. <varname>enable_syncslot</varname> = true Not sure, but I thought the "= true" part should be formatted too. SUGGESTION <literal>enable_syncslot = true</literal> ====== doc/src/sgml/logicaldecoding.sgml 3. + <para> + A logical replication slot on the primary can be synchronized to the hot + standby by enabling the failover option during slot creation and setting + <varname>enable_syncslot</varname> on the standby. For the synchronization + to work, it is mandatory to have a physical replication slot between the + primary and the standby. It's highly recommended that the said physical + replication slot is listed in <varname>standby_slot_names</varname> on + the primary to prevent the subscriber from consuming changes faster than + the hot standby. Additionally, <varname>hot_standby_feedback</varname> + must be enabled on the standby for the slots synchronization to work. + </para> I felt those parts that describe the mandatory GUCs should be kept together. SUGGESTION For the synchronization to work, it is mandatory to have a physical replication slot between the primary and the standby, and <varname>hot_standby_feedback</varname> must be enabled on the standby. It's also highly recommended that the said physical replication slot is named in <varname>standby_slot_names</varname> list on the primary, to prevent the subscriber from consuming changes faster than the hot standby. ~~~ 4. (Chapter 49) By enabling synchronization of slots, logical replication can be resumed after failover depending upon the pg_replication_slots.sync_state for the synchronized slots on the standby at the time of failover. Only slots that were in ready sync_state ('r') on the standby before failover can be used for logical replication after failover. However, the slots which were in initiated sync_state ('i') and not sync-ready ('r') at the time of failover will be dropped and logical replication for such slots can not be resumed after failover. This applies to the case where a logical subscription is disabled before failover and is enabled after failover. If the synchronized slot due to disabled subscription could not be made sync-ready ('r') on standby, then the subscription can not be resumed after failover even when enabled. If the primary is idle, then the synchronized slots on the standby may take a noticeable time to reach the ready ('r') sync_state. This can be sped up by calling the pg_log_standby_snapshot function on the primary. ~ Somehow, I still felt all that was too wordy/repetitive. Below is my attempt to make it more concise. Thoughts? SUGGESTION The ability to resume logical replication after failover depends upon the pg_replication_slots.sync_state value for the synchronized slots on the standby at the time of failover. Only slots that have attained a "ready" sync_state ('r') on the standby before failover can be used for logical replication after failover. Slots that have not yet reached 'r' state (they are still 'i') will be dropped, therefore logical replication for those slots cannot be resumed. For example, if the synchronized slot could not become sync-ready on standby due to a disabled subscription, then the subscription cannot be resumed after failover even when it is enabled. If the primary is idle, the synchronized slots on the standby may take a noticeable time to reach the ready ('r') sync_state. This can be sped up by calling the pg_log_standby_snapshot function on the primary. ====== doc/src/sgml/system-views.sgml 5. + <para> + Defines slot synchronization state. This is meaningful on the physical + standby which has configured <varname>enable_syncslot</varname> = true + </para> As mentioned in the previous review comment ([1]#10) I thought it might be good to include a hyperlink cross-reference to the 'enable_syncslot' GUC. ~~~ 6. + <para> + The hot standby can have any of these sync_state for the slots but on a + hot standby, the slots with state 'r' and 'i' can neither be used for + logical decoding nor dropped by the user. The primary server will have + sync_state as 'n' for all the slots. But if the standby is promoted to + become the new primary server, sync_state can be seen 'r' as well. On + this new primary server, slots with sync_state as 'r' and 'n' will + behave the same. + </para></entry> 6a. /these sync_state for the slots/these sync_state values for the slots/ ~ 6b Hm. I still felt (same as previous review [1]#12b) that there seems too much information here. IIUC the sync_state is only meaningful on the standby. Sure, it might have some values line 'n' or 'r' on the primary also, but those either mean nothing ('n') or are leftover states from a previous failover from a standby ('r'), which also means nothing. So can't we just say it more succinctly like that? SUGGESTION The sync_state has no meaning on the primary server; the primary sync_state value is default 'n' for all slots but may (if leftover from a promoted standby) also be 'r'. ====== .../libpqwalreceiver/libpqwalreceiver.c 7. static void libpqrcv_alter_slot(WalReceiverConn *conn, const char *slotname, - bool failover) + bool failover) Still seems to be tampering with indentation that should only be in patch 0001. ====== src/backend/replication/logical/slotsync.c 8. wait_for_primary_slot_catchup The meaning of the boolean return of this function is still not described by the function comment. ~~~ 9. + * If passed, *wait_attempts_exceeded will be set to true only if this + * function exits after exhausting its wait attempts. It will be false + * in all the other cases like failure, remote-slot invalidation, primary + * could catch up. The above already says when a return false happens, so it seems overkill to give more information. SUGGESTION If passed, *wait_attempts_exceeded will be set to true only if this function exits due to exhausting its wait attempts. It will be false in all the other cases. ~~~ 10. +static bool +wait_for_primary_slot_catchup(WalReceiverConn *wrconn, RemoteSlot *remote_slot, + bool *wait_attempts_exceeded) +{ +#define WAIT_OUTPUT_COLUMN_COUNT 4 +#define WORKER_PRIMARY_CATCHUP_WAIT_ATTEMPTS 5 + 10a Maybe the long constant name is too long. How about WAIT_PRIMARY_CATCHUP_ATTEMPTS? ~~~ 10b. IMO it is better to Assert the input value of this kind of side-effect return parameter, to give a better understanding and to prevent future accidents. SUGGESTION Assert(wait_attempts_exceeded == NULL |} *wait_attempts_exceeded == false); ~~~ 11. synchronize_one_slot + ReplicationSlot *s; + ReplicationSlot *slot; + char sync_state = '\0'; 11a. I don't think you need both 's' and 'slot' ReplicationSlot -- it looks a bit odd. Can't you just reuse the one 'slot' variable? ~ 11b. Also, maybe those assignment like + slot = MyReplicationSlot; can have an explanatory comment like: /* For convenience, we assign MyReplicationSlot to a shorter variable name. */ ~~~ 12. +static void +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot, + bool *slot_updated) +{ + ReplicationSlot *s; + ReplicationSlot *slot; + char sync_state = '\0'; In my previous review [1]#33a I thought it was strange to assign the sync_state (which is essentially an enum) to some meaningless value, so I suggested it should be set to SYNCSLOT_STATE_NONE in the declaration. The reply [2] was "No, that will change the flow. It should stay uninitialized if the slot is not found." But I am not convinced there is any flow problem. Also, SYNCSLOT_STATE_NONE seems the naturally correct default for something with no state. It cannot be found and be SYNCSLOT_STATE_NONE at the same time (that is reported as an ERROR "skipping sync of slot") so I see no problem. The CURRENT code is like this: /* Slot created by the slot sync worker exists, sync it */ if (sync_state) { Assert(sync_state == SYNCSLOT_STATE_READY || sync_state == SYNCSLOT_STATE_INITIATED); ... } /* Otherwise create the slot first. */ else { ... } AFAICT that could easily be changed to like below, with no change to the logic, and it avoids setting strange values. SUGGESTION. if (sync_state == SYNCSLOT_STATE_NONE) { /* Slot not found. Create it. */ .. } else { Assert(sync_state == SYNCSLOT_STATE_READY || sync_state == SYNCSLOT_STATE_INITIATED); ... } ~~~ 13. synchronize_one_slot +static void +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot, + bool *slot_updated) This *slot_updated parameter looks dubious. It is used in a loop from the caller to mean that ANY slot was updated -- e.g. maybe it is true or false on entry to this function. But, Instead of having some dependency between this function and the caller, IMO it makes more sense if we would make this just a boolean function in the first place (e.g. was updated? T/F) Then the caller can also be written more easily like: some_slot_updated |= synchronize_one_slot(wrconn, remote_slot); ~~~ 14. + /* Search for the named slot */ + if ((s = SearchNamedReplicationSlot(remote_slot->name, true))) + { + SpinLockAcquire(&s->mutex); + sync_state = s->data.sync_state; + SpinLockRelease(&s->mutex); + + /* User created slot with the same name exists, raise ERROR. */ + if (sync_state == SYNCSLOT_STATE_NONE) + { + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("skipping sync of slot \"%s\" as it is a user created" + " slot", remote_slot->name), + errdetail("This slot has failover enabled on the primary and" + " thus is sync candidate but user created slot with" + " the same name already exists on the standby"))); + } + } Extra curly brackets around the ereport are not needed. ~~~ 15. + /* + * Sanity check: With hot_standby_feedback enabled and + * invalidations handled apropriately as above, this should never + * happen. + */ + if (remote_slot->restart_lsn < slot->data.restart_lsn) + { + elog(ERROR, + "not synchronizing local slot \"%s\" LSN(%X/%X)" + " to remote slot's LSN(%X/%X) as synchronization " + " would move it backwards", remote_slot->name, + LSN_FORMAT_ARGS(slot->data.restart_lsn), + LSN_FORMAT_ARGS(remote_slot->restart_lsn)); + } 15a. /apropriately/appropriately/ ~ 15b. Extra curly brackets around the elog are not needed. ~~~ 16. synchronize_slots +static bool +synchronize_slots(WalReceiverConn *wrconn) +{ +#define SLOTSYNC_COLUMN_COUNT 9 + Oid slotRow[SLOTSYNC_COLUMN_COUNT] = {TEXTOID, TEXTOID, LSNOID, + LSNOID, XIDOID, BOOLOID, BOOLOID, TEXTOID, INT2OID}; + + WalRcvExecResult *res; + TupleTableSlot *tupslot; + StringInfoData s; + List *remote_slot_list = NIL; + MemoryContext oldctx = CurrentMemoryContext; + ListCell *lc; + bool slot_updated = false; Suggest renaming 'slot_updated' to 'some_slot_updated' or 'update_occurred' etc because the current name makes it look like it applies to a single slot, but it doesn't. ~~~ 17. + SpinLockAcquire(&WalRcv->mutex); + if (!WalRcv || + (WalRcv->slotname[0] == '\0') || + XLogRecPtrIsInvalid(WalRcv->latestWalEnd)) + { + SpinLockRelease(&WalRcv->mutex); + return slot_updated; + } + SpinLockRelease(&WalRcv->mutex); IMO "return false;" here is more clear than saying "return slot_updated;" ~~~ 18. + appendStringInfo(&s, + "SELECT slot_name, plugin, confirmed_flush_lsn," + " restart_lsn, catalog_xmin, two_phase, failover," + " database, pg_get_slot_invalidation_cause(slot_name)" + " FROM pg_catalog.pg_replication_slots" + " WHERE failover and sync_state != 'i'"); 18a. /and/AND/ ~ 18b. In the reply post (see [2]#32) Shveta said "I could not find quote_* function for a character just like we have 'quote_literal_cstr' for string". If you still want to use constant substitution instead of just hardwired 'i' then why do even you need a quote_* function? I thought the appendStringInfo uses a printf style format-string internally, so I assumed it is possible to substitute the state char directly using '%c'. ~~~ 19. + + + + /* We are done, free remote_slot_list elements */ + list_free_deep(remote_slot_list); + + walrcv_clear_result(res); + + return slot_updated; +} Excessive blank lines. ~~~ 20. validate_primary_slot + appendStringInfo(&cmd, + "SELECT count(*) = 1 from pg_replication_slots " + "WHERE slot_type='physical' and slot_name=%s", + quote_literal_cstr(PrimarySlotName)); /and/AND/ ~~~ 21. + slot = MakeSingleTupleTableSlot(res->tupledesc, &TTSOpsMinimalTuple); + tuple_ok = tuplestore_gettupleslot(res->tuplestore, true, false, slot); + Assert(tuple_ok); /* It must return one tuple */ IMO it's better to use all the var names the same across all functions? So call this 'tupslot' like the other MakeSingleTupleTableSlot result. ~~~ 22. validate_slotsync_parameters +/* + * Checks if GUCs are set appropriately before starting slot sync worker + * + * The slot sync worker can not start if 'enable_syncslot' is off and + * since 'enable_syncslot' is ON, check that the other GUC settings + * (primary_slot_name, hot_standby_feedback, wal_level, primary_conninfo) + * are compatible with slot synchronization. If not, raise ERROR. + */ +static void +validate_slotsync_parameters(char **dbname) +{ 22a. The comment is quite verbose. IMO the 2nd para seems just unnecessary detail of the 1st para. SUGGESTION Check that all necessary GUCs for slot synchronization are set appropriately. If not, raise an ERROR. ~~~ 22b. IMO (and given what was said in the comment about enable_syncslot must be on)the first statement of this function should be: /* Sanity check. */ Assert(enable_syncslot); ~~~ 23. slotsync_reread_config + old_dbname = walrcv_get_dbname_from_conninfo(PrimaryConnInfo); + Assert(old_dbname); (This is same comment as old review [1]#61) Hmm. I still don't see why this extraction of the dbname cannot be deferred until later when you know the PrimaryConnInfo has changed, otherwise, it might be redundant to do this. Shveta replied [2] that "Once PrimaryConnInfo is changed, we can not get old-dbname.", but I'm not so sure. Isn't this walrcv_get_dbname_from_conninfo just doing a string search -- Why can't you defer this until you know conninfoChanged is true, and then to get the old_dbname, you can just pass the old_primary_conninfo. E.g. call like walrcv_get_dbname_from_conninfo(old_primary_conninfo); Maybe I am mistaken. ~~ 24. + /* + * Since we have initialized this worker with the old dbname, thus + * exit if dbname changed. Let it get restarted and connect to the new + * dbname specified. + */ + if (conninfoChanged && strcmp(old_dbname, new_dbname) != 0) + ereport(ERROR, + errmsg("exiting slot sync worker as dbname in " + "primary_conninfo changed")); IIUC when the tablesync has to restart, it emits a LOG message before it exits; but it's not an ERROR. So, shouldn't this be similar -- IMO it is not an "error" for the user to wish to change the dbname. Maybe this should be LOG followed by an explicit exit. If you agree, then it might be better to encapsulate such logic in some little function: // pseudo-code void slotsync_worker_restart(const char *msg) { ereport(LOG, msg... exit(0); } ~~~ 25. ReplSlotSyncWorkerMain + for (;;) + { + int rc; + long naptime = WORKER_DEFAULT_NAPTIME_MS; + TimestampTz now; + bool slot_updated; + + ProcessSlotSyncInterrupts(wrconn); + + slot_updated = synchronize_slots(wrconn); Here I think the 'slot_updated' should be renamed to the same name as in #16 above (e.g. 'some_slot_updated' or 'any_slot_updated' or 'update_occurred' etc). ~~~ 26. SlotSyncWorkerRegister + if (!enable_syncslot) + { + ereport(LOG, + errmsg("skipping slots synchronization because enable_syncslot is " + "disabled.")); + return; + } Instead of saying "because..." in the error message maybe keep the message more terse and describe the "because" part in the errdetail SUGGESTION errmsg("skipping slot synchronization") errdetail("enable_syncslot is disabled.") ====== src/backend/replication/slot.c 27. + * sync_state: Defines slot synchronization state. For user created slots, it + * is SYNCSLOT_STATE_NONE and for the slots being synchronized on the physical + * standby, it is either SYNCSLOT_STATE_INITIATED or SYNCSLOT_STATE_READY */ void ReplicationSlotCreate(const char *name, bool db_specific, ReplicationSlotPersistency persistency, - bool two_phase, bool failover) + bool two_phase, bool failover, char sync_state) 27a. Why is this comment even mentioning SYNCSLOT_STATE_READY? IIUC it doesn't make sense to ever call ReplicationSlotCreate directly setting the 'r' state (e.g., bypassing 'i' ???) ~ 27b. Indeed, IMO there should be Assert(sync_state == SYNCSLOT_STATE_NONE || syncstate == SYNCSLOT_STATE_INITIATED); to guarantee this. ====== src/include/replication/slot.h 28. + /* + * Is this a slot created by a sync-slot worker? + * + * Only relevant for logical slots on the physical standby. + */ + char sync_state; + (probably I am repeating a previous thought here) The comment says the field is only relevant for standby, and that's how I've been visualizing it, and why I had previously suggested even renaming it to 'standby_sync_state'. However, replies are saying that after failover these sync_states also have "some meaning for the primary server". That's the part I have trouble understanding. IIUC the server states are just either all 'n' (means nothing) or 'r' because they are just leftover from the old standby state. So, does it *truly* have meaning for the server? Or should those states somehow be removed/ignored on the new primary? Anyway, the point is that if this field does have meaning also on the primary (I doubt) then those details should be in this comment. Otherwise "Only relevant ... on the standby" is too misleading. ====== .../t/050_standby_failover_slots_sync.pl 29. +# Create table and publication on primary +$primary->safe_psql('postgres', "CREATE TABLE tab_mypub3 (a int PRIMARY KEY);"); +$primary->safe_psql('postgres', "CREATE PUBLICATION mypub3 FOR TABLE tab_mypub3;"); + 29a. /on primary/on the primary/ ~ 29b. Consider to combine those DDL ~ 29c. Perhaps for consistency, you should be calling this 'regress_mypub3'. ~~~ 30. +# Create a subscriber node +my $subscriber3 = PostgreSQL::Test::Cluster->new('subscriber3'); +$subscriber3->init(allows_streaming => 'logical'); +$subscriber3->start; +$subscriber3->safe_psql('postgres', "CREATE TABLE tab_mypub3 (a int PRIMARY KEY);"); + +# Create a subscription with failover = true & wait for sync to complete. +$subscriber3->safe_psql('postgres', + "CREATE SUBSCRIPTION mysub3 CONNECTION '$publisher_connstr' " + . "PUBLICATION mypub3 WITH (slot_name = lsub3_slot, failover = true);"); +$subscriber3->wait_for_subscription_sync; 30a Consider combining those DDLs. ~ 30b. Probably for consistency, you should be calling this 'regress_mysub3'. ~~~ 31. +# Advance lsn on the primary +$primary->safe_psql('postgres', + "SELECT pg_log_standby_snapshot();"); +$primary->safe_psql('postgres', + "SELECT pg_log_standby_snapshot();"); +$primary->safe_psql('postgres', + "SELECT pg_log_standby_snapshot();"); + Consider combining all those DDLs. ~~~ 32. +# Truncate table on primary +$primary->safe_psql('postgres', + "TRUNCATE TABLE tab_mypub3;"); + +# Insert data on the primary +$primary->safe_psql('postgres', + "INSERT INTO tab_mypub3 SELECT generate_series(1, 10);"); + Consider combining those DDLs. ~~~ 33. +# Confirm that restart_lsn of lsub3_slot slot is synced to the standby +$result = $standby3->safe_psql('postgres', + qq[SELECT '$primary_lsn' >= restart_lsn from pg_replication_slots WHERE slot_name = 'lsub3_slot';]); +is($result, 't', 'restart_lsn of slot lsub3_slot synced to standby'); Does "'$primary_lsn' >= restart_lsn" make sense here? NOTE, the sign was '<=' in v43-0002 ~~~ 34. +# Confirm that confirmed_flush_lsn of lsub3_slot slot is synced to the standby +$result = $standby3->safe_psql('postgres', + qq[SELECT '$primary_lsn' >= confirmed_flush_lsn from pg_replication_slots WHERE slot_name = 'lsub3_slot';]); +is($result, 't', 'confirmed_flush_lsn of slot lsub3_slot synced to the standby'); Does "'$primary_lsn' >= confirmed_flush_lsn" make sense here? NOTE, the sign was '<=' in v43-0002 ~~~ 35. +################################################## +# Test that synchronized slot can neither be docoded nor dropped by the user +################################################## 35a. /docoded/decoded/ ~ 35b. Please give explanation in the comment *why* those ops are not allowed (e.g. because the hot_standby_feedback GUC does not have an accepted value) ~~~ 36. +################################################## +# Create another slot which stays in sync_state as initiated ('i') +################################################## + Please explain the comment as to *why* it gets stuck in the initiated state. ~~~ 37. +################################################## +# Promote the standby3 to primary. Confirm that: +# a) the sync-ready('r') slot 'lsub3_slot' is retained on new primary +# b) the initiated('i') slot 'logical_slot'is dropped on promotion +# c) logical replication for mysub3 is resumed succesfully after failover +################################################## /'logical_slot'is/'logical_slot' is/ (missing space) /succesfully/successfully/ ~~~ 38. +# Update subscription with new primary's connection info +$subscriber3->safe_psql('postgres', "ALTER SUBSCRIPTION mysub3 DISABLE;"); +$subscriber3->safe_psql('postgres', "ALTER SUBSCRIPTION mysub3 CONNECTION '$standby3_conninfo';"); +$subscriber3->safe_psql('postgres', "ALTER SUBSCRIPTION mysub3 ENABLE;"); Consider combining all those DDLs. ~~~ 39. + +# Insert data on the new primary +$standby3->safe_psql('postgres', + "INSERT INTO tab_mypub3 SELECT generate_series(11, 20);"); + +# Confirm that data in tab_mypub3 replicated on subscriber +is( $subscriber3->safe_psql('postgres', q{SELECT count(*) FROM tab_mypub3;}), + "20", + 'data replicated from new primary'); Shouldn't there be some wait_for_subscription_sync logic (or similar) here just to ensure the subscriber3 had time to receive that data before you immediately check that it had arrived? ====== [1] My v43-0002 review. https://www.postgresql.org/message-id/CAHut%2BPuuqEpDse5msENsVuK3rjTRN-QGS67rRCGVv%2BzcT-f0GA%40mail.gmail.com [2] Replies to v43-0002 review. https://www.postgresql.org/message-id/CAJpy0uDcOf5Hvk_CdCCAbfx9SY%2Bog%3D%3D%3DtgiuhWKzkYyqebui9g%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
On Wed, Dec 13, 2023 at 11:42 AM shveta malik <shveta.malik@gmail.com> wrote: > > On Wed, Dec 13, 2023 at 10:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Dec 11, 2023 at 5:13 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > On Mon, Dec 11, 2023 at 1:22 PM Drouvot, Bertrand > > > <bertranddrouvot.pg@gmail.com> wrote: > > > > > > > > > If we agree > > > > > on that then it would be good to prohibit setting this GUC on standby > > > > > or at least it should be a no-op even if this GUC should be set on > > > > > physical standby. > > > > > > > > I'd prefer to completely prohibit it on standby (to make it very clear it's not > > > > working at all) as long as one can enable it without downtime once the standby > > > > is promoted (which is the case currently). > > > > > > And I think slot-sync worker should exit as well on cascading standby. Thoughts? > > > > > > > I think one has set all the valid parameters for the slot-sync worker > > on standby, we should not exit, rather it should be no-op which means > > it should not try to sync slots from another standby. One scenario > > where this may help is when users promote the standby which has > > already synced slots from the primary. In this case, cascading standby > > will become non-cascading and should sync slots. > > > > Right, then perhaps we should increase naptime in this no-op case. It > could be even more then current inactivity naptime which is just > 10sec. Shall it be say 5min in this case? > PFA v47 attached, changes are: patch 001: 1) Addressed comment in [1]. Thanks Hou-san for this change. patch 002 2) Slot sync worker will be no-op if it is on cascading standby as suggested in [2] 3) StartTransaction related optimization as suggested in [3] 4) Few other comments' improvement and code-cleanup. TODO: --Few pending comments as I stated in [4] (mainly header inclusion in tablesync.c, and 'r' to 'n' conversion on promotion) --The comments given today in [5] [1]: https://www.postgresql.org/message-id/CABdArM4Cow6aOLjGG9qnp6mhg%2B%2BgjK%3DHDO%3DKSU%3D6%3DyT7hLkknQ%40mail.gmail.com [2]: https://www.postgresql.org/message-id/CAA4eK1Ki1O65SyA6ijh-Mq4zpzeh644fCmkrZXMJcQXHNrAw0Q%40mail.gmail.com [3]: https://www.postgresql.org/message-id/CAA4eK1L3DiKL_Wq-VdU%2B9wmjmO5%2Bfrf%3DZHK9Lzq-7zOezPP%2BWg%40mail.gmail.com [4]: https://www.postgresql.org/message-id/CAJpy0uDcOf5Hvk_CdCCAbfx9SY%2Bog%3D%3D%3DtgiuhWKzkYyqebui9g%40mail.gmail.com [5]: https://www.postgresql.org/message-id/CAHut%2BPtOc7J_n24HJ6f_dFWTuD3X2ApOByQzZf6jZz%2B0wb-ebQ%40mail.gmail.com thanks Shveta
Attachment
Hi, here are a few more review comments for the patch v47-0002 (plus my review comments of v45-0002 [1] are yet to be addressed) ====== 1. General For consistency and readability, try to use variables of the same names whenever they have the same purpose, even when they declared are in different functions. A few like this were already mentioned in the previous review but there are more I keep noticing. For example, 'slotnameChanged' in function, VERSUS 'primary_slot_changed' in the caller. ====== src/backend/replication/logical/slotsync.c 2. +/* + * + * Validates the primary server info. + * + * Using the specified primary server connection, it verifies whether the master + * is a standby itself and returns true in that case to convey the caller that + * we are on the cascading standby. + * But if master is the primary server, it goes ahead and validates + * primary_slot_name. It emits error if the physical slot in primary_slot_name + * does not exist on the primary server. + */ +static bool +validate_primary_info(WalReceiverConn *wrconn) 2a. Extra line top of that comment? ~ 2b. IMO it is too tricky to have a function called "validate_xxx", when actually you gave that return value some special unintuitive meaning other than just validation. IMO it is always better for the returned value to properly match the function name so the expectations are very obvious. So, In this case, I think a better function signature would be like this: SUGGESTION static void validate_primary_info(WalReceiverConn *wrconn, bool *master_is_standby) or static void validate_primary_info(WalReceiverConn *wrconn, bool *am_cascading_standby) ~~~ 3. + if (res->status != WALRCV_OK_TUPLES) + ereport(ERROR, + (errmsg("could not fetch recovery and primary_slot_name \"%s\" info from the " + "primary: %s", PrimarySlotName, res->err))); I'm not sure that including "recovery and" in the error message is meaningful to the user, is it? ~~~ 4. slotsync_reread_config +/* + * Re-read the config file. + * + * If any of the slot sync GUCs have changed, re-validate them. The + * worker will exit if the check fails. + * + * Returns TRUE if primary_slot_name is changed, let the caller re-verify it. + */ +static bool +slotsync_reread_config(WalReceiverConn *wrconn) Hm. This is another function where the return value has been butchered to have a special meaning unrelated the the function name. IMO it makes the code unnecessarily confusing. IMO a better function signature here would be: static void slotsync_reread_config(WalReceiverConn *wrconn, bool *primary_slot_name_changed) ~~~ 5. ProcessSlotSyncInterrupts +/* + * Interrupt handler for main loop of slot sync worker. + */ +static bool +ProcessSlotSyncInterrupts(WalReceiverConn *wrconn, bool check_cascading_standby) +{ There is no function comment describing the meaning of the return value. But actually, IMO this is an example of how conflating the meanings of validation VERSUS are_we_cascading_standby in the lower-down function has propagated up to become a big muddle. The code + if (primary_slot_changed || check_cascading_standby) + return validate_primary_info(wrconn); seems unnecessarily hard to understand because, false -- doesn't mean invalid true -- doesn't mean valid Please, consider changing this signature also so the functions return what you would intuitively expect them to return without surprisingly different meanings. SUGGESTION static void ProcessSlotSyncInterrupts(WalReceiverConn *wrconn, bool check_cascading_standby, bool *am_cascading_standby) ~~~ 6. ReplSlotSyncWorkerMain + int rc; + long naptime = WORKER_DEFAULT_NAPTIME_MS; + TimestampTz now; + bool slot_updated; + + /* + * The transaction env is needed by walrcv_exec() in both the slot + * sync and primary info validation flow. + */ + StartTransactionCommand(); + + if (!am_cascading_standby) + { + slot_updated = synchronize_slots(wrconn); + + /* + * If any of the slots get updated in this sync-cycle, use default + * naptime and update 'last_update_time'. But if no activity is + * observed in this sync-cycle, then increase naptime provided + * inactivity time reaches threshold. + */ + now = GetCurrentTimestamp(); + if (slot_updated) + last_update_time = now; + else if (TimestampDifferenceExceeds(last_update_time, + now, WORKER_INACTIVITY_THRESHOLD_MS)) + naptime = WORKER_INACTIVITY_NAPTIME_MS; + } + else + naptime = 6 * WORKER_INACTIVITY_NAPTIME_MS; /* 60 sec */ + + rc = WaitLatch(MyLatch, + WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, + naptime, + WAIT_EVENT_REPL_SLOTSYNC_MAIN); + + if (rc & WL_LATCH_SET) + ResetLatch(MyLatch); + + am_cascading_standby = + ProcessSlotSyncInterrupts(wrconn, am_cascading_standby); + + CommitTransactionCommand(); IMO it is more natural to avoid negative conditions, so just reverse these. Also, some comment is needed to explain why the longer naptime is needed in this special case. SUGGESTION if (am_cascading_standby) { /* comment the reason .... */ naptime = 6 * WORKER_INACTIVITY_NAPTIME_MS; /* 60 sec */ } else { /* Normal standby */ ... } ====== [1] review of v45-0002. https://www.postgresql.org/message-id/CAHut%2BPtOc7J_n24HJ6f_dFWTuD3X2ApOByQzZf6jZz%2B0wb-ebQ%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
A review comment for v47-0001 ====== src/backend/replication/slot.c 1. GetStandbySlotList +static void +WalSndRereadConfigAndReInitSlotList(List **standby_slots) +{ + char *pre_standby_slot_names; + + ProcessConfigFile(PGC_SIGHUP); + + /* + * If we are running on a standby, there is no need to reload + * standby_slot_names since we do not support syncing slots to cascading + * standbys. + */ + if (RecoveryInProgress()) + return; Should the RecoveryInProgress() check be first -- even before the ProcessConfigFile call? ====== Kind Regards, Peter Smith. Fujitsu Australia
On Wed, Dec 13, 2023 at 3:53 PM Peter Smith <smithpb2250@gmail.com> wrote: > > 12. > +static void > +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot, > + bool *slot_updated) > +{ > + ReplicationSlot *s; > + ReplicationSlot *slot; > + char sync_state = '\0'; > > In my previous review [1]#33a I thought it was strange to assign the > sync_state (which is essentially an enum) to some meaningless value, > so I suggested it should be set to SYNCSLOT_STATE_NONE in the > declaration. The reply [2] was "No, that will change the flow. It > should stay uninitialized if the slot is not found." > > But I am not convinced there is any flow problem. Also, > SYNCSLOT_STATE_NONE seems the naturally correct default for something > with no state. It cannot be found and be SYNCSLOT_STATE_NONE at the > same time (that is reported as an ERROR "skipping sync of slot") so I > see no problem. > > The CURRENT code is like this: > > /* Slot created by the slot sync worker exists, sync it */ > if (sync_state) > { > Assert(sync_state == SYNCSLOT_STATE_READY || sync_state == > SYNCSLOT_STATE_INITIATED); > ... > } > /* Otherwise create the slot first. */ > else > { > ... > } > > AFAICT that could easily be changed to like below, with no change to > the logic, and it avoids setting strange values. > > SUGGESTION. > > if (sync_state == SYNCSLOT_STATE_NONE) > { > /* Slot not found. Create it. */ > .. > } > else > { > Assert(sync_state == SYNCSLOT_STATE_READY || sync_state == > SYNCSLOT_STATE_INITIATED); > ... > } > I think instead of creating syncslot based on syncstate, it would be better to create it when we don't find it via SearchNamedReplicationSlot(). That will avoid the need to initialize the syncstate and I think it would make code in this area look better. > ~~~ > > 13. synchronize_one_slot > > +static void > +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot, > + bool *slot_updated) > > This *slot_updated parameter looks dubious. It is used in a loop from > the caller to mean that ANY slot was updated -- e.g. maybe it is true > or false on entry to this function. > > But, Instead of having some dependency between this function and the > caller, IMO it makes more sense if we would make this just a boolean > function in the first place (e.g. was updated? T/F) > > Then the caller can also be written more easily like: > > some_slot_updated |= synchronize_one_slot(wrconn, remote_slot); > +1. > > 23. slotsync_reread_config > > + old_dbname = walrcv_get_dbname_from_conninfo(PrimaryConnInfo); > + Assert(old_dbname); > > (This is same comment as old review [1]#61) > > Hmm. I still don't see why this extraction of the dbname cannot be > deferred until later when you know the PrimaryConnInfo has changed, > otherwise, it might be redundant to do this. Shveta replied [2] that > "Once PrimaryConnInfo is changed, we can not get old-dbname.", but I'm > not so sure. Isn't this walrcv_get_dbname_from_conninfo just doing a > string search -- Why can't you defer this until you know > conninfoChanged is true, and then to get the old_dbname, you can just > pass the old_primary_conninfo. E.g. call like > walrcv_get_dbname_from_conninfo(old_primary_conninfo); Maybe I am > mistaken. > I think we should just restart if any one of the information is changed with a message like: "slotsync worker will restart because of a parameter change". This would be similar to what we do apply worker in maybe_reread_subscription(). > > 28. > + /* > + * Is this a slot created by a sync-slot worker? > + * > + * Only relevant for logical slots on the physical standby. > + */ > + char sync_state; > + > > (probably I am repeating a previous thought here) > > The comment says the field is only relevant for standby, and that's > how I've been visualizing it, and why I had previously suggested even > renaming it to 'standby_sync_state'. However, replies are saying that > after failover these sync_states also have "some meaning for the > primary server". > > That's the part I have trouble understanding. IIUC the server states > are just either all 'n' (means nothing) or 'r' because they are just > leftover from the old standby state. So, does it *truly* have meaning > for the server? Or should those states somehow be removed/ignored on > the new primary? Anyway, the point is that if this field does have > meaning also on the primary (I doubt) then those details should be in > this comment. Otherwise "Only relevant ... on the standby" is too > misleading. > I think this deserves more comments. -- With Regards, Amit Kapila.
On Thursday, December 14, 2023 12:45 PM Peter Smith <smithpb2250@gmail.com> wrote: > A review comment for v47-0001 Thanks for the comment. > > ====== > src/backend/replication/slot.c > > 1. GetStandbySlotList > > +static void > +WalSndRereadConfigAndReInitSlotList(List **standby_slots) { > + char *pre_standby_slot_names; > + > + ProcessConfigFile(PGC_SIGHUP); > + > + /* > + * If we are running on a standby, there is no need to reload > + * standby_slot_names since we do not support syncing slots to > + cascading > + * standbys. > + */ > + if (RecoveryInProgress()) > + return; > > Should the RecoveryInProgress() check be first -- even before the > ProcessConfigFile call? ProcessConfigFile is necessary here, it is used not only for standby_slot_names but also all other GUCs that could be used in the caller. Best Regards, Hou zj
On Thu, Dec 14, 2023 at 7:00 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Hi, here are a few more review comments for the patch v47-0002 > > (plus my review comments of v45-0002 [1] are yet to be addressed) > > ====== > 1. General > > For consistency and readability, try to use variables of the same > names whenever they have the same purpose, even when they declared are > in different functions. A few like this were already mentioned in the > previous review but there are more I keep noticing. > > For example, > 'slotnameChanged' in function, VERSUS 'primary_slot_changed' in the caller. > > > ====== > src/backend/replication/logical/slotsync.c > > 2. > +/* > + * > + * Validates the primary server info. > + * > + * Using the specified primary server connection, it verifies whether > the master > + * is a standby itself and returns true in that case to convey the caller that > + * we are on the cascading standby. > + * But if master is the primary server, it goes ahead and validates > + * primary_slot_name. It emits error if the physical slot in primary_slot_name > + * does not exist on the primary server. > + */ > +static bool > +validate_primary_info(WalReceiverConn *wrconn) > > 2b. > IMO it is too tricky to have a function called "validate_xxx", when > actually you gave that return value some special unintuitive meaning > other than just validation. IMO it is always better for the returned > value to properly match the function name so the expectations are very > obvious. So, In this case, I think a better function signature would > be like this: > > SUGGESTION > > static void > validate_primary_info(WalReceiverConn *wrconn, bool *master_is_standby) > > or > > static void > validate_primary_info(WalReceiverConn *wrconn, bool *am_cascading_standby) > The terminology master_is_standby is a bit indirect for this usage, so I would prefer the second one. Shall we name this function as check_primary_info()? Additionally, can we rewrite the following comment: "Using the specified primary server connection, check whether we are cascading standby. It also validates primary_slot_info for non-cascading-standbys.". -- With Regards, Amit Kapila.
On Thu, Dec 14, 2023 at 4:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Dec 14, 2023 at 7:00 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Hi, here are a few more review comments for the patch v47-0002 > > > > (plus my review comments of v45-0002 [1] are yet to be addressed) > > > > ====== > > 1. General > > > > For consistency and readability, try to use variables of the same > > names whenever they have the same purpose, even when they declared are > > in different functions. A few like this were already mentioned in the > > previous review but there are more I keep noticing. > > > > For example, > > 'slotnameChanged' in function, VERSUS 'primary_slot_changed' in the caller. > > > > > > ====== > > src/backend/replication/logical/slotsync.c > > > > 2. > > +/* > > + * > > + * Validates the primary server info. > > + * > > + * Using the specified primary server connection, it verifies whether > > the master > > + * is a standby itself and returns true in that case to convey the caller that > > + * we are on the cascading standby. > > + * But if master is the primary server, it goes ahead and validates > > + * primary_slot_name. It emits error if the physical slot in primary_slot_name > > + * does not exist on the primary server. > > + */ > > +static bool > > +validate_primary_info(WalReceiverConn *wrconn) > > > > 2b. > > IMO it is too tricky to have a function called "validate_xxx", when > > actually you gave that return value some special unintuitive meaning > > other than just validation. IMO it is always better for the returned > > value to properly match the function name so the expectations are very > > obvious. So, In this case, I think a better function signature would > > be like this: > > > > SUGGESTION > > > > static void > > validate_primary_info(WalReceiverConn *wrconn, bool *master_is_standby) > > > > or > > > > static void > > validate_primary_info(WalReceiverConn *wrconn, bool *am_cascading_standby) > > > > The terminology master_is_standby is a bit indirect for this usage, so > I would prefer the second one. Shall we name this function as > check_primary_info()? Additionally, can we rewrite the following > comment: "Using the specified primary server connection, check whether > we are cascading standby. It also validates primary_slot_info for > non-cascading-standbys.". > > -- > With Regards, > Amit Kapila. PFA v48. Changes are: 1) Addressed comments by Peter for v45-002 and v47-002 given in [1] and [2] respectively 2) Addressed comments by Amit for v47-002 given in [3], [4] 3) Addressed an old comment (#74 in [5]) of getting rid of header inclusion from tablesync.c when there was no code change in that file. Thanks Hou-san for working on this change. TODO: --Address the test comments in [1] for 050_standby_failover_slots_sync.pl --Review the feasibility of addressing one pending comment (comment 13 in [5]) of 'r'->'n' conversion. [1]: https://www.postgresql.org/message-id/CAHut%2BPtOc7J_n24HJ6f_dFWTuD3X2ApOByQzZf6jZz%2B0wb-ebQ%40mail.gmail.com [2]: https://www.postgresql.org/message-id/CAHut%2BPsvxs-%3Dj3aCpPVs3e4w78HndCdO-F4bLPzAX70%2BdgWUuQ%40mail.gmail.com [3]: https://www.postgresql.org/message-id/CAA4eK1L2ts%3DgfiF4aw7-DH8HWj29s08hVRq-Ff8%3DmjfdUXx8CA%40mail.gmail.com [4]: https://www.postgresql.org/message-id/CAA4eK1%2Bw9yv%2B4UZXhiDHZpGDfbeRHYDBu23FwsniS8sYUZeu1w%40mail.gmail.com [5]: https://www.postgresql.org/message-id/CAJpy0uDcOf5Hvk_CdCCAbfx9SY%2Bog%3D%3D%3DtgiuhWKzkYyqebui9g%40mail.gmail.com thanks Shveta
On Fri, Dec 15, 2023 at 11:02 AM shveta malik <shveta.malik@gmail.com> wrote: > > On Thu, Dec 14, 2023 at 4:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Dec 14, 2023 at 7:00 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > > Hi, here are a few more review comments for the patch v47-0002 > > > > > > (plus my review comments of v45-0002 [1] are yet to be addressed) > > > > > > ====== > > > 1. General > > > > > > For consistency and readability, try to use variables of the same > > > names whenever they have the same purpose, even when they declared are > > > in different functions. A few like this were already mentioned in the > > > previous review but there are more I keep noticing. > > > > > > For example, > > > 'slotnameChanged' in function, VERSUS 'primary_slot_changed' in the caller. > > > > > > > > > ====== > > > src/backend/replication/logical/slotsync.c > > > > > > 2. > > > +/* > > > + * > > > + * Validates the primary server info. > > > + * > > > + * Using the specified primary server connection, it verifies whether > > > the master > > > + * is a standby itself and returns true in that case to convey the caller that > > > + * we are on the cascading standby. > > > + * But if master is the primary server, it goes ahead and validates > > > + * primary_slot_name. It emits error if the physical slot in primary_slot_name > > > + * does not exist on the primary server. > > > + */ > > > +static bool > > > +validate_primary_info(WalReceiverConn *wrconn) > > > > > > 2b. > > > IMO it is too tricky to have a function called "validate_xxx", when > > > actually you gave that return value some special unintuitive meaning > > > other than just validation. IMO it is always better for the returned > > > value to properly match the function name so the expectations are very > > > obvious. So, In this case, I think a better function signature would > > > be like this: > > > > > > SUGGESTION > > > > > > static void > > > validate_primary_info(WalReceiverConn *wrconn, bool *master_is_standby) > > > > > > or > > > > > > static void > > > validate_primary_info(WalReceiverConn *wrconn, bool *am_cascading_standby) > > > > > > > The terminology master_is_standby is a bit indirect for this usage, so > > I would prefer the second one. Shall we name this function as > > check_primary_info()? Additionally, can we rewrite the following > > comment: "Using the specified primary server connection, check whether > > we are cascading standby. It also validates primary_slot_info for > > non-cascading-standbys.". > > > > -- > > With Regards, > > Amit Kapila. > > > PFA v48. Changes are: > Sorry, I missed attaching the patch. PFA v48. > 1) Addressed comments by Peter for v45-002 and v47-002 given in [1] > and [2] respectively > 2) Addressed comments by Amit for v47-002 given in [3], [4] > 3) Addressed an old comment (#74 in [5]) of getting rid of header > inclusion from tablesync.c when there was no code change in that file. > Thanks Hou-san for working on this change. > > > TODO: > --Address the test comments in [1] for 050_standby_failover_slots_sync.pl > --Review the feasibility of addressing one pending comment (comment 13 > in [5]) of 'r'->'n' conversion. > > [1]: https://www.postgresql.org/message-id/CAHut%2BPtOc7J_n24HJ6f_dFWTuD3X2ApOByQzZf6jZz%2B0wb-ebQ%40mail.gmail.com > [2]: https://www.postgresql.org/message-id/CAHut%2BPsvxs-%3Dj3aCpPVs3e4w78HndCdO-F4bLPzAX70%2BdgWUuQ%40mail.gmail.com > [3]: https://www.postgresql.org/message-id/CAA4eK1L2ts%3DgfiF4aw7-DH8HWj29s08hVRq-Ff8%3DmjfdUXx8CA%40mail.gmail.com > [4]: https://www.postgresql.org/message-id/CAA4eK1%2Bw9yv%2B4UZXhiDHZpGDfbeRHYDBu23FwsniS8sYUZeu1w%40mail.gmail.com > [5]: https://www.postgresql.org/message-id/CAJpy0uDcOf5Hvk_CdCCAbfx9SY%2Bog%3D%3D%3DtgiuhWKzkYyqebui9g%40mail.gmail.com > > thanks > Shveta
Attachment
On Wed, Dec 13, 2023 at 3:53 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Hi Shveta, here are some review comments for v45-0002. > Thanks for the feedback. Addressed these in v48. Please find my comments on some. > ====== > doc/src/sgml/bgworker.sgml > > 1. > + <variablelist> > + <varlistentry> > + <term><literal>BgWorkerStart_PostmasterStart</literal></term> > + <listitem> > + <para> > + <indexterm><primary>BgWorkerStart_PostmasterStart</primary></indexterm> > + Start as soon as postgres itself has finished its own initialization; > + processes requesting this are not eligible for database connections. > + </para> > + </listitem> > + </varlistentry> > + > + <varlistentry> > + <term><literal>BgWorkerStart_ConsistentState</literal></term> > + <listitem> > + <para> > + <indexterm><primary>BgWorkerStart_ConsistentState</primary></indexterm> > + Start as soon as a consistent state has been reached in a hot-standby, > + allowing processes to connect to databases and run read-only queries. > + </para> > + </listitem> > + </varlistentry> > + > + <varlistentry> > + <term><literal>BgWorkerStart_RecoveryFinished</literal></term> > + <listitem> > + <para> > + <indexterm><primary>BgWorkerStart_RecoveryFinished</primary></indexterm> > + Start as soon as the system has entered normal read-write state. Note > + that the <literal>BgWorkerStart_ConsistentState</literal> and > + <literal>BgWorkerStart_RecoveryFinished</literal> are equivalent > + in a server that's not a hot standby. > + </para> > + </listitem> > + </varlistentry> > + > + <varlistentry> > + <term><literal>BgWorkerStart_ConsistentState_HotStandby</literal></term> > + <listitem> > + <para> > + <indexterm><primary>BgWorkerStart_ConsistentState_HotStandby</primary></indexterm> > + Same meaning as <literal>BgWorkerStart_ConsistentState</literal> but > + it is more strict in terms of the server i.e. start the worker only > + if it is hot-standby. > + </para> > + </listitem> > + </varlistentry> > + </variablelist> > > Maybe reorder these slightly, because I felt it is better if the > BgWorkerStart_ConsistentState_HotStandby comes next after > BgWorkerStart_ConsistentState, which it refers to > > For example:: > 1st.BgWorkerStart_PostmasterStart > 2nd.BgWorkerStart_ConsistentState > 3rd.BgWorkerStart_ConsistentState_HotStandby > 4th.BgWorkerStart_RecoveryFinished > > ====== > doc/src/sgml/config.sgml > > 2. > <varname>enable_syncslot</varname> = true > > Not sure, but I thought the "= true" part should be formatted too. > > SUGGESTION > <literal>enable_syncslot = true</literal> > > ====== > doc/src/sgml/logicaldecoding.sgml > > 3. > + <para> > + A logical replication slot on the primary can be synchronized to the hot > + standby by enabling the failover option during slot creation and setting > + <varname>enable_syncslot</varname> on the standby. For the synchronization > + to work, it is mandatory to have a physical replication slot between the > + primary and the standby. It's highly recommended that the said physical > + replication slot is listed in <varname>standby_slot_names</varname> on > + the primary to prevent the subscriber from consuming changes faster than > + the hot standby. Additionally, <varname>hot_standby_feedback</varname> > + must be enabled on the standby for the slots synchronization to work. > + </para> > > I felt those parts that describe the mandatory GUCs should be kept together. > > SUGGESTION > For the synchronization to work, it is mandatory to have a physical > replication slot between the primary and the standby, and > <varname>hot_standby_feedback</varname> must be enabled on the > standby. > > It's also highly recommended that the said physical replication slot > is named in <varname>standby_slot_names</varname> list on the primary, > to prevent the subscriber from consuming changes faster than the hot > standby. > > ~~~ > > 4. (Chapter 49) > > By enabling synchronization of slots, logical replication can be > resumed after failover depending upon the > pg_replication_slots.sync_state for the synchronized slots on the > standby at the time of failover. Only slots that were in ready > sync_state ('r') on the standby before failover can be used for > logical replication after failover. However, the slots which were in > initiated sync_state ('i') and not sync-ready ('r') at the time of > failover will be dropped and logical replication for such slots can > not be resumed after failover. This applies to the case where a > logical subscription is disabled before failover and is enabled after > failover. If the synchronized slot due to disabled subscription could > not be made sync-ready ('r') on standby, then the subscription can not > be resumed after failover even when enabled. If the primary is idle, > then the synchronized slots on the standby may take a noticeable time > to reach the ready ('r') sync_state. This can be sped up by calling > the pg_log_standby_snapshot function on the primary. > > ~ > > Somehow, I still felt all that was too wordy/repetitive. Below is my > attempt to make it more concise. Thoughts? > > SUGGESTION > The ability to resume logical replication after failover depends upon > the pg_replication_slots.sync_state value for the synchronized slots > on the standby at the time of failover. Only slots that have attained > a "ready" sync_state ('r') on the standby before failover can be used > for logical replication after failover. Slots that have not yet > reached 'r' state (they are still 'i') will be dropped, therefore > logical replication for those slots cannot be resumed. For example, if > the synchronized slot could not become sync-ready on standby due to a > disabled subscription, then the subscription cannot be resumed after > failover even when it is enabled. > > If the primary is idle, the synchronized slots on the standby may take > a noticeable time to reach the ready ('r') sync_state. This can be > sped up by calling the pg_log_standby_snapshot function on the > primary. > > ====== > doc/src/sgml/system-views.sgml > > 5. > + <para> > + Defines slot synchronization state. This is meaningful on the physical > + standby which has configured <varname>enable_syncslot</varname> = true > + </para> > > As mentioned in the previous review comment ([1]#10) I thought it > might be good to include a hyperlink cross-reference to the > 'enable_syncslot' GUC. > > ~~~ > > 6. > + <para> > + The hot standby can have any of these sync_state for the slots but on a > + hot standby, the slots with state 'r' and 'i' can neither be used for > + logical decoding nor dropped by the user. The primary server will have > + sync_state as 'n' for all the slots. But if the standby is promoted to > + become the new primary server, sync_state can be seen 'r' as well. On > + this new primary server, slots with sync_state as 'r' and 'n' will > + behave the same. > + </para></entry> > > 6a. > /these sync_state for the slots/these sync_state values for the slots/ > > ~ > > 6b > Hm. I still felt (same as previous review [1]#12b) that there seems > too much information here. > > IIUC the sync_state is only meaningful on the standby. Sure, it might > have some values line 'n' or 'r' on the primary also, but those either > mean nothing ('n') or are leftover states from a previous failover > from a standby ('r'), which also means nothing. So can't we just say > it more succinctly like that? > > SUGGESTION > The sync_state has no meaning on the primary server; the primary > sync_state value is default 'n' for all slots but may (if leftover > from a promoted standby) also be 'r'. > > ====== > .../libpqwalreceiver/libpqwalreceiver.c > > 7. > static void > libpqrcv_alter_slot(WalReceiverConn *conn, const char *slotname, > - bool failover) > + bool failover) > > Still seems to be tampering with indentation that should only be in patch 0001. > > ====== > src/backend/replication/logical/slotsync.c > > 8. wait_for_primary_slot_catchup > > The meaning of the boolean return of this function is still not > described by the function comment. > > ~~~ > > 9. > + * If passed, *wait_attempts_exceeded will be set to true only if this > + * function exits after exhausting its wait attempts. It will be false > + * in all the other cases like failure, remote-slot invalidation, primary > + * could catch up. > > The above already says when a return false happens, so it seems > overkill to give more information. > > SUGGESTION > If passed, *wait_attempts_exceeded will be set to true only if this > function exits due to exhausting its wait attempts. It will be false > in all the other cases. > > ~~~ > > 10. > > +static bool > +wait_for_primary_slot_catchup(WalReceiverConn *wrconn, RemoteSlot *remote_slot, > + bool *wait_attempts_exceeded) > +{ > +#define WAIT_OUTPUT_COLUMN_COUNT 4 > +#define WORKER_PRIMARY_CATCHUP_WAIT_ATTEMPTS 5 > + > > 10a > Maybe the long constant name is too long. How about > WAIT_PRIMARY_CATCHUP_ATTEMPTS? > > ~~~ > > 10b. > IMO it is better to Assert the input value of this kind of side-effect > return parameter, to give a better understanding and to prevent future > accidents. > > SUGGESTION > Assert(wait_attempts_exceeded == NULL |} *wait_attempts_exceeded == false); > > ~~~ > > 11. synchronize_one_slot > > + ReplicationSlot *s; > + ReplicationSlot *slot; > + char sync_state = '\0'; > > 11a. > I don't think you need both 's' and 'slot' ReplicationSlot -- it looks > a bit odd. Can't you just reuse the one 'slot' variable? > > ~ > > 11b. > Also, maybe those assignment like > + slot = MyReplicationSlot; > > can have an explanatory comment like: > /* For convenience, we assign MyReplicationSlot to a shorter variable name. */ > I have changed it to slightly simpler one, if that is okay? > ~~~ > > 12. > +static void > +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot, > + bool *slot_updated) > +{ > + ReplicationSlot *s; > + ReplicationSlot *slot; > + char sync_state = '\0'; > > In my previous review [1]#33a I thought it was strange to assign the > sync_state (which is essentially an enum) to some meaningless value, > so I suggested it should be set to SYNCSLOT_STATE_NONE in the > declaration. The reply [2] was "No, that will change the flow. It > should stay uninitialized if the slot is not found." > > But I am not convinced there is any flow problem. Also, > SYNCSLOT_STATE_NONE seems the naturally correct default for something > with no state. It cannot be found and be SYNCSLOT_STATE_NONE at the > same time (that is reported as an ERROR "skipping sync of slot") so I > see no problem. > > The CURRENT code is like this: > > /* Slot created by the slot sync worker exists, sync it */ > if (sync_state) > { > Assert(sync_state == SYNCSLOT_STATE_READY || sync_state == > SYNCSLOT_STATE_INITIATED); > ... > } > /* Otherwise create the slot first. */ > else > { > ... > } > > AFAICT that could easily be changed to like below, with no change to > the logic, and it avoids setting strange values. > > SUGGESTION. > > if (sync_state == SYNCSLOT_STATE_NONE) > { > /* Slot not found. Create it. */ > .. > } > else > { > Assert(sync_state == SYNCSLOT_STATE_READY || sync_state == > SYNCSLOT_STATE_INITIATED); > ... > } > I have restructured the entire code here and thus initialization of sync_state is no longer needed. Please review now and let me know. > ~~~ > > 13. synchronize_one_slot > > +static void > +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot, > + bool *slot_updated) > > This *slot_updated parameter looks dubious. It is used in a loop from > the caller to mean that ANY slot was updated -- e.g. maybe it is true > or false on entry to this function. > > But, Instead of having some dependency between this function and the > caller, IMO it makes more sense if we would make this just a boolean > function in the first place (e.g. was updated? T/F) > > Then the caller can also be written more easily like: > > some_slot_updated |= synchronize_one_slot(wrconn, remote_slot); > > ~~~ > > 14. > + /* Search for the named slot */ > + if ((s = SearchNamedReplicationSlot(remote_slot->name, true))) > + { > + SpinLockAcquire(&s->mutex); > + sync_state = s->data.sync_state; > + SpinLockRelease(&s->mutex); > + > + /* User created slot with the same name exists, raise ERROR. */ > + if (sync_state == SYNCSLOT_STATE_NONE) > + { > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("skipping sync of slot \"%s\" as it is a user created" > + " slot", remote_slot->name), > + errdetail("This slot has failover enabled on the primary and" > + " thus is sync candidate but user created slot with" > + " the same name already exists on the standby"))); > + } > + } > > > Extra curly brackets around the ereport are not needed. > > ~~~ > > 15. > + /* > + * Sanity check: With hot_standby_feedback enabled and > + * invalidations handled apropriately as above, this should never > + * happen. > + */ > + if (remote_slot->restart_lsn < slot->data.restart_lsn) > + { > + elog(ERROR, > + "not synchronizing local slot \"%s\" LSN(%X/%X)" > + " to remote slot's LSN(%X/%X) as synchronization " > + " would move it backwards", remote_slot->name, > + LSN_FORMAT_ARGS(slot->data.restart_lsn), > + LSN_FORMAT_ARGS(remote_slot->restart_lsn)); > + } > > 15a. > /apropriately/appropriately/ > > ~ > > 15b. > Extra curly brackets around the elog are not needed. > > ~~~ > > 16. synchronize_slots > > +static bool > +synchronize_slots(WalReceiverConn *wrconn) > +{ > +#define SLOTSYNC_COLUMN_COUNT 9 > + Oid slotRow[SLOTSYNC_COLUMN_COUNT] = {TEXTOID, TEXTOID, LSNOID, > + LSNOID, XIDOID, BOOLOID, BOOLOID, TEXTOID, INT2OID}; > + > + WalRcvExecResult *res; > + TupleTableSlot *tupslot; > + StringInfoData s; > + List *remote_slot_list = NIL; > + MemoryContext oldctx = CurrentMemoryContext; > + ListCell *lc; > + bool slot_updated = false; > > Suggest renaming 'slot_updated' to 'some_slot_updated' or > 'update_occurred' etc because the current name makes it look like it > applies to a single slot, but it doesn't. > > ~~~ > > 17. > + SpinLockAcquire(&WalRcv->mutex); > + if (!WalRcv || > + (WalRcv->slotname[0] == '\0') || > + XLogRecPtrIsInvalid(WalRcv->latestWalEnd)) > + { > + SpinLockRelease(&WalRcv->mutex); > + return slot_updated; > + } > + SpinLockRelease(&WalRcv->mutex); > > IMO "return false;" here is more clear than saying "return slot_updated;" > > ~~~ > > 18. > + appendStringInfo(&s, > + "SELECT slot_name, plugin, confirmed_flush_lsn," > + " restart_lsn, catalog_xmin, two_phase, failover," > + " database, pg_get_slot_invalidation_cause(slot_name)" > + " FROM pg_catalog.pg_replication_slots" > + " WHERE failover and sync_state != 'i'"); > > 18a. > /and/AND/ > > ~ > > 18b. > In the reply post (see [2]#32) Shveta said "I could not find quote_* > function for a character just like we have 'quote_literal_cstr' for > string". If you still want to use constant substitution instead of > just hardwired 'i' then why do even you need a quote_* function? I > thought the appendStringInfo uses a printf style format-string > internally, so I assumed it is possible to substitute the state char > directly using '%c'. > Since we have removed cascading standby support, this condition (sync_state != 'i') is no longer needed in the query. > ~~~ > > 19. > + > + > + > + /* We are done, free remote_slot_list elements */ > + list_free_deep(remote_slot_list); > + > + walrcv_clear_result(res); > + > + return slot_updated; > +} > > Excessive blank lines. > > ~~~ > > 20. validate_primary_slot > > + appendStringInfo(&cmd, > + "SELECT count(*) = 1 from pg_replication_slots " > + "WHERE slot_type='physical' and slot_name=%s", > + quote_literal_cstr(PrimarySlotName)); > > > /and/AND/ > > ~~~ > > 21. > + slot = MakeSingleTupleTableSlot(res->tupledesc, &TTSOpsMinimalTuple); > + tuple_ok = tuplestore_gettupleslot(res->tuplestore, true, false, slot); > + Assert(tuple_ok); /* It must return one tuple */ > > IMO it's better to use all the var names the same across all > functions? So call this 'tupslot' like the other > MakeSingleTupleTableSlot result. > > ~~~ > > 22. validate_slotsync_parameters > > +/* > + * Checks if GUCs are set appropriately before starting slot sync worker > + * > + * The slot sync worker can not start if 'enable_syncslot' is off and > + * since 'enable_syncslot' is ON, check that the other GUC settings > + * (primary_slot_name, hot_standby_feedback, wal_level, primary_conninfo) > + * are compatible with slot synchronization. If not, raise ERROR. > + */ > +static void > +validate_slotsync_parameters(char **dbname) > +{ > > 22a. > The comment is quite verbose. IMO the 2nd para seems just unnecessary > detail of the 1st para. > > SUGGESTION > Check that all necessary GUCs for slot synchronization are set > appropriately. If not, raise an ERROR. > > ~~~ > > 22b. > IMO (and given what was said in the comment about enable_syncslot must > be on)the first statement of this function should be: > > /* Sanity check. */ > Assert(enable_syncslot); > > ~~~ > > 23. slotsync_reread_config > > + old_dbname = walrcv_get_dbname_from_conninfo(PrimaryConnInfo); > + Assert(old_dbname); > > (This is same comment as old review [1]#61) > > Hmm. I still don't see why this extraction of the dbname cannot be > deferred until later when you know the PrimaryConnInfo has changed, > otherwise, it might be redundant to do this. Shveta replied [2] that > "Once PrimaryConnInfo is changed, we can not get old-dbname.", but I'm > not so sure. Isn't this walrcv_get_dbname_from_conninfo just doing a > string search -- Why can't you defer this until you know > conninfoChanged is true, and then to get the old_dbname, you can just > pass the old_primary_conninfo. E.g. call like > walrcv_get_dbname_from_conninfo(old_primary_conninfo); Maybe I am > mistaken. > Sorry missed your point earlier that we can use old_primary_conninfo to extract dbname later. I have removed this re-validation now as we will restart the worker in case of a parameter change similar to the case of logical apply worker. So these changes are no longer needed. > ~~ > > 24. > + /* > + * Since we have initialized this worker with the old dbname, thus > + * exit if dbname changed. Let it get restarted and connect to the new > + * dbname specified. > + */ > + if (conninfoChanged && strcmp(old_dbname, new_dbname) != 0) > + ereport(ERROR, > + errmsg("exiting slot sync worker as dbname in " > + "primary_conninfo changed")); > > IIUC when the tablesync has to restart, it emits a LOG message before > it exits; but it's not an ERROR. So, shouldn't this be similar -- IMO > it is not an "error" for the user to wish to change the dbname. Maybe > this should be LOG followed by an explicit exit. If you agree, then it > might be better to encapsulate such logic in some little function: > > // pseudo-code > void slotsync_worker_restart(const char *msg) > { > ereport(LOG, msg... > exit(0); > } > we can not do proc_exit(0), as then postmaster will not restart it on clean-exit. I agree with your logic, but will have another argument in this function to accept 'exit code' from the caller. > ~~~ > > 25. ReplSlotSyncWorkerMain > > + for (;;) > + { > + int rc; > + long naptime = WORKER_DEFAULT_NAPTIME_MS; > + TimestampTz now; > + bool slot_updated; > + > + ProcessSlotSyncInterrupts(wrconn); > + > + slot_updated = synchronize_slots(wrconn); > > Here I think the 'slot_updated' should be renamed to the same name as > in #16 above (e.g. 'some_slot_updated' or 'any_slot_updated' or > 'update_occurred' etc). > > ~~~ > > 26. SlotSyncWorkerRegister > > + if (!enable_syncslot) > + { > + ereport(LOG, > + errmsg("skipping slots synchronization because enable_syncslot is " > + "disabled.")); > + return; > + } > > Instead of saying "because..." in the error message maybe keep the > message more terse and describe the "because" part in the errdetail > > SUGGESTION > errmsg("skipping slot synchronization") > errdetail("enable_syncslot is disabled.") > > > ====== > src/backend/replication/slot.c > > 27. > + * sync_state: Defines slot synchronization state. For user created slots, it > + * is SYNCSLOT_STATE_NONE and for the slots being synchronized on the physical > + * standby, it is either SYNCSLOT_STATE_INITIATED or SYNCSLOT_STATE_READY > */ > void > ReplicationSlotCreate(const char *name, bool db_specific, > ReplicationSlotPersistency persistency, > - bool two_phase, bool failover) > + bool two_phase, bool failover, char sync_state) > > > 27a. > Why is this comment even mentioning SYNCSLOT_STATE_READY? IIUC it > doesn't make sense to ever call ReplicationSlotCreate directly setting > the 'r' state (e.g., bypassing 'i' ???) > > ~ > > 27b. > Indeed, IMO there should be Assert(sync_state == SYNCSLOT_STATE_NONE > || syncstate == SYNCSLOT_STATE_INITIATED); to guarantee this. > > ====== > src/include/replication/slot.h > > 28. > + /* > + * Is this a slot created by a sync-slot worker? > + * > + * Only relevant for logical slots on the physical standby. > + */ > + char sync_state; > + > > (probably I am repeating a previous thought here) > > The comment says the field is only relevant for standby, and that's > how I've been visualizing it, and why I had previously suggested even > renaming it to 'standby_sync_state'. However, replies are saying that > after failover these sync_states also have "some meaning for the > primary server". > > That's the part I have trouble understanding. IIUC the server states > are just either all 'n' (means nothing) or 'r' because they are just > leftover from the old standby state. So, does it *truly* have meaning > for the server? Or should those states somehow be removed/ignored on > the new primary? Anyway, the point is that if this field does have > meaning also on the primary (I doubt) then those details should be in > this comment. Otherwise "Only relevant ... on the standby" is too > misleading. > I have modified it currently, but I will give another thought on your suggestions here (and in earlier emails) and will let you know. > ====== > .../t/050_standby_failover_slots_sync.pl > We are working on CFbot failure fixes in this file and restructing the tests here. Thus I am keeping these test comments on hold and will address in next version. > 29. > +# Create table and publication on primary > +$primary->safe_psql('postgres', "CREATE TABLE tab_mypub3 (a int > PRIMARY KEY);"); > +$primary->safe_psql('postgres', "CREATE PUBLICATION mypub3 FOR TABLE > tab_mypub3;"); > + > > 29a. > /on primary/on the primary/ > > ~ > > 29b. > Consider to combine those DDL > > ~ > > 29c. > Perhaps for consistency, you should be calling this 'regress_mypub3'. > > ~~~ > > 30. > +# Create a subscriber node > +my $subscriber3 = PostgreSQL::Test::Cluster->new('subscriber3'); > +$subscriber3->init(allows_streaming => 'logical'); > +$subscriber3->start; > +$subscriber3->safe_psql('postgres', "CREATE TABLE tab_mypub3 (a int > PRIMARY KEY);"); > + > +# Create a subscription with failover = true & wait for sync to complete. > +$subscriber3->safe_psql('postgres', > + "CREATE SUBSCRIPTION mysub3 CONNECTION '$publisher_connstr' " > + . "PUBLICATION mypub3 WITH (slot_name = lsub3_slot, failover = true);"); > +$subscriber3->wait_for_subscription_sync; > > 30a > Consider combining those DDLs. > > ~ > > 30b. > Probably for consistency, you should be calling this 'regress_mysub3'. > > ~~~ > > 31. > +# Advance lsn on the primary > +$primary->safe_psql('postgres', > + "SELECT pg_log_standby_snapshot();"); > +$primary->safe_psql('postgres', > + "SELECT pg_log_standby_snapshot();"); > +$primary->safe_psql('postgres', > + "SELECT pg_log_standby_snapshot();"); > + > > Consider combining all those DDLs. > > ~~~ > > 32. > +# Truncate table on primary > +$primary->safe_psql('postgres', > + "TRUNCATE TABLE tab_mypub3;"); > + > +# Insert data on the primary > +$primary->safe_psql('postgres', > + "INSERT INTO tab_mypub3 SELECT generate_series(1, 10);"); > + > > Consider combining those DDLs. > > ~~~ > > 33. > +# Confirm that restart_lsn of lsub3_slot slot is synced to the standby > +$result = $standby3->safe_psql('postgres', > + qq[SELECT '$primary_lsn' >= restart_lsn from pg_replication_slots > WHERE slot_name = 'lsub3_slot';]); > +is($result, 't', 'restart_lsn of slot lsub3_slot synced to standby'); > > > Does "'$primary_lsn' >= restart_lsn" make sense here? NOTE, the sign > was '<=' in v43-0002 > > ~~~ > > 34. > +# Confirm that confirmed_flush_lsn of lsub3_slot slot is synced to the standby > +$result = $standby3->safe_psql('postgres', > + qq[SELECT '$primary_lsn' >= confirmed_flush_lsn from > pg_replication_slots WHERE slot_name = 'lsub3_slot';]); > +is($result, 't', 'confirmed_flush_lsn of slot lsub3_slot synced to > the standby'); > > Does "'$primary_lsn' >= confirmed_flush_lsn" make sense here? NOTE, > the sign was '<=' in v43-0002 > > ~~~ > > 35. > +################################################## > +# Test that synchronized slot can neither be docoded nor dropped by the user > +################################################## > > 35a. > /docoded/decoded/ > > ~ > > 35b. > Please give explanation in the comment *why* those ops are not allowed > (e.g. because the hot_standby_feedback GUC does not have an accepted > value) > > ~~~ > > 36. > +################################################## > +# Create another slot which stays in sync_state as initiated ('i') > +################################################## > + > > Please explain the comment as to *why* it gets stuck in the initiated state. > > > ~~~ > > 37. > +################################################## > +# Promote the standby3 to primary. Confirm that: > +# a) the sync-ready('r') slot 'lsub3_slot' is retained on new primary > +# b) the initiated('i') slot 'logical_slot'is dropped on promotion > +# c) logical replication for mysub3 is resumed succesfully after failover > +################################################## > > > /'logical_slot'is/'logical_slot' is/ (missing space) > > /succesfully/successfully/ > > ~~~ > > 38. > +# Update subscription with new primary's connection info > +$subscriber3->safe_psql('postgres', "ALTER SUBSCRIPTION mysub3 DISABLE;"); > +$subscriber3->safe_psql('postgres', "ALTER SUBSCRIPTION mysub3 > CONNECTION '$standby3_conninfo';"); > +$subscriber3->safe_psql('postgres', "ALTER SUBSCRIPTION mysub3 ENABLE;"); > > > Consider combining all those DDLs. > > ~~~ > > 39. > + > +# Insert data on the new primary > +$standby3->safe_psql('postgres', > + "INSERT INTO tab_mypub3 SELECT generate_series(11, 20);"); > + > +# Confirm that data in tab_mypub3 replicated on subscriber > +is( $subscriber3->safe_psql('postgres', q{SELECT count(*) FROM tab_mypub3;}), > + "20", > + 'data replicated from new primary'); > > Shouldn't there be some wait_for_subscription_sync logic (or similar) > here just to ensure the subscriber3 had time to receive that data > before you immediately check that it had arrived? > > ====== > [1] My v43-0002 review. > https://www.postgresql.org/message-id/CAHut%2BPuuqEpDse5msENsVuK3rjTRN-QGS67rRCGVv%2BzcT-f0GA%40mail.gmail.com > [2] Replies to v43-0002 review. > https://www.postgresql.org/message-id/CAJpy0uDcOf5Hvk_CdCCAbfx9SY%2Bog%3D%3D%3DtgiuhWKzkYyqebui9g%40mail.gmail.com > > Kind Regards, > Peter Smith. > Fujitsu Australia
On Thu, Dec 14, 2023 at 10:15 AM Peter Smith <smithpb2250@gmail.com> wrote: > > A review comment for v47-0001 > Thanks for reviewing. I have addressed these in v48. There is some design change around the code part where we were checking cascading and were revalidating new GUC values on conf-reload. So code has changed entirely around that part where some of these comments were. Please review now and let me know. > ====== > src/backend/replication/slot.c > > 1. GetStandbySlotList > > +static void > +WalSndRereadConfigAndReInitSlotList(List **standby_slots) > +{ > + char *pre_standby_slot_names; > + > + ProcessConfigFile(PGC_SIGHUP); > + > + /* > + * If we are running on a standby, there is no need to reload > + * standby_slot_names since we do not support syncing slots to cascading > + * standbys. > + */ > + if (RecoveryInProgress()) > + return; > > Should the RecoveryInProgress() check be first -- even before the > ProcessConfigFile call? > > ====== > Kind Regards, > Peter Smith. > Fujitsu Australia
Review for v47 patch - (1) When we try to create a subscription on standby using a synced slot that is in 'r' sync_state, the subscription will be created at the subscriber, and on standby, two actions will take place - (i) As copy_data is true by default, it will switch the failover state of the synced-slot to 'false'. (ii) As we don't allow to use synced-slots, it will start giving the expected error in the log file - ERROR: cannot use replication slot "logical_slot" for logical decoding DETAIL: This slot is being synced from the primary server. HINT: Specify another replication slot. The first one seems an issue, it toggles the failover to false and then it remains false after that. I think it should be fixed. (2) With the patch, the 'CREATE SUBSCRIPTION' command with a 'slot_name' of an 'active' logical slot fails and errors out - ERROR: could not alter replication slot "logical_slot" on publisher: ERROR: replication slot "logical_slot1" is active for PID xxxx Without the patch, the create subscription with an 'active' slot_name succeeds and the log file shows the error "could not start WAL streaming: ERROR: replication slot "logical_slot" is active for PID xxxx". Given that the specified active slot_name has failover set to false and the create subscription command also specifies failover=false, the expected behavior of the "with-patch" case is anticipated to be the same as that of the "without-patch" scenario.
On Fri, Dec 15, 2023 at 5:55 PM Nisha Moond <nisha.moond412@gmail.com> wrote: > > (1) > When we try to create a subscription on standby using a synced slot > that is in 'r' sync_state, the subscription will be created at the > subscriber, and on standby, two actions will take place - > (i) As copy_data is true by default, it will switch the failover > state of the synced-slot to 'false'. > (ii) As we don't allow to use synced-slots, it will start giving > the expected error in the log file - > ERROR: cannot use replication slot "logical_slot" for logical decoding > DETAIL: This slot is being synced from the primary server. > HINT: Specify another replication slot. > > The first one seems an issue, it toggles the failover to false and > then it remains false after that. I think it should be fixed. > +1. If we don't allow the slot to be used, we shouldn't allow its state to be changed as well. > (2) > With the patch, the 'CREATE SUBSCRIPTION' command with a 'slot_name' > of an 'active' logical slot fails and errors out - > ERROR: could not alter replication slot "logical_slot" on > publisher: ERROR: replication slot "logical_slot1" is active for PID > xxxx > > Without the patch, the create subscription with an 'active' slot_name > succeeds and the log file shows the error "could not start WAL > streaming: ERROR: replication slot "logical_slot" is active for PID > xxxx". > > Given that the specified active slot_name has failover set to false > and the create subscription command also specifies failover=false, the > expected behavior of the "with-patch" case is anticipated to be the > same as that of the "without-patch" scenario. > Currently, we first acquire the slot to change its state but I guess if we want the behavior as you mentioned we first need to check the slot's 'failover' state without acquiring the slot. I am not sure if that is any better because anyway we are going to fail in the very next step as the slot is busy. -- With Regards, Amit Kapila.
On Fri, Dec 15, 2023 at 11:03 AM shveta malik <shveta.malik@gmail.com> wrote: > > Sorry, I missed attaching the patch. PFA v48. > Few comments on v48_0002 ======================== 1. +static void +slotsync_reread_config(WalReceiverConn *wrconn) { ... + pfree(old_primary_conninfo); + pfree(old_primary_slotname); + + if (restart) + { + char *msg = "slot sync worker will restart because of a parameter change"; + + /* + * The exit code 1 will make postmaster restart the slot sync worker. + */ + slotsync_worker_exit(msg, 1 /* proc_exit code */ ); + } ... I don't see the need to explicitly pfree in case we are already exiting the process because anyway the memory will be released. We can avoid using the 'restart' variable for this. Also, probably, directly exiting here makes sense and at another place where this function is used. I see that in maybe_reread_subscription(), we exit with a 0 code and still apply worker restarts, so why use a different exit code here? 2. +static void +check_primary_info(WalReceiverConn *wrconn, bool *am_cascading_standby) { ... + remote_in_recovery = DatumGetBool(slot_getattr(tupslot, 1, &isnull)); + Assert(!isnull); + + /* No need to check further, return that we are cascading standby */ + if (remote_in_recovery) + { + *am_cascading_standby = true; + CommitTransactionCommand(); + return; ... } Don't we need to clear the result and tuple in case of early return? 3. It would be a good idea to mention about requirements like a physical slot on primary, hot_standby_feedback, etc. in the commit message. 4. +static bool +wait_for_primary_slot_catchup(WalReceiverConn *wrconn, RemoteSlot *remote_slot, + bool *wait_attempts_exceeded) { ... + tupslot = MakeSingleTupleTableSlot(res->tupledesc, &TTSOpsMinimalTuple); + if (!tuplestore_gettupleslot(res->tuplestore, true, false, tupslot)) + { + ereport(WARNING, + (errmsg("slot \"%s\" creation aborted", remote_slot->name), + errdetail("This slot was not found on the primary server"))); ... + /* + * It is possible to get null values for LSN and Xmin if slot is + * invalidated on the primary server, so handle accordingly. + */ + new_invalidated = DatumGetBool(slot_getattr(tupslot, 1, &isnull)); + Assert(!isnull); + + new_restart_lsn = DatumGetLSN(slot_getattr(tupslot, 2, &isnull)); + if (new_invalidated || isnull) + { + ereport(WARNING, + (errmsg("slot \"%s\" creation aborted", remote_slot->name), + errdetail("This slot was invalidated on the primary server"))); ... } a. The errdetail message should end with a full stop. Please check all other errdetail messages in the patch to follow the same guideline. b. I think saying slot creation aborted is not completely correct because we would have created the slot especially when it is in 'i' state. Can we change it to something like: "aborting initial sync for slot \"%s\""? c. Also, if the remote_slot is invalidated, ideally, we can even drop the local slot but it seems that the patch will drop the same before the next-sync cycle with any other slot that needs to be dropped. If so, can we add the comment to indicate the same? 5. +static void +local_slot_update(RemoteSlot *remote_slot) +{ + Assert(MyReplicationSlot->data.invalidated == RS_INVAL_NONE); + + LogicalConfirmReceivedLocation(remote_slot->confirmed_lsn); + LogicalIncreaseXminForSlot(remote_slot->confirmed_lsn, + remote_slot->catalog_xmin); + LogicalIncreaseRestartDecodingForSlot(remote_slot->confirmed_lsn, + remote_slot->restart_lsn); + + SpinLockAcquire(&MyReplicationSlot->mutex); + MyReplicationSlot->data.invalidated = remote_slot->invalidated; + SpinLockRelease(&MyReplicationSlot->mutex); ... ... If required, the invalidated flag is updated in the caller as well, so why do we need to update it here as well? -- With Regards, Amit Kapila.
On Monday, December 11, 2023 5:31 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Thu, Dec 7, 2023 at 1:33 PM Peter Smith <smithpb2250@gmail.com> > wrote: > > > > Hi. > > > > Here are my review comments for patch v43-0002. > > > > > ====== > > src/backend/access/transam/xlogrecovery.c > > > > 13. > > + /* > > + * Shutdown the slot sync workers to prevent potential conflicts between > > + * user processes and slotsync workers after a promotion. Additionally, > > + * drop any slots that have initiated but not yet completed the sync > > + * process. > > + */ > > + ShutDownSlotSync(); > > + slotsync_drop_initiated_slots(); > > + > > > > Is this where maybe the 'sync_state' should also be updated for > > everything so you are not left with confusion about different states > > on a node that is no longer a standby node? > > > > yes, this is the place. But this needs more thought as it may cause > too much disk activity during promotion. so let me analyze and come > back. Per off-list discussion with Amit. I think it's fine to keep both READY and NONE on a primary. Because even if we update the sync_state from READY to NONE on promotion, it doesn't reduce the complexity for the handling of READY and NONE state. And it's not straightforward to choose the right place to update sync_state, here is the analysis: (related steps on promotion) 1 (patch) shutdown slotsync worker 2 (patch) drop 'i' state slots. 3 remove standby.signal and recovery.signal 4 switch to a new timeline and write the timeline history file 5 set SharedRecoveryState = RECOVERY_STATE_DONE which means RecoveryInProgress() will return false. We could not update the sync_state before step 3 because if the update fails after updating some of slots' state, then the server will be shutdown leaving some 'NONE' state slots. After restarting, the server is still a standby so the slot sync worker will fail to sync these 'NONE' state slots. We also could not update it after step 3 and before step 4. Because if any ERROR when updating, then after restarting the server, although the server will become a master(as standby.signal is removed), but it can still be made as a active standby by creating a standby.signal file because the timeline has not been switched. And in this case, the slot sync worker will also fail to sync these 'NONE' state slots. Updating the sync_state after step 4 and before step 5 is OK, but still It doesn't simplify the handling for both READY and NONE state slots. Therefore, I think we can retain the READY state slots after promotion as they can provide information about the slot's origin. I added some comments around slotsync cleanup codes (in FinishWalRecovery) to mentioned the reason. Best Regards, Hou zj
On Friday, December 15, 2023 1:32 PM shveta malik <shveta.malik@gmail.com> wrote: > > TODO: > --Address the test comments in [1] for 050_standby_failover_slots_sync.pl > --Review the feasibility of addressing one pending comment (comment 13 in > [5]) of 'r'->'n' conversion. Here is the V49 patch set which addressed above TODO items. The patch also includes the following changes: V49-0001 1) added some documents to mention it's user responsibility to ensure the table sync is completed before subscriber to the new primary. 2) fix one CFbot failure in 050_standby_failover_slots_sync.pl. V49-0002 1) added few comments to mention why we retain the READY state after promotion. 2) Prevent user from altering the slots that is being synced. 3) fix one CFbot failure in 050_standby_failover_slots_sync.pl. 4) Improve the 050_standby_failover_slots_sync.pl to remove some unnecessary operations. V49-0003 There is one unstable test in V48-0002 which is to validate the restart_lsn of synced slot. We test it by checking "'$primary_restart_lsn' <= restart_lsn" which would wrongly allow the standby to go ahead of primary. And it may fail randomly as standby may still be lagging behind primary if the slot-sync worker has gone to longer nap (10 sec) and has not taken the slots-changes yet. And we cannot put sleep of 10sec here. We may consider removing this test as it may be enough to test that logical replication is proceeding well from the synced slots on new primary. So, I temporarily move it into a separately patch for review. Thanks Ajin for working on the testcases improvement. > > [1]: > https://www.postgresql.org/message-id/CAHut%2BPtOc7J_n24HJ6f_dFWTu > D3X2ApOByQzZf6jZz%2B0wb-ebQ%40mail.gmail.com > [5]: > https://www.postgresql.org/message-id/CAJpy0uDcOf5Hvk_CdCCAbfx9SY% > 2Bog%3D%3D%3DtgiuhWKzkYyqebui9g%40mail.gmail.com Best Regards, Hou zj
Attachment
Here are some review comments for v48-0002 ====== doc/src/sgml/config.sgml 1. + If slot synchronization is enabled then it is also necessary to + specify <literal>dbname</literal> in the + <varname>primary_conninfo</varname> string. This will only be used for + slot synchronization. It is ignored for streaming. I felt the "If slot synchronization is enabled" part should also include an xref to the enable_slotsync GUC, otherwise there is no information here about how to enable it. SUGGESTION If slot synchronization is enabled (see XXX) .... ====== doc/src/sgml/logicaldecoding.sgml 2. + <para> + The ability to resume logical replication after failover depends upon the + <link linkend="view-pg-replication-slots">pg_replication_slots</link>.<structfield>sync_state</structfield> + value for the synchronized slots on the standby at the time of failover. + Only slots that have attained "ready" sync_state ('r') on the standby + before failover can be used for logical replication after failover. Slots + that have not yet reached 'r' state (they are still 'i') will be dropped, + therefore logical replication for those slots cannot be resumed. For + example, if the synchronized slot could not become sync-ready on standby + due to a disabled subscription, then the subscription cannot be resumed + after failover even when it is enabled. + If the primary is idle, then the synchronized slots on the standby may + take a noticeable time to reach the ready ('r') sync_state. This can + be sped up by calling the + <function>pg_log_standby_snapshot</function> function on the primary. + </para> 2a. /sync-ready on standby/sync-ready on the standby/ ~ 2b. Should "If the primary is idle" be in a new paragraph? ====== doc/src/sgml/system-views.sgml 3. + <para> + The hot standby can have any of these sync_state values for the slots but + on a hot standby, the slots with state 'r' and 'i' can neither be used + for logical decoding nor dropped by the user. + The sync_state has no meaning on the primary server; the primary + sync_state value is default 'n' for all slots but may (if leftover + from a promoted standby) also be 'r'. + </para></entry> I still feel we are exposing too much useless information about the primary server values. Isn't it sufficient to just say "The sync_state values have no meaning on a primary server.", and not bother to mention what those meaningless values might be -- e.g. if they are meaningless then who cares what they are or how they got there? ====== src/backend/replication/logical/slotsync.c 4. synchronize_one_slot + /* Slot ready for sync, so sync it. */ + if (sync_state == SYNCSLOT_STATE_READY) + { + /* + * Sanity check: With hot_standby_feedback enabled and + * invalidations handled appropriately as above, this should never + * happen. + */ + if (remote_slot->restart_lsn < slot->data.restart_lsn) + elog(ERROR, + "not synchronizing local slot \"%s\" LSN(%X/%X)" + " to remote slot's LSN(%X/%X) as synchronization " + " would move it backwards", remote_slot->name, + LSN_FORMAT_ARGS(slot->data.restart_lsn), + LSN_FORMAT_ARGS(remote_slot->restart_lsn)); + + if (remote_slot->confirmed_lsn != slot->data.confirmed_flush || + remote_slot->restart_lsn != slot->data.restart_lsn || + remote_slot->catalog_xmin != slot->data.catalog_xmin) + { + /* Update LSN of slot to remote slot's current position */ + local_slot_update(remote_slot); + ReplicationSlotSave(); + slot_updated = true; + } + } + /* Slot not ready yet, let's attempt to make it sync-ready now. */ + else if (sync_state == SYNCSLOT_STATE_INITIATED) + { + /* + * Wait for the primary server to catch-up. Refer to the comment + * atop the file for details on this wait. + */ + if (remote_slot->restart_lsn < slot->data.restart_lsn || + TransactionIdPrecedes(remote_slot->catalog_xmin, + slot->data.catalog_xmin)) + { + if (!wait_for_primary_slot_catchup(wrconn, remote_slot, NULL)) + { + ReplicationSlotRelease(); + return false; + } + } + + /* + * Wait for primary is over, update the lsns and mark the slot as + * READY for further syncs. + */ + local_slot_update(remote_slot); + SpinLockAcquire(&slot->mutex); + slot->data.sync_state = SYNCSLOT_STATE_READY; + SpinLockRelease(&slot->mutex); + + /* Save the changes */ + ReplicationSlotMarkDirty(); + ReplicationSlotSave(); + slot_updated = true; + + ereport(LOG, + errmsg("newly locally created slot \"%s\" is sync-ready now", + remote_slot->name)); + } 4a. It would be more natural in the code if you do the SYNCSLOT_STATE_INITIATED logic before the SYNCSLOT_STATE_READY because that is the order those states come in. ~ 4b. I'm not sure if it is worth it, but I was thinking that some duplicate code can be avoided by doing if/if instead of if/else if (sync_state == SYNCSLOT_STATE_INITIATED) { .. } if (sync_state == SYNCSLOT_STATE_READY) { } By arranging it this way maybe the SYNCSLOT_STATE_INITIATED code block doesn't need to do anything except update the sync_state = SYNCSLOT_STATE_READY; Then it can just fall through to the SYNCSLOT_STATE_READY logic to do all the local_slot_update(remote_slot); etc in just one place. ~~~ 5. check_primary_info + * Checks the primary server info. + * + * Using the specified primary server connection, check whether we are cascading + * standby. It also validates primary_slot_name for non-cascading-standbys. + */ +static void +check_primary_info(WalReceiverConn *wrconn, bool *am_cascading_standby) 5a. /we are cascading/we are a cascading/ 5b. /non-cascading-standbys./non-cascading standbys./ ~~~ 6. + CommitTransactionCommand(); + + *am_cascading_standby = false; Maybe it's simpler just to set this default false up-front, replacing the current assert. BEFORE: + Assert(am_cascading_standby != NULL); AFTER: *am_cascading_standby = false; /* maybe overwrite later */ ~~~ 7. +/* + * Exit the slot sync worker with given exit-code. + */ +static void +slotsync_worker_exit(const char *msg, int code) +{ + ereport(LOG, errmsg("%s", msg)); + proc_exit(code); +} This could be written differently (don't pass the exit code, instead pass a bool) like: static void slotsync_worker_exit(const char *msg, bool restart_worker) By doing it this way, you can keep the special exit code values (0,1) within this function where you can comment all about them instead of having scattered comments about exit codes in the callers. SUGGESTION ereport(LOG, errmsg("%s", msg)); /* <some big comment here about how the code causes the worker to restart or not> */ proc_exit(restart_worker ? 1 : 0); ~~~ 8. slotsync_reread_config + if (restart) + { + char *msg = "slot sync worker will restart because of a parameter change"; + + /* + * The exit code 1 will make postmaster restart the slot sync worker. + */ + slotsync_worker_exit(msg, 1 /* proc_exit code */ ); + } Shouldn't that message be written as _(), so that it will get translated? SUGGESTION slotsync_worker_exit(_("slot sync worker will restart because of a parameter change"), true /* restart worker */ ); ~~~ 9. ProcessSlotSyncInterrupts + CHECK_FOR_INTERRUPTS(); + + if (ShutdownRequestPending) + { + char *msg = "replication slot sync worker is shutting down on receiving SIGINT"; + + walrcv_disconnect(wrconn); + + /* + * The exit code 0 means slot sync worker will not be restarted by + * postmaster. + */ + slotsync_worker_exit(msg, 0 /* proc_exit code */ ); + } Shouldn't that message be written as _(), so that it will be translated? SUGGESTION slotsync_worker_exit(_("replication slot sync worker is shutting down on receiving SIGINT"), false /* don't restart worker */ ); ~~~ 10. +/* + * Cleanup function for logical replication launcher. + * + * Called on logical replication launcher exit. + */ +static void +slotsync_worker_onexit(int code, Datum arg) +{ + SpinLockAcquire(&SlotSyncWorker->mutex); + SlotSyncWorker->pid = InvalidPid; + SpinLockRelease(&SlotSyncWorker->mutex); +} IMO it would make sense for this function to be defined adjacent to the slotsync_worker_exit() function. ~~~ 11. ReplSlotSyncWorkerMain + /* + * Using the specified primary server connection, check whether we are + * cascading standby and validates primary_slot_name for + * non-cascading-standbys. + */ + check_primary_info(wrconn, &am_cascading_standby); ... + /* Recheck if it is still a cascading standby */ + if (am_cascading_standby) + check_primary_info(wrconn, &am_cascading_standby); Those 2 above calls could be combined if you want. By defaulting the am_cascading_standby = true when declared, then you could put this code at the top of the loop instead of having the same code in 2 places: + if (am_cascading_standby) + check_primary_info(wrconn, &am_cascading_standby); ====== src/include/commands/subscriptioncmds.h 12. #include "parser/parse_node.h" +#include "replication/walreceiver.h" There is #include, but no other code change. Is this needed? ====== src/include/replication/slot.h 13. + /* + * Synchronization state for a logical slot. + * + * The standby can have any value among the possible values of 'i','r' and + * 'n'. For primary, the default is 'n' for all slots but may also be 'r' + * if leftover from a promoted standby. + */ + char sync_state; + All that is OK now, but I keep circling back to my original thought that since this state has no meaning for the primary server then a) why do we even care what potential values it might have there, and b) isn't it better to call this field 'standby_sync_state' to emphasize it only has meaning for the standby? e.g. SUGGESTION /* * Synchronization state for a logical slot. * * The standby can have any value among the possible values of 'i','r' and * 'n'. For the primary, this field value has no meaning. */ char standby_sync_state; ====== Kind Regards, Peter Smith. Fujitsu Australia
Here are some comments for the patch v49-0002. (This is in addition to my review comments for v48-0002 [1]) ====== src/backend/access/transam/xlogrecovery.c 1. FinishWalRecovery + * + * We do not update the sync_state from READY to NONE here, as any failed + * update could leave some slots in the 'NONE' state, causing issues during + * slot sync after restarting the server as a standby. While updating after + * switching to the new timeline is an option, it does not simplify the + * handling for both READY and NONE state slots. Therefore, we retain the + * READY state slots after promotion as they can provide useful information + * about their origin. + */ Do you know if that wording is correct? e.g., If you were updating from READY to NONE and there was a failed update, that would leave some slots still in a READY state, right? So why does the comment say "could leave some slots in the 'NONE' state"? ====== src/backend/replication/slot.c 2. ReplicationSlotAlter + /* + * Do not allow users to drop the slots which are currently being synced + * from the primary to the standby. + */ + if (RecoveryInProgress() && + MyReplicationSlot->data.sync_state != SYNCSLOT_STATE_NONE) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("cannot alter replication slot \"%s\"", name), + errdetail("This slot is being synced from the primary server."))); + The comment looks wrong -- should say "Do not allow users to alter..." ====== 3. +################################################## +# Test that synchronized slot can neither be decoded nor dropped by the user +################################################## + 3a, /Test that synchronized slot/Test that a synchronized slot/ 3b. Isn't there a missing test? Should this part also check that it cannot ALTER the replication slot being synced? e.g. test for the new v49 error message that was added in ReplicationSlotAlter() ~~~ 4. +# Disable hot_standby_feedback +$standby1->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = off;'); +$standby1->restart; + Can there be a comment added to explain why you are doing the 'hot_standby_feedback' toggle? ~~~ 5. +################################################## +# Promote the standby1 to primary. Confirm that: +# a) the sync-ready('r') slot 'lsub1_slot' is retained on the new primary +# b) the initiated('i') slot 'logical_slot' is dropped on promotion +# c) logical replication for regress_mysub1 is resumed succesfully after failover +################################################## /succesfully/successfully/ ~~~ 6. + +# Confirm that data in tab_mypub3 replicated on subscriber +is( $subscriber1->safe_psql('postgres', q{SELECT count(*) FROM tab_int;}), + "$primary_row_count", + 'data replicated from the new primary'); The comment is wrong -- it names a different table ('tab_mypub3' ?) to what the SQL says. ====== [1] My v48-0002 review comments. https://www.postgresql.org/message-id/CAHut%2BPsyZQZ1A4XcKw-D%3DvcTg16pN9Dw0PzE8W_X7Yz_bv00rQ%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
On Tue, Dec 19, 2023 at 6:58 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are some comments for the patch v49-0002. > > (This is in addition to my review comments for v48-0002 [1]) > > ====== > src/backend/access/transam/xlogrecovery.c > > > 1. FinishWalRecovery > > + * > + * We do not update the sync_state from READY to NONE here, as any failed > + * update could leave some slots in the 'NONE' state, causing issues during > + * slot sync after restarting the server as a standby. While updating after > + * switching to the new timeline is an option, it does not simplify the > + * handling for both READY and NONE state slots. Therefore, we retain the > + * READY state slots after promotion as they can provide useful information > + * about their origin. > + */ > > Do you know if that wording is correct? e.g., If you were updating > from READY to NONE and there was a failed update, that would leave > some slots still in a READY state, right? So why does the comment say > "could leave some slots in the 'NONE' state"? > The comment is correct because after restart the server will start as 'standby', so 'READY' marked slots are okay but the slots that we changed to 'NONE' would now appear as user-created slots which would be wrong. -- With Regards, Amit Kapila.
On Tue, Dec 19, 2023 at 4:51 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > ====== > doc/src/sgml/system-views.sgml > > 3. > + <para> > + The hot standby can have any of these sync_state values for the slots but > + on a hot standby, the slots with state 'r' and 'i' can neither be used > + for logical decoding nor dropped by the user. > + The sync_state has no meaning on the primary server; the primary > + sync_state value is default 'n' for all slots but may (if leftover > + from a promoted standby) also be 'r'. > + </para></entry> > > I still feel we are exposing too much useless information about the > primary server values. > > Isn't it sufficient to just say "The sync_state values have no meaning > on a primary server.", and not bother to mention what those > meaningless values might be -- e.g. if they are meaningless then who > cares what they are or how they got there? > I feel it would be good to mention somewhere that primary can have slots in 'r' state, if not here, some other place. > > 7. > +/* > + * Exit the slot sync worker with given exit-code. > + */ > +static void > +slotsync_worker_exit(const char *msg, int code) > +{ > + ereport(LOG, errmsg("%s", msg)); > + proc_exit(code); > +} > > This could be written differently (don't pass the exit code, instead > pass a bool) like: > > static void > slotsync_worker_exit(const char *msg, bool restart_worker) > > By doing it this way, you can keep the special exit code values (0,1) > within this function where you can comment all about them instead of > having scattered comments about exit codes in the callers. > > SUGGESTION > ereport(LOG, errmsg("%s", msg)); > /* <some big comment here about how the code causes the worker to > restart or not> */ > proc_exit(restart_worker ? 1 : 0); > Hmm, I don't see the need for this function in the first place. We can use proc_exit in the two callers directly. -- With Regards, Amit Kapila.
On Tue, Dec 19, 2023 at 11:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Dec 19, 2023 at 4:51 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > > ====== > > doc/src/sgml/system-views.sgml > > > > 3. > > + <para> > > + The hot standby can have any of these sync_state values for the slots but > > + on a hot standby, the slots with state 'r' and 'i' can neither be used > > + for logical decoding nor dropped by the user. > > + The sync_state has no meaning on the primary server; the primary > > + sync_state value is default 'n' for all slots but may (if leftover > > + from a promoted standby) also be 'r'. > > + </para></entry> > > > > I still feel we are exposing too much useless information about the > > primary server values. > > > > Isn't it sufficient to just say "The sync_state values have no meaning > > on a primary server.", and not bother to mention what those > > meaningless values might be -- e.g. if they are meaningless then who > > cares what they are or how they got there? > > > > I feel it would be good to mention somewhere that primary can have > slots in 'r' state, if not here, some other place. > > > > > 7. > > +/* > > + * Exit the slot sync worker with given exit-code. > > + */ > > +static void > > +slotsync_worker_exit(const char *msg, int code) > > +{ > > + ereport(LOG, errmsg("%s", msg)); > > + proc_exit(code); > > +} > > > > This could be written differently (don't pass the exit code, instead > > pass a bool) like: > > > > static void > > slotsync_worker_exit(const char *msg, bool restart_worker) > > > > By doing it this way, you can keep the special exit code values (0,1) > > within this function where you can comment all about them instead of > > having scattered comments about exit codes in the callers. > > > > SUGGESTION > > ereport(LOG, errmsg("%s", msg)); > > /* <some big comment here about how the code causes the worker to > > restart or not> */ > > proc_exit(restart_worker ? 1 : 0); > > > > Hmm, I don't see the need for this function in the first place. We can > use proc_exit in the two callers directly. > > -- > With Regards, > Amit Kapila. PFA v50 patch-set which addresses comments for v48-0002 and v49-0002 given in [1], [2] and [3]. TODO: --Fix CFBot failure. --Work on correctness of test to merge patch003 to patch002 [1]: https://www.postgresql.org/message-id/CAA4eK1Ko-EBBDkea2R8V8PeveGg10PBswCF7JQdnRu%2BMJP%2BYBQ%40mail.gmail.com [2]: https://www.postgresql.org/message-id/CAHut%2BPsyZQZ1A4XcKw-D%3DvcTg16pN9Dw0PzE8W_X7Yz_bv00rQ%40mail.gmail.com [3]: https://www.postgresql.org/message-id/CAHut%2BPv86wBZiyOLHxycd8Yj9%3Dk5kzVa1x7Gbp%2B%3Dc1VGT9TG2w%40mail.gmail.com thanks Shveta
Attachment
On Mon, Dec 18, 2023 at 4:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Dec 15, 2023 at 11:03 AM shveta malik <shveta.malik@gmail.com> wrote: > > > > Sorry, I missed attaching the patch. PFA v48. > > > > Few comments on v48_0002 > ======================== Thanks for reviewing. These are addressed in v50. Please find my comments inline for some of these. > 1. > +static void > +slotsync_reread_config(WalReceiverConn *wrconn) > { > ... > + pfree(old_primary_conninfo); > + pfree(old_primary_slotname); > + > + if (restart) > + { > + char *msg = "slot sync worker will restart because of a parameter change"; > + > + /* > + * The exit code 1 will make postmaster restart the slot sync worker. > + */ > + slotsync_worker_exit(msg, 1 /* proc_exit code */ ); > + } > ... > > I don't see the need to explicitly pfree in case we are already > exiting the process because anyway the memory will be released. We can > avoid using the 'restart' variable for this. I have moved pfree to the end where we do not exit the worker. Removed restart variable. >Also, probably, directly > exiting here makes sense and at another place where this function is > used. I see that in maybe_reread_subscription(), we exit with a 0 code > and still apply worker restarts, so why use a different exit code > here? > Logical rep worker is started by logical rep launcher and it has different logic of restarting it. OTOH, slot-sync worker is started by the postmaster and the postmaster starts any of its bgworkers only if the worker had an abnormal exit and restart_time is given during registration of the worker. Thus we need exit_code here. I have removed the new function added though. > 2. > +static void > +check_primary_info(WalReceiverConn *wrconn, bool *am_cascading_standby) > { > ... > + remote_in_recovery = DatumGetBool(slot_getattr(tupslot, 1, &isnull)); > + Assert(!isnull); > + > + /* No need to check further, return that we are cascading standby */ > + if (remote_in_recovery) > + { > + *am_cascading_standby = true; > + CommitTransactionCommand(); > + return; > ... > } > > Don't we need to clear the result and tuple in case of early return? Yes, it was needed. Modified. > > 3. It would be a good idea to mention about requirements like a > physical slot on primary, hot_standby_feedback, etc. in the commit > message. > > 4. > +static bool > +wait_for_primary_slot_catchup(WalReceiverConn *wrconn, RemoteSlot *remote_slot, > + bool *wait_attempts_exceeded) > { > ... > + tupslot = MakeSingleTupleTableSlot(res->tupledesc, &TTSOpsMinimalTuple); > + if (!tuplestore_gettupleslot(res->tuplestore, true, false, tupslot)) > + { > + ereport(WARNING, > + (errmsg("slot \"%s\" creation aborted", remote_slot->name), > + errdetail("This slot was not found on the primary server"))); > ... > + /* > + * It is possible to get null values for LSN and Xmin if slot is > + * invalidated on the primary server, so handle accordingly. > + */ > + new_invalidated = DatumGetBool(slot_getattr(tupslot, 1, &isnull)); > + Assert(!isnull); > + > + new_restart_lsn = DatumGetLSN(slot_getattr(tupslot, 2, &isnull)); > + if (new_invalidated || isnull) > + { > + ereport(WARNING, > + (errmsg("slot \"%s\" creation aborted", remote_slot->name), > + errdetail("This slot was invalidated on the primary server"))); > ... > } > > a. The errdetail message should end with a full stop. Please check all > other errdetail messages in the patch to follow the same guideline. > b. I think saying slot creation aborted is not completely correct > because we would have created the slot especially when it is in 'i' > state. Can we change it to something like: "aborting initial sync for > slot \"%s\""? > c. Also, if the remote_slot is invalidated, ideally, we can even drop > the local slot but it seems that the patch will drop the same before > the next-sync cycle with any other slot that needs to be dropped. If > so, can we add the comment to indicate the same? I have added comments. Basically, it will be dropped in caller only if it is 'RS_EPHEMERAL' state else if it is already persisted, then will be maintained as is but marked as invalidated in caller and its sync will be skipped next time onwards. > > 5. > +static void > +local_slot_update(RemoteSlot *remote_slot) > +{ > + Assert(MyReplicationSlot->data.invalidated == RS_INVAL_NONE); > + > + LogicalConfirmReceivedLocation(remote_slot->confirmed_lsn); > + LogicalIncreaseXminForSlot(remote_slot->confirmed_lsn, > + remote_slot->catalog_xmin); > + LogicalIncreaseRestartDecodingForSlot(remote_slot->confirmed_lsn, > + remote_slot->restart_lsn); > + > + SpinLockAcquire(&MyReplicationSlot->mutex); > + MyReplicationSlot->data.invalidated = remote_slot->invalidated; > + SpinLockRelease(&MyReplicationSlot->mutex); > ... > ... > > If required, the invalidated flag is updated in the caller as well, so > why do we need to update it here as well? > It was needed by the part where the slot is not existing and we need to create a new slot. I have now moved the invalidation check in caller; we do not create the slot itself if remote_slot is found as invalidated one in the beginning. And if it is invalidated in between of the wait logic, then it will be dropped by ReplicationSlotRelease. > -- > With Regards, > Amit Kapila.
On Tue, Dec 19, 2023 at 4:51 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are some review comments for v48-0002 > Thanks for reviewing. Most of these are addressed in v50. Please find my comments for the rest. > ====== > doc/src/sgml/config.sgml > > 1. > + If slot synchronization is enabled then it is also necessary to > + specify <literal>dbname</literal> in the > + <varname>primary_conninfo</varname> string. This will only > be used for > + slot synchronization. It is ignored for streaming. > > I felt the "If slot synchronization is enabled" part should also > include an xref to the enable_slotsync GUC, otherwise there is no > information here about how to enable it. > > SUGGESTION > If slot synchronization is enabled (see XXX) .... > > ====== > doc/src/sgml/logicaldecoding.sgml > > 2. > + <para> > + The ability to resume logical replication after failover depends upon the > + <link linkend="view-pg-replication-slots">pg_replication_slots</link>.<structfield>sync_state</structfield> > + value for the synchronized slots on the standby at the time of failover. > + Only slots that have attained "ready" sync_state ('r') on the standby > + before failover can be used for logical replication after failover. Slots > + that have not yet reached 'r' state (they are still 'i') will be dropped, > + therefore logical replication for those slots cannot be resumed. For > + example, if the synchronized slot could not become sync-ready on standby > + due to a disabled subscription, then the subscription cannot be resumed > + after failover even when it is enabled. > + If the primary is idle, then the synchronized slots on the standby may > + take a noticeable time to reach the ready ('r') sync_state. This can > + be sped up by calling the > + <function>pg_log_standby_snapshot</function> function on the primary. > + </para> > > 2a. > /sync-ready on standby/sync-ready on the standby/ > > ~ > > 2b. > Should "If the primary is idle" be in a new paragraph? > > ====== > doc/src/sgml/system-views.sgml > > 3. > + <para> > + The hot standby can have any of these sync_state values for the slots but > + on a hot standby, the slots with state 'r' and 'i' can neither be used > + for logical decoding nor dropped by the user. > + The sync_state has no meaning on the primary server; the primary > + sync_state value is default 'n' for all slots but may (if leftover > + from a promoted standby) also be 'r'. > + </para></entry> > > I still feel we are exposing too much useless information about the > primary server values. > > Isn't it sufficient to just say "The sync_state values have no meaning > on a primary server.", and not bother to mention what those > meaningless values might be -- e.g. if they are meaningless then who > cares what they are or how they got there? > I am retaining the original info till we find a better place for it as suggested by Amit in [1] > ====== > src/backend/replication/logical/slotsync.c > > 4. synchronize_one_slot > > + /* Slot ready for sync, so sync it. */ > + if (sync_state == SYNCSLOT_STATE_READY) > + { > + /* > + * Sanity check: With hot_standby_feedback enabled and > + * invalidations handled appropriately as above, this should never > + * happen. > + */ > + if (remote_slot->restart_lsn < slot->data.restart_lsn) > + elog(ERROR, > + "not synchronizing local slot \"%s\" LSN(%X/%X)" > + " to remote slot's LSN(%X/%X) as synchronization " > + " would move it backwards", remote_slot->name, > + LSN_FORMAT_ARGS(slot->data.restart_lsn), > + LSN_FORMAT_ARGS(remote_slot->restart_lsn)); > + > + if (remote_slot->confirmed_lsn != slot->data.confirmed_flush || > + remote_slot->restart_lsn != slot->data.restart_lsn || > + remote_slot->catalog_xmin != slot->data.catalog_xmin) > + { > + /* Update LSN of slot to remote slot's current position */ > + local_slot_update(remote_slot); > + ReplicationSlotSave(); > + slot_updated = true; > + } > + } > + /* Slot not ready yet, let's attempt to make it sync-ready now. */ > + else if (sync_state == SYNCSLOT_STATE_INITIATED) > + { > + /* > + * Wait for the primary server to catch-up. Refer to the comment > + * atop the file for details on this wait. > + */ > + if (remote_slot->restart_lsn < slot->data.restart_lsn || > + TransactionIdPrecedes(remote_slot->catalog_xmin, > + slot->data.catalog_xmin)) > + { > + if (!wait_for_primary_slot_catchup(wrconn, remote_slot, NULL)) > + { > + ReplicationSlotRelease(); > + return false; > + } > + } > + > + /* > + * Wait for primary is over, update the lsns and mark the slot as > + * READY for further syncs. > + */ > + local_slot_update(remote_slot); > + SpinLockAcquire(&slot->mutex); > + slot->data.sync_state = SYNCSLOT_STATE_READY; > + SpinLockRelease(&slot->mutex); > + > + /* Save the changes */ > + ReplicationSlotMarkDirty(); > + ReplicationSlotSave(); > + slot_updated = true; > + > + ereport(LOG, > + errmsg("newly locally created slot \"%s\" is sync-ready now", > + remote_slot->name)); > + } > > 4a. > It would be more natural in the code if you do the > SYNCSLOT_STATE_INITIATED logic before the SYNCSLOT_STATE_READY because > that is the order those states come in. > > ~ > > 4b. > I'm not sure if it is worth it, but I was thinking that some duplicate > code can be avoided by doing if/if instead of if/else > > if (sync_state == SYNCSLOT_STATE_INITIATED) > { > .. > } > if (sync_state == SYNCSLOT_STATE_READY) > { > } > > By arranging it this way maybe the SYNCSLOT_STATE_INITIATED code block > doesn't need to do anything except update the sync_state = > SYNCSLOT_STATE_READY; Then it can just fall through to the > SYNCSLOT_STATE_READY logic to do all the > local_slot_update(remote_slot); etc in just one place. > We want to mark the slot as sync-ready once initial sync is over (i.e. confirmed_lsn != NULL). But if we try to optimize as above, we will end up marking it as sync-read before initial-sync itself in local_slot_update() which does not sound like a good idea. > ~~~ > > 5. check_primary_info > > + * Checks the primary server info. > + * > + * Using the specified primary server connection, check whether we > are cascading > + * standby. It also validates primary_slot_name for non-cascading-standbys. > + */ > +static void > +check_primary_info(WalReceiverConn *wrconn, bool *am_cascading_standby) > > 5a. > /we are cascading/we are a cascading/ > > 5b. > /non-cascading-standbys./non-cascading standbys./ > > ~~~ > > 6. > + CommitTransactionCommand(); > + > + *am_cascading_standby = false; > > Maybe it's simpler just to set this default false up-front, replacing > the current assert. > > BEFORE: > + Assert(am_cascading_standby != NULL); > > AFTER: > *am_cascading_standby = false; /* maybe overwrite later */ > Sure, moved default false up-front. But do we need to replace assert? I think assert is needed to make sure we are not accessing null-pointer later. > ~~~ > > 7. > +/* > + * Exit the slot sync worker with given exit-code. > + */ > +static void > +slotsync_worker_exit(const char *msg, int code) > +{ > + ereport(LOG, errmsg("%s", msg)); > + proc_exit(code); > +} > > This could be written differently (don't pass the exit code, instead > pass a bool) like: > > static void > slotsync_worker_exit(const char *msg, bool restart_worker) > > By doing it this way, you can keep the special exit code values (0,1) > within this function where you can comment all about them instead of > having scattered comments about exit codes in the callers. > > SUGGESTION > ereport(LOG, errmsg("%s", msg)); > /* <some big comment here about how the code causes the worker to > restart or not> */ > proc_exit(restart_worker ? 1 : 0); > I have removed slotsync_worker_exit() function as suggested by Amit in [1]. Thus few of the suggestions (7,8,10) are no longer valid relevant. > ~~~ > > 8. slotsync_reread_config > > + if (restart) > + { > + char *msg = "slot sync worker will restart because of a parameter change"; > + > + /* > + * The exit code 1 will make postmaster restart the slot sync worker. > + */ > + slotsync_worker_exit(msg, 1 /* proc_exit code */ ); > + } > > Shouldn't that message be written as _(), so that it will get translated? > > SUGGESTION > slotsync_worker_exit(_("slot sync worker will restart because of a > parameter change"), true /* restart worker */ ); > > ~~~ > > 9. ProcessSlotSyncInterrupts > > + CHECK_FOR_INTERRUPTS(); > + > + if (ShutdownRequestPending) > + { > + char *msg = "replication slot sync worker is shutting down on > receiving SIGINT"; > + > + walrcv_disconnect(wrconn); > + > + /* > + * The exit code 0 means slot sync worker will not be restarted by > + * postmaster. > + */ > + slotsync_worker_exit(msg, 0 /* proc_exit code */ ); > + } > > Shouldn't that message be written as _(), so that it will be translated? > > SUGGESTION > slotsync_worker_exit(_("replication slot sync worker is shutting down > on receiving SIGINT"), false /* don't restart worker */ ); > > ~~~ > > 10. > +/* > + * Cleanup function for logical replication launcher. > + * > + * Called on logical replication launcher exit. > + */ > +static void > +slotsync_worker_onexit(int code, Datum arg) > +{ > + SpinLockAcquire(&SlotSyncWorker->mutex); > + SlotSyncWorker->pid = InvalidPid; > + SpinLockRelease(&SlotSyncWorker->mutex); > +} > > IMO it would make sense for this function to be defined adjacent to > the slotsync_worker_exit() function. > > ~~~ > > 11. ReplSlotSyncWorkerMain > > + /* > + * Using the specified primary server connection, check whether we are > + * cascading standby and validates primary_slot_name for > + * non-cascading-standbys. > + */ > + check_primary_info(wrconn, &am_cascading_standby); > ... > + /* Recheck if it is still a cascading standby */ > + if (am_cascading_standby) > + check_primary_info(wrconn, &am_cascading_standby); > > Those 2 above calls could be combined if you want. By defaulting the > am_cascading_standby = true when declared, then you could put this > code at the top of the loop instead of having the same code in 2 > places: > > + if (am_cascading_standby) > + check_primary_info(wrconn, &am_cascading_standby); I am not very sure about this change. Yes, as you stated logic-wise it could be combined. But the current flow looks more neat while reading the code. Initializing 'am_cascading_standby' as TRUE could be slightly confusing for the reader. > ====== > src/include/commands/subscriptioncmds.h > > 12. > #include "parser/parse_node.h" > +#include "replication/walreceiver.h" > > There is #include, but no other code change. Is this needed? > > ====== > src/include/replication/slot.h > > 13. > + /* > + * Synchronization state for a logical slot. > + * > + * The standby can have any value among the possible values of 'i','r' and > + * 'n'. For primary, the default is 'n' for all slots but may also be 'r' > + * if leftover from a promoted standby. > + */ > + char sync_state; > + > > All that is OK now, but I keep circling back to my original thought > that since this state has no meaning for the primary server then > > a) why do we even care what potential values it might have there, and > b) isn't it better to call this field 'standby_sync_state' to > emphasize it only has meaning for the standby? > > e.g. > SUGGESTION > /* > * Synchronization state for a logical slot. > * > * The standby can have any value among the possible values of 'i','r' and > * 'n'. For the primary, this field value has no meaning. > */ > char standby_sync_state; > 'sync_state' still looks a better choice to me (discussed with others too offline). If we get more objections to this name, I can consider changing this. [1]: https://www.postgresql.org/message-id/CAA4eK1Kh2cj5vjknAxibpp8Dn%2BjjVwT%2BF7oMPT1P861s_ZrDXQ%40mail.gmail.com thanks Shveta
On Tue, Dec 19, 2023 at 6:58 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are some comments for the patch v49-0002. > Thanks for reviewing. I have addressed these in v50. > (This is in addition to my review comments for v48-0002 [1]) > > ====== > src/backend/access/transam/xlogrecovery.c > > > 1. FinishWalRecovery > > + * > + * We do not update the sync_state from READY to NONE here, as any failed > + * update could leave some slots in the 'NONE' state, causing issues during > + * slot sync after restarting the server as a standby. While updating after > + * switching to the new timeline is an option, it does not simplify the > + * handling for both READY and NONE state slots. Therefore, we retain the > + * READY state slots after promotion as they can provide useful information > + * about their origin. > + */ > > Do you know if that wording is correct? e.g., If you were updating > from READY to NONE and there was a failed update, that would leave > some slots still in a READY state, right? So why does the comment say > "could leave some slots in the 'NONE' state"? > yes, it the comment is correct as stated in [1] [1]: https://www.postgresql.org/message-id/CAA4eK1LoJSbFJwa%3D97_5qHNAVfOkmfc40W_SFMVBbm6r0%3DPXHQ%40mail.gmail.com > ====== > src/backend/replication/slot.c > > 2. ReplicationSlotAlter > > + /* > + * Do not allow users to drop the slots which are currently being synced > + * from the primary to the standby. > + */ > + if (RecoveryInProgress() && > + MyReplicationSlot->data.sync_state != SYNCSLOT_STATE_NONE) > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("cannot alter replication slot \"%s\"", name), > + errdetail("This slot is being synced from the primary server."))); > + > > The comment looks wrong -- should say "Do not allow users to alter..." > > ====== > > 3. > +################################################## > +# Test that synchronized slot can neither be decoded nor dropped by the user > +################################################## > + > > 3a, > /Test that synchronized slot/Test that a synchronized slot/ > > 3b. > Isn't there a missing test? Should this part also check that it cannot > ALTER the replication slot being synced? e.g. test for the new v49 > error message that was added in ReplicationSlotAlter() > > ~~~ > > 4. > +# Disable hot_standby_feedback > +$standby1->safe_psql('postgres', 'ALTER SYSTEM SET > hot_standby_feedback = off;'); > +$standby1->restart; > + > > Can there be a comment added to explain why you are doing the > 'hot_standby_feedback' toggle? > > ~~~ > > 5. > +################################################## > +# Promote the standby1 to primary. Confirm that: > +# a) the sync-ready('r') slot 'lsub1_slot' is retained on the new primary > +# b) the initiated('i') slot 'logical_slot' is dropped on promotion > +# c) logical replication for regress_mysub1 is resumed succesfully > after failover > +################################################## > > /succesfully/successfully/ > > ~~~ > > 6. > + > +# Confirm that data in tab_mypub3 replicated on subscriber > +is( $subscriber1->safe_psql('postgres', q{SELECT count(*) FROM tab_int;}), > + "$primary_row_count", > + 'data replicated from the new primary'); > > The comment is wrong -- it names a different table ('tab_mypub3' ?) to > what the SQL says. > > ====== > [1] My v48-0002 review comments. > https://www.postgresql.org/message-id/CAHut%2BPsyZQZ1A4XcKw-D%3DvcTg16pN9Dw0PzE8W_X7Yz_bv00rQ%40mail.gmail.com > > Kind Regards, > Peter Smith. > Fujitsu Australia
Dear Shveta, I resumed to review the patch. I will play more about it, but I can post some cosmetic comments. ==== walsender.c 01. WalSndWaitForStandbyConfirmation ``` + sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp()); ``` It works well, but I'm not sure whether we should use WalSndComputeSleeptime() because the function won't be called by walsender. 02.WalSndWaitForStandbyConfirmation ``` + ConditionVariableTimedSleep(&WalSndCtl->wal_confirm_rcv_cv, sleeptime, + WAIT_EVENT_WAL_SENDER_WAIT_FOR_STANDBY_CONFIRMATION) ``` Hmm, is it OK to use the same event as WalSndWaitForWal()? IIUC it should be avoided. 03. WalSndShmemInit() ``` + + ConditionVariableInit(&WalSndCtl->wal_confirm_rcv_cv); ``` Unnecessary blank? ~~~~~ 050_standby_failover_slots_sync.pl 04. General My pgperltidy modified your test. Please check. 05. ``` # Create publication on the primary ``` Missing "a" before publication? 06. ``` $subscriber1->init(allows_streaming => 'logical'); ... $subscriber2->init(allows_streaming => 'logical'); ``` IIUC, these settings are not needed. 07. ``` my $primary_insert_time = time(); ``` The variable is not used. 08. ``` # Stop the standby associated with the specified physical replication slot so # that the logical replication slot won't receive changes until the standby # slot's restart_lsn is advanced or the slot is removed from the # standby_slot_names list ``` Missing comma? 09. ``` $back_q->query_until(qr//, "SELECT pg_logical_slot_get_changes('test_slot', NULL, NULL);\n"); ``` Not sure, should we have to close the back_q connection? 10. ``` # Remove the standby from the standby_slot_names list and reload the # configuration $primary->adjust_conf('postgresql.conf', 'standby_slot_names', "''"); $primary->psql('postgres', "SELECT pg_reload_conf()"); ``` a. Missing comma? b. I counted and reload function in perl (e.g., `$primary->reload;`) is more often to be used. Do you have a reason to use pg_reload_conf()? 11. ``` # Now that the standby lsn has advanced, the primary must send the decoded # changes to the subscription. $publisher->wait_for_catchup('regress_mysub1'); ``` Is the comment correct? I think primary sends data because the GUC is modified. 12. ``` # Put the standby back on the primary_slot_name for the rest of the tests $primary->adjust_conf('postgresql.conf', 'standby_slot_names', 'sb1_slot'); $primary->restart(); ``` Just to confirm - you used restart() here because we must ensure the GUC change is propagated to all backends, right? ~~~~~ wait_event_names.txt 13. ``` +WAL_SENDER_WAIT_FOR_STANDBY_CONFIRMATION "Waiting for the WAL to be received by physical standby in WAL sender process." ``` But there is a possibility that backend processes may wait with the event, right? Best Regards, Hayato Kuroda FUJITSU LIMITED
On Tue, Dec 19, 2023 at 5:17 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Mon, Dec 18, 2023 at 4:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, Dec 15, 2023 at 11:03 AM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > Sorry, I missed attaching the patch. PFA v48. > > > > > > > Few comments on v48_0002 > > ======================== > > Thanks for reviewing. These are addressed in v50. > I was still reviewing the v48 version and have a few comments as below. If some of these are already addressed or not relevant, feel free to ignore them. 1. + /* + * Slot sync worker can be stopped at any time. + * Use exit status 1 so the background worker is restarted. We don't need to start the second line of comment in a separate line. 2. + * The assumption is that these dropped local invalidated slots will get + * recreated in next sync-cycle and it is okay to drop and recreate such slots In the above line '.. local invalidated ..' sounds redundant. Shall we remove it? 3. + if (remote_slot->confirmed_lsn > WalRcv->latestWalEnd) + { + SpinLockRelease(&WalRcv->mutex); + elog(ERROR, "skipping sync of slot \"%s\" as the received slot sync " This error message looks odd to me. At least, it should be exiting instead of skipping because we won't continue after this. 4. + /* User created slot with the same name exists, raise ERROR. */ + if (sync_state == SYNCSLOT_STATE_NONE) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("skipping sync of slot \"%s\" as it is a user created" + " slot", remote_slot->name), + errdetail("This slot has failover enabled on the primary and" + " thus is sync candidate but user created slot with" + " the same name already exists on the standby"))); Same problem as above. The skipping in error message doesn't seem to be suitable for the purpose. Additionally, errdetail message should end with a full stop. 5. + /* Slot ready for sync, so sync it. */ + if (sync_state == SYNCSLOT_STATE_READY) + { + /* + * Sanity check: With hot_standby_feedback enabled and + * invalidations handled appropriately as above, this should never + * happen. + */ + if (remote_slot->restart_lsn < slot->data.restart_lsn) + elog(ERROR, + "not synchronizing local slot \"%s\" LSN(%X/%X)" + " to remote slot's LSN(%X/%X) as synchronization " The start of the error message sounds odd. Shall we say 'cannot synchronize ...'? 6. All except one of the callers of local_slot_update() marks the slot dirty and the same is required as well. I think the remaining caller should also mark it dirty and we should move ReplicationSlotMarkDirty() in the caller space. 7. +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot) { ... + /* + * Copy the invalidation cause from remote only if local slot is not + * invalidated locally, we don't want to overwrite existing one. + */ + if (slot->data.invalidated == RS_INVAL_NONE) + { + SpinLockAcquire(&slot->mutex); + slot->data.invalidated = remote_slot->invalidated; + SpinLockRelease(&slot->mutex); + } ... It doesn't seem that after changing the invalidated flag, we always mark the slot dirty. Am, I missing something? 8. + /* + * Drop local slots that no longer need to be synced. Do it before + * synchronize_one_slot to allow dropping of slots before actual sync + * which are invalidated locally while still valid on the primary server. + */ + drop_obsolete_slots(remote_slot_list); The second part of the above comment seems redundant as that is obvious. 9. +static WalReceiverConn * +remote_connect(void) +{ + WalReceiverConn *wrconn = NULL; + char *err; + + wrconn = walrcv_connect(PrimaryConnInfo, true, false, + cluster_name[0] ? cluster_name : "slotsyncworker", + &err); + if (wrconn == NULL) + ereport(ERROR, + (errcode(ERRCODE_CONNECTION_FAILURE), + errmsg("could not connect to the primary server: %s", err))); + return wrconn; +} Do we need a function for this? It appears to be called from just one place, so not sure if it is helpful to have a function for this. -- With Regards, Amit Kapila.
On Tue, Dec 19, 2023 at 5:30 PM shveta malik <shveta.malik@gmail.com> wrote: > > Thanks for reviewing. I have addressed these in v50. > I was looking at this patch to see if something smaller could be independently committable. I think we can extract pg_get_slot_invalidation_cause() and commit it as that function could be independently useful as well. What do you think? -- With Regards, Amit Kapila.
On Tue, Dec 19, 2023 at 6:35 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > ==== > walsender.c > > 01. WalSndWaitForStandbyConfirmation > > ``` > + sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp()); > ``` > > It works well, but I'm not sure whether we should use WalSndComputeSleeptime() > because the function won't be called by walsender. > I don't think it is correct to use this function because it is walsender specific, for example, it uses 'last_reply_timestamp' which won't be even initialized in the backend environment. We need to probably use a different logic for sleep here or need to use a hard-coded value. I think we should change the name of functions like WalSndWaitForStandbyConfirmation() as they are no longer used by walsender. IIRC, earlier, we had a common logic to wait from both walsender and SQL APIs which led to this naming but that is no longer true with the latest patch. > 02.WalSndWaitForStandbyConfirmation > > ``` > + ConditionVariableTimedSleep(&WalSndCtl->wal_confirm_rcv_cv, sleeptime, > + WAIT_EVENT_WAL_SENDER_WAIT_FOR_STANDBY_CONFIRMATION) > ``` > > Hmm, is it OK to use the same event as WalSndWaitForWal()? IIUC it should be avoided. > Agreed. So, how about using WAIT_EVENT_WAIT_FOR_STANDBY_CONFIRMATION so that we can use it both from the backend and walsender? -- With Regards, Amit Kapila.
Here are some comments for the patch v50-0002. ====== GENERAL (I made a short study of all the ereports in this patch -- here are some findings) ~~~ 0.1 Don't need the parentheses. Checking all the ereports I see that half of them have the redundant parentheses and half of them do not; You might as well make them all use the new style where the extra parentheses are not needed. e.g. + ereport(LOG, + (errmsg("skipping slot synchronization"), + errdetail("enable_syncslot is disabled."))); e.g. + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("cannot drop replication slot \"%s\"", name), + errdetail("This slot is being synced from the primary server."))); and many more like this. Search for all the ereports. ~~~ 0.2 + ereport(LOG, + (errmsg("dropped replication slot \"%s\" of dbid %d as it " + "was not sync-ready", NameStr(s->data.name), + s->data.database))); I felt maybe that could be: errmsg("dropped replication slot \"%s\" of dbid %d", ... errdetail("It was not sync-ready.") (now this shares the same errmsg with another ereport) ~~~ 0.3. + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("skipping sync of slot \"%s\" as it is a user created" + " slot", remote_slot->name), + errdetail("This slot has failover enabled on the primary and" + " thus is sync candidate but user created slot with" + " the same name already exists on the standby."))); This seemed too wordy. Can't it be shortened (maybe like below) without losing any of the vital information? errmsg("skipping sync of slot \"%s\"", ...) errdetail("A user-created slot with the same name already exists on the standby.") ~~~ 0.4 + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("exiting from slot synchronization due to bad configuration"), + /* translator: second %s is a GUC variable name */ + errdetail("The primary slot \"%s\" specified by %s is not valid.", + PrimarySlotName, "primary_slot_name"))); /The primary slot/The primary server slot/ ~~~ 0.5 + ereport(ERROR, + (errmsg("could not fetch primary_slot_name \"%s\" info from the " + "primary: %s", PrimarySlotName, res->err))); /primary:/primary server:/ ~~~ 0.6 The continuations for long lines are inconsistent. Sometimes there are trailing spaces and sometimes there are leading spaces. And sometimes there are both at the same time which would cause double-spacing in the message! Please make them all the same. I think using leading spaces is easier but YMMV. e.g. + elog(ERROR, + "not synchronizing local slot \"%s\" LSN(%X/%X)" + " to remote slot's LSN(%X/%X) as synchronization " + " would move it backwards", remote_slot->name, + LSN_FORMAT_ARGS(slot->data.restart_lsn), + LSN_FORMAT_ARGS(remote_slot->restart_lsn)); ====== src/backend/replication/logical/slotsync.c 1. check_primary_info + /* No need to check further, return that we are cascading standby */ + if (remote_in_recovery) + { + *am_cascading_standby = true; + ExecClearTuple(tupslot); + walrcv_clear_result(res); + CommitTransactionCommand(); + return; + } + + valid = DatumGetBool(slot_getattr(tupslot, 2, &isnull)); + Assert(!isnull); + + if (!valid) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("exiting from slot synchronization due to bad configuration"), + /* translator: second %s is a GUC variable name */ + errdetail("The primary slot \"%s\" specified by %s is not valid.", + PrimarySlotName, "primary_slot_name"))); + ExecClearTuple(tupslot); + walrcv_clear_result(res); + CommitTransactionCommand(); +} Now that there is a common cleanup/return code this function be reduced further like below: SUGGESTION if (remote_in_recovery) { /* No need to check further, return that we are cascading standby */ *am_cascading_standby = true; } else { /* We are a normal standby. */ valid = DatumGetBool(slot_getattr(tupslot, 2, &isnull)); Assert(!isnull); if (!valid) ... } ExecClearTuple(tupslot); walrcv_clear_result(res); CommitTransactionCommand(); } ~~~ 2. ReplSlotSyncWorkerMain + /* + * One can promote the standby and we can no longer be a cascading + * standby. So recheck here. + */ + if (am_cascading_standby) + check_primary_info(wrconn, &am_cascading_standby); Minor rewording of that new comment. SUGGESTION If the standby was promoted then what was previously a cascading standby might no longer be one, so recheck each time. ====== src/test/recovery/t/050_verify_slot_order.pl 3. +################################################## +# Test that a synchronized slot can not be decoded, altered and dropped by the user +################################################## /and dropped/or dropped/ ~~~ 4. + +($result, $stdout, $stderr) = $standby1->psql( + 'postgres', + qq[ALTER_REPLICATION_SLOT lsub1_slot (failover);], + replication => 'database'); +ok($stderr =~ /ERROR: cannot alter replication slot "lsub1_slot"/, + "synced slot on standby cannot be altered"); + Add a comment for this test part SUGGESTION Attempting to alter a synced slot should result in an error ~~~ 5. IMO it would be better if the tests were done in the same order mentioned in the comment. So either change the tests or change the comment. ====== Kind Regards, Peter Smith. Fujitsu Australia
Dear Amit, Shveta, > > walsender.c > > > > 01. WalSndWaitForStandbyConfirmation > > > > ``` > > + sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp()); > > ``` > > > > It works well, but I'm not sure whether we should use > WalSndComputeSleeptime() > > because the function won't be called by walsender. > > > > I don't think it is correct to use this function because it is > walsender specific, for example, it uses 'last_reply_timestamp' which > won't be even initialized in the backend environment. We need to > probably use a different logic for sleep here or need to use a > hard-coded value. Oh, you are right. I haven't look until the func. > I think we should change the name of functions like > WalSndWaitForStandbyConfirmation() as they are no longer used by > walsender. IIRC, earlier, we had a common logic to wait from both > walsender and SQL APIs which led to this naming but that is no longer > true with the latest patch. How about "WaitForStandbyConfirmation", which is simpler? There are some functions like "WaitForParallelWorkersToFinish", "WaitForProcSignalBarrier" and so on. > > 02.WalSndWaitForStandbyConfirmation > > > > ``` > > + ConditionVariableTimedSleep(&WalSndCtl->wal_confirm_rcv_cv, > sleeptime, > > + > WAIT_EVENT_WAL_SENDER_WAIT_FOR_STANDBY_CONFIRMATION) > > ``` > > > > Hmm, is it OK to use the same event as WalSndWaitForWal()? IIUC it should be > avoided. > > > > Agreed. So, how about using > WAIT_EVENT_WAIT_FOR_STANDBY_CONFIRMATION > so that we can use it both from the backend and walsender? Seems right. Note again that a description of .txt file must be also fixed. Anyway, further comments on v50-0001. ~~~~~ protocol.sgml 01. create_replication_slot ``` + <varlistentry> + <term><literal>FAILOVER { 'true' | 'false' }</literal></term> + <listitem> + <para> + If true, the slot is enabled to be synced to the physical + standbys so that logical replication can be resumed after failover. + </para> + </listitem> + </varlistentry> ``` IIUC, the true/false is optional. libpqwalreceiver does not add the boolean. Also you can follow the notation of `TWO_PHASE`. 02. alter_replication_slot ``` + <variablelist> + <varlistentry> + <term><literal>FAILOVER { 'true' | 'false' }</literal></term> + <listitem> + <para> + If true, the slot is enabled to be synced to the physical + standbys so that logical replication can be resumed after failover. + </para> + </listitem> + </varlistentry> + </variablelist> ``` Apart from above, this boolean is mandatory, right? But you can follow other notation. ~~~~~~~ slot.c 03. validate_standby_slots ``` + /* Need a modifiable copy of string. */ ... + /* Verify syntax and parse string into a list of identifiers. */ ``` Unnecessary comma? 04. validate_standby_slots ``` + if (!ok || !ReplicationSlotCtl) + { + pfree(rawname); + list_free(elemlist); + return ok; + } ``` It may be more efficient to exit earlier when ReplicationSlotCtl is NULL. ~~~~~~~ walsender.c 05. PhysicalWakeupLogicalWalSnd ``` +/* + * Wake up the logical walsender processes with failover-enabled slots if the + * physical slot of the current walsender is specified in standby_slot_names + * GUC. + */ +void +PhysicalWakeupLogicalWalSnd(void) ``` The function can be called from backend processes, but you said "the current walsender" in the comment. 06. WalSndRereadConfigAndReInitSlotList ``` + char *pre_standby_slot_names; + + ProcessConfigFile(PGC_SIGHUP); + + /* + * If we are running on a standby, there is no need to reload + * standby_slot_names since we do not support syncing slots to cascading + * standbys. + */ + if (RecoveryInProgress()) + return; + + pre_standby_slot_names = pstrdup(standby_slot_names); ``` I felt that we must preserve pre_standby_slot_names before calling ProcessConfigFile(). 07. WalSndFilterStandbySlots I felt the prefix "WalSnd" may not be needed because both backend processes and walsender will call the function. Best Regards, Hayato Kuroda FUJITSU LIMITED
On Wed, Dec 20, 2023 at 9:12 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Dec 19, 2023 at 5:30 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > Thanks for reviewing. I have addressed these in v50. > > > > I was looking at this patch to see if something smaller could be > independently committable. I think we can extract > pg_get_slot_invalidation_cause() and commit it as that function could > be independently useful as well. What do you think? > Sure, forked another thread [1] [1]: https://www.postgresql.org/message-id/CAJpy0uBpr0ym12%2B0mXpjcRFA6N%3DanX%2BYk9aGU4EJhHNu%3DfWykQ%40mail.gmail.com thanks Shveta
On Wed, Dec 20, 2023 at 3:29 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Wed, Dec 20, 2023 at 9:12 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Dec 19, 2023 at 5:30 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > Thanks for reviewing. I have addressed these in v50. > > > > > > > I was looking at this patch to see if something smaller could be > > independently committable. I think we can extract > > pg_get_slot_invalidation_cause() and commit it as that function could > > be independently useful as well. What do you think? > > > > Sure, forked another thread [1] > [1]: https://www.postgresql.org/message-id/CAJpy0uBpr0ym12%2B0mXpjcRFA6N%3DanX%2BYk9aGU4EJhHNu%3DfWykQ%40mail.gmail.com > Thanks, thinking more, we can split the patch into the following three patches which can be committed separately (a) Allowing the failover property to be set for a slot via SQL API and subscription commands (b) sync slot worker infrastructure (c) GUC standby_slot_names and the the corresponding wait logic in server-side. Thoughts? -- With Regards, Amit Kapila.
On Wednesday, December 20, 2023 4:03 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote: Hi, > > Dear Amit, Shveta, > > > > walsender.c > > > > > > 01. WalSndWaitForStandbyConfirmation > > > > > > ``` > > > + sleeptime = > WalSndComputeSleeptime(GetCurrentTimestamp()); > > > ``` > > > > > > It works well, but I'm not sure whether we should use > > WalSndComputeSleeptime() > > > because the function won't be called by walsender. > > > > > > > I don't think it is correct to use this function because it is > > walsender specific, for example, it uses 'last_reply_timestamp' which > > won't be even initialized in the backend environment. We need to > > probably use a different logic for sleep here or need to use a > > hard-coded value. > > Oh, you are right. I haven't look until the func. > > > I think we should change the name of functions like > > WalSndWaitForStandbyConfirmation() as they are no longer used by > > walsender. IIRC, earlier, we had a common logic to wait from both > > walsender and SQL APIs which led to this naming but that is no longer > > true with the latest patch. > > How about "WaitForStandbyConfirmation", which is simpler? There are some > functions like "WaitForParallelWorkersToFinish", "WaitForProcSignalBarrier" > and so on. Thanks for the comments. I think WaitForStandbyConfirmation is OK. And I removed the WalSnd prefix for these functions and move them to slot.c where the standby_slot_names is declared. > > > > 02.WalSndWaitForStandbyConfirmation > > > > > > ``` > > > + ConditionVariableTimedSleep(&WalSndCtl->wal_confirm_rcv_cv, > > sleeptime, > > > + > > WAIT_EVENT_WAL_SENDER_WAIT_FOR_STANDBY_CONFIRMATION) > > > ``` > > > > > > Hmm, is it OK to use the same event as WalSndWaitForWal()? IIUC it > > > should be > > avoided. > > > > > > > Agreed. So, how about using > > WAIT_EVENT_WAIT_FOR_STANDBY_CONFIRMATION > > so that we can use it both from the backend and walsender? > > Seems right. Note again that a description of .txt file must be also fixed. Changed. > > Anyway, further comments on v50-0001. > > ~~~~~ > protocol.sgml > > 01. create_replication_slot > > ``` > + <varlistentry> > + <term><literal>FAILOVER { 'true' | 'false' }</literal></term> > + <listitem> > + <para> > + If true, the slot is enabled to be synced to the physical > + standbys so that logical replication can be resumed after failover. > + </para> > + </listitem> > + </varlistentry> > ``` > > IIUC, the true/false is optional. libpqwalreceiver does not add the boolean. > Also you can follow the notation of `TWO_PHASE`. Changed. > > 02. alter_replication_slot > > ``` > + <variablelist> > + <varlistentry> > + <term><literal>FAILOVER { 'true' | 'false' }</literal></term> > + <listitem> > + <para> > + If true, the slot is enabled to be synced to the physical > + standbys so that logical replication can be resumed after failover. > + </para> > + </listitem> > + </varlistentry> > + </variablelist> > ``` > > Apart from above, this boolean is mandatory, right? > But you can follow other notation. > Right, changed it to optional to be consistent with others. > > ~~~~~~~ > slot.c > > 03. validate_standby_slots > > ``` > + /* Need a modifiable copy of string. */ > ... > + /* Verify syntax and parse string into a list of identifiers. */ > ``` > > Unnecessary comma? You mean comma or period ? I think the current style is OK. > > 04. validate_standby_slots > > ``` > + if (!ok || !ReplicationSlotCtl) > + { > + pfree(rawname); > + list_free(elemlist); > + return ok; > + } > ``` > > It may be more efficient to exit earlier when ReplicationSlotCtl is NULL. I think even if ReplicationSlotCtl is NULL, we still need to check the syntax of the slot names. > > ~~~~~~~ > walsender.c > > 05. PhysicalWakeupLogicalWalSnd > > ``` > +/* > + * Wake up the logical walsender processes with failover-enabled slots > +if the > + * physical slot of the current walsender is specified in > +standby_slot_names > + * GUC. > + */ > +void > +PhysicalWakeupLogicalWalSnd(void) > ``` > > The function can be called from backend processes, but you said "the current > walsender" > in the comment. Changed the words. > > 06. WalSndRereadConfigAndReInitSlotList > > ``` > + char *pre_standby_slot_names; > + > + ProcessConfigFile(PGC_SIGHUP); > + > + /* > + * If we are running on a standby, there is no need to reload > + * standby_slot_names since we do not support syncing slots to > cascading > + * standbys. > + */ > + if (RecoveryInProgress()) > + return; > + > + pre_standby_slot_names = pstrdup(standby_slot_names); > ``` > > I felt that we must preserve pre_standby_slot_names before calling > ProcessConfigFile(). > Good catch. Fixed. > > 07. WalSndFilterStandbySlots > > I felt the prefix "WalSnd" may not be needed because both backend processes > and walsender will call the function. Right, renamed. Attach the V51 patch set which addressed Kuroda-san's comments. I also tried to improve the test in 0003 to make it stable. Best Regards, Hou zj
Attachment
On Tuesday, December 19, 2023 9:05 PM Kuroda, Hayato/黒田 隼人 <kuroda.hayato@fujitsu.com> wrote: > > Dear Shveta, > > I resumed to review the patch. I will play more about it, but I can post some > cosmetic comments. Thanks for the comments. > > ==== > walsender.c > > 01. WalSndWaitForStandbyConfirmation > > ``` > + sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp()); > ``` > > It works well, but I'm not sure whether we should use > WalSndComputeSleeptime() > because the function won't be called by walsender. Changed to a hard-coded value. > > 02.WalSndWaitForStandbyConfirmation > > ``` > + ConditionVariableTimedSleep(&WalSndCtl->wal_confirm_rcv_cv, > sleeptime, > + > WAIT_EVENT_WAL_SENDER_WAIT_FOR_STANDBY_CONFIRMATION) > ``` > > Hmm, is it OK to use the same event as WalSndWaitForWal()? IIUC it should be > avoided. As discussed, I change the event name to a more common one, so that it makes sense to use it in both places. > > 03. WalSndShmemInit() > > ``` > + > + ConditionVariableInit(&WalSndCtl->wal_confirm_rcv_cv); > ``` > > Unnecessary blank? Removed. > > ~~~~~ > 050_standby_failover_slots_sync.pl > > 04. General > > My pgperltidy modified your test. Please check. Will run this in next version. > > 05. > > ``` > # Create publication on the primary > ``` > > Missing "a" before publication? Changed. > > 06. > > ``` > $subscriber1->init(allows_streaming => 'logical'); > ... > $subscriber2->init(allows_streaming => 'logical'); > ``` > > IIUC, these settings are not needed. Yeah, removed. > > 07. > > ``` > my $primary_insert_time = time(); > ``` > > The variable is not used. Removed. > > 08. > > ``` > # Stop the standby associated with the specified physical replication slot so > # that the logical replication slot won't receive changes until the standby > # slot's restart_lsn is advanced or the slot is removed from the > # standby_slot_names list > ``` > > Missing comma? Added. > > 09. > > ``` > $back_q->query_until(qr//, > "SELECT pg_logical_slot_get_changes('test_slot', NULL, NULL);\n"); > ``` > > Not sure, should we have to close the back_q connection? Added the quit. > > 10. > > ``` > # Remove the standby from the standby_slot_names list and reload the > # configuration > $primary->adjust_conf('postgresql.conf', 'standby_slot_names', "''"); > $primary->psql('postgres', "SELECT pg_reload_conf()"); > ``` > a. > Missing comma? > > b. > I counted and reload function in perl (e.g., `$primary->reload;`) is more often > to > be used. Do you have a reason to use pg_reload_conf()? I think it was copied from other places, changed to ->reload. > > 11. > > ``` > # Now that the standby lsn has advanced, the primary must send the decoded > # changes to the subscription. > $publisher->wait_for_catchup('regress_mysub1'); > ``` > > Is the comment correct? I think primary sends data because the GUC is > modified. Fixed. > > 12. > > ``` > # Put the standby back on the primary_slot_name for the rest of the tests > $primary->adjust_conf('postgresql.conf', 'standby_slot_names', 'sb1_slot'); > $primary->restart(); > ``` > > Just to confirm - you used restart() here because we must ensure the GUC > change is > propagated to all backends, right? Yes, but I think restart is not necessary, so I changed it to reload. > > ~~~~~ > wait_event_names.txt > > 13. > > ``` > +WAL_SENDER_WAIT_FOR_STANDBY_CONFIRMATION "Waiting for the > WAL to be received by physical standby in WAL sender process." > ``` > > But there is a possibility that backend processes may wait with the event, right? Adjusted. Best Regards, Hou zj
On Wednesday, December 20, 2023 8:42 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > Attach the V51 patch set which addressed Kuroda-san's comments. > I also tried to improve the test in 0003 to make it stable. The patches conflict with a recent commit dc21234. Here is the rebased V51_2 version, there is no code changes in this version. Best Regards, Hou zj
Attachment
Here is a minor comment for v51-0001 ====== src/backend/replication/slot.c 1. +void +RereadConfigAndReInitSlotList(List **standby_slots) +{ + char *pre_standby_slot_names; + + /* + * If we are running on a standby, there is no need to reload + * standby_slot_names since we do not support syncing slots to cascading + * standbys. + */ + if (RecoveryInProgress()) + { + ProcessConfigFile(PGC_SIGHUP); + return; + } + + pre_standby_slot_names = pstrdup(standby_slot_names); + + ProcessConfigFile(PGC_SIGHUP); + + if (strcmp(pre_standby_slot_names, standby_slot_names) != 0) + { + list_free(*standby_slots); + *standby_slots = GetStandbySlotList(true); + } + + pfree(pre_standby_slot_names); +} Consider below, which seems a simpler way to do that but with just one return point and without duplicating the ProcessConfigFile calls: SUGGESTION { char *pre_standby_slot_names = pstrdup(standby_slot_names); ProcessConfigFile(PGC_SIGHUP); if (!RecoveryInProgress()) { if (strcmp(pre_standby_slot_names, standby_slot_names) != 0) { list_free(*standby_slots); *standby_slots = GetStandbySlotList(true); } } pfree(pre_standby_slot_names); } ====== Kind Regards, Peter Smith. Fujitsu Australia
On Thursday, December 21, 2023 12:25 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Here is a minor comment for v51-0001 > > ====== > src/backend/replication/slot.c > > 1. > +void > +RereadConfigAndReInitSlotList(List **standby_slots) { > + char *pre_standby_slot_names; > + > + /* > + * If we are running on a standby, there is no need to reload > + * standby_slot_names since we do not support syncing slots to > + cascading > + * standbys. > + */ > + if (RecoveryInProgress()) > + { > + ProcessConfigFile(PGC_SIGHUP); > + return; > + } > + > + pre_standby_slot_names = pstrdup(standby_slot_names); > + > + ProcessConfigFile(PGC_SIGHUP); > + > + if (strcmp(pre_standby_slot_names, standby_slot_names) != 0) { > + list_free(*standby_slots); *standby_slots = GetStandbySlotList(true); > + } > + > + pfree(pre_standby_slot_names); > +} > > Consider below, which seems a simpler way to do that but with just one return > point and without duplicating the ProcessConfigFile calls: > > SUGGESTION > { > char *pre_standby_slot_names = pstrdup(standby_slot_names); > > ProcessConfigFile(PGC_SIGHUP); > > if (!RecoveryInProgress()) > { > if (strcmp(pre_standby_slot_names, standby_slot_names) != 0) > { > list_free(*standby_slots); > *standby_slots = GetStandbySlotList(true); > } > } > > pfree(pre_standby_slot_names); > } Thanks for the suggestion. I also thought about this, but I'd like to avoid allocating/freeing memory for the pre_standby_slot_names if not needed. Best Regards, Hou zj
On Thu, Dec 21, 2023 at 11:30 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Thursday, December 21, 2023 12:25 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Here is a minor comment for v51-0001 > > > > ====== > > src/backend/replication/slot.c > > > > 1. > > +void > > +RereadConfigAndReInitSlotList(List **standby_slots) { > > + char *pre_standby_slot_names; > > + > > + /* > > + * If we are running on a standby, there is no need to reload > > + * standby_slot_names since we do not support syncing slots to > > + cascading > > + * standbys. > > + */ > > + if (RecoveryInProgress()) > > + { > > + ProcessConfigFile(PGC_SIGHUP); > > + return; > > + } > > + > > + pre_standby_slot_names = pstrdup(standby_slot_names); > > + > > + ProcessConfigFile(PGC_SIGHUP); > > + > > + if (strcmp(pre_standby_slot_names, standby_slot_names) != 0) { > > + list_free(*standby_slots); *standby_slots = GetStandbySlotList(true); > > + } > > + > > + pfree(pre_standby_slot_names); > > +} > > > > Consider below, which seems a simpler way to do that but with just one return > > point and without duplicating the ProcessConfigFile calls: > > > > SUGGESTION > > { > > char *pre_standby_slot_names = pstrdup(standby_slot_names); > > > > ProcessConfigFile(PGC_SIGHUP); > > > > if (!RecoveryInProgress()) > > { > > if (strcmp(pre_standby_slot_names, standby_slot_names) != 0) > > { > > list_free(*standby_slots); > > *standby_slots = GetStandbySlotList(true); > > } > > } > > > > pfree(pre_standby_slot_names); > > } > > Thanks for the suggestion. I also thought about this, but I'd like to avoid > allocating/freeing memory for the pre_standby_slot_names if not needed. > > Best Regards, > Hou zj > > PFA v52. Changes are: 1) Addressed comments given for v48-002 in [1] and v50-002 in [2] 2) Merged patch003 (test improvement) to patch002 itself. 3) Restructured code around ReplicationSlotDrop to remove extra arg 'user_cmd' 4) Fixed a bug wherein promotion flow was breaking. The pid of slot-sync worker was nullified in slotsync_worker_onexit() before the worker can release the acquired slot in ReplicationSlotShmemExit(). Due to this, the startup process which relies on worker's pid tried to drop the 'i' state slots assuming the slot sync worker has stopped whereas the slot sync worker was trying to modify the slot concurrently, resulting into the problem. This was due to the fact that slotsync_worker_onexit() was registered with before_shmem_exit(). It should instead be registered using on_shmem_exit(). Corrected it now. Thanks Hou-San for working on this. [1]: https://www.postgresql.org/message-id/CAA4eK1J5zTmm4NE4os59WgU4AZPNb74X-n67pY8SkoDfzsN_jA%40mail.gmail.com [2]: https://www.postgresql.org/message-id/CAHut%2BPvocO_bwwz7kD-4mLnFRCLOK3i0ocLyGDvLQKzkhzEjTg%40mail.gmail.com
Attachment
On Wed, Dec 20, 2023 at 12:02 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are some comments for the patch v50-0002. Thank You for the feedback. I have addressed these in v52. > ====== > GENERAL > > (I made a short study of all the ereports in this patch -- here are > some findings) > > ~~~ > > 0.1 Don't need the parentheses. > > Checking all the ereports I see that half of them have the redundant > parentheses and half of them do not; You might as well make them all > use the new style where the extra parentheses are not needed. > > e.g. > + ereport(LOG, > + (errmsg("skipping slot synchronization"), > + errdetail("enable_syncslot is disabled."))); > > e.g. > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("cannot drop replication slot \"%s\"", name), > + errdetail("This slot is being synced from the primary server."))); > > and many more like this. Search for all the ereports. > > ~~~ > > 0.2 > + ereport(LOG, > + (errmsg("dropped replication slot \"%s\" of dbid %d as it " > + "was not sync-ready", NameStr(s->data.name), > + s->data.database))); > > I felt maybe that could be: > > errmsg("dropped replication slot \"%s\" of dbid %d", ... > errdetail("It was not sync-ready.") > > (now this shares the same errmsg with another ereport) > > ~~~ > > 0.3. > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("skipping sync of slot \"%s\" as it is a user created" > + " slot", remote_slot->name), > + errdetail("This slot has failover enabled on the primary and" > + " thus is sync candidate but user created slot with" > + " the same name already exists on the standby."))); > > This seemed too wordy. Can't it be shortened (maybe like below) > without losing any of the vital information? > > errmsg("skipping sync of slot \"%s\"", ...) > errdetail("A user-created slot with the same name already exists on > the standby.") I have modified it a little bit more. Please see now. I wanted to add the info that slot-sync worker is exiting instead of skipping a slot and that the concerned slot is a failover slot on primary. These were the other comments around the same. > ~~~ > > 0.4 > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("exiting from slot synchronization due to bad configuration"), > + /* translator: second %s is a GUC variable name */ > + errdetail("The primary slot \"%s\" specified by %s is not valid.", > + PrimarySlotName, "primary_slot_name"))); > > /The primary slot/The primary server slot/ > > ~~~ > > 0.5 > + ereport(ERROR, > + (errmsg("could not fetch primary_slot_name \"%s\" info from the " > + "primary: %s", PrimarySlotName, res->err))); > > /primary:/primary server:/ > > ~~~ > > 0.6 > The continuations for long lines are inconsistent. Sometimes there are > trailing spaces and sometimes there are leading spaces. And sometimes > there are both at the same time which would cause double-spacing in > the message! Please make them all the same. I think using leading > spaces is easier but YMMV. > > e.g. > + elog(ERROR, > + "not synchronizing local slot \"%s\" LSN(%X/%X)" > + " to remote slot's LSN(%X/%X) as synchronization " > + " would move it backwards", remote_slot->name, > + LSN_FORMAT_ARGS(slot->data.restart_lsn), > + LSN_FORMAT_ARGS(remote_slot->restart_lsn)); > > ====== > src/backend/replication/logical/slotsync.c > > 1. check_primary_info > > + /* No need to check further, return that we are cascading standby */ > + if (remote_in_recovery) > + { > + *am_cascading_standby = true; > + ExecClearTuple(tupslot); > + walrcv_clear_result(res); > + CommitTransactionCommand(); > + return; > + } > + > + valid = DatumGetBool(slot_getattr(tupslot, 2, &isnull)); > + Assert(!isnull); > + > + if (!valid) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("exiting from slot synchronization due to bad configuration"), > + /* translator: second %s is a GUC variable name */ > + errdetail("The primary slot \"%s\" specified by %s is not valid.", > + PrimarySlotName, "primary_slot_name"))); > + ExecClearTuple(tupslot); > + walrcv_clear_result(res); > + CommitTransactionCommand(); > +} > > Now that there is a common cleanup/return code this function be > reduced further like below: > > SUGGESTION > > if (remote_in_recovery) > { > /* No need to check further, return that we are cascading standby */ > *am_cascading_standby = true; > } > else > { > /* We are a normal standby. */ > > valid = DatumGetBool(slot_getattr(tupslot, 2, &isnull)); > Assert(!isnull); > > if (!valid) > ... > } > > ExecClearTuple(tupslot); > walrcv_clear_result(res); > CommitTransactionCommand(); > } > > ~~~ > > 2. ReplSlotSyncWorkerMain > > + /* > + * One can promote the standby and we can no longer be a cascading > + * standby. So recheck here. > + */ > + if (am_cascading_standby) > + check_primary_info(wrconn, &am_cascading_standby); > > Minor rewording of that new comment. > > SUGGESTION > If the standby was promoted then what was previously a cascading > standby might no longer be one, so recheck each time. > > ====== > src/test/recovery/t/050_verify_slot_order.pl > > 3. > +################################################## > +# Test that a synchronized slot can not be decoded, altered and > dropped by the user > +################################################## > > /and dropped/or dropped/ > > ~~~ > > 4. > + > +($result, $stdout, $stderr) = $standby1->psql( > + 'postgres', > + qq[ALTER_REPLICATION_SLOT lsub1_slot (failover);], > + replication => 'database'); > +ok($stderr =~ /ERROR: cannot alter replication slot "lsub1_slot"/, > + "synced slot on standby cannot be altered"); > + > > Add a comment for this test part > > SUGGESTION > Attempting to alter a synced slot should result in an error > > ~~~ > > 5. > IMO it would be better if the tests were done in the same order > mentioned in the comment. So either change the tests or change the > comment. > > ====== > Kind Regards, > Peter Smith. > Fujitsu Australia
Hi, On Thu, Dec 21, 2023 at 02:23:12AM +0000, Zhijie Hou (Fujitsu) wrote: > On Wednesday, December 20, 2023 8:42 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > > > Attach the V51 patch set which addressed Kuroda-san's comments. > > I also tried to improve the test in 0003 to make it stable. > > The patches conflict with a recent commit dc21234. > Here is the rebased V51_2 version, there is no code changes in this version. > Thanks! I've a few remarks regarding 0001: 1 === In the commit message what about replacing "Allow logical walsenders to wait for the physical standbys" with "Force some logical walsenders to wait for the physical standbys"? Also I think it would be better to first explain what we are trying to achieve and after explain how we do it (adding a new flag in CREATE SUBSCRIPTION and so on). 2 === + <listitem> + <para> + List of physical replication slots that logical replication slots with + failover enabled waits for. Worth to add a few words about what we are actually waiting for? 3 === + ereport(ERROR, + (errcode(ERRCODE_PROTOCOL_VIOLATION), + errmsg("could not alter replication slot \"%s\" on publisher: %s", + slotname, pchomp(PQerrorMessage(conn->streamConn))))); should we mention "on publisher" here, what about removing the word "publisher"? 4 === @@ -248,10 +262,13 @@ ReplicationSlotValidateName(const char *name, int elevel) * during getting changes, if the two_phase option is enabled it can skip * prepare because by that time start decoding point has been moved. So the * user will only get commit prepared. + * failover: If enabled, allows the slot to be synced to physical standbys so + * that logical replication can be resumed after failover. s/allows/forces ? 5 === + bool ok; parse_ok maybe? 6 === + /* Need a modifiable copy of string. */ + rawname = pstrdup(*newval); It seems to me that the single line comments in the neighborhood functions (see RestoreSlotFromDisk() for example) don't finish with ".". Worth to follow the same format for all what we add in slot.c? 7 === +static void +parseAlterReplSlotOptions(AlterReplicationSlotCmd *cmd, bool *failover) ParseAlterReplSlotOptions instead? 8 === + * We do not need to change the failover to false if the server + * does not support failover (e.g. pre-PG17) Missing "." at the end. 9 === + * See comments above for twophasestate, same holds true for + * 'failover' Missing "." at the end. 10 === +++ b/src/include/replication/walsender.h @@ -12,6 +12,8 @@ #ifndef _WALSENDER_H #define _WALSENDER_H +#include "access/xlogdefs.h" Is this include needed? 11 === + * When the wait event is WAIT_FOR_STANDBY_CONFIRMATION, wait on another + * CV that is woken up by physical walsenders when the walreceiver has + * confirmed the receipt of LSN. s/that is woken up by/that is broadcasted by/ ? 12 === We are mentioning in several places that the replication can be resumed after a failover. Should we add a few words about possible lag? (see [1]) [1]: https://www.postgresql.org/message-id/CAA4eK1KihniOK21mEVYtSOHRQiGNyToUmENWp7hPbH_PMsqzkA%40mail.gmail.com Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Dear Shveta, Thanks for updating the patch! Here is my comments for v52-0002. ~~~~~ system-views.sgml 01. ``` + + <row> + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>sync_state</structfield> <type>char</type> + </para> + <para> + Defines slot synchronization state. This is meaningful on the physical + standby which has configured <xref linkend="guc-enable-syncslot"/> = true. + Possible values are: + <itemizedlist> + <listitem> + <para><literal>n</literal> = none for user created slots, ... ``` Hmm. I'm not sure why we must show a single character to a user. I'm OK for pg_subscription.srsubstate because it is a "catalog" - the actual value would be recorded in the heap. But pg_replication_slot is just a view so that we can replace internal representations to other strings. E.g., pg_replication_slots.wal_status. How about using {none, initialized, ready} or something? ~~~~~ postmaster.c 02. bgworker_should_start_now ``` + if (start_time == BgWorkerStart_ConsistentState_HotStandby && + pmState != PM_RUN) + return true; ``` I'm not sure the second condition is really needed. The line will be executed when pmState is PM_HOT_STANDBY. Is there a possibility that pmState is changed around here? ~~~~~ libpqwalreceiver.c 03. PQWalReceiverFunctions ``` + .walrcv_get_dbname_from_conninfo = libpqrcv_get_dbname_from_conninfo, ``` Just to confirm - is there a rule for ordering? ~~~~~ slotsync.c 04. SlotSyncWorkerCtx ``` typedef struct SlotSyncWorkerCtx { pid_t pid; slock_t mutex; } SlotSyncWorkerCtx; SlotSyncWorkerCtx *SlotSyncWorker = NULL; ``` Per other files like launcher.c, should we use a name like "SlotSyncWorkerCtxStruct"? 05. SlotSyncWorkerRegister() Your coding will work well, but there is another approach which validates slotsync parameters here. In this case, the postmaster should exit ASAP. This can notify that there are some wrong settings to users earlier. Thought? 06. wait_for_primary_slot_catchup ``` + CHECK_FOR_INTERRUPTS(); + + /* Handle any termination request if any */ + ProcessSlotSyncInterrupts(wrconn); ``` ProcessSlotSyncInterrupts() also has CHECK_FOR_INTERRUPTS(), so no need to call. 07. wait_for_primary_slot_catchup ``` + /* + * XXX: Is waiting for 2 seconds before retrying enough or more or + * less? + */ + rc = WaitLatch(MyLatch, + WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, + 2000L, + WAIT_EVENT_REPL_SLOTSYNC_PRIMARY_CATCHUP); + + ResetLatch(MyLatch); + + /* Emergency bailout if postmaster has died */ + if (rc & WL_POSTMASTER_DEATH) + proc_exit(1); ``` Is there any reasons not to use WL_EXIT_ON_PM_DEATH event? If not, you can use. 08. synchronize_slots ``` + SpinLockAcquire(&WalRcv->mutex); + if (!WalRcv || + (WalRcv->slotname[0] == '\0') || + XLogRecPtrIsInvalid(WalRcv->latestWalEnd)) + { ... ``` Assuming that WalRcv is still NULL. In this case, does the first SpinLockAcquire() lead a segmentation fault? 09. synchronize_slots ``` + elog(DEBUG2, "slot sync worker's query:%s \n", s.data); ``` The query is not dynamical one, so I think no need to print even if the debug mode. 10. synchronize_one_slot IIUC, this function can synchronize slots even if the used plugin on primary is not installed on the secondary server. If the slot is created by the slotsync worker, users will recognize it after the server is promoted and the decode is starting. I felt it is not good specification. Can we detect in the validation phase? ~~~~~ not the source code 11. I tested the typical case - promoting a publisher from a below diagram. A physical replication slot "physical" was specified as standby_slot_names. ``` node A (primary) --> node B (secondary) | | node C (subscriber) ``` And after the promoting, below lines were periodically output on logfiles for node B and C. ``` WARNING: replication slot "physical" specified in parameter "standby_slot_names" does not exist, ignoring ``` Do you have idea to suppress the warning? IIUC it is a normal behavior of the walsender so that we cannot avoid the periodical outputs. The steps of the test was as follows: 1. stop the node A via pg_ctl stop 2. promota the node B via pg_ctl promote 3. change the connection string of the subscription via ALTER SUBSCRIPTION ... CONNECTION ... Best Regards, Hayato Kuroda FUJITSU LIMITED
On Thursday, December 21, 2023 5:39 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > On Thu, Dec 21, 2023 at 02:23:12AM +0000, Zhijie Hou (Fujitsu) wrote: > > On Wednesday, December 20, 2023 8:42 PM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > > > Attach the V51 patch set which addressed Kuroda-san's comments. > > > I also tried to improve the test in 0003 to make it stable. > > > > The patches conflict with a recent commit dc21234. > > Here is the rebased V51_2 version, there is no code changes in this version. > > > > Thanks! > > I've a few remarks regarding 0001: Thanks for the comments! > > 1 === > > In the commit message what about replacing "Allow logical walsenders to wait > for the physical standbys" with "Force some logical walsenders to wait for the > physical standbys"? I feel 'Allow' is OK, as the GUC standby_slot_names is optional for user. ISTM, 'force' means we always wait for physical standbys regardless of the GUC. > > Also I think it would be better to first explain what we are trying to achieve and > after explain how we do it (adding a new flag in CREATE SUBSCRIPTION and so > on). Noted. We are about to split the patches, so will improve each commit message after that. > > 4 === > > @@ -248,10 +262,13 @@ ReplicationSlotValidateName(const char *name, int > elevel) > * during getting changes, if the two_phase option is enabled it can skip > * prepare because by that time start decoding point has been moved. So > the > * user will only get commit prepared. > + * failover: If enabled, allows the slot to be synced to physical standbys so > + * that logical replication can be resumed after failover. > > s/allows/forces ? I think whether the slot is synced also depends on the GUC setting on standby, so I feel 'allow' is fine here. > > 5 === > > + bool ok; > > parse_ok maybe? The flag is also used to store the slot type check result, so I feel 'ok' is better here. > > 6 === > > + /* Need a modifiable copy of string. */ > + rawname = pstrdup(*newval); > > It seems to me that the single line comments in the neighborhood functions > (see > RestoreSlotFromDisk() for example) don't finish with ".". Worth to follow the > same format for all what we add in slot.c? I felt we have both styles in slot.c, but it seems Kuroda-san also prefer removing the ".", so will address. > > 7 === > > +static void > +parseAlterReplSlotOptions(AlterReplicationSlotCmd *cmd, bool *failover) > > ParseAlterReplSlotOptions instead? I think it followed parseCreateReplSlotOptions, but I agree that it looks inconsistent with other names. Will address. > 11 === > > + * When the wait event is WAIT_FOR_STANDBY_CONFIRMATION, wait on > another > + * CV that is woken up by physical walsenders when the walreceiver has > + * confirmed the receipt of LSN. > > s/that is woken up by/that is broadcasted by/ ? Will reword the comment here. > > 12 === > > We are mentioning in several places that the replication can be resumed after a > failover. Should we add a few words about possible lag? (see [1]) > > [1]: > https://www.postgresql.org/message-id/CAA4eK1KihniOK21mEVYtSOHRQiG > NyToUmENWp7hPbH_PMsqzkA%40mail.gmail.com It feels like the implementation detail to me, but noted. We will think more about the document. The comments not mentioned above look good to me. Best Regards, Hou zj
On Fri, Dec 22, 2023 at 3:11 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Thursday, December 21, 2023 5:39 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > > > On Thu, Dec 21, 2023 at 02:23:12AM +0000, Zhijie Hou (Fujitsu) wrote: > > > On Wednesday, December 20, 2023 8:42 PM Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > Attach the V51 patch set which addressed Kuroda-san's comments. > > > > I also tried to improve the test in 0003 to make it stable. > > > > > > The patches conflict with a recent commit dc21234. > > > Here is the rebased V51_2 version, there is no code changes in this version. > > > > > > > Thanks! > > > > I've a few remarks regarding 0001: > > Thanks for the comments! > > > > > 1 === > > > > In the commit message what about replacing "Allow logical walsenders to wait > > for the physical standbys" with "Force some logical walsenders to wait for the > > physical standbys"? > > I feel 'Allow' is OK, as the GUC standby_slot_names is optional for user. ISTM, 'force' > means we always wait for physical standbys regardless of the GUC. > > > > > Also I think it would be better to first explain what we are trying to achieve and > > after explain how we do it (adding a new flag in CREATE SUBSCRIPTION and so > > on). > > Noted. We are about to split the patches, so will improve each commit message after that. > > > > > 4 === > > > > @@ -248,10 +262,13 @@ ReplicationSlotValidateName(const char *name, int > > elevel) > > * during getting changes, if the two_phase option is enabled it can skip > > * prepare because by that time start decoding point has been moved. So > > the > > * user will only get commit prepared. > > + * failover: If enabled, allows the slot to be synced to physical standbys so > > + * that logical replication can be resumed after failover. > > > > s/allows/forces ? > > I think whether the slot is synced also depends on the > GUC setting on standby, so I feel 'allow' is fine here. > > > > > 5 === > > > > + bool ok; > > > > parse_ok maybe? > > The flag is also used to store the slot type check result, so I feel 'ok' is > better here. > > > > > 6 === > > > > + /* Need a modifiable copy of string. */ > > + rawname = pstrdup(*newval); > > > > It seems to me that the single line comments in the neighborhood functions > > (see > > RestoreSlotFromDisk() for example) don't finish with ".". Worth to follow the > > same format for all what we add in slot.c? > > I felt we have both styles in slot.c, but it seems Kuroda-san also > prefer removing the ".", so will address. > > > > > 7 === > > > > +static void > > +parseAlterReplSlotOptions(AlterReplicationSlotCmd *cmd, bool *failover) > > > > ParseAlterReplSlotOptions instead? > > I think it followed parseCreateReplSlotOptions, but I agree that it looks > inconsistent with other names. Will address. > > > 11 === > > > > + * When the wait event is WAIT_FOR_STANDBY_CONFIRMATION, wait on > > another > > + * CV that is woken up by physical walsenders when the walreceiver has > > + * confirmed the receipt of LSN. > > > > s/that is woken up by/that is broadcasted by/ ? > > Will reword the comment here. > > > > > 12 === > > > > We are mentioning in several places that the replication can be resumed after a > > failover. Should we add a few words about possible lag? (see [1]) > > > > [1]: > > https://www.postgresql.org/message-id/CAA4eK1KihniOK21mEVYtSOHRQiG > > NyToUmENWp7hPbH_PMsqzkA%40mail.gmail.com > > It feels like the implementation detail to me, but noted. We will think more > about the document. > > > The comments not mentioned above look good to me. > > Best Regards, > Hou zj PFA v53. Changes are: patch001: 1) Addressed comments in [1] for v51-001. Thanks Hou-san for working on this. patch002: 2) Addressed comments in [2] for v52-002. 3) Fixed CFBot failure. The failure was caused by an assert in wait_for_primary_slot_catchup() for null confirmed_lsn received. In wait_for_primary_slot_catchup(), we had an assumption that if restart_lsn is valid and 'conflicting' is also false, then we must have non-null confirmed_lsn. But this is not true. It is possible to get null values for confirmed_lsn and catalog_xmin if on the primary server the slot is just created with a valid restart_lsn and slot-sync worker has fetched the slot before the primary server could set valid confirmed_lsn and catalog_xmin. In pg_create_logical_replication_slot(), there is a small window between CreateInitDecodingContext-->ReplicationSlotReserveWal() which sets restart_lsn and DecodingContextFindStartpoint() which sets confirmed_lsn. If the slot-sync worker fetches the slot in this window, confirmed_lsn received will be NULL. Corrected the code to remove assert and added one additional condition that confirmed_lsn should be valid before moving the slot to 'r'. [1]: https://www.postgresql.org/message-id/ZYQHvgBpH0GgQaJK%40ip-10-97-1-34.eu-west-3.compute.internal [2]: https://www.postgresql.org/message-id/TY3PR01MB98893274D5A4FD4F86CC04A0F595A%40TY3PR01MB9889.jpnprd01.prod.outlook.com thanks Shveta
Attachment
On Thu, Dec 21, 2023 at 6:37 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Shveta, > > Thanks for updating the patch! Here is my comments for v52-0002. Thanks for the feedback Kuroda-san. I have addressed these in v53. > ~~~~~ > system-views.sgml > > 01. > > ``` > + > + <row> > + <entry role="catalog_table_entry"><para role="column_definition"> > + <structfield>sync_state</structfield> <type>char</type> > + </para> > + <para> > + Defines slot synchronization state. This is meaningful on the physical > + standby which has configured <xref linkend="guc-enable-syncslot"/> = true. > + Possible values are: > + <itemizedlist> > + <listitem> > + <para><literal>n</literal> = none for user created slots, > ... > ``` > > Hmm. I'm not sure why we must show a single character to a user. I'm OK for > pg_subscription.srsubstate because it is a "catalog" - the actual value would be > recorded in the heap. But pg_replication_slot is just a view so that we can replace > internal representations to other strings. E.g., pg_replication_slots.wal_status. > How about using {none, initialized, ready} or something? Done. > ~~~~~ > postmaster.c > > 02. bgworker_should_start_now > > ``` > + if (start_time == BgWorkerStart_ConsistentState_HotStandby && > + pmState != PM_RUN) > + return true; > ``` > > I'm not sure the second condition is really needed. The line will be executed when > pmState is PM_HOT_STANDBY. Is there a possibility that pmState is changed around here? 'case PM_RUN:' is a fall-through and thus we need to have this second condition under 'case PM_HOT_STANDBY' for BgWorkerStart_ConsistentState_HotStandby to avoid the worker getting started on non-standby. > ~~~~~ > libpqwalreceiver.c > > 03. PQWalReceiverFunctions > > ``` > + .walrcv_get_dbname_from_conninfo = libpqrcv_get_dbname_from_conninfo, > ``` > > Just to confirm - is there a rule for ordering? No, I think. I am not aware of any. > ~~~~~ > slotsync.c > > 04. SlotSyncWorkerCtx > > ``` > typedef struct SlotSyncWorkerCtx > { > pid_t pid; > slock_t mutex; > } SlotSyncWorkerCtx; > > SlotSyncWorkerCtx *SlotSyncWorker = NULL; > ``` > > Per other files like launcher.c, should we use a name like "SlotSyncWorkerCtxStruct"? Modified. > 05. SlotSyncWorkerRegister() > > Your coding will work well, but there is another approach which validates > slotsync parameters here. In this case, the postmaster should exit ASAP. This can > notify that there are some wrong settings to users earlier. Thought? I think the postmaster should not exit. IMO, slot-sync worker being a child process of postmaster, should not control start or exit of postmaster. The worker should only exit itself if slot-sync GUCs are not set. Have you seen any other case where postmaster exits if any of its bgworker processes has invalid GUCs? > 06. wait_for_primary_slot_catchup > > ``` > + CHECK_FOR_INTERRUPTS(); > + > + /* Handle any termination request if any */ > + ProcessSlotSyncInterrupts(wrconn); > ``` > > ProcessSlotSyncInterrupts() also has CHECK_FOR_INTERRUPTS(), so no need to call. yes, removed. > 07. wait_for_primary_slot_catchup > > ``` > + /* > + * XXX: Is waiting for 2 seconds before retrying enough or more or > + * less? > + */ > + rc = WaitLatch(MyLatch, > + WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, > + 2000L, > + WAIT_EVENT_REPL_SLOTSYNC_PRIMARY_CATCHUP); > + > + ResetLatch(MyLatch); > + > + /* Emergency bailout if postmaster has died */ > + if (rc & WL_POSTMASTER_DEATH) > + proc_exit(1); > ``` > > Is there any reasons not to use WL_EXIT_ON_PM_DEATH event? If not, you can use. I think we should use WL_EXIT_ON_PM_DEATH. Corrected now. > 08. synchronize_slots > > ``` > + SpinLockAcquire(&WalRcv->mutex); > + if (!WalRcv || > + (WalRcv->slotname[0] == '\0') || > + XLogRecPtrIsInvalid(WalRcv->latestWalEnd)) > + { > ... > ``` > > Assuming that WalRcv is still NULL. In this case, does the first SpinLockAcquire() > lead a segmentation fault? It may. Thanks for pointing this out. Modified. > 09. synchronize_slots > > ``` > + elog(DEBUG2, "slot sync worker's query:%s \n", s.data); > ``` > > The query is not dynamical one, so I think no need to print even if the debug > mode. Okay. Removed. > 10. synchronize_one_slot > > IIUC, this function can synchronize slots even if the used plugin on primary is > not installed on the secondary server. If the slot is created by the slotsync > worker, users will recognize it after the server is promoted and the decode is > starting. I felt it is not good specification. Can we detect in the validation > phase? Noted the concern. Let me review more on this. I will revert back. > ~~~~~ > not the source code > > 11. > > I tested the typical case - promoting a publisher from a below diagram. > A physical replication slot "physical" was specified as standby_slot_names. > > ``` > node A (primary) --> node B (secondary) > | > | > node C (subscriber) > ``` > > And after the promoting, below lines were periodically output on logfiles for > node B and C. > > ``` > WARNING: replication slot "physical" specified in parameter "standby_slot_names" does not exist, ignoring > ``` It seems like you have set standby_slot_names on the standby, that is why promoted standby is emitting this warning. It is not recommended to set it on standby as it is the primary GUC. Having said that, I understand that even on primary, we may get this repeated warning if standby_slot_names is not set correctly. This WARNING is intentional, as the user should know that this setting is wrong. So I am not sure if we should suppress this. I would like to know what others think on this. > Do you have idea to suppress the warning? IIUC it is a normal behavior of the > walsender so that we cannot avoid the periodical outputs. > > The steps of the test was as follows: > > 1. stop the node A via pg_ctl stop > 2. promota the node B via pg_ctl promote > 3. change the connection string of the subscription via ALTER SUBSCRIPTION ... CONNECTION ... > thanks Shveta
Hi, On Fri, Dec 22, 2023 at 04:02:21PM +0530, shveta malik wrote: > PFA v53. Changes are: Thanks! > patch002: > 2) Addressed comments in [2] for v52-002. > 3) Fixed CFBot failure. The failure was caused by an assert in > wait_for_primary_slot_catchup() for null confirmed_lsn received. In > wait_for_primary_slot_catchup(), we had an assumption that if > restart_lsn is valid and 'conflicting' is also false, then we must > have non-null confirmed_lsn. But this is not true. It is possible to > get null values for confirmed_lsn and catalog_xmin if on the primary > server the slot is just created with a valid restart_lsn and slot-sync > worker has fetched the slot before the primary server could set valid > confirmed_lsn and catalog_xmin. In > pg_create_logical_replication_slot(), there is a small window between > CreateInitDecodingContext-->ReplicationSlotReserveWal() which sets > restart_lsn and DecodingContextFindStartpoint() which sets > confirmed_lsn. If the slot-sync worker fetches the slot in this > window, confirmed_lsn received will be NULL. Corrected the code to > remove assert and added one additional condition that confirmed_lsn > should be valid before moving the slot to 'r'. > Looking at v53-0002 commit message: It states: " If a logical slot on the primary is valid but is invalidated on the standby, then that slot is dropped and recreated on the standby in next sync-cycle. " and one of the reasons mentioned is: " - The primary changes wal_level to a level lower than logical. " I think that as long at there is still logical replication slot on the primary that should not be possible. The primary should fail to start with messages like: " 2023-12-22 14:06:09.281 UTC [31824] FATAL: logical replication slot "logical_slot" exists, but wal_level < logical " Now, if: - The standby is shutdown - All the logical replication slots are removed on the primary - wal_level is set to < logical on the primary and it is restarted Then when the standby starts, the "synced" slots will be invalidated and later removed but not re-created on the next sync-cycle (because they don't exist anymore on the primary). Worth to reword a bit that part? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, Dec 22, 2023 at 7:59 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On Fri, Dec 22, 2023 at 04:02:21PM +0530, shveta malik wrote: > > PFA v53. Changes are: > > Thanks! > > > patch002: > > 2) Addressed comments in [2] for v52-002. > > 3) Fixed CFBot failure. The failure was caused by an assert in > > wait_for_primary_slot_catchup() for null confirmed_lsn received. In > > wait_for_primary_slot_catchup(), we had an assumption that if > > restart_lsn is valid and 'conflicting' is also false, then we must > > have non-null confirmed_lsn. But this is not true. It is possible to > > get null values for confirmed_lsn and catalog_xmin if on the primary > > server the slot is just created with a valid restart_lsn and slot-sync > > worker has fetched the slot before the primary server could set valid > > confirmed_lsn and catalog_xmin. In > > pg_create_logical_replication_slot(), there is a small window between > > CreateInitDecodingContext-->ReplicationSlotReserveWal() which sets > > restart_lsn and DecodingContextFindStartpoint() which sets > > confirmed_lsn. If the slot-sync worker fetches the slot in this > > window, confirmed_lsn received will be NULL. Corrected the code to > > remove assert and added one additional condition that confirmed_lsn > > should be valid before moving the slot to 'r'. > > > > Looking at v53-0002 commit message: > > It states: > > " > If a logical slot on the primary is valid but is invalidated on the standby, > then that slot is dropped and recreated on the standby in next sync-cycle. > " > > and one of the reasons mentioned is: > > " > - The primary changes wal_level to a level lower than logical. > " > > I think that as long at there is still logical replication slot on the primary > that should not be possible. The primary should fail to start with messages like: > > " > 2023-12-22 14:06:09.281 UTC [31824] FATAL: logical replication slot "logical_slot" exists, but wal_level < logical > " Yes, right. It fails in such a case. > > Now, if: > > - The standby is shutdown > - All the logical replication slots are removed on the primary > - wal_level is set to < logical on the primary and it is restarted > > Then when the standby starts, the "synced" slots will be invalidated and later > removed but not re-created on the next sync-cycle (because they don't exist > anymore on the primary). > > Worth to reword a bit that part? yes, will change these details. Thanks! > Regards, > > -- > Bertrand Drouvot > PostgreSQL Contributors Team > RDS Open Source Databases > Amazon Web Services: https://aws.amazon.com
On Thu, Dec 21, 2023 at 6:37 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > 10. synchronize_one_slot > > IIUC, this function can synchronize slots even if the used plugin on primary is > not installed on the secondary server. If the slot is created by the slotsync > worker, users will recognize it after the server is promoted and the decode is > starting. I felt it is not good specification. Can we detect in the validation > phase? > I think we should be able to detect it if we want but do we want to add this restriction considering that users can always install the required plugins after standby gets promoted? I think we can do either way in this case but as we are not going to use these slots till the standby node is promoted, it seems okay to validate the plugins after promotion once users use the synced slots. -- With Regards, Amit Kapila.
Dear Amit, > I think we should be able to detect it if we want but do we want to > add this restriction considering that users can always install the > required plugins after standby gets promoted? I think we can do either > way in this case but as we are not going to use these slots till the > standby node is promoted, it seems okay to validate the plugins after > promotion once users use the synced slots. Personally it should be detected, but I want to hear opinions from others. Below are my reasons: 1) We can avoid a possibility that users miss the installation of plugins. Basically we should detect before the issue will really occur. 2) Rules around here might be inconsistent. Slots which will be synchronized can be created either way: a) manual creation via SQL function, or b) automatic creation by slotsync worker. In case of a), the decoding context is created when creation so that the plugin must be installed. Case b), however, we allow not to install beforehand. I felt it might be confused for users. Thought? Best Regards, Hayato Kuroda FUJITSU LIMITED
On Tue, Dec 26, 2023 at 3:00 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > > I think we should be able to detect it if we want but do we want to > > add this restriction considering that users can always install the > > required plugins after standby gets promoted? I think we can do either > > way in this case but as we are not going to use these slots till the > > standby node is promoted, it seems okay to validate the plugins after > > promotion once users use the synced slots. > > Personally it should be detected, but I want to hear opinions from others. > Below are my reasons: > > 1) > We can avoid a possibility that users miss the installation of plugins. Basically > we should detect before the issue will really occur. > > 2) > Rules around here might be inconsistent. Slots which will be synchronized can be > created either way: > > a) manual creation via SQL function, or > b) automatic creation by slotsync worker. > > In case of a), the decoding context is created when creation so that the plugin > must be installed. Case b), however, we allow not to install beforehand. I felt > it might be confused for users. Thought? > I think the (a) way could lead to the setting of incorrect LSNs (restart_LSN and confirmed_flush_lsn) considering they are not copied from the primary. -- With Regards, Amit Kapila.
On Wednesday, December 20, 2023 7:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Dec 20, 2023 at 3:29 PM shveta malik <shveta.malik@gmail.com> > wrote: > > > > On Wed, Dec 20, 2023 at 9:12 AM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > > > On Tue, Dec 19, 2023 at 5:30 PM shveta malik <shveta.malik@gmail.com> > wrote: > > > > > > > > Thanks for reviewing. I have addressed these in v50. > > > > > > > > > > I was looking at this patch to see if something smaller could be > > > independently committable. I think we can extract > > > pg_get_slot_invalidation_cause() and commit it as that function > > > could be independently useful as well. What do you think? > > > > > > > Sure, forked another thread [1] > > [1]: > > > https://www.postgresql.org/message-id/CAJpy0uBpr0ym12%2B0mXpjcRFA6 > N%3D > > anX%2BYk9aGU4EJhHNu%3DfWykQ%40mail.gmail.com > > > > Thanks, thinking more, we can split the patch into the following three patches > which can be committed separately (a) Allowing the failover property to be set > for a slot via SQL API and subscription commands > (b) sync slot worker infrastructure (c) GUC standby_slot_names and the the > corresponding wait logic in server-side. > > Thoughts? I agree. Here is the V54 patch set which was split based on the suggestion. The commit message in each patch is also improved. Best Regards, Hou zj
Attachment
On Tue, Dec 26, 2023 at 4:41 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Wednesday, December 20, 2023 7:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Dec 20, 2023 at 3:29 PM shveta malik <shveta.malik@gmail.com> > > wrote: > > > > > > On Wed, Dec 20, 2023 at 9:12 AM Amit Kapila <amit.kapila16@gmail.com> > > wrote: > > > > > > > > On Tue, Dec 19, 2023 at 5:30 PM shveta malik <shveta.malik@gmail.com> > > wrote: > > > > > > > > > > Thanks for reviewing. I have addressed these in v50. > > > > > > > > > > > > > I was looking at this patch to see if something smaller could be > > > > independently committable. I think we can extract > > > > pg_get_slot_invalidation_cause() and commit it as that function > > > > could be independently useful as well. What do you think? > > > > > > > > > > Sure, forked another thread [1] > > > [1]: > > > > > https://www.postgresql.org/message-id/CAJpy0uBpr0ym12%2B0mXpjcRFA6 > > N%3D > > > anX%2BYk9aGU4EJhHNu%3DfWykQ%40mail.gmail.com > > > > > > > Thanks, thinking more, we can split the patch into the following three patches > > which can be committed separately (a) Allowing the failover property to be set > > for a slot via SQL API and subscription commands > > (b) sync slot worker infrastructure (c) GUC standby_slot_names and the the > > corresponding wait logic in server-side. > > > > Thoughts? > > I agree. Here is the V54 patch set which was split based on the suggestion. > The commit message in each patch is also improved. > I would like to revisit the current dependency of slotsync worker on dbname used in 002 patch. Currently we accept dbname in primary_conninfo and thus the user has to make sure to provide one (by manually altering it) even in case of a conf file auto-generated by "pg_basebackup -R". Thus I would like to discuss if there are better ways to do it. Complete background is as follow: We need dbname for 2 purposes: 1) to connect to remote db in order to run SELECT queries to fetch the info needed by slotsync worker. 2) to make connection in slot-sync worker itself in order to be able to use libpq APIs for 1) We run 3 kind of select queries in slot-sync worker currently: a) To fetch all failover slots (logical slots) info at once in synchronize_slots(). b) To fetch a particular slot info during wait_for_primary_slot_catchup() logic (logical slot). c) To validate primary slot (physical one) and also to distinguish between standby and cascading standby by running pg_is_in_recovery(). 1) One approach to avoid dependency on dbname is using commands instead of SELECT. This will need implementing LIST_SLOTS command for a), and for b) we can use LIST_SLOTS and fetch everything (even though it is not needed) or have LIST_SLOTS with a filter on slot-name or extend READ_REPLICATION_SLOT, and for c) we can have some other command to get pg_is_in_recovery() info. But, I feel by relying on commands we will be making the extension of the slot-sync feature difficult. In future, if there is some more requirement to fetch any other info, then there too we have to implement a command. I am not sure if it is good and extensible approach. 2) Another way to avoid asking for a dbname in primary_conninfo is to use the default dbname internally. This brings us to two questions: 'How' and 'Which default db'? 2.1) To answer 'How': Using default dbname is simpler for the purpose of slot-sync worker having its own db-connection, but is a little tricky for the purpose of connection to remote_db. This is because we have to inject this dbname internally in our connection-info. 2.1.1) Say we use primary_conninfo (i.e. original one w/o dbname), then currently it could have 2 formats: a) The simple "=" format for key-value pairs, example: 'user=replication host=127.0.0.1 port=5433 dbname=postgres'. b) URI format, example: postgresql://other@localhost/otherdb?connect_timeout=10&application_name=myapp We can distinguish between the 2 formats using 'uri_prefix_length' but injecting the dbname part will be messy specially for URI format. If we want to do it w/o injecting and only by changing libpq interfaces to accept dbname separately apart from conninfo, then there is no current simpler way available. It will need a good amount of changes in libpq. 2.1.2) Another way is to not rely on primary_conninfo directly but rely on 'WalRcv->conninfo' in order to connect to remote_db. This is because the latter is never URI format, it is some parsed format and appending may work. As an example, primary_conninfo = 'postgresql://replication@localhost:5433', WalRcv->conninfo loaded internally is: "user=replication passfile=/home/shveta/.pgpass channel_binding=prefer dbname=replication host=localhost port=5433 fallback_application_name=walreceiver sslmode=prefer sslcompression=0 sslcertmode=allow sslsni=1 ssl_min_protocol_version=TLSv1.2 gssencmode=disable krbsrvname=postgres gssdelegation=0 target_session_attrs=any load_balance_hosts=disable", '\000' So we can try appending our default dbname to this. But all the defaults loaded in WalRcv->conninfo need some careful analysis to figure out if they work for slot-sync worker case. 2.2) Now coming to 'Which default db': 2.2.1) If we use 'template1' as default db, it may block 'create db' operations on primary for the time when the slot-sync worker is connected to remote using this dbname. Example: postgres=# create database newdb1; ERROR: source database "template1" is being accessed by other users DETAIL: There is 1 other session using the database. 2.2.2) If we use 'postgres' as default db, there are chances that it can be dropped as unlike 'template1', it is allowed to be dropped by user, and if slotsync worker is connected to it, user may see: newdb1=# drop database postgres; ERROR: database "postgres" is being accessed by other users DETAIL: There is 1 other session using the database. But once the slot-sync worker or standby goes down, user can always drop this and next time slot-sync worker may not be able to come up. ================ As explained, there is no clean approach to avoid dbname dependency and thus making us implement it this way where we ask dbname in primary_conninfo. It will be good to know what others think on this and if there are better ways to do it. thanks Shveta
Hi, Thank you for working on this. On Tue, Dec 26, 2023 at 9:27 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Tue, Dec 26, 2023 at 4:41 PM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > On Wednesday, December 20, 2023 7:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Wed, Dec 20, 2023 at 3:29 PM shveta malik <shveta.malik@gmail.com> > > > wrote: > > > > > > > > On Wed, Dec 20, 2023 at 9:12 AM Amit Kapila <amit.kapila16@gmail.com> > > > wrote: > > > > > > > > > > On Tue, Dec 19, 2023 at 5:30 PM shveta malik <shveta.malik@gmail.com> > > > wrote: > > > > > > > > > > > > Thanks for reviewing. I have addressed these in v50. > > > > > > > > > > > > > > > > I was looking at this patch to see if something smaller could be > > > > > independently committable. I think we can extract > > > > > pg_get_slot_invalidation_cause() and commit it as that function > > > > > could be independently useful as well. What do you think? > > > > > > > > > > > > > Sure, forked another thread [1] > > > > [1]: > > > > > > > https://www.postgresql.org/message-id/CAJpy0uBpr0ym12%2B0mXpjcRFA6 > > > N%3D > > > > anX%2BYk9aGU4EJhHNu%3DfWykQ%40mail.gmail.com > > > > > > > > > > Thanks, thinking more, we can split the patch into the following three patches > > > which can be committed separately (a) Allowing the failover property to be set > > > for a slot via SQL API and subscription commands > > > (b) sync slot worker infrastructure (c) GUC standby_slot_names and the the > > > corresponding wait logic in server-side. > > > > > > Thoughts? > > > > I agree. Here is the V54 patch set which was split based on the suggestion. > > The commit message in each patch is also improved. > > > > I would like to revisit the current dependency of slotsync worker on > dbname used in 002 patch. Currently we accept dbname in > primary_conninfo and thus the user has to make sure to provide one (by > manually altering it) even in case of a conf file auto-generated by > "pg_basebackup -R". > Thus I would like to discuss if there are better ways to do it. > Complete background is as follow: > > We need dbname for 2 purposes: > > 1) to connect to remote db in order to run SELECT queries to fetch the > info needed by slotsync worker. > 2) to make connection in slot-sync worker itself in order to be able > to use libpq APIs for 1) > > We run 3 kind of select queries in slot-sync worker currently: > > a) To fetch all failover slots (logical slots) info at once in > synchronize_slots(). > b) To fetch a particular slot info during > wait_for_primary_slot_catchup() logic (logical slot). > c) To validate primary slot (physical one) and also to distinguish > between standby and cascading standby by running pg_is_in_recovery(). > > 1) One approach to avoid dependency on dbname is using commands > instead of SELECT. This will need implementing LIST_SLOTS command for > a), and for b) we can use LIST_SLOTS and fetch everything (even though > it is not needed) or have LIST_SLOTS with a filter on slot-name or > extend READ_REPLICATION_SLOT, and for c) we can have some other > command to get pg_is_in_recovery() info. But, I feel by relying on > commands we will be making the extension of the slot-sync feature > difficult. In future, if there is some more requirement to fetch any > other info, > then there too we have to implement a command. I am not sure if it is > good and extensible approach. > > 2) Another way to avoid asking for a dbname in primary_conninfo is to > use the default dbname internally. This brings us to two questions: > 'How' and 'Which default db'? > > 2.1) To answer 'How': > Using default dbname is simpler for the purpose of slot-sync worker > having its own db-connection, but is a little tricky for the purpose > of connection to remote_db. This is because we have to inject this > dbname internally in our connection-info. > > 2.1.1) Say we use primary_conninfo (i.e. original one w/o dbname), > then currently it could have 2 formats: > > a) The simple "=" format for key-value pairs, example: > 'user=replication host=127.0.0.1 port=5433 dbname=postgres'. > b) URI format, example: > postgresql://other@localhost/otherdb?connect_timeout=10&application_name=myapp > > We can distinguish between the 2 formats using 'uri_prefix_length' but > injecting the dbname part will be messy specially for URI format. If > we want to do it w/o injecting and only by changing libpq interfaces > to accept dbname separately apart from conninfo, then there is no > current simpler way available. It will need a good amount of changes > in libpq. > > 2.1.2) Another way is to not rely on primary_conninfo directly but > rely on 'WalRcv->conninfo' in order to connect to remote_db. This is > because the latter is never URI format, it is some parsed format and > appending may work. As an example, primary_conninfo = > 'postgresql://replication@localhost:5433', WalRcv->conninfo loaded > internally is: > "user=replication passfile=/home/shveta/.pgpass channel_binding=prefer > dbname=replication host=localhost port=5433 > fallback_application_name=walreceiver sslmode=prefer sslcompression=0 > sslcertmode=allow sslsni=1 ssl_min_protocol_version=TLSv1.2 > gssencmode=disable krbsrvname=postgres gssdelegation=0 > target_session_attrs=any load_balance_hosts=disable", '\000' > > So we can try appending our default dbname to this. But all the > defaults loaded in WalRcv->conninfo need some careful analysis to > figure out if they work for slot-sync worker case. > > 2.2) Now coming to 'Which default db': > > 2.2.1) If we use 'template1' as default db, it may block 'create db' > operations on primary for the time when the slot-sync worker is > connected to remote using this dbname. Example: > > postgres=# create database newdb1; > ERROR: source database "template1" is being accessed by other users > DETAIL: There is 1 other session using the database. > > 2.2.2) If we use 'postgres' as default db, there are chances that it > can be dropped as unlike 'template1', it is allowed to be dropped by > user, and if slotsync worker is connected to it, user may see: > newdb1=# drop database postgres; > ERROR: database "postgres" is being accessed by other users > DETAIL: There is 1 other session using the database. > > But once the slot-sync worker or standby goes down, user can always > drop this and next time slot-sync worker may not be able to come up. > Other random ideas for discussion are: 3) The slotsync worker uses primary_conninfo but also uses a new GUC parameter, say slot_sync_dbname, to specify the database to connect. The slot_sync_dbname overwrites the dbname if primary_conninfo also specifies it. If both don't have a dbname, raise an error. 4) The slotsync worker uses a new GUC parameter, say slot_sync_conninfo, to specify the connection string to the primary aside from primary_conninfo. And pg_basebackup -R generates slot_sync_conninfo as well if required (new option required). BTW given that the slotsync worker executes only normal SQL queries, is there any reason why it uses a replication connection? It's slightly odd to me that the pg_stat_replication view shows one entry that remains in the "startup" state. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Dec 27, 2023 at 11:36 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Hi, > > Thank you for working on this. > > On Tue, Dec 26, 2023 at 9:27 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Tue, Dec 26, 2023 at 4:41 PM Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> wrote: > > > > > > On Wednesday, December 20, 2023 7:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Wed, Dec 20, 2023 at 3:29 PM shveta malik <shveta.malik@gmail.com> > > > > wrote: > > > > > > > > > > On Wed, Dec 20, 2023 at 9:12 AM Amit Kapila <amit.kapila16@gmail.com> > > > > wrote: > > > > > > > > > > > > On Tue, Dec 19, 2023 at 5:30 PM shveta malik <shveta.malik@gmail.com> > > > > wrote: > > > > > > > > > > > > > > Thanks for reviewing. I have addressed these in v50. > > > > > > > > > > > > > > > > > > > I was looking at this patch to see if something smaller could be > > > > > > independently committable. I think we can extract > > > > > > pg_get_slot_invalidation_cause() and commit it as that function > > > > > > could be independently useful as well. What do you think? > > > > > > > > > > > > > > > > Sure, forked another thread [1] > > > > > [1]: > > > > > > > > > https://www.postgresql.org/message-id/CAJpy0uBpr0ym12%2B0mXpjcRFA6 > > > > N%3D > > > > > anX%2BYk9aGU4EJhHNu%3DfWykQ%40mail.gmail.com > > > > > > > > > > > > > Thanks, thinking more, we can split the patch into the following three patches > > > > which can be committed separately (a) Allowing the failover property to be set > > > > for a slot via SQL API and subscription commands > > > > (b) sync slot worker infrastructure (c) GUC standby_slot_names and the the > > > > corresponding wait logic in server-side. > > > > > > > > Thoughts? > > > > > > I agree. Here is the V54 patch set which was split based on the suggestion. > > > The commit message in each patch is also improved. > > > > > > > I would like to revisit the current dependency of slotsync worker on > > dbname used in 002 patch. Currently we accept dbname in > > primary_conninfo and thus the user has to make sure to provide one (by > > manually altering it) even in case of a conf file auto-generated by > > "pg_basebackup -R". > > Thus I would like to discuss if there are better ways to do it. > > Complete background is as follow: > > > > We need dbname for 2 purposes: > > > > 1) to connect to remote db in order to run SELECT queries to fetch the > > info needed by slotsync worker. > > 2) to make connection in slot-sync worker itself in order to be able > > to use libpq APIs for 1) > > > > We run 3 kind of select queries in slot-sync worker currently: > > > > a) To fetch all failover slots (logical slots) info at once in > > synchronize_slots(). > > b) To fetch a particular slot info during > > wait_for_primary_slot_catchup() logic (logical slot). > > c) To validate primary slot (physical one) and also to distinguish > > between standby and cascading standby by running pg_is_in_recovery(). > > > > 1) One approach to avoid dependency on dbname is using commands > > instead of SELECT. This will need implementing LIST_SLOTS command for > > a), and for b) we can use LIST_SLOTS and fetch everything (even though > > it is not needed) or have LIST_SLOTS with a filter on slot-name or > > extend READ_REPLICATION_SLOT, and for c) we can have some other > > command to get pg_is_in_recovery() info. But, I feel by relying on > > commands we will be making the extension of the slot-sync feature > > difficult. In future, if there is some more requirement to fetch any > > other info, > > then there too we have to implement a command. I am not sure if it is > > good and extensible approach. > > > > 2) Another way to avoid asking for a dbname in primary_conninfo is to > > use the default dbname internally. This brings us to two questions: > > 'How' and 'Which default db'? > > > > 2.1) To answer 'How': > > Using default dbname is simpler for the purpose of slot-sync worker > > having its own db-connection, but is a little tricky for the purpose > > of connection to remote_db. This is because we have to inject this > > dbname internally in our connection-info. > > > > 2.1.1) Say we use primary_conninfo (i.e. original one w/o dbname), > > then currently it could have 2 formats: > > > > a) The simple "=" format for key-value pairs, example: > > 'user=replication host=127.0.0.1 port=5433 dbname=postgres'. > > b) URI format, example: > > postgresql://other@localhost/otherdb?connect_timeout=10&application_name=myapp > > > > We can distinguish between the 2 formats using 'uri_prefix_length' but > > injecting the dbname part will be messy specially for URI format. If > > we want to do it w/o injecting and only by changing libpq interfaces > > to accept dbname separately apart from conninfo, then there is no > > current simpler way available. It will need a good amount of changes > > in libpq. > > > > 2.1.2) Another way is to not rely on primary_conninfo directly but > > rely on 'WalRcv->conninfo' in order to connect to remote_db. This is > > because the latter is never URI format, it is some parsed format and > > appending may work. As an example, primary_conninfo = > > 'postgresql://replication@localhost:5433', WalRcv->conninfo loaded > > internally is: > > "user=replication passfile=/home/shveta/.pgpass channel_binding=prefer > > dbname=replication host=localhost port=5433 > > fallback_application_name=walreceiver sslmode=prefer sslcompression=0 > > sslcertmode=allow sslsni=1 ssl_min_protocol_version=TLSv1.2 > > gssencmode=disable krbsrvname=postgres gssdelegation=0 > > target_session_attrs=any load_balance_hosts=disable", '\000' > > > > So we can try appending our default dbname to this. But all the > > defaults loaded in WalRcv->conninfo need some careful analysis to > > figure out if they work for slot-sync worker case. > > > > 2.2) Now coming to 'Which default db': > > > > 2.2.1) If we use 'template1' as default db, it may block 'create db' > > operations on primary for the time when the slot-sync worker is > > connected to remote using this dbname. Example: > > > > postgres=# create database newdb1; > > ERROR: source database "template1" is being accessed by other users > > DETAIL: There is 1 other session using the database. > > > > 2.2.2) If we use 'postgres' as default db, there are chances that it > > can be dropped as unlike 'template1', it is allowed to be dropped by > > user, and if slotsync worker is connected to it, user may see: > > newdb1=# drop database postgres; > > ERROR: database "postgres" is being accessed by other users > > DETAIL: There is 1 other session using the database. > > > > But once the slot-sync worker or standby goes down, user can always > > drop this and next time slot-sync worker may not be able to come up. > > > > Other random ideas for discussion are: > > 3) The slotsync worker uses primary_conninfo but also uses a new GUC > parameter, say slot_sync_dbname, to specify the database to connect. > The slot_sync_dbname overwrites the dbname if primary_conninfo also > specifies it. If both don't have a dbname, raise an error. > > 4) The slotsync worker uses a new GUC parameter, say > slot_sync_conninfo, to specify the connection string to the primary > aside from primary_conninfo. And pg_basebackup -R generates > slot_sync_conninfo as well if required (new option required). > > BTW given that the slotsync worker executes only normal SQL queries, > is there any reason why it uses a replication connection? Thank You for the feedback. Do you mean why are we using libpqwalreceiver.c APIs instead of using libpq directly? I was not aware if there is any way to connect if we want to run SQL queries. I initially tried using 'PQconnectdbParams' but couldn't make it work. Perhaps it is to be used only by front-end and extensions as the header files indicate as well: * libpq-fe.h : This file contains definitions for structures and externs for functions used by frontend postgres applications. * libpq-be-fe-helpers.h: Helper functions for using libpq in extensions . Code built directly into the backend is not allowed to link to libpq directly. Do you mean some other kind of connection here? thanks Shveta
On Wed, Dec 27, 2023 at 11:36 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Dec 26, 2023 at 9:27 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > I would like to revisit the current dependency of slotsync worker on > > dbname used in 002 patch. Currently we accept dbname in > > primary_conninfo and thus the user has to make sure to provide one (by > > manually altering it) even in case of a conf file auto-generated by > > "pg_basebackup -R". > > Thus I would like to discuss if there are better ways to do it. > > Complete background is as follow: > > > > We need dbname for 2 purposes: > > > > 1) to connect to remote db in order to run SELECT queries to fetch the > > info needed by slotsync worker. > > 2) to make connection in slot-sync worker itself in order to be able > > to use libpq APIs for 1) > > > > We run 3 kind of select queries in slot-sync worker currently: > > > > a) To fetch all failover slots (logical slots) info at once in > > synchronize_slots(). > > b) To fetch a particular slot info during > > wait_for_primary_slot_catchup() logic (logical slot). > > c) To validate primary slot (physical one) and also to distinguish > > between standby and cascading standby by running pg_is_in_recovery(). > > > > 1) One approach to avoid dependency on dbname is using commands > > instead of SELECT. This will need implementing LIST_SLOTS command for > > a), and for b) we can use LIST_SLOTS and fetch everything (even though > > it is not needed) or have LIST_SLOTS with a filter on slot-name or > > extend READ_REPLICATION_SLOT, and for c) we can have some other > > command to get pg_is_in_recovery() info. But, I feel by relying on > > commands we will be making the extension of the slot-sync feature > > difficult. In future, if there is some more requirement to fetch any > > other info, > > then there too we have to implement a command. I am not sure if it is > > good and extensible approach. > > > > 2) Another way to avoid asking for a dbname in primary_conninfo is to > > use the default dbname internally. This brings us to two questions: > > 'How' and 'Which default db'? > > > > 2.1) To answer 'How': > > Using default dbname is simpler for the purpose of slot-sync worker > > having its own db-connection, but is a little tricky for the purpose > > of connection to remote_db. This is because we have to inject this > > dbname internally in our connection-info. > > > > 2.1.1) Say we use primary_conninfo (i.e. original one w/o dbname), > > then currently it could have 2 formats: > > > > a) The simple "=" format for key-value pairs, example: > > 'user=replication host=127.0.0.1 port=5433 dbname=postgres'. > > b) URI format, example: > > postgresql://other@localhost/otherdb?connect_timeout=10&application_name=myapp > > > > We can distinguish between the 2 formats using 'uri_prefix_length' but > > injecting the dbname part will be messy specially for URI format. If > > we want to do it w/o injecting and only by changing libpq interfaces > > to accept dbname separately apart from conninfo, then there is no > > current simpler way available. It will need a good amount of changes > > in libpq. > > > > 2.1.2) Another way is to not rely on primary_conninfo directly but > > rely on 'WalRcv->conninfo' in order to connect to remote_db. This is > > because the latter is never URI format, it is some parsed format and > > appending may work. As an example, primary_conninfo = > > 'postgresql://replication@localhost:5433', WalRcv->conninfo loaded > > internally is: > > "user=replication passfile=/home/shveta/.pgpass channel_binding=prefer > > dbname=replication host=localhost port=5433 > > fallback_application_name=walreceiver sslmode=prefer sslcompression=0 > > sslcertmode=allow sslsni=1 ssl_min_protocol_version=TLSv1.2 > > gssencmode=disable krbsrvname=postgres gssdelegation=0 > > target_session_attrs=any load_balance_hosts=disable", '\000' > > > > So we can try appending our default dbname to this. But all the > > defaults loaded in WalRcv->conninfo need some careful analysis to > > figure out if they work for slot-sync worker case. > > > > 2.2) Now coming to 'Which default db': > > > > 2.2.1) If we use 'template1' as default db, it may block 'create db' > > operations on primary for the time when the slot-sync worker is > > connected to remote using this dbname. Example: > > > > postgres=# create database newdb1; > > ERROR: source database "template1" is being accessed by other users > > DETAIL: There is 1 other session using the database. > > > > 2.2.2) If we use 'postgres' as default db, there are chances that it > > can be dropped as unlike 'template1', it is allowed to be dropped by > > user, and if slotsync worker is connected to it, user may see: > > newdb1=# drop database postgres; > > ERROR: database "postgres" is being accessed by other users > > DETAIL: There is 1 other session using the database. > > > > But once the slot-sync worker or standby goes down, user can always > > drop this and next time slot-sync worker may not be able to come up. > > > > Other random ideas for discussion are: > > 3) The slotsync worker uses primary_conninfo but also uses a new GUC > parameter, say slot_sync_dbname, to specify the database to connect. > The slot_sync_dbname overwrites the dbname if primary_conninfo also > specifies it. If both don't have a dbname, raise an error. > Would the users prefer to provide a value for a separate GUC instead of changing primary_conninfo? It is possible that we can have some users prefer to use one GUC and others prefer a separate GUC but we should add a new GUC if we are sure that is what users would prefer. Also, even if have to consider this option, I think we can easily later add a new GUC to provide a dbname in addition to having the provision of giving it in primary_conninfo. Also, I think having a separate GUC for dbanme has some complexity in terms of appending the dbname to primary_conninfo as pointed out by Shveta. > 4) The slotsync worker uses a new GUC parameter, say > slot_sync_conninfo, to specify the connection string to the primary > aside from primary_conninfo. And pg_basebackup -R generates > slot_sync_conninfo as well if required (new option required). > Yeah, this is worth considering but won't slot_sync_conninfo be mostly a duplicate of primary_conninfo apart from dbname? I am not sure if the benefit outweighs the disadvantage of having mostly similar information in two GUCs. -- With Regards, Amit Kapila.
On Wed, Dec 27, 2023 at 4:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Dec 27, 2023 at 11:36 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Tue, Dec 26, 2023 at 9:27 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > I would like to revisit the current dependency of slotsync worker on > > > dbname used in 002 patch. Currently we accept dbname in > > > primary_conninfo and thus the user has to make sure to provide one (by > > > manually altering it) even in case of a conf file auto-generated by > > > "pg_basebackup -R". > > > Thus I would like to discuss if there are better ways to do it. > > > Complete background is as follow: > > > > > > We need dbname for 2 purposes: > > > > > > 1) to connect to remote db in order to run SELECT queries to fetch the > > > info needed by slotsync worker. > > > 2) to make connection in slot-sync worker itself in order to be able > > > to use libpq APIs for 1) > > > > > > We run 3 kind of select queries in slot-sync worker currently: > > > > > > a) To fetch all failover slots (logical slots) info at once in > > > synchronize_slots(). > > > b) To fetch a particular slot info during > > > wait_for_primary_slot_catchup() logic (logical slot). > > > c) To validate primary slot (physical one) and also to distinguish > > > between standby and cascading standby by running pg_is_in_recovery(). > > > > > > 1) One approach to avoid dependency on dbname is using commands > > > instead of SELECT. This will need implementing LIST_SLOTS command for > > > a), and for b) we can use LIST_SLOTS and fetch everything (even though > > > it is not needed) or have LIST_SLOTS with a filter on slot-name or > > > extend READ_REPLICATION_SLOT, and for c) we can have some other > > > command to get pg_is_in_recovery() info. But, I feel by relying on > > > commands we will be making the extension of the slot-sync feature > > > difficult. In future, if there is some more requirement to fetch any > > > other info, > > > then there too we have to implement a command. I am not sure if it is > > > good and extensible approach. > > > > > > 2) Another way to avoid asking for a dbname in primary_conninfo is to > > > use the default dbname internally. This brings us to two questions: > > > 'How' and 'Which default db'? > > > > > > 2.1) To answer 'How': > > > Using default dbname is simpler for the purpose of slot-sync worker > > > having its own db-connection, but is a little tricky for the purpose > > > of connection to remote_db. This is because we have to inject this > > > dbname internally in our connection-info. > > > > > > 2.1.1) Say we use primary_conninfo (i.e. original one w/o dbname), > > > then currently it could have 2 formats: > > > > > > a) The simple "=" format for key-value pairs, example: > > > 'user=replication host=127.0.0.1 port=5433 dbname=postgres'. > > > b) URI format, example: > > > postgresql://other@localhost/otherdb?connect_timeout=10&application_name=myapp > > > > > > We can distinguish between the 2 formats using 'uri_prefix_length' but > > > injecting the dbname part will be messy specially for URI format. If > > > we want to do it w/o injecting and only by changing libpq interfaces > > > to accept dbname separately apart from conninfo, then there is no > > > current simpler way available. It will need a good amount of changes > > > in libpq. > > > > > > 2.1.2) Another way is to not rely on primary_conninfo directly but > > > rely on 'WalRcv->conninfo' in order to connect to remote_db. This is > > > because the latter is never URI format, it is some parsed format and > > > appending may work. As an example, primary_conninfo = > > > 'postgresql://replication@localhost:5433', WalRcv->conninfo loaded > > > internally is: > > > "user=replication passfile=/home/shveta/.pgpass channel_binding=prefer > > > dbname=replication host=localhost port=5433 > > > fallback_application_name=walreceiver sslmode=prefer sslcompression=0 > > > sslcertmode=allow sslsni=1 ssl_min_protocol_version=TLSv1.2 > > > gssencmode=disable krbsrvname=postgres gssdelegation=0 > > > target_session_attrs=any load_balance_hosts=disable", '\000' > > > > > > So we can try appending our default dbname to this. But all the > > > defaults loaded in WalRcv->conninfo need some careful analysis to > > > figure out if they work for slot-sync worker case. > > > > > > 2.2) Now coming to 'Which default db': > > > > > > 2.2.1) If we use 'template1' as default db, it may block 'create db' > > > operations on primary for the time when the slot-sync worker is > > > connected to remote using this dbname. Example: > > > > > > postgres=# create database newdb1; > > > ERROR: source database "template1" is being accessed by other users > > > DETAIL: There is 1 other session using the database. > > > > > > 2.2.2) If we use 'postgres' as default db, there are chances that it > > > can be dropped as unlike 'template1', it is allowed to be dropped by > > > user, and if slotsync worker is connected to it, user may see: > > > newdb1=# drop database postgres; > > > ERROR: database "postgres" is being accessed by other users > > > DETAIL: There is 1 other session using the database. > > > > > > But once the slot-sync worker or standby goes down, user can always > > > drop this and next time slot-sync worker may not be able to come up. > > > > > > > Other random ideas for discussion are: > > > > 3) The slotsync worker uses primary_conninfo but also uses a new GUC > > parameter, say slot_sync_dbname, to specify the database to connect. > > The slot_sync_dbname overwrites the dbname if primary_conninfo also > > specifies it. If both don't have a dbname, raise an error. > > > > Would the users prefer to provide a value for a separate GUC instead > of changing primary_conninfo? It is possible that we can have some > users prefer to use one GUC and others prefer a separate GUC but we > should add a new GUC if we are sure that is what users would prefer. > Also, even if have to consider this option, I think we can easily > later add a new GUC to provide a dbname in addition to having the > provision of giving it in primary_conninfo. > > Also, I think having a separate GUC for dbanme has some complexity in > terms of appending the dbname to primary_conninfo as pointed out by > Shveta. > > > 4) The slotsync worker uses a new GUC parameter, say > > slot_sync_conninfo, to specify the connection string to the primary > > aside from primary_conninfo. And pg_basebackup -R generates > > slot_sync_conninfo as well if required (new option required). > > > > Yeah, this is worth considering but won't slot_sync_conninfo be mostly > a duplicate of primary_conninfo apart from dbname? I am not sure if > the benefit outweighs the disadvantage of having mostly similar > information in two GUCs. > > -- > With Regards, > Amit Kapila. PFA v55. It has fixes for 2 CFBot failures seen on v53 and 1 CFBot failure seen on v54. patch002: 1) In 32-bit env, a Datum for int64 is treated as a pointer, and thus below leads to NULL pointer access if the concerned attribute is NULL. Corrected it now. DatumGetLSN(slot_getattr(tupslot, 3, &isnull)); 2)During slot-creation on standby it is possible to get NULL confirmed_lsn from primary even for a valid slot with valid restart_lsn. This may happen when a slot is just created on primary with valid restart_lsn and slot-sync worker has fetched it before primary could set valid confirmed_lsn. And thus along with remote_slot's restart_lsn to catch up, we also need to check for non-null confirmed_lsn of remote_slot. patch003: 3) Another intermittent failure was due to an unstable test added in 050_standby_failover_slots_sync.pl. It has now been removed. The other tests already have the coverage which the problematic test was trying to achieve. Thank You Hou-san for working on this. thanks Shveta
Attachment
On Wed, Dec 27, 2023 at 7:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Dec 27, 2023 at 11:36 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Tue, Dec 26, 2023 at 9:27 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > I would like to revisit the current dependency of slotsync worker on > > > dbname used in 002 patch. Currently we accept dbname in > > > primary_conninfo and thus the user has to make sure to provide one (by > > > manually altering it) even in case of a conf file auto-generated by > > > "pg_basebackup -R". > > > Thus I would like to discuss if there are better ways to do it. > > > Complete background is as follow: > > > > > > We need dbname for 2 purposes: > > > > > > 1) to connect to remote db in order to run SELECT queries to fetch the > > > info needed by slotsync worker. > > > 2) to make connection in slot-sync worker itself in order to be able > > > to use libpq APIs for 1) > > > > > > We run 3 kind of select queries in slot-sync worker currently: > > > > > > a) To fetch all failover slots (logical slots) info at once in > > > synchronize_slots(). > > > b) To fetch a particular slot info during > > > wait_for_primary_slot_catchup() logic (logical slot). > > > c) To validate primary slot (physical one) and also to distinguish > > > between standby and cascading standby by running pg_is_in_recovery(). > > > > > > 1) One approach to avoid dependency on dbname is using commands > > > instead of SELECT. This will need implementing LIST_SLOTS command for > > > a), and for b) we can use LIST_SLOTS and fetch everything (even though > > > it is not needed) or have LIST_SLOTS with a filter on slot-name or > > > extend READ_REPLICATION_SLOT, and for c) we can have some other > > > command to get pg_is_in_recovery() info. But, I feel by relying on > > > commands we will be making the extension of the slot-sync feature > > > difficult. In future, if there is some more requirement to fetch any > > > other info, > > > then there too we have to implement a command. I am not sure if it is > > > good and extensible approach. > > > > > > 2) Another way to avoid asking for a dbname in primary_conninfo is to > > > use the default dbname internally. This brings us to two questions: > > > 'How' and 'Which default db'? > > > > > > 2.1) To answer 'How': > > > Using default dbname is simpler for the purpose of slot-sync worker > > > having its own db-connection, but is a little tricky for the purpose > > > of connection to remote_db. This is because we have to inject this > > > dbname internally in our connection-info. > > > > > > 2.1.1) Say we use primary_conninfo (i.e. original one w/o dbname), > > > then currently it could have 2 formats: > > > > > > a) The simple "=" format for key-value pairs, example: > > > 'user=replication host=127.0.0.1 port=5433 dbname=postgres'. > > > b) URI format, example: > > > postgresql://other@localhost/otherdb?connect_timeout=10&application_name=myapp > > > > > > We can distinguish between the 2 formats using 'uri_prefix_length' but > > > injecting the dbname part will be messy specially for URI format. If > > > we want to do it w/o injecting and only by changing libpq interfaces > > > to accept dbname separately apart from conninfo, then there is no > > > current simpler way available. It will need a good amount of changes > > > in libpq. > > > > > > 2.1.2) Another way is to not rely on primary_conninfo directly but > > > rely on 'WalRcv->conninfo' in order to connect to remote_db. This is > > > because the latter is never URI format, it is some parsed format and > > > appending may work. As an example, primary_conninfo = > > > 'postgresql://replication@localhost:5433', WalRcv->conninfo loaded > > > internally is: > > > "user=replication passfile=/home/shveta/.pgpass channel_binding=prefer > > > dbname=replication host=localhost port=5433 > > > fallback_application_name=walreceiver sslmode=prefer sslcompression=0 > > > sslcertmode=allow sslsni=1 ssl_min_protocol_version=TLSv1.2 > > > gssencmode=disable krbsrvname=postgres gssdelegation=0 > > > target_session_attrs=any load_balance_hosts=disable", '\000' > > > > > > So we can try appending our default dbname to this. But all the > > > defaults loaded in WalRcv->conninfo need some careful analysis to > > > figure out if they work for slot-sync worker case. > > > > > > 2.2) Now coming to 'Which default db': > > > > > > 2.2.1) If we use 'template1' as default db, it may block 'create db' > > > operations on primary for the time when the slot-sync worker is > > > connected to remote using this dbname. Example: > > > > > > postgres=# create database newdb1; > > > ERROR: source database "template1" is being accessed by other users > > > DETAIL: There is 1 other session using the database. > > > > > > 2.2.2) If we use 'postgres' as default db, there are chances that it > > > can be dropped as unlike 'template1', it is allowed to be dropped by > > > user, and if slotsync worker is connected to it, user may see: > > > newdb1=# drop database postgres; > > > ERROR: database "postgres" is being accessed by other users > > > DETAIL: There is 1 other session using the database. > > > > > > But once the slot-sync worker or standby goes down, user can always > > > drop this and next time slot-sync worker may not be able to come up. > > > > > > > Other random ideas for discussion are: > > > > 3) The slotsync worker uses primary_conninfo but also uses a new GUC > > parameter, say slot_sync_dbname, to specify the database to connect. > > The slot_sync_dbname overwrites the dbname if primary_conninfo also > > specifies it. If both don't have a dbname, raise an error. > > > > Would the users prefer to provide a value for a separate GUC instead > of changing primary_conninfo? It is possible that we can have some > users prefer to use one GUC and others prefer a separate GUC but we > should add a new GUC if we are sure that is what users would prefer. > Also, even if have to consider this option, I think we can easily > later add a new GUC to provide a dbname in addition to having the > provision of giving it in primary_conninfo. I think having two separate GUCs is more flexible for example when users want to change the dbname to connect. It makes sense that the slotsync worker wants to use the same connection string as the walreceiver uses. But I guess today most primary_conninfo settings that are set manually or are generated by tools such as pg_basebackup don't have dbname. If we require a dbname in primary_conninfo, many tools will need to be changed. Once the connection string is generated, it would be tricky to change the dbname in it, as Shveta mentioned. The users will have to carefully select the database to connect when taking a base backup. > > Also, I think having a separate GUC for dbanme has some complexity in > terms of appending the dbname to primary_conninfo as pointed out by > Shveta. I think we don't necessarily need to append the dbname to the connection string in order to specify/change the database to connect. PQconnectdbParams() overrides the database name to connect if the dbname parameter appears twice in the connection keyword. The documentation[1] says: When expand_dbname is non-zero, the value for the first dbname key word is checked to see if it is a connection string. If so, it is “expanded” into the individual connection parameters extracted from the string. The value is considered to be a connection string, rather than just a database name, if it contains an equal sign (=) or it begins with a URI scheme designator. (More details on connection string formats appear in Section 33.1.1.) Only the first occurrence of dbname is treated in this way; any subsequent dbname parameter is processed as a plain database name. In general the parameter arrays are processed from start to end. If any key word is repeated, the last value (that is not NULL or empty) is used. This rule applies in particular when a key word found in a connection string conflicts with one appearing in the keywords array. Thus, the programmer may determine whether array entries can override or be overridden by values taken from a connection string. Array entries appearing before an expanded dbname entry can be overridden by fields of the connection string, and in turn those fields are overridden by array entries appearing after dbname (but, again, only if those entries supply non-empty values). If the slotsync worker needs to use libpqwalreceiver to connect the primary, we will need to change libpqrcv_connect(). But we have the infrastructure to change the database name to connect without changing the connection string, at least. > > > 4) The slotsync worker uses a new GUC parameter, say > > slot_sync_conninfo, to specify the connection string to the primary > > aside from primary_conninfo. And pg_basebackup -R generates > > slot_sync_conninfo as well if required (new option required). > > > > Yeah, this is worth considering but won't slot_sync_conninfo be mostly > a duplicate of primary_conninfo apart from dbname? I am not sure if > the benefit outweighs the disadvantage of having mostly similar > information in two GUCs. Agreed. Regards, [1] https://www.postgresql.org/docs/devel/libpq-connect.html#LIBPQ-PQCONNECTDBPARAMS -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Dec 27, 2023 at 7:13 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Wed, Dec 27, 2023 at 11:36 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > Hi, > > > > Thank you for working on this. > > > > On Tue, Dec 26, 2023 at 9:27 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > On Tue, Dec 26, 2023 at 4:41 PM Zhijie Hou (Fujitsu) > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > On Wednesday, December 20, 2023 7:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > On Wed, Dec 20, 2023 at 3:29 PM shveta malik <shveta.malik@gmail.com> > > > > > wrote: > > > > > > > > > > > > On Wed, Dec 20, 2023 at 9:12 AM Amit Kapila <amit.kapila16@gmail.com> > > > > > wrote: > > > > > > > > > > > > > > On Tue, Dec 19, 2023 at 5:30 PM shveta malik <shveta.malik@gmail.com> > > > > > wrote: > > > > > > > > > > > > > > > > Thanks for reviewing. I have addressed these in v50. > > > > > > > > > > > > > > > > > > > > > > I was looking at this patch to see if something smaller could be > > > > > > > independently committable. I think we can extract > > > > > > > pg_get_slot_invalidation_cause() and commit it as that function > > > > > > > could be independently useful as well. What do you think? > > > > > > > > > > > > > > > > > > > Sure, forked another thread [1] > > > > > > [1]: > > > > > > > > > > > https://www.postgresql.org/message-id/CAJpy0uBpr0ym12%2B0mXpjcRFA6 > > > > > N%3D > > > > > > anX%2BYk9aGU4EJhHNu%3DfWykQ%40mail.gmail.com > > > > > > > > > > > > > > > > Thanks, thinking more, we can split the patch into the following three patches > > > > > which can be committed separately (a) Allowing the failover property to be set > > > > > for a slot via SQL API and subscription commands > > > > > (b) sync slot worker infrastructure (c) GUC standby_slot_names and the the > > > > > corresponding wait logic in server-side. > > > > > > > > > > Thoughts? > > > > > > > > I agree. Here is the V54 patch set which was split based on the suggestion. > > > > The commit message in each patch is also improved. > > > > > > > > > > I would like to revisit the current dependency of slotsync worker on > > > dbname used in 002 patch. Currently we accept dbname in > > > primary_conninfo and thus the user has to make sure to provide one (by > > > manually altering it) even in case of a conf file auto-generated by > > > "pg_basebackup -R". > > > Thus I would like to discuss if there are better ways to do it. > > > Complete background is as follow: > > > > > > We need dbname for 2 purposes: > > > > > > 1) to connect to remote db in order to run SELECT queries to fetch the > > > info needed by slotsync worker. > > > 2) to make connection in slot-sync worker itself in order to be able > > > to use libpq APIs for 1) > > > > > > We run 3 kind of select queries in slot-sync worker currently: > > > > > > a) To fetch all failover slots (logical slots) info at once in > > > synchronize_slots(). > > > b) To fetch a particular slot info during > > > wait_for_primary_slot_catchup() logic (logical slot). > > > c) To validate primary slot (physical one) and also to distinguish > > > between standby and cascading standby by running pg_is_in_recovery(). > > > > > > 1) One approach to avoid dependency on dbname is using commands > > > instead of SELECT. This will need implementing LIST_SLOTS command for > > > a), and for b) we can use LIST_SLOTS and fetch everything (even though > > > it is not needed) or have LIST_SLOTS with a filter on slot-name or > > > extend READ_REPLICATION_SLOT, and for c) we can have some other > > > command to get pg_is_in_recovery() info. But, I feel by relying on > > > commands we will be making the extension of the slot-sync feature > > > difficult. In future, if there is some more requirement to fetch any > > > other info, > > > then there too we have to implement a command. I am not sure if it is > > > good and extensible approach. > > > > > > 2) Another way to avoid asking for a dbname in primary_conninfo is to > > > use the default dbname internally. This brings us to two questions: > > > 'How' and 'Which default db'? > > > > > > 2.1) To answer 'How': > > > Using default dbname is simpler for the purpose of slot-sync worker > > > having its own db-connection, but is a little tricky for the purpose > > > of connection to remote_db. This is because we have to inject this > > > dbname internally in our connection-info. > > > > > > 2.1.1) Say we use primary_conninfo (i.e. original one w/o dbname), > > > then currently it could have 2 formats: > > > > > > a) The simple "=" format for key-value pairs, example: > > > 'user=replication host=127.0.0.1 port=5433 dbname=postgres'. > > > b) URI format, example: > > > postgresql://other@localhost/otherdb?connect_timeout=10&application_name=myapp > > > > > > We can distinguish between the 2 formats using 'uri_prefix_length' but > > > injecting the dbname part will be messy specially for URI format. If > > > we want to do it w/o injecting and only by changing libpq interfaces > > > to accept dbname separately apart from conninfo, then there is no > > > current simpler way available. It will need a good amount of changes > > > in libpq. > > > > > > 2.1.2) Another way is to not rely on primary_conninfo directly but > > > rely on 'WalRcv->conninfo' in order to connect to remote_db. This is > > > because the latter is never URI format, it is some parsed format and > > > appending may work. As an example, primary_conninfo = > > > 'postgresql://replication@localhost:5433', WalRcv->conninfo loaded > > > internally is: > > > "user=replication passfile=/home/shveta/.pgpass channel_binding=prefer > > > dbname=replication host=localhost port=5433 > > > fallback_application_name=walreceiver sslmode=prefer sslcompression=0 > > > sslcertmode=allow sslsni=1 ssl_min_protocol_version=TLSv1.2 > > > gssencmode=disable krbsrvname=postgres gssdelegation=0 > > > target_session_attrs=any load_balance_hosts=disable", '\000' > > > > > > So we can try appending our default dbname to this. But all the > > > defaults loaded in WalRcv->conninfo need some careful analysis to > > > figure out if they work for slot-sync worker case. > > > > > > 2.2) Now coming to 'Which default db': > > > > > > 2.2.1) If we use 'template1' as default db, it may block 'create db' > > > operations on primary for the time when the slot-sync worker is > > > connected to remote using this dbname. Example: > > > > > > postgres=# create database newdb1; > > > ERROR: source database "template1" is being accessed by other users > > > DETAIL: There is 1 other session using the database. > > > > > > 2.2.2) If we use 'postgres' as default db, there are chances that it > > > can be dropped as unlike 'template1', it is allowed to be dropped by > > > user, and if slotsync worker is connected to it, user may see: > > > newdb1=# drop database postgres; > > > ERROR: database "postgres" is being accessed by other users > > > DETAIL: There is 1 other session using the database. > > > > > > But once the slot-sync worker or standby goes down, user can always > > > drop this and next time slot-sync worker may not be able to come up. > > > > > > > Other random ideas for discussion are: > > > > 3) The slotsync worker uses primary_conninfo but also uses a new GUC > > parameter, say slot_sync_dbname, to specify the database to connect. > > The slot_sync_dbname overwrites the dbname if primary_conninfo also > > specifies it. If both don't have a dbname, raise an error. > > > > 4) The slotsync worker uses a new GUC parameter, say > > slot_sync_conninfo, to specify the connection string to the primary > > aside from primary_conninfo. And pg_basebackup -R generates > > slot_sync_conninfo as well if required (new option required). > > > > BTW given that the slotsync worker executes only normal SQL queries, > > is there any reason why it uses a replication connection? > > Thank You for the feedback. > Do you mean why are we using libpqwalreceiver.c APIs instead of using > libpq directly? Yes, I meant to use libpq directly, to connect a backend process but not a walsender process. > I was not aware if there is any way to connect if we > want to run SQL queries. I initially tried using 'PQconnectdbParams' > but couldn't make it work. Perhaps it is to be used only by front-end > and extensions as the header files indicate as well: > * libpq-fe.h : This file contains definitions for structures and > externs for functions used by frontend postgres applications. > * libpq-be-fe-helpers.h: Helper functions for using libpq in > extensions . Code built directly into the backend is not allowed to > link to libpq directly. Oh I didn't know that. Thank you for pointing it out. But I'm still concerned it could confuse users that pg_stat_replication keeps showing one entry that remains as "startup" state. It has the same application_name as the walreceiver uses. For example, when users want to check the particular replication connection, it's common to filter the entries by the application name. But it will end up having duplicate entries having different states. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Dec 29, 2023 at 7:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Dec 27, 2023 at 7:13 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Wed, Dec 27, 2023 at 11:36 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > I was not aware if there is any way to connect if we > > want to run SQL queries. I initially tried using 'PQconnectdbParams' > > but couldn't make it work. Perhaps it is to be used only by front-end > > and extensions as the header files indicate as well: > > * libpq-fe.h : This file contains definitions for structures and > > externs for functions used by frontend postgres applications. > > * libpq-be-fe-helpers.h: Helper functions for using libpq in > > extensions . Code built directly into the backend is not allowed to > > link to libpq directly. > > Oh I didn't know that. Thank you for pointing it out. > > But I'm still concerned it could confuse users that > pg_stat_replication keeps showing one entry that remains as "startup" > state. It has the same application_name as the walreceiver uses. For > example, when users want to check the particular replication > connection, it's common to filter the entries by the application name. > But it will end up having duplicate entries having different states. > Valid point. The main reason for using cluster_name is that if multiple standby's connect to the same primary, all will have the same application_name as 'slotsyncworker'. The other alternative could be to use {cluster_name}_slotsyncworker, which will probably address your concern and we can have to provision to differentiate among slotsyncworkers from different standby's. -- With Regards, Amit Kapila.
On Fri, Dec 29, 2023 at 6:59 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Dec 27, 2023 at 7:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > 3) The slotsync worker uses primary_conninfo but also uses a new GUC > > > parameter, say slot_sync_dbname, to specify the database to connect. > > > The slot_sync_dbname overwrites the dbname if primary_conninfo also > > > specifies it. If both don't have a dbname, raise an error. > > > > > > > Would the users prefer to provide a value for a separate GUC instead > > of changing primary_conninfo? It is possible that we can have some > > users prefer to use one GUC and others prefer a separate GUC but we > > should add a new GUC if we are sure that is what users would prefer. > > Also, even if have to consider this option, I think we can easily > > later add a new GUC to provide a dbname in addition to having the > > provision of giving it in primary_conninfo. > > I think having two separate GUCs is more flexible for example when > users want to change the dbname to connect. It makes sense that the > slotsync worker wants to use the same connection string as the > walreceiver uses. But I guess today most primary_conninfo settings > that are set manually or are generated by tools such as pg_basebackup > don't have dbname. If we require a dbname in primary_conninfo, many > tools will need to be changed. Once the connection string is > generated, it would be tricky to change the dbname in it, as Shveta > mentioned. The users will have to carefully select the database to > connect when taking a base backup. > I see your point and agree that users need to be careful. I was trying to compare it with other places like the conninfo used with a subscription where no separate dbname needs to be provided. Now, here the situation is not the same because the same conninfo is used for different purposes (walreceiver doesn't require dbname (dbname is ignored even if present) whereas slotsyncworker requires dbname). I was just trying to see if we can avoid having a new GUC for this purpose. Does anyone else have an opinion on this matter? -- With Regards, Amit Kapila.
On Fri, Dec 29, 2023 at 12:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Dec 29, 2023 at 6:59 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Wed, Dec 27, 2023 at 7:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > 3) The slotsync worker uses primary_conninfo but also uses a new GUC > > > > parameter, say slot_sync_dbname, to specify the database to connect. > > > > The slot_sync_dbname overwrites the dbname if primary_conninfo also > > > > specifies it. If both don't have a dbname, raise an error. > > > > > > > > > > Would the users prefer to provide a value for a separate GUC instead > > > of changing primary_conninfo? It is possible that we can have some > > > users prefer to use one GUC and others prefer a separate GUC but we > > > should add a new GUC if we are sure that is what users would prefer. > > > Also, even if have to consider this option, I think we can easily > > > later add a new GUC to provide a dbname in addition to having the > > > provision of giving it in primary_conninfo. > > > > I think having two separate GUCs is more flexible for example when > > users want to change the dbname to connect. It makes sense that the > > slotsync worker wants to use the same connection string as the > > walreceiver uses. But I guess today most primary_conninfo settings > > that are set manually or are generated by tools such as pg_basebackup > > don't have dbname. If we require a dbname in primary_conninfo, many > > tools will need to be changed. Once the connection string is > > generated, it would be tricky to change the dbname in it, as Shveta > > mentioned. The users will have to carefully select the database to > > connect when taking a base backup. > > > > I see your point and agree that users need to be careful. I was trying > to compare it with other places like the conninfo used with a > subscription where no separate dbname needs to be provided. Now, here > the situation is not the same because the same conninfo is used for > different purposes (walreceiver doesn't require dbname (dbname is > ignored even if present) whereas slotsyncworker requires dbname). I > was just trying to see if we can avoid having a new GUC for this > purpose. Does anyone else have an opinion on this matter? > > -- > With Regards, > Amit Kapila. Attaching the rebased patches. A recent commit 9a17be1e2 has resulted in conflicts in pg_dump changes. thanks Shveta
Attachment
On Fri, Dec 29, 2023 at 10:25 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Dec 29, 2023 at 7:18 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Wed, Dec 27, 2023 at 7:13 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > On Wed, Dec 27, 2023 at 11:36 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > I was not aware if there is any way to connect if we > > > want to run SQL queries. I initially tried using 'PQconnectdbParams' > > > but couldn't make it work. Perhaps it is to be used only by front-end > > > and extensions as the header files indicate as well: > > > * libpq-fe.h : This file contains definitions for structures and > > > externs for functions used by frontend postgres applications. > > > * libpq-be-fe-helpers.h: Helper functions for using libpq in > > > extensions . Code built directly into the backend is not allowed to > > > link to libpq directly. > > > > Oh I didn't know that. Thank you for pointing it out. > > > > But I'm still concerned it could confuse users that > > pg_stat_replication keeps showing one entry that remains as "startup" > > state. Okay. I understand your concern. I have attached PoC patch (v55_02-0004) which attempts to implement non-replication connection in slotsync worker. By doing so, pg_stat_replication will not show its entry while pg_stat_activity will still show it with 'state' as either "active" or "idle". Currently, since we are not using any of the replication cmds, thus non-replication connection suits well. But in future, if there is a requirement to execute existing (or new) cmd in slotsync worker, then that can not be done simply in non-replication connection; it will need some changes in non-replication or will need the replication connection itself. >> It has the same application_name as the walreceiver uses. For > > example, when users want to check the particular replication > > connection, it's common to filter the entries by the application name. > > But it will end up having duplicate entries having different states. > > > > Valid point. The main reason for using cluster_name is that if > multiple standby's connect to the same primary, all will have the same > application_name as 'slotsyncworker'. The other alternative could be > to use {cluster_name}_slotsyncworker, which will probably address your > concern and we can have to provision to differentiate among > slotsyncworkers from different standby's. The topup patch has also changed app_name to {cluster_name}_slotsyncworker so that we do not confuse between walreceiver and slotsyncworker entry. Please note that there is no change in rest of the patches, changes are in additional 0004 patch alone. > -- > With Regards, > Amit Kapila.
Attachment
On Fri, Dec 29, 2023 at 12:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Dec 29, 2023 at 6:59 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Wed, Dec 27, 2023 at 7:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > 3) The slotsync worker uses primary_conninfo but also uses a new GUC > > > > parameter, say slot_sync_dbname, to specify the database to connect. > > > > The slot_sync_dbname overwrites the dbname if primary_conninfo also > > > > specifies it. If both don't have a dbname, raise an error. > > > > > > > > > > Would the users prefer to provide a value for a separate GUC instead > > > of changing primary_conninfo? It is possible that we can have some > > > users prefer to use one GUC and others prefer a separate GUC but we > > > should add a new GUC if we are sure that is what users would prefer. > > > Also, even if have to consider this option, I think we can easily > > > later add a new GUC to provide a dbname in addition to having the > > > provision of giving it in primary_conninfo. > > > > I think having two separate GUCs is more flexible for example when > > users want to change the dbname to connect. It makes sense that the > > slotsync worker wants to use the same connection string as the > > walreceiver uses. But I guess today most primary_conninfo settings > > that are set manually or are generated by tools such as pg_basebackup > > don't have dbname. If we require a dbname in primary_conninfo, many > > tools will need to be changed. Once the connection string is > > generated, it would be tricky to change the dbname in it, as Shveta > > mentioned. The users will have to carefully select the database to > > connect when taking a base backup. > > > > I see your point and agree that users need to be careful. I was trying > to compare it with other places like the conninfo used with a > subscription where no separate dbname needs to be provided. Now, here > the situation is not the same because the same conninfo is used for > different purposes (walreceiver doesn't require dbname (dbname is > ignored even if present) whereas slotsyncworker requires dbname). I > was just trying to see if we can avoid having a new GUC for this > purpose. Does anyone else have an opinion on this matter? > Bertrand, Dilip, and others involved in this thread or otherwise, see if you can share an opinion on the above point because it would be good to get some more opinions before we decide to add a new GUC (for dbname) for slotsync worker. -- With Regards, Amit Kapila.
Hi, On Wed, Jan 03, 2024 at 04:20:03PM +0530, Amit Kapila wrote: > On Fri, Dec 29, 2023 at 12:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, Dec 29, 2023 at 6:59 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Wed, Dec 27, 2023 at 7:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > > 3) The slotsync worker uses primary_conninfo but also uses a new GUC > > > > > parameter, say slot_sync_dbname, to specify the database to connect. > > > > > The slot_sync_dbname overwrites the dbname if primary_conninfo also > > > > > specifies it. If both don't have a dbname, raise an error. > > > > > > > > > > > > > Would the users prefer to provide a value for a separate GUC instead > > > > of changing primary_conninfo? It is possible that we can have some > > > > users prefer to use one GUC and others prefer a separate GUC but we > > > > should add a new GUC if we are sure that is what users would prefer. > > > > Also, even if have to consider this option, I think we can easily > > > > later add a new GUC to provide a dbname in addition to having the > > > > provision of giving it in primary_conninfo. > > > > > > I think having two separate GUCs is more flexible for example when > > > users want to change the dbname to connect. It makes sense that the > > > slotsync worker wants to use the same connection string as the > > > walreceiver uses. But I guess today most primary_conninfo settings > > > that are set manually or are generated by tools such as pg_basebackup > > > don't have dbname. If we require a dbname in primary_conninfo, many > > > tools will need to be changed. Once the connection string is > > > generated, it would be tricky to change the dbname in it, as Shveta > > > mentioned. The users will have to carefully select the database to > > > connect when taking a base backup. > > > > > > > I see your point and agree that users need to be careful. I was trying > > to compare it with other places like the conninfo used with a > > subscription where no separate dbname needs to be provided. Now, here > > the situation is not the same because the same conninfo is used for > > different purposes (walreceiver doesn't require dbname (dbname is > > ignored even if present) whereas slotsyncworker requires dbname). I > > was just trying to see if we can avoid having a new GUC for this > > purpose. Does anyone else have an opinion on this matter? > > > > Bertrand, Dilip, and others involved in this thread or otherwise, see > if you can share an opinion on the above point because it would be > good to get some more opinions before we decide to add a new GUC (for > dbname) for slotsync worker. > I think that as long as enable_syncslot is off then there is no need to add the dbname in primary_conninfo (means there is no need to change an existing primary_conninfo for the ones that don't use the sync slot feature). So given that primary_conninfo does not necessary need to be changed (for ones that don't use the sync slot feature) and that adding a new GUC looks more a one-way door change to me, I'd vote to keep the patch as it is (we can still revisit this later on and add a new GUC if we feel the need based on user's feedback). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tuesday, January 2, 2024 6:32 PM shveta malik <shveta.malik@gmail.com> wrote: > On Fri, Dec 29, 2023 at 10:25 AM Amit Kapila <amit.kapila16@gmail.com> > > The topup patch has also changed app_name to > {cluster_name}_slotsyncworker so that we do not confuse between walreceiver > and slotsyncworker entry. > > Please note that there is no change in rest of the patches, changes are in > additional 0004 patch alone. Attach the V56 patch set which supports ALTER SUBSCRIPTION SET (failover). This is useful when user want to refresh the publication tables, they can now alter the failover option to false and then execute the refresh command. Best Regards, Hou zj
Attachment
On Wed, Jan 3, 2024 at 6:33 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Tuesday, January 2, 2024 6:32 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Fri, Dec 29, 2023 at 10:25 AM Amit Kapila <amit.kapila16@gmail.com> > > > > The topup patch has also changed app_name to > > {cluster_name}_slotsyncworker so that we do not confuse between walreceiver > > and slotsyncworker entry. > > > > Please note that there is no change in rest of the patches, changes are in > > additional 0004 patch alone. > > Attach the V56 patch set which supports ALTER SUBSCRIPTION SET (failover). > This is useful when user want to refresh the publication tables, they can now alter the > failover option to false and then execute the refresh command. > > Best Regards, > Hou zj The patches no longer apply to HEAD due to a recent commit 007693f. I am working on rebasing and will post the new patches soon thanks Shveta
On Thu, Jan 4, 2024 at 9:18 AM shveta malik <shveta.malik@gmail.com> wrote: > > On Wed, Jan 3, 2024 at 6:33 PM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > On Tuesday, January 2, 2024 6:32 PM shveta malik <shveta.malik@gmail.com> wrote: > > > On Fri, Dec 29, 2023 at 10:25 AM Amit Kapila <amit.kapila16@gmail.com> > > > > > > The topup patch has also changed app_name to > > > {cluster_name}_slotsyncworker so that we do not confuse between walreceiver > > > and slotsyncworker entry. > > > > > > Please note that there is no change in rest of the patches, changes are in > > > additional 0004 patch alone. > > > > Attach the V56 patch set which supports ALTER SUBSCRIPTION SET (failover). > > This is useful when user want to refresh the publication tables, they can now alter the > > failover option to false and then execute the refresh command. > > > > Best Regards, > > Hou zj > > The patches no longer apply to HEAD due to a recent commit 007693f. I > am working on rebasing and will post the new patches soon > > thanks > Shveta Commit 007693f has changed 'conflicting' to 'conflict_reason', so adjusted the code around that in the slotsync worker. Also removed function 'pg_get_slot_invalidation_cause' as now conflict_reason tells the same. PFA rebased patches with above changes. thanks Shveta
Attachment
On Wed, Jan 3, 2024 at 4:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Dec 29, 2023 at 12:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > I see your point and agree that users need to be careful. I was trying > > to compare it with other places like the conninfo used with a > > subscription where no separate dbname needs to be provided. Now, here > > the situation is not the same because the same conninfo is used for > > different purposes (walreceiver doesn't require dbname (dbname is > > ignored even if present) whereas slotsyncworker requires dbname). I > > was just trying to see if we can avoid having a new GUC for this > > purpose. Does anyone else have an opinion on this matter? > > > > Bertrand, Dilip, and others involved in this thread or otherwise, see > if you can share an opinion on the above point because it would be > good to get some more opinions before we decide to add a new GUC (for > dbname) for slotsync worker. IMHO, as of now we can only use the primary_coninfo and let the user modify this and add the dbname to this. In the future, if this creates some discomfort or we see some complaints about the usage then we can expand the behavior by providing an additional GUC with dbname. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Jan 3, 2024 at 4:57 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > On Wed, Jan 03, 2024 at 04:20:03PM +0530, Amit Kapila wrote: > > On Fri, Dec 29, 2023 at 12:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Fri, Dec 29, 2023 at 6:59 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > On Wed, Dec 27, 2023 at 7:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > > > > > 3) The slotsync worker uses primary_conninfo but also uses a new GUC > > > > > > parameter, say slot_sync_dbname, to specify the database to connect. > > > > > > The slot_sync_dbname overwrites the dbname if primary_conninfo also > > > > > > specifies it. If both don't have a dbname, raise an error. > > > > > > > > > > > > > > > > Would the users prefer to provide a value for a separate GUC instead > > > > > of changing primary_conninfo? It is possible that we can have some > > > > > users prefer to use one GUC and others prefer a separate GUC but we > > > > > should add a new GUC if we are sure that is what users would prefer. > > > > > Also, even if have to consider this option, I think we can easily > > > > > later add a new GUC to provide a dbname in addition to having the > > > > > provision of giving it in primary_conninfo. > > > > > > > > I think having two separate GUCs is more flexible for example when > > > > users want to change the dbname to connect. It makes sense that the > > > > slotsync worker wants to use the same connection string as the > > > > walreceiver uses. But I guess today most primary_conninfo settings > > > > that are set manually or are generated by tools such as pg_basebackup > > > > don't have dbname. If we require a dbname in primary_conninfo, many > > > > tools will need to be changed. Once the connection string is > > > > generated, it would be tricky to change the dbname in it, as Shveta > > > > mentioned. The users will have to carefully select the database to > > > > connect when taking a base backup. > > > > > > > > > > I see your point and agree that users need to be careful. I was trying > > > to compare it with other places like the conninfo used with a > > > subscription where no separate dbname needs to be provided. Now, here > > > the situation is not the same because the same conninfo is used for > > > different purposes (walreceiver doesn't require dbname (dbname is > > > ignored even if present) whereas slotsyncworker requires dbname). I > > > was just trying to see if we can avoid having a new GUC for this > > > purpose. Does anyone else have an opinion on this matter? > > > > > > > Bertrand, Dilip, and others involved in this thread or otherwise, see > > if you can share an opinion on the above point because it would be > > good to get some more opinions before we decide to add a new GUC (for > > dbname) for slotsync worker. > > > > I think that as long as enable_syncslot is off then there is no need to add the > dbname in primary_conninfo (means there is no need to change an existing primary_conninfo > for the ones that don't use the sync slot feature). > > So given that primary_conninfo does not necessary need to be changed (for ones that > don't use the sync slot feature) and that adding a new GUC looks more a one-way door > change to me, I'd vote to keep the patch as it is (we can still revisit this later > on and add a new GUC if we feel the need based on user's feedback). > Okay, thanks for the feedback. Dilip also shares the same opinion, so let's wait and see if there is any strong argument to add this new GUC. -- With Regards, Amit Kapila.
Hi, On Thu, Jan 04, 2024 at 10:27:31AM +0530, shveta malik wrote: > On Thu, Jan 4, 2024 at 9:18 AM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Wed, Jan 3, 2024 at 6:33 PM Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> wrote: > > > > > > On Tuesday, January 2, 2024 6:32 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Fri, Dec 29, 2023 at 10:25 AM Amit Kapila <amit.kapila16@gmail.com> > > > > > > > > The topup patch has also changed app_name to > > > > {cluster_name}_slotsyncworker so that we do not confuse between walreceiver > > > > and slotsyncworker entry. > > > > > > > > Please note that there is no change in rest of the patches, changes are in > > > > additional 0004 patch alone. > > > > > > Attach the V56 patch set which supports ALTER SUBSCRIPTION SET (failover). > > > This is useful when user want to refresh the publication tables, they can now alter the > > > failover option to false and then execute the refresh command. > > > > > > Best Regards, > > > Hou zj > > > > The patches no longer apply to HEAD due to a recent commit 007693f. I > > am working on rebasing and will post the new patches soon > > > > thanks > > Shveta > > Commit 007693f has changed 'conflicting' to 'conflict_reason', so > adjusted the code around that in the slotsync worker. > > Also removed function 'pg_get_slot_invalidation_cause' as now > conflict_reason tells the same. > > PFA rebased patches with above changes. > Thanks! Looking at 0004: 1 ==== -libpqrcv_connect(const char *conninfo, bool logical, bool must_use_password, - const char *appname, char **err) +libpqrcv_connect(const char *conninfo, bool replication, bool logical, + bool must_use_password, const char *appname, char **err) What about adjusting the preceding comment a bit to describe what the new replication parameter is for? 2 ==== + /* We can not have logical w/o replication */ what about replacing w/o by without? 3 === + if(!replication) + Assert(!logical); + + if (replication) { what about using "if () else" instead (to avoid unnecessary test)? Having said that the patch seems a reasonable way to implement non-replication connection in slotsync worker. 4 === Looking closer, the only place where walrcv_connect() is called with replication set to false and logical set to false is in ReplSlotSyncWorkerMain(). That does make sense, but what do you think about creating dedicated libpqslotsyncwrkr_connect and slotsyncwrkr_connect (instead of using the libpqrcv_connect / walrcv_connect ones)? That way we could make use of slotsyncwrkr_connect() in ReplSlotSyncWorkerMain() as I think it's confusing to use "rcv" functions while the process using them is not of backend type walreceiver. I'm not sure that worth the extra complexity though, what do you think? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Jan 4, 2024 at 7:24 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On Thu, Jan 04, 2024 at 10:27:31AM +0530, shveta malik wrote: > > On Thu, Jan 4, 2024 at 9:18 AM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > On Wed, Jan 3, 2024 at 6:33 PM Zhijie Hou (Fujitsu) > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > On Tuesday, January 2, 2024 6:32 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > On Fri, Dec 29, 2023 at 10:25 AM Amit Kapila <amit.kapila16@gmail.com> > > > > > > > > > > The topup patch has also changed app_name to > > > > > {cluster_name}_slotsyncworker so that we do not confuse between walreceiver > > > > > and slotsyncworker entry. > > > > > > > > > > Please note that there is no change in rest of the patches, changes are in > > > > > additional 0004 patch alone. > > > > > > > > Attach the V56 patch set which supports ALTER SUBSCRIPTION SET (failover). > > > > This is useful when user want to refresh the publication tables, they can now alter the > > > > failover option to false and then execute the refresh command. > > > > > > > > Best Regards, > > > > Hou zj > > > > > > The patches no longer apply to HEAD due to a recent commit 007693f. I > > > am working on rebasing and will post the new patches soon > > > > > > thanks > > > Shveta > > > > Commit 007693f has changed 'conflicting' to 'conflict_reason', so > > adjusted the code around that in the slotsync worker. > > > > Also removed function 'pg_get_slot_invalidation_cause' as now > > conflict_reason tells the same. > > > > PFA rebased patches with above changes. > > > > Thanks! > > Looking at 0004: > > 1 ==== > > -libpqrcv_connect(const char *conninfo, bool logical, bool must_use_password, > - const char *appname, char **err) > +libpqrcv_connect(const char *conninfo, bool replication, bool logical, > + bool must_use_password, const char *appname, char **err) > > What about adjusting the preceding comment a bit to describe what the new replication > parameter is for? > > 2 ==== > > + /* We can not have logical w/o replication */ > > what about replacing w/o by without? > > 3 === > > + if(!replication) > + Assert(!logical); > + > + if (replication) > { > > what about using "if () else" instead (to avoid unnecessary test)? > > > Having said that the patch seems a reasonable way to implement non-replication > connection in slotsync worker. > > 4 === > > Looking closer, the only place where walrcv_connect() is called with replication > set to false and logical set to false is in ReplSlotSyncWorkerMain(). > > That does make sense, but what do you think about creating dedicated libpqslotsyncwrkr_connect > and slotsyncwrkr_connect (instead of using the libpqrcv_connect / walrcv_connect ones)? > > That way we could make use of slotsyncwrkr_connect() in ReplSlotSyncWorkerMain() > as I think it's confusing to use "rcv" functions while the process using them is > not of backend type walreceiver. > > I'm not sure that worth the extra complexity though, what do you think? I gave it a thought earlier, but then I was not sure even if I create a new function w/o "rcv" in it then where should it be placed as the existing file name itself is libpq'walreceiver'.c. Shall we be creating a new file then? But it does not seem good to create a new setup (new file, function pointers other stuff) around 1 function. And thus reusing the same function with 'replication' (new arg) felt like a better choice than other options. If in future, there is any other module trying to do the same, then it can use current walrcv_connect() with rep=false. If I make it specific to slot-sync worker, then it will not be reusable by other modules (if needed). thanks Shveta
On Fri, Jan 5, 2024 at 8:59 AM shveta malik <shveta.malik@gmail.com> wrote: > > On Thu, Jan 4, 2024 at 7:24 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > > 4 === > > > > Looking closer, the only place where walrcv_connect() is called with replication > > set to false and logical set to false is in ReplSlotSyncWorkerMain(). > > > > That does make sense, but what do you think about creating dedicated libpqslotsyncwrkr_connect > > and slotsyncwrkr_connect (instead of using the libpqrcv_connect / walrcv_connect ones)? > > > > That way we could make use of slotsyncwrkr_connect() in ReplSlotSyncWorkerMain() > > as I think it's confusing to use "rcv" functions while the process using them is > > not of backend type walreceiver. > > > > I'm not sure that worth the extra complexity though, what do you think? > > I gave it a thought earlier, but then I was not sure even if I create > a new function w/o "rcv" in it then where should it be placed as the > existing file name itself is libpq'walreceiver'.c. Shall we be > creating a new file then? But it does not seem good to create a new > setup (new file, function pointers other stuff) around 1 function. > And thus reusing the same function with 'replication' (new arg) felt > like a better choice than other options. If in future, there is any > other module trying to do the same, then it can use current > walrcv_connect() with rep=false. If I make it specific to slot-sync > worker, then it will not be reusable by other modules (if needed). > I agree that the benefit of creating a new API is not very clear. How about adjusting the description in the file header of libpqwalreceiver.c. I think apart from walreceiver, it is now also used by logical replication workers and with this patch by the slotsync worker as well. -- With Regards, Amit Kapila.
Hi, On Fri, Jan 05, 2024 at 10:00:53AM +0530, Amit Kapila wrote: > On Fri, Jan 5, 2024 at 8:59 AM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Thu, Jan 4, 2024 at 7:24 PM Bertrand Drouvot > > <bertranddrouvot.pg@gmail.com> wrote: > > > > > > 4 === > > > > > > Looking closer, the only place where walrcv_connect() is called with replication > > > set to false and logical set to false is in ReplSlotSyncWorkerMain(). > > > > > > That does make sense, but what do you think about creating dedicated libpqslotsyncwrkr_connect > > > and slotsyncwrkr_connect (instead of using the libpqrcv_connect / walrcv_connect ones)? > > > > > > That way we could make use of slotsyncwrkr_connect() in ReplSlotSyncWorkerMain() > > > as I think it's confusing to use "rcv" functions while the process using them is > > > not of backend type walreceiver. > > > > > > I'm not sure that worth the extra complexity though, what do you think? > > > > I gave it a thought earlier, but then I was not sure even if I create > > a new function w/o "rcv" in it then where should it be placed as the > > existing file name itself is libpq'walreceiver'.c. Shall we be > > creating a new file then? But it does not seem good to create a new > > setup (new file, function pointers other stuff) around 1 function. Yeah... > > And thus reusing the same function with 'replication' (new arg) felt > > like a better choice than other options. If in future, there is any > > other module trying to do the same, then it can use current > > walrcv_connect() with rep=false. If I make it specific to slot-sync > > worker, then it will not be reusable by other modules (if needed). Yeah good point, it would need to be more generic. > I agree that the benefit of creating a new API is not very clear. Yeah, that would be more for cosmetic purpose (and avoid using a WalReceiverConn while a PGconn could/should suffice). > How > about adjusting the description in the file header of > libpqwalreceiver.c. Agree, that seems to be a better option (not sure that building the new API is worth the extra work). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, Jan 5, 2024 at 8:59 AM shveta malik <shveta.malik@gmail.com> wrote: > I was going the the patch set again, I have a question. The below comments say that we keep the failover option as PENDING until we have done the initial table sync which seems fine. But what happens if we add a new table to the publication and refresh the subscription? In such a case does this go back to the PENDING state or something else? + * As a result, we enable the failover option for the main slot only after the + * initial sync is complete. The failover option is implemented as a tri-state + * with values DISABLED, PENDING, and ENABLED. The state transition process + * between these values is the same as the two_phase option (see TWO_PHASE + * TRANSACTIONS for details). -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Fri, Jan 5, 2024 at 4:25 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Fri, Jan 5, 2024 at 8:59 AM shveta malik <shveta.malik@gmail.com> wrote: > > > I was going the the patch set again, I have a question. The below > comments say that we keep the failover option as PENDING until we have > done the initial table sync which seems fine. But what happens if we > add a new table to the publication and refresh the subscription? In > such a case does this go back to the PENDING state or something else? > At this stage, such an operation is prohibited. Users need to disable the failover option first, then perform the above operation, and after that failover option can be re-enabled. -- With Regards, Amit Kapila.
On Fri, Jan 5, 2024 at 5:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Jan 5, 2024 at 4:25 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Fri, Jan 5, 2024 at 8:59 AM shveta malik <shveta.malik@gmail.com> wrote: > > > > > I was going the the patch set again, I have a question. The below > > comments say that we keep the failover option as PENDING until we have > > done the initial table sync which seems fine. But what happens if we > > add a new table to the publication and refresh the subscription? In > > such a case does this go back to the PENDING state or something else? > > > > At this stage, such an operation is prohibited. Users need to disable > the failover option first, then perform the above operation, and after > that failover option can be re-enabled. Okay, that makes sense to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Here are some review comments for patch v57-0001. ====== doc/src/sgml/protocol.sgml 1. CREATE_REPLICATION_SLOT ... FAILOVER + <varlistentry> + <term><literal>FAILOVER [ <replaceable class="parameter">boolean</replaceable> ]</literal></term> + <listitem> + <para> + If true, the slot is enabled to be synced to the physical + standbys so that logical replication can be resumed after failover. + </para> + </listitem> + </varlistentry> This syntax says passing the boolean value is optional. So the default needs to the specified here in the docs (like what the TWO_PHASE option does). ~~~ 2. ALTER_REPLICATION_SLOT ... FAILOVER + <variablelist> + <varlistentry> + <term><literal>FAILOVER [ <replaceable class="parameter">boolean</replaceable> ]</literal></term> + <listitem> + <para> + If true, the slot is enabled to be synced to the physical + standbys so that logical replication can be resumed after failover. + </para> + </listitem> + </varlistentry> + </variablelist> This syntax says passing the boolean value is optional. So it needs to be specified here in the docs that not passing a value would be the same as passing the value true. ====== doc/src/sgml/ref/alter_subscription.sgml 3. + If <link linkend="sql-createsubscription-params-with-failover"><literal>failover</literal></link> + is enabled, you can temporarily disable it in order to execute these commands. /in order to/to/ ~~~ 4. + <para> + When altering the + <link linkend="sql-createsubscription-params-with-slot-name"><literal>slot_name</literal></link>, + the <literal>failover</literal> property of the new slot may differ from the + <link linkend="sql-createsubscription-params-with-failover"><literal>failover</literal></link> + parameter specified in the subscription. When creating the slot, + ensure the slot failover property matches the + <link linkend="sql-createsubscription-params-with-failover"><literal>failover</literal></link> + parameter value of the subscription. + </para> 4a. the <literal>failover</literal> property of the new slot may differ Maybe it would be more clear if that said "the failover property value of the named slot...". ~ 4b. In the "failover property matches" part should that failover also be rendered as <literal> like before in the same paragraph? ====== doc/src/sgml/system-views.sgml 5. + <row> + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>failover</structfield> <type>bool</type> + </para> + <para> + True if this logical slot is enabled to be synced to the physical + standbys so that logical replication can be resumed from the new primary + after failover. Always false for physical slots. + </para></entry> + </row> /True if this logical slot is enabled.../True if this is a logical slot enabled.../ ====== src/backend/commands/subscriptioncmds.c 6. CreateSubscription + /* + * Even if failover is set, don't create the slot with failover + * enabled. Will enable it once all the tables are synced and + * ready. The intention is that if failover happens at the time of + * table-sync, user should re-launch the subscription instead of + * relying on main slot (if synced) with no table-sync data + * present. When the subscription has no tables, leave failover as + * false to allow ALTER SUBSCRIPTION ... REFRESH PUBLICATION to + * work. + */ + if (opts.failover && !opts.copy_data && tables != NIL) + failover_enabled = true; AFAICT it might be possible for this to set failover_enabled = true if copy_data is false. So failover_enabled would be true when later calling: walrcv_create_slot(wrconn, opts.slot_name, false, twophase_enabled, failover_enabled, CRS_NOEXPORT_SNAPSHOT, NULL); Isn't that contrary to what this comment said: "Even if failover is set, don't create the slot with failover enabled" ~~~ 7. AlterSubscription. case ALTER_SUBSCRIPTION_OPTIONS: + if (!sub->slotname) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("cannot set failover for subscription that does not have slot name"))); /for subscription that does not have slot name/for a subscription that does not have a slot name/ ====== .../libpqwalreceiver/libpqwalreceiver.c 8. + if (PQresultStatus(res) != PGRES_COMMAND_OK) + ereport(ERROR, + (errcode(ERRCODE_PROTOCOL_VIOLATION), + errmsg("could not alter replication slot \"%s\"", + slotname))); This used to display the error message like pchomp(PQerrorMessage(conn->streamConn)) but it was removed. Is it OK? ====== src/backend/replication/logical/tablesync.c 9. + if (MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING) + ereport(LOG, + /* translator: %s is a subscription option */ + (errmsg("logical replication apply worker for subscription \"%s\" will restart so that %s can be enabled", + MySubscription->name, "two_phase"))); + + if (MySubscription->failoverstate == LOGICALREP_FAILOVER_STATE_PENDING) + ereport(LOG, + /* translator: %s is a subscription option */ + (errmsg("logical replication apply worker for subscription \"%s\" will restart so that %s can be enabled", + MySubscription->name, "failover"))); Those errors have multiple %s, so the translator's comment should say "the 2nd %s is a..." ~~~ 10. void -UpdateTwoPhaseState(Oid suboid, char new_state) +EnableTwoPhaseFailoverTriState(Oid suboid, bool enable_twophase, + bool enable_failover) I felt the function name was a bit confusing. Maybe it is simpler to call it like "EnableTriState" or "EnableSubTriState" -- the parameters anyway specify what actual state(s) will be set. ====== src/backend/replication/logical/worker.c 11. + /* Update twophase and/or failover */ + EnableTwoPhaseFailoverTriState(MySubscription->oid, twophase_pending, + failover_pending); + if (twophase_pending) + MySubscription->twophasestate = LOGICALREP_TWOPHASE_STATE_ENABLED; + + if (failover_pending) + MySubscription->failoverstate = LOGICALREP_FAILOVER_STATE_ENABLED; Can't you pass the MySubscription as a parameter and then the EnableTwoPhaseFailoverTriState can also set these LOGICALREP_TWOPHASE_STATE_ENABLED/LOGICALREP_FAILOVER_STATE_ENABLED states within the Enable* function? ====== src/backend/replication/repl_gram.y 12. %token K_CREATE_REPLICATION_SLOT %token K_DROP_REPLICATION_SLOT +%token K_ALTER_REPLICATION_SLOT and + create_replication_slot drop_replication_slot + alter_replication_slot identify_system read_replication_slot + timeline_history show upload_manifest and | create_replication_slot | drop_replication_slot + | alter_replication_slot and | K_CREATE_REPLICATION_SLOT { $$ = "create_replication_slot"; } | K_DROP_REPLICATION_SLOT { $$ = "drop_replication_slot"; } + | K_ALTER_REPLICATION_SLOT { $$ = "alter_replication_slot"; } etc. ~ Although it makes no difference IMO it is more natural to code everything in the order: create, alter, drop. ====== src/backend/replication/repl_scanner.l 13. CREATE_REPLICATION_SLOT { return K_CREATE_REPLICATION_SLOT; } DROP_REPLICATION_SLOT { return K_DROP_REPLICATION_SLOT; } +ALTER_REPLICATION_SLOT { return K_ALTER_REPLICATION_SLOT; } and case K_CREATE_REPLICATION_SLOT: case K_DROP_REPLICATION_SLOT: + case K_ALTER_REPLICATION_SLOT: Although it makes no difference IMO it is more natural to code everything in the order: create, alter, drop. ====== src/backend/replication/slot.c 14. + if (SlotIsPhysical(MyReplicationSlot)) + ereport(ERROR, + errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot use %s with a physical replication slot", + "ALTER_REPLICATION_SLOT")); /with a/for a/ ====== src/backend/replication/walsender.c 15. +static void +ParseAlterReplSlotOptions(AlterReplicationSlotCmd *cmd, bool *failover) +{ + ListCell *lc; + bool failover_given = false; + + /* Parse options */ + foreach(lc, cmd->options) + { + DefElem *defel = (DefElem *) lfirst(lc); AFAIK there are some new-style macros now you can use for this code. e.g. foreach_ptr? See [1]. ~~~ 16. + if (strcmp(defel->defname, "failover") == 0) + { + if (failover_given) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("conflicting or redundant options"))); + failover_given = true; + *failover = defGetBoolean(defel); + } The documented syntax showed that passing the boolean value for the FAILOVER option is not mandatory. Does this code work if the boolean value is not passed? ====== src/bin/psql/tab-complete.c 17. I think "ALTER SUBSCRIPTION ... SET (failover)" is possible, but the ALTER SUBSCRIPTION tab completion code is missing. ====== src/include/nodes/replnodes.h 18. +/* ---------------------- + * ALTER_REPLICATION_SLOT command + * ---------------------- + */ +typedef struct AlterReplicationSlotCmd +{ + NodeTag type; + char *slotname; + List *options; +} AlterReplicationSlotCmd; + + Same as an earlier comment. Although it makes no difference IMO it is more natural to define these structs in the order: CreateReplicationSlotCmd, then AlterReplicationSlotCmd, then DropReplicationSlotCmd. ====== .../t/050_standby_failover_slots_sync.pl 19. + +# Copyright (c) 2023, PostgreSQL Global Development Group + /2023/2024/ ~~~ 20. +# Create another subscription (using the same slot created above) that enables +# failover. +$subscriber1->safe_psql( + 'postgres', qq[ + CREATE TABLE tab_int (a int PRIMARY KEY); + CREATE SUBSCRIPTION regress_mysub1 CONNECTION '$publisher_connstr' PUBLICATION regress_mypub WITH (slot_name = lsub1_slot, copy_data=false, failover = true, create_slot = false); The comment should not say "Create another subscription" because this is the first subscription being created. /another/a/ ~~~ 21. +################################################## +# Test if changing the failover property of a subscription updates the +# corresponding failover property of the slot. +################################################## /Test if/Test that/ ====== src/test/regress/sql/subscription.sql 22. +CREATE SUBSCRIPTION regress_testsub CONNECTION 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, failover = true); + +\dRs+ This is currently only testing the explicit "failover=true". Maybe you can also test the other kinds work as expected: - explicit "SET (failover=false)" - explicit "SET (failover)" with no value specified ====== [1] https://github.com/postgres/postgres/commit/14dd0f27d7cd56ffae9ecdbe324965073d01a9ff Kind Regards, Peter Smith. Fujitsu Australia
On Monday, January 8, 2024 2:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Fri, Jan 5, 2024 at 5:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, Jan 5, 2024 at 4:25 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Fri, Jan 5, 2024 at 8:59 AM shveta malik <shveta.malik@gmail.com> > wrote: > > > > > > > I was going the the patch set again, I have a question. The below > > > comments say that we keep the failover option as PENDING until we > > > have done the initial table sync which seems fine. But what happens > > > if we add a new table to the publication and refresh the > > > subscription? In such a case does this go back to the PENDING state or > something else? > > > > > > > At this stage, such an operation is prohibited. Users need to disable > > the failover option first, then perform the above operation, and after > > that failover option can be re-enabled. > > Okay, that makes sense to me. During the off-list discussion, Sawada-san proposed one idea which can release the restriction for table sync: instead of relying on the latest WAL position, we can utilize the remote restart_lsn to reserve the WAL when creating a new synced slot on the standby. This approach eliminates the need to wait for the primary server to catch up, thus improving the speed of synced slot creation on the standby in most scenarios. By using this approach, the limitation that prevents users from performing table sync during failover can be eliminated. In previous versions, this restriction existed because table sync slots were often incompletely synchronized to the standby(the slots on primary could not catch up the synced slot). And with this approach, the table sync slots can be efficiently synced to the standby in most cases. However, there could still be rare cases that the WAL around remote restart_lsn has been removed on standby, we will try to reserve the last remaining wal in this case and mark the slot as temporary, these temp slots will be converted to persistent once the remote restart_lsn catches up. We think this idea is promising and here is the V58 patch set which tries to address the idea, the summary of changes for each patch is as follows: V58-0001 1) Enables failover for table sync slot. 2) Removes the restriction on table sync when failover is enabled. 3) Removes tristate handling for failover state. 4) Renames failoverstate to failover. 5) Address Peter's comments[1]. V58-0002 1) Add the document about how to resume logical replication after failover. 2) Don't sync temporary from primary server anymore. 3) Fix one spinlock miss. 4) Fix one CFbot warning. 5) Fixes a bug where last_update_time is not initialized. 6) Reserves WAL based on the remote restart_lsn. 7) Improves and adjusts the tests. 8) remove the separate function wait_for_primary_slot_catchup() and integrate its logic of marking the slot as ready into the main loop. 9) remove the 'i' state of sync_state. The slots that need to wait for the primary to catch up will be marked as TEMPORARY, and they will be converted to PERSISTENT once the remote restart_lsn catches up. Thanks Shveta for working on 1) to 4). V58-0003 Rebases the tests. V58-0004: Address Bertrand comments[2]. Thanks Shveta for working on this. TODO: Add documents to guide user the way to identity if the table sync slot and the main slot is READY that the logical replication can be resumed by subscribing to the new primary. [1] https://www.postgresql.org/message-id/CAHut%2BPvbbPz1%3DT4bzY0_GotUK460Eih41Twjt%3DczJ1z2J8SGEw%40mail.gmail.com [2] https://www.postgresql.org/message-id/ZZa4pLFCe2mAks1m%40ip-10-97-1-34.eu-west-3.compute.internal Best Regards, Hou zj
Attachment
On Tuesday, January 9, 2024 9:17 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are some review comments for patch v57-0001. Thanks for the comments! > > ====== > doc/src/sgml/protocol.sgml > > 1. CREATE_REPLICATION_SLOT ... FAILOVER > > + <varlistentry> > + <term><literal>FAILOVER [ <replaceable > class="parameter">boolean</replaceable> ]</literal></term> > + <listitem> > + <para> > + If true, the slot is enabled to be synced to the physical > + standbys so that logical replication can be resumed after failover. > + </para> > + </listitem> > + </varlistentry> > > This syntax says passing the boolean value is optional. So the default needs to > the specified here in the docs (like what the TWO_PHASE option does). > > ~~~ > > 2. ALTER_REPLICATION_SLOT ... FAILOVER > > + <variablelist> > + <varlistentry> > + <term><literal>FAILOVER [ <replaceable > class="parameter">boolean</replaceable> ]</literal></term> > + <listitem> > + <para> > + If true, the slot is enabled to be synced to the physical > + standbys so that logical replication can be resumed after failover. > + </para> > + </listitem> > + </varlistentry> > + </variablelist> > > This syntax says passing the boolean value is optional. So it needs to be > specified here in the docs that not passing a value would be the same as > passing the value true. The behavior that "not passing a value would be the same as passing the value true " is due to the rule of defGetBoolean(). And all the options of commands in this document behave the same in this case, therefore I think we'd better add document for it in a general place in a separate patch/thread instead of mentioning this in each option's paragraph. > > ====== > doc/src/sgml/ref/alter_subscription.sgml > > 3. > + If <link > linkend="sql-createsubscription-params-with-failover"><literal>failover</lit > eral></link> > + is enabled, you can temporarily disable it in order to execute > these commands. > > /in order to/to/ This part has been removed due to design change. > > ~~~ > > 4. > + <para> > + When altering the > + <link > linkend="sql-createsubscription-params-with-slot-name"><literal>slot_nam > e</literal></link>, > + the <literal>failover</literal> property of the new slot may > differ from the > + <link > linkend="sql-createsubscription-params-with-failover"><literal>failover</lit > eral></link> > + parameter specified in the subscription. When creating the slot, > + ensure the slot failover property matches the > + <link > linkend="sql-createsubscription-params-with-failover"><literal>failover</lit > eral></link> > + parameter value of the subscription. > + </para> > > 4a. > the <literal>failover</literal> property of the new slot may differ > > Maybe it would be more clear if that said "the failover property value of the > named slot...". Changed. > > ~ > > 4b. > In the "failover property matches" part should that failover also be rendered as > <literal> like before in the same paragraph? Added. > > ====== > doc/src/sgml/system-views.sgml > > 5. > + <row> > + <entry role="catalog_table_entry"><para role="column_definition"> > + <structfield>failover</structfield> <type>bool</type> > + </para> > + <para> > + True if this logical slot is enabled to be synced to the physical > + standbys so that logical replication can be resumed from the new > primary > + after failover. Always false for physical slots. > + </para></entry> > + </row> > > /True if this logical slot is enabled.../True if this is a logical slot enabled.../ Changed. > > ====== > src/backend/commands/subscriptioncmds.c > > 6. CreateSubscription > > + /* > + * Even if failover is set, don't create the slot with failover > + * enabled. Will enable it once all the tables are synced and > + * ready. The intention is that if failover happens at the time of > + * table-sync, user should re-launch the subscription instead of > + * relying on main slot (if synced) with no table-sync data > + * present. When the subscription has no tables, leave failover as > + * false to allow ALTER SUBSCRIPTION ... REFRESH PUBLICATION to > + * work. > + */ > + if (opts.failover && !opts.copy_data && tables != NIL) > + failover_enabled = true; > > AFAICT it might be possible for this to set failover_enabled = true if copy_data > is false. So failover_enabled would be true when later > calling: > walrcv_create_slot(wrconn, opts.slot_name, false, twophase_enabled, > failover_enabled, CRS_NOEXPORT_SNAPSHOT, NULL); > > Isn't that contrary to what this comment said: "Even if failover is set, don't > create the slot with failover enabled" This part has been removed due to design change. > > ~~~ > > 7. AlterSubscription. case ALTER_SUBSCRIPTION_OPTIONS: > > + if (!sub->slotname) > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("cannot set failover for subscription that does not have slot > + name"))); > > /for subscription that does not have slot name/for a subscription that does not > have a slot name/ Changed. > > ====== > .../libpqwalreceiver/libpqwalreceiver.c > > 8. > + if (PQresultStatus(res) != PGRES_COMMAND_OK) ereport(ERROR, > + (errcode(ERRCODE_PROTOCOL_VIOLATION), > + errmsg("could not alter replication slot \"%s\"", slotname))); > > This used to display the error message like > pchomp(PQerrorMessage(conn->streamConn)) but it was removed. Is it OK? I added this back. > > ====== > src/backend/replication/logical/tablesync.c > > 9. > + if (MySubscription->twophasestate == > + LOGICALREP_TWOPHASE_STATE_PENDING) > + ereport(LOG, > + /* translator: %s is a subscription option */ (errmsg("logical > + replication apply worker for subscription \"%s\" > will restart so that %s can be enabled", > + MySubscription->name, "two_phase"))); > + > + if (MySubscription->failoverstate == > + LOGICALREP_FAILOVER_STATE_PENDING) > + ereport(LOG, > + /* translator: %s is a subscription option */ (errmsg("logical > + replication apply worker for subscription \"%s\" > will restart so that %s can be enabled", > + MySubscription->name, "failover"))); > > Those errors have multiple %s, so the translator's comment should say "the > 2nd %s is a..." This part has been removed due to design change. > > ~~~ > > 10. > void > -UpdateTwoPhaseState(Oid suboid, char new_state) > +EnableTwoPhaseFailoverTriState(Oid suboid, bool enable_twophase, > + bool enable_failover) > > I felt the function name was a bit confusing. Maybe it is simpler to call it like > "EnableTriState" or "EnableSubTriState" -- the parameters anyway specify what > actual state(s) will be set. This part has been removed due to design change. > > ====== > src/backend/replication/logical/worker.c > > 11. > + /* Update twophase and/or failover */ > + EnableTwoPhaseFailoverTriState(MySubscription->oid, twophase_pending, > + failover_pending); > + if (twophase_pending) > + MySubscription->twophasestate = > LOGICALREP_TWOPHASE_STATE_ENABLED; > + > + if (failover_pending) > + MySubscription->failoverstate = LOGICALREP_FAILOVER_STATE_ENABLED; > > Can't you pass the MySubscription as a parameter and then the > EnableTwoPhaseFailoverTriState can also set these > LOGICALREP_TWOPHASE_STATE_ENABLED/LOGICALREP_FAILOVER_STATE_EN > ABLED > states within the Enable* function? This part has been removed due to design change. > > ====== > src/backend/replication/repl_gram.y > > 12. > %token K_CREATE_REPLICATION_SLOT > %token K_DROP_REPLICATION_SLOT > +%token K_ALTER_REPLICATION_SLOT > > and > > + create_replication_slot drop_replication_slot alter_replication_slot > + identify_system read_replication_slot timeline_history show > + upload_manifest > > and > > | create_replication_slot > | drop_replication_slot > + | alter_replication_slot > > and > > | K_CREATE_REPLICATION_SLOT { $$ = "create_replication_slot"; } > | K_DROP_REPLICATION_SLOT { $$ = "drop_replication_slot"; } > + | K_ALTER_REPLICATION_SLOT { $$ = "alter_replication_slot"; } > > etc. > > ~ > > Although it makes no difference IMO it is more natural to code everything in > the order: create, alter, drop. > > ====== > src/backend/replication/repl_scanner.l > > 13. > CREATE_REPLICATION_SLOT { return K_CREATE_REPLICATION_SLOT; } > DROP_REPLICATION_SLOT { return K_DROP_REPLICATION_SLOT; } > +ALTER_REPLICATION_SLOT { return K_ALTER_REPLICATION_SLOT; } > > and > > case K_CREATE_REPLICATION_SLOT: > case K_DROP_REPLICATION_SLOT: > + case K_ALTER_REPLICATION_SLOT: > > Although it makes no difference IMO it is more natural to code everything in > the order: create, alter, drop. Personally, I am not sure if it looks better, so I didn’t change this. > > ====== > src/backend/replication/slot.c > > 14. > + if (SlotIsPhysical(MyReplicationSlot)) > + ereport(ERROR, > + errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("cannot use %s with a physical replication slot", > + "ALTER_REPLICATION_SLOT")); > > /with a/for a/ This is to be consistent with another error message, so I didn’t change this. errmsg("cannot use %s with a logical replication slot", "READ_REPLICATION_SLOT")); > > ====== > src/backend/replication/walsender.c > > 15. > +static void > +ParseAlterReplSlotOptions(AlterReplicationSlotCmd *cmd, bool *failover) > +{ > + ListCell *lc; > + bool failover_given = false; > + > + /* Parse options */ > + foreach(lc, cmd->options) > + { > + DefElem *defel = (DefElem *) lfirst(lc); > > AFAIK there are some new-style macros now you can use for this code. > e.g. foreach_ptr? See [1]. Changed. > > ~~~ > > 16. > + if (strcmp(defel->defname, "failover") == 0) { if (failover_given) > + ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR), errmsg("conflicting or > + redundant options"))); failover_given = true; *failover = > + defGetBoolean(defel); } > > The documented syntax showed that passing the boolean value for the > FAILOVER option is not mandatory. Does this code work if the boolean value is > not passed? It works, defGetBoolean will handle this case. > > ====== > src/bin/psql/tab-complete.c > > 17. > I think "ALTER SUBSCRIPTION ... SET (failover)" is possible, but the ALTER > SUBSCRIPTION tab completion code is missing. Added. > ====== > .../t/050_standby_failover_slots_sync.pl > > 19. > + > +# Copyright (c) 2023, PostgreSQL Global Development Group > + > > /2023/2024/ Changed. > > ~~~ > > 20. > +# Create another subscription (using the same slot created above) that > +enables # failover. > +$subscriber1->safe_psql( > + 'postgres', qq[ > + CREATE TABLE tab_int (a int PRIMARY KEY); CREATE SUBSCRIPTION > +regress_mysub1 CONNECTION '$publisher_connstr' > PUBLICATION regress_mypub WITH (slot_name = lsub1_slot, copy_data=false, > failover = true, create_slot = false); > > The comment should not say "Create another subscription" because this is the > first subscription being created. > > /another/a/ Changed. > > ~~~ > > 21. > +################################################## > +# Test if changing the failover property of a subscription updates the > +# corresponding failover property of the slot. > +################################################## > > /Test if/Test that/ Changed. > > ====== > src/test/regress/sql/subscription.sql > > 22. > +CREATE SUBSCRIPTION regress_testsub CONNECTION > 'dbname=regress_doesnotexist' PUBLICATION testpub WITH (connect = false, > failover = true); > + > +\dRs+ > > This is currently only testing the explicit "failover=true". > > Maybe you can also test the other kinds work as expected: > - explicit "SET (failover=false)" > - explicit "SET (failover)" with no value specified I think these tests don't add enough value to catch future bugs, so I prefer not to add these. Best Regards, Hou zj
On Tue, Jan 9, 2024 at 5:44 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > V58-0002 > +static bool +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot) { ... + /* Slot ready for sync, so sync it. */ + else + { + /* + * Sanity check: With hot_standby_feedback enabled and + * invalidations handled appropriately as above, this should never + * happen. + */ + if (remote_slot->restart_lsn < slot->data.restart_lsn) + elog(ERROR, + "cannot synchronize local slot \"%s\" LSN(%X/%X)" + " to remote slot's LSN(%X/%X) as synchronization" + " would move it backwards", remote_slot->name, + LSN_FORMAT_ARGS(slot->data.restart_lsn), + LSN_FORMAT_ARGS(remote_slot->restart_lsn)); ... } I was thinking about the above code in the patch and as far as I can think this can only occur if the same name slot is re-created with prior restart_lsn after the existing slot is dropped. Normally, the newly created slot (with the same name) will have higher restart_lsn but one can mimic it by copying some older slot by using pg_copy_logical_replication_slot(). I don't think as mentioned in comments even if hot_standby_feedback is temporarily set to off, the above shouldn't happen. It can only lead to invalidated slots on standby. To close the above race, I could think of the following ways: 1. Drop and re-create the slot. 2. Emit LOG/WARNING in this case and once remote_slot's LSN moves ahead of local_slot's LSN then we can update it; but as mentioned in your previous comment, we need to update all other fields as well. If we follow this then we probably need to have a check for catalog_xmin as well. Now, related to this the other case which needs some handling is what if the remote_slot's restart_lsn is greater than local_slot's restart_lsn but it is a re-created slot with the same name. In that case, I think the other properties like 'two_phase', 'plugin' could be different. So, is simply copying those sufficient or do we need to do something else as well? -- With Regards, Amit Kapila.
Here are some review comments for the patch v58-0001 ====== doc/src/sgml/catalogs.sgml 1. + <para> + If true, the associated replication slots (i.e. the main slot and the + table sync slots) in the upstream database are enabled to be + synchronized to the physical standbys. + </para></entry> It seems the other single-sentence descriptions on this page have no period (.) so for consistency maybe you should remove it here also. ====== src/backend/commands/subscriptioncmds.c 2. AlterSubscription + /* + * Do not allow changing the failover state if the + * subscription is enabled. This is because the failover + * state of the slot on the publisher cannot be modified if + * the slot is currently being acquired by the apply + * worker. + */ /being acquired/acquired/ ~~~ 3. values[Anum_pg_subscription_subfailover - 1] = BoolGetDatum(opts.failover); replaces[Anum_pg_subscription_subfailover - 1] = true; /* * The failover state of the slot should be changed after * the catalog update is completed. */ set_failover = true; AFAICT you don't need to introduce a new variable 'set_failover'. Instead, you can test like: BEFORE if (set_failover) AFTER if (replaces[Anum_pg_subscription_subfailover - 1]) ====== src/backend/replication/logical/tablesync.c 4. walrcv_create_slot(LogRepWorkerWalRcvConn, slotname, false /* permanent */ , false /* two_phase */ , + MySubscription->failover /* failover */ , CRS_USE_SNAPSHOT, origin_startpos); The "/* failover */ comment is unnecessary now that you pass the boolean field with the same descriptive name. ====== src/include/catalog/pg_subscription.h 5. CATALOG + bool subfailover; /* True if the associated replication slots + * (i.e. the main slot and the table sync + * slots) in the upstream database are enabled + * the upstream database are enabled to be + * synchronized to the physical standbys. */ + The wording of the comment is broken (it says "are enabled" 2x). SUGGESTION True if the associated replication slots (i.e. the main slot and the table sync slots) in the upstream database are enabled to be synchronized to the physical standbys. ~~~ 6. Subscription + bool failover; /* Indicates if the associated replication + * slots (i.e. the main slot and the table sync + * slots) in the upstream database are enabled + * to be synchronized to the physical + * standbys. */ This comment can say "True if...", so it will be the same as the earlier CATALOG comment for 'subfailover'. ====== Kind Regards, Peter Smith. Fujitsu Australia.
On Tue, Jan 9, 2024 at 5:44 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > comments on 0002 1. +/* Worker's nap time in case of regular activity on the primary server */ +#define WORKER_DEFAULT_NAPTIME_MS 10L /* 10 ms */ + +/* Worker's nap time in case of no-activity on the primary server */ +#define WORKER_INACTIVITY_NAPTIME_MS 10000L /* 10 sec */ Instead of directly switching between 10ms to 10s shouldn't we increase the nap time gradually? I mean it can go beyond 10 sec as well but instead of directly switching from 10ms to 10 sec we can increase it every time with some multiplier and keep a max limit up to which it can grow. Although we can reset back to 10ms directly as soon as we observe some activity. 2. SlotSyncWorkerCtxStruct add this to typedefs. list file 3. +/* + * Update local slot metadata as per remote_slot's positions + */ +static void +local_slot_update(RemoteSlot *remote_slot) +{ + Assert(MyReplicationSlot->data.invalidated == RS_INVAL_NONE); + + LogicalConfirmReceivedLocation(remote_slot->confirmed_lsn); + LogicalIncreaseXminForSlot(remote_slot->confirmed_lsn, + remote_slot->catalog_xmin); + LogicalIncreaseRestartDecodingForSlot(remote_slot->confirmed_lsn, + remote_slot->restart_lsn); +} IIUC on the standby we just want to overwrite what we get from primary no? If so why we are using those APIs that are meant for the actual decoding slots where it needs to take certain logical decisions instead of mere overwriting? 4. +/* + * Helper function for drop_obsolete_slots() + * + * Drops synced slot identified by the passed in name. + */ +static void +drop_synced_slots_internal(const char *name, bool nowait) Suggestion to add one line to explain no wait in the header 5. +/* + * Helper function to check if local_slot is present in remote_slots list. + * + * It also checks if logical slot is locally invalidated i.e. invalidated on + * the standby but valid on the primary server. If found so, it sets + * locally_invalidated to true. + */ Instead of saying "but valid on the primary server" better to mention it in the remote_slots list, because here this function is just checking the remote_slots list regardless of whether the list came from. Mentioning primary seems like it might fetch directly from the primary in this function so this is a bit confusing. 6. +/* + * Check that all necessary GUCs for slot synchronization are set + * appropriately. If not, raise an ERROR. + */ +static void +validate_slotsync_parameters(char **dbname) The function name just says 'validate_slotsync_parameters' but it also gets the dbname so I think it better we change the name accordingly also instead of passing dbname as a parameter just return it directly. There is no need to pass this extra parameter and make the function return void. 7. + tupslot = MakeSingleTupleTableSlot(res->tupledesc, &TTSOpsMinimalTuple); + tuple_ok = tuplestore_gettupleslot(res->tuplestore, true, false, tupslot); + Assert(tuple_ok); /* It must return one tuple */ Comments say 'It must return one tuple' but asserting just for at least one tuple shouldn't we enhance assert so that it checks that we got exactly one tuple? 8. /* No need to check further, return that we are cascading standby */ + *am_cascading_standby = true; we are not returning immediately we are just setting am_cascading_standby to true so adjust comments accordingly 9. + /* No need to check further, return that we are cascading standby */ + *am_cascading_standby = true; + } + else + { + /* We are a normal standby. */ Single-line comments do not follow the uniform pattern for the full stop, either use a full stop for all single-line comments or none, at least follow the same rule in a file or nearby comments. 10. + errmsg("exiting from slot synchronization due to bad configuration"), + errhint("%s must be defined.", "primary_slot_name")); Why we are using the constant string "primary_slot_name" as a variable in this error formatting? 11. + /* + * Hot_standby_feedback must be enabled to cooperate with the physical + * replication slot, which allows informing the primary about the xmin and + * catalog_xmin values on the standby. I do not like capitalizing the first letter of the 'hot_standby_feedback' which is a GUC parameter -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wednesday, January 10, 2024 2:26 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Jan 9, 2024 at 5:44 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> > wrote: > > > comments on 0002 Thanks for the comments ! > > 1. > +/* Worker's nap time in case of regular activity on the primary server */ > +#define WORKER_DEFAULT_NAPTIME_MS 10L /* 10 ms */ > + > +/* Worker's nap time in case of no-activity on the primary server */ > +#define WORKER_INACTIVITY_NAPTIME_MS 10000L /* 10 sec > */ > > Instead of directly switching between 10ms to 10s shouldn't we increase the > nap time gradually? I mean it can go beyond 10 sec as well but instead of > directly switching from 10ms to 10 sec we can increase it every time with some > multiplier and keep a max limit up to which it can grow. Although we can reset > back to 10ms directly as soon as we observe some activity. Agreed. I changed the strategy similar to what we do in the walsummarizer. > > 2. > SlotSyncWorkerCtxStruct add this to typedefs. list file > > 3. > +/* > + * Update local slot metadata as per remote_slot's positions */ static > +void local_slot_update(RemoteSlot *remote_slot) { > +Assert(MyReplicationSlot->data.invalidated == RS_INVAL_NONE); > + > + LogicalConfirmReceivedLocation(remote_slot->confirmed_lsn); > + LogicalIncreaseXminForSlot(remote_slot->confirmed_lsn, > + remote_slot->catalog_xmin); > + LogicalIncreaseRestartDecodingForSlot(remote_slot->confirmed_lsn, > + remote_slot->restart_lsn); > +} > > IIUC on the standby we just want to overwrite what we get from primary no? If > so why we are using those APIs that are meant for the actual decoding slots > where it needs to take certain logical decisions instead of mere overwriting? I think we don't have a strong reason to use these APIs, but it was convenient to use these APIs as they can take care of updating the slots info and will call functions like, ReplicationSlotsComputeRequiredXmin, ReplicationSlotsComputeRequiredLSN internally. Or do you prefer directly overwriting the fields and call these manually ? > > 4. > +/* > + * Helper function for drop_obsolete_slots() > + * > + * Drops synced slot identified by the passed in name. > + */ > +static void > +drop_synced_slots_internal(const char *name, bool nowait) > > Suggestion to add one line to explain no wait in the header The 'nowait' flag is not necessary now, so removed. > > 5. > +/* > + * Helper function to check if local_slot is present in remote_slots list. > + * > + * It also checks if logical slot is locally invalidated i.e. > +invalidated on > + * the standby but valid on the primary server. If found so, it sets > + * locally_invalidated to true. > + */ > > Instead of saying "but valid on the primary server" better to mention it in the > remote_slots list, because here this function is just checking the remote_slots > list regardless of whether the list came from. Mentioning primary seems like it > might fetch directly from the primary in this function so this is a bit confusing. Adjusted. > > 6. > +/* > + * Check that all necessary GUCs for slot synchronization are set > + * appropriately. If not, raise an ERROR. > + */ > +static void > +validate_slotsync_parameters(char **dbname) > > > The function name just says 'validate_slotsync_parameters' but it also gets the > dbname so I think it better we change the name accordingly also instead of > passing dbname as a parameter just return it directly. > There > is no need to pass this extra parameter and make the function return void. Renamed. > > 7. > + tupslot = MakeSingleTupleTableSlot(res->tupledesc, > + &TTSOpsMinimalTuple); tuple_ok = > + tuplestore_gettupleslot(res->tuplestore, true, false, tupslot); > + Assert(tuple_ok); /* It must return one tuple */ > > Comments say 'It must return one tuple' but asserting just for at least one tuple > shouldn't we enhance assert so that it checks that we got exactly one tuple? Changed to use tuplestore_tuple_count. > > 8. > /* No need to check further, return that we are cascading standby */ > + *am_cascading_standby = true; > > we are not returning immediately we are just setting am_cascading_standby to > true so adjust comments accordingly Adjusted. > > 9. > + /* No need to check further, return that we are cascading standby */ > + *am_cascading_standby = true; } else { > + /* We are a normal standby. */ > > Single-line comments do not follow the uniform pattern for the full stop, either > use a full stop for all single-line comments or none, at least follow the same rule > in a file or nearby comments. Adjusted. > > 10. > + errmsg("exiting from slot synchronization due to bad configuration"), > + errhint("%s must be defined.", "primary_slot_name")); > > Why we are using the constant string "primary_slot_name" as a variable in this > error formatting? It was suggested to make it friendly to the translator, as the GUC doesn't needs to be translated and it can avoid adding multiple similar message to be translated. > > 11. > + /* > + * Hot_standby_feedback must be enabled to cooperate with the physical > + * replication slot, which allows informing the primary about the xmin > + and > + * catalog_xmin values on the standby. > > I do not like capitalizing the first letter of the 'hot_standby_feedback' which is a > GUC parameter Changed. Here is the V59 patch set which addressed above comments and comments from Peter[1]. [1] https://www.postgresql.org/message-id/CAHut%2BPu34_dYj9MnV6n3cPsssEx57YaO6Pg0d9mDryQZX2Mx3g%40mail.gmail.com Best Regards, Hou zj
Attachment
Here are some review comments for patch v58-0002 (FYI - I quickly checked with the latest v59-0002 and AFAIK all these review comments below are still relevant) ====== Commit message 1. If a logical slot is invalidated on the primary, slot on the standby is also invalidated. ~ /slot on the standby/then that slot on the standby/ ====== doc/src/sgml/logicaldecoding.sgml 2. In order to resume logical replication after failover from the synced logical slots, it is required that 'conninfo' in subscriptions are altered to point to the new primary server using ALTER SUBSCRIPTION ... CONNECTION. It is recommended that subscriptions are first disabled before promoting the standby and are enabled back once these are altered as above after failover. ~ Minor rewording mainly to reduce a long sentence. SUGGESTION To resume logical replication after failover from the synced logical slots, the subscription's 'conninfo' must be altered to point to the new primary server. This is done using ALTER SUBSCRIPTION ... CONNECTION. It is recommended that subscriptions are first disabled before promoting the standby and are enabled back after altering the connection string. ====== doc/src/sgml/system-views.sgml 3. + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>synced</structfield> <type>bool</type> + </para> + <para> + True if this logical slot was synced from a primary server. + </para> + <para> SUGGESTION True if this is a logical slot that was synced from a primary server. ====== src/backend/access/transam/xlogrecovery.c 4. + /* + * Shutdown the slot sync workers to prevent potential conflicts between + * user processes and slotsync workers after a promotion. + * + * We do not update the 'synced' column from true to false here, as any + * failed update could leave some slot's 'synced' column as false. This + * could cause issues during slot sync after restarting the server as a + * standby. While updating after switching to the new timeline is an + * option, it does not simplify the handling for 'synced' column. + * Therefore, we retain the 'synced' column as true after promotion as they + * can provide useful information about their origin. + */ Minor comment wording changes. BEFORE ...any failed update could leave some slot's 'synced' column as false. SUGGESTION ...any failed update could leave 'synced' column false for some slots. ~ BEFORE Therefore, we retain the 'synced' column as true after promotion as they can provide useful information about their origin. SUGGESTION Therefore, we retain the 'synced' column as true after promotion as it may provide useful information about the slot origin. ====== src/backend/replication/logical/slotsync.c 5. + * While creating the slot on physical standby, if the local restart_lsn and/or + * local catalog_xmin is ahead of those on the remote then the worker cannot + * create the local slot in sync with the primary server because that would + * mean moving the local slot backwards and the standby might not have WALs + * retained for old LSN. In this case, the worker will mark the slot as + * RS_TEMPORARY. Once the primary server catches up, it will move the slot to + * RS_PERSISTENT and will perform the sync periodically. /will move the slot to RS_PERSISTENT/will mark the slot as RS_PERSISTENT/ ~~~ 6. drop_synced_slots_internal +/* + * Helper function for drop_obsolete_slots() + * + * Drops synced slot identified by the passed in name. + */ +static void +drop_synced_slots_internal(const char *name, bool nowait) +{ + Assert(MyReplicationSlot == NULL); + + ReplicationSlotAcquire(name, nowait); + + Assert(MyReplicationSlot->data.synced); + + ReplicationSlotDropAcquired(); +} IMO you don't need this function. AFAICT it is only called from one place and does not result in fewer lines of code. ~~~ 7. get_local_synced_slots + /* Check if it is logical synchronized slot */ + if (s->in_use && SlotIsLogical(s) && s->data.synced) + { + local_slots = lappend(local_slots, s); + } Do you need to check SlotIsLogical(s) here? I thought s->data.synced can never be true for physical slots. I felt you could write this like blelow: if (s->in_use s->data.synced) { Assert(SlotIsLogical(s)); local_slots = lappend(local_slots, s); } ~~~ 8. check_sync_slot_on_remote +static bool +check_sync_slot_on_remote(ReplicationSlot *local_slot, List *remote_slots, + bool *locally_invalidated) +{ + ListCell *lc; + + foreach(lc, remote_slots) + { + RemoteSlot *remote_slot = (RemoteSlot *) lfirst(lc); I think you can use the new style foreach_ptr list macros here. ~~~ 9. drop_obsolete_slots +drop_obsolete_slots(List *remote_slot_list) +{ + List *local_slots = NIL; + ListCell *lc; + + local_slots = get_local_synced_slots(); + + foreach(lc, local_slots) + { + ReplicationSlot *local_slot = (ReplicationSlot *) lfirst(lc); I think you can use the new style foreach_ptr list macros here. ~~~ 10. reserve_wal_for_slot + Assert(slot != NULL); + Assert(slot->data.restart_lsn == InvalidXLogRecPtr); You can use the macro XLogRecPtrIsInvalid(lot->data.restart_lsn) ~~~ 11. update_and_persist_slot +/* + * Update the LSNs and persist the slot for further syncs if the remote + * restart_lsn and catalog_xmin have caught up with the local ones. Otherwise, + * persist the slot and return. + * + * Return true if the slot is marked READY, otherwise false. + */ +static bool +update_and_persist_slot(RemoteSlot *remote_slot) 11a. The comment says "Otherwise, persist the slot and return" but there is a return false which doesn't seem to persist anything so it seems contrary to the comment. ~ 11b. "slot is marked READY" -- IIUC the synced states no longer exist in v58 so this comment maybe should not be referring to READY anymore. Or maybe there just needs to be more explanation about the difference between 'synced' and the state you call "READY". ~~~ 12. synchronize_one_slot + * The slot is created as a temporary slot and stays in same state until the + * initialization is complete. The initialization is considered to be completed + * once the remote_slot catches up with locally reserved position and local + * slot is updated. The slot is then persisted. I think this comment is related to the "READY" mentioned by update_and_persist_slot. Still, perhaps the terminology needs to be made consistent across all these comments -- e.g. "considered to be completed" versus "READY" versus "sync-ready" etc. ~~~ 13. + ReplicationSlotCreate(remote_slot->name, true, RS_TEMPORARY, + remote_slot->two_phase, + remote_slot->failover, + true); This review comment is similar to elsewhere in this post. Consider commenting on the new parameter like "true /* synced */" ~~~ 14. synchronize_slots + /* + * It is possible to get null values for LSN and Xmin if slot is + * invalidated on the primary server, so handle accordingly. + */ + remote_slot->confirmed_lsn = !slot_attisnull(tupslot, 3) ? + DatumGetLSN(slot_getattr(tupslot, 3, &isnull)) : + InvalidXLogRecPtr; + + remote_slot->restart_lsn = !slot_attisnull(tupslot, 4) ? + DatumGetLSN(slot_getattr(tupslot, 4, &isnull)) : + InvalidXLogRecPtr; + + remote_slot->catalog_xmin = !slot_attisnull(tupslot, 5) ? + DatumGetTransactionId(slot_getattr(tupslot, 5, &isnull)) : + InvalidTransactionId; Isn't this the same functionality as the older v51 code that was written differently? I felt the old style (without ignoring the 'isnull') was more readable. v51 + remote_slot->confirmed_lsn = DatumGetLSN(slot_getattr(tupslot, 3, &isnull)); + if (isnull) + remote_slot->confirmed_lsn = InvalidXLogRecPtr; v58 + remote_slot->confirmed_lsn = !slot_attisnull(tupslot, 3) ? + DatumGetLSN(slot_getattr(tupslot, 3, &isnull)) : + InvalidXLogRecPtr; If you prefer a ternary, it might be cleaner to do it like: Datum d; ... d = slot_getattr(tupslot, 3, &isnull); remote_slot->confirmed_lsn = isnull ? InvalidXLogRecPtr : DatumGetLSN(d); ... ~~~ 15. + + /* Drop local slots that no longer need to be synced. */ + drop_obsolete_slots(remote_slot_list); + + /* Now sync the slots locally */ + foreach(lc, remote_slot_list) + { + RemoteSlot *remote_slot = (RemoteSlot *) lfirst(lc); + + some_slot_updated |= synchronize_one_slot(wrconn, remote_slot); + } Here you can use the new list macro like foreach_ptr. ~~~ 16. ReplSlotSyncWorkerMain + wrconn = walrcv_connect(PrimaryConnInfo, true, false, + cluster_name[0] ? cluster_name : "slotsyncworker", + &err); + if (wrconn == NULL) + ereport(ERROR, + errcode(ERRCODE_CONNECTION_FAILURE), + errmsg("could not connect to the primary server: %s", err)); Typically, I saw other PG code doing "if (!wrconn)" instead of "if (wrconn == NULL)" ====== src/backend/replication/slotfuncs.c 17. create_physical_replication_slot ReplicationSlotCreate(name, false, temporary ? RS_TEMPORARY : RS_PERSISTENT, false, - false); + false, false); IMO passing parameters like "false, false, false" becomes a bit difficult to understand from the caller's POV so it might be good to comment on the parameter like: ReplicationSlotCreate(name, false, temporary ? RS_TEMPORARY : RS_PERSISTENT, false, false, false /* synced */); (there are a few other places like this where the same review comment applies) ~~~ 18. create_logical_replication_slot ReplicationSlotCreate(name, true, temporary ? RS_TEMPORARY : RS_EPHEMERAL, two_phase, - failover); + failover, false); Same as above. Maybe comment on the parameter like "false /* synced */" ~~~ 19. pg_get_replication_slots case RS_INVAL_WAL_REMOVED: - values[i++] = CStringGetTextDatum("wal_removed"); + values[i++] = CStringGetTextDatum(SLOT_INVAL_WAL_REMOVED_TEXT); break; case RS_INVAL_HORIZON: - values[i++] = CStringGetTextDatum("rows_removed"); + values[i++] = CStringGetTextDatum(SLOT_INVAL_HORIZON_TEXT); break; case RS_INVAL_WAL_LEVEL: - values[i++] = CStringGetTextDatum("wal_level_insufficient"); + values[i++] = CStringGetTextDatum(SLOT_INVAL_WAL_LEVEL_TEXT); break; IMO this code and the #defines that it uses can be written and pushed as an independent patch. ====== src/backend/replication/walsender.c 20. CreateReplicationSlot ReplicationSlotCreate(cmd->slotname, false, cmd->temporary ? RS_TEMPORARY : RS_PERSISTENT, - false, false); + false, false, false); Consider commenting the parameter like "false /* synced */" ~~~ 21. ReplicationSlotCreate(cmd->slotname, true, cmd->temporary ? RS_TEMPORARY : RS_EPHEMERAL, - two_phase, failover); + two_phase, failover, false); Consider commenting the parameter like "false /* synced */" ====== src/include/replication/slot.h 22. +/* + * The possible values for 'conflict_reason' returned in + * pg_get_replication_slots. + */ +#define SLOT_INVAL_WAL_REMOVED_TEXT "wal_removed" +#define SLOT_INVAL_HORIZON_TEXT "rows_removed" +#define SLOT_INVAL_WAL_LEVEL_TEXT "wal_level_insufficient" IMO these #defines and also the code in pg_get_replication_slots() that uses them can be written and pushed as an independent patch. ====== .../t/050_standby_failover_slots_sync.pl 23. +# Wait for the standby to start sync +$standby1->start; But there is no waiting here? Maybe the comment should say like "Start the standby so that slot syncing can begin" ~~~ 24. +# Wait for the standby to finish sync +my $offset = -s $standby1->logfile; +$standby1->wait_for_log( + qr/LOG: ( [A-Z0-9]+:)? newly locally created slot \"lsub1_slot\" is sync-ready now/, + $offset); SUGGESTION # Wait for the standby to finish slot syncing ~~~ 25. +# Confirm that logical failover slot is created on the standby and is sync +# ready. +is($standby1->safe_psql('postgres', + q{SELECT failover, synced FROM pg_replication_slots WHERE slot_name = 'lsub1_slot';}), + "t|t", + 'logical slot has failover as true and synced as true on standby'); SUGGESTION # Confirm that the logical failover slot is created on the standby and is flagged as 'synced' ~~~ 26. +$subscriber1->safe_psql( + 'postgres', qq[ + CREATE TABLE tab_int (a int PRIMARY KEY); + ALTER SUBSCRIPTION regress_mysub1 REFRESH PUBLICATION; +]); + +$subscriber1->wait_for_subscription_sync; Add a comment like # Subscribe to the new table data and wait for it to arrive ~~~ 27. +# Disable hot_standby_feedback temporarily to stop slot sync worker otherwise +# the concerned testing scenarios here may be interrupted by different error: +# 'ERROR: replication slot is active for PID ..' + +$standby1->safe_psql('postgres', 'ALTER SYSTEM SET hot_standby_feedback = off;'); +$standby1->restart; Remove the blank line. ~~~ 28. +is($standby1->safe_psql('postgres', + q{SELECT slot_name FROM pg_replication_slots WHERE slot_name = 'lsub1_slot';}), + 'lsub1_slot', + 'synced slot retained on the new primary'); There should be some comment like: SUGGESTION # Confirm the synced slot 'lsub1_slot' is retained on the new primary ~~~ 29. +# Confirm that data in tab_int replicated on subscriber +is( $subscriber1->safe_psql('postgres', q{SELECT count(*) FROM tab_int;}), + "20", + 'data replicated from the new primary'); /replicated on subscriber/replicated on the subscriber/ ====== Kind Regards, Peter Smith. Fujitsu Australia
Hi, On Wed, Jan 10, 2024 at 12:23:14PM +0000, Zhijie Hou (Fujitsu) wrote: > On Wednesday, January 10, 2024 2:26 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > + LogicalConfirmReceivedLocation(remote_slot->confirmed_lsn); > > + LogicalIncreaseXminForSlot(remote_slot->confirmed_lsn, > > + remote_slot->catalog_xmin); > > + LogicalIncreaseRestartDecodingForSlot(remote_slot->confirmed_lsn, > > + remote_slot->restart_lsn); > > +} > > > > IIUC on the standby we just want to overwrite what we get from primary no? If > > so why we are using those APIs that are meant for the actual decoding slots > > where it needs to take certain logical decisions instead of mere overwriting? > > I think we don't have a strong reason to use these APIs, but it was convenient to > use these APIs as they can take care of updating the slots info and will call > functions like, ReplicationSlotsComputeRequiredXmin, > ReplicationSlotsComputeRequiredLSN internally. Or do you prefer directly overwriting > the fields and call these manually ? I'd vote for using the APIs as I think it will be harder to maintain if we are not using them (means ensure the "direct" overwriting still makes sense over time). FWIW, pg_failover_slots also rely on those APIs from what I can see. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Jan 11, 2024 at 1:19 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On Wed, Jan 10, 2024 at 12:23:14PM +0000, Zhijie Hou (Fujitsu) wrote: > > On Wednesday, January 10, 2024 2:26 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > + LogicalConfirmReceivedLocation(remote_slot->confirmed_lsn); > > > + LogicalIncreaseXminForSlot(remote_slot->confirmed_lsn, > > > + remote_slot->catalog_xmin); > > > + LogicalIncreaseRestartDecodingForSlot(remote_slot->confirmed_lsn, > > > + remote_slot->restart_lsn); > > > +} > > > > > > IIUC on the standby we just want to overwrite what we get from primary no? If > > > so why we are using those APIs that are meant for the actual decoding slots > > > where it needs to take certain logical decisions instead of mere overwriting? > > > > I think we don't have a strong reason to use these APIs, but it was convenient to > > use these APIs as they can take care of updating the slots info and will call > > functions like, ReplicationSlotsComputeRequiredXmin, > > ReplicationSlotsComputeRequiredLSN internally. Or do you prefer directly overwriting > > the fields and call these manually ? > > I'd vote for using the APIs as I think it will be harder to maintain if we are > not using them (means ensure the "direct" overwriting still makes sense over time). +1 PFA v60 which addresses: 1) Peter's comment in [1] 2) Peter's off list suggestion to convert sleep_quanta to sleep_ms and simplify the logic in wait_for_slot_activity() [1]: https://www.postgresql.org/message-id/CAHut%2BPtJAAPghc4GPt0k%3DjeMz1qu4H7mnaDifOHsVsMqi-qOLA%40mail.gmail.com thanks Shveta
Attachment
On Thu, Jan 11, 2024 at 7:28 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are some review comments for patch v58-0002 Thank You for the feedback. These are addressed in v60. Please find my response inline for a few. > (FYI - I quickly checked with the latest v59-0002 and AFAIK all these > review comments below are still relevant) > > ====== > Commit message > > 1. > If a logical slot is invalidated on the primary, slot on the standby is also > invalidated. > > ~ > > /slot on the standby/then that slot on the standby/ > > ====== > doc/src/sgml/logicaldecoding.sgml > > 2. > In order to resume logical replication after failover from the synced > logical slots, it is required that 'conninfo' in subscriptions are > altered to point to the new primary server using ALTER SUBSCRIPTION > ... CONNECTION. It is recommended that subscriptions are first > disabled before promoting the standby and are enabled back once these > are altered as above after failover. > > ~ > > Minor rewording mainly to reduce a long sentence. > > SUGGESTION > To resume logical replication after failover from the synced logical > slots, the subscription's 'conninfo' must be altered to point to the > new primary server. This is done using ALTER SUBSCRIPTION ... > CONNECTION. It is recommended that subscriptions are first disabled > before promoting the standby and are enabled back after altering the > connection string. > > ====== > doc/src/sgml/system-views.sgml > > 3. > + <entry role="catalog_table_entry"><para role="column_definition"> > + <structfield>synced</structfield> <type>bool</type> > + </para> > + <para> > + True if this logical slot was synced from a primary server. > + </para> > + <para> > > SUGGESTION > True if this is a logical slot that was synced from a primary server. > > ====== > src/backend/access/transam/xlogrecovery.c > > 4. > + /* > + * Shutdown the slot sync workers to prevent potential conflicts between > + * user processes and slotsync workers after a promotion. > + * > + * We do not update the 'synced' column from true to false here, as any > + * failed update could leave some slot's 'synced' column as false. This > + * could cause issues during slot sync after restarting the server as a > + * standby. While updating after switching to the new timeline is an > + * option, it does not simplify the handling for 'synced' column. > + * Therefore, we retain the 'synced' column as true after promotion as they > + * can provide useful information about their origin. > + */ > > Minor comment wording changes. > > BEFORE > ...any failed update could leave some slot's 'synced' column as false. > SUGGESTION > ...any failed update could leave 'synced' column false for some slots. > > ~ > > BEFORE > Therefore, we retain the 'synced' column as true after promotion as > they can provide useful information about their origin. > SUGGESTION > Therefore, we retain the 'synced' column as true after promotion as it > may provide useful information about the slot origin. > > ====== > src/backend/replication/logical/slotsync.c > > 5. > + * While creating the slot on physical standby, if the local restart_lsn and/or > + * local catalog_xmin is ahead of those on the remote then the worker cannot > + * create the local slot in sync with the primary server because that would > + * mean moving the local slot backwards and the standby might not have WALs > + * retained for old LSN. In this case, the worker will mark the slot as > + * RS_TEMPORARY. Once the primary server catches up, it will move the slot to > + * RS_PERSISTENT and will perform the sync periodically. > > /will move the slot to RS_PERSISTENT/will mark the slot as RS_PERSISTENT/ > > ~~~ > > 6. drop_synced_slots_internal > +/* > + * Helper function for drop_obsolete_slots() > + * > + * Drops synced slot identified by the passed in name. > + */ > +static void > +drop_synced_slots_internal(const char *name, bool nowait) > +{ > + Assert(MyReplicationSlot == NULL); > + > + ReplicationSlotAcquire(name, nowait); > + > + Assert(MyReplicationSlot->data.synced); > + > + ReplicationSlotDropAcquired(); > +} > > IMO you don't need this function. AFAICT it is only called from one > place and does not result in fewer lines of code. > > ~~~ > > 7. get_local_synced_slots > > + /* Check if it is logical synchronized slot */ > + if (s->in_use && SlotIsLogical(s) && s->data.synced) > + { > + local_slots = lappend(local_slots, s); > + } > > Do you need to check SlotIsLogical(s) here? I thought s->data.synced > can never be true for physical slots. I felt you could write this like > blelow: > > if (s->in_use s->data.synced) > { > Assert(SlotIsLogical(s)); > local_slots = lappend(local_slots, s); > } > > ~~~ > > 8. check_sync_slot_on_remote > > +static bool > +check_sync_slot_on_remote(ReplicationSlot *local_slot, List *remote_slots, > + bool *locally_invalidated) > +{ > + ListCell *lc; > + > + foreach(lc, remote_slots) > + { > + RemoteSlot *remote_slot = (RemoteSlot *) lfirst(lc); > > I think you can use the new style foreach_ptr list macros here. > > ~~~ > > 9. drop_obsolete_slots > > +drop_obsolete_slots(List *remote_slot_list) > +{ > + List *local_slots = NIL; > + ListCell *lc; > + > + local_slots = get_local_synced_slots(); > + > + foreach(lc, local_slots) > + { > + ReplicationSlot *local_slot = (ReplicationSlot *) lfirst(lc); > > I think you can use the new style foreach_ptr list macros here. > > ~~~ > > 10. reserve_wal_for_slot > > + Assert(slot != NULL); > + Assert(slot->data.restart_lsn == InvalidXLogRecPtr); > > You can use the macro XLogRecPtrIsInvalid(lot->data.restart_lsn) > > ~~~ > > 11. update_and_persist_slot > > +/* > + * Update the LSNs and persist the slot for further syncs if the remote > + * restart_lsn and catalog_xmin have caught up with the local ones. Otherwise, > + * persist the slot and return. > + * > + * Return true if the slot is marked READY, otherwise false. > + */ > +static bool > +update_and_persist_slot(RemoteSlot *remote_slot) > > 11a. > The comment says "Otherwise, persist the slot and return" but there is > a return false which doesn't seem to persist anything so it seems > contrary to the comment. > > ~ > > 11b. > "slot is marked READY" -- IIUC the synced states no longer exist in > v58 so this comment maybe should not be referring to READY anymore. Or > maybe there just needs to be more explanation about the difference > between 'synced' and the state you call "READY". > > ~~~ > > 12. synchronize_one_slot > > + * The slot is created as a temporary slot and stays in same state until the > + * initialization is complete. The initialization is considered to be completed > + * once the remote_slot catches up with locally reserved position and local > + * slot is updated. The slot is then persisted. > > I think this comment is related to the "READY" mentioned by > update_and_persist_slot. Still, perhaps the terminology needs to be > made consistent across all these comments -- e.g. "considered to be > completed" versus "READY" versus "sync-ready" etc. > > ~~~ > > 13. > + ReplicationSlotCreate(remote_slot->name, true, RS_TEMPORARY, > + remote_slot->two_phase, > + remote_slot->failover, > + true); > > > This review comment is similar to elsewhere in this post. Consider > commenting on the new parameter like "true /* synced */" > > ~~~ > > 14. synchronize_slots > > + /* > + * It is possible to get null values for LSN and Xmin if slot is > + * invalidated on the primary server, so handle accordingly. > + */ > + remote_slot->confirmed_lsn = !slot_attisnull(tupslot, 3) ? > + DatumGetLSN(slot_getattr(tupslot, 3, &isnull)) : > + InvalidXLogRecPtr; > + > + remote_slot->restart_lsn = !slot_attisnull(tupslot, 4) ? > + DatumGetLSN(slot_getattr(tupslot, 4, &isnull)) : > + InvalidXLogRecPtr; > + > + remote_slot->catalog_xmin = !slot_attisnull(tupslot, 5) ? > + DatumGetTransactionId(slot_getattr(tupslot, 5, &isnull)) : > + InvalidTransactionId; > > Isn't this the same functionality as the older v51 code that was > written differently? I felt the old style (without ignoring the > 'isnull') was more readable. > > v51 > + remote_slot->confirmed_lsn = DatumGetLSN(slot_getattr(tupslot, 3, &isnull)); > + if (isnull) > + remote_slot->confirmed_lsn = InvalidXLogRecPtr; > > v58 > + remote_slot->confirmed_lsn = !slot_attisnull(tupslot, 3) ? > + DatumGetLSN(slot_getattr(tupslot, 3, &isnull)) : > + InvalidXLogRecPtr; > > If you prefer a ternary, it might be cleaner to do it like: We got a CFBot failure, where the v51's way was crashing in a 32-bit env, because there a Datum for int64 is regarded as a pointer and thus it resulted in NULL pointer access if slot_getattr() returned NULL. Please see DatumGetInt64(). > Datum d; > ... > d = slot_getattr(tupslot, 3, &isnull); > remote_slot->confirmed_lsn = isnull ? InvalidXLogRecPtr : DatumGetLSN(d); Okay, I see. This can also be done. I kind of missed this line earlier, I can consider it in the next version. > ~~~ > > 15. > + > + /* Drop local slots that no longer need to be synced. */ > + drop_obsolete_slots(remote_slot_list); > + > + /* Now sync the slots locally */ > + foreach(lc, remote_slot_list) > + { > + RemoteSlot *remote_slot = (RemoteSlot *) lfirst(lc); > + > + some_slot_updated |= synchronize_one_slot(wrconn, remote_slot); > + } > > Here you can use the new list macro like foreach_ptr. > > ~~~ > > 16. ReplSlotSyncWorkerMain > > + wrconn = walrcv_connect(PrimaryConnInfo, true, false, > + cluster_name[0] ? cluster_name : "slotsyncworker", > + &err); > + if (wrconn == NULL) > + ereport(ERROR, > + errcode(ERRCODE_CONNECTION_FAILURE), > + errmsg("could not connect to the primary server: %s", err)); > > > Typically, I saw other PG code doing "if (!wrconn)" instead of "if > (wrconn == NULL)" > > > ====== > src/backend/replication/slotfuncs.c > > 17. create_physical_replication_slot > > ReplicationSlotCreate(name, false, > temporary ? RS_TEMPORARY : RS_PERSISTENT, false, > - false); > + false, false); > > IMO passing parameters like "false, false, false" becomes a bit > difficult to understand from the caller's POV so it might be good to > comment on the parameter like: > > ReplicationSlotCreate(name, false, > temporary ? RS_TEMPORARY : RS_PERSISTENT, false, > false, false /* synced */); > > (there are a few other places like this where the same review comment applies) > > ~~~ > > 18. create_logical_replication_slot > > ReplicationSlotCreate(name, true, > temporary ? RS_TEMPORARY : RS_EPHEMERAL, two_phase, > - failover); > + failover, false); > > Same as above. Maybe comment on the parameter like "false /* synced */" > > ~~~ > > 19. pg_get_replication_slots > > case RS_INVAL_WAL_REMOVED: > - values[i++] = CStringGetTextDatum("wal_removed"); > + values[i++] = CStringGetTextDatum(SLOT_INVAL_WAL_REMOVED_TEXT); > break; > > case RS_INVAL_HORIZON: > - values[i++] = CStringGetTextDatum("rows_removed"); > + values[i++] = CStringGetTextDatum(SLOT_INVAL_HORIZON_TEXT); > break; > > case RS_INVAL_WAL_LEVEL: > - values[i++] = CStringGetTextDatum("wal_level_insufficient"); > + values[i++] = CStringGetTextDatum(SLOT_INVAL_WAL_LEVEL_TEXT); > break; > > IMO this code and the #defines that it uses can be written and pushed > as an independent patch. Okay, let me review this one and #22 which mentions the same. > ====== > src/backend/replication/walsender.c > > 20. CreateReplicationSlot > > ReplicationSlotCreate(cmd->slotname, false, > cmd->temporary ? RS_TEMPORARY : RS_PERSISTENT, > - false, false); > + false, false, false); > > Consider commenting the parameter like "false /* synced */" > > ~~~ > > 21. > ReplicationSlotCreate(cmd->slotname, true, > cmd->temporary ? RS_TEMPORARY : RS_EPHEMERAL, > - two_phase, failover); > + two_phase, failover, false); > > Consider commenting the parameter like "false /* synced */" > > ====== > src/include/replication/slot.h > > 22. > +/* > + * The possible values for 'conflict_reason' returned in > + * pg_get_replication_slots. > + */ > +#define SLOT_INVAL_WAL_REMOVED_TEXT "wal_removed" > +#define SLOT_INVAL_HORIZON_TEXT "rows_removed" > +#define SLOT_INVAL_WAL_LEVEL_TEXT "wal_level_insufficient" > > IMO these #defines and also the code in pg_get_replication_slots() > that uses them can be written and pushed as an independent patch. > > ====== > .../t/050_standby_failover_slots_sync.pl > > 23. > +# Wait for the standby to start sync > +$standby1->start; > > But there is no waiting here? Maybe the comment should say like "Start > the standby so that slot syncing can begin" > > ~~~ > > 24. > +# Wait for the standby to finish sync > +my $offset = -s $standby1->logfile; > +$standby1->wait_for_log( > + qr/LOG: ( [A-Z0-9]+:)? newly locally created slot \"lsub1_slot\" is > sync-ready now/, > + $offset); > > SUGGESTION > # Wait for the standby to finish slot syncing > > ~~~ > > 25. > +# Confirm that logical failover slot is created on the standby and is sync > +# ready. > +is($standby1->safe_psql('postgres', > + q{SELECT failover, synced FROM pg_replication_slots WHERE slot_name > = 'lsub1_slot';}), > + "t|t", > + 'logical slot has failover as true and synced as true on standby'); > > SUGGESTION > # Confirm that the logical failover slot is created on the standby and > is flagged as 'synced' > > ~~~ > > 26. > +$subscriber1->safe_psql( > + 'postgres', qq[ > + CREATE TABLE tab_int (a int PRIMARY KEY); > + ALTER SUBSCRIPTION regress_mysub1 REFRESH PUBLICATION; > +]); > + > +$subscriber1->wait_for_subscription_sync; > > Add a comment like > > # Subscribe to the new table data and wait for it to arrive > > ~~~ > > 27. > +# Disable hot_standby_feedback temporarily to stop slot sync worker otherwise > +# the concerned testing scenarios here may be interrupted by different error: > +# 'ERROR: replication slot is active for PID ..' > + > +$standby1->safe_psql('postgres', 'ALTER SYSTEM SET > hot_standby_feedback = off;'); > +$standby1->restart; > > Remove the blank line. > > ~~~ > > 28. > +is($standby1->safe_psql('postgres', > + q{SELECT slot_name FROM pg_replication_slots WHERE slot_name = > 'lsub1_slot';}), > + 'lsub1_slot', > + 'synced slot retained on the new primary'); > > There should be some comment like: > > SUGGESTION > # Confirm the synced slot 'lsub1_slot' is retained on the new primary > > ~~~ > > 29. > +# Confirm that data in tab_int replicated on subscriber > +is( $subscriber1->safe_psql('postgres', q{SELECT count(*) FROM tab_int;}), > + "20", > + 'data replicated from the new primary'); > > /replicated on subscriber/replicated on the subscriber/ > > > ====== > Kind Regards, > Peter Smith. > Fujitsu Australia
On Wed, Jan 10, 2024 at 5:53 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > > IIUC on the standby we just want to overwrite what we get from primary no? If > > so why we are using those APIs that are meant for the actual decoding slots > > where it needs to take certain logical decisions instead of mere overwriting? > > I think we don't have a strong reason to use these APIs, but it was convenient to > use these APIs as they can take care of updating the slots info and will call > functions like, ReplicationSlotsComputeRequiredXmin, > ReplicationSlotsComputeRequiredLSN internally. Or do you prefer directly overwriting > the fields and call these manually ? I might be missing something but do you want to call ReplicationSlotsComputeRequiredXmin() kind of functions in standby? I mean those will ultimately update the catalog xmin and replication xmin in Procarray and that prevents Vacuum from cleaning up some of the required xids. But on standby, those shared memory parameters are not used IIUC. In my opinion on standby, we just need to update the values in the local slots and whatever we get from remote slots without taking all the logical decisions in the hope that they will all fall into a particular path, for example, if you see LogicalIncreaseXminForSlot(), it is doing following steps of operations as shown below[1]. These all make sense when you are doing candidate-based updation where we first mark the candidates and then update the candidate to real value once you get the confirmation for the LSN. Now following all this logic looks completely weird unless this can fall in a different path I feel it will do some duplicate steps as well. For example in local_slot_update(), first you call LogicalConfirmReceivedLocation() which will set the 'data.confirmed_flush' and then you will call LogicalIncreaseXminForSlot() which will set the 'updated_xmin = true;' and will again call LogicalConfirmReceivedLocation(). I don't think this is the correct way of reusing the function unless you need to go through those paths and I am missing something. [1] LogicalIncreaseXminForSlot() { if (TransactionIdPrecedesOrEquals(xmin, slot->data.catalog_xmin)) { } else if (current_lsn <= slot->data.confirmed_flush) { } else if (slot->candidate_xmin_lsn == InvalidXLogRecPtr) { } if (updated_xmin) LogicalConfirmReceivedLocation(slot->data.confirmed_flush); } -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Jan 9, 2024 at 6:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > +static bool > +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot) > { > ... > + /* Slot ready for sync, so sync it. */ > + else > + { > + /* > + * Sanity check: With hot_standby_feedback enabled and > + * invalidations handled appropriately as above, this should never > + * happen. > + */ > + if (remote_slot->restart_lsn < slot->data.restart_lsn) > + elog(ERROR, > + "cannot synchronize local slot \"%s\" LSN(%X/%X)" > + " to remote slot's LSN(%X/%X) as synchronization" > + " would move it backwards", remote_slot->name, > + LSN_FORMAT_ARGS(slot->data.restart_lsn), > + LSN_FORMAT_ARGS(remote_slot->restart_lsn)); > ... > } > > I was thinking about the above code in the patch and as far as I can > think this can only occur if the same name slot is re-created with > prior restart_lsn after the existing slot is dropped. Normally, the > newly created slot (with the same name) will have higher restart_lsn > but one can mimic it by copying some older slot by using > pg_copy_logical_replication_slot(). > > I don't think as mentioned in comments even if hot_standby_feedback is > temporarily set to off, the above shouldn't happen. It can only lead > to invalidated slots on standby. > > To close the above race, I could think of the following ways: > 1. Drop and re-create the slot. > 2. Emit LOG/WARNING in this case and once remote_slot's LSN moves > ahead of local_slot's LSN then we can update it; but as mentioned in > your previous comment, we need to update all other fields as well. If > we follow this then we probably need to have a check for catalog_xmin > as well. > The second point as mentioned is slightly misleading, so let me try to rephrase it once again: Emit LOG/WARNING in this case and once remote_slot's LSN moves ahead of local_slot's LSN then we can update it; additionally, we need to update all other fields like two_phase as well. If we follow this then we probably need to have a check for catalog_xmin as well along remote_slot's restart_lsn. > Now, related to this the other case which needs some handling is what > if the remote_slot's restart_lsn is greater than local_slot's > restart_lsn but it is a re-created slot with the same name. In that > case, I think the other properties like 'two_phase', 'plugin' could be > different. So, is simply copying those sufficient or do we need to do > something else as well? > Bertrand, Dilip, Sawada-San, and others, please share your opinion on this problem as I think it is important to handle this race condition. -- With Regards, Amit Kapila.
On Thu, Jan 11, 2024 at 3:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Jan 10, 2024 at 5:53 PM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > > > IIUC on the standby we just want to overwrite what we get from primary no? If > > > so why we are using those APIs that are meant for the actual decoding slots > > > where it needs to take certain logical decisions instead of mere overwriting? > > > > I think we don't have a strong reason to use these APIs, but it was convenient to > > use these APIs as they can take care of updating the slots info and will call > > functions like, ReplicationSlotsComputeRequiredXmin, > > ReplicationSlotsComputeRequiredLSN internally. Or do you prefer directly overwriting > > the fields and call these manually ? > > I might be missing something but do you want to call > ReplicationSlotsComputeRequiredXmin() kind of functions in standby? I > mean those will ultimately update the catalog xmin and replication > xmin in Procarray and that prevents Vacuum from cleaning up some of > the required xids. But on standby, those shared memory parameters are > not used IIUC. > These xmins are required for logical slots as we allow logical decoding on standby (see GetOldestSafeDecodingTransactionId()). We also invalidate such slots if the required rows are removed on standby. Similarly, we need ReplicationSlotsComputeRequiredLSN() to avoid getting the required WAL removed. > In my opinion on standby, we just need to update the values in the > local slots and whatever we get from remote slots without taking all > the logical decisions in the hope that they will all fall into a > particular path, for example, if you see LogicalIncreaseXminForSlot(), > it is doing following steps of operations as shown below[1]. These > all make sense when you are doing candidate-based updation where we > first mark the candidates and then update the candidate to real value > once you get the confirmation for the LSN. Now following all this > logic looks completely weird unless this can fall in a different path > I feel it will do some duplicate steps as well. For example in > local_slot_update(), first you call LogicalConfirmReceivedLocation() > which will set the 'data.confirmed_flush' and then you will call > LogicalIncreaseXminForSlot() which will set the 'updated_xmin = true;' > and will again call LogicalConfirmReceivedLocation(). > In case (else if (slot->candidate_xmin_lsn == InvalidXLogRecPtr)), even the updated_xmin is not getting set to true which means there is a chance that we will never update the required xmin values. I don't think > this is the correct way of reusing the function unless you need to go > through those paths and I am missing something. > I agree with this conclusion and also think that we should directly update the required fields and call functions like ReplicationSlotsComputeRequiredLSN() wherever required. -- With Regards, Amit Kapila.
Hi, On Thu, Jan 11, 2024 at 04:22:56PM +0530, Amit Kapila wrote: > On Tue, Jan 9, 2024 at 6:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > +static bool > > +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot) > > { > > ... > > + /* Slot ready for sync, so sync it. */ > > + else > > + { > > + /* > > + * Sanity check: With hot_standby_feedback enabled and > > + * invalidations handled appropriately as above, this should never > > + * happen. > > + */ > > + if (remote_slot->restart_lsn < slot->data.restart_lsn) > > + elog(ERROR, > > + "cannot synchronize local slot \"%s\" LSN(%X/%X)" > > + " to remote slot's LSN(%X/%X) as synchronization" > > + " would move it backwards", remote_slot->name, > > + LSN_FORMAT_ARGS(slot->data.restart_lsn), > > + LSN_FORMAT_ARGS(remote_slot->restart_lsn)); > > ... > > } > > > > I was thinking about the above code in the patch and as far as I can > > think this can only occur if the same name slot is re-created with > > prior restart_lsn after the existing slot is dropped. Normally, the > > newly created slot (with the same name) will have higher restart_lsn > > but one can mimic it by copying some older slot by using > > pg_copy_logical_replication_slot(). > > > > I don't think as mentioned in comments even if hot_standby_feedback is > > temporarily set to off, the above shouldn't happen. It can only lead > > to invalidated slots on standby. I also think so. > > > > To close the above race, I could think of the following ways: > > 1. Drop and re-create the slot. > > 2. Emit LOG/WARNING in this case and once remote_slot's LSN moves > > ahead of local_slot's LSN then we can update it; but as mentioned in > > your previous comment, we need to update all other fields as well. If > > we follow this then we probably need to have a check for catalog_xmin > > as well. IIUC, this would be a sync slot (so not usable until promotion) that could not be used anyway (invalidated), so I'll vote for drop / re-create then. > > Now, related to this the other case which needs some handling is what > > if the remote_slot's restart_lsn is greater than local_slot's > > restart_lsn but it is a re-created slot with the same name. In that > > case, I think the other properties like 'two_phase', 'plugin' could be > > different. So, is simply copying those sufficient or do we need to do > > something else as well? > > > I'm not sure to follow here. If the remote slot is re-created then it would be also dropped / re-created locally, or am I missing something? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Jan 11, 2024 at 9:11 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > On Thu, Jan 11, 2024 at 04:22:56PM +0530, Amit Kapila wrote: > > On Tue, Jan 9, 2024 at 6:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > +static bool > > > +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot) > > > { > > > ... > > > + /* Slot ready for sync, so sync it. */ > > > + else > > > + { > > > + /* > > > + * Sanity check: With hot_standby_feedback enabled and > > > + * invalidations handled appropriately as above, this should never > > > + * happen. > > > + */ > > > + if (remote_slot->restart_lsn < slot->data.restart_lsn) > > > + elog(ERROR, > > > + "cannot synchronize local slot \"%s\" LSN(%X/%X)" > > > + " to remote slot's LSN(%X/%X) as synchronization" > > > + " would move it backwards", remote_slot->name, > > > + LSN_FORMAT_ARGS(slot->data.restart_lsn), > > > + LSN_FORMAT_ARGS(remote_slot->restart_lsn)); > > > ... > > > } > > > > > > I was thinking about the above code in the patch and as far as I can > > > think this can only occur if the same name slot is re-created with > > > prior restart_lsn after the existing slot is dropped. Normally, the > > > newly created slot (with the same name) will have higher restart_lsn > > > but one can mimic it by copying some older slot by using > > > pg_copy_logical_replication_slot(). > > > > > > I don't think as mentioned in comments even if hot_standby_feedback is > > > temporarily set to off, the above shouldn't happen. It can only lead > > > to invalidated slots on standby. > > I also think so. > > > > > > > To close the above race, I could think of the following ways: > > > 1. Drop and re-create the slot. > > > 2. Emit LOG/WARNING in this case and once remote_slot's LSN moves > > > ahead of local_slot's LSN then we can update it; but as mentioned in > > > your previous comment, we need to update all other fields as well. If > > > we follow this then we probably need to have a check for catalog_xmin > > > as well. > > IIUC, this would be a sync slot (so not usable until promotion) that could > not be used anyway (invalidated), so I'll vote for drop / re-create then. > No, it can happen for non-sync slots as well. > > > Now, related to this the other case which needs some handling is what > > > if the remote_slot's restart_lsn is greater than local_slot's > > > restart_lsn but it is a re-created slot with the same name. In that > > > case, I think the other properties like 'two_phase', 'plugin' could be > > > different. So, is simply copying those sufficient or do we need to do > > > something else as well? > > > > > > > I'm not sure to follow here. If the remote slot is re-created then it would > be also dropped / re-created locally, or am I missing something? > As our slot-syncing mechanism is asynchronous (from time to time we check the slot information on primary), isn't it possible that the same name slot is dropped and recreated between slot-sync worker's checks? -- With Regards, Amit Kapila.
On Thursday, January 11, 2024 11:42 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: Hi, > On Thu, Jan 11, 2024 at 04:22:56PM +0530, Amit Kapila wrote: > > On Tue, Jan 9, 2024 at 6:39 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > > > +static bool > > > +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot > > > +*remote_slot) > > > { > > > ... > > > + /* Slot ready for sync, so sync it. */ else { > > > + /* > > > + * Sanity check: With hot_standby_feedback enabled and > > > + * invalidations handled appropriately as above, this should never > > > + * happen. > > > + */ > > > + if (remote_slot->restart_lsn < slot->data.restart_lsn) elog(ERROR, > > > + "cannot synchronize local slot \"%s\" LSN(%X/%X)" > > > + " to remote slot's LSN(%X/%X) as synchronization" > > > + " would move it backwards", remote_slot->name, > > > + LSN_FORMAT_ARGS(slot->data.restart_lsn), > > > + LSN_FORMAT_ARGS(remote_slot->restart_lsn)); > > > ... > > > } > > > > > > I was thinking about the above code in the patch and as far as I can > > > think this can only occur if the same name slot is re-created with > > > prior restart_lsn after the existing slot is dropped. Normally, the > > > newly created slot (with the same name) will have higher restart_lsn > > > but one can mimic it by copying some older slot by using > > > pg_copy_logical_replication_slot(). > > > > > > I don't think as mentioned in comments even if hot_standby_feedback > > > is temporarily set to off, the above shouldn't happen. It can only > > > lead to invalidated slots on standby. > > I also think so. > > > > > > > To close the above race, I could think of the following ways: > > > 1. Drop and re-create the slot. > > > 2. Emit LOG/WARNING in this case and once remote_slot's LSN moves > > > ahead of local_slot's LSN then we can update it; but as mentioned in > > > your previous comment, we need to update all other fields as well. > > > If we follow this then we probably need to have a check for > > > catalog_xmin as well. > > IIUC, this would be a sync slot (so not usable until promotion) that could not be > used anyway (invalidated), so I'll vote for drop / re-create then. Such race can happen when user drop and re-create the same failover slot on primary as well. For example, user dropped one failover slot and them immediately created a new one by copying from an old slot(using pg_copy_logical_replication_slot). Then the slotsync worker will find the restart_lsn of this slot go backwards. The steps: ---- SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'pgoutput', false, false, true); SELECT 'init' FROM pg_create_logical_replication_slot('test', 'pgoutput', false, false, true); - Advance the restart_lsn of 'test' slot CREATE TABLE test2(a int); INSERT INTO test2 SELECT generate_series(1,10000,1); SELECT slot_name FROM pg_replication_slot_advance('test', pg_current_wal_lsn()); - re-create the test slot but based on the old isolation_slot. SELECT pg_drop_replication_slot('test'); SELECT 'copy' FROM pg_copy_logical_replication_slot('isolation_slot', 'test'); Then the restart_lsn of 'test' slot will go backwards. Best Regards, Hou zj
On Thu, Jan 11, 2024 at 9:11 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > On Thu, Jan 11, 2024 at 04:22:56PM +0530, Amit Kapila wrote: > > On Tue, Jan 9, 2024 at 6:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > +static bool > > > +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot) > > > { > > > ... > > > + /* Slot ready for sync, so sync it. */ > > > + else > > > + { > > > + /* > > > + * Sanity check: With hot_standby_feedback enabled and > > > + * invalidations handled appropriately as above, this should never > > > + * happen. > > > + */ > > > + if (remote_slot->restart_lsn < slot->data.restart_lsn) > > > + elog(ERROR, > > > + "cannot synchronize local slot \"%s\" LSN(%X/%X)" > > > + " to remote slot's LSN(%X/%X) as synchronization" > > > + " would move it backwards", remote_slot->name, > > > + LSN_FORMAT_ARGS(slot->data.restart_lsn), > > > + LSN_FORMAT_ARGS(remote_slot->restart_lsn)); > > > ... > > > } > > > > > > I was thinking about the above code in the patch and as far as I can > > > think this can only occur if the same name slot is re-created with > > > prior restart_lsn after the existing slot is dropped. Normally, the > > > newly created slot (with the same name) will have higher restart_lsn > > > but one can mimic it by copying some older slot by using > > > pg_copy_logical_replication_slot(). > > > > > > I don't think as mentioned in comments even if hot_standby_feedback is > > > temporarily set to off, the above shouldn't happen. It can only lead > > > to invalidated slots on standby. > > I also think so. > > > > > > > To close the above race, I could think of the following ways: > > > 1. Drop and re-create the slot. > > > 2. Emit LOG/WARNING in this case and once remote_slot's LSN moves > > > ahead of local_slot's LSN then we can update it; but as mentioned in > > > your previous comment, we need to update all other fields as well. If > > > we follow this then we probably need to have a check for catalog_xmin > > > as well. > > IIUC, this would be a sync slot (so not usable until promotion) that could > not be used anyway (invalidated), so I'll vote for drop / re-create then. > The one more drawback I see is that in such a case (where the slot could have dropped on primary) is that it is not advisable to keep it on standby. So, I also think we should drop and re-create the slots in this case unless I am missing something here. -- With Regards, Amit Kapila.
Hi, On Fri, Jan 12, 2024 at 08:42:39AM +0530, Amit Kapila wrote: > On Thu, Jan 11, 2024 at 9:11 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > > On Thu, Jan 11, 2024 at 04:22:56PM +0530, Amit Kapila wrote: > > > > > > > > To close the above race, I could think of the following ways: > > > > 1. Drop and re-create the slot. > > > > 2. Emit LOG/WARNING in this case and once remote_slot's LSN moves > > > > ahead of local_slot's LSN then we can update it; but as mentioned in > > > > your previous comment, we need to update all other fields as well. If > > > > we follow this then we probably need to have a check for catalog_xmin > > > > as well. > > > > IIUC, this would be a sync slot (so not usable until promotion) that could > > not be used anyway (invalidated), so I'll vote for drop / re-create then. > > > > No, it can happen for non-sync slots as well. Yeah, I meant that we could decide to drop/re-create only for sync slots. > > > > > Now, related to this the other case which needs some handling is what > > > > if the remote_slot's restart_lsn is greater than local_slot's > > > > restart_lsn but it is a re-created slot with the same name. In that > > > > case, I think the other properties like 'two_phase', 'plugin' could be > > > > different. So, is simply copying those sufficient or do we need to do > > > > something else as well? > > > > > > > > > > > I'm not sure to follow here. If the remote slot is re-created then it would > > be also dropped / re-created locally, or am I missing something? > > > > As our slot-syncing mechanism is asynchronous (from time to time we > check the slot information on primary), isn't it possible that the > same name slot is dropped and recreated between slot-sync worker's > checks? > Yeah, I should have thought harder ;-) So for this case, let's imagine that If we had an easy way to detect that a remote slot has been drop/re-created then I think we would also drop and re-create it on the standby too. If so, I think we should then update all the fields (that we're currently updating in the "create locally" case) when we detect that (at least) one of the following differs: - dboid - plugin - two_phase Maybe the "best" approach would be to have a way to detect that a slot has been re-created on the primary (but that would mean rely on more than the slot name to "identify" a slot and probably add a new member to the struct to do so). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Fri, Jan 12, 2024 at 03:46:00AM +0000, Zhijie Hou (Fujitsu) wrote: > On Thursday, January 11, 2024 11:42 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > > On Thu, Jan 11, 2024 at 04:22:56PM +0530, Amit Kapila wrote: > > IIUC, this would be a sync slot (so not usable until promotion) that could not be > > used anyway (invalidated), so I'll vote for drop / re-create then. > > Such race can happen when user drop and re-create the same failover slot on > primary as well. For example, user dropped one failover slot and them > immediately created a new one by copying from an old slot(using > pg_copy_logical_replication_slot). Then the slotsync worker will find the > restart_lsn of this slot go backwards. > > The steps: > ---- > SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'pgoutput', false, false, true); > SELECT 'init' FROM pg_create_logical_replication_slot('test', 'pgoutput', false, false, true); > > - Advance the restart_lsn of 'test' slot > CREATE TABLE test2(a int); > INSERT INTO test2 SELECT generate_series(1,10000,1); > SELECT slot_name FROM pg_replication_slot_advance('test', pg_current_wal_lsn()); > > - re-create the test slot but based on the old isolation_slot. > SELECT pg_drop_replication_slot('test'); > SELECT 'copy' FROM pg_copy_logical_replication_slot('isolation_slot', 'test'); > > Then the restart_lsn of 'test' slot will go backwards. Yeah, that's right. BTW, I think it's worth to add those "corner cases" in the TAP tests related to the sync slot feature (the more coverage the better). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Jan 11, 2024 at 7:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jan 9, 2024 at 6:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > +static bool > > +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot) > > { > > ... > > + /* Slot ready for sync, so sync it. */ > > + else > > + { > > + /* > > + * Sanity check: With hot_standby_feedback enabled and > > + * invalidations handled appropriately as above, this should never > > + * happen. > > + */ > > + if (remote_slot->restart_lsn < slot->data.restart_lsn) > > + elog(ERROR, > > + "cannot synchronize local slot \"%s\" LSN(%X/%X)" > > + " to remote slot's LSN(%X/%X) as synchronization" > > + " would move it backwards", remote_slot->name, > > + LSN_FORMAT_ARGS(slot->data.restart_lsn), > > + LSN_FORMAT_ARGS(remote_slot->restart_lsn)); > > ... > > } > > > > I was thinking about the above code in the patch and as far as I can > > think this can only occur if the same name slot is re-created with > > prior restart_lsn after the existing slot is dropped. Normally, the > > newly created slot (with the same name) will have higher restart_lsn > > but one can mimic it by copying some older slot by using > > pg_copy_logical_replication_slot(). > > > > I don't think as mentioned in comments even if hot_standby_feedback is > > temporarily set to off, the above shouldn't happen. It can only lead > > to invalidated slots on standby. > > > > To close the above race, I could think of the following ways: > > 1. Drop and re-create the slot. > > 2. Emit LOG/WARNING in this case and once remote_slot's LSN moves > > ahead of local_slot's LSN then we can update it; but as mentioned in > > your previous comment, we need to update all other fields as well. If > > we follow this then we probably need to have a check for catalog_xmin > > as well. > > > > The second point as mentioned is slightly misleading, so let me try to > rephrase it once again: Emit LOG/WARNING in this case and once > remote_slot's LSN moves ahead of local_slot's LSN then we can update > it; additionally, we need to update all other fields like two_phase as > well. If we follow this then we probably need to have a check for > catalog_xmin as well along remote_slot's restart_lsn. > > > Now, related to this the other case which needs some handling is what > > if the remote_slot's restart_lsn is greater than local_slot's > > restart_lsn but it is a re-created slot with the same name. In that > > case, I think the other properties like 'two_phase', 'plugin' could be > > different. So, is simply copying those sufficient or do we need to do > > something else as well? > > > > Bertrand, Dilip, Sawada-San, and others, please share your opinion on > this problem as I think it is important to handle this race condition. Is there any good use case of copying a failover slot in the first place? If it's not a normal use case and we can probably live without it, why not always disable failover during the copy? FYI we always disable two_phase on copied slots. It seems to me that copying a failover slot could lead to problems, as long as we synchronize slots based on their names. IIUC without the copy, this pass should never happen. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Jan 12, 2024 at 5:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Jan 11, 2024 at 7:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Jan 9, 2024 at 6:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > +static bool > > > +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot) > > > { > > > ... > > > + /* Slot ready for sync, so sync it. */ > > > + else > > > + { > > > + /* > > > + * Sanity check: With hot_standby_feedback enabled and > > > + * invalidations handled appropriately as above, this should never > > > + * happen. > > > + */ > > > + if (remote_slot->restart_lsn < slot->data.restart_lsn) > > > + elog(ERROR, > > > + "cannot synchronize local slot \"%s\" LSN(%X/%X)" > > > + " to remote slot's LSN(%X/%X) as synchronization" > > > + " would move it backwards", remote_slot->name, > > > + LSN_FORMAT_ARGS(slot->data.restart_lsn), > > > + LSN_FORMAT_ARGS(remote_slot->restart_lsn)); > > > ... > > > } > > > > > > I was thinking about the above code in the patch and as far as I can > > > think this can only occur if the same name slot is re-created with > > > prior restart_lsn after the existing slot is dropped. Normally, the > > > newly created slot (with the same name) will have higher restart_lsn > > > but one can mimic it by copying some older slot by using > > > pg_copy_logical_replication_slot(). > > > > > > I don't think as mentioned in comments even if hot_standby_feedback is > > > temporarily set to off, the above shouldn't happen. It can only lead > > > to invalidated slots on standby. > > > > > > To close the above race, I could think of the following ways: > > > 1. Drop and re-create the slot. > > > 2. Emit LOG/WARNING in this case and once remote_slot's LSN moves > > > ahead of local_slot's LSN then we can update it; but as mentioned in > > > your previous comment, we need to update all other fields as well. If > > > we follow this then we probably need to have a check for catalog_xmin > > > as well. > > > > > > > The second point as mentioned is slightly misleading, so let me try to > > rephrase it once again: Emit LOG/WARNING in this case and once > > remote_slot's LSN moves ahead of local_slot's LSN then we can update > > it; additionally, we need to update all other fields like two_phase as > > well. If we follow this then we probably need to have a check for > > catalog_xmin as well along remote_slot's restart_lsn. > > > > > Now, related to this the other case which needs some handling is what > > > if the remote_slot's restart_lsn is greater than local_slot's > > > restart_lsn but it is a re-created slot with the same name. In that > > > case, I think the other properties like 'two_phase', 'plugin' could be > > > different. So, is simply copying those sufficient or do we need to do > > > something else as well? > > > > > > > Bertrand, Dilip, Sawada-San, and others, please share your opinion on > > this problem as I think it is important to handle this race condition. > > Is there any good use case of copying a failover slot in the first > place? If it's not a normal use case and we can probably live without > it, why not always disable failover during the copy? FYI we always > disable two_phase on copied slots. It seems to me that copying a > failover slot could lead to problems, as long as we synchronize slots > based on their names. IIUC without the copy, this pass should never > happen. > > Regards, > > -- > Masahiko Sawada > Amazon Web Services: https://aws.amazon.com There are multiple approaches discussed and tried when it comes to starting a slot-sync worker. I am summarizing all here: 1) Make slotsync worker as an Auxiliary Process (like checkpointer, walwriter, walreceiver etc). The benefit this approach provides is, it can control begin and stop in a more flexible way as each auxiliary process could have different checks before starting and can have different stop conditions. But it needs code duplication for process management(start, stop, crash handling, signals etc) and currently it does not support db-connection smoothly (none of the auxiliary process has one so far) We attempted to make slot-sync worker as an auxiliary process and faced some challenges. The slot sync worker needs db-connection and thus needs InitPostgres(). But AuxiliaryProcessMain() and InitPostgres() are not compatible as both invoke common functions and end up setting many callbacks functions twice (with different args). Also InitPostgres() does 'MyBackendId' initialization (which further triggers some stuff) which is not needed for AuxiliaryProcess and so on. And thus in order to make slot-sync worker as an auxiliary process, we need something similar to InitPostgres (trimmed down working version) which needs further detailed analysis. 2) Make slotsync worker as a 'special' process like AutoVacLauncher which is neither an Auxiliary process nor a bgworker one. It allows db-connection and also provides flexibility to have start and stop conditions for a process. But it needs a lot of code-duplication around start, stop, fork (windows, non-windows), crash-management and stuff. It also needs to do many process-initialization stuff by itself (which is otherwise done internally by Aux and bgworker infra). And I am not sure if we should be adding a new process as a 'special' one when postgres already provides bgworker and Auxiliary process infrastructure. 3) Make slotysnc worker a bgworker. Here we just need to register our process as a bgworker (RegisterBackgroundWorker()) by providing a relevant start_time and restart_time and then the process management is well taken care of. It does not need any code-duplication and allows db-connection smoothly in registered process. The only thing it lacks is that it does not provide flexibility of having start-condition which then makes us to have 'enable_syncslot' as PGC_POSTMASTER parameter rather than PGC_SIGHUP. Having said this, I feel enable_syncslot is something which will not be changed frequently and with the benefits provided by bgworker infra, it seems a reasonably good option to choose this approach. 4) Another option is to have Logical Replication Launcher(or a new process) to launch slot-sync worker. But going by the current design where we have only 1 slotsync worker, it may be an overhead to have an additional manager process maintained. Especially if we go by 'Logical Replication Launcher', some extra changes will be needed there. It will need start_time change from BgWorkerStart_RecoveryFinished to BgWorkerStart_ConsistentState (doable but wanted to mention the change). And provided the fact that 'Logical Replication Launcher' does not have db-connection currently, in future if slotsync validation-checks need to execute some sql query, it cannot do it simply. It will either need the launcher to have db-connection or will need new commands to be implemented for the same. Thus weighing pros and cons of all these options, we have currently implemented the bgworker approach (approach 3). Any feedback is welcome. Thanks Shveta
On Friday, January 12, 2024 8:00 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: Hi, > > On Thu, Jan 11, 2024 at 7:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Jan 9, 2024 at 6:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > +static bool > > > +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot > > > +*remote_slot) > > > { > > > ... > > > + /* Slot ready for sync, so sync it. */ else { > > > + /* > > > + * Sanity check: With hot_standby_feedback enabled and > > > + * invalidations handled appropriately as above, this should never > > > + * happen. > > > + */ > > > + if (remote_slot->restart_lsn < slot->data.restart_lsn) elog(ERROR, > > > + "cannot synchronize local slot \"%s\" LSN(%X/%X)" > > > + " to remote slot's LSN(%X/%X) as synchronization" > > > + " would move it backwards", remote_slot->name, > > > + LSN_FORMAT_ARGS(slot->data.restart_lsn), > > > + LSN_FORMAT_ARGS(remote_slot->restart_lsn)); > > > ... > > > } > > > > > > I was thinking about the above code in the patch and as far as I can > > > think this can only occur if the same name slot is re-created with > > > prior restart_lsn after the existing slot is dropped. Normally, the > > > newly created slot (with the same name) will have higher restart_lsn > > > but one can mimic it by copying some older slot by using > > > pg_copy_logical_replication_slot(). > > > > > > I don't think as mentioned in comments even if hot_standby_feedback > > > is temporarily set to off, the above shouldn't happen. It can only > > > lead to invalidated slots on standby. > > > > > > To close the above race, I could think of the following ways: > > > 1. Drop and re-create the slot. > > > 2. Emit LOG/WARNING in this case and once remote_slot's LSN moves > > > ahead of local_slot's LSN then we can update it; but as mentioned in > > > your previous comment, we need to update all other fields as well. > > > If we follow this then we probably need to have a check for > > > catalog_xmin as well. > > > > > > > The second point as mentioned is slightly misleading, so let me try to > > rephrase it once again: Emit LOG/WARNING in this case and once > > remote_slot's LSN moves ahead of local_slot's LSN then we can update > > it; additionally, we need to update all other fields like two_phase as > > well. If we follow this then we probably need to have a check for > > catalog_xmin as well along remote_slot's restart_lsn. > > > > > Now, related to this the other case which needs some handling is > > > what if the remote_slot's restart_lsn is greater than local_slot's > > > restart_lsn but it is a re-created slot with the same name. In that > > > case, I think the other properties like 'two_phase', 'plugin' could > > > be different. So, is simply copying those sufficient or do we need > > > to do something else as well? > > > > > > > Bertrand, Dilip, Sawada-San, and others, please share your opinion on > > this problem as I think it is important to handle this race condition. > > Is there any good use case of copying a failover slot in the first place? If it's not > a normal use case and we can probably live without it, why not always disable > failover during the copy? FYI we always disable two_phase on copied slots. It > seems to me that copying a failover slot could lead to problems, as long as we > synchronize slots based on their names. IIUC without the copy, this pass should > never happen. Thanks for the suggestion. I also don't have a use case for this. Attach the V61 patch set that addresses this suggestion. And here is the summary of the changes made in each patch. V61-0001 1. Reverts the changes in copy_replication_slot. V61-0002 1. Adds the documents for the steps that user needs to follow to ensure the standby is ready for failover 2. Directly update the fields restart_lsn/confirmed_flush/catalog_xmin instead of using APIs like LogicalConfirmReceivedLocation 3. Updates all the fields(two_phase, failover, plugin) when syncing the slots 4. fixes CFbot failures. 5. Some code style adjustment. (pending comments in last version) 6. Remove some unnecessary Assert and variable assignment (off-list comments from Peter) Thanks Shveta for working on 4 and 5. V61-0003 1. Some documents update related to standby_slot_names and the steps for failover. V61-0004 - No change. Best Regards, Hou zj
Attachment
On Fri, Jan 12, 2024 at 12:07 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > On Fri, Jan 12, 2024 at 08:42:39AM +0530, Amit Kapila wrote: > > On Thu, Jan 11, 2024 at 9:11 PM Bertrand Drouvot > > <bertranddrouvot.pg@gmail.com> wrote: > > > > > > I'm not sure to follow here. If the remote slot is re-created then it would > > > be also dropped / re-created locally, or am I missing something? > > > > > > > As our slot-syncing mechanism is asynchronous (from time to time we > > check the slot information on primary), isn't it possible that the > > same name slot is dropped and recreated between slot-sync worker's > > checks? > > > > Yeah, I should have thought harder ;-) So for this case, let's imagine that If we > had an easy way to detect that a remote slot has been drop/re-created then I think > we would also drop and re-create it on the standby too. > > If so, I think we should then update all the fields (that we're currently updating > in the "create locally" case) when we detect that (at least) one of the following differs: > > - dboid > - plugin > - two_phase > Right, I think even if any of restart/confirmed LSN's or xmin has changed then also there is no harm in simply copying all the fields from remote_slot as done by Hou-San in latest patch. > Maybe the "best" approach would be to have a way to detect that a slot has been > re-created on the primary (but that would mean rely on more than the slot name > to "identify" a slot and probably add a new member to the struct to do so). > Right, I also thought so but not sure further complicating the slot machinery is worth detecting this case explicitly. If we see any problem with the idea discussed then we may need to think something along those lines. -- With Regards, Amit Kapila.
On Fri, Jan 12, 2024 at 5:50 PM shveta malik <shveta.malik@gmail.com> wrote: > > There are multiple approaches discussed and tried when it comes to > starting a slot-sync worker. I am summarizing all here: > > 1) Make slotsync worker as an Auxiliary Process (like checkpointer, > walwriter, walreceiver etc). The benefit this approach provides is, it > can control begin and stop in a more flexible way as each auxiliary > process could have different checks before starting and can have > different stop conditions. But it needs code duplication for process > management(start, stop, crash handling, signals etc) and currently it > does not support db-connection smoothly (none of the auxiliary process > has one so far) > As slotsync worker needs to perform transactions and access syscache, we can't make it an auxiliary process as that doesn't initialize the required stuff like syscache. Also, see the comment "Auxiliary processes don't run transactions ..." in AuxiliaryProcessMain() which means this is not an option. > > 2) Make slotsync worker as a 'special' process like AutoVacLauncher > which is neither an Auxiliary process nor a bgworker one. It allows > db-connection and also provides flexibility to have start and stop > conditions for a process. > Yeah, due to these reasons, I think this option is worth considering and another plus point is that this allows us to make enable_syncslot a PGC_SIGHUP GUC rather than a PGC_POSTMASTER. > > 3) Make slotysnc worker a bgworker. Here we just need to register our > process as a bgworker (RegisterBackgroundWorker()) by providing a > relevant start_time and restart_time and then the process management > is well taken care of. It does not need any code-duplication and > allows db-connection smoothly in registered process. The only thing it > lacks is that it does not provide flexibility of having > start-condition which then makes us to have 'enable_syncslot' as > PGC_POSTMASTER parameter rather than PGC_SIGHUP. Having said this, I > feel enable_syncslot is something which will not be changed frequently > and with the benefits provided by bgworker infra, it seems a > reasonably good option to choose this approach. > I agree but it may be better to make it a PGC_SIGHUP parameter. > 4) Another option is to have Logical Replication Launcher(or a new > process) to launch slot-sync worker. But going by the current design > where we have only 1 slotsync worker, it may be an overhead to have an > additional manager process maintained. > I don't see any good reason to have an additional launcher process here. > > Thus weighing pros and cons of all these options, we have currently > implemented the bgworker approach (approach 3). Any feedback is > welcome. > I vote to go for (2) unless we face difficulties in doing so but (3) is also okay especially if others also think so. -- With Regards, Amit Kapila.
Hi, On Sat, Jan 13, 2024 at 10:05:52AM +0530, Amit Kapila wrote: > On Fri, Jan 12, 2024 at 12:07 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > Maybe the "best" approach would be to have a way to detect that a slot has been > > re-created on the primary (but that would mean rely on more than the slot name > > to "identify" a slot and probably add a new member to the struct to do so). > > > > Right, I also thought so but not sure further complicating the slot > machinery is worth detecting this case explicitly. If we see any > problem with the idea discussed then we may need to think something > along those lines. Yeah, let's see. On one side that would require extra work but on the other side that would also probably simplify (and less bug prone in the mid-long term?) other parts of the code. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Jan 15, 2024 at 2:54 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > On Sat, Jan 13, 2024 at 10:05:52AM +0530, Amit Kapila wrote: > > On Fri, Jan 12, 2024 at 12:07 PM Bertrand Drouvot > > <bertranddrouvot.pg@gmail.com> wrote: > > > Maybe the "best" approach would be to have a way to detect that a slot has been > > > re-created on the primary (but that would mean rely on more than the slot name > > > to "identify" a slot and probably add a new member to the struct to do so). > > > > > > > Right, I also thought so but not sure further complicating the slot > > machinery is worth detecting this case explicitly. If we see any > > problem with the idea discussed then we may need to think something > > along those lines. > > Yeah, let's see. On one side that would require extra work but on the other side > that would also probably simplify (and less bug prone in the mid-long term?) > other parts of the code. > After following Sawada-San's suggestion to not copy the 'failover' option there doesn't seem to be much special handling, so there is probably less to simplify. -- With Regards, Amit Kapila.
Here are some review comments for patch v61-0002 ====== doc/src/sgml/logical-replication.sgml 1. + <sect2 id="logical-replication-failover-examples"> + <title>Examples: logical replication failover</title> The current documentation structure (after the patch is applied) looks like this: 30.1. Publication 30.2. Subscription 30.2.1. Replication Slot Management 30.2.2. Examples: Set Up Logical Replication 30.2.3. Examples: Deferred Replication Slot Creation 30.2.4. Examples: logical replication failover I don't think it is ideal. Firstly, I think this new section is not just "Examples:"; it is more like instructions for steps to check if a successful failover is possible. IMO call it something like "Logical Replication Failover" or "Replication Slot Failover". Secondly, I don't think this new section strictly belongs underneath the "Subscription" section anymore because IMO it is just as much about the promotion of the publications. Now that you are adding this new (2nd) section about slots, I think the whole structure of this document should be changed like below: SUGGESTION #1 (make a new section 30.3 just for slot-related topics) 30.1. Publication 30.2. Subscription 30.2.1. Examples: Set Up Logical Replication 30.3. Logical Replication Slots 30.3.1. Replication Slot Management 30.3.2. Examples: Deferred Replication Slot Creation 30.3.3. Logical Replication Failover ~ SUGGESTION #2 (keep the existing structure, but give the failover its own new section 30.3) 30.1. Publication 30.2. Subscription 30.2.1. Replication Slot Management 30.2.2. Examples: Set Up Logical Replication 30.2.3. Examples: Deferred Replication Slot Creation 30.3 Logical Replication Failover ~ SUGGESTION #2a (and maybe later you can extract some of the failover examples further) 30.1. Publication 30.2. Subscription 30.2.1. Replication Slot Management 30.2.2. Examples: Set Up Logical Replication 30.2.3. Examples: Deferred Replication Slot Creation 30.3 Logical Replication Failover 30.3.1. Examples: Checking if failover ready ~~~ 2. + <para> + In a logical replication setup, if the publisher server is also the primary + server of the streaming replication, the logical slots on the primary server + can be synchronized to the standby server by specifying <literal>failover = true</literal> + when creating the subscription. Enabling failover ensures a seamless + transition of the subscription to the promoted standby, allowing it to + subscribe to the new primary server without any data loss. + </para> I was initially confused by the wording. How about like below: SUGGESTION When the publisher server is the primary server of a streaming replication, the logical slots on that primary server can be synchronized to the standby server by specifying <literal>failover = true</literal> when creating subscriptions for those publications. Enabling failover ensures a seamless transition of those subscriptions after the standby is promoted. They can continue subscribing to publications now on the new primary server without any data loss. ~~~ 3. + <para> + However, the replication slots are copied asynchronously, which means it's necessary + to confirm that replication slots have been synced to the standby server + before the failover happens. Additionally, to ensure a successful failover, + the standby server must not lag behind the subscriber. To confirm + that the standby server is ready for failover, follow these steps: + </para> Minor rewording SUGGESTION Because the slot synchronization logic copies asynchronously, it is necessary to confirm that replication slots have been synced to the standby server before the failover happens. Furthermore, to ensure a successful failover, the standby server must not be lagging behind the subscriber. To confirm that the standby server is indeed ready for failover, follow these 2 steps: ~~~ 4. The instructions said "follow these steps", so the next parts should be rendered as 2 "steps" (using <procedure> markup?) SUGGESTION (show as steps 1,2 and also some minor rewording of the step heading) 1. Confirm that all the necessary logical replication slots have been synced to the standby server. 2. Confirm that the standby server is not lagging behind the subscribers. ~~~ 5. + <para> + Check if all the necessary logical replication slots have been synced to + the standby server. + </para> SUGGESTION Confirm that all the necessary logical replication slots have been synced to the standby server. ~~~ 6. + <listitem> + <para> + On logical subscriber, fetch the slot names that should be synced to the + standby that we plan to promote. SUGGESTION Firstly, on the subscriber node, use the following SQL to identify the slot names that should be... ~~~ 7. +<programlisting> +test_sub=# SELECT + array_agg(slotname) AS slots + FROM + (( + SELECT r.srsubid AS subid, CONCAT('pg_' || srsubid || '_sync_' || srrelid || '_' || ctl.system_identifier) AS slotname + FROM pg_control_system() ctl, pg_subscription_rel r, pg_subscription s + WHERE r.srsubstate = 'f' AND s.oid = r.srsubid AND s.subfailover + ) UNION ( + SELECT oid AS subid, subslotname as slotname + FROM pg_subscription + WHERE subfailover + )); 7a Maybe this ought to include "pg_catalog" schemas? ~ 7b. For consistency, maybe it is better to use a table alias "FROM pg_subscription s" in the UNION also ~~~ 8. + <listitem> + <para> + Check that the logical replication slots exist on the standby server. SUGGESTION Next, check that the logical replication slots identified above exist on the standby server. ~~~ 9. +<programlisting> +test_standby=# SELECT bool_and(synced AND NOT temporary AND conflict_reason IS NULL) AS failover_ready + FROM pg_replication_slots + WHERE slot_name in ('slots'); + failover_ready +---------------- + t 9a. Maybe this ought to include "pg_catalog" schemas? ~ 9b. IIUC that 'slots' reference is supposed to be those names that were found in the prior step. If so, then that point needs to be made clear, and anyway in this case 'slots' is not compatible with the 'sub' name returned by your first SQL. ~~~ 10. + <listitem> + <para> + Query the last replayed WAL on the logical subscriber. SUGGESTION Firstly, on the subscriber node check the last replayed WAL. ~~~ 11. +<programlisting> +test_sub=# SELECT + MAX(remote_lsn) AS remote_lsn_on_subscriber + FROM + (( + SELECT (CASE WHEN r.srsubstate = 'f' THEN pg_replication_origin_progress(CONCAT('pg_' || r.srsubid || '_' || r.srrelid), false) + WHEN r.srsubstate = 's' THEN r.srsublsn END) as remote_lsn + FROM pg_subscription_rel r, pg_subscription s + WHERE r.srsubstate IN ('f', 's') AND s.oid = r.srsubid AND s.subfailover + ) UNION ( + SELECT pg_replication_origin_progress(CONCAT('pg_' || s.oid), false) AS remote_lsn + FROM pg_subscription s + WHERE subfailover + )); 11a. Maybe this ought to include "pg_catalog" schemas? ~ 11b. /WHERE subfailover/WHERE s.subfailover/ ~~~ 12. + <listitem> + <para> + On the standby server, check that the last-received WAL location + is ahead of the replayed WAL location on the subscriber. SUGGESTION Next, on the standby server check that the last-received WAL location is ahead of the replayed WAL location on the subscriber identified above. ~~~ 13. +</programlisting></para> + </listitem> + <listitem> + <para> + On the standby server, check that the last-received WAL location + is ahead of the replayed WAL location on the subscriber. +<programlisting> +test_standby=# SELECT pg_last_wal_receive_lsn() >= 'remote_lsn_on_subscriber'::pg_lsn AS failover_ready; + failover_ready +---------------- + t IIUC the 'remote_lsn_on_subscriber' is supposed to represent the substitution of the value found in the subscriber server. In this example maybe it would be: SELECT pg_last_wal_receive_lsn() >= '0/3000388'::pg_lsn AS failover_ready; maybe that point can be made more clearly. ~~~ 14. + <para> + If the result (failover_ready) of both above steps is true, it means it is + okay to subscribe to the standby server. + </para> 14a. failover_ready should be rendered as literal. ~ 14b. Does this say what you intended, or did you mean something more like "the standby can be promoted and existing subscriptions will be able to continue without data loss" ====== src/backend/replication/logical/slotsync.c 15. local_slot_update +/* + * Try to update local slot metadata based on the data from the remote slot. + * + * Return false if the data of the remote slot is the same as the local slot. + * Otherwise, return true. + */ There's not really any "try to" here; it either does it if needed or doesn't do it because it's not needed. SUGGESTION If necessary, update local slot metadata based on the data from the remote slot. If no update was needed (the data of the remote slot is the same as the local slot) return false, otherwise true. ~~~ 16. + bool updated_xmin; + bool updated_restart; + Oid dbid; + ReplicationSlot *slot = MyReplicationSlot; + + Assert(slot->data.invalidated == RS_INVAL_NONE); + + updated_xmin = (remote_slot->catalog_xmin != slot->data.catalog_xmin); + updated_restart = (remote_slot->restart_lsn != slot->data.restart_lsn); + dbid = get_database_oid(remote_slot->database, false); + + if (namestrcmp(&slot->data.plugin, remote_slot->plugin) == 0 && + slot->data.database == dbid && !updated_restart && !updated_xmin && + remote_slot->two_phase == slot->data.two_phase && + remote_slot->failover == slot->data.failover && + remote_slot->confirmed_lsn == slot->data.confirmed_flush) + return false; It seems a bit strange to have boolean flags for some of the differences (updated_xmin, updated_restart) but not for the others. I expected it should be for all (e.g. updated_twophase, updated_failover, ...) or none of them. ~~~ 17. synchronize_one_slot + slot_updated = local_slot_update(remote_slot); + + /* Make sure the slot changes persist across server restart */ + if (slot_updated) + { + ReplicationSlotMarkDirty(); + ReplicationSlotSave(); + } IMO this code would be simpler if written like below because then 'slot_updated' is only ever assigned when true instead of maybe overwriting the default again with false: SUGGESTION /* Make sure the slot changes persist across server restart */ if (local_slot_update(remote_slot)) { slot_updated = true; ReplicationSlotMarkDirty(); ReplicationSlotSave(); } ====== src/backend/replication/slot.c 18. ReplicationSlotPersist - TEMPORARY v EPHEMERAL I noticed this ReplicationSlotPersist() from v59-0002 was reverted: - * Convert a slot that's marked as RS_EPHEMERAL to a RS_PERSISTENT slot, - * guaranteeing it will be there after an eventual crash. + * Convert a slot that's marked as RS_EPHEMERAL or RS_TEMPORARY to a + * RS_PERSISTENT slot, guaranteeing it will be there after an eventual crash. AFAIK in v61 you are still calling this function with RS_TEMPORARY which is now contrary to the current function comment if you don't change it to also mention RS_TEMPORARY. ====== Kind Regards, Peter Smith. Fujitsu Australia
On Sat, Jan 13, 2024 at 12:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Jan 12, 2024 at 5:50 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > There are multiple approaches discussed and tried when it comes to > > starting a slot-sync worker. I am summarizing all here: > > > > 1) Make slotsync worker as an Auxiliary Process (like checkpointer, > > walwriter, walreceiver etc). The benefit this approach provides is, it > > can control begin and stop in a more flexible way as each auxiliary > > process could have different checks before starting and can have > > different stop conditions. But it needs code duplication for process > > management(start, stop, crash handling, signals etc) and currently it > > does not support db-connection smoothly (none of the auxiliary process > > has one so far) > > > > As slotsync worker needs to perform transactions and access syscache, > we can't make it an auxiliary process as that doesn't initialize the > required stuff like syscache. Also, see the comment "Auxiliary > processes don't run transactions ..." in AuxiliaryProcessMain() which > means this is not an option. > > > > > 2) Make slotsync worker as a 'special' process like AutoVacLauncher > > which is neither an Auxiliary process nor a bgworker one. It allows > > db-connection and also provides flexibility to have start and stop > > conditions for a process. > > > > Yeah, due to these reasons, I think this option is worth considering > and another plus point is that this allows us to make enable_syncslot > a PGC_SIGHUP GUC rather than a PGC_POSTMASTER. > > > > > 3) Make slotysnc worker a bgworker. Here we just need to register our > > process as a bgworker (RegisterBackgroundWorker()) by providing a > > relevant start_time and restart_time and then the process management > > is well taken care of. It does not need any code-duplication and > > allows db-connection smoothly in registered process. The only thing it > > lacks is that it does not provide flexibility of having > > start-condition which then makes us to have 'enable_syncslot' as > > PGC_POSTMASTER parameter rather than PGC_SIGHUP. Having said this, I > > feel enable_syncslot is something which will not be changed frequently > > and with the benefits provided by bgworker infra, it seems a > > reasonably good option to choose this approach. > > > > I agree but it may be better to make it a PGC_SIGHUP parameter. > > > 4) Another option is to have Logical Replication Launcher(or a new > > process) to launch slot-sync worker. But going by the current design > > where we have only 1 slotsync worker, it may be an overhead to have an > > additional manager process maintained. > > > > I don't see any good reason to have an additional launcher process here. > > > > > Thus weighing pros and cons of all these options, we have currently > > implemented the bgworker approach (approach 3). Any feedback is > > welcome. > > > > I vote to go for (2) unless we face difficulties in doing so but (3) > is also okay especially if others also think so. I am not against any of the approaches but I still feel that when we have a standard way of doing things (bgworker) we should not keep adding code to do things in a special way unless there is a strong reason to do so. Now we need to decide if 'enable_syncslot' being PGC_POSTMASTER is a strong reason to go the non-standard way? If yes, then we should think of option 2 else option 3 seems better in my understanding (which may be limited due to my short experience here), so I am all ears to what others think on this. thanks Shveta
On Tue, Jan 16, 2024 at 9:03 AM shveta malik <shveta.malik@gmail.com> wrote: > > On Sat, Jan 13, 2024 at 12:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, Jan 12, 2024 at 5:50 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > There are multiple approaches discussed and tried when it comes to > > > starting a slot-sync worker. I am summarizing all here: > > > > > > 1) Make slotsync worker as an Auxiliary Process (like checkpointer, > > > walwriter, walreceiver etc). The benefit this approach provides is, it > > > can control begin and stop in a more flexible way as each auxiliary > > > process could have different checks before starting and can have > > > different stop conditions. But it needs code duplication for process > > > management(start, stop, crash handling, signals etc) and currently it > > > does not support db-connection smoothly (none of the auxiliary process > > > has one so far) > > > > > > > As slotsync worker needs to perform transactions and access syscache, > > we can't make it an auxiliary process as that doesn't initialize the > > required stuff like syscache. Also, see the comment "Auxiliary > > processes don't run transactions ..." in AuxiliaryProcessMain() which > > means this is not an option. > > > > > > > > 2) Make slotsync worker as a 'special' process like AutoVacLauncher > > > which is neither an Auxiliary process nor a bgworker one. It allows > > > db-connection and also provides flexibility to have start and stop > > > conditions for a process. > > > > > > > Yeah, due to these reasons, I think this option is worth considering > > and another plus point is that this allows us to make enable_syncslot > > a PGC_SIGHUP GUC rather than a PGC_POSTMASTER. > > > > > > > > 3) Make slotysnc worker a bgworker. Here we just need to register our > > > process as a bgworker (RegisterBackgroundWorker()) by providing a > > > relevant start_time and restart_time and then the process management > > > is well taken care of. It does not need any code-duplication and > > > allows db-connection smoothly in registered process. The only thing it > > > lacks is that it does not provide flexibility of having > > > start-condition which then makes us to have 'enable_syncslot' as > > > PGC_POSTMASTER parameter rather than PGC_SIGHUP. Having said this, I > > > feel enable_syncslot is something which will not be changed frequently > > > and with the benefits provided by bgworker infra, it seems a > > > reasonably good option to choose this approach. > > > > > > > I agree but it may be better to make it a PGC_SIGHUP parameter. > > > > > 4) Another option is to have Logical Replication Launcher(or a new > > > process) to launch slot-sync worker. But going by the current design > > > where we have only 1 slotsync worker, it may be an overhead to have an > > > additional manager process maintained. > > > > > > > I don't see any good reason to have an additional launcher process here. > > > > > > > > Thus weighing pros and cons of all these options, we have currently > > > implemented the bgworker approach (approach 3). Any feedback is > > > welcome. > > > > > > > I vote to go for (2) unless we face difficulties in doing so but (3) > > is also okay especially if others also think so. > > I am not against any of the approaches but I still feel that when we > have a standard way of doing things (bgworker) we should not keep > adding code to do things in a special way unless there is a strong > reason to do so. Now we need to decide if 'enable_syncslot' being > PGC_POSTMASTER is a strong reason to go the non-standard way? > Agreed and as said earlier I think it is better to make it a PGC_SIGHUP. Also, not sure we can say it is a non-standard way as already autovacuum launcher is handled in the same way. One more minor thing is it will save us for having a new bgworker state BgWorkerStart_ConsistentState_HotStandby as introduced by this patch. > If yes, > then we should think of option 2 else option 3 seems better in my > understanding (which may be limited due to my short experience here), > so I am all ears to what others think on this. > I also think it would be better if more people share their opinion on this matter. -- With Regards, Amit Kapila.
On Tue, Jan 16, 2024 at 9:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jan 16, 2024 at 9:03 AM shveta malik <shveta.malik@gmail.com> wrote: > > > Agreed and as said earlier I think it is better to make it a > PGC_SIGHUP. Also, not sure we can say it is a non-standard way as > already autovacuum launcher is handled in the same way. One more minor > thing is it will save us for having a new bgworker state > BgWorkerStart_ConsistentState_HotStandby as introduced by this patch. Yeah, it's not a nonstandard way. But bgworker provides a lot of inbuilt infrastructure which otherwise we would have to maintain by ourselves if we opt for option 2. I would have preferred option 3 from the simplicity point of view but I would prefer to make this PGC_SIGHUP over simplicity. But anyway, if there are issues in doing so then we can keep it PGC_POSTMASTER but it's worth trying this out. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Jan 16, 2024 at 1:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jan 16, 2024 at 9:03 AM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Sat, Jan 13, 2024 at 12:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Fri, Jan 12, 2024 at 5:50 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > > > There are multiple approaches discussed and tried when it comes to > > > > starting a slot-sync worker. I am summarizing all here: > > > > > > > > 1) Make slotsync worker as an Auxiliary Process (like checkpointer, > > > > walwriter, walreceiver etc). The benefit this approach provides is, it > > > > can control begin and stop in a more flexible way as each auxiliary > > > > process could have different checks before starting and can have > > > > different stop conditions. But it needs code duplication for process > > > > management(start, stop, crash handling, signals etc) and currently it > > > > does not support db-connection smoothly (none of the auxiliary process > > > > has one so far) > > > > > > > > > > As slotsync worker needs to perform transactions and access syscache, > > > we can't make it an auxiliary process as that doesn't initialize the > > > required stuff like syscache. Also, see the comment "Auxiliary > > > processes don't run transactions ..." in AuxiliaryProcessMain() which > > > means this is not an option. > > > > > > > > > > > 2) Make slotsync worker as a 'special' process like AutoVacLauncher > > > > which is neither an Auxiliary process nor a bgworker one. It allows > > > > db-connection and also provides flexibility to have start and stop > > > > conditions for a process. > > > > > > > > > > Yeah, due to these reasons, I think this option is worth considering > > > and another plus point is that this allows us to make enable_syncslot > > > a PGC_SIGHUP GUC rather than a PGC_POSTMASTER. > > > > > > > > > > > 3) Make slotysnc worker a bgworker. Here we just need to register our > > > > process as a bgworker (RegisterBackgroundWorker()) by providing a > > > > relevant start_time and restart_time and then the process management > > > > is well taken care of. It does not need any code-duplication and > > > > allows db-connection smoothly in registered process. The only thing it > > > > lacks is that it does not provide flexibility of having > > > > start-condition which then makes us to have 'enable_syncslot' as > > > > PGC_POSTMASTER parameter rather than PGC_SIGHUP. Having said this, I > > > > feel enable_syncslot is something which will not be changed frequently > > > > and with the benefits provided by bgworker infra, it seems a > > > > reasonably good option to choose this approach. > > > > > > > > > > I agree but it may be better to make it a PGC_SIGHUP parameter. > > > > > > > 4) Another option is to have Logical Replication Launcher(or a new > > > > process) to launch slot-sync worker. But going by the current design > > > > where we have only 1 slotsync worker, it may be an overhead to have an > > > > additional manager process maintained. > > > > > > > > > > I don't see any good reason to have an additional launcher process here. > > > > > > > > > > > Thus weighing pros and cons of all these options, we have currently > > > > implemented the bgworker approach (approach 3). Any feedback is > > > > welcome. > > > > > > > > > > I vote to go for (2) unless we face difficulties in doing so but (3) > > > is also okay especially if others also think so. > > > > I am not against any of the approaches but I still feel that when we > > have a standard way of doing things (bgworker) we should not keep > > adding code to do things in a special way unless there is a strong > > reason to do so. Now we need to decide if 'enable_syncslot' being > > PGC_POSTMASTER is a strong reason to go the non-standard way? > > > > Agreed and as said earlier I think it is better to make it a > PGC_SIGHUP. Also, not sure we can say it is a non-standard way as > already autovacuum launcher is handled in the same way. One more minor > thing is it will save us for having a new bgworker state > BgWorkerStart_ConsistentState_HotStandby as introduced by this patch. Why do we need to add a new BgWorkerStart_ConsistentState_HotStandby for the slotsync worker? Isn't it sufficient that the slotsync worker exits if not in hot standby mode? Is there any technical difficulty or obstacle to make the slotsync worker start using bgworker after reloading the config file? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Hi, On Sat, Jan 13, 2024 at 12:53:50PM +0530, Amit Kapila wrote: > On Fri, Jan 12, 2024 at 5:50 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > There are multiple approaches discussed and tried when it comes to > > starting a slot-sync worker. I am summarizing all here: > > > > 1) Make slotsync worker as an Auxiliary Process (like checkpointer, > > walwriter, walreceiver etc). The benefit this approach provides is, it > > can control begin and stop in a more flexible way as each auxiliary > > process could have different checks before starting and can have > > different stop conditions. But it needs code duplication for process > > management(start, stop, crash handling, signals etc) and currently it > > does not support db-connection smoothly (none of the auxiliary process > > has one so far) > > > > As slotsync worker needs to perform transactions and access syscache, > we can't make it an auxiliary process as that doesn't initialize the > required stuff like syscache. Also, see the comment "Auxiliary > processes don't run transactions ..." in AuxiliaryProcessMain() which > means this is not an option. > > > > > 2) Make slotsync worker as a 'special' process like AutoVacLauncher > > which is neither an Auxiliary process nor a bgworker one. It allows > > db-connection and also provides flexibility to have start and stop > > conditions for a process. > > > > Yeah, due to these reasons, I think this option is worth considering > and another plus point is that this allows us to make enable_syncslot > a PGC_SIGHUP GUC rather than a PGC_POSTMASTER. > > > > > 3) Make slotysnc worker a bgworker. Here we just need to register our > > process as a bgworker (RegisterBackgroundWorker()) by providing a > > relevant start_time and restart_time and then the process management > > is well taken care of. It does not need any code-duplication and > > allows db-connection smoothly in registered process. The only thing it > > lacks is that it does not provide flexibility of having > > start-condition which then makes us to have 'enable_syncslot' as > > PGC_POSTMASTER parameter rather than PGC_SIGHUP. Having said this, I > > feel enable_syncslot is something which will not be changed frequently > > and with the benefits provided by bgworker infra, it seems a > > reasonably good option to choose this approach. > > > > I agree but it may be better to make it a PGC_SIGHUP parameter. > > > 4) Another option is to have Logical Replication Launcher(or a new > > process) to launch slot-sync worker. But going by the current design > > where we have only 1 slotsync worker, it may be an overhead to have an > > additional manager process maintained. > > > > I don't see any good reason to have an additional launcher process here. > > > > > Thus weighing pros and cons of all these options, we have currently > > implemented the bgworker approach (approach 3). Any feedback is > > welcome. > > > > I vote to go for (2) unless we face difficulties in doing so but (3) > is also okay especially if others also think so. > Yeah, I think that (2) would be the "ideal" one but (3) is fine too. I think that if we think/see that (2) is too "complicated"/long to implement maybe we could do (3) initially and switch to (2) later. What I mean by that is that I don't think that not doing (2) should be a blocker. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Jan 16, 2024 at 12:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Jan 16, 2024 at 1:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Jan 16, 2024 at 9:03 AM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > On Sat, Jan 13, 2024 at 12:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Fri, Jan 12, 2024 at 5:50 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > > > > > There are multiple approaches discussed and tried when it comes to > > > > > starting a slot-sync worker. I am summarizing all here: > > > > > > > > > > 1) Make slotsync worker as an Auxiliary Process (like checkpointer, > > > > > walwriter, walreceiver etc). The benefit this approach provides is, it > > > > > can control begin and stop in a more flexible way as each auxiliary > > > > > process could have different checks before starting and can have > > > > > different stop conditions. But it needs code duplication for process > > > > > management(start, stop, crash handling, signals etc) and currently it > > > > > does not support db-connection smoothly (none of the auxiliary process > > > > > has one so far) > > > > > > > > > > > > > As slotsync worker needs to perform transactions and access syscache, > > > > we can't make it an auxiliary process as that doesn't initialize the > > > > required stuff like syscache. Also, see the comment "Auxiliary > > > > processes don't run transactions ..." in AuxiliaryProcessMain() which > > > > means this is not an option. > > > > > > > > > > > > > > 2) Make slotsync worker as a 'special' process like AutoVacLauncher > > > > > which is neither an Auxiliary process nor a bgworker one. It allows > > > > > db-connection and also provides flexibility to have start and stop > > > > > conditions for a process. > > > > > > > > > > > > > Yeah, due to these reasons, I think this option is worth considering > > > > and another plus point is that this allows us to make enable_syncslot > > > > a PGC_SIGHUP GUC rather than a PGC_POSTMASTER. > > > > > > > > > > > > > > 3) Make slotysnc worker a bgworker. Here we just need to register our > > > > > process as a bgworker (RegisterBackgroundWorker()) by providing a > > > > > relevant start_time and restart_time and then the process management > > > > > is well taken care of. It does not need any code-duplication and > > > > > allows db-connection smoothly in registered process. The only thing it > > > > > lacks is that it does not provide flexibility of having > > > > > start-condition which then makes us to have 'enable_syncslot' as > > > > > PGC_POSTMASTER parameter rather than PGC_SIGHUP. Having said this, I > > > > > feel enable_syncslot is something which will not be changed frequently > > > > > and with the benefits provided by bgworker infra, it seems a > > > > > reasonably good option to choose this approach. > > > > > > > > > > > > > I agree but it may be better to make it a PGC_SIGHUP parameter. > > > > > > > > > 4) Another option is to have Logical Replication Launcher(or a new > > > > > process) to launch slot-sync worker. But going by the current design > > > > > where we have only 1 slotsync worker, it may be an overhead to have an > > > > > additional manager process maintained. > > > > > > > > > > > > > I don't see any good reason to have an additional launcher process here. > > > > > > > > > > > > > > Thus weighing pros and cons of all these options, we have currently > > > > > implemented the bgworker approach (approach 3). Any feedback is > > > > > welcome. > > > > > > > > > > > > > I vote to go for (2) unless we face difficulties in doing so but (3) > > > > is also okay especially if others also think so. > > > > > > I am not against any of the approaches but I still feel that when we > > > have a standard way of doing things (bgworker) we should not keep > > > adding code to do things in a special way unless there is a strong > > > reason to do so. Now we need to decide if 'enable_syncslot' being > > > PGC_POSTMASTER is a strong reason to go the non-standard way? > > > > > > > Agreed and as said earlier I think it is better to make it a > > PGC_SIGHUP. Also, not sure we can say it is a non-standard way as > > already autovacuum launcher is handled in the same way. One more minor > > thing is it will save us for having a new bgworker state > > BgWorkerStart_ConsistentState_HotStandby as introduced by this patch. > > Why do we need to add a new BgWorkerStart_ConsistentState_HotStandby > for the slotsync worker? Isn't it sufficient that the slotsync worker > exits if not in hot standby mode? It is doable, but that will mean starting slot-sync worker even on primary on every server restart which does not seem like a good idea. We wanted to have a way where-in it does not start itself in non-standby mode. > Is there any technical difficulty or obstacle to make the slotsync > worker start using bgworker after reloading the config file? When we register slotsync worker as bgworker, we can only register the bgworker before initializing shared memory, we cannot register dynamically in the cycle of ServerLoop and thus we do not have flexibility of registering/deregistering the bgworker (or controlling the bgworker start) based on config parameters each time they change. We can always start slot-sync worker and let it check if enable_syncslot is ON. If not, exit and retry the next time when postmaster will restart it after restart_time(60sec). The downside of this approach is, even if any user does not want slot-sync functionality and thus has permanently disabled 'enable_syncslot', it will keep on restarting and exiting there. thanks Shveta
On Tue, Jan 16, 2024 at 3:10 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Tue, Jan 16, 2024 at 12:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Tue, Jan 16, 2024 at 1:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Tue, Jan 16, 2024 at 9:03 AM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > > > On Sat, Jan 13, 2024 at 12:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > On Fri, Jan 12, 2024 at 5:50 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > > > > > > > There are multiple approaches discussed and tried when it comes to > > > > > > starting a slot-sync worker. I am summarizing all here: > > > > > > > > > > > > 1) Make slotsync worker as an Auxiliary Process (like checkpointer, > > > > > > walwriter, walreceiver etc). The benefit this approach provides is, it > > > > > > can control begin and stop in a more flexible way as each auxiliary > > > > > > process could have different checks before starting and can have > > > > > > different stop conditions. But it needs code duplication for process > > > > > > management(start, stop, crash handling, signals etc) and currently it > > > > > > does not support db-connection smoothly (none of the auxiliary process > > > > > > has one so far) > > > > > > > > > > > > > > > > As slotsync worker needs to perform transactions and access syscache, > > > > > we can't make it an auxiliary process as that doesn't initialize the > > > > > required stuff like syscache. Also, see the comment "Auxiliary > > > > > processes don't run transactions ..." in AuxiliaryProcessMain() which > > > > > means this is not an option. > > > > > > > > > > > > > > > > > 2) Make slotsync worker as a 'special' process like AutoVacLauncher > > > > > > which is neither an Auxiliary process nor a bgworker one. It allows > > > > > > db-connection and also provides flexibility to have start and stop > > > > > > conditions for a process. > > > > > > > > > > > > > > > > Yeah, due to these reasons, I think this option is worth considering > > > > > and another plus point is that this allows us to make enable_syncslot > > > > > a PGC_SIGHUP GUC rather than a PGC_POSTMASTER. > > > > > > > > > > > > > > > > > 3) Make slotysnc worker a bgworker. Here we just need to register our > > > > > > process as a bgworker (RegisterBackgroundWorker()) by providing a > > > > > > relevant start_time and restart_time and then the process management > > > > > > is well taken care of. It does not need any code-duplication and > > > > > > allows db-connection smoothly in registered process. The only thing it > > > > > > lacks is that it does not provide flexibility of having > > > > > > start-condition which then makes us to have 'enable_syncslot' as > > > > > > PGC_POSTMASTER parameter rather than PGC_SIGHUP. Having said this, I > > > > > > feel enable_syncslot is something which will not be changed frequently > > > > > > and with the benefits provided by bgworker infra, it seems a > > > > > > reasonably good option to choose this approach. > > > > > > > > > > > > > > > > I agree but it may be better to make it a PGC_SIGHUP parameter. > > > > > > > > > > > 4) Another option is to have Logical Replication Launcher(or a new > > > > > > process) to launch slot-sync worker. But going by the current design > > > > > > where we have only 1 slotsync worker, it may be an overhead to have an > > > > > > additional manager process maintained. > > > > > > > > > > > > > > > > I don't see any good reason to have an additional launcher process here. > > > > > > > > > > > > > > > > > Thus weighing pros and cons of all these options, we have currently > > > > > > implemented the bgworker approach (approach 3). Any feedback is > > > > > > welcome. > > > > > > > > > > > > > > > > I vote to go for (2) unless we face difficulties in doing so but (3) > > > > > is also okay especially if others also think so. > > > > > > > > I am not against any of the approaches but I still feel that when we > > > > have a standard way of doing things (bgworker) we should not keep > > > > adding code to do things in a special way unless there is a strong > > > > reason to do so. Now we need to decide if 'enable_syncslot' being > > > > PGC_POSTMASTER is a strong reason to go the non-standard way? > > > > > > > > > > Agreed and as said earlier I think it is better to make it a > > > PGC_SIGHUP. Also, not sure we can say it is a non-standard way as > > > already autovacuum launcher is handled in the same way. One more minor > > > thing is it will save us for having a new bgworker state > > > BgWorkerStart_ConsistentState_HotStandby as introduced by this patch. > > > > Why do we need to add a new BgWorkerStart_ConsistentState_HotStandby > > for the slotsync worker? Isn't it sufficient that the slotsync worker > > exits if not in hot standby mode? > > It is doable, but that will mean starting slot-sync worker even on > primary on every server restart which does not seem like a good idea. > We wanted to have a way where-in it does not start itself in > non-standby mode. > > > Is there any technical difficulty or obstacle to make the slotsync > > worker start using bgworker after reloading the config file? > > When we register slotsync worker as bgworker, we can only register the > bgworker before initializing shared memory, we cannot register > dynamically in the cycle of ServerLoop and thus we do not have > flexibility of registering/deregistering the bgworker (or controlling > the bgworker start) based on config parameters each time they change. > We can always start slot-sync worker and let it check if > enable_syncslot is ON. If not, exit and retry the next time when > postmaster will restart it after restart_time(60sec). The downside of > this approach is, even if any user does not want slot-sync > functionality and thus has permanently disabled 'enable_syncslot', it > will keep on restarting and exiting there. PFA v62. Details: v62-001: No change. v62-002: 1) Addressed slotsync.c related comments by Peter in [1]. 2) Addressed CFBot failure where there was a crash in 32 bit env while accessing DatumGetLSN 3) Addressed another CFBot failure where the test for '050_standby_failover_slots_sync.pl' was hanging. Thanks Hou-San for this fix. v62-003: It is a new patch which attempts to implement slot-sync worker as a special process which is neither a bgworker nor an Auxiliary process. Here we get the benefit of converting enable_syncslot to a PGC_SIGHUP Guc rather than PGC_POSTMASTER. We launch the slot-sync worker only if it is hot-standby and 'enable_syncslot' is ON. v62-004: Small change in document. v62-005: No change v62-006: Separated the failover-ready validation steps into this separate doc-patch (which were earlier present in v61-002 and v61-003). Also addressed some of the doc comments by Peter in [1]. Thanks Hou-San for providing this patch. [1]: https://www.postgresql.org/message-id/CAHut%2BPteZVNx1jQ6Hs3mEdoC%3DDNALVpJJ2mZDYim7sU-04tiaw%40mail.gmail.com thanks Shveta
Attachment
- v62-0004-Allow-logical-walsenders-to-wait-for-the-physica.patch
- v62-0005-Non-replication-connection-and-app_name-change.patch
- v62-0002-Add-logical-slot-sync-capability-to-the-physical.patch
- v62-0001-Enable-setting-failover-property-for-a-slot-thro.patch
- v62-0003-Slot-sync-worker-as-a-special-process.patch
- v62-0006-Document-the-steps-to-check-if-the-standby-is-re.patch
On Tuesday, January 16, 2024 9:27 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are some review comments for patch v61-0002 Thanks for the comments. > > ====== > doc/src/sgml/logical-replication.sgml > > 1. > + <sect2 id="logical-replication-failover-examples"> > + <title>Examples: logical replication failover</title> > > The current documentation structure (after the patch is applied) looks > like this: > > 30.1. Publication > 30.2. Subscription > 30.2.1. Replication Slot Management > 30.2.2. Examples: Set Up Logical Replication > 30.2.3. Examples: Deferred Replication Slot Creation > 30.2.4. Examples: logical replication failover > > I don't think it is ideal. > > Firstly, I think this new section is not just "Examples:"; it is more > like instructions for steps to check if a successful failover is > possible. IMO call it something like "Logical Replication Failover" or > "Replication Slot Failover". > > Secondly, I don't think this new section strictly belongs underneath > the "Subscription" section anymore because IMO it is just as much > about the promotion of the publications. Now that you are adding this > new (2nd) section about slots, I think the whole structure of this > document should be changed like below: > > SUGGESTION #1 (make a new section 30.3 just for slot-related topics) > > 30.1. Publication > 30.2. Subscription > 30.2.1. Examples: Set Up Logical Replication > 30.3. Logical Replication Slots > 30.3.1. Replication Slot Management > 30.3.2. Examples: Deferred Replication Slot Creation > 30.3.3. Logical Replication Failover > > ~ > > SUGGESTION #2 (keep the existing structure, but give the failover its > own new section 30.3) > > 30.1. Publication > 30.2. Subscription > 30.2.1. Replication Slot Management > 30.2.2. Examples: Set Up Logical Replication > 30.2.3. Examples: Deferred Replication Slot Creation > 30.3 Logical Replication Failover I used this version for now as I am sure about changing other section. > > ~ > > SUGGESTION #2a (and maybe later you can extract some of the failover > examples further) > > 30.1. Publication > 30.2. Subscription > 30.2.1. Replication Slot Management > 30.2.2. Examples: Set Up Logical Replication > 30.2.3. Examples: Deferred Replication Slot Creation > 30.3 Logical Replication Failover > 30.3.1. Examples: Checking if failover ready > > ~~~ > > 2. > + <para> > + In a logical replication setup, if the publisher server is also the primary > + server of the streaming replication, the logical slots on the > primary server > + can be synchronized to the standby server by specifying > <literal>failover = true</literal> > + when creating the subscription. Enabling failover ensures a seamless > + transition of the subscription to the promoted standby, allowing it to > + subscribe to the new primary server without any data loss. > + </para> > > I was initially confused by the wording. How about like below: > > SUGGESTION > When the publisher server is the primary server of a streaming > replication, the logical slots on that primary server can be > synchronized to the standby server by specifying <literal>failover = > true</literal> when creating subscriptions for those publications. > Enabling failover ensures a seamless transition of those subscriptions > after the standby is promoted. They can continue subscribing to > publications now on the new primary server without any data loss. Changed as suggested. > > ~~~ > > 3. > + <para> > + However, the replication slots are copied asynchronously, which > means it's necessary > + to confirm that replication slots have been synced to the standby server > + before the failover happens. Additionally, to ensure a successful failover, > + the standby server must not lag behind the subscriber. To confirm > + that the standby server is ready for failover, follow these steps: > + </para> > > Minor rewording > > SUGGESTION > Because the slot synchronization logic copies asynchronously, it is > necessary to confirm that replication slots have been synced to the > standby server before the failover happens. Furthermore, to ensure a > successful failover, the standby server must not be lagging behind the > subscriber. To confirm that the standby server is indeed ready for > failover, follow these 2 steps: Changed as suggested. > > ~~~ > > 4. > The instructions said "follow these steps", so the next parts should > be rendered as 2 "steps" (using <procedure> markup?) > > SUGGESTION (show as steps 1,2 and also some minor rewording of the > step heading) > > 1. Confirm that all the necessary logical replication slots have been > synced to the standby server. > 2. Confirm that the standby server is not lagging behind the subscribers. > Changed as suggested. > ~~~ > > 5. > + <para> > + Check if all the necessary logical replication slots have been synced to > + the standby server. > + </para> > > SUGGESTION > Confirm that all the necessary logical replication slots have been > synced to the standby server. > Changed as suggested. > ~~~ > > 6. > + <listitem> > + <para> > + On logical subscriber, fetch the slot names that should be synced to > the > + standby that we plan to promote. > > SUGGESTION > Firstly, on the subscriber node, use the following SQL to identify the > slot names that should be... > Changed as suggested. > ~~~ > > 7. > +<programlisting> > +test_sub=# SELECT > + array_agg(slotname) AS slots > + FROM > + (( > + SELECT r.srsubid AS subid, CONCAT('pg_' || srsubid || > '_sync_' || srrelid || '_' || ctl.system_identifier) AS slotname > + FROM pg_control_system() ctl, pg_subscription_rel r, > pg_subscription s > + WHERE r.srsubstate = 'f' AND s.oid = r.srsubid AND > s.subfailover > + ) UNION ( > + SELECT oid AS subid, subslotname as slotname > + FROM pg_subscription > + WHERE subfailover > + )); > > 7a > Maybe this ought to include "pg_catalog" schemas? After searching other query examples, I think most of them don’t add this for either function or system table. So, I didn’t add this. > > ~ > > 7b. > For consistency, maybe it is better to use a table alias "FROM > pg_subscription s" in the UNION also Added. > > ~~~ > > 8. > + <listitem> > + <para> > + Check that the logical replication slots exist on the standby server. > > SUGGESTION > Next, check that the logical replication slots identified above exist > on the standby server. Changed as suggested. > > ~~~ > > 9. > +<programlisting> > +test_standby=# SELECT bool_and(synced AND NOT temporary AND > conflict_reason IS NULL) AS failover_ready > + FROM pg_replication_slots > + WHERE slot_name in ('slots'); > + failover_ready > +---------------- > + t > > 9a. > Maybe this ought to include "pg_catalog" schemas? Same as above. > > ~ > > 9b. > IIUC that 'slots' reference is supposed to be those names that were > found in the prior step. If so, then that point needs to be made > clear, and anyway in this case 'slots' is not compatible with the > 'sub' name returned by your first SQL. Changed as suggested. > > ~~~ > > 10. > + <listitem> > + <para> > + Query the last replayed WAL on the logical subscriber. > > SUGGESTION > Firstly, on the subscriber node check the last replayed WAL. > Changed as suggested. > ~~~ > > 11. > +<programlisting> > +test_sub=# SELECT > + MAX(remote_lsn) AS remote_lsn_on_subscriber > + FROM > + (( > + SELECT (CASE WHEN r.srsubstate = 'f' THEN > pg_replication_origin_progress(CONCAT('pg_' || r.srsubid || '_' || > r.srrelid), false) > + WHEN r.srsubstate = 's' THEN r.srsublsn > END) as remote_lsn > + FROM pg_subscription_rel r, pg_subscription s > + WHERE r.srsubstate IN ('f', 's') AND s.oid = r.srsubid > AND s.subfailover > + ) UNION ( > + SELECT pg_replication_origin_progress(CONCAT('pg_' || > s.oid), false) AS remote_lsn > + FROM pg_subscription s > + WHERE subfailover > + )); > > 11a. > Maybe this ought to include "pg_catalog" schemas? Same as above. > > ~ > > 11b. > /WHERE subfailover/WHERE s.subfailover/ > > ~~~ > > 12. > + <listitem> > + <para> > + On the standby server, check that the last-received WAL location > + is ahead of the replayed WAL location on the subscriber. > > SUGGESTION > Next, on the standby server check that the last-received WAL location > is ahead of the replayed WAL location on the subscriber identified > above. > Changed as suggested. > ~~~ > > 13. > +</programlisting></para> > + </listitem> > + <listitem> > + <para> > + On the standby server, check that the last-received WAL location > + is ahead of the replayed WAL location on the subscriber. > +<programlisting> > +test_standby=# SELECT pg_last_wal_receive_lsn() >= > 'remote_lsn_on_subscriber'::pg_lsn AS failover_ready; > + failover_ready > +---------------- > + t > > IIUC the 'remote_lsn_on_subscriber' is supposed to represent the > substitution of the value found in the subscriber server. In this > example maybe it would be: > SELECT pg_last_wal_receive_lsn() >= '0/3000388'::pg_lsn AS failover_ready; > > maybe that point can be made more clearly. I have changed it to use the actual LSN got in last step. > > ~~~ > > 14. > + <para> > + If the result (failover_ready) of both above steps is true, it means it is > + okay to subscribe to the standby server. > + </para> > > 14a. > failover_ready should be rendered as literal. Added. > > ~ > > 14b. > Does this say what you intended, or did you mean something more like > "the standby can be promoted and existing subscriptions will be able > to continue without data loss" I used the later part of your suggestion as I think promotion depends not only on logical replication part. Best Regards, Hou zj
On Sat, Jan 13, 2024 at 12:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Jan 12, 2024 at 5:50 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > There are multiple approaches discussed and tried when it comes to > > starting a slot-sync worker. I am summarizing all here: > > > > 1) Make slotsync worker as an Auxiliary Process (like checkpointer, > > walwriter, walreceiver etc). The benefit this approach provides is, it > > can control begin and stop in a more flexible way as each auxiliary > > process could have different checks before starting and can have > > different stop conditions. But it needs code duplication for process > > management(start, stop, crash handling, signals etc) and currently it > > does not support db-connection smoothly (none of the auxiliary process > > has one so far) > > > > As slotsync worker needs to perform transactions and access syscache, > we can't make it an auxiliary process as that doesn't initialize the > required stuff like syscache. Also, see the comment "Auxiliary > processes don't run transactions ..." in AuxiliaryProcessMain() which > means this is not an option. > > > > > 2) Make slotsync worker as a 'special' process like AutoVacLauncher > > which is neither an Auxiliary process nor a bgworker one. It allows > > db-connection and also provides flexibility to have start and stop > > conditions for a process. > > > > Yeah, due to these reasons, I think this option is worth considering > and another plus point is that this allows us to make enable_syncslot > a PGC_SIGHUP GUC rather than a PGC_POSTMASTER. > > > > > 3) Make slotysnc worker a bgworker. Here we just need to register our > > process as a bgworker (RegisterBackgroundWorker()) by providing a > > relevant start_time and restart_time and then the process management > > is well taken care of. It does not need any code-duplication and > > allows db-connection smoothly in registered process. The only thing it > > lacks is that it does not provide flexibility of having > > start-condition which then makes us to have 'enable_syncslot' as > > PGC_POSTMASTER parameter rather than PGC_SIGHUP. Having said this, I > > feel enable_syncslot is something which will not be changed frequently > > and with the benefits provided by bgworker infra, it seems a > > reasonably good option to choose this approach. > > > > I agree but it may be better to make it a PGC_SIGHUP parameter. > > > 4) Another option is to have Logical Replication Launcher(or a new > > process) to launch slot-sync worker. But going by the current design > > where we have only 1 slotsync worker, it may be an overhead to have an > > additional manager process maintained. > > > > I don't see any good reason to have an additional launcher process here. > > > > > Thus weighing pros and cons of all these options, we have currently > > implemented the bgworker approach (approach 3). Any feedback is > > welcome. > > > > I vote to go for (2) unless we face difficulties in doing so but (3) > is also okay especially if others also think so. Okay. Attempted approach 2 as a separate patch in v62-0003. Approach 3 (bgworker) is still maintained in v62-002. thanks Shveta
About v62-0001: As stated in the patch comment: But note that this commit does not yet include the capability to actually sync the replication slot; the next patch will address that. ~~~ Because of this, I think it might be prudent to separate the documentation portion from this patch so that it can be pushed later when the actual synchronize capability also gets pushed. It would not be good for the PG documentation on HEAD to be describing behaviour that does not yet exist. (e.g. if patch 0001 is pushed early, but then there is some delay or problems getting the subsequent patches committed). ====== Kind Regards, Peter Smith. Fujitsu Australia
On Tue, Jan 16, 2024 at 6:40 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Tue, Jan 16, 2024 at 12:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Tue, Jan 16, 2024 at 1:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Tue, Jan 16, 2024 at 9:03 AM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > > > On Sat, Jan 13, 2024 at 12:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > On Fri, Jan 12, 2024 at 5:50 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > > > > > > > There are multiple approaches discussed and tried when it comes to > > > > > > starting a slot-sync worker. I am summarizing all here: > > > > > > > > > > > > 1) Make slotsync worker as an Auxiliary Process (like checkpointer, > > > > > > walwriter, walreceiver etc). The benefit this approach provides is, it > > > > > > can control begin and stop in a more flexible way as each auxiliary > > > > > > process could have different checks before starting and can have > > > > > > different stop conditions. But it needs code duplication for process > > > > > > management(start, stop, crash handling, signals etc) and currently it > > > > > > does not support db-connection smoothly (none of the auxiliary process > > > > > > has one so far) > > > > > > > > > > > > > > > > As slotsync worker needs to perform transactions and access syscache, > > > > > we can't make it an auxiliary process as that doesn't initialize the > > > > > required stuff like syscache. Also, see the comment "Auxiliary > > > > > processes don't run transactions ..." in AuxiliaryProcessMain() which > > > > > means this is not an option. > > > > > > > > > > > > > > > > > 2) Make slotsync worker as a 'special' process like AutoVacLauncher > > > > > > which is neither an Auxiliary process nor a bgworker one. It allows > > > > > > db-connection and also provides flexibility to have start and stop > > > > > > conditions for a process. > > > > > > > > > > > > > > > > Yeah, due to these reasons, I think this option is worth considering > > > > > and another plus point is that this allows us to make enable_syncslot > > > > > a PGC_SIGHUP GUC rather than a PGC_POSTMASTER. > > > > > > > > > > > > > > > > > 3) Make slotysnc worker a bgworker. Here we just need to register our > > > > > > process as a bgworker (RegisterBackgroundWorker()) by providing a > > > > > > relevant start_time and restart_time and then the process management > > > > > > is well taken care of. It does not need any code-duplication and > > > > > > allows db-connection smoothly in registered process. The only thing it > > > > > > lacks is that it does not provide flexibility of having > > > > > > start-condition which then makes us to have 'enable_syncslot' as > > > > > > PGC_POSTMASTER parameter rather than PGC_SIGHUP. Having said this, I > > > > > > feel enable_syncslot is something which will not be changed frequently > > > > > > and with the benefits provided by bgworker infra, it seems a > > > > > > reasonably good option to choose this approach. > > > > > > > > > > > > > > > > I agree but it may be better to make it a PGC_SIGHUP parameter. > > > > > > > > > > > 4) Another option is to have Logical Replication Launcher(or a new > > > > > > process) to launch slot-sync worker. But going by the current design > > > > > > where we have only 1 slotsync worker, it may be an overhead to have an > > > > > > additional manager process maintained. > > > > > > > > > > > > > > > > I don't see any good reason to have an additional launcher process here. > > > > > > > > > > > > > > > > > Thus weighing pros and cons of all these options, we have currently > > > > > > implemented the bgworker approach (approach 3). Any feedback is > > > > > > welcome. > > > > > > > > > > > > > > > > I vote to go for (2) unless we face difficulties in doing so but (3) > > > > > is also okay especially if others also think so. > > > > > > > > I am not against any of the approaches but I still feel that when we > > > > have a standard way of doing things (bgworker) we should not keep > > > > adding code to do things in a special way unless there is a strong > > > > reason to do so. Now we need to decide if 'enable_syncslot' being > > > > PGC_POSTMASTER is a strong reason to go the non-standard way? > > > > > > > > > > Agreed and as said earlier I think it is better to make it a > > > PGC_SIGHUP. Also, not sure we can say it is a non-standard way as > > > already autovacuum launcher is handled in the same way. One more minor > > > thing is it will save us for having a new bgworker state > > > BgWorkerStart_ConsistentState_HotStandby as introduced by this patch. > > > > Why do we need to add a new BgWorkerStart_ConsistentState_HotStandby > > for the slotsync worker? Isn't it sufficient that the slotsync worker > > exits if not in hot standby mode? > > It is doable, but that will mean starting slot-sync worker even on > primary on every server restart which does not seem like a good idea. > We wanted to have a way where-in it does not start itself in > non-standby mode. Understood. Another idea would be that the startup process dynamically registers the slotsync worker if hot_standby is enabled. But it doesn't seem like the right approach. > > > Is there any technical difficulty or obstacle to make the slotsync > > worker start using bgworker after reloading the config file? > > When we register slotsync worker as bgworker, we can only register the > bgworker before initializing shared memory, we cannot register > dynamically in the cycle of ServerLoop and thus we do not have > flexibility of registering/deregistering the bgworker (or controlling > the bgworker start) based on config parameters each time they change. > We can always start slot-sync worker and let it check if > enable_syncslot is ON. If not, exit and retry the next time when > postmaster will restart it after restart_time(60sec). The downside of > this approach is, even if any user does not want slot-sync > functionality and thus has permanently disabled 'enable_syncslot', it > will keep on restarting and exiting there. Thanks for the explanation. It sounds like it's not impossible but would require some work. If allowing bgworkers to start also on SIGUP is a general improvement, we can implement it later while having enable_syncslot PGC_POSTMASTER at this time. Then, we will be able to make the enable_syncslot PGC_SIGUP later. BTW I think I found a race condition in the v61 patch to cause that the slotsync worker continues working even after promotion (I've not tested with v62 patch though). At the time when the startup shutdown the slotsync worker in FinishWalRecovery(), the postmaster's pmState is still PM_HOT_STANDBY. And if the slotsync worker is not running when the startup process attempts to shutdown it, ShutDownSlotSync() does nothing. Therefore, after the startup process doesn't shutdown the slotsync worker, the postmaster could relaunch the slotsync worker before its state transition to PM_RUN. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Here is a review comment for the latest v62-0002 changes. ====== src/backend/replication/logical/slotsync.c 1. + if (namestrcmp(&slot->data.plugin, remote_slot->plugin) == 0 && + slot->data.database == dbid && + remote_slot->restart_lsn == slot->data.restart_lsn && + remote_slot->catalog_xmin == slot->data.catalog_xmin && + remote_slot->two_phase == slot->data.two_phase && + remote_slot->failover == slot->data.failover && + remote_slot->confirmed_lsn == slot->data.confirmed_flush) + return false; For consistency, I think it would be better to always code the remote slot value on the LHS and the local slot value on the RHS, instead of the current random mix. And rename 'dbid' to 'remote_dbid' for name consistency too. SUGGESTION if (namestrcmp(remote_slot->plugin, &slot->data.plugin) == 0 && remote_dbid == slot->data.database && remote_slot->restart_lsn == slot->data.restart_lsn && remote_slot->catalog_xmin == slot->data.catalog_xmin && remote_slot->two_phase == slot->data.two_phase && remote_slot->failover == slot->data.failover && remote_slot->confirmed_lsn == slot->data.confirmed_flush) return false; ====== Kind Regards, Peter Smith. Fujitsu Australia
On Tue, Jan 16, 2024 at 10:57 PM shveta malik <shveta.malik@gmail.com> wrote: > ... > v62-006: > Separated the failover-ready validation steps into this separate > doc-patch (which were earlier present in v61-002 and v61-003). Also > addressed some of the doc comments by Peter in [1]. > Thanks Hou-San for providing this patch. > > [1]: https://www.postgresql.org/message-id/CAHut%2BPteZVNx1jQ6Hs3mEdoC%3DDNALVpJJ2mZDYim7sU-04tiaw%40mail.gmail.com > Thanks for addressing my previous review in the new patch 0006. I checked it again and below are a few more comments ====== 1. GENERAL I was wondering if some other documentation (like somewhere from chapter 27, or maybe the pgctl promote docs?) should be referring back to this new information about how to decide if the standby is ready for promotion. ====== doc/src/sgml/logical-replication.sgml 2. + + <para> + Because the slot synchronization logic copies asynchronously, it is + necessary to confirm that replication slots have been synced to the standby + server before the failover happens. Furthermore, to ensure a successful + failover, the standby server must not be lagging behind the subscriber. It + is highly recommended to use <varname>standby_slot_names</varname> to + prevent the subscriber from consuming changes faster than the hot standby. + To confirm that the standby server is indeed ready for failover, follow + these 2 steps: + </para> For easier navigation, perhaps that standby_slot_names should include a link back to where the standby_slot_names GUC is described. ~~~ 3. + <substeps> + <step performance="required"> + <para> + Firstly, on the subscriber node, use the following SQL to identify the + slot names that should be synced to the standby that we plan to promote. Minor change to wording. SUGGESTION Firstly, on the subscriber node, use the following SQL to identify which slots should be synced to the standby that we plan to promote. ~~~ 4. +<programlisting> +test_sub=# SELECT + array_agg(slotname) AS slots + FROM + (( + SELECT r.srsubid AS subid, CONCAT('pg_' || srsubid || '_sync_' || srrelid || '_' || ctl.system_identifier) AS slotname + FROM pg_control_system() ctl, pg_subscription_rel r, pg_subscription s + WHERE r.srsubstate = 'f' AND s.oid = r.srsubid AND s.subfailover + ) UNION ( + SELECT s.oid AS subid, s.subslotname as slotname + FROM pg_subscription s + WHERE s.subfailover + )); + slots +------- + {sub} +(1 row) +</programlisting></para> + </step> I think the example might be better if the result shows > 1 slot. e.g. {sub1,sub2,sub3} This would also make the next step 1.b. more clear. ~~~ 5. +</programlisting></para> + </step> + <step performance="required"> + <para> + Next, check that the logical replication slots identified above exist on + the standby server. This step can be skipped if + <varname>standby_slot_names</varname> has been correctly configured. +<programlisting> +test_standby=# SELECT bool_and(synced AND NOT temporary AND conflict_reason IS NULL) AS failover_ready + FROM pg_replication_slots + WHERE slot_name in ('sub'); + failover_ready +---------------- + t +(1 row) +</programlisting></para> 5a. (uppercase SQL keyword) /in/IN/ ~ 5b. I felt this might be easier to understand if the SQL gives a two-column result instead of one all-of-nothing T/F where you might no be sure which slot was the one giving a problem. e.g. failover_ready | slot --------------------- t | sub1 t | sub2 f | sub3 ... ~~~ 6. + <para> + Firstly, on the subscriber node check the last replayed WAL. If the + query result is NULL, it indicates that the subscriber has not yet + replayed any WAL. Therefore, the next step can be skipped, as the + standby server must be ahead of the subscriber. IMO all of that part "If the query result is NULL" does not really belong here because it describes skipping the *next* step. So, it would be better to say this in the next step. Something like: SUGGESTION (for step 2b) Next, on the standby server check that the last-received WAL location is ahead of the replayed WAL location on the subscriber identified above. If the above SQL result was NULL, it means the subscriber has not yet replayed any WAL, so the standby server must be ahead of the subscriber, and this step can be skipped. ~~~ 7. +<programlisting> +test_sub=# SELECT + MAX(remote_lsn) AS remote_lsn_on_subscriber + FROM + (( + SELECT (CASE WHEN r.srsubstate = 'f' THEN pg_replication_origin_progress(CONCAT('pg_' || r.srsubid || '_' || r.srrelid), false) + WHEN r.srsubstate IN ('s', 'r') THEN r.srsublsn END) as remote_lsn + FROM pg_subscription_rel r, pg_subscription s + WHERE r.srsubstate IN ('f', 's', 'r') AND s.oid = r.srsubid AND s.subfailover + ) UNION ( + SELECT pg_replication_origin_progress(CONCAT('pg_' || s.oid), false) AS remote_lsn + FROM pg_subscription s + WHERE subfailover + )); + remote_lsn_on_subscriber +-------------------------- + 0/3000388 +</programlisting></para> 7a. (uppercase SQL keyword) /as/AS/ ~ 7b. missing table alias /WHERE subfailover/WHERE s.subfailover/ ~~~ 8. + </step> + <step performance="required"> + <para> + Next, on the standby server check that the last-received WAL location + is ahead of the replayed WAL location on the subscriber identified above. See the review comment above (#6) which suggested adding some more info here. ====== Kind Regards, Peter Smith. Fujitsu Australia
On Wed, Jan 17, 2024 at 6:43 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Jan 16, 2024 at 6:40 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Tue, Jan 16, 2024 at 12:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Tue, Jan 16, 2024 at 1:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Tue, Jan 16, 2024 at 9:03 AM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > > > > > On Sat, Jan 13, 2024 at 12:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > On Fri, Jan 12, 2024 at 5:50 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > > > > > > > > > There are multiple approaches discussed and tried when it comes to > > > > > > > starting a slot-sync worker. I am summarizing all here: > > > > > > > > > > > > > > 1) Make slotsync worker as an Auxiliary Process (like checkpointer, > > > > > > > walwriter, walreceiver etc). The benefit this approach provides is, it > > > > > > > can control begin and stop in a more flexible way as each auxiliary > > > > > > > process could have different checks before starting and can have > > > > > > > different stop conditions. But it needs code duplication for process > > > > > > > management(start, stop, crash handling, signals etc) and currently it > > > > > > > does not support db-connection smoothly (none of the auxiliary process > > > > > > > has one so far) > > > > > > > > > > > > > > > > > > > As slotsync worker needs to perform transactions and access syscache, > > > > > > we can't make it an auxiliary process as that doesn't initialize the > > > > > > required stuff like syscache. Also, see the comment "Auxiliary > > > > > > processes don't run transactions ..." in AuxiliaryProcessMain() which > > > > > > means this is not an option. > > > > > > > > > > > > > > > > > > > > 2) Make slotsync worker as a 'special' process like AutoVacLauncher > > > > > > > which is neither an Auxiliary process nor a bgworker one. It allows > > > > > > > db-connection and also provides flexibility to have start and stop > > > > > > > conditions for a process. > > > > > > > > > > > > > > > > > > > Yeah, due to these reasons, I think this option is worth considering > > > > > > and another plus point is that this allows us to make enable_syncslot > > > > > > a PGC_SIGHUP GUC rather than a PGC_POSTMASTER. > > > > > > > > > > > > > > > > > > > > 3) Make slotysnc worker a bgworker. Here we just need to register our > > > > > > > process as a bgworker (RegisterBackgroundWorker()) by providing a > > > > > > > relevant start_time and restart_time and then the process management > > > > > > > is well taken care of. It does not need any code-duplication and > > > > > > > allows db-connection smoothly in registered process. The only thing it > > > > > > > lacks is that it does not provide flexibility of having > > > > > > > start-condition which then makes us to have 'enable_syncslot' as > > > > > > > PGC_POSTMASTER parameter rather than PGC_SIGHUP. Having said this, I > > > > > > > feel enable_syncslot is something which will not be changed frequently > > > > > > > and with the benefits provided by bgworker infra, it seems a > > > > > > > reasonably good option to choose this approach. > > > > > > > > > > > > > > > > > > > I agree but it may be better to make it a PGC_SIGHUP parameter. > > > > > > > > > > > > > 4) Another option is to have Logical Replication Launcher(or a new > > > > > > > process) to launch slot-sync worker. But going by the current design > > > > > > > where we have only 1 slotsync worker, it may be an overhead to have an > > > > > > > additional manager process maintained. > > > > > > > > > > > > > > > > > > > I don't see any good reason to have an additional launcher process here. > > > > > > > > > > > > > > > > > > > > Thus weighing pros and cons of all these options, we have currently > > > > > > > implemented the bgworker approach (approach 3). Any feedback is > > > > > > > welcome. > > > > > > > > > > > > > > > > > > > I vote to go for (2) unless we face difficulties in doing so but (3) > > > > > > is also okay especially if others also think so. > > > > > > > > > > I am not against any of the approaches but I still feel that when we > > > > > have a standard way of doing things (bgworker) we should not keep > > > > > adding code to do things in a special way unless there is a strong > > > > > reason to do so. Now we need to decide if 'enable_syncslot' being > > > > > PGC_POSTMASTER is a strong reason to go the non-standard way? > > > > > > > > > > > > > Agreed and as said earlier I think it is better to make it a > > > > PGC_SIGHUP. Also, not sure we can say it is a non-standard way as > > > > already autovacuum launcher is handled in the same way. One more minor > > > > thing is it will save us for having a new bgworker state > > > > BgWorkerStart_ConsistentState_HotStandby as introduced by this patch. > > > > > > Why do we need to add a new BgWorkerStart_ConsistentState_HotStandby > > > for the slotsync worker? Isn't it sufficient that the slotsync worker > > > exits if not in hot standby mode? > > > > It is doable, but that will mean starting slot-sync worker even on > > primary on every server restart which does not seem like a good idea. > > We wanted to have a way where-in it does not start itself in > > non-standby mode. > > Understood. > > Another idea would be that the startup process dynamically registers > the slotsync worker if hot_standby is enabled. But it doesn't seem > like the right approach. > > > > > > Is there any technical difficulty or obstacle to make the slotsync > > > worker start using bgworker after reloading the config file? > > > > When we register slotsync worker as bgworker, we can only register the > > bgworker before initializing shared memory, we cannot register > > dynamically in the cycle of ServerLoop and thus we do not have > > flexibility of registering/deregistering the bgworker (or controlling > > the bgworker start) based on config parameters each time they change. > > We can always start slot-sync worker and let it check if > > enable_syncslot is ON. If not, exit and retry the next time when > > postmaster will restart it after restart_time(60sec). The downside of > > this approach is, even if any user does not want slot-sync > > functionality and thus has permanently disabled 'enable_syncslot', it > > will keep on restarting and exiting there. > > Thanks for the explanation. It sounds like it's not impossible but > would require some work. If allowing bgworkers to start also on SIGUP > is a general improvement, we can implement it later while having > enable_syncslot PGC_POSTMASTER at this time. Then, we will be able to > make the enable_syncslot PGC_SIGUP later. > > BTW I think I found a race condition in the v61 patch to cause that > the slotsync worker continues working even after promotion (I've not > tested with v62 patch though). At the time when the startup shutdown > the slotsync worker in FinishWalRecovery(), the postmaster's pmState > is still PM_HOT_STANDBY. And if the slotsync worker is not running > when the startup process attempts to shutdown it, ShutDownSlotSync() > does nothing. Therefore, after the startup process doesn't shutdown > the slotsync worker, the postmaster could relaunch the slotsync worker > before its state transition to PM_RUN. Yes, this race condition exists. We have attempted to fix this race-condition in v62-003 by introducing 'stopSignaled' bool in slot-sync worker shared memory. The startup process will set it to true before shutting slotsync worker and if meanwhile postmaster ends up restarting it, ReplSlotSyncWorkerMain() will exit seeing 'stopSignaled' set. This is on similar line of WalReceiver where postmaster starts it and startup process shuts it down and it handles similar race condition by having state-machinery in place (see WALRCV_STOPPING, WALRCV_STOPPED in ShutdownWalRcv() and WalReceiverMain()). Having said above, I feel in slotysnc worker case: --I need to pull that race-condition fix around 'stopSignaled' to patch002 instead. --Also I feel in ShutDownSlotSync(), I need to move setting 'stopSignaled' to true before we exit on finding 'SlotSyncWorker->pid' is InvalidPid. This will take care of that scenario too, where there was no slot-sync worker present and startup process tried to shut it down. We need this flag set in that corner-case too. thanks Shveta
A review on v62-006: failover-ready validation steps doc - + Next, check that the logical replication slots identified above exist on + the standby server. This step can be skipped if + <varname>standby_slot_names</varname> has been correctly configured. +<programlisting> +test_standby=# SELECT bool_and(synced AND NOT temporary AND conflict_reason IS NULL) AS failover_ready + FROM pg_replication_slots + WHERE slot_name in ('sub'); + failover_ready +---------------- + t +(1 row) This query does not ensure that all the logical replication slots exist on standby. Due to the 'IN ('slots')' check, it will return 'true' even if only one or a few slots exist.
Hi, On Tue, Jan 16, 2024 at 05:27:05PM +0530, shveta malik wrote: > PFA v62. Details: Thanks! > v62-003: > It is a new patch which attempts to implement slot-sync worker as a > special process which is neither a bgworker nor an Auxiliary process. > Here we get the benefit of converting enable_syncslot to a PGC_SIGHUP > Guc rather than PGC_POSTMASTER. We launch the slot-sync worker only if > it is hot-standby and 'enable_syncslot' is ON. The implementation looks reasonable to me (from what I can see some parts is copy/paste from an already existing "special" process and some parts are "sync slot" specific) which makes fully sense. A few remarks: 1 === + * Was it the slot sycn worker? Typo: sycn 2 === + * ones), and no walwriter, autovac launcher or bgwriter or slot sync Instead? "* ones), and no walwriter, autovac launcher, bgwriter or slot sync" 3 === + * restarting slot slyc worker. If stopSignaled is set, the worker will Typo: slyc 4 === +/* Flag to tell if we are in an slot sync worker process */ s/an/a/ ? 5 === (coming from v62-0002) + Assert(tuplestore_tuple_count(res->tuplestore) == 1); Is it even possible for the related query to not return only one row? (I think the "count" ensures it). 6 === if (conninfo_changed || primary_slotname_changed || + old_enable_syncslot != enable_syncslot || (old_hot_standby_feedback != hot_standby_feedback)) { ereport(LOG, errmsg("slot sync worker will restart because of" " a parameter change")); I don't think "slot sync worker will restart" is true if one change enable_syncslot from on to off. IMHO, v62-003 is in good shape and could be merged in v62-002 (that would ease the review). But let's wait to see if others think differently. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, Jan 17, 2024 at 3:08 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On Tue, Jan 16, 2024 at 05:27:05PM +0530, shveta malik wrote: > > PFA v62. Details: > > Thanks! > > > v62-003: > > It is a new patch which attempts to implement slot-sync worker as a > > special process which is neither a bgworker nor an Auxiliary process. > > Here we get the benefit of converting enable_syncslot to a PGC_SIGHUP > > Guc rather than PGC_POSTMASTER. We launch the slot-sync worker only if > > it is hot-standby and 'enable_syncslot' is ON. > > The implementation looks reasonable to me (from what I can see some parts is > copy/paste from an already existing "special" process and some parts are > "sync slot" specific) which makes fully sense. > > A few remarks: > > 1 === > + * Was it the slot sycn worker? > > Typo: sycn > > 2 === > + * ones), and no walwriter, autovac launcher or bgwriter or slot sync > > Instead? "* ones), and no walwriter, autovac launcher, bgwriter or slot sync" > > 3 === > + * restarting slot slyc worker. If stopSignaled is set, the worker will > > Typo: slyc > > 4 === > +/* Flag to tell if we are in an slot sync worker process */ > > s/an/a/ ? > > 5 === (coming from v62-0002) > + Assert(tuplestore_tuple_count(res->tuplestore) == 1); > > Is it even possible for the related query to not return only one row? (I think the > "count" ensures it). > > 6 === > if (conninfo_changed || > primary_slotname_changed || > + old_enable_syncslot != enable_syncslot || > (old_hot_standby_feedback != hot_standby_feedback)) > { > ereport(LOG, > errmsg("slot sync worker will restart because of" > " a parameter change")); > > I don't think "slot sync worker will restart" is true if one change enable_syncslot > from on to off. > > IMHO, v62-003 is in good shape and could be merged in v62-002 (that would ease > the review). But let's wait to see if others think differently. > > Regards, > > -- > Bertrand Drouvot > PostgreSQL Contributors Team > RDS Open Source Databases > Amazon Web Services: https://aws.amazon.com PFA v63. --It addresses comments by Peter given in [1], [2], comment by Nisha given in [3], comments by Bertrand given in [4] --It also moves race-condition fix from patch003 to patch002 as suggested by Swada-san offlist. Race-condition is mentioned in [5] All the changes are in patch02, patch003 and patch006. [1]: https://www.postgresql.org/message-id/CAHut%2BPuECB8fNBfXMdTHSMKF9kL%3D0XqPw1Am4NVahfJSSHzoYg%40mail.gmail.com [2]: https://www.postgresql.org/message-id/CAHut%2BPt0uum%2B6Hg5UDofWMEJWhVEyArM1b0_B94UJmRcQmz7DA%40mail.gmail.com [3]: https://www.postgresql.org/message-id/CABdArM73qdHyA0nteDLAQrfKNHRP%2B5Qq6p8uobg5bkE3EWiC%2Bg%40mail.gmail.com [4]: https://www.postgresql.org/message-id/ZaegJe9JpUiQeV%2BD%40ip-10-97-1-34.eu-west-3.compute.internal [5]: https://www.postgresql.org/message-id/CAD21AoA5izeKpp9Ei4Cd745pKX3wn-TRvhhmPFEW9UY1nx%2B_aw%40mail.gmail.com thanks Shveta
Attachment
- v63-0004-Allow-logical-walsenders-to-wait-for-the-physica.patch
- v63-0003-Slot-sync-worker-as-a-special-process.patch
- v63-0001-Enable-setting-failover-property-for-a-slot-thro.patch
- v63-0005-Non-replication-connection-and-app_name-change.patch
- v63-0002-Add-logical-slot-sync-capability-to-the-physical.patch
- v63-0006-Document-the-steps-to-check-if-the-standby-is-re.patch
On Wed, Jan 17, 2024 at 3:08 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On Tue, Jan 16, 2024 at 05:27:05PM +0530, shveta malik wrote: > > PFA v62. Details: > > Thanks! > > > v62-003: > > It is a new patch which attempts to implement slot-sync worker as a > > special process which is neither a bgworker nor an Auxiliary process. > > Here we get the benefit of converting enable_syncslot to a PGC_SIGHUP > > Guc rather than PGC_POSTMASTER. We launch the slot-sync worker only if > > it is hot-standby and 'enable_syncslot' is ON. > > The implementation looks reasonable to me (from what I can see some parts is > copy/paste from an already existing "special" process and some parts are > "sync slot" specific) which makes fully sense. Thanks for the feedback. I have addressed the comments in v63 except 5th one. > A few remarks: > > 1 === > + * Was it the slot sycn worker? > > Typo: sycn > > 2 === > + * ones), and no walwriter, autovac launcher or bgwriter or slot sync > > Instead? "* ones), and no walwriter, autovac launcher, bgwriter or slot sync" > > 3 === > + * restarting slot slyc worker. If stopSignaled is set, the worker will > > Typo: slyc > > 4 === > +/* Flag to tell if we are in an slot sync worker process */ > > s/an/a/ ? > > 5 === (coming from v62-0002) > + Assert(tuplestore_tuple_count(res->tuplestore) == 1); > > Is it even possible for the related query to not return only one row? (I think the > "count" ensures it). I think you are right. This assertion was added sometime back on the basis of feedback on hackers. Let me review that again. I can consider this comment in the next version. > 6 === > if (conninfo_changed || > primary_slotname_changed || > + old_enable_syncslot != enable_syncslot || > (old_hot_standby_feedback != hot_standby_feedback)) > { > ereport(LOG, > errmsg("slot sync worker will restart because of" > " a parameter change")); > > I don't think "slot sync worker will restart" is true if one change enable_syncslot > from on to off. Yes, right. I have changed the log-msg in this specific case. > > IMHO, v62-003 is in good shape and could be merged in v62-002 (that would ease > the review). But let's wait to see if others think differently. > > Regards, > > -- > Bertrand Drouvot > PostgreSQL Contributors Team > RDS Open Source Databases > Amazon Web Services: https://aws.amazon.com
I have one question about the new code in v63-0002. ====== src/backend/replication/logical/slotsync.c 1. ReplSlotSyncWorkerMain + Assert(SlotSyncWorker->pid == InvalidPid); + + /* + * Startup process signaled the slot sync worker to stop, so if meanwhile + * postmaster ended up starting the worker again, exit. + */ + if (SlotSyncWorker->stopSignaled) + { + SpinLockRelease(&SlotSyncWorker->mutex); + proc_exit(0); + } Can we be sure a worker crash can't occur (in ShutDownSlotSync?) in such a way that SlotSyncWorker->stopSignaled was already assigned true, but SlotSyncWorker->pid was not yet reset to InvalidPid; e.g. Is the Assert above still OK? ====== Kind Regards, Peter Smith. Fujitsu Australia
On Wed, Jan 17, 2024 at 4:00 PM shveta malik <shveta.malik@gmail.com> wrote: > > PFA v63. > 1. + /* User created slot with the same name exists, raise ERROR. */ + if (!synced) + ereport(ERROR, + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("exiting from slot synchronization on receiving" + " the failover slot \"%s\" from the primary server", + remote_slot->name), + errdetail("A user-created slot with the same name already" + " exists on the standby.")); I think here primary error message should contain the reason for failure. Something like: "exiting from slot synchronization because same name slot already exists on standby" then we can add more details in errdetail. 2. +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot) { ... + LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); + xmin_horizon = GetOldestSafeDecodingTransactionId(true); + SpinLockAcquire(&slot->mutex); + slot->data.catalog_xmin = xmin_horizon; + SpinLockRelease(&slot->mutex); ... } Here, why slot->effective_catalog_xmin is not updated? The same is required by a later call to ReplicationSlotsComputeRequiredXmin(). I see that the prior version v60-0002 has the corresponding change but it is missing in the latest version. Any reason? 3. + * Return true either if the slot is marked as RS_PERSISTENT (sync-ready) or + * is synced periodically (if it was already sync-ready). Return false + * otherwise. + */ +static bool +update_and_persist_slot(RemoteSlot *remote_slot) The second part of the above comment (or is synced periodically (if it was already sync-ready)) is not clear to me. Does it intend to describe the case when we try to update the already created temp slot in the last call. If so, that is not very clear because periodically sounds like it can be due to repeated sync for sync-ready slot. 4. +update_and_persist_slot(RemoteSlot *remote_slot) { ... + (void) local_slot_update(remote_slot); ... } Can we write a comment to state the reason why we don't care about the return value here? -- With Regards, Amit Kapila.
On Wednesday, January 17, 2024 6:30 PM shveta malik <shveta.malik@gmail.com> wrote: > PFA v63. I analyzed the security of the slotsync worker and replication connection a bit, and didn't find issue. Here is detail: 1) data security First, we are using the role used in primary_conninfo, the role used here is requested to have REPLICATION or SUPERUSER privilege[1] which means it is reasonable for the role to modify and read replication slots on the primary. On the primary, the slotsync worker only queries the pg_replication_view which doesn't contain any system or user table access, so I think it's safe. On the standby server, the slot sync worker will not read/write any user table as well, thus we don't have the risk of executing arbitrary codes in trigger. 2) privilege check The SQL query of the slotsync worker will take common privilege check on the primary. If I revoke the function execution privilege on pg_get_replication_slots from replication user, then the slotsync worker won't be able to query the pg_replication_slots view. Same is true for the pg_is_in_recovery function. The slotsync worker will keep reporting ERROR after revoking which is as expected. Based on above, I didn't see some security issues for slotsync worker. [1] https://www.postgresql.org/docs/16/runtime-config-replication.html#GUC-PRIMARY-CONNINFO Best Regards, Hou zj
On Tue, Jan 9, 2024 at 11:15 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Tuesday, January 9, 2024 9:17 AM Peter Smith <smithpb2250@gmail.com> wrote: > > ... > > > > 2. ALTER_REPLICATION_SLOT ... FAILOVER > > > > + <variablelist> > > + <varlistentry> > > + <term><literal>FAILOVER [ <replaceable > > class="parameter">boolean</replaceable> ]</literal></term> > > + <listitem> > > + <para> > > + If true, the slot is enabled to be synced to the physical > > + standbys so that logical replication can be resumed after failover. > > + </para> > > + </listitem> > > + </varlistentry> > > + </variablelist> > > > > This syntax says passing the boolean value is optional. So it needs to be > > specified here in the docs that not passing a value would be the same as > > passing the value true. > > The behavior that "not passing a value would be the same as passing the value > true " is due to the rule of defGetBoolean(). And all the options of commands > in this document behave the same in this case, therefore I think we'd better > add document for it in a general place in a separate patch/thread instead of > mentioning this in each option's paragraph. > Hi Hou-san, I did as suggested and posted a patch for this in another thread [1]. Please see if it is OK. ====== [1] https://www.postgresql.org/message-id/CAHut%2BPtDWSmW8uiRJF1LfGQJikmo7V2jdysLuRmtsanNZc7fNw%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
On Thu, Jan 18, 2024 at 10:31 AM Peter Smith <smithpb2250@gmail.com> wrote: > > I have one question about the new code in v63-0002. > > ====== > src/backend/replication/logical/slotsync.c > > 1. ReplSlotSyncWorkerMain > > + Assert(SlotSyncWorker->pid == InvalidPid); > + > + /* > + * Startup process signaled the slot sync worker to stop, so if meanwhile > + * postmaster ended up starting the worker again, exit. > + */ > + if (SlotSyncWorker->stopSignaled) > + { > + SpinLockRelease(&SlotSyncWorker->mutex); > + proc_exit(0); > + } > > Can we be sure a worker crash can't occur (in ShutDownSlotSync?) in > such a way that SlotSyncWorker->stopSignaled was already assigned > true, but SlotSyncWorker->pid was not yet reset to InvalidPid; > > e.g. Is the Assert above still OK? We are good with the Assert here. I tried below cases: 1) When slotsync worker is say killed using 'kill', it is considered as SIGTERM; slot sync worker invokes 'slotsync_worker_onexit()' before going down and thus sets SlotSyncWorker->pid = InvalidPid. This means when it is restarted (considering we have put the breakpoints in such a way that postmaster had already reached do_start_bgworker() before promotion finished), it is able to see stopSignaled set but pid is InvalidPid and thus we are good. 2) Another case is when we kill slot sync worker using 'kill -9' (or say we make it crash), in such a case, postmaster signals each sibling process to quit (including startup process) and cleans up the shared memory used by each (including SlotSyncWorker). In such a case promotion fails. And if slot sync worker is started again, it will find pid as InvalidPid. So we are good. thanks Shveta
On Wed, Jan 17, 2024 at 7:30 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Wed, Jan 17, 2024 at 3:08 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > > Hi, > > > > On Tue, Jan 16, 2024 at 05:27:05PM +0530, shveta malik wrote: > > > PFA v62. Details: > > > > Thanks! > > > > > v62-003: > > > It is a new patch which attempts to implement slot-sync worker as a > > > special process which is neither a bgworker nor an Auxiliary process. > > > Here we get the benefit of converting enable_syncslot to a PGC_SIGHUP > > > Guc rather than PGC_POSTMASTER. We launch the slot-sync worker only if > > > it is hot-standby and 'enable_syncslot' is ON. > > > > The implementation looks reasonable to me (from what I can see some parts is > > copy/paste from an already existing "special" process and some parts are > > "sync slot" specific) which makes fully sense. > > > > A few remarks: > > > > 1 === > > + * Was it the slot sycn worker? > > > > Typo: sycn > > > > 2 === > > + * ones), and no walwriter, autovac launcher or bgwriter or slot sync > > > > Instead? "* ones), and no walwriter, autovac launcher, bgwriter or slot sync" > > > > 3 === > > + * restarting slot slyc worker. If stopSignaled is set, the worker will > > > > Typo: slyc > > > > 4 === > > +/* Flag to tell if we are in an slot sync worker process */ > > > > s/an/a/ ? > > > > 5 === (coming from v62-0002) > > + Assert(tuplestore_tuple_count(res->tuplestore) == 1); > > > > Is it even possible for the related query to not return only one row? (I think the > > "count" ensures it). > > > > 6 === > > if (conninfo_changed || > > primary_slotname_changed || > > + old_enable_syncslot != enable_syncslot || > > (old_hot_standby_feedback != hot_standby_feedback)) > > { > > ereport(LOG, > > errmsg("slot sync worker will restart because of" > > " a parameter change")); > > > > I don't think "slot sync worker will restart" is true if one change enable_syncslot > > from on to off. > > > > IMHO, v62-003 is in good shape and could be merged in v62-002 (that would ease > > the review). But let's wait to see if others think differently. > > > > Regards, > > > > -- > > Bertrand Drouvot > > PostgreSQL Contributors Team > > RDS Open Source Databases > > Amazon Web Services: https://aws.amazon.com > > > PFA v63. > > --It addresses comments by Peter given in [1], [2], comment by Nisha > given in [3], comments by Bertrand given in [4] > --It also moves race-condition fix from patch003 to patch002 as > suggested by Swada-san offlist. Race-condition is mentioned in [5] > Thank you for updating the patch. I have some comments: --- + latestWalEnd = GetWalRcvLatestWalEnd(); + if (remote_slot->confirmed_lsn > latestWalEnd) + { + elog(ERROR, "exiting from slot synchronization as the received slot sync" + " LSN %X/%X for slot \"%s\" is ahead of the standby position %X/%X", + LSN_FORMAT_ARGS(remote_slot->confirmed_lsn), + remote_slot->name, + LSN_FORMAT_ARGS(latestWalEnd)); + } IIUC GetWalRcvLatestWalEnd () returns walrcv->latestWalEnd, which is typically the primary server's flush position and doesn't mean the LSN where the walreceiver received/flushed up to. Does it really happen that the slot's confirmed_flush_lsn is higher than the primary's flush lsn? --- After dropping a database on the primary, I got the following LOG (PID 2978463 is the slotsync worker on the standby): LOG: still waiting for backend with PID 2978463 to accept ProcSignalBarrier CONTEXT: WAL redo at 0/301CE00 for Database/DROP: dir 1663/16384 Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Jan 17, 2024 at 4:06 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > 5 === (coming from v62-0002) > > + Assert(tuplestore_tuple_count(res->tuplestore) == 1); > > > > Is it even possible for the related query to not return only one row? (I think the > > "count" ensures it). > > I think you are right. This assertion was added sometime back on the > basis of feedback on hackers. Let me review that again. I can consider > this comment in the next version. > OTOH, can't we keep the assert as it is but remove "= 1" from "count(*) = 1" in the query. There shouldn't be more than one slot with same name on the primary. Or, am I missing something? -- With Regards, Amit Kapila.
On Fri, Jan 19, 2024 at 11:23 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > 5 === (coming from v62-0002) > > > + Assert(tuplestore_tuple_count(res->tuplestore) == 1); > > > > > > Is it even possible for the related query to not return only one row? (I think the > > > "count" ensures it). > > > > I think you are right. This assertion was added sometime back on the > > basis of feedback on hackers. Let me review that again. I can consider > > this comment in the next version. > > > > OTOH, can't we keep the assert as it is but remove "= 1" from > "count(*) = 1" in the query. There shouldn't be more than one slot > with same name on the primary. Or, am I missing something? There will be 1 record max and 0 record if the primary_slot_name is invalid. Keeping 'count(*)=1' gives the benefit that it will straight away give us true/false indicating if we are good or not wrt primary_slot_name. I feel Assert can be removed and we can simply have: if (!tuplestore_gettupleslot(res->tuplestore, true, false, tupslot)) elog(ERROR, "failed to fetch primary_slot_name tuple"); thanks Shveta
Here are some review comments for patch v63-0003. ====== Commit Message 1. This patch attempts to start slot-sync worker as a special process which is neither a bgworker nor an Auxiliary process. The benefit we get here is we can control the start-conditions of the worker which further allows us to 'enable_syncslot' as PGC_SIGHUP which was otherwise a PGC_POSTMASTER GUC when slotsync worker was registered as bgworker. ~ missing word? /allows us to/allows us to define/ ====== src/backend/postmaster/postmaster.c 2. process_pm_child_exit + /* + * Was it the slot sync worker? Normal exit or FATAL exit (FATAL can + * be caused by libpqwalreceiver on receiving shutdown request by the + * startup process during promotion) can be ignored; we'll start a new + * one at the next iteration of the postmaster's main loop, if + * necessary. Any other exit condition is treated as a crash. + */ + if (pid == SlotSyncWorkerPID) + { + SlotSyncWorkerPID = 0; + if (!EXIT_STATUS_0(exitstatus) && !EXIT_STATUS_1(exitstatus)) + HandleChildCrash(pid, exitstatus, + _("Slotsync worker process")); + continue; + } 2a. I think the 2nd sentence is easier to read if written like: Normal exit or FATAL exit can be ignored (FATAL can be caused by libpqwalreceiver on receiving shutdown request by the startup process during promotion); ~ 2b. All other names nearby are lowercase so maybe change "Slotsync worker process" to ""slotsync worker process" or ""slot sync worker process". ====== src/backend/replication/logical/slotsync.c 3. check_primary_info if (!valid) - ereport(ERROR, + { + *primary_slot_invalid = true; + ereport(LOG, errcode(ERRCODE_INVALID_PARAMETER_VALUE), - errmsg("exiting from slot synchronization due to bad configuration"), + errmsg("skipping slot synchronization due to bad configuration"), /* translator: second %s is a GUC variable name */ errdetail("The primary server slot \"%s\" specified by %s is not valid.", PrimarySlotName, "primary_slot_name")); + } Somehow it seems more appropriate for the *caller* to decide what to do (e.g. "skipping...") when the primary slot is invalid. See also the next review comment #4b -- maybe just change this LOG to say "bad configuration for slot synchronization". ~~~ 4. /* * Check that all necessary GUCs for slot synchronization are set - * appropriately. If not, raise an ERROR. + * appropriately. If not, log the message and pass 'valid' as false + * to the caller. * * If all checks pass, extracts the dbname from the primary_conninfo GUC and * returns it. */ static char * -validate_parameters_and_get_dbname(void) +validate_parameters_and_get_dbname(bool *valid) 4a. This feels back-to-front. I think a "validate" function should return boolean. It can return the dbname as a side-effect only when it is valid. SUGGESTION static boolean validate_parameters_and_get_dbname(char *dbname) ~ 4b. It was a bit different when there were ERRORs but now they are LOGs somehow it seems wrong for this function to say what the *caller* will do. Maybe you can rewrite all the errmsg so the don't say "skipping" but they just say "bad configuration for slot synchronization" If valid is false then you can LOG "skipping" at the caller... ~~~ 5. wait_for_valid_params_and_get_dbname + dbname = validate_parameters_and_get_dbname(&valid); + if (valid) + break; + else This code will be simpler when the function is change to return boolean as suggested above in #4a. Also the 'else' is unnecessary. SUGGESTION if (validate_parameters_and_get_dbname(&dbname) break; ~ 6. + if (rc & WL_LATCH_SET) + ResetLatch(MyLatch); + + } + } Unnecessary blank line. ~~~ 7. slotsync_reread_config + if (old_enable_syncslot != enable_syncslot) + { + /* + * We have reached here, so old value must be true and new must be + * false. + */ + Assert(old_enable_syncslot); + Assert(!enable_syncslot); I felt it would be better just to say Assert(enable_syncslot); at the top of this function (before the ProcessConfigFile). Then none of this other comment/assert if really needed because it should be self-evident. ~~~ 8. StartSlotSyncWorker int StartSlotSyncWorker(void) { pid_t pid; #ifdef EXEC_BACKEND switch ((pid = slotsyncworker_forkexec())) #else switch ((pid = fork_process())) #endif { case -1: ereport(LOG, (errmsg("could not fork slot sync worker process: %m"))); return 0; #ifndef EXEC_BACKEND case 0: /* in postmaster child ... */ InitPostmasterChild(); /* Close the postmaster's sockets */ ClosePostmasterPorts(false); ReplSlotSyncWorkerMain(0, NULL); break; #endif default: return (int) pid; } /* shouldn't get here */ return 0; } The switch code can be rearranged so you don't need the #ifndef SUGGESTION #ifdef EXEC_BACKEND switch ((pid = slotsyncworker_forkexec())) { #else switch ((pid = fork_process())) { case 0: /* in postmaster child ... */ InitPostmasterChild(); /* Close the postmaster's sockets */ ClosePostmasterPorts(false); ReplSlotSyncWorkerMain(0, NULL); break; #endif case -1: ereport(LOG, (errmsg("could not fork slot sync worker process: %m"))); return 0; default: return (int) pid; } ====== src/backend/storage/lmgr/proc.c 9. InitProcess * this; it probably should.) + * + * Slot sync worker does not participate in it, see comments atop Backend. */ - if (IsUnderPostmaster && !IsAutoVacuumLauncherProcess()) + if (IsUnderPostmaster && !IsAutoVacuumLauncherProcess() && + !IsLogicalSlotSyncWorker()) MarkPostmasterChildActive(); 9a. /does not participate in it/also does not participate in it/ ~ 9b. It's not clear where "atop Backend" is referring to. ~~~ 10. * way, so tell the postmaster we've cleaned up acceptably well. (XXX * autovac launcher should be included here someday) */ - if (IsUnderPostmaster && !IsAutoVacuumLauncherProcess()) + if (IsUnderPostmaster && !IsAutoVacuumLauncherProcess() && + !IsLogicalSlotSyncWorker()) MarkPostmasterChildInactive(); Should this comment also be updated to mention slot sync worker? ====== src/backend/utils/activity/pgstat_io.c 11. pgstat_tracks_io_bktype case B_WAL_SENDER: + case B_SLOTSYNC_WORKER: return true; } Notice all the other enums were arrange in alphabetical order, so do the same here. ====== src/backend/utils/init/miscinit.c 12. GetBackendTypeDesc + case B_SLOTSYNC_WORKER: + backendDesc = "slotsyncworker"; + break; } All the other case are in alphabetical order, same as the enum values, so do the same here. ~~~ 13. InitializeSessionUserIdStandalone * This function should only be called in single-user mode, in autovacuum * workers, and in background workers. */ - Assert(!IsUnderPostmaster || IsAutoVacuumWorkerProcess() || IsBackgroundWorker); + Assert(!IsUnderPostmaster || IsAutoVacuumWorkerProcess() || + IsLogicalSlotSyncWorker() || IsBackgroundWorker); Looks like this Assert has a stale comment that should be updated. ====== src/include/miscadmin.h 14. GetBackendTypeDesc B_WAL_SUMMARIZER, + B_SLOTSYNC_WORKER, B_WAL_WRITER, } BackendType; It seems strange to jam this new value among the other B_WAL enums. Anyway, it looks like everything else is in alphabetical order, so we do that too. ====== Kind Regards, Peter Smith. Fujitsu Australia
Hi, On Fri, Jan 19, 2024 at 11:46:51AM +0530, shveta malik wrote: > On Fri, Jan 19, 2024 at 11:23 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > 5 === (coming from v62-0002) > > > > + Assert(tuplestore_tuple_count(res->tuplestore) == 1); > > > > > > > > Is it even possible for the related query to not return only one row? (I think the > > > > "count" ensures it). > > > > > > I think you are right. This assertion was added sometime back on the > > > basis of feedback on hackers. Let me review that again. I can consider > > > this comment in the next version. > > > > > > > OTOH, can't we keep the assert as it is but remove "= 1" from > > "count(*) = 1" in the query. There shouldn't be more than one slot > > with same name on the primary. Or, am I missing something? > > There will be 1 record max and 0 record if the primary_slot_name is > invalid. I think we'd have exactly one record in all the cases (due to the count): postgres=# SELECT pg_is_in_recovery(), count(*) FROM pg_replication_slots WHERE 1 = 2; pg_is_in_recovery | count -------------------+------- f | 0 (1 row) postgres=# SELECT pg_is_in_recovery(), count(*) FROM pg_replication_slots WHERE 1 = 1; pg_is_in_recovery | count -------------------+------- f | 1 (1 row) > Keeping 'count(*)=1' gives the benefit that it will straight > away give us true/false indicating if we are good or not wrt > primary_slot_name. I feel Assert can be removed and we can simply > have: > > if (!tuplestore_gettupleslot(res->tuplestore, true, false, tupslot)) > elog(ERROR, "failed to fetch primary_slot_name tuple"); > I'd also vote for keeping it as it is and remove the Assert. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, Jan 19, 2024 at 10:35 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > Thank you for updating the patch. I have some comments: > > --- > + latestWalEnd = GetWalRcvLatestWalEnd(); > + if (remote_slot->confirmed_lsn > latestWalEnd) > + { > + elog(ERROR, "exiting from slot synchronization as the > received slot sync" > + " LSN %X/%X for slot \"%s\" is ahead of the > standby position %X/%X", > + LSN_FORMAT_ARGS(remote_slot->confirmed_lsn), > + remote_slot->name, > + LSN_FORMAT_ARGS(latestWalEnd)); > + } > > IIUC GetWalRcvLatestWalEnd () returns walrcv->latestWalEnd, which is > typically the primary server's flush position and doesn't mean the LSN > where the walreceiver received/flushed up to. yes. I think it makes more sense to use something which actually tells flushed-position. I gave it a try by replacing GetWalRcvLatestWalEnd() with GetWalRcvFlushRecPtr() but I see a problem here. Lets say I have enabled the slot-sync feature in a running standby, in that case we are all good (flushedUpto is the same as actual flush-position indicated by LogstreamResult.Flush). But if I restart standby, then I observed that the startup process sets flushedUpto to some value 'x' (see [1]) while when the wal-receiver starts, it sets 'LogstreamResult.Flush' to another value (see [2]) which is always greater than 'x'. And we do not update flushedUpto with the 'LogstreamResult.Flush' value in walreceiver until we actually do an operation on primary. Performing a data change on primary sends WALs to standby which then hits XLogWalRcvFlush() and updates flushedUpto same as LogstreamResult.Flush. Until then we have a situation where slots received on standby are ahead of flushedUpto and thus slotsync worker keeps one erroring out. I am yet to find out why flushedUpto is set to a lower value than 'LogstreamResult.Flush' at the start of standby. Or maybe am I using the wrong function GetWalRcvFlushRecPtr() and should be using something else instead? [1]: Startup process sets 'flushedUpto' here: ReadPageInternal-->XLogPageRead-->WaitForWALToBecomeAvailable-->RequestXLogStreaming [2]: Walreceiver sets 'LogstreamResult.Flush' here but do not update 'flushedUpto' here: WalReceiverMain(): LogstreamResult.Write = LogstreamResult.Flush = GetXLogReplayRecPtr(NULL) > Does it really happen > that the slot's confirmed_flush_lsn is higher than the primary's flush > lsn? It may happen if we have not configured standby_slot_names on primary. In such a case, slots may get updated w/o confirming that standby has taken the change and thus slot-sync worker may fetch the slots which have lsns ahead of the latest WAL position on standby. thanks Shveta
On Fri, Jan 19, 2024 at 1:42 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On Fri, Jan 19, 2024 at 11:46:51AM +0530, shveta malik wrote: > > Keeping 'count(*)=1' gives the benefit that it will straight > > away give us true/false indicating if we are good or not wrt > > primary_slot_name. I feel Assert can be removed and we can simply > > have: > > > > if (!tuplestore_gettupleslot(res->tuplestore, true, false, tupslot)) > > elog(ERROR, "failed to fetch primary_slot_name tuple"); > > > > I'd also vote for keeping it as it is and remove the Assert. Sure, retained the query as is. Removed Assert. PFA v64. Changes are: 1) Addressed comments by Amit in [1]. 2) Addressed offlist comments given by Peter for documentation patch06. 3) Moved some docs to patch04 which were wrongly placed in patch02. 4) Addressed 1 pending comment from Bertrand (as stated above) to remove redundant Assert from check_primary_info() TODO: Address comments by Peter given in [2] [1]: https://www.postgresql.org/message-id/CAA4eK1LBnCjxBi7vPam0OfxsTEyHdvqx7goKxi1ePU45oz%3Dkhg%40mail.gmail.com [2]: https://www.postgresql.org/message-id/CAHut%2BPt5Pk_xJkb54oahR%2Bf9oawgfnmbpewvkZPgnRhoJ3gkYg%40mail.gmail.com thanks Shveta
Attachment
- v64-0003-Slot-sync-worker-as-a-special-process.patch
- v64-0005-Non-replication-connection-and-app_name-change.patch
- v64-0004-Allow-logical-walsenders-to-wait-for-the-physica.patch
- v64-0001-Enable-setting-failover-property-for-a-slot-thro.patch
- v64-0002-Add-logical-slot-sync-capability-to-the-physical.patch
- v64-0006-Document-the-steps-to-check-if-the-standby-is-re.patch
On Thu, Jan 18, 2024 at 4:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > 2. > +synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot) > { > ... > + LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); > + xmin_horizon = GetOldestSafeDecodingTransactionId(true); > + SpinLockAcquire(&slot->mutex); > + slot->data.catalog_xmin = xmin_horizon; > + SpinLockRelease(&slot->mutex); > ... > } > > Here, why slot->effective_catalog_xmin is not updated? The same is > required by a later call to ReplicationSlotsComputeRequiredXmin(). I > see that the prior version v60-0002 has the corresponding change but > it is missing in the latest version. Any reason? I think it was a mistake in v61. Added it back in v64.. > > 3. > + * Return true either if the slot is marked as RS_PERSISTENT (sync-ready) or > + * is synced periodically (if it was already sync-ready). Return false > + * otherwise. > + */ > +static bool > +update_and_persist_slot(RemoteSlot *remote_slot) > > The second part of the above comment (or is synced periodically (if it > was already sync-ready)) is not clear to me. Does it intend to > describe the case when we try to update the already created temp slot > in the last call. If so, that is not very clear because periodically > sounds like it can be due to repeated sync for sync-ready slot. The comment was as per old functionality where this function was doing persist and save both. In v61 code changed, but comment was not updated. I have changed it now in v64. > 4. > +update_and_persist_slot(RemoteSlot *remote_slot) > { > ... > + (void) local_slot_update(remote_slot); > ... > } > > Can we write a comment to state the reason why we don't care about the > return value here? Since it is the first time 'local_slot_update' is happening on any slot, the return value must be true i.e. local_slot_update() should not skip the update. I have thus added an Assert on return value now (in v64). thanks Shveta
On Wed, Jan 17, 2024 at 4:00 PM shveta malik <shveta.malik@gmail.com> wrote: > I had some off-list discussions with Sawada-San, Hou-San, and Shveta on the topic of extending replication commands instead of using the current model where we fetch the required slot information via SQL using a database connection. I would like to summarize the discussion and would like to know the thoughts of others on this topic. In the current patch, we launch the slotsync worker on physical standby which connects to the specified database (currently we let users specify the required dbname in primary_conninfo) on the primary. It then fetches the required information for failover marked slots from the primary and also does some primitive checks on the upstream node via SQL (the additional checks are like whether the upstream node has a specified physical slot or whether the upstream node is a primary node or a standby node). To fetch the required information it uses a libpqwalreciever API which is mostly apt for this purpose as it supports SQL execution but for this patch, we don't need a replication connection, so we extend the libpqwalreciever connect API. Now, the concerns related to this could be that users would probably need to change existing mechanisms/tools to update priamry_conninfo and one of the alternatives proposed is to have an additional GUC like slot_sync_dbname. Users won't be able to drop the database this worker is connected to aka whatever is specified in slot_sync_dbname but as the user herself sets up the configuration it shouldn't be a big deal. Then we also discussed whether extending libpqwalreceiver's connect API is a good idea and whether we need to further extend it in the future. As far as I can see, slotsync worker's primary requirement is to execute SQL queries which the current API is sufficient, and don't see something that needs any drastic change in this API. Note that tablesync worker that executes SQL also uses these APIs, so we may need something in the future for either of those. Then finally we need a slotsync worker to also connect to a database to use SQL and fetch results. Now, let us consider if we extend the replication commands like READ_REPLICATION_SLOT and or introduce a new set of replication commands to fetch the required information then we don't need a DB connection with primary or a connection in slotsync worker. As per my current understanding, it is quite doable but I think we will slowly go in the direction of making replication commands something like SQL because today we need to extend it to fetch all slots info that have failover marked as true, the existence of a particular replication, etc. Then tomorrow, if we want to extend this work to have multiple slotsync workers say workers perdb then we have to extend the replication command to fetch per-database failover marked slots. To me, it sounds more like we are slowly adding SQL-like features to replication commands. Apart from this when we are reading per-db replication slots without connecting to a database, we probably need some additional protection mechanism so that the database won't get dropped. Considering all this it seems that for now probably extending replication commands can simplify a few things like mentioned above but using SQL's with db-connection is more extendable. Thoughts? -- With Regards, Amit Kapila.
On Fri, Jan 19, 2024 at 5:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Jan 17, 2024 at 4:00 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > I had some off-list discussions with Sawada-San, Hou-San, and Shveta > on the topic of extending replication commands instead of using the > current model where we fetch the required slot information via SQL > using a database connection. I would like to summarize the discussion > and would like to know the thoughts of others on this topic. > > In the current patch, we launch the slotsync worker on physical > standby which connects to the specified database (currently we let > users specify the required dbname in primary_conninfo) on the primary. > It then fetches the required information for failover marked slots > from the primary and also does some primitive checks on the upstream > node via SQL (the additional checks are like whether the upstream node > has a specified physical slot or whether the upstream node is a > primary node or a standby node). To fetch the required information it > uses a libpqwalreciever API which is mostly apt for this purpose as it > supports SQL execution but for this patch, we don't need a replication > connection, so we extend the libpqwalreciever connect API. What sort of extension we have done to 'libpqwalreciever'? Is it something like by default this supports replication connections so we have done an extension to the API so that we can provide an option whether to create a replication connection or a normal connection? > Now, the concerns related to this could be that users would probably > need to change existing mechanisms/tools to update priamry_conninfo > and one of the alternatives proposed is to have an additional GUC like > slot_sync_dbname. Users won't be able to drop the database this worker > is connected to aka whatever is specified in slot_sync_dbname but as > the user herself sets up the configuration it shouldn't be a big deal. Yeah for this purpose users may use template1 or so which they generally don't plan to drop. So in case the user wants to drop that database user needs to turn off the slot syncing option and then it can be done? > Then we also discussed whether extending libpqwalreceiver's connect > API is a good idea and whether we need to further extend it in the > future. As far as I can see, slotsync worker's primary requirement is > to execute SQL queries which the current API is sufficient, and don't > see something that needs any drastic change in this API. Note that > tablesync worker that executes SQL also uses these APIs, so we may > need something in the future for either of those. Then finally we need > a slotsync worker to also connect to a database to use SQL and fetch > results. While looking into the patch v64-0002 I could not exactly point out what sort of extensions are there in libpqwalreceiver.c, I just saw one extra API for fetching the dbname from connection info? > Now, let us consider if we extend the replication commands like > READ_REPLICATION_SLOT and or introduce a new set of replication > commands to fetch the required information then we don't need a DB > connection with primary or a connection in slotsync worker. As per my > current understanding, it is quite doable but I think we will slowly > go in the direction of making replication commands something like SQL > because today we need to extend it to fetch all slots info that have > failover marked as true, the existence of a particular replication, > etc. Then tomorrow, if we want to extend this work to have multiple > slotsync workers say workers perdb then we have to extend the > replication command to fetch per-database failover marked slots. To > me, it sounds more like we are slowly adding SQL-like features to > replication commands. > > Apart from this when we are reading per-db replication slots without > connecting to a database, we probably need some additional protection > mechanism so that the database won't get dropped. Something like locking the database only while fetching the slots? > Considering all this it seems that for now probably extending > replication commands can simplify a few things like mentioned above > but using SQL's with db-connection is more extendable. Even I have similar thoughts. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Sat, Jan 20, 2024 at 10:52 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Fri, Jan 19, 2024 at 5:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Jan 17, 2024 at 4:00 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > > I had some off-list discussions with Sawada-San, Hou-San, and Shveta > > on the topic of extending replication commands instead of using the > > current model where we fetch the required slot information via SQL > > using a database connection. I would like to summarize the discussion > > and would like to know the thoughts of others on this topic. > > > > In the current patch, we launch the slotsync worker on physical > > standby which connects to the specified database (currently we let > > users specify the required dbname in primary_conninfo) on the primary. > > It then fetches the required information for failover marked slots > > from the primary and also does some primitive checks on the upstream > > node via SQL (the additional checks are like whether the upstream node > > has a specified physical slot or whether the upstream node is a > > primary node or a standby node). To fetch the required information it > > uses a libpqwalreciever API which is mostly apt for this purpose as it > > supports SQL execution but for this patch, we don't need a replication > > connection, so we extend the libpqwalreciever connect API. > > What sort of extension we have done to 'libpqwalreciever'? Is it > something like by default this supports replication connections so we > have done an extension to the API so that we can provide an option > whether to create a replication connection or a normal connection? > Yeah and in the future there could be more as well. The other function added walrcv_get_dbname_from_conninfo doesn't appear to be a problem either for now. > > Now, the concerns related to this could be that users would probably > > need to change existing mechanisms/tools to update priamry_conninfo > > and one of the alternatives proposed is to have an additional GUC like > > slot_sync_dbname. Users won't be able to drop the database this worker > > is connected to aka whatever is specified in slot_sync_dbname but as > > the user herself sets up the configuration it shouldn't be a big deal. > > Yeah for this purpose users may use template1 or so which they > generally don't plan to drop. > Using template1 has other problems like users won't be able to create a new database. See [2] (point number 2.2) > > So in case the user wants to drop that > database user needs to turn off the slot syncing option and then it > can be done? > Right. > > Then we also discussed whether extending libpqwalreceiver's connect > > API is a good idea and whether we need to further extend it in the > > future. As far as I can see, slotsync worker's primary requirement is > > to execute SQL queries which the current API is sufficient, and don't > > see something that needs any drastic change in this API. Note that > > tablesync worker that executes SQL also uses these APIs, so we may > > need something in the future for either of those. Then finally we need > > a slotsync worker to also connect to a database to use SQL and fetch > > results. > > While looking into the patch v64-0002 I could not exactly point out > what sort of extensions are there in libpqwalreceiver.c, I just saw > one extra API for fetching the dbname from connection info? > Right, the worry was that we may need it in the future. > > Now, let us consider if we extend the replication commands like > > READ_REPLICATION_SLOT and or introduce a new set of replication > > commands to fetch the required information then we don't need a DB > > connection with primary or a connection in slotsync worker. As per my > > current understanding, it is quite doable but I think we will slowly > > go in the direction of making replication commands something like SQL > > because today we need to extend it to fetch all slots info that have > > failover marked as true, the existence of a particular replication, > > etc. Then tomorrow, if we want to extend this work to have multiple > > slotsync workers say workers perdb then we have to extend the > > replication command to fetch per-database failover marked slots. To > > me, it sounds more like we are slowly adding SQL-like features to > > replication commands. > > > > Apart from this when we are reading per-db replication slots without > > connecting to a database, we probably need some additional protection > > mechanism so that the database won't get dropped. > > Something like locking the database only while fetching the slots? > Possible, but can we lock the database from an auxiliary process? > > Considering all this it seems that for now probably extending > > replication commands can simplify a few things like mentioned above > > but using SQL's with db-connection is more extendable. > > Even I have similar thoughts. > Thanks. [1] - https://www.postgresql.org/message-id/CAJpy0uBhPx1MDHh903XpFAhpBH23KzVXyg_4VjH2zXk81oGi1w%40mail.gmail.com -- With Regards, Amit Kapila.
On Fri, Jan 19, 2024 at 4:18 PM shveta malik <shveta.malik@gmail.com> wrote: > > PFA v64. V64 fails to apply to HEAD due to a recent commit. Rebased it. PFA v64_2. It has no new changes. thanks Shveta
Attachment
- v64_2-0002-Add-logical-slot-sync-capability-to-the-physic.patch
- v64_2-0003-Slot-sync-worker-as-a-special-process.patch
- v64_2-0001-Enable-setting-failover-property-for-a-slot-th.patch
- v64_2-0005-Non-replication-connection-and-app_name-change.patch
- v64_2-0004-Allow-logical-walsenders-to-wait-for-the-physi.patch
- v64_2-0006-Document-the-steps-to-check-if-the-standby-is-.patch
On Fri, Jan 19, 2024 at 5:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Jan 17, 2024 at 4:00 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > I had some off-list discussions with Sawada-San, Hou-San, and Shveta > on the topic of extending replication commands instead of using the > current model where we fetch the required slot information via SQL > using a database connection. I would like to summarize the discussion > and would like to know the thoughts of others on this topic. > > In the current patch, we launch the slotsync worker on physical > standby which connects to the specified database (currently we let > users specify the required dbname in primary_conninfo) on the primary. > It then fetches the required information for failover marked slots > from the primary and also does some primitive checks on the upstream > node via SQL (the additional checks are like whether the upstream node > has a specified physical slot or whether the upstream node is a > primary node or a standby node). To fetch the required information it > uses a libpqwalreciever API which is mostly apt for this purpose as it > supports SQL execution but for this patch, we don't need a replication > connection, so we extend the libpqwalreciever connect API. > > Now, the concerns related to this could be that users would probably > need to change existing mechanisms/tools to update priamry_conninfo > and one of the alternatives proposed is to have an additional GUC like > slot_sync_dbname. Users won't be able to drop the database this worker > is connected to aka whatever is specified in slot_sync_dbname but as > the user herself sets up the configuration it shouldn't be a big deal. > Then we also discussed whether extending libpqwalreceiver's connect > API is a good idea and whether we need to further extend it in the > future. As far as I can see, slotsync worker's primary requirement is > to execute SQL queries which the current API is sufficient, and don't > see something that needs any drastic change in this API. Note that > tablesync worker that executes SQL also uses these APIs, so we may > need something in the future for either of those. Then finally we need > a slotsync worker to also connect to a database to use SQL and fetch > results. > > Now, let us consider if we extend the replication commands like > READ_REPLICATION_SLOT and or introduce a new set of replication > commands to fetch the required information then we don't need a DB > connection with primary or a connection in slotsync worker. As per my > current understanding, it is quite doable but I think we will slowly > go in the direction of making replication commands something like SQL > because today we need to extend it to fetch all slots info that have > failover marked as true, the existence of a particular replication, > etc. Then tomorrow, if we want to extend this work to have multiple > slotsync workers say workers perdb then we have to extend the > replication command to fetch per-database failover marked slots. To > me, it sounds more like we are slowly adding SQL-like features to > replication commands. > > Apart from this when we are reading per-db replication slots without > connecting to a database, we probably need some additional protection > mechanism so that the database won't get dropped. > > Considering all this it seems that for now probably extending > replication commands can simplify a few things like mentioned above > but using SQL's with db-connection is more extendable. > > Thoughts? > Bertrand, and others, do you have an opinion on this matter? -- With Regards, Amit Kapila.
On Fri, Jan 19, 2024 at 3:55 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Fri, Jan 19, 2024 at 10:35 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > Thank you for updating the patch. I have some comments: > > > > --- > > + latestWalEnd = GetWalRcvLatestWalEnd(); > > + if (remote_slot->confirmed_lsn > latestWalEnd) > > + { > > + elog(ERROR, "exiting from slot synchronization as the > > received slot sync" > > + " LSN %X/%X for slot \"%s\" is ahead of the > > standby position %X/%X", > > + LSN_FORMAT_ARGS(remote_slot->confirmed_lsn), > > + remote_slot->name, > > + LSN_FORMAT_ARGS(latestWalEnd)); > > + } > > > > IIUC GetWalRcvLatestWalEnd () returns walrcv->latestWalEnd, which is > > typically the primary server's flush position and doesn't mean the LSN > > where the walreceiver received/flushed up to. > > yes. I think it makes more sense to use something which actually tells > flushed-position. I gave it a try by replacing GetWalRcvLatestWalEnd() > with GetWalRcvFlushRecPtr() but I see a problem here. Lets say I have > enabled the slot-sync feature in a running standby, in that case we > are all good (flushedUpto is the same as actual flush-position > indicated by LogstreamResult.Flush). But if I restart standby, then I > observed that the startup process sets flushedUpto to some value 'x' > (see [1]) while when the wal-receiver starts, it sets > 'LogstreamResult.Flush' to another value (see [2]) which is always > greater than 'x'. And we do not update flushedUpto with the > 'LogstreamResult.Flush' value in walreceiver until we actually do an > operation on primary. Performing a data change on primary sends WALs > to standby which then hits XLogWalRcvFlush() and updates flushedUpto > same as LogstreamResult.Flush. Until then we have a situation where > slots received on standby are ahead of flushedUpto and thus slotsync > worker keeps one erroring out. I am yet to find out why flushedUpto is > set to a lower value than 'LogstreamResult.Flush' at the start of > standby. Or maybe am I using the wrong function > GetWalRcvFlushRecPtr() and should be using something else instead? > Can we think of using GetStandbyFlushRecPtr()? We probably need to expose this function, if this works for the required purpose. -- With Regards, Amit Kapila.
Hi, On Fri, Jan 19, 2024 at 05:23:53PM +0530, Amit Kapila wrote: > On Wed, Jan 17, 2024 at 4:00 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > Now, the concerns related to this could be that users would probably > need to change existing mechanisms/tools to update priamry_conninfo Yeah, for the ones that want the sync slot feature. > and one of the alternatives proposed is to have an additional GUC like > slot_sync_dbname. Users won't be able to drop the database this worker > is connected to aka whatever is specified in slot_sync_dbname but as > the user herself sets up the configuration it shouldn't be a big deal. Same point of view here. > Then we also discussed whether extending libpqwalreceiver's connect > API is a good idea and whether we need to further extend it in the > future. As far as I can see, slotsync worker's primary requirement is > to execute SQL queries which the current API is sufficient, and don't > see something that needs any drastic change in this API. Note that > tablesync worker that executes SQL also uses these APIs, so we may > need something in the future for either of those. Then finally we need > a slotsync worker to also connect to a database to use SQL and fetch > results. > On my side the nits concerns about using the libpqrcv_connect / walrcv_connect are: - cosmetic: the "rcv" do not really align with the sync slot worker - we're using a WalReceiverConn, while a PGconn should suffice. From what I can see the "overhead" is (1 byte + 7 bytes hole + 8 bytes). I don't think that's a big deal even if we switch to a multi sync slot worker design later on. Those have already been discussed in [1] and I'm fine with them. > Now, let us consider if we extend the replication commands like > READ_REPLICATION_SLOT and or introduce a new set of replication > commands to fetch the required information then we don't need a DB > connection with primary or a connection in slotsync worker. As per my > current understanding, it is quite doable but I think we will slowly > go in the direction of making replication commands something like SQL > because today we need to extend it to fetch all slots info that have > failover marked as true, the existence of a particular replication, > etc. Then tomorrow, if we want to extend this work to have multiple > slotsync workers say workers perdb then we have to extend the > replication command to fetch per-database failover marked slots. To > me, it sounds more like we are slowly adding SQL-like features to > replication commands. Agree. Also it seems to me that extending the replication commands is more like a one-way door change. > Apart from this when we are reading per-db replication slots without > connecting to a database, we probably need some additional protection > mechanism so that the database won't get dropped. > > Considering all this it seems that for now probably extending > replication commands can simplify a few things like mentioned above > but using SQL's with db-connection is more extendable. I'd vote for using a SQL db-connection (like we are doing currently). It seems more extendable and more a two-way door (as compared to extending the replication commands): I think it still gives us the flexibility to switch to extending the replication commands if we want to in the future. [1]: https://www.postgresql.org/message-id/ZZe6sok7IWmhKReU%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Monday, January 22, 2024 11:36 AM shveta malik <shveta.malik@gmail.com> wrote: Hi, > On Fri, Jan 19, 2024 at 4:18 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > PFA v64. > > V64 fails to apply to HEAD due to a recent commit. Rebased it. PFA v64_2. It has > no new changes. I noticed few things while analyzing the patch. 1. sleep_ms = Min(sleep_ms * 2, MAX_WORKER_NAPTIME_MS); The initial value for sleep_ms is 0(default value for static variable) which will not be advanced in this expression. We should initialize sleep_ms to a positive number. 2. / Wait a bit, we don't expect to have to wait long / rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, 10L, WAIT_EVENT_BGWORKER_SHUTDOWN); The slotsync worker is not a bgworker anymore after 0003 patch, so a new event is needed I think. 3. slot->effective_catalog_xmin = xmin_horizon; The assignment is also needed in local_slot_update() to make ReplicationSlotsComputeRequiredXmin work. Best Regards, Hou zj
On Mon, Jan 22, 2024 at 1:11 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > Thanks for sharing the feedback. > > > Then we also discussed whether extending libpqwalreceiver's connect > > API is a good idea and whether we need to further extend it in the > > future. As far as I can see, slotsync worker's primary requirement is > > to execute SQL queries which the current API is sufficient, and don't > > see something that needs any drastic change in this API. Note that > > tablesync worker that executes SQL also uses these APIs, so we may > > need something in the future for either of those. Then finally we need > > a slotsync worker to also connect to a database to use SQL and fetch > > results. > > > > On my side the nits concerns about using the libpqrcv_connect / walrcv_connect are: > > - cosmetic: the "rcv" do not really align with the sync slot worker > But note that the same API is even used for apply worker as well. One can think that this is a connection used to receive WAL or slot_info. minor comments on the patch: ======================= 1. + /* First time slot update, the function must return true */ + Assert(local_slot_update(remote_slot)); Isn't moving this code to Assert in update_and_persist_slot() wrong? It will make this function call no-op in non-assert builds? 2. + ereport(LOG, + errmsg("newly locally created slot \"%s\" is sync-ready now", I think even without 'locally' in the above LOG message, it is clear. 3. +/* + * walrcv_get_dbinfo_for_failover_slots_fn + * + * Run LIST_DBID_FOR_FAILOVER_SLOTS on primary server to get the + * list of unique DBIDs for failover logical slots + */ +typedef List *(*walrcv_get_dbinfo_for_failover_slots_fn) (WalReceiverConn *conn); This looks like a leftover from the previous version of the patch. -- With Regards, Amit Kapila.
On Mon, Jan 22, 2024 at 3:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > minor comments on the patch: > ======================= PFA v65 addressing the comments. Addressed comments by Peter in [1], comments by Hou-San in [2], comments by Amit in [3] and [4] TODO: Analyze the issue reported by Swada-san in [5] (pt 2) Disallow subscription creation on standby with failover=true (as we do not support sync on cascading standbys) [1]: https://www.postgresql.org/message-id/CAHut%2BPt5Pk_xJkb54oahR%2Bf9oawgfnmbpewvkZPgnRhoJ3gkYg%40mail.gmail.com [2]: https://www.postgresql.org/message-id/OS0PR01MB57160C7184E17C6765AAE38294752%40OS0PR01MB5716.jpnprd01.prod.outlook.com [3]: https://www.postgresql.org/message-id/CAA4eK1JPB-zpGYTbVOP5Qp26tNQPMjDuYzNZ%2Ba9RFiN5nE1tEA%40mail.gmail.com [4]: https://www.postgresql.org/message-id/CAA4eK1Jhy1-bsu6vc0%3DNja7aw5-EK_%3D101pnnuM3ATqTA8%2B%3DSg%40mail.gmail.com [5]: https://www.postgresql.org/message-id/CAD21AoBgzONdt3o5mzbQ4MtqAE%3DWseiXUOq0LMqne-nWGjZBsA%40mail.gmail.com thanks Shveta
Attachment
- v65-0004-Allow-logical-walsenders-to-wait-for-the-physica.patch
- v65-0005-Non-replication-connection-and-app_name-change.patch
- v65-0001-Enable-setting-failover-property-for-a-slot-thro.patch
- v65-0002-Add-logical-slot-sync-capability-to-the-physical.patch
- v65-0003-Slot-sync-worker-as-a-special-process.patch
- v65-0006-Document-the-steps-to-check-if-the-standby-is-re.patch
On Mon, Jan 22, 2024 at 12:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Jan 19, 2024 at 3:55 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Fri, Jan 19, 2024 at 10:35 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > > Thank you for updating the patch. I have some comments: > > > > > > --- > > > + latestWalEnd = GetWalRcvLatestWalEnd(); > > > + if (remote_slot->confirmed_lsn > latestWalEnd) > > > + { > > > + elog(ERROR, "exiting from slot synchronization as the > > > received slot sync" > > > + " LSN %X/%X for slot \"%s\" is ahead of the > > > standby position %X/%X", > > > + LSN_FORMAT_ARGS(remote_slot->confirmed_lsn), > > > + remote_slot->name, > > > + LSN_FORMAT_ARGS(latestWalEnd)); > > > + } > > > > > > IIUC GetWalRcvLatestWalEnd () returns walrcv->latestWalEnd, which is > > > typically the primary server's flush position and doesn't mean the LSN > > > where the walreceiver received/flushed up to. > > > > yes. I think it makes more sense to use something which actually tells > > flushed-position. I gave it a try by replacing GetWalRcvLatestWalEnd() > > with GetWalRcvFlushRecPtr() but I see a problem here. Lets say I have > > enabled the slot-sync feature in a running standby, in that case we > > are all good (flushedUpto is the same as actual flush-position > > indicated by LogstreamResult.Flush). But if I restart standby, then I > > observed that the startup process sets flushedUpto to some value 'x' > > (see [1]) while when the wal-receiver starts, it sets > > 'LogstreamResult.Flush' to another value (see [2]) which is always > > greater than 'x'. And we do not update flushedUpto with the > > 'LogstreamResult.Flush' value in walreceiver until we actually do an > > operation on primary. Performing a data change on primary sends WALs > > to standby which then hits XLogWalRcvFlush() and updates flushedUpto > > same as LogstreamResult.Flush. Until then we have a situation where > > slots received on standby are ahead of flushedUpto and thus slotsync > > worker keeps one erroring out. I am yet to find out why flushedUpto is > > set to a lower value than 'LogstreamResult.Flush' at the start of > > standby. Or maybe am I using the wrong function > > GetWalRcvFlushRecPtr() and should be using something else instead? > > > > Can we think of using GetStandbyFlushRecPtr()? We probably need to > expose this function, if this works for the required purpose. I think we can. For the records, the problem while using flushedUpto (or GetWalRcvFlushRecPtr()) directly is that it is not set to the latest flushed position immediately after startup. It points to some prior location (perhaps segment or page start) after startup until some data is flushed next which then updates it to the latest flushed position, thus we can not use it directly. GetStandbyFlushRecPtr() OTOH takes care of it i.e. it returns correct flushed-location at any point of time. I have changed v65 to use this one. thanks Shveta
On Fri, Jan 19, 2024 at 11:48 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are some review comments for patch v63-0003. Thanks Peter. I have addressed all in v65. > > 4b. > It was a bit different when there were ERRORs but now they are LOGs > somehow it seems wrong for this function to say what the *caller* will > do. Maybe you can rewrite all the errmsg so the don't say "skipping" > but they just say "bad configuration for slot synchronization" > > If valid is false then you can LOG "skipping" at the caller... I have made this change but now in the log file we see 3 logs like below, does it seem apt? Was the earlier one better where we get the info in 2 lines? [34416] LOG: bad configuration for slot synchronization [34416] HINT: hot_standby_feedback must be enabled. [34416] LOG: skipping slot synchronization thanks Shveta
On Sat, Jan 20, 2024 at 7:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sat, Jan 20, 2024 at 10:52 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Fri, Jan 19, 2024 at 5:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Wed, Jan 17, 2024 at 4:00 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > > > > > I had some off-list discussions with Sawada-San, Hou-San, and Shveta > > > on the topic of extending replication commands instead of using the > > > current model where we fetch the required slot information via SQL > > > using a database connection. I would like to summarize the discussion > > > and would like to know the thoughts of others on this topic. > > > > > > In the current patch, we launch the slotsync worker on physical > > > standby which connects to the specified database (currently we let > > > users specify the required dbname in primary_conninfo) on the primary. > > > It then fetches the required information for failover marked slots > > > from the primary and also does some primitive checks on the upstream > > > node via SQL (the additional checks are like whether the upstream node > > > has a specified physical slot or whether the upstream node is a > > > primary node or a standby node). To fetch the required information it > > > uses a libpqwalreciever API which is mostly apt for this purpose as it > > > supports SQL execution but for this patch, we don't need a replication > > > connection, so we extend the libpqwalreciever connect API. > > > > What sort of extension we have done to 'libpqwalreciever'? Is it > > something like by default this supports replication connections so we > > have done an extension to the API so that we can provide an option > > whether to create a replication connection or a normal connection? > > > > Yeah and in the future there could be more as well. The other function > added walrcv_get_dbname_from_conninfo doesn't appear to be a problem > either for now. > > > > Now, the concerns related to this could be that users would probably > > > need to change existing mechanisms/tools to update priamry_conninfo I'm concerned about this. In fact, a primary_conninfo value generated by pg_basebackup does not work with enable_syncslot. > > > and one of the alternatives proposed is to have an additional GUC like > > > slot_sync_dbname. Users won't be able to drop the database this worker > > > is connected to aka whatever is specified in slot_sync_dbname but as > > > the user herself sets up the configuration it shouldn't be a big deal. > > > > Yeah for this purpose users may use template1 or so which they > > generally don't plan to drop. > > > > Using template1 has other problems like users won't be able to create > a new database. See [2] (point number 2.2) > > > > > So in case the user wants to drop that > > database user needs to turn off the slot syncing option and then it > > can be done? > > > > Right. If the user wants to continue using slot syncing, they need to switch the database to connect. Which requires modifying primary_conninfo and reloading the configuration file. Which further leads to restarting the physical replication. If they use synchronous replication, it means the application temporarily stops during that. > > > > Then we also discussed whether extending libpqwalreceiver's connect > > > API is a good idea and whether we need to further extend it in the > > > future. As far as I can see, slotsync worker's primary requirement is > > > to execute SQL queries which the current API is sufficient, and don't > > > see something that needs any drastic change in this API. Note that > > > tablesync worker that executes SQL also uses these APIs, so we may > > > need something in the future for either of those. Then finally we need > > > a slotsync worker to also connect to a database to use SQL and fetch > > > results. > > > > While looking into the patch v64-0002 I could not exactly point out > > what sort of extensions are there in libpqwalreceiver.c, I just saw > > one extra API for fetching the dbname from connection info? > > > > Right, the worry was that we may need it in the future. Yes. IIUC the slotsync worker uses libpqwalreceiver to establish a non-replication connection and to execute SQL query. But neither of them are relevant with replication. I'm a bit concerned that when we need to extend the slotsync feature in the future we will end up extending libpqwalreceiver, even if the new feature is not also relevant with replication. > > > > Now, let us consider if we extend the replication commands like > > > READ_REPLICATION_SLOT and or introduce a new set of replication > > > commands to fetch the required information then we don't need a DB > > > connection with primary or a connection in slotsync worker. As per my > > > current understanding, it is quite doable but I think we will slowly > > > go in the direction of making replication commands something like SQL > > > because today we need to extend it to fetch all slots info that have > > > failover marked as true, the existence of a particular replication, > > > etc. Then tomorrow, if we want to extend this work to have multiple > > > slotsync workers say workers perdb then we have to extend the > > > replication command to fetch per-database failover marked slots. To > > > me, it sounds more like we are slowly adding SQL-like features to > > > replication commands. Right. How about filtering slots on the standby side? That is, for example, the LIST_SLOT command returns all slots and the slotsync worker filters out non-failover slots. Also such command could potentially be used also in client tools like pg_basebackup, pg_receivewal, and pg_recvlogical to list the available replication slots to specify. > > > Considering all this it seems that for now probably extending > > > replication commands can simplify a few things like mentioned above > > > but using SQL's with db-connection is more extendable. > > Agreed. Having said that, considering Amit, Bertrand, and Dilip already agreed with the current design (using SQL's with db-connection), I might be worrying too much. So we can probably go with the current design and improve it if we find some problems. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Mon, Jan 22, 2024 at 5:28 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Sat, Jan 20, 2024 at 7:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Sat, Jan 20, 2024 at 10:52 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Fri, Jan 19, 2024 at 5:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Wed, Jan 17, 2024 at 4:00 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > > > > > > > > I had some off-list discussions with Sawada-San, Hou-San, and Shveta > > > > on the topic of extending replication commands instead of using the > > > > current model where we fetch the required slot information via SQL > > > > using a database connection. I would like to summarize the discussion > > > > and would like to know the thoughts of others on this topic. > > > > > > > > In the current patch, we launch the slotsync worker on physical > > > > standby which connects to the specified database (currently we let > > > > users specify the required dbname in primary_conninfo) on the primary. > > > > It then fetches the required information for failover marked slots > > > > from the primary and also does some primitive checks on the upstream > > > > node via SQL (the additional checks are like whether the upstream node > > > > has a specified physical slot or whether the upstream node is a > > > > primary node or a standby node). To fetch the required information it > > > > uses a libpqwalreciever API which is mostly apt for this purpose as it > > > > supports SQL execution but for this patch, we don't need a replication > > > > connection, so we extend the libpqwalreciever connect API. > > > > > > What sort of extension we have done to 'libpqwalreciever'? Is it > > > something like by default this supports replication connections so we > > > have done an extension to the API so that we can provide an option > > > whether to create a replication connection or a normal connection? > > > > > > > Yeah and in the future there could be more as well. The other function > > added walrcv_get_dbname_from_conninfo doesn't appear to be a problem > > either for now. > > > > > > Now, the concerns related to this could be that users would probably > > > > need to change existing mechanisms/tools to update priamry_conninfo > > I'm concerned about this. In fact, a primary_conninfo value generated > by pg_basebackup does not work with enable_syncslot. > Right, but if we want can't we extend pg_basebackup to do that? It is just that I am not sure that it is a good idea to extend pg_basebackup in the first version. > > > > and one of the alternatives proposed is to have an additional GUC like > > > > slot_sync_dbname. Users won't be able to drop the database this worker > > > > is connected to aka whatever is specified in slot_sync_dbname but as > > > > the user herself sets up the configuration it shouldn't be a big deal. > > > > > > Yeah for this purpose users may use template1 or so which they > > > generally don't plan to drop. > > > > > > > Using template1 has other problems like users won't be able to create > > a new database. See [2] (point number 2.2) > > > > > > > > So in case the user wants to drop that > > > database user needs to turn off the slot syncing option and then it > > > can be done? > > > > > > > Right. > > If the user wants to continue using slot syncing, they need to switch > the database to connect. Which requires modifying primary_conninfo and > reloading the configuration file. Which further leads to restarting > the physical replication. If they use synchronous replication, it > means the application temporarily stops during that. > Yes, that would be an inconvenience but the point is we don't expect this to change often. > > > > > > Then we also discussed whether extending libpqwalreceiver's connect > > > > API is a good idea and whether we need to further extend it in the > > > > future. As far as I can see, slotsync worker's primary requirement is > > > > to execute SQL queries which the current API is sufficient, and don't > > > > see something that needs any drastic change in this API. Note that > > > > tablesync worker that executes SQL also uses these APIs, so we may > > > > need something in the future for either of those. Then finally we need > > > > a slotsync worker to also connect to a database to use SQL and fetch > > > > results. > > > > > > While looking into the patch v64-0002 I could not exactly point out > > > what sort of extensions are there in libpqwalreceiver.c, I just saw > > > one extra API for fetching the dbname from connection info? > > > > > > > Right, the worry was that we may need it in the future. > > Yes. IIUC the slotsync worker uses libpqwalreceiver to establish a > non-replication connection and to execute SQL query. But neither of > them are relevant with replication. > But we are already using libpqwalreceiver to execute SQL queries via tablesync worker. I'm a bit concerned that when we > need to extend the slotsync feature in the future we will end up > extending libpqwalreceiver, even if the new feature is not also > relevant with replication. > > > > > > > Now, let us consider if we extend the replication commands like > > > > READ_REPLICATION_SLOT and or introduce a new set of replication > > > > commands to fetch the required information then we don't need a DB > > > > connection with primary or a connection in slotsync worker. As per my > > > > current understanding, it is quite doable but I think we will slowly > > > > go in the direction of making replication commands something like SQL > > > > because today we need to extend it to fetch all slots info that have > > > > failover marked as true, the existence of a particular replication, > > > > etc. Then tomorrow, if we want to extend this work to have multiple > > > > slotsync workers say workers perdb then we have to extend the > > > > replication command to fetch per-database failover marked slots. To > > > > me, it sounds more like we are slowly adding SQL-like features to > > > > replication commands. > > Right. How about filtering slots on the standby side? That is, for > example, the LIST_SLOT command returns all slots and the slotsync > worker filters out non-failover slots. > Yeah, we can do that but it could be unnecessary network overhead when there are very few failover slots. And it may not be just one time, we need to fetch slot information periodically. -- With Regards, Amit Kapila.
On Mon, Jan 22, 2024 at 9:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jan 22, 2024 at 5:28 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Sat, Jan 20, 2024 at 7:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Sat, Jan 20, 2024 at 10:52 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > > > On Fri, Jan 19, 2024 at 5:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > On Wed, Jan 17, 2024 at 4:00 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > > > > > > > > > > > I had some off-list discussions with Sawada-San, Hou-San, and Shveta > > > > > on the topic of extending replication commands instead of using the > > > > > current model where we fetch the required slot information via SQL > > > > > using a database connection. I would like to summarize the discussion > > > > > and would like to know the thoughts of others on this topic. > > > > > > > > > > In the current patch, we launch the slotsync worker on physical > > > > > standby which connects to the specified database (currently we let > > > > > users specify the required dbname in primary_conninfo) on the primary. > > > > > It then fetches the required information for failover marked slots > > > > > from the primary and also does some primitive checks on the upstream > > > > > node via SQL (the additional checks are like whether the upstream node > > > > > has a specified physical slot or whether the upstream node is a > > > > > primary node or a standby node). To fetch the required information it > > > > > uses a libpqwalreciever API which is mostly apt for this purpose as it > > > > > supports SQL execution but for this patch, we don't need a replication > > > > > connection, so we extend the libpqwalreciever connect API. > > > > > > > > What sort of extension we have done to 'libpqwalreciever'? Is it > > > > something like by default this supports replication connections so we > > > > have done an extension to the API so that we can provide an option > > > > whether to create a replication connection or a normal connection? > > > > > > > > > > Yeah and in the future there could be more as well. The other function > > > added walrcv_get_dbname_from_conninfo doesn't appear to be a problem > > > either for now. > > > > > > > > Now, the concerns related to this could be that users would probably > > > > > need to change existing mechanisms/tools to update priamry_conninfo > > > > I'm concerned about this. In fact, a primary_conninfo value generated > > by pg_basebackup does not work with enable_syncslot. > > > > Right, but if we want can't we extend pg_basebackup to do that? It is > just that I am not sure that it is a good idea to extend pg_basebackup > in the first version. Okay. > > > > > > and one of the alternatives proposed is to have an additional GUC like > > > > > slot_sync_dbname. Users won't be able to drop the database this worker > > > > > is connected to aka whatever is specified in slot_sync_dbname but as > > > > > the user herself sets up the configuration it shouldn't be a big deal. > > > > > > > > Yeah for this purpose users may use template1 or so which they > > > > generally don't plan to drop. > > > > > > > > > > Using template1 has other problems like users won't be able to create > > > a new database. See [2] (point number 2.2) > > > > > > > > > > > So in case the user wants to drop that > > > > database user needs to turn off the slot syncing option and then it > > > > can be done? > > > > > > > > > > Right. > > > > If the user wants to continue using slot syncing, they need to switch > > the database to connect. Which requires modifying primary_conninfo and > > reloading the configuration file. Which further leads to restarting > > the physical replication. If they use synchronous replication, it > > means the application temporarily stops during that. > > > > Yes, that would be an inconvenience but the point is we don't expect > this to change often. > > > > > > > > > Then we also discussed whether extending libpqwalreceiver's connect > > > > > API is a good idea and whether we need to further extend it in the > > > > > future. As far as I can see, slotsync worker's primary requirement is > > > > > to execute SQL queries which the current API is sufficient, and don't > > > > > see something that needs any drastic change in this API. Note that > > > > > tablesync worker that executes SQL also uses these APIs, so we may > > > > > need something in the future for either of those. Then finally we need > > > > > a slotsync worker to also connect to a database to use SQL and fetch > > > > > results. > > > > > > > > While looking into the patch v64-0002 I could not exactly point out > > > > what sort of extensions are there in libpqwalreceiver.c, I just saw > > > > one extra API for fetching the dbname from connection info? > > > > > > > > > > Right, the worry was that we may need it in the future. > > > > Yes. IIUC the slotsync worker uses libpqwalreceiver to establish a > > non-replication connection and to execute SQL query. But neither of > > them are relevant with replication. > > > > But we are already using libpqwalreceiver to execute SQL queries via > tablesync worker. IIUC tablesync workers do both SQL queries and replication commands. I think the slotsync worker is the first background process who does only SQL queries in a non-replication command ( using libpqwalreceiver). > > I'm a bit concerned that when we > > need to extend the slotsync feature in the future we will end up > > extending libpqwalreceiver, even if the new feature is not also > > relevant with replication. > > > > > > > > > > Now, let us consider if we extend the replication commands like > > > > > READ_REPLICATION_SLOT and or introduce a new set of replication > > > > > commands to fetch the required information then we don't need a DB > > > > > connection with primary or a connection in slotsync worker. As per my > > > > > current understanding, it is quite doable but I think we will slowly > > > > > go in the direction of making replication commands something like SQL > > > > > because today we need to extend it to fetch all slots info that have > > > > > failover marked as true, the existence of a particular replication, > > > > > etc. Then tomorrow, if we want to extend this work to have multiple > > > > > slotsync workers say workers perdb then we have to extend the > > > > > replication command to fetch per-database failover marked slots. To > > > > > me, it sounds more like we are slowly adding SQL-like features to > > > > > replication commands. > > > > Right. How about filtering slots on the standby side? That is, for > > example, the LIST_SLOT command returns all slots and the slotsync > > worker filters out non-failover slots. > > > > Yeah, we can do that but it could be unnecessary network overhead when > there are very few failover slots. And it may not be just one time, we > need to fetch slot information periodically. True. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Here are some review comments for v65-0002 ====== 0. General - GUCs in messages I think it would be better for the GUC names to all be quoted. It's not a rule (yet), but OTOH it seems to be the consensus most people want. See [1]. This might impact the following messages: 0.1 + ereport(ERROR, + errmsg("could not fetch primary_slot_name \"%s\" info from the" + " primary server: %s", PrimarySlotName, res->err)); SUGGESTION errmsg("could not fetch primary server slot \"%s\" info from the primary server: %s", ...) errhint("Check if \"primary_slot_name\" is configured correctly."); ~~~ 0.2 + if (!tuplestore_gettupleslot(res->tuplestore, true, false, tupslot)) + elog(ERROR, "failed to fetch primary_slot_name tuple"); SUGGESTION elog(ERROR, "failed to fetch tuple for the primary server slot specified by \"primary_slot_name\""); ~~~ 0.3 + ereport(ERROR, + errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("exiting from slot synchronization due to bad configuration"), + /* translator: second %s is a GUC variable name */ + errdetail("The primary server slot \"%s\" specified by %s is not valid.", + PrimarySlotName, "primary_slot_name")); /specified by %s/specified by \"%s\"/ ~~~ 0.4 + ereport(ERROR, + /* translator: %s is a GUC variable name */ + errmsg("exiting from slot synchronization due to bad configuration"), + errhint("%s must be defined.", "primary_slot_name")); /%s must be defined./\"%s\" must be defined./ ~~~ 0.5 + ereport(ERROR, + /* translator: %s is a GUC variable name */ + errmsg("exiting from slot synchronization due to bad configuration"), + errhint("%s must be enabled.", "hot_standby_feedback")); /%s must be enabled./\"%s\" must be enabled./ ~~~ 0.6 + ereport(ERROR, + /* translator: %s is a GUC variable name */ + errmsg("exiting from slot synchronization due to bad configuration"), + errhint("wal_level must be >= logical.")); errhint("\"wal_level\" must be >= logical.")) ~~~ 0.7 + if (PrimaryConnInfo == NULL || strcmp(PrimaryConnInfo, "") == 0) + ereport(ERROR, + /* translator: %s is a GUC variable name */ + errmsg("exiting from slot synchronization due to bad configuration"), + errhint("%s must be defined.", "primary_conninfo")); /%s must be defined./\"%s\" must be defined./ ~~~ 0.8 + ereport(ERROR, + + /* + * translator: 'dbname' is a specific option; %s is a GUC variable + * name + */ + errmsg("exiting from slot synchronization due to bad configuration"), + errhint("'dbname' must be specified in %s.", "primary_conninfo")); /must be specified in %s./must be specified in \"%s\"./ ~~~ 0.9 + ereport(LOG, + errmsg("skipping slot synchronization"), + errdetail("enable_syncslot is disabled.")); errdetail("\"enable_syncslot\" is disabled.")); ====== src/backend/replication/logical/slotsync.c 1. +/* Min and Max sleep time for slot sync worker */ +#define MIN_WORKER_NAPTIME_MS 200 +#define MAX_WORKER_NAPTIME_MS 30000 /* 30s */ + +/* + * Sleep time in ms between slot-sync cycles. + * See wait_for_slot_activity() for how we adjust this + */ +static long sleep_ms = MIN_WORKER_NAPTIME_MS; These all belong together, so I think they share a combined comment like: SUGGESTION The sleep time (ms) between slot-sync cycles varies dynamically (within a MIN/MAX range) according to slot activity. See wait_for_slot_activity() for details. ~~~ 2. update_and_persist_slot + /* First time slot update, the function must return true */ + if(!local_slot_update(remote_slot)) + elog(ERROR, "failed to update slot"); Missing whitespace after 'if' ~~~ 3. synchronize_one_slot + ereport(ERROR, + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("exiting from slot synchronization because same" + " name slot \"%s\" already exists on standby", + remote_slot->name), + errdetail("A user-created slot with the same name as" + " failover slot already exists on the standby.")); 3a. /on standby/on the standby/ ~ 3b. Now the errmsg is changed, the errdetail doesn't seem so useful. Isn't it repeating pretty much the same information as in the errmsg? ====== src/backend/replication/walsender.c 4. GetStandbyFlushRecPtr /* - * Returns the latest point in WAL that has been safely flushed to disk, and - * can be sent to the standby. This should only be called when in recovery, - * ie. we're streaming to a cascaded standby. + * Returns the latest point in WAL that has been safely flushed to disk. + * This should only be called when in recovery. + * Since it says "This should only be called when in recovery", should there also be a check for that (e.g. RecoveryInProgress) in the added Assert? ====== src/include/replication/walreceiver.h 5. typedef char *(*walrcv_identify_system_fn) (WalReceiverConn *conn, TimeLineID *primary_tli); +/* + * walrcv_get_dbname_from_conninfo_fn + * + * Returns the dbid from the primary_conninfo + */ +typedef char *(*walrcv_get_dbname_from_conninfo_fn) (const char *conninfo); It looks like a blank line that previously existed has been lost. ====== [1] https://www.postgresql.org/message-id/CAHut%2BPsf3NewXbsFKY88Qn1ON1_dMD6343MuWdMiiM2Ds9a_wA%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
On Mon, Jan 22, 2024 at 8:42 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Jan 22, 2024 at 9:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > Yes. IIUC the slotsync worker uses libpqwalreceiver to establish a > > > non-replication connection and to execute SQL query. But neither of > > > them are relevant with replication. > > > > > > > But we are already using libpqwalreceiver to execute SQL queries via > > tablesync worker. > > IIUC tablesync workers do both SQL queries and replication commands. I > think the slotsync worker is the first background process who does > only SQL queries in a non-replication command ( using > libpqwalreceiver). > Yes, I agree but till now we didn't saw any problem with the same. -- With Regards, Amit Kapila.
>
> On Mon, Jan 22, 2024 at 3:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > minor comments on the patch:
> > =======================
>
> PFA v65 addressing the comments.
>
> Addressed comments by Peter in [1], comments by Hou-San in [2],
> comments by Amit in [3] and [4]
>
> TODO:
> Analyze the issue reported by Swada-san in [5] (pt 2)
> Disallow subscription creation on standby with failover=true (as we do
> not support sync on cascading standbys)
>
> [1]: https://www.postgresql.org/message-id/CAHut%2BPt5Pk_xJkb54oahR%2Bf9oawgfnmbpewvkZPgnRhoJ3gkYg%40mail.gmail.com
> [2]: https://www.postgresql.org/message-id/OS0PR01MB57160C7184E17C6765AAE38294752%40OS0PR01MB5716.jpnprd01.prod.outlook.com
> [3]: https://www.postgresql.org/message-id/CAA4eK1JPB-zpGYTbVOP5Qp26tNQPMjDuYzNZ%2Ba9RFiN5nE1tEA%40mail.gmail.com
> [4]: https://www.postgresql.org/message-id/CAA4eK1Jhy1-bsu6vc0%3DNja7aw5-EK_%3D101pnnuM3ATqTA8%2B%3DSg%40mail.gmail.com
> [5]: https://www.postgresql.org/message-id/CAD21AoBgzONdt3o5mzbQ4MtqAE%3DWseiXUOq0LMqne-nWGjZBsA%40mail.gmail.com
>
>
I was doing some testing on this. What I noticed is that creating subscriptions with failover enabled is taking a lot longer compared with a subscription with failover disabled. The setup has primary configured with standby_slot_names and that standby is enabled with enable_synclot turned on.
Publisher has one publication, no tables.
subscriber:
postgres=# \timing
Timing is on.
Publisher has one publication, no tables.
subscriber:
postgres=# \timing
Timing is on.
postgres=# CREATE SUBSCRIPTION sub CONNECTION 'dbname=postgres host=localhost port=6972' PUBLICATION pub with (failover = true);
NOTICE: created replication slot "sub" on publisher
CREATE SUBSCRIPTION
Time: 10011.829 ms (00:10.012)
== drop the sub
postgres=# CREATE SUBSCRIPTION sub CONNECTION 'dbname=postgres host=localhost port=6972' PUBLICATION pub with (failover = false);
NOTICE: created replication slot "sub" on publisher
CREATE SUBSCRIPTION
Time: 46.317 ms
With failover=true, it takes 10011 ms while failover=false takes 46 ms.
I don't see a similar delay when creating slot on the primary with pg_create_logical_replication_slot() with failover flag enabled.
Then on primary:
postgres=# SELECT 'init' FROM pg_create_logical_replication_slot('lsub2_slot', 'pgoutput', false, false, true);
?column?
----------
init
(1 row)
Time: 36.125 ms
postgres=# SELECT 'init' FROM pg_create_logical_replication_slot('lsub1_slot', 'pgoutput', false, false, false);
?column?
----------
init
(1 row)
Time: 53.981 ms
NOTICE: created replication slot "sub" on publisher
CREATE SUBSCRIPTION
Time: 10011.829 ms (00:10.012)
== drop the sub
postgres=# CREATE SUBSCRIPTION sub CONNECTION 'dbname=postgres host=localhost port=6972' PUBLICATION pub with (failover = false);
NOTICE: created replication slot "sub" on publisher
CREATE SUBSCRIPTION
Time: 46.317 ms
With failover=true, it takes 10011 ms while failover=false takes 46 ms.
I don't see a similar delay when creating slot on the primary with pg_create_logical_replication_slot() with failover flag enabled.
Then on primary:
postgres=# SELECT 'init' FROM pg_create_logical_replication_slot('lsub2_slot', 'pgoutput', false, false, true);
?column?
----------
init
(1 row)
Time: 36.125 ms
postgres=# SELECT 'init' FROM pg_create_logical_replication_slot('lsub1_slot', 'pgoutput', false, false, false);
?column?
----------
init
(1 row)
Time: 53.981 ms
regards,
Ajin Cherian
Fujitsu Australia
On Tue, Jan 23, 2024 at 2:38 PM Ajin Cherian <itsajin@gmail.com> wrote: > > I was doing some testing on this. What I noticed is that creating subscriptions with failover enabled is taking a lot longercompared with a subscription with failover disabled. The setup has primary configured with standby_slot_names and thatstandby is enabled with enable_synclot turned on. > Thanks Ajin for testing the patch. PFA v66 which fixes this issue. The overall changes in this version are: patch 001 1) Restricted enabling failover for user created slots on standby. 2) Fixed a wrong NOTICE during alter-sub which was always saying that 'changed the failover state to false' even if it was switched to true. patch 002: 3) Addressed Peter's comment in [1] patch 003: 4) Fixed the drop-db issue reported by Swada-San in [2] 5) Added other signal-handlers. 6) Fixed CFBot Windows compilation failure. patch 004: 7) Fixed the issue reported by Ajin above in [3]. The performance issue was due to the additional wait in WalSndWaitForWal() for failover slots. Create Subscription calls DecodingContextFindStartpoint() which then reads WALs to build the initial snapshot which ends up calling WalSndWaitForWal() which waits for standby confirmation for the case of failover slots. Addressed it by skipping the wait during Create Sub as it is not needed there. We now wait only if 'replication_active' is true. Thanks Nisha for reporting the NOTICE issue (addressed in 2) and working on issue #6. Thanks Hou-San for working on #7. [1]: https://www.postgresql.org/message-id/CAHut%2BPs6p6Km8_Hfy6X0KTuyqBKkhC84u23sQnnkhqkHuDL%2BDQ%40mail.gmail.com [2]: https://www.postgresql.org/message-id/CAD21AoBgzONdt3o5mzbQ4MtqAE%3DWseiXUOq0LMqne-nWGjZBsA%40mail.gmail.com [3]: https://www.postgresql.org/message-id/CAFPTHDbsZ%2BpxAubb9d9BwVNt5OB3_2s77bG6nHcAgUPPhEVmMQ%40mail.gmail.com thanks Shveta
Attachment
- v66-0003-Slot-sync-worker-as-a-special-process.patch
- v66-0001-Enable-setting-failover-property-for-a-slot-thro.patch
- v66-0005-Non-replication-connection-and-app_name-change.patch
- v66-0004-Allow-logical-walsenders-to-wait-for-the-physica.patch
- v66-0002-Add-logical-slot-sync-capability-to-the-physical.patch
- v66-0006-Document-the-steps-to-check-if-the-standby-is-re.patch
On Tue, Jan 23, 2024 at 9:45 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are some review comments for v65-0002 Thanks Peter for the feedback. I have addressed these in v66. > > 4. GetStandbyFlushRecPtr > > /* > - * Returns the latest point in WAL that has been safely flushed to disk, and > - * can be sent to the standby. This should only be called when in recovery, > - * ie. we're streaming to a cascaded standby. > + * Returns the latest point in WAL that has been safely flushed to disk. > + * This should only be called when in recovery. > + * > > Since it says "This should only be called when in recovery", should > there also be a check for that (e.g. RecoveryInProgress) in the added > Assert? Since 'am_cascading_walsender' and 'IsLogicalSlotSyncWorker' makes sense 'in-recovery' only, I think explicit check for 'RecoveryInProgress' is not needed here. But I can add if others also think it is needed. thanks Shveta
Here are some comments for patch v66-0001. ====== doc/src/sgml/catalogs.sgml 1. + <para> + If true, the associated replication slots (i.e. the main slot and the + table sync slots) in the upstream database are enabled to be + synchronized to the physical standbys + </para></entry> /physical standbys/physical standby/ I wondered if it is better just to say singular "standby" instead of "standbys" in places like this; e.g. plural might imply cascading for some readers. There are a number of examples like this, so I've repeated the same comment multiple times below. If you disagree, please just ignore all of them. ====== doc/src/sgml/func.sgml 2. that the decoding of prepared transactions is enabled for this - slot. A call to this function has the same effect as the replication - protocol command <literal>CREATE_REPLICATION_SLOT ... LOGICAL</literal>. + slot. The optional fifth parameter, + <parameter>failover</parameter>, when set to true, + specifies that this slot is enabled to be synced to the + physical standbys so that logical replication can be resumed + after failover. A call to this function has the same effect as + the replication protocol command + <literal>CREATE_REPLICATION_SLOT ... LOGICAL</literal>. </para></entry> (same as above) /physical standbys/physical standby/ Also, I don't see anything else on this page using plural "standbys". ====== doc/src/sgml/protocol.sgml 3. CREATE_REPLICATION_SLOT + <varlistentry> + <term><literal>FAILOVER [ <replaceable class="parameter">boolean</replaceable> ]</literal></term> + <listitem> + <para> + If true, the slot is enabled to be synced to the physical + standbys so that logical replication can be resumed after failover. + The default is false. + </para> + </listitem> + </varlistentry> (same as above) /physical standbys/physical standby/ ~~~ 4. ALTER_REPLICATION_SLOT + <variablelist> + <varlistentry> + <term><literal>FAILOVER [ <replaceable class="parameter">boolean</replaceable> ]</literal></term> + <listitem> + <para> + If true, the slot is enabled to be synced to the physical + standbys so that logical replication can be resumed after failover. + </para> + </listitem> + </varlistentry> + </variablelist> (same as above) /physical standbys/physical standby/ ====== doc/src/sgml/ref/create_subscription.sgml 5. + <varlistentry id="sql-createsubscription-params-with-failover"> + <term><literal>failover</literal> (<type>boolean</type>)</term> + <listitem> + <para> + Specifies whether the replication slots associated with the subscription + are enabled to be synced to the physical standbys so that logical + replication can be resumed from the new primary after failover. + The default is <literal>false</literal>. + </para> + </listitem> + </varlistentry> (same as above) /physical standbys/physical standby/ ====== doc/src/sgml/system-views.sgml 6. + + <row> + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>failover</structfield> <type>bool</type> + </para> + <para> + True if this is a logical slot enabled to be synced to the physical + standbys so that logical replication can be resumed from the new primary + after failover. Always false for physical slots. + </para></entry> + </row> (same as above) /physical standbys/physical standby/ ====== src/backend/commands/subscriptioncmds.c 7. + if (IsSet(opts.specified_opts, SUBOPT_FAILOVER)) + { + if (!sub->slotname) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("cannot set failover for a subscription that does not have a slot name"))); + + /* + * Do not allow changing the failover state if the + * subscription is enabled. This is because the failover + * state of the slot on the publisher cannot be modified if + * the slot is currently acquired by the apply worker. + */ + if (sub->enabled) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("cannot set %s for enabled subscription", + "failover"))); + + values[Anum_pg_subscription_subfailover - 1] = + BoolGetDatum(opts.failover); + replaces[Anum_pg_subscription_subfailover - 1] = true; + } The first message is not consistent with the second. The "failover" option maybe should be extracted so it won't be translated. SUGGESTION errmsg("cannot set %s for a subscription that does not have a slot name", "failover") ~~~ 8. AlterSubscription + if (!wrconn) + ereport(ERROR, + (errcode(ERRCODE_CONNECTION_FAILURE), + errmsg("could not connect to the publisher: %s", err))); + Need to keep an eye on the patch proposed by Nisha [1] for messages similar to this one, so in case that gets pushed this code should be changed appropriately. ====== src/backend/replication/slot.c 9. * during getting changes, if the two_phase option is enabled it can skip * prepare because by that time start decoding point has been moved. So the * user will only get commit prepared. + * failover: If enabled, allows the slot to be synced to physical standbys so + * that logical replication can be resumed after failover. */ (same as earlier) /physical standbys/physical standby/ ~~~ 10. + /* + * Do not allow users to alter slots to enable failover on the standby + * as we do not support sync to the cascading standby. + */ + if (RecoveryInProgress() && failover) + ereport(ERROR, + errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot alter replication slot to have failover" + " enabled on the standby")); I felt the errmsg could be expressed with less ambiguity: SUGGESTION: cannot enable failover for a replication slot on the standby ====== src/backend/replication/slotfuncs.c 11. create_physical_replication_slot /* acquire replication slot, this will check for conflicting names */ ReplicationSlotCreate(name, false, - temporary ? RS_TEMPORARY : RS_PERSISTENT, false); + temporary ? RS_TEMPORARY : RS_PERSISTENT, false, + false); Having an inline comment might be helpful here instead of passing "false,false" SUGGESTION ReplicationSlotCreate(name, false, temporary ? RS_TEMPORARY : RS_PERSISTENT, false, false /* failover */); ~~~ 12. create_logical_replication_slot + /* + * Do not allow users to create the slots with failover enabled on the + * standby as we do not support sync to the cascading standby. + */ + if (RecoveryInProgress() && failover) + ereport(ERROR, + errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot create replication slot with failover" + " enabled on the standby")); (similar to previous comment) SUGGESTION: cannot enable failover for a replication slot created on the standby ~~~ 13. copy_replication_slot * hence pass find_startpoint false. confirmed_flush will be set * below, by copying from the source slot. + * + * To avoid potential issues with the slotsync worker when the + * restart_lsn of a replication slot goes backwards, we set the + * failover option to false here. This situation occurs when a slot on + * the primary server is dropped and immediately replaced with a new + * slot of the same name, created by copying from another existing + * slot. However, the slotsync worker will only observe the restart_lsn + * of the same slot going backwards. */ create_logical_replication_slot(NameStr(*dst_name), plugin, temporary, false, + false, src_restart_lsn, false); (similar to an earlier comment) Having an inline comment might be helpful here. e.g. false /* failover */, ====== src/backend/replication/walreceiver.c 14. - walrcv_create_slot(wrconn, slotname, true, false, 0, NULL); + walrcv_create_slot(wrconn, slotname, true, false, false, 0, NULL); (similar to an earlier comment) Having an inline comment might be helpful here: SUGGESTION walrcv_create_slot(wrconn, slotname, true, false, false /* failover */, 0, NULL); ====== src/backend/replication/walsender.c 15. CreateReplicationSlot ReplicationSlotCreate(cmd->slotname, false, cmd->temporary ? RS_TEMPORARY : RS_PERSISTENT, - false); + false, false); (similar to an earlier comment) Having an inline comment might be helpful here. e.g. false /* failover */, ~~~ 16. CreateReplicationSlot + /* + * Do not allow users to create the slots with failover enabled on the + * standby as we do not support sync to the cascading standby. + */ + if (RecoveryInProgress() && failover) + ereport(ERROR, + errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot create replication slot with failover" + " enabled on the standby")); + /* * Initially create persistent slot as ephemeral - that allows us to * nicely handle errors during initialization because it'll get @@ -1243,7 +1265,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd) */ ReplicationSlotCreate(cmd->slotname, true, cmd->temporary ? RS_TEMPORARY : RS_EPHEMERAL, - two_phase); + two_phase, failover); This errmsg seems to be repeated in a few places, so I wondered if this code can be refactored to call direct to create_logical_replication_slot() so the errmsg can be just once in a common place. OTOH, if it cannot be refactored, then needs to be using same errmsg as suggested by earlier review comments (see above). ====== src/include/catalog/pg_subscription.h 17. + bool subfailover; /* True if the associated replication slots + * (i.e. the main slot and the table sync + * slots) in the upstream database are enabled + * to be synchronized to the physical + * standbys. */ (same as earlier) /physical standbys/physical standby/ ~~~ 18. + bool failover; /* True if the associated replication slots + * (i.e. the main slot and the table sync + * slots) in the upstream database are enabled + * to be synchronized to the physical + * standbys. */ (same as earlier) /physical standbys/physical standby/ ====== src/include/replication/slot.h 19. + + /* + * Is this a failover slot (sync candidate for physical standbys)? Only + * relevant for logical slots on the primary server. + */ + bool failover; (same as earlier) /physical standbys/physical standby/ ====== [1] Nisha errmsg - https://www.postgresql.org/message-id/CABdArM5-VR4Akt_AHap_0Ofne0cTcsdnN6FcNe%2BMU8eXsa_ERQ%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
On Wed, Jan 24, 2024 at 8:52 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are some comments for patch v66-0001. > > ====== > doc/src/sgml/catalogs.sgml > > 1. > + <para> > + If true, the associated replication slots (i.e. the main slot and the > + table sync slots) in the upstream database are enabled to be > + synchronized to the physical standbys > + </para></entry> > > /physical standbys/physical standby/ > > I wondered if it is better just to say singular "standby" instead of > "standbys" in places like this; e.g. plural might imply cascading for > some readers. > I don't think it is confusing as we used in a similar way in docs. We can probably avoid using physical in places similar to above as that is implied > > > ====== > src/backend/replication/slotfuncs.c > > 11. create_physical_replication_slot > > /* acquire replication slot, this will check for conflicting names */ > ReplicationSlotCreate(name, false, > - temporary ? RS_TEMPORARY : RS_PERSISTENT, false); > + temporary ? RS_TEMPORARY : RS_PERSISTENT, false, > + false); > > Having an inline comment might be helpful here instead of passing "false,false" > > SUGGESTION > ReplicationSlotCreate(name, false, > temporary ? RS_TEMPORARY : RS_PERSISTENT, false, > false /* failover */); > I don't think we follow to use of inline comments. I feel that sometimes makes code difficult to read considering when we have multiple such parameters. -- With Regards, Amit Kapila.
On Mon, Jan 22, 2024 at 3:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Jan 19, 2024 at 3:55 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Fri, Jan 19, 2024 at 10:35 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > > Thank you for updating the patch. I have some comments: > > > > > > --- > > > + latestWalEnd = GetWalRcvLatestWalEnd(); > > > + if (remote_slot->confirmed_lsn > latestWalEnd) > > > + { > > > + elog(ERROR, "exiting from slot synchronization as the > > > received slot sync" > > > + " LSN %X/%X for slot \"%s\" is ahead of the > > > standby position %X/%X", > > > + LSN_FORMAT_ARGS(remote_slot->confirmed_lsn), > > > + remote_slot->name, > > > + LSN_FORMAT_ARGS(latestWalEnd)); > > > + } > > > > > > IIUC GetWalRcvLatestWalEnd () returns walrcv->latestWalEnd, which is > > > typically the primary server's flush position and doesn't mean the LSN > > > where the walreceiver received/flushed up to. > > > > yes. I think it makes more sense to use something which actually tells > > flushed-position. I gave it a try by replacing GetWalRcvLatestWalEnd() > > with GetWalRcvFlushRecPtr() but I see a problem here. Lets say I have > > enabled the slot-sync feature in a running standby, in that case we > > are all good (flushedUpto is the same as actual flush-position > > indicated by LogstreamResult.Flush). But if I restart standby, then I > > observed that the startup process sets flushedUpto to some value 'x' > > (see [1]) while when the wal-receiver starts, it sets > > 'LogstreamResult.Flush' to another value (see [2]) which is always > > greater than 'x'. And we do not update flushedUpto with the > > 'LogstreamResult.Flush' value in walreceiver until we actually do an > > operation on primary. Performing a data change on primary sends WALs > > to standby which then hits XLogWalRcvFlush() and updates flushedUpto > > same as LogstreamResult.Flush. Until then we have a situation where > > slots received on standby are ahead of flushedUpto and thus slotsync > > worker keeps one erroring out. I am yet to find out why flushedUpto is > > set to a lower value than 'LogstreamResult.Flush' at the start of > > standby. Or maybe am I using the wrong function > > GetWalRcvFlushRecPtr() and should be using something else instead? > > > > Can we think of using GetStandbyFlushRecPtr()? We probably need to > expose this function, if this works for the required purpose. GetStandbyFlushRecPtr() seems good. But do we really want to raise an ERROR in this case? IIUC this case could happen often when the slot used by the standby is not listed in standby_slot_names. I think we can just skip such a slot to synchronize and check it the next time. Here are random comments on slotsyncworker.c (v66): --- The postmaster relaunches the slotsync worker without intervals. So if a connection string in primary_conninfo is not correct, many errors are emitted. --- +/* GUC variable */ +bool enable_syncslot = false; Is enable_syncslot a really good name? We use "enable" prefix only for planner parameters such as enable_seqscan, and it seems to me that "slot" is not specific. Other candidates are: * synchronize_replication_slots = on|off * synchronize_failover_slots = on|off --- + elog(ERROR, + "cannot synchronize local slot \"%s\" LSN(%X/%X)" + " to remote slot's LSN(%X/%X) as synchronization" + " would move it backwards", remote_slot->name, Many error messages in slotsync.c are splitted into several lines, but I think it would reduce the greppability when the user looks for the error message in the source code. --- + SpinLockAcquire(&slot->mutex); + slot->data.database = get_database_oid(remote_slot->database, false); + namestrcpy(&slot->data.plugin, remote_slot->plugin); We should not access syscaches while holding a spinlock. --- + SpinLockAcquire(&slot->mutex); + slot->data.database = get_database_oid(remote_slot->database, false); + namestrcpy(&slot->data.plugin, remote_slot->plugin); + SpinLockRelease(&slot->mutex); Similarly, it's better to avoid calling namestrcpy() while holding a spinlock, as we do in CreateInitDecodingContext(). --- + SpinLockAcquire(&SlotSyncWorker->mutex); + + SlotSyncWorker->stopSignaled = true; + + if (SlotSyncWorker->pid == InvalidPid) + { + SpinLockRelease(&SlotSyncWorker->mutex); + return; + } + + kill(SlotSyncWorker->pid, SIGINT); + + SpinLockRelease(&SlotSyncWorker->mutex); It's better to avoid calling a system call while holding a spin lock. --- + BackgroundWorkerUnblockSignals(); I think it's no longer necessary. --- + ereport(LOG, + /* translator: %s is a GUC variable name */ + errmsg("bad configuration for slot synchronization"), + errhint("\"wal_level\" must be >= logical.")); There is no '%s' in errmsg string. --- +/* + * Cleanup function for logical replication launcher. + * + * Called on logical replication launcher exit. + */ IIUC this function is never called by logical replication launcher. --- + /* + * The slot sync worker can not get here because it will only stop when it + * receives a SIGINT from the logical replication launcher, or when there + * is an error. + */ + Assert(false); This comment is not correct. IIUC the slotsync worker receives a SIGINT from the startup process. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Jan 24, 2024 at 10:41 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Jan 22, 2024 at 3:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > Can we think of using GetStandbyFlushRecPtr()? We probably need to > > expose this function, if this works for the required purpose. > > GetStandbyFlushRecPtr() seems good. But do we really want to raise an > ERROR in this case? IIUC this case could happen often when the slot > used by the standby is not listed in standby_slot_names. > or it can be due to some bug in the code as well. > I think we > can just skip such a slot to synchronize and check it the next time. > How about logging the message and then skipping the sync step? This will at least make users aware that they could be missing to set standby_slot_names. > Here are random comments on slotsyncworker.c (v66): > > +/* GUC variable */ > +bool enable_syncslot = false; > > Is enable_syncslot a really good name? We use "enable" prefix only for > planner parameters such as enable_seqscan, and it seems to me that > "slot" is not specific. Other candidates are: > > * synchronize_replication_slots = on|off > * synchronize_failover_slots = on|off > I would prefer the second one. Would it be better to just say sync_failover_slots? -- With Regards, Amit Kapila.
On Wed, Jan 24, 2024 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Jan 24, 2024 at 10:41 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Mon, Jan 22, 2024 at 3:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > Can we think of using GetStandbyFlushRecPtr()? We probably need to > > > expose this function, if this works for the required purpose. > > > > GetStandbyFlushRecPtr() seems good. But do we really want to raise an > > ERROR in this case? IIUC this case could happen often when the slot > > used by the standby is not listed in standby_slot_names. > > > > or it can be due to some bug in the code as well. > > > I think we > > can just skip such a slot to synchronize and check it the next time. > > > > How about logging the message and then skipping the sync step? This > will at least make users aware that they could be missing to set > standby_slot_names. +1 > > > Here are random comments on slotsyncworker.c (v66): > > > > +/* GUC variable */ > > +bool enable_syncslot = false; > > > > Is enable_syncslot a really good name? We use "enable" prefix only for > > planner parameters such as enable_seqscan, and it seems to me that > > "slot" is not specific. Other candidates are: > > > > * synchronize_replication_slots = on|off > > * synchronize_failover_slots = on|off > > > > I would prefer the second one. Would it be better to just say > sync_failover_slots? Works for me. But if we want to extend this option for non-failover slots as well in the future, synchronize_replication_slots (or sync_replication_slots) seems better. We can extend it by having an enum later. For example, the values can be on, off, or failover etc. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Jan 24, 2024 at 11:24 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Jan 24, 2024 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > +/* GUC variable */ > > > +bool enable_syncslot = false; > > > > > > Is enable_syncslot a really good name? We use "enable" prefix only for > > > planner parameters such as enable_seqscan, and it seems to me that > > > "slot" is not specific. Other candidates are: > > > > > > * synchronize_replication_slots = on|off > > > * synchronize_failover_slots = on|off > > > > > > > I would prefer the second one. Would it be better to just say > > sync_failover_slots? > > Works for me. But if we want to extend this option for non-failover > slots as well in the future, synchronize_replication_slots (or > sync_replication_slots) seems better. We can extend it by having an > enum later. For example, the values can be on, off, or failover etc. > I see your point. Let us see if others have any suggestions on this. -- With Regards, Amit Kapila.
Hi, On Wed, Jan 24, 2024 at 01:51:54PM +0530, Amit Kapila wrote: > On Wed, Jan 24, 2024 at 11:24 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Wed, Jan 24, 2024 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > +/* GUC variable */ > > > > +bool enable_syncslot = false; > > > > > > > > Is enable_syncslot a really good name? We use "enable" prefix only for > > > > planner parameters such as enable_seqscan, and it seems to me that > > > > "slot" is not specific. Other candidates are: > > > > > > > > * synchronize_replication_slots = on|off > > > > * synchronize_failover_slots = on|off > > > > > > > > > > I would prefer the second one. Would it be better to just say > > > sync_failover_slots? > > > > Works for me. But if we want to extend this option for non-failover > > slots as well in the future, synchronize_replication_slots (or > > sync_replication_slots) seems better. We can extend it by having an > > enum later. For example, the values can be on, off, or failover etc. > > > > I see your point. Let us see if others have any suggestions on this. I also see Sawada-San's point and I'd vote for "sync_replication_slots". Then for the current feature I think "failover" and "on" should be the values to turn the feature on (assuming "on" would mean "all kind of supported slots"). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Jan 23, 2024 at 5:13 PM shveta malik <shveta.malik@gmail.com> wrote: > > Thanks Ajin for testing the patch. PFA v66 which fixes this issue. > I think we should try to commit the patch as all of the design concerns are resolved now. To achieve that, can we split the failover setting patch into the following: (a) setting failover property via SQL commands and display it in pg_replication_slots (b) replication protocol command (c) failover property via subscription commands? It will make each patch smaller and it would be easier to detect any problem in the same after commit. -- With Regards, Amit Kapila.
On Wed, Jan 24, 2024 at 2:38 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On Wed, Jan 24, 2024 at 01:51:54PM +0530, Amit Kapila wrote: > > On Wed, Jan 24, 2024 at 11:24 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Wed, Jan 24, 2024 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > > > > +/* GUC variable */ > > > > > +bool enable_syncslot = false; > > > > > > > > > > Is enable_syncslot a really good name? We use "enable" prefix only for > > > > > planner parameters such as enable_seqscan, and it seems to me that > > > > > "slot" is not specific. Other candidates are: > > > > > > > > > > * synchronize_replication_slots = on|off > > > > > * synchronize_failover_slots = on|off > > > > > > > > > > > > > I would prefer the second one. Would it be better to just say > > > > sync_failover_slots? > > > > > > Works for me. But if we want to extend this option for non-failover > > > slots as well in the future, synchronize_replication_slots (or > > > sync_replication_slots) seems better. We can extend it by having an > > > enum later. For example, the values can be on, off, or failover etc. > > > > > > > I see your point. Let us see if others have any suggestions on this. > > I also see Sawada-San's point and I'd vote for "sync_replication_slots". Then for > the current feature I think "failover" and "on" should be the values to turn the > feature on (assuming "on" would mean "all kind of supported slots"). Even if others agree and we change this GUC name to "sync_replication_slots", I feel we should keep the values as "on" and "off" currently, where "on" would mean 'sync failover slots' (docs can state that clearly). I do not think we should support sync of "all kinds of supported slots" in the first version. Maybe we can think about it for future versions. thanks Shveta
On Wed, Jan 24, 2024 at 4:09 PM shveta malik <shveta.malik@gmail.com> wrote: > > Even if others agree and we change this GUC name to > "sync_replication_slots", I feel we should keep the values as "on" and > "off" currently, where "on" would mean 'sync failover slots' (docs can > state that clearly). I do not think we should support sync of "all > kinds of supported slots" in the first version. Maybe we can think > about it for future versions. PFA v67. Note that the GUC (enable_syncslot) name is unchanged. Once we have final agreement on the name, we can make the change in the next version. Changes in v67 are: 1) Addressed comments by Peter given in [1]. 2) Addressed comments by Swada-San given in [2]. 3) Removed syncing 'failover' on standby from remote_slot. The 'failover' field will be false for synced slots. Since we do not support sync to cascading standbys yet, thus failover=true was misleading and unused there. Thanks Hou-San for contributing in 2. Changes are split across patch001,002 and 003. TODO: --Split patch-001 as suggested in [3]. --Change GUC name. [1]: https://www.postgresql.org/message-id/CAHut%2BPu_uK%3D%3DM%2BVmCMug7m7O6LAwpC05A%3DT7zP8c4G2-hS%2Bbdg%40mail.gmail.com [2]: https://www.postgresql.org/message-id/CAD21AoApGoTZu7D_7%3DbVYQqKnj%2BPZ2Rz%2Bnc8Ky1HPQMS_XL6%2BA%40mail.gmail.com [3]: https://www.postgresql.org/message-id/CAA4eK1Lxvfq9RwOEsguiMCrKPUc1He9UGz1_wi0N0cJaXFa4Eg%40mail.gmail.com thanks Shveta
Attachment
- v67-0005-Non-replication-connection-and-app_name-change.patch
- v67-0001-Enable-setting-failover-property-for-a-slot-thro.patch
- v67-0002-Add-logical-slot-sync-capability-to-the-physical.patch
- v67-0003-Slot-sync-worker-as-a-special-process.patch
- v67-0004-Allow-logical-walsenders-to-wait-for-the-physica.patch
- v67-0006-Document-the-steps-to-check-if-the-standby-is-re.patch
Here are some review comments for the patch v67-0001. ====== 1. There are a couple of places checking for failover usage on a standby. + if (RecoveryInProgress() && failover) + ereport(ERROR, + errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot enable failover for a replication slot" + " created on the standby")); and + if (RecoveryInProgress() && failover) + ereport(ERROR, + errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot enable failover for a replication slot" + " on the standby")); IMO the conditions should be written the other way around (failover && RecoveryInProgress()) to avoid the unnecessary function calls when 'failover' flag is probably mostly default false anyway. ====== Kind Regards, Peter Smith. Fujitsu Australia
On Wednesday, January 24, 2024 6:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jan 23, 2024 at 5:13 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > Thanks Ajin for testing the patch. PFA v66 which fixes this issue. > > > > I think we should try to commit the patch as all of the design concerns are > resolved now. To achieve that, can we split the failover setting patch into the > following: (a) setting failover property via SQL commands and display it in > pg_replication_slots (b) replication protocol command (c) failover property via > subscription commands? > > It will make each patch smaller and it would be easier to detect any problem in > the same after commit. Agreed. I split the original 0001 patch into 3 patches as suggested. Here is the V68 patch set. Best Regards, Hou zj
Attachment
- v68-0005-Slot-sync-worker-as-a-special-process.patch
- v68-0006-Allow-logical-walsenders-to-wait-for-the-physica.patch
- v68-0007-Non-replication-connection-and-app_name-change.patch
- v68-0008-Document-the-steps-to-check-if-the-standby-is-re.patch
- v68-0001-Add-the-failover-property-to-replication-slot.patch
- v68-0002-Allow-setting-failover-property-in-the-replicati.patch
- v68-0003-Add-a-failover-option-to-subscriptions.patch
- v68-0004-Add-logical-slot-sync-capability-to-the-physical.patch
On Wed, Jan 24, 2024 at 5:17 PM shveta malik <shveta.malik@gmail.com> wrote: > > PFA v67. Note that the GUC (enable_syncslot) name is unchanged. Once > we have final agreement on the name, we can make the change in the > next version. > > Changes in v67 are: > > 1) Addressed comments by Peter given in [1]. > 2) Addressed comments by Swada-San given in [2]. > 3) Removed syncing 'failover' on standby from remote_slot. The > 'failover' field will be false for synced slots. Since we do not > support sync to cascading standbys yet, thus failover=true was > misleading and unused there. > But what will happen after the standby is promoted? After promotion, ideally, it should have failover enabled, so that the slots can be synced. Also, note that corresponding subscriptions still have the failover flag enabled. I think we should copy the 'failover' option for the synced slots. -- With Regards, Amit Kapila.
On Wednesday, January 24, 2024 1:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Here are random comments on slotsyncworker.c (v66): Thanks for the comments: > > --- > + elog(ERROR, > + "cannot synchronize local slot \"%s\" LSN(%X/%X)" > + " to remote slot's LSN(%X/%X) as synchronization" > + " would move it backwards", remote_slot->name, > > Many error messages in slotsync.c are splitted into several lines, but I think it > would reduce the greppability when the user looks for the error message in the > source code. Thanks for the suggestion! we combined most of the messages in the new version patch. Although some messages including the above one were kept splitted, because It's too long(> 120 col including the indent) to fit into the screen, so I feel it's better to keep these messages splitted. Best Regards, Hou zj
Here are some review comments for v67-0002. ====== src/backend/replication/logical/slotsync.c 1. +/* The sleep time (ms) between slot-sync cycles varies dynamically + * (within a MIN/MAX range) according to slot activity. See + * wait_for_slot_activity() for details. + */ +#define MIN_WORKER_NAPTIME_MS 200 +#define MAX_WORKER_NAPTIME_MS 30000 /* 30s */ + +static long sleep_ms = MIN_WORKER_NAPTIME_MS; In my previous review for this, I meant for there to be no whitespace between the #defines and the static long sleep_ms so the prior comment then clearly belongs to all 3 lines ~~~ 2. synchronize_one_slot + /* + * Sanity check: Make sure that concerned WAL is received and flushed + * before syncing slot to target lsn received from the primary server. + * + * This check should never pass as on the primary server, we have waited + * for the standby's confirmation before updating the logical slot. + */ + latestFlushPtr = GetStandbyFlushRecPtr(NULL); + if (remote_slot->confirmed_lsn > latestFlushPtr) + { + ereport(LOG, + errmsg("skipping slot synchronization as the received slot sync" + " LSN %X/%X for slot \"%s\" is ahead of the standby position %X/%X", + LSN_FORMAT_ARGS(remote_slot->confirmed_lsn), + remote_slot->name, + LSN_FORMAT_ARGS(latestFlushPtr))); + + return false; + } Previously in v65 this was an elog, but now it is an ereport. But since this is a sanity check condition that "should never pass" wasn't the elog the more appropriate choice? ~~~ 3. synchronize_one_slot + /* + * We don't want any complicated code while holding a spinlock, so do + * namestrcpy() and get_database_oid() outside. + */ + namestrcpy(&plugin_name, remote_slot->plugin); + dbid = get_database_oid(remote_slot->database, false); IMO just simplify the whole comment, here and for the other similar comment in local_slot_update(). SUGGESTION /* Avoid expensive operations while holding a spinlock. */ ~~~ 4. synchronize_slots + /* Construct the remote_slot tuple and synchronize each slot locally */ + tupslot = MakeSingleTupleTableSlot(res->tupledesc, &TTSOpsMinimalTuple); + while (tuplestore_gettupleslot(res->tuplestore, true, false, tupslot)) + { + bool isnull; + RemoteSlot *remote_slot = palloc0(sizeof(RemoteSlot)); + Datum d; + + remote_slot->name = TextDatumGetCString(slot_getattr(tupslot, 1, &isnull)); + Assert(!isnull); + + remote_slot->plugin = TextDatumGetCString(slot_getattr(tupslot, 2, &isnull)); + Assert(!isnull); + + /* + * It is possible to get null values for LSN and Xmin if slot is + * invalidated on the primary server, so handle accordingly. + */ + d = slot_getattr(tupslot, 3, &isnull); + remote_slot->confirmed_lsn = isnull ? InvalidXLogRecPtr : + DatumGetLSN(d); + + d = slot_getattr(tupslot, 4, &isnull); + remote_slot->restart_lsn = isnull ? InvalidXLogRecPtr : DatumGetLSN(d); + + d = slot_getattr(tupslot, 5, &isnull); + remote_slot->catalog_xmin = isnull ? InvalidTransactionId : + DatumGetTransactionId(d); + + remote_slot->two_phase = DatumGetBool(slot_getattr(tupslot, 6, &isnull)); + Assert(!isnull); + + remote_slot->database = TextDatumGetCString(slot_getattr(tupslot, + 7, &isnull)); + Assert(!isnull); + + d = slot_getattr(tupslot, 8, &isnull); + remote_slot->invalidated = isnull ? RS_INVAL_NONE : + get_slot_invalidation_cause(TextDatumGetCString(d)); Would it be better to get rid of the hardwired column numbers and then be able to use the SLOTSYNC_COLUMN_COUNT already defined as a sanity check? SUGGESTION int col = 0; ... remote_slot->name = TextDatumGetCString(slot_getattr(tupslot, ++col, &isnull)); ... remote_slot->plugin = TextDatumGetCString(slot_getattr(tupslot, ++col, &isnull)); ... d = slot_getattr(tupslot, ++col, &isnull); remote_slot->confirmed_lsn = isnull ? InvalidXLogRecPtr : DatumGetLSN(d); ... d = slot_getattr(tupslot, ++col, &isnull); remote_slot->restart_lsn = isnull ? InvalidXLogRecPtr : DatumGetLSN(d); ... d = slot_getattr(tupslot, ++col, &isnull); remote_slot->catalog_xmin = isnull ? InvalidTransactionId : DatumGetTransactionId(d); ... remote_slot->two_phase = DatumGetBool(slot_getattr(tupslot, ++col, &isnull)); ... remote_slot->database = TextDatumGetCString(slot_getattr(tupslot, ++col, &isnull)); ... d = slot_getattr(tupslot, ++col, &isnull); remote_slot->invalidated = isnull ? RS_INVAL_NONE : get_slot_invalidation_cause(TextDatumGetCString(d)); /* Sanity check */ Asert(col == SLOTSYNC_COLUMN_COUNT); ~~~ 5. +static char * +validate_parameters_and_get_dbname(void) +{ + char *dbname; These are configuration issues, so probably all these ereports could also set errcode(ERRCODE_INVALID_PARAMETER_VALUE). ====== Kind Regards, Peter Smith. Fujitsu Australia
Hi, On Wed, Jan 24, 2024 at 04:09:15PM +0530, shveta malik wrote: > On Wed, Jan 24, 2024 at 2:38 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > > I also see Sawada-San's point and I'd vote for "sync_replication_slots". Then for > > the current feature I think "failover" and "on" should be the values to turn the > > feature on (assuming "on" would mean "all kind of supported slots"). > > Even if others agree and we change this GUC name to > "sync_replication_slots", I feel we should keep the values as "on" and > "off" currently, where "on" would mean 'sync failover slots' (docs can > state that clearly). I gave more thoughts on it and I think the values should only be "failover" or "off". The reason is that if we allow "on" and change the "on" behavior in future versions (to support more than failover slots) then that would change the behavior for the ones that used "on". That's right that we can mention it in the docs, but there is still the risk of users not reading the doc (that's why I think that it would be good if we can put this extra "safety" in the code too). > I do not think we should support sync of "all > kinds of supported slots" in the first version. Maybe we can think > about it for future versions. Yeah I think the same (I was mentioning the future "on" behavior up-thread). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Jan 25, 2024 at 9:13 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > 3) Removed syncing 'failover' on standby from remote_slot. The > > 'failover' field will be false for synced slots. Since we do not > > support sync to cascading standbys yet, thus failover=true was > > misleading and unused there. > > > > But what will happen after the standby is promoted? After promotion, > ideally, it should have failover enabled, so that the slots can be > synced. Also, note that corresponding subscriptions still have the > failover flag enabled. I think we should copy the 'failover' option > for the synced slots. Yes, right, missed this point earlier. I will make the change in the next version. thanks Shveta
Hi, On Thu, Jan 25, 2024 at 02:57:30AM +0000, Zhijie Hou (Fujitsu) wrote: > On Wednesday, January 24, 2024 6:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Jan 23, 2024 at 5:13 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > Thanks Ajin for testing the patch. PFA v66 which fixes this issue. > > > > > > > I think we should try to commit the patch as all of the design concerns are > > resolved now. To achieve that, can we split the failover setting patch into the > > following: (a) setting failover property via SQL commands and display it in > > pg_replication_slots (b) replication protocol command (c) failover property via > > subscription commands? > > > > It will make each patch smaller and it would be easier to detect any problem in > > the same after commit. > > Agreed. I split the original 0001 patch into 3 patches as suggested. > Here is the V68 patch set. Thanks! Some comments. Looking at 0002: 1 === + <para>The following options are supported:</para> What about "The following option is supported"? (as currently only the "FAILOVER" is) 2 === What about adding some TAP tests too? (I can see that ALTER_REPLICATION_SLOT test is added in v68-0004 but I think having some in 0002 would make sense too). Looking at 0003: 1 === + parameter specified in the subscription. When creating the slot, + ensure the slot <literal>failover</literal> property matches the + <link linkend="sql-createsubscription-params-with-failover"><literal>failover</literal></link> + parameter value of the subscription. What about explaining what would be the consequence of not doing so? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Jan 25, 2024 at 1:25 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > On Thu, Jan 25, 2024 at 02:57:30AM +0000, Zhijie Hou (Fujitsu) wrote: > > > > Agreed. I split the original 0001 patch into 3 patches as suggested. > > Here is the V68 patch set. Thanks, I have pushed 0001. > > Thanks! > > Some comments. > > Looking at 0002: > > 1 === > > + <para>The following options are supported:</para> > > What about "The following option is supported"? (as currently only the "FAILOVER" > is) > > 2 === > > What about adding some TAP tests too? (I can see that ALTER_REPLICATION_SLOT test > is added in v68-0004 but I think having some in 0002 would make sense too). > The subscription tests in v68-0003 will test this functionality. The one advantage of adding separate tests for this is that if in the future we extend this replication command, it could be convenient to test various options. However, the same could be said about existing replication commands as well. But is it worth having extra tests which will be anyway covered in the next commit in a few days? I understand that it is a good idea and makes one comfortable to have tests for each separate commit but OTOH, in the longer term it will just be adding more test time without achieving much benefit. I think we can tell explicitly in the commit message of this patch that the subsequent commit will cover the tests for this functionality One minor comment on 0002: + so that logical replication can be resumed after failover. + </para> Can we move this and similar comments or doc changes to the later 0004 patch where we are syncing the slots? -- With Regards, Amit Kapila.
On Thu, Jan 25, 2024 at 10:39 AM Peter Smith <smithpb2250@gmail.com> wrote: > 2. synchronize_one_slot > > + /* > + * Sanity check: Make sure that concerned WAL is received and flushed > + * before syncing slot to target lsn received from the primary server. > + * > + * This check should never pass as on the primary server, we have waited > + * for the standby's confirmation before updating the logical slot. > + */ > + latestFlushPtr = GetStandbyFlushRecPtr(NULL); > + if (remote_slot->confirmed_lsn > latestFlushPtr) > + { > + ereport(LOG, > + errmsg("skipping slot synchronization as the received slot sync" > + " LSN %X/%X for slot \"%s\" is ahead of the standby position %X/%X", > + LSN_FORMAT_ARGS(remote_slot->confirmed_lsn), > + remote_slot->name, > + LSN_FORMAT_ARGS(latestFlushPtr))); > + > + return false; > + } > > Previously in v65 this was an elog, but now it is an ereport. But > since this is a sanity check condition that "should never pass" wasn't > the elog the more appropriate choice? We realized that this scenario can be frequently hit when the user has not set standby_slot_names on primary. And thus ereport makes more sense. But I agree that this comment is misleading. We will adjust the comment in the next version. thanks Shveta
Hi, On Thu, Jan 25, 2024 at 03:54:45PM +0530, Amit Kapila wrote: > On Thu, Jan 25, 2024 at 1:25 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > > On Thu, Jan 25, 2024 at 02:57:30AM +0000, Zhijie Hou (Fujitsu) wrote: > > > > > > Agreed. I split the original 0001 patch into 3 patches as suggested. > > > Here is the V68 patch set. > > Thanks, I have pushed 0001. > > > > > Thanks! > > > > Some comments. > > > > Looking at 0002: > > > > 1 === > > > > + <para>The following options are supported:</para> > > > > What about "The following option is supported"? (as currently only the "FAILOVER" > > is) > > > > 2 === > > > > What about adding some TAP tests too? (I can see that ALTER_REPLICATION_SLOT test > > is added in v68-0004 but I think having some in 0002 would make sense too). > > > > The subscription tests in v68-0003 will test this functionality. The > one advantage of adding separate tests for this is that if in the > future we extend this replication command, it could be convenient to > test various options. However, the same could be said about existing > replication commands as well. I initially did check for "START_REPLICATION" and I saw it's part of 006_logical_decoding.pl (but did not check all the "REPLICATION" commands). That said, it's more a Nit and I think it's fine with having the test in v68-0004 (as it is currently done) + the ones in v68-0003. > But is it worth having extra tests which > will be anyway covered in the next commit in a few days? > > I understand that it is a good idea and makes one comfortable to have > tests for each separate commit but OTOH, in the longer term it will > just be adding more test time without achieving much benefit. I think > we can tell explicitly in the commit message of this patch that the > subsequent commit will cover the tests for this functionality Yeah, I think that's enough (at least someone reading the commit message, the diff changes and not following this dedicated thread closely would know the lack of test is not a miss). > One minor comment on 0002: > + so that logical replication can be resumed after failover. > + </para> > > Can we move this and similar comments or doc changes to the later 0004 > patch where we are syncing the slots? Sure. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thursday, January 25, 2024 6:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jan 25, 2024 at 1:25 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > > On Thu, Jan 25, 2024 at 02:57:30AM +0000, Zhijie Hou (Fujitsu) wrote: > > > > > > Agreed. I split the original 0001 patch into 3 patches as suggested. > > > Here is the V68 patch set. > > Thanks, I have pushed 0001. > > > > > Thanks! > > > > Some comments. > > > > Looking at 0002: > > > > 1 === > > > > + <para>The following options are supported:</para> > > > > What about "The following option is supported"? (as currently only the > "FAILOVER" > > is) > > > > 2 === > > > > What about adding some TAP tests too? (I can see that > > ALTER_REPLICATION_SLOT test is added in v68-0004 but I think having some > in 0002 would make sense too). > > > > The subscription tests in v68-0003 will test this functionality. The one > advantage of adding separate tests for this is that if in the future we extend this > replication command, it could be convenient to test various options. However, > the same could be said about existing replication commands as well. But is it > worth having extra tests which will be anyway covered in the next commit in a > few days? > > I understand that it is a good idea and makes one comfortable to have tests for > each separate commit but OTOH, in the longer term it will just be adding more > test time without achieving much benefit. I think we can tell explicitly in the > commit message of this patch that the subsequent commit will cover the tests > for this functionality Agreed. > > One minor comment on 0002: > + so that logical replication can be resumed after failover. > + </para> > > Can we move this and similar comments or doc changes to the later 0004 patch > where we are syncing the slots? Thanks for the comment. Here is the V69 patch set which includes the following changes. V69-0001, V69-0002 1) Addressed Bertrand's comments[1]. V69-0003 1) Addressed Peter's comment in [2], [3] 2) Addressed Amit's comment in [4] and above. 3) Fixed one issue that the startup process may report ERROR if it tries to drop the same slot that the slotsync worker is acquiring. Now we take shared lock on db in slot-sync worker before we create, update or drop any of its slots. This is done to prevent potential conflict with ReplicationSlotsDropDBSlots() in case that database is dropped in parallel. V69-0004 1) Rebased and fixed one CFbot failure. V69-0005, V69-0006, V69-0007 1) Rebased. Thanks Shveta for rebasing and working for the changes on 0003~0007. [1] https://www.postgresql.org/message-id/ZbIT9Kj3d8TFD8h6%40ip-10-97-1-34.eu-west-3.compute.internal [2]: https://www.postgresql.org/message-id/CAHut%2BPt2oLfxv_%3DGN23dOOduKHBHdAkCvwSZiwSbtTJFFbQm-w%40mail.gmail.com [3]: https://www.postgresql.org/message-id/CAHut%2BPtsDYPbg7qM1nGWtJcSQBQ5JH%3DLmgyqwqBPL9k%2Bz8f5Ew%40mail.gmail.com [4]: https://www.postgresql.org/message-id/CAA4eK1%2B4PhO-f4%2B2fForG6MOEj3jbtee_PYPtwtgww%3DonC5DSQ%40mail.gmail.com Best Regards, Hou zj
Attachment
- v69-0006-Non-replication-connection-and-app_name-change.patch
- v69-0007-Document-the-steps-to-check-if-the-standby-is-re.patch
- v69-0003-Add-logical-slot-sync-capability-to-the-physical.patch
- v69-0004-Slot-sync-worker-as-a-special-process.patch
- v69-0005-Allow-logical-walsenders-to-wait-for-the-physica.patch
- v69-0002-Add-a-failover-option-to-subscriptions.patch
- v69-0001-Allow-setting-failover-property-in-the-replicati.patch
Hi, On Thu, Jan 25, 2024 at 01:11:50PM +0000, Zhijie Hou (Fujitsu) wrote: > Here is the V69 patch set which includes the following changes. > > V69-0001, V69-0002 > 1) Addressed Bertrand's comments[1]. Thanks! V69-0001 LGTM. As far V69-0002 I just have one more last remark: + */ + if (sub->enabled) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("cannot set %s for enabled subscription", + "failover"))); Worth to add a test for it in 050_standby_failover_slots_sync.pl? (I had a quick look and it does not seem to be covered). Remarks, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Jan 25, 2024 at 6:42 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > Here is the V69 patch set which includes the following changes. > > V69-0001, V69-0002 > Few minor comments on v69-0001 1. In libpqrcv_create_slot(), I see we are using two types of syntaxes based on 'use_new_options_syntax' (aka server_version >= 15) whereas this new 'failover' option doesn't follow that. What is the reason of the same? I thought it is because older versions anyway won't support this option. However, I guess we should follow the syntax of the old server and let it error out. BTW, did you test this patch with old server versions (say < 15 and >=15) by directly using replication commands, if so, what is the behavior of same? 2. } - + if (failover) + appendStringInfoString(&cmd, "FAILOVER, "); Spurious line removal. Also, to follow a coding pattern similar to nearby code, let's have one empty line after handling of failover. 3. +/* ALTER_REPLICATION_SLOT slot */ +alter_replication_slot: + K_ALTER_REPLICATION_SLOT IDENT '(' generic_option_list ')' I think it would be better if we follow the create style by specifying syntax in comments as that can make the code easier to understand after future extensions to this command if any. See create_replication_slot: /* CREATE_REPLICATION_SLOT slot [TEMPORARY] PHYSICAL [options] */ K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_PHYSICAL create_slot_options -- With Regards, Amit Kapila.
On Saturday, January 27, 2024 11:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jan 25, 2024 at 6:42 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> > wrote: > > > > Here is the V69 patch set which includes the following changes. > > > > V69-0001, V69-0002 > > > > Few minor comments on v69-0001 > 1. In libpqrcv_create_slot(), I see we are using two types of syntaxes based on > 'use_new_options_syntax' (aka server_version >= 15) whereas this new 'failover' > option doesn't follow that. What is the reason of the same? I thought it is > because older versions anyway won't support this option. However, I guess we > should follow the syntax of the old server and let it error out. Changed as suggested. > BTW, did you test > this patch with old server versions (say < 15 and >=15) by directly using > replication commands, if so, what is the behavior of same? Yes, I tested it. We cannot use new failover option or new alter_replication_slot on server <17, the errors we will get are as follows: Using failover option in create_replication_slot on server 15 ~ 16 ERROR: unrecognized option: failover Using failover option in create_replication_slot on server < 15 ERROR: syntax error Alter_replication_slot on server < 17 ERROR: syntax error at or near "ALTER_REPLICATION_SLOT" > > 2. > } > - > + if (failover) > + appendStringInfoString(&cmd, "FAILOVER, "); > > Spurious line removal. Also, to follow a coding pattern similar to nearby code, > let's have one empty line after handling of failover. Changed. > > 3. > +/* ALTER_REPLICATION_SLOT slot */ > +alter_replication_slot: > + K_ALTER_REPLICATION_SLOT IDENT '(' generic_option_list ')' > > I think it would be better if we follow the create style by specifying syntax in > comments as that can make the code easier to understand after future > extensions to this command if any. See > create_replication_slot: > /* CREATE_REPLICATION_SLOT slot [TEMPORARY] PHYSICAL [options] */ > K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_PHYSICAL > create_slot_options Changed. Attach the V70 patch set which addressed above comments and Bertrand's comments in [1] [1] https://www.postgresql.org/message-id/ZbNt1oRZRcdIAw2c%40ip-10-97-1-34.eu-west-3.compute.internal Best Regards, Hou zj
Attachment
- v70-0006-Non-replication-connection-and-app_name-change.patch
- v70-0004-Slot-sync-worker-as-a-special-process.patch
- v70-0007-Document-the-steps-to-check-if-the-standby-is-re.patch
- v70-0001-Allow-setting-failover-property-in-the-replicati.patch
- v70-0002-Add-a-failover-option-to-subscriptions.patch
- v70-0003-Add-logical-slot-sync-capability-to-the-physical.patch
- v70-0005-Allow-logical-walsenders-to-wait-for-the-physica.patch
Here are some review comments for v70-0001. ====== doc/src/sgml/protocol.sgml 1. Related to this, please also review my other patch to the same docs page protocol.sgml [1]. ====== src/backend/replication/logical/tablesync.c 2. walrcv_create_slot(LogRepWorkerWalRcvConn, slotname, false /* permanent */ , false /* two_phase */ , + false, CRS_USE_SNAPSHOT, origin_startpos); I know it was previously mentioned in this thread that inline parameter comments are unnecessary, but here they are already in the existing code so shouldn't we do the same? ====== src/backend/replication/repl_gram.y 3. +/* ALTER_REPLICATION_SLOT slot options */ +alter_replication_slot: + K_ALTER_REPLICATION_SLOT IDENT '(' generic_option_list ')' + { + AlterReplicationSlotCmd *cmd; + cmd = makeNode(AlterReplicationSlotCmd); + cmd->slotname = $2; + cmd->options = $4; + $$ = (Node *) cmd; + } + ; + IMO write that comment with parentheses, so it matches the code. SUGGESTION ALTER_REPLICATION_SLOT slot ( options ) ====== [1] https://www.postgresql.org/message-id/CAHut%2BPtDWSmW8uiRJF1LfGQJikmo7V2jdysLuRmtsanNZc7fNw%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
On Mon, Jan 29, 2024 at 9:21 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are some review comments for v70-0001. > > ====== > doc/src/sgml/protocol.sgml > > 1. > Related to this, please also review my other patch to the same docs > page protocol.sgml [1]. > We can check that separately. > ====== > src/backend/replication/logical/tablesync.c > > 2. > walrcv_create_slot(LogRepWorkerWalRcvConn, > slotname, false /* permanent */ , false /* two_phase */ , > + false, > CRS_USE_SNAPSHOT, origin_startpos); > > I know it was previously mentioned in this thread that inline > parameter comments are unnecessary, but here they are already in the > existing code so shouldn't we do the same? > I think it is better to remove the even existing ones as those many times make code difficult to read. -- With Regards, Amit Kapila.
On Sat, Jan 27, 2024 at 12:02 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > Attach the V70 patch set which addressed above comments and Bertrand's comments in [1] > Since v70-0001 is pushed, rebased and attached v70_2 patches. There are no new changes. thanks Shveta
Attachment
- v70_2-0002-Add-logical-slot-sync-capability-to-the-physic.patch
- v70_2-0001-Add-a-failover-option-to-subscriptions.patch
- v70_2-0004-Allow-logical-walsenders-to-wait-for-the-physi.patch
- v70_2-0003-Slot-sync-worker-as-a-special-process.patch
- v70_2-0005-Non-replication-connection-and-app_name-change.patch
- v70_2-0006-Document-the-steps-to-check-if-the-standby-is-.patch
Hi, On Mon, Jan 29, 2024 at 10:24:11AM +0530, shveta malik wrote: > On Sat, Jan 27, 2024 at 12:02 PM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > Attach the V70 patch set which addressed above comments and Bertrand's comments in [1] > > > > Since v70-0001 is pushed, rebased and attached v70_2 patches. There > are no new changes. Thanks! Looking at 0001: + When altering the + <link linkend="sql-createsubscription-params-with-slot-name"><literal>slot_name</literal></link>, + the <literal>failover</literal> property value of the named slot may differ from the + <link linkend="sql-createsubscription-params-with-failover"><literal>failover</literal></link> + parameter specified in the subscription. When creating the slot, + ensure the slot <literal>failover</literal> property matches the + <link linkend="sql-createsubscription-params-with-failover"><literal>failover</literal></link> + parameter value of the subscription. Otherwise, the slot on the publisher may + not be enabled to be synced to standbys. Not related to this patch series but while at it shouldn't we also add a few words about two_phase too? (I mean ensure the slot property matchs the subscription one). Or would it be better to create a dedicated patch (outside of this thread) for the "two_phase" remark? (If so I can take care of it). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Jan 29, 2024 at 2:22 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > On Mon, Jan 29, 2024 at 10:24:11AM +0530, shveta malik wrote: > > On Sat, Jan 27, 2024 at 12:02 PM Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> wrote: > > > > > > Attach the V70 patch set which addressed above comments and Bertrand's comments in [1] > > > > > > > Since v70-0001 is pushed, rebased and attached v70_2 patches. There > > are no new changes. > > Thanks! > > Looking at 0001: > > + When altering the > + <link linkend="sql-createsubscription-params-with-slot-name"><literal>slot_name</literal></link>, > + the <literal>failover</literal> property value of the named slot may differ from the > + <link linkend="sql-createsubscription-params-with-failover"><literal>failover</literal></link> > + parameter specified in the subscription. When creating the slot, > + ensure the slot <literal>failover</literal> property matches the > + <link linkend="sql-createsubscription-params-with-failover"><literal>failover</literal></link> > + parameter value of the subscription. Otherwise, the slot on the publisher may > + not be enabled to be synced to standbys. > > Not related to this patch series but while at it shouldn't we also add a few > words about two_phase too? (I mean ensure the slot property matchs the > subscription one). > > Or would it be better to create a dedicated patch (outside of this thread) for > the "two_phase" remark? (If so I can take care of it). > I think it is better to create a separate patch for two_phase after this patch gets committed. -- With Regards, Amit Kapila.
Hi, On Mon, Jan 29, 2024 at 02:35:52PM +0530, Amit Kapila wrote: > On Mon, Jan 29, 2024 at 2:22 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > Looking at 0001: > > > > + When altering the > > + <link linkend="sql-createsubscription-params-with-slot-name"><literal>slot_name</literal></link>, > > + the <literal>failover</literal> property value of the named slot may differ from the > > + <link linkend="sql-createsubscription-params-with-failover"><literal>failover</literal></link> > > + parameter specified in the subscription. When creating the slot, > > + ensure the slot <literal>failover</literal> property matches the > > + <link linkend="sql-createsubscription-params-with-failover"><literal>failover</literal></link> > > + parameter value of the subscription. Otherwise, the slot on the publisher may > > + not be enabled to be synced to standbys. > > > > Not related to this patch series but while at it shouldn't we also add a few > > words about two_phase too? (I mean ensure the slot property matchs the > > subscription one). > > > > Or would it be better to create a dedicated patch (outside of this thread) for > > the "two_phase" remark? (If so I can take care of it). > > > > I think it is better to create a separate patch for two_phase after > this patch gets committed. Yeah, makes sense, will do, thanks! Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Jan 29, 2024 at 9:35 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > 2. > > walrcv_create_slot(LogRepWorkerWalRcvConn, > > slotname, false /* permanent */ , false /* two_phase */ , > > + false, > > CRS_USE_SNAPSHOT, origin_startpos); > > > > I know it was previously mentioned in this thread that inline > > parameter comments are unnecessary, but here they are already in the > > existing code so shouldn't we do the same? > > > > I think it is better to remove the even existing ones as those many > times make code difficult to read. I had earlier added inline comments in callers of ReplicationSlotCreate() and walrcv_connect() for new args 'synced' and 'replication' respectively, removing those changes from pacth002 and patch005 now. Also improved alter-sub doc in patch001 as suggested by Peter offlist. PFA v71 patch set with above changes. thanks Shveta
Attachment
- v71-0005-Non-replication-connection-and-app_name-change.patch
- v71-0004-Allow-logical-walsenders-to-wait-for-the-physica.patch
- v71-0003-Slot-sync-worker-as-a-special-process.patch
- v71-0001-Add-a-failover-option-to-subscriptions.patch
- v71-0002-Add-logical-slot-sync-capability-to-the-physical.patch
- v71-0006-Document-the-steps-to-check-if-the-standby-is-re.patch
On Mon, Jan 29, 2024 at 3:11 PM shveta malik <shveta.malik@gmail.com> wrote: > > PFA v71 patch set with above changes. > Few comments on 0001 =================== 1. parse_subscription_options() { ... /* * We've been explicitly asked to not connect, that requires some * additional processing. */ if (!opts->connect && IsSet(supported_opts, SUBOPT_CONNECT)) { Here, along with other options, we need an explicit check for failover, so that if connect=false and failover=true, the statement should give error. I was expecting the below statement to fail but it passed with WARNING. postgres=# create subscription sub2 connection 'dbname=postgres' publication pub2 with(connect=false, failover=true); WARNING: subscription was created, but is not connected HINT: To initiate replication, you must manually create the replication slot, enable the subscription, and refresh the subscription. CREATE SUBSCRIPTION 2. @@ -148,6 +153,10 @@ typedef struct Subscription List *publications; /* List of publication names to subscribe to */ char *origin; /* Only publish data originating from the * specified origin */ + bool failover; /* True if the associated replication slots + * (i.e. the main slot and the table sync + * slots) in the upstream database are enabled + * to be synchronized to the standbys. */ } Subscription; Let's add this new field immediately after "bool runasowner;" as is done for other boolean members. This will help avoid increasing the size of the structure due to alignment when we add any new pointer field in the future. Also, that would be consistent with what we do for other new boolean members. -- With Regards, Amit Kapila.
On Monday, January 29, 2024 7:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jan 29, 2024 at 3:11 PM shveta malik <shveta.malik@gmail.com> > wrote: > > > > PFA v71 patch set with above changes. > > > > Few comments on 0001 Thanks for the comments. > =================== > 1. > parse_subscription_options() > { > ... > /* > * We've been explicitly asked to not connect, that requires some > * additional processing. > */ > if (!opts->connect && IsSet(supported_opts, SUBOPT_CONNECT)) { > > Here, along with other options, we need an explicit check for failover, so that if > connect=false and failover=true, the statement should give error. I was > expecting the below statement to fail but it passed with WARNING. > postgres=# create subscription sub2 connection 'dbname=postgres' > publication pub2 with(connect=false, failover=true); > WARNING: subscription was created, but is not connected > HINT: To initiate replication, you must manually create the replication slot, > enable the subscription, and refresh the subscription. > CREATE SUBSCRIPTION Added. > > 2. > @@ -148,6 +153,10 @@ typedef struct Subscription > List *publications; /* List of publication names to subscribe to */ > char *origin; /* Only publish data originating from the > * specified origin */ > + bool failover; /* True if the associated replication slots > + * (i.e. the main slot and the table sync > + * slots) in the upstream database are enabled > + * to be synchronized to the standbys. */ > } Subscription; > > Let's add this new field immediately after "bool runasowner;" as is done for > other boolean members. This will help avoid increasing the size of the structure > due to alignment when we add any new pointer field in the future. Also, that > would be consistent with what we do for other new boolean members. Moved this field as suggested. Attach the V72-0001 which addressed above comments, other patches will be rebased and posted after pushing first patch. Thanks Shveta for helping address the comments. Best Regards, Hou zj
Attachment
On Monday, January 29, 2024 9:17 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Monday, January 29, 2024 7:30 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > On Mon, Jan 29, 2024 at 3:11 PM shveta malik <shveta.malik@gmail.com> > > wrote: > > > > > > PFA v71 patch set with above changes. > > > > > > > Few comments on 0001 > > Thanks for the comments. > > > =================== > > 1. > > parse_subscription_options() > > { > > ... > > /* > > * We've been explicitly asked to not connect, that requires some > > * additional processing. > > */ > > if (!opts->connect && IsSet(supported_opts, SUBOPT_CONNECT)) { > > > > Here, along with other options, we need an explicit check for > > failover, so that if connect=false and failover=true, the statement > > should give error. I was expecting the below statement to fail but it passed > with WARNING. > > postgres=# create subscription sub2 connection 'dbname=postgres' > > publication pub2 with(connect=false, failover=true); > > WARNING: subscription was created, but is not connected > > HINT: To initiate replication, you must manually create the > > replication slot, enable the subscription, and refresh the subscription. > > CREATE SUBSCRIPTION > > Added. > > > > > 2. > > @@ -148,6 +153,10 @@ typedef struct Subscription > > List *publications; /* List of publication names to subscribe to */ > > char *origin; /* Only publish data originating from the > > * specified origin */ > > + bool failover; /* True if the associated replication slots > > + * (i.e. the main slot and the table sync > > + * slots) in the upstream database are enabled > > + * to be synchronized to the standbys. */ > > } Subscription; > > > > Let's add this new field immediately after "bool runasowner;" as is > > done for other boolean members. This will help avoid increasing the > > size of the structure due to alignment when we add any new pointer > > field in the future. Also, that would be consistent with what we do for other > new boolean members. > > Moved this field as suggested. > > Attach the V72-0001 which addressed above comments, other patches will be > rebased and posted after pushing first patch. Thanks Shveta for helping > address the comments. Apart from above comments. The new V72 patch also includes the followings changes. 1. Moved the test 'altering failover for enabled sub' to the tap-test where most of the alter-sub behaviors are tested. 2. Rename the tap-test from 050_standby_failover_slots_sync.pl to 040_standby_failover_slots_sync.pl (the big number 050 was used to avoid conflict with other newly committed tests). And add the test into meson.build which was missed. Best Regards, Hou zj
Here are some review comments for v72-0001 ====== doc/src/sgml/ref/alter_subscription.sgml 1. + parameter value of the subscription. Otherwise, the slot on the + publisher may behave differently from what subscription's + <link linkend="sql-createsubscription-params-with-failover"><literal>failover</literal></link> + option says. The slot on the publisher could either be + synced to the standbys even when the subscription's + <link linkend="sql-createsubscription-params-with-failover"><literal>failover</literal></link> + option is disabled or could be disabled for sync + even when the subscription's + <link linkend="sql-createsubscription-params-with-failover"><literal>failover</literal></link> + option is enabled. + </para> It is a bit wordy to keep saying "disabled/enabled" BEFORE The slot on the publisher could either be synced to the standbys even when the subscription's failover option is disabled or could be disabled for sync even when the subscription's failover option is enabled. SUGGESTION The slot on the publisher could be synced to the standbys even when the subscription's failover = false or may not be syncing even when the subscription's failover = true. ====== .../t/040_standby_failover_slots_sync.pl 2. +# Enable subscription +$subscriber1->safe_psql('postgres', + "ALTER SUBSCRIPTION regress_mysub1 ENABLE"); + +# Disable failover for enabled subscription +my ($result, $stdout, $stderr) = $subscriber1->psql('postgres', + "ALTER SUBSCRIPTION regress_mysub1 SET (failover = false)"); +ok( $stderr =~ /ERROR: cannot set failover for enabled subscription/, + "altering failover is not allowed for enabled subscription"); + Currently, those tests are under scope the big comment: +################################################## +# Test that changing the failover property of a subscription updates the +# corresponding failover property of the slot. +################################################## But that comment is not quite relevant to these tests. So, add another one just these: SUGGESTION: ################################################## # Test that cannot modify the failover option for enabled subscriptions. ################################################## ====== Kind Regards, Peter Smith. Fujitsu Australia
On Tue, Jan 30, 2024 at 7:29 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are some review comments for v72-0001 > > ====== > doc/src/sgml/ref/alter_subscription.sgml > > 1. > + parameter value of the subscription. Otherwise, the slot on the > + publisher may behave differently from what subscription's > + <link linkend="sql-createsubscription-params-with-failover"><literal>failover</literal></link> > + option says. The slot on the publisher could either be > + synced to the standbys even when the subscription's > + <link linkend="sql-createsubscription-params-with-failover"><literal>failover</literal></link> > + option is disabled or could be disabled for sync > + even when the subscription's > + <link linkend="sql-createsubscription-params-with-failover"><literal>failover</literal></link> > + option is enabled. > + </para> > > It is a bit wordy to keep saying "disabled/enabled" > > BEFORE > The slot on the publisher could either be synced to the standbys even > when the subscription's failover option is disabled or could be > disabled for sync even when the subscription's failover option is > enabled. > > SUGGESTION > The slot on the publisher could be synced to the standbys even when > the subscription's failover = false or may not be syncing even when > the subscription's failover = true. > I think it is a matter of personal preference because I find the existing wording in the patch easier to follow. So, I would like to retain that as it is. -- With Regards, Amit Kapila.
On Mon, Jan 29, 2024 at 6:47 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Monday, January 29, 2024 7:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > =================== > > 1. > > parse_subscription_options() > > { > > ... > > /* > > * We've been explicitly asked to not connect, that requires some > > * additional processing. > > */ > > if (!opts->connect && IsSet(supported_opts, SUBOPT_CONNECT)) { > > > > Here, along with other options, we need an explicit check for failover, so that if > > connect=false and failover=true, the statement should give error. I was > > expecting the below statement to fail but it passed with WARNING. > > postgres=# create subscription sub2 connection 'dbname=postgres' > > publication pub2 with(connect=false, failover=true); > > WARNING: subscription was created, but is not connected > > HINT: To initiate replication, you must manually create the replication slot, > > enable the subscription, and refresh the subscription. > > CREATE SUBSCRIPTION > > Added. > In this regard, I feel we don't need to dump/restore the 'FAILOVER' option non-binary upgrade paths similar to the 'ENABLE' option. For binary upgrade, if the failover option is enabled, then we can enable it using Alter Subscription SET (failover=true). Let's add one test corresponding to this behavior in postgresql\src\bin\pg_upgrade\t\004_subscription. Additionally, we need to update the pg_dump docs for the 'failover' option. See "When dumping logical replication subscriptions, .." [1]. I think we also need to update the connect option docs in CREATE SUBSCRIPTION [2]. [1] - https://www.postgresql.org/docs/devel/app-pgdump.html [2] - https://www.postgresql.org/docs/devel/sql-createsubscription.html -- With Regards, Amit Kapila.
On Tue, Jan 30, 2024 at 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > In this regard, I feel we don't need to dump/restore the 'FAILOVER' > option non-binary upgrade paths similar to the 'ENABLE' option. For > binary upgrade, if the failover option is enabled, then we can enable > it using Alter Subscription SET (failover=true). Let's add one test > corresponding to this behavior in > postgresql\src\bin\pg_upgrade\t\004_subscription. Changed pg_dump behaviour as suggested and added additional test. > Additionally, we need to update the pg_dump docs for the 'failover' > option. See "When dumping logical replication subscriptions, .." [1]. > I think we also need to update the connect option docs in CREATE > SUBSCRIPTION [2]. Updated docs. > [1] - https://www.postgresql.org/docs/devel/app-pgdump.html > [2] - https://www.postgresql.org/docs/devel/sql-createsubscription.html PFA v73-0001 which addresses the above comments. Other patches will be rebased and posted after pushing this one. Thanks Hou-San for adding pg_upgrade test for failover. thanks Shveta
Attachment
On Tue, Jan 30, 2024 at 4:06 PM shveta malik <shveta.malik@gmail.com> wrote: > > PFA v73-0001 which addresses the above comments. Other patches will be > rebased and posted after pushing this one. Since v73-0001 is pushed, PFA rest of the patches. Changes are: 1) Rebased the patches. 2) Ran pg_indent on all. 3) patch001: Updated logicaldecoding.sgml for dbname requirement in primary_conninfo for slot-synchronization. thanks Shveta
Attachment
- v74-0002-Slot-sync-worker-as-a-special-process.patch
- v74-0003-Allow-logical-walsenders-to-wait-for-the-physica.patch
- v74-0005-Document-the-steps-to-check-if-the-standby-is-re.patch
- v74-0004-Non-replication-connection-and-app_name-change.patch
- v74-0001-Add-logical-slot-sync-capability-to-the-physical.patch
Hi, On Mon, Jan 29, 2024 at 09:15:57AM +0000, Bertrand Drouvot wrote: > Hi, > > On Mon, Jan 29, 2024 at 02:35:52PM +0530, Amit Kapila wrote: > > I think it is better to create a separate patch for two_phase after > > this patch gets committed. > > Yeah, makes sense, will do, thanks! It's done in [1]. [1]: https://www.postgresql.org/message-id/ZbkYrLPhH%2BRxpZlW%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, I saw that v73-0001 was pushed, but it included some last-minute changes that I did not get a chance to check yesterday. Here are some review comments for the new parts of that patch. ====== doc/src/sgml/ref/create_subscription.sgml 1. connect (boolean) Specifies whether the CREATE SUBSCRIPTION command should connect to the publisher at all. The default is true. Setting this to false will force the values of create_slot, enabled, copy_data, and failover to false. (You cannot combine setting connect to false with setting create_slot, enabled, copy_data, or failover to true.) ~ I don't think the first part "Setting this to false will force the values ... failover to false." is strictly correct. I think is correct to say all those *other* properties (create_slot, enabled, copy_data) are forced to false because those otherwise have default true values. But the 'failover' has default false, so it cannot get force-changed at all because you can't set connect to false when failover is true as the second part ("You cannot combine...") explains. IMO remove 'failover' from that first sentence. ~~~ 2. <para> Since no connection is made when this option is <literal>false</literal>, no tables are subscribed. To initiate replication, you must manually create the replication slot, enable - the subscription, and refresh the subscription. See + the failover if required, enable the subscription, and refresh the + subscription. See <xref linkend="logical-replication-subscription-examples-deferred-slot"/> for examples. </para> IMO "see the failover if required" is very vague. See what failover? The slot property failover, or the subscription option failover? And "see" it for what purpose? I think the intention was probably to say something like "ensure the manually created slot has the same matching failover property value as the subscriber failover option", but that is not clear from the current text. ====== doc/src/sgml/ref/pg_dump.sgml 3. dump can be restored without requiring network access to the remote servers. It is then up to the user to reactivate the subscriptions in a suitable way. If the involved hosts have changed, the connection - information might have to be changed. It might also be appropriate to + information might have to be changed. If the subscription needs to + be enabled for + <link linkend="sql-createsubscription-params-with-failover"><literal>failover</literal></link>, + then same needs to be done by executing + <link linkend="sql-altersubscription-params-set"> + <literal>ALTER SUBSCRIPTION ... SET(failover = true)</literal></link> + after the slot has been created. It might also be appropriate to "then same needs to be done" (English?) BEFORE If the subscription needs to be enabled for failover, then same needs to be done by executing ALTER SUBSCRIPTION ... SET(failover = true) after the slot has been created. SUGGESTION If the subscription needs to be enabled for failover, execute ALTER SUBSCRIPTION ... SET(failover = true) after the slot has been created. ====== src/backend/commands/subscriptioncmds.c 4. #define SUBOPT_RUN_AS_OWNER 0x00001000 -#define SUBOPT_LSN 0x00002000 -#define SUBOPT_ORIGIN 0x00004000 +#define SUBOPT_FAILOVER 0x00002000 +#define SUBOPT_LSN 0x00004000 +#define SUBOPT_ORIGIN 0x00008000 + A spurious blank line was added. ====== src/bin/pg_upgrade/t/004_subscription.pl 5. -# The subscription's running status should be preserved. Old subscription -# regress_sub1 should be enabled and old subscription regress_sub2 should be -# disabled. +# The subscription's running status and failover option should be preserved. +# Old subscription regress_sub1 should have enabled and failover as true while +# old subscription regress_sub2 should have enabled and failover as false. $result = $new_sub->safe_psql('postgres', - "SELECT subname, subenabled FROM pg_subscription ORDER BY subname"); -is( $result, qq(regress_sub1|t -regress_sub2|f), + "SELECT subname, subenabled, subfailover FROM pg_subscription ORDER BY subname"); +is( $result, qq(regress_sub1|t|t +regress_sub2|f|f), "check that the subscription's running status are preserved"); ~ Calling those "old subscriptions" seems misleading. Aren't these the new/upgraded subscriptions being checked here? Should the comment be more like: # The subscription's running status and failover option should be preserved. # Upgraded regress_sub1 should still have enabled and failover as true. # Upgraded regress_sub2 should still have enabled and failover as false. ====== Kind Regards, Peter Smith. Fujitsu Australia.
On Wed, Jan 31, 2024 at 7:27 AM Peter Smith <smithpb2250@gmail.com> wrote: > > I saw that v73-0001 was pushed, but it included some last-minute > changes that I did not get a chance to check yesterday. > > Here are some review comments for the new parts of that patch. > > ====== > doc/src/sgml/ref/create_subscription.sgml > > 1. > connect (boolean) > > Specifies whether the CREATE SUBSCRIPTION command should connect > to the publisher at all. The default is true. Setting this to false > will force the values of create_slot, enabled, copy_data, and failover > to false. (You cannot combine setting connect to false with setting > create_slot, enabled, copy_data, or failover to true.) > > ~ > > I don't think the first part "Setting this to false will force the > values ... failover to false." is strictly correct. > > I think is correct to say all those *other* properties (create_slot, > enabled, copy_data) are forced to false because those otherwise have > default true values. > So, won't when connect=false, the user has to explicitly provide such values (create_slot, enabled, etc.) as false? If so, is using 'force' strictly correct? > ~~~ > > 2. > <para> > Since no connection is made when this option is > <literal>false</literal>, no tables are subscribed. To initiate > replication, you must manually create the replication slot, enable > - the subscription, and refresh the subscription. See > + the failover if required, enable the subscription, and refresh the > + subscription. See > <xref > linkend="logical-replication-subscription-examples-deferred-slot"/> > for examples. > </para> > > IMO "see the failover if required" is very vague. See what failover? > AFAICS, the committed docs says: "To initiate replication, you must manually create the replication slot, enable the failover if required, ...". I am not sure what you are referring to. > > ====== > src/bin/pg_upgrade/t/004_subscription.pl > > 5. > -# The subscription's running status should be preserved. Old subscription > -# regress_sub1 should be enabled and old subscription regress_sub2 should be > -# disabled. > +# The subscription's running status and failover option should be preserved. > +# Old subscription regress_sub1 should have enabled and failover as true while > +# old subscription regress_sub2 should have enabled and failover as false. > $result = > $new_sub->safe_psql('postgres', > - "SELECT subname, subenabled FROM pg_subscription ORDER BY subname"); > -is( $result, qq(regress_sub1|t > -regress_sub2|f), > + "SELECT subname, subenabled, subfailover FROM pg_subscription ORDER > BY subname"); > +is( $result, qq(regress_sub1|t|t > +regress_sub2|f|f), > "check that the subscription's running status are preserved"); > > ~ > > Calling those "old subscriptions" seems misleading. Aren't these the > new/upgraded subscriptions being checked here? > Again the quoted wording is not introduced by this patch. But, I see your point and it is better if you can start a separate thread for it. -- With Regards, Amit Kapila.
On Wed, Jan 31, 2024 at 2:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Jan 31, 2024 at 7:27 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > I saw that v73-0001 was pushed, but it included some last-minute > > changes that I did not get a chance to check yesterday. > > > > Here are some review comments for the new parts of that patch. > > > > ====== > > doc/src/sgml/ref/create_subscription.sgml > > > > 1. > > connect (boolean) > > > > Specifies whether the CREATE SUBSCRIPTION command should connect > > to the publisher at all. The default is true. Setting this to false > > will force the values of create_slot, enabled, copy_data, and failover > > to false. (You cannot combine setting connect to false with setting > > create_slot, enabled, copy_data, or failover to true.) > > > > ~ > > > > I don't think the first part "Setting this to false will force the > > values ... failover to false." is strictly correct. > > > > I think is correct to say all those *other* properties (create_slot, > > enabled, copy_data) are forced to false because those otherwise have > > default true values. > > > > So, won't when connect=false, the user has to explicitly provide such > values (create_slot, enabled, etc.) as false? If so, is using 'force' > strictly correct? Perhaps the original docs text could be worded differently; I think the word "force" here just meant setting connection=false forces/causes/makes those other options behave "as if" they had been set to false without the user explicitly doing anything to them. The point is they are made to behave *differently* to their normal defaults. So, connect=false ==> this actually sets enabled=false (you can see this with \dRs+), which is different to the default setting of 'enabled' connect=false ==> will not create a slot (because there is no connection), which is different to the default behaviour for 'create-slot' connect=false ==> will not copy tables (because there is no connection), which is different to the default behaviour for 'copy_data;' OTOH, failover is different connect=false ==> failover is not possible (because there is no connection), but the 'failover' default is false anyway, so no change to the default behaviour for this one. > > > ~~~ > > > > 2. > > <para> > > Since no connection is made when this option is > > <literal>false</literal>, no tables are subscribed. To initiate > > replication, you must manually create the replication slot, enable > > - the subscription, and refresh the subscription. See > > + the failover if required, enable the subscription, and refresh the > > + subscription. See > > <xref > > linkend="logical-replication-subscription-examples-deferred-slot"/> > > for examples. > > </para> > > > > IMO "see the failover if required" is very vague. See what failover? > > > > AFAICS, the committed docs says: "To initiate replication, you must > manually create the replication slot, enable the failover if required, > ...". I am not sure what you are referring to. > My mistake. I was misreading the patch code without applying the patch. Sorry for the noise. > > > > ====== > > src/bin/pg_upgrade/t/004_subscription.pl > > > > 5. > > -# The subscription's running status should be preserved. Old subscription > > -# regress_sub1 should be enabled and old subscription regress_sub2 should be > > -# disabled. > > +# The subscription's running status and failover option should be preserved. > > +# Old subscription regress_sub1 should have enabled and failover as true while > > +# old subscription regress_sub2 should have enabled and failover as false. > > $result = > > $new_sub->safe_psql('postgres', > > - "SELECT subname, subenabled FROM pg_subscription ORDER BY subname"); > > -is( $result, qq(regress_sub1|t > > -regress_sub2|f), > > + "SELECT subname, subenabled, subfailover FROM pg_subscription ORDER > > BY subname"); > > +is( $result, qq(regress_sub1|t|t > > +regress_sub2|f|f), > > "check that the subscription's running status are preserved"); > > > > ~ > > > > Calling those "old subscriptions" seems misleading. Aren't these the > > new/upgraded subscriptions being checked here? > > > > Again the quoted wording is not introduced by this patch. But, I see > your point and it is better if you can start a separate thread for it. > OK. I created a separate thread for this [1] ====== [1] https://www.postgresql.org/message-id/CAHut+Pu1usLPHRySPTacY1K_Q-ddSRXNFhmj_2u1NfqBC1ytng@mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia.
On Wednesday, January 31, 2024 9:57 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Hi, > > I saw that v73-0001 was pushed, but it included some last-minute > changes that I did not get a chance to check yesterday. > > Here are some review comments for the new parts of that patch. > > ====== > doc/src/sgml/ref/create_subscription.sgml > > 1. > connect (boolean) > > Specifies whether the CREATE SUBSCRIPTION command should connect > to the publisher at all. The default is true. Setting this to false > will force the values of create_slot, enabled, copy_data, and failover > to false. (You cannot combine setting connect to false with setting > create_slot, enabled, copy_data, or failover to true.) > > ~ > > I don't think the first part "Setting this to false will force the > values ... failover to false." is strictly correct. > > I think is correct to say all those *other* properties (create_slot, > enabled, copy_data) are forced to false because those otherwise have > default true values. But the 'failover' has default false, so it > cannot get force-changed at all because you can't set connect to false > when failover is true as the second part ("You cannot combine...") > explains. > > IMO remove 'failover' from that first sentence. > > > 3. > dump can be restored without requiring network access to the remote > servers. It is then up to the user to reactivate the subscriptions in a > suitable way. If the involved hosts have changed, the connection > - information might have to be changed. It might also be appropriate to > + information might have to be changed. If the subscription needs to > + be enabled for > + <link > linkend="sql-createsubscription-params-with-failover"><literal>failover</lit > eral></link>, > + then same needs to be done by executing > + <link linkend="sql-altersubscription-params-set"> > + <literal>ALTER SUBSCRIPTION ... SET(failover = true)</literal></link> > + after the slot has been created. It might also be appropriate to > > "then same needs to be done" (English?) > > BEFORE > If the subscription needs to be enabled for failover, then same needs > to be done by executing ALTER SUBSCRIPTION ... SET(failover = true) > after the slot has been created. > > SUGGESTION > If the subscription needs to be enabled for failover, execute ALTER > SUBSCRIPTION ... SET(failover = true) after the slot has been created. > > ====== > src/backend/commands/subscriptioncmds.c > > 4. > #define SUBOPT_RUN_AS_OWNER 0x00001000 > -#define SUBOPT_LSN 0x00002000 > -#define SUBOPT_ORIGIN 0x00004000 > +#define SUBOPT_FAILOVER 0x00002000 > +#define SUBOPT_LSN 0x00004000 > +#define SUBOPT_ORIGIN 0x00008000 > + > > A spurious blank line was added. > Here is a small patch to address the comment 3 and 4. The discussion for comment 1 is still going on, so we can update the patch once it's concluded. Best Regards, Hou zj
Attachment
On Tue, Jan 30, 2024 at 9:53 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Tue, Jan 30, 2024 at 4:06 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > PFA v73-0001 which addresses the above comments. Other patches will be > > rebased and posted after pushing this one. > > Since v73-0001 is pushed, PFA rest of the patches. Changes are: > > 1) Rebased the patches. > 2) Ran pg_indent on all. > 3) patch001: Updated logicaldecoding.sgml for dbname requirement in > primary_conninfo for slot-synchronization. > Thank you for updating the patches. As for the slotsync worker patch, is there any reason why 0001, 0002, and 0004 patches are still separated? Beside, here are some comments on v74 0001, 0002, and 0004 patches: --- +static char * +wait_for_valid_params_and_get_dbname(void) +{ + char *dbname; + int rc; + + /* Sanity check. */ + Assert(enable_syncslot); + + for (;;) + { + if (validate_parameters_and_get_dbname(&dbname)) + break; + ereport(LOG, errmsg("skipping slot synchronization")); + + ProcessSlotSyncInterrupts(NULL); When reading this function, I expected that the slotsync worker would resume working once the parameters became valid, but it was not correct. For example, if I changed hot_standby_feedback from off to on, the slotsync worker reads the config file, exits, and then restarts. Given that the slotsync worker ends up exiting on parameter changes anyway, why do we want to have it wait for parameters to become valid? IIUC even if the slotsync worker exits when a parameter is not valid, it restarts at some intervals. --- +bool +SlotSyncWorkerCanRestart(void) +{ +#define SLOTSYNC_RESTART_INTERVAL_SEC 10 + IIUC depending on how busy the postmaster is and the timing, the user could wait for 1 min to re-launch the slotsync worker. But I think the user might want to re-launch the slotsync worker more quickly for example when the slotsync worker restarts due to parameter changes. IIUC SloSyncWorkerCanRestart() doesn't consider the fact that the slotsync worker previously exited with 0 or 1. --- + /* We are a normal standby */ + valid = DatumGetBool(slot_getattr(tupslot, 2, &isnull)); + Assert(!isnull); What do you mean by "normal standby"? --- + appendStringInfo(&cmd, + "SELECT pg_is_in_recovery(), count(*) = 1" + " FROM pg_replication_slots" + " WHERE slot_type='physical' AND slot_name=%s", + quote_literal_cstr(PrimarySlotName)); I think we need to make "pg_replication_slots" schema-qualified. --- + errdetail("The primary server slot \"%s\" specified by" + " \"%s\" is not valid.", + PrimarySlotName, "primary_slot_name")); and + errmsg("slot sync worker will shutdown because" + " %s is disabled", "enable_syncslot")); It's better to write it in one line for better greppability. --- When I dropped a database on the primary that has a failover slot, I got the following logs on the standby: 2024-01-31 17:25:21.750 JST [1103933] FATAL: replication slot "s" is active for PID 1103935 2024-01-31 17:25:21.750 JST [1103933] CONTEXT: WAL redo at 0/3020D20 for Database/DROP: dir 1663/16384 2024-01-31 17:25:21.751 JST [1103930] LOG: startup process (PID 1103933) exited with exit code 1 It seems that because the slotsync worker created the slot on the standby, the slot's active_pid is still valid. That is why the startup process could not drop the slot. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Jan 31, 2024 at 2:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Thank you for updating the patches. As for the slotsync worker patch, > is there any reason why 0001, 0002, and 0004 patches are still > separated? > No specific reason, it could be easier to review those parts. > > Beside, here are some comments on v74 0001, 0002, and 0004 patches: > > --- > +static char * > +wait_for_valid_params_and_get_dbname(void) > +{ > + char *dbname; > + int rc; > + > + /* Sanity check. */ > + Assert(enable_syncslot); > + > + for (;;) > + { > + if (validate_parameters_and_get_dbname(&dbname)) > + break; > + ereport(LOG, errmsg("skipping slot synchronization")); > + > + ProcessSlotSyncInterrupts(NULL); > > When reading this function, I expected that the slotsync worker would > resume working once the parameters became valid, but it was not > correct. For example, if I changed hot_standby_feedback from off to > on, the slotsync worker reads the config file, exits, and then > restarts. Given that the slotsync worker ends up exiting on parameter > changes anyway, why do we want to have it wait for parameters to > become valid? > Right, the reason for waiting is to avoid repeated re-start of slotsync worker if the required parameter is not changed. To follow that, I think we should simply continue when the required parameter is changed and is valid. But, I think during actual slotsync, if connection_info is changed then there is no option but to restart. > > --- > +bool > +SlotSyncWorkerCanRestart(void) > +{ > +#define SLOTSYNC_RESTART_INTERVAL_SEC 10 > + > > IIUC depending on how busy the postmaster is and the timing, the user > could wait for 1 min to re-launch the slotsync worker. But I think the > user might want to re-launch the slotsync worker more quickly for > example when the slotsync worker restarts due to parameter changes. > IIUC SloSyncWorkerCanRestart() doesn't consider the fact that the > slotsync worker previously exited with 0 or 1. > Considering my previous where we don't want to restart for a required parameter change, isn't it better to avoid repeated restart (say when the user gave an invalid dbname)? BTW, I think this restart interval is added based on your previous complaint [1]. > > --- > When I dropped a database on the primary that has a failover slot, I > got the following logs on the standby: > > 2024-01-31 17:25:21.750 JST [1103933] FATAL: replication slot "s" is > active for PID 1103935 > 2024-01-31 17:25:21.750 JST [1103933] CONTEXT: WAL redo at 0/3020D20 > for Database/DROP: dir 1663/16384 > 2024-01-31 17:25:21.751 JST [1103930] LOG: startup process (PID > 1103933) exited with exit code 1 > > It seems that because the slotsync worker created the slot on the > standby, the slot's active_pid is still valid. > But we release the slot after sync. And we do take a shared lock on the database to make the startup process wait for slotsync. There is one gap which is that we don't reset active_pid for temp slots in ReplicationSlotRelease(), so for temp slots such an error can occur but OTOH, we immediately make the slot persistent after sync. As per my understanding, it is only possible to get this error if the initial sync doesn't happen and the slot remains temporary. Is that your case? How did reproduce this? That is why the startup > process could not drop the slot. > [1] - https://www.postgresql.org/message-id/CAD21AoApGoTZu7D_7%3DbVYQqKnj%2BPZ2Rz%2Bnc8Ky1HPQMS_XL6%2BA%40mail.gmail.com -- With Regards, Amit Kapila.
On Wed, Jan 31, 2024 at 7:42 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Jan 31, 2024 at 2:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > Thank you for updating the patches. As for the slotsync worker patch, > > is there any reason why 0001, 0002, and 0004 patches are still > > separated? > > > > No specific reason, it could be easier to review those parts. Okay, I think we can merge 0001 and 0002 at least as we don't need bgworker codes. > > > > > Beside, here are some comments on v74 0001, 0002, and 0004 patches: > > > > --- > > +static char * > > +wait_for_valid_params_and_get_dbname(void) > > +{ > > + char *dbname; > > + int rc; > > + > > + /* Sanity check. */ > > + Assert(enable_syncslot); > > + > > + for (;;) > > + { > > + if (validate_parameters_and_get_dbname(&dbname)) > > + break; > > + ereport(LOG, errmsg("skipping slot synchronization")); > > + > > + ProcessSlotSyncInterrupts(NULL); > > > > When reading this function, I expected that the slotsync worker would > > resume working once the parameters became valid, but it was not > > correct. For example, if I changed hot_standby_feedback from off to > > on, the slotsync worker reads the config file, exits, and then > > restarts. Given that the slotsync worker ends up exiting on parameter > > changes anyway, why do we want to have it wait for parameters to > > become valid? > > > > Right, the reason for waiting is to avoid repeated re-start of > slotsync worker if the required parameter is not changed. To follow > that, I think we should simply continue when the required parameter is > changed and is valid. But, I think during actual slotsync, if > connection_info is changed then there is no option but to restart. Agreed. > > > > --- > > +bool > > +SlotSyncWorkerCanRestart(void) > > +{ > > +#define SLOTSYNC_RESTART_INTERVAL_SEC 10 > > + > > > > IIUC depending on how busy the postmaster is and the timing, the user > > could wait for 1 min to re-launch the slotsync worker. But I think the > > user might want to re-launch the slotsync worker more quickly for > > example when the slotsync worker restarts due to parameter changes. > > IIUC SloSyncWorkerCanRestart() doesn't consider the fact that the > > slotsync worker previously exited with 0 or 1. > > > > Considering my previous where we don't want to restart for a required > parameter change, isn't it better to avoid repeated restart (say when > the user gave an invalid dbname)? BTW, I think this restart interval > is added based on your previous complaint [1]. I think it's useful that the slotsync worker restarts immediately when a required parameter is changed but waits to restart when it exits with an error. IIUC the apply worker does so; if it restarts due to a subscription parameter change, it resets the last-start time so that the launcher will restart it without waiting. But if it exits with an error, the launcher waits for wal_retrieve_retry_interval. I don't think the slotsync worker must follow this behavior but I feel it's useful behavior. > > > > > --- > > When I dropped a database on the primary that has a failover slot, I > > got the following logs on the standby: > > > > 2024-01-31 17:25:21.750 JST [1103933] FATAL: replication slot "s" is > > active for PID 1103935 > > 2024-01-31 17:25:21.750 JST [1103933] CONTEXT: WAL redo at 0/3020D20 > > for Database/DROP: dir 1663/16384 > > 2024-01-31 17:25:21.751 JST [1103930] LOG: startup process (PID > > 1103933) exited with exit code 1 > > > > It seems that because the slotsync worker created the slot on the > > standby, the slot's active_pid is still valid. > > > > But we release the slot after sync. And we do take a shared lock on > the database to make the startup process wait for slotsync. There is > one gap which is that we don't reset active_pid for temp slots in > ReplicationSlotRelease(), so for temp slots such an error can occur > but OTOH, we immediately make the slot persistent after sync. As per > my understanding, it is only possible to get this error if the initial > sync doesn't happen and the slot remains temporary. Is that your case? > How did reproduce this? I created a failover slot manually on the primary and dropped the database where the failover slot is created. So this would not happen in normal cases. BTW I've tested the following switch/fail-back scenario but it seems not to work fine. Am I missing something? Setup: node1 is the primary, node2 is the physical standby for node1, and node3 is the subscriber connecting to node1. Steps: 1. [node1]: create a table and a publication for the table. 2. [node2]: set enable_syncslot = on and start (to receive WALs from node1). 3. [node3]: create a subscription with failover = true for the publication. 4. [node2]: promote to the new standby. 5. [node3]: alter subscription to connect the new primary, node2. 6. [node1]: stop, set enable_syncslot = on (and other required parameters), then start as a new standby. Then I got the error "exiting from slot synchronization because same name slot "test_sub" already exists on the standby". The logical replication slot that was created on the old primary (node1) has been synchronized to the old standby (node2). Therefore on node2, the slot's "synced" field is true. However, once node1 starts as the new standby with slot synchronization, the slotsync worker cannot synchronize the slot because the slot's "synced" field on the primary is false. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Mon, Jan 29, 2024, at 10:17 AM, Zhijie Hou (Fujitsu) wrote:
Attach the V72-0001 which addressed above comments, other patches will berebased and posted after pushing first patch. Thanks Shveta for helping addressthe comments.
While working on another patch I noticed a new NOTICE message:
NOTICE: changed the failover state of replication slot "foo" on publisher to false
I wasn't paying much attention to this thread then I start reading the 2
patches that was recently committed. The message above surprises me because
pg_createsubscriber starts to emit this message. The reason is that it doesn't
create the replication slot during the CREATE SUBSCRIPTION. Instead, it creates
the replication slot with failover = false and no such option is informed
during CREATE SUBSCRIPTION which means it uses the default value (failover =
false). I expect that I don't see any message because it is *not* changing the
behavior. I was wrong. It doesn't check the failover state on publisher, it
just executes walrcv_alter_slot() and emits a message.
IMO if we are changing an outstanding property on node A from node B, node B
already knows (or might know) about that behavior change (because it is sending
the command), however, node A doesn't (unless log_replication_commands = on --
it is not the default).
Do we really need this message as NOTICE? I would set it to DEBUG1 if it is
worth or even remove it (if we consider there are other ways to obtain the same
information).
On Wed, Jan 31, 2024 at 9:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Jan 31, 2024 at 7:42 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > Considering my previous where we don't want to restart for a required > > parameter change, isn't it better to avoid repeated restart (say when > > the user gave an invalid dbname)? BTW, I think this restart interval > > is added based on your previous complaint [1]. > > I think it's useful that the slotsync worker restarts immediately when > a required parameter is changed but waits to restart when it exits > with an error. IIUC the apply worker does so; if it restarts due to a > subscription parameter change, it resets the last-start time so that > the launcher will restart it without waiting. > Agreed, this idea sounds good to me. > > > > > > > > --- > > > When I dropped a database on the primary that has a failover slot, I > > > got the following logs on the standby: > > > > > > 2024-01-31 17:25:21.750 JST [1103933] FATAL: replication slot "s" is > > > active for PID 1103935 > > > 2024-01-31 17:25:21.750 JST [1103933] CONTEXT: WAL redo at 0/3020D20 > > > for Database/DROP: dir 1663/16384 > > > 2024-01-31 17:25:21.751 JST [1103930] LOG: startup process (PID > > > 1103933) exited with exit code 1 > > > > > > It seems that because the slotsync worker created the slot on the > > > standby, the slot's active_pid is still valid. > > > > > > > But we release the slot after sync. And we do take a shared lock on > > the database to make the startup process wait for slotsync. There is > > one gap which is that we don't reset active_pid for temp slots in > > ReplicationSlotRelease(), so for temp slots such an error can occur > > but OTOH, we immediately make the slot persistent after sync. As per > > my understanding, it is only possible to get this error if the initial > > sync doesn't happen and the slot remains temporary. Is that your case? > > How did reproduce this? > > I created a failover slot manually on the primary and dropped the > database where the failover slot is created. So this would not happen > in normal cases. > Right, it won't happen in normal cases (say for walsender). This can happen in some cases even without this patch as noted in comments just above active_pid check in ReplicationSlotsDropDBSlots(). Now, we need to think whether we should just update the comments above active_pid check to explain this case or try to engineer some solution for this not-so-common case. I guess if we want a solution we need to stop slotsync worker temporarily till the drop database WAL is applied or something like that. > BTW I've tested the following switch/fail-back scenario but it seems > not to work fine. Am I missing something? > > Setup: > node1 is the primary, node2 is the physical standby for node1, and > node3 is the subscriber connecting to node1. > > Steps: > 1. [node1]: create a table and a publication for the table. > 2. [node2]: set enable_syncslot = on and start (to receive WALs from node1). > 3. [node3]: create a subscription with failover = true for the publication. > 4. [node2]: promote to the new standby. > 5. [node3]: alter subscription to connect the new primary, node2. > 6. [node1]: stop, set enable_syncslot = on (and other required > parameters), then start as a new standby. > > Then I got the error "exiting from slot synchronization because same > name slot "test_sub" already exists on the standby". > > The logical replication slot that was created on the old primary > (node1) has been synchronized to the old standby (node2). Therefore on > node2, the slot's "synced" field is true. However, once node1 starts > as the new standby with slot synchronization, the slotsync worker > cannot synchronize the slot because the slot's "synced" field on the > primary is false. > Yeah, we avoided doing anything in this case because the user could have manually created another slot with the same name on standby. Unlike WAL slots can be modified on standby as we allow decoding on standby, so we can't allow to overwrite the existing slots. We won't be able to distinguish whether the existing slot was a slot that the user wants to sync with primary or a slot created on standby to perform decoding. I think in this case user first needs to drop the slot on new standby. We probably need to document it as well unless we decide to do something else. What do you think? -- With Regards, Amit Kapila.
On Thu, Feb 1, 2024 at 8:15 AM Euler Taveira <euler@eulerto.com> wrote: > > On Mon, Jan 29, 2024, at 10:17 AM, Zhijie Hou (Fujitsu) wrote: > > Attach the V72-0001 which addressed above comments, other patches will be > rebased and posted after pushing first patch. Thanks Shveta for helping address > the comments. > > > While working on another patch I noticed a new NOTICE message: > > NOTICE: changed the failover state of replication slot "foo" on publisher to false > > I wasn't paying much attention to this thread then I start reading the 2 > patches that was recently committed. The message above surprises me because > pg_createsubscriber starts to emit this message. The reason is that it doesn't > create the replication slot during the CREATE SUBSCRIPTION. Instead, it creates > the replication slot with failover = false and no such option is informed > during CREATE SUBSCRIPTION which means it uses the default value (failover = > false). I expect that I don't see any message because it is *not* changing the > behavior. I was wrong. It doesn't check the failover state on publisher, it > just executes walrcv_alter_slot() and emits a message. > > IMO if we are changing an outstanding property on node A from node B, node B > already knows (or might know) about that behavior change (because it is sending > the command), however, node A doesn't (unless log_replication_commands = on -- > it is not the default). > > Do we really need this message as NOTICE? > The reason for adding this NOTICE was to keep it similar to other Notice messages in these commands like create/drop slot. However, here the difference is we may not have altered the slot as the property is already the same as we want to set on the publisher. So, I am not sure whether we should follow the existing behavior or just get rid of it. And then do we remove similar NOTICE in AlterSubscription() as well? Normally, I think NOTICE intends to let users know if we did anything with slots while executing subscription commands. Does anyone else have an opinion on this point? A related point, I think we can avoid setting the 'failover' property in ReplicationSlotAlter() if it is not changed, the advantage is we will avoid saving slots. OTOH, this won't be a frequent operation so we can leave it as it is as well. -- With Regards, Amit Kapila.
On Tue, Jan 30, 2024 at 11:53 PM shveta malik <shveta.malik@gmail.com> wrote:
On Tue, Jan 30, 2024 at 4:06 PM shveta malik <shveta.malik@gmail.com> wrote:
>
> PFA v73-0001 which addresses the above comments. Other patches will be
> rebased and posted after pushing this one.
Since v73-0001 is pushed, PFA rest of the patches. Changes are:
1) Rebased the patches.
2) Ran pg_indent on all.
3) patch001: Updated logicaldecoding.sgml for dbname requirement in
primary_conninfo for slot-synchronization.
thanks
Shveta
Just to test the behaviour, I modified the code to set failover flag to default to "true" while creating subscription and ran the regression tests. I only saw the expected errors.
1. Make check in postgres root folder - all failures are because of difference when listing subscription as failover flag is now enabled. The diff is attached for regress.
2. Make check in src/test/subscription - no failures All tests successful.
Files=34, Tests=457, 81 wallclock secs ( 0.14 usr 0.05 sys + 9.53 cusr 13.00 csys = 22.72 CPU)
Result: PASS
3. Make check in src/test/recovery - 3 failures Test Summary Report
-------------------
t/027_stream_regress.pl (Wstat: 256 Tests: 6 Failed: 1)
Failed test: 2
Non-zero exit status: 1
t/035_standby_logical_decoding.pl (Wstat: 7424 Tests: 8 Failed: 0)
Non-zero exit status: 29
Parse errors: No plan found in TAP output t/050_standby_failover_slots_sync.pl (Wstat: 7424 Tests: 5 Failed: 0)
Non-zero exit status: 29
Parse errors: No plan found in TAP output
3a. Analysis of t/027_stream_regress.pl - No, 027 fails with the same issue as "make check" in postgres root folder (for which I attached the diffs). 027 is about running the standard regression tests with streaming replication. Since the regression tests fail because listing subscription now has failover enabled, 027 also fails in the same way with streaming replication.
3b. Analysis of t/035_standby_logical_decoding.pl - In this test case, they attempt to create a subscription from the subscriber to the standby ##################################################
# Test that we can subscribe on the standby with the publication # created on the primary.
##################################################
Now, this fails because creating a subscription on the standby with failover enabled will result in error:
I see the following error in the log:
2024-01-28 23:51:30.425 EST [23332] tap_sub STATEMENT: CREATE_REPLICATION_SLOT "tap_sub" LOGICAL pgoutput (FAILOVER, SNAPSHOT 'nothing')
2024-01-28 23:51:30.425 EST [23332] tap_sub ERROR: cannot create replication slot with failover enabled on the standby I discussed this with Shveta and she agreed that this is the expected behaviour as we don't support failover to cascading standby yet.
3c. Analysis of t/050_standby_failover_slots_sync.pl - This is a new test case created for this patch, and it creates a subscription without failover enabled to make sure that the Subscription with failover disabled does not depend on sync on standby, but this fails because we have failover enabled by default.
In summary, I don't think these issues are actual bugs but expected behaviour change.
1. Make check in postgres root folder - all failures are because of difference when listing subscription as failover flag is now enabled. The diff is attached for regress.
2. Make check in src/test/subscription - no failures All tests successful.
Files=34, Tests=457, 81 wallclock secs ( 0.14 usr 0.05 sys + 9.53 cusr 13.00 csys = 22.72 CPU)
Result: PASS
3. Make check in src/test/recovery - 3 failures Test Summary Report
-------------------
t/027_stream_regress.pl (Wstat: 256 Tests: 6 Failed: 1)
Failed test: 2
Non-zero exit status: 1
t/035_standby_logical_decoding.pl (Wstat: 7424 Tests: 8 Failed: 0)
Non-zero exit status: 29
Parse errors: No plan found in TAP output t/050_standby_failover_slots_sync.pl (Wstat: 7424 Tests: 5 Failed: 0)
Non-zero exit status: 29
Parse errors: No plan found in TAP output
3a. Analysis of t/027_stream_regress.pl - No, 027 fails with the same issue as "make check" in postgres root folder (for which I attached the diffs). 027 is about running the standard regression tests with streaming replication. Since the regression tests fail because listing subscription now has failover enabled, 027 also fails in the same way with streaming replication.
3b. Analysis of t/035_standby_logical_decoding.pl - In this test case, they attempt to create a subscription from the subscriber to the standby ##################################################
# Test that we can subscribe on the standby with the publication # created on the primary.
##################################################
Now, this fails because creating a subscription on the standby with failover enabled will result in error:
I see the following error in the log:
2024-01-28 23:51:30.425 EST [23332] tap_sub STATEMENT: CREATE_REPLICATION_SLOT "tap_sub" LOGICAL pgoutput (FAILOVER, SNAPSHOT 'nothing')
2024-01-28 23:51:30.425 EST [23332] tap_sub ERROR: cannot create replication slot with failover enabled on the standby I discussed this with Shveta and she agreed that this is the expected behaviour as we don't support failover to cascading standby yet.
3c. Analysis of t/050_standby_failover_slots_sync.pl - This is a new test case created for this patch, and it creates a subscription without failover enabled to make sure that the Subscription with failover disabled does not depend on sync on standby, but this fails because we have failover enabled by default.
In summary, I don't think these issues are actual bugs but expected behaviour change.
regards,
Ajin Cherian
Fujitsu Australia
Here are some review comments for v740001. ====== src/sgml/logicaldecoding.sgml 1. + <sect2 id="logicaldecoding-replication-slots-synchronization"> + <title>Replication Slot Synchronization</title> + <para> + A logical replication slot on the primary can be synchronized to the hot + standby by enabling the <literal>failover</literal> option during slot + creation and setting + <link linkend="guc-enable-syncslot"><varname>enable_syncslot</varname></link> + on the standby. For the synchronization + to work, it is mandatory to have a physical replication slot between the + primary and the standby, and + <link linkend="guc-hot-standby-feedback"><varname>hot_standby_feedback</varname></link> + must be enabled on the standby. It is also necessary to specify a valid + <literal>dbname</literal> in the + <link linkend="guc-primary-conninfo"><varname>primary_conninfo</varname></link> + string, which is used for slot synchronization and is ignored for streaming. + </para> IMO we don't need to repeat that last part ", which is used for slot synchronization and is ignored for streaming." because that is a detail about the primary_conninfo GUC, and the same information is already described in that GUC section. ====== 2. ALTER_REPLICATION_SLOT slot_name ( option [, ...] ) # <para> - If true, the slot is enabled to be synced to the standbys. + If true, the slot is enabled to be synced to the standbys + so that logical replication can be resumed after failover. </para> This also should have the sentence "The default is false.", e.g. the same as the same option in CREATE_REPLICATION_SLOT says. ====== synchronize_one_slot 3. + /* + * Make sure that concerned WAL is received and flushed before syncing + * slot to target lsn received from the primary server. + * + * This check will never pass if on the primary server, user has + * configured standby_slot_names GUC correctly, otherwise this can hit + * frequently. + */ + latestFlushPtr = GetStandbyFlushRecPtr(NULL); + if (remote_slot->confirmed_lsn > latestFlushPtr) BEFORE This check will never pass if on the primary server, user has configured standby_slot_names GUC correctly, otherwise this can hit frequently. SUGGESTION (simpler way to say the same thing?) This will always be the case unless the standby_slot_names GUC is not correctly configured on the primary server. ~~~ 4. + /* User created slot with the same name exists, raise ERROR. */ /User created/User-created/ ~~~ 5. synchronize_slots, and also drop_obsolete_slots + /* + * Use shared lock to prevent a conflict with + * ReplicationSlotsDropDBSlots(), trying to drop the same slot while + * drop-database operation. + */ (same code comment is in a couple of places) SUGGESTION (while -> during, etc.) Use a shared lock to prevent conflicting with ReplicationSlotsDropDBSlots() trying to drop the same slot during a drop-database operation. ~~~ 6. validate_parameters_and_get_dbname strcmp() just for the empty string "" might be overkill. 6a. + if (PrimarySlotName == NULL || strcmp(PrimarySlotName, "") == 0) SUGGESTION if (PrimarySlotName == NULL || *PrimarySlotName == '\0') ~~ 6b. + if (PrimaryConnInfo == NULL || strcmp(PrimaryConnInfo, "") == 0) SUGGESTION if (PrimaryConnInfo == NULL || *PrimaryConnInfo == '\0') ====== Kind Regards, Peter Smith. Fujitsu Australia
On Wed, Jan 31, 2024 at 9:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Jan 31, 2024 at 7:42 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Jan 31, 2024 at 2:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > Thank you for updating the patches. As for the slotsync worker patch, > > > is there any reason why 0001, 0002, and 0004 patches are still > > > separated? > > > > > > > No specific reason, it could be easier to review those parts. > > Okay, I think we can merge 0001 and 0002 at least as we don't need > bgworker codes. > Agreed, and I am fine with merging 0001, 0002, and 0004 as suggested by you though I have a few minor comments on 0002 and 0004. I was thinking about what will be a logical way to split the slot sync worker patch (combined result of 0001, 0002, and 0004), and one idea occurred to me is that we can have the first patch as synchronize_solts() API and the functionality required to implement that API then the second patch would be a slot sync worker which uses that API to synchronize slots and does all the required validations. Any thoughts? Few minor comments on 0002 and 0004 ================================ 1. The comments above HandleChildCrash() should mention about slot sync worker 2. --- a/src/backend/storage/lmgr/proc.c +++ b/src/backend/storage/lmgr/proc.c @@ -42,6 +42,7 @@ #include "replication/slot.h" #include "replication/syncrep.h" #include "replication/walsender.h" +#include "replication/logicalworker.h" ... --- a/src/backend/utils/init/postinit.c +++ b/src/backend/utils/init/postinit.c @@ -43,6 +43,7 @@ #include "postmaster/autovacuum.h" #include "postmaster/postmaster.h" #include "replication/slot.h" +#include "replication/logicalworker.h" These new includes don't appear to be in alphabetical order. 3. + /* We can not have logical without replication */ + if (!replication) + Assert(!logical); I think we can cover both these conditions via Assert -- With Regards, Amit Kapila.
On Thu, Jan 25, 2024 at 11:26 AM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > On Wed, Jan 24, 2024 at 04:09:15PM +0530, shveta malik wrote: > > On Wed, Jan 24, 2024 at 2:38 PM Bertrand Drouvot > > <bertranddrouvot.pg@gmail.com> wrote: > > > > > > I also see Sawada-San's point and I'd vote for "sync_replication_slots". Then for > > > the current feature I think "failover" and "on" should be the values to turn the > > > feature on (assuming "on" would mean "all kind of supported slots"). > > > > Even if others agree and we change this GUC name to > > "sync_replication_slots", I feel we should keep the values as "on" and > > "off" currently, where "on" would mean 'sync failover slots' (docs can > > state that clearly). > > I gave more thoughts on it and I think the values should only be "failover" or > "off". > > The reason is that if we allow "on" and change the "on" behavior in future > versions (to support more than failover slots) then that would change the behavior > for the ones that used "on". > I again thought on this point and feel that even if we start to sync say physical slots their purpose would also be to allow failover/switchover, otherwise, there is no use of syncing the slots. So, by that theory, we can just go for naming it as sync_failover_slots or simply sync_slots with values 'off' and 'on'. Now, if these are used for switchover then there is an argument that adding 'failover' in the GUC name could be confusing but I feel 'failover' is used widely enough that it shouldn't be a problem for users to understand, otherwise, we can go with simple name like sync_slots as well. Thoughts? -- With Regards, Amit Kapila.
On Wed, Jan 31, 2024 at 10:40 AM Peter Smith <smithpb2250@gmail.com> wrote: > > On Wed, Jan 31, 2024 at 2:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > I think is correct to say all those *other* properties (create_slot, > > > enabled, copy_data) are forced to false because those otherwise have > > > default true values. > > > > > > > So, won't when connect=false, the user has to explicitly provide such > > values (create_slot, enabled, etc.) as false? If so, is using 'force' > > strictly correct? > > Perhaps the original docs text could be worded differently; I think > the word "force" here just meant setting connection=false > forces/causes/makes those other options behave "as if" they had been > set to false without the user explicitly doing anything to them. > Okay, I see your point. Let's remove the 'failover' from this part of the sentence. -- With Regards, Amit Kapila.
On Thu, Feb 1, 2024 at 2:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Agreed, and I am fine with merging 0001, 0002, and 0004 as suggested > by you though I have a few minor comments on 0002 and 0004. I was > thinking about what will be a logical way to split the slot sync > worker patch (combined result of 0001, 0002, and 0004), and one idea > occurred to me is that we can have the first patch as > synchronize_solts() API and the functionality required to implement > that API then the second patch would be a slot sync worker which uses > that API to synchronize slots and does all the required validations. > Any thoughts? If we shift 'synchronize_slots()' to the first patch but there is no caller of it, we may have a compiler warning for the same. The only way it can be done is if we temporarily add SQL function on standby which uses 'synchronize_slots()'. This SQL function can then be removed in later patches where we actually have a caller for 'synchronize_slots'. For the time being, I have merged 1,2, and some parts of 4 into a single patch and separated out libpqrc related changes to the first patch. Attached v75 patch-set. Changes are: 1) Re-arranged the patches: 1.1) 'libpqrc' related changes (from v74-001 and v74-004) are separated out in v75-001 as those are independent changes. 1.2) 'Add logical slot sync capability', 'Slot sync worker as special process' and 'App-name changes' are now merged to single patch which makes v75-002. 1.3) 'Wait for physical Standby confirmation' and 'Failover Validation Document' patches are maintained as is (v75-003 and v75-004 now). 2) Addressed comments by Swada-San, Peter and Amit given in [1], [2], [3] and [4] [1]: https://www.postgresql.org/message-id/CAD21AoDUfnnxP%2By2cg%3DLhP-bQXqFE1z4US-no%3Du30J7X%3D4Z6Aw%40mail.gmail.com [2]: https://www.postgresql.org/message-id/CAD21AoAv6FwZ6UPNTj6%3D7A%2B3O2m4utzfL8ZGS6X1EGexikG66A%40mail.gmail.com [3]: https://www.postgresql.org/message-id/CAHut%2BPuDUT7X7ieB9uQE%3DCLznaVVcQDO2GexkHe1Xfw%3DSWnkPA%40mail.gmail.com [4]: https://www.postgresql.org/message-id/CAA4eK1K7hLU2ZT1VX2k3e21c%3DkOZySZqfVDJsfE9vAS2AZ0mig%40mail.gmail.com thanks Shveta
Attachment
On Thu, Feb 1, 2024 at 11:21 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are some review comments for v740001. Thanks Peter for the feedback. > ====== > src/sgml/logicaldecoding.sgml > > 1. > + <sect2 id="logicaldecoding-replication-slots-synchronization"> > + <title>Replication Slot Synchronization</title> > + <para> > + A logical replication slot on the primary can be synchronized to the hot > + standby by enabling the <literal>failover</literal> option during slot > + creation and setting > + <link linkend="guc-enable-syncslot"><varname>enable_syncslot</varname></link> > + on the standby. For the synchronization > + to work, it is mandatory to have a physical replication slot between the > + primary and the standby, and > + <link linkend="guc-hot-standby-feedback"><varname>hot_standby_feedback</varname></link> > + must be enabled on the standby. It is also necessary to specify a valid > + <literal>dbname</literal> in the > + <link linkend="guc-primary-conninfo"><varname>primary_conninfo</varname></link> > + string, which is used for slot synchronization and is ignored > for streaming. > + </para> > > IMO we don't need to repeat that last part ", which is used for slot > synchronization and is ignored for streaming." because that is a > detail about the primary_conninfo GUC, and the same information is > already described in that GUC section. Modified in v75. > ====== > > 2. ALTER_REPLICATION_SLOT slot_name ( option [, ...] ) # > > <para> > - If true, the slot is enabled to be synced to the standbys. > + If true, the slot is enabled to be synced to the standbys > + so that logical replication can be resumed after failover. > </para> > > This also should have the sentence "The default is false.", e.g. the > same as the same option in CREATE_REPLICATION_SLOT says. I have not added this. I feel the default value related details should be present in the 'CREATE' part, it is not meaningful for the "ALTER" part. ALTER does not have any defaults, it just modifies the options given by the user. > ====== > synchronize_one_slot > > 3. > + /* > + * Make sure that concerned WAL is received and flushed before syncing > + * slot to target lsn received from the primary server. > + * > + * This check will never pass if on the primary server, user has > + * configured standby_slot_names GUC correctly, otherwise this can hit > + * frequently. > + */ > + latestFlushPtr = GetStandbyFlushRecPtr(NULL); > + if (remote_slot->confirmed_lsn > latestFlushPtr) > > BEFORE > This check will never pass if on the primary server, user has > configured standby_slot_names GUC correctly, otherwise this can hit > frequently. > > SUGGESTION (simpler way to say the same thing?) > This will always be the case unless the standby_slot_names GUC is not > correctly configured on the primary server. It is not true. It will not hit this condition "always" but has higher chances to hit it when standby_slot_names is not configured. I think you meant 'unless the standby_slot_names GUC is correctly configured'. I feel the current comment gives clear info (less confusing) and thus I have not changed it for the time being. I can consider if I get more comments there. > 4. > + /* User created slot with the same name exists, raise ERROR. */ > > /User created/User-created/ Modified. > ~~~ > > 5. synchronize_slots, and also drop_obsolete_slots > > + /* > + * Use shared lock to prevent a conflict with > + * ReplicationSlotsDropDBSlots(), trying to drop the same slot while > + * drop-database operation. > + */ > > (same code comment is in a couple of places) > > SUGGESTION (while -> during, etc.) > > Use a shared lock to prevent conflicting with > ReplicationSlotsDropDBSlots() trying to drop the same slot during a > drop-database operation. Modified. > ~~~ > > 6. validate_parameters_and_get_dbname > > strcmp() just for the empty string "" might be overkill. > > 6a. > + if (PrimarySlotName == NULL || strcmp(PrimarySlotName, "") == 0) > > SUGGESTION > if (PrimarySlotName == NULL || *PrimarySlotName == '\0') > > ~~ > > 6b. > + if (PrimaryConnInfo == NULL || strcmp(PrimaryConnInfo, "") == 0) > > SUGGESTION > if (PrimaryConnInfo == NULL || *PrimaryConnInfo == '\0') Modified. thanks Shveta
On Wed, Jan 31, 2024 at 2:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > --- > +static char * > +wait_for_valid_params_and_get_dbname(void) > +{ > + char *dbname; > + int rc; > + > + /* Sanity check. */ > + Assert(enable_syncslot); > + > + for (;;) > + { > + if (validate_parameters_and_get_dbname(&dbname)) > + break; > + ereport(LOG, errmsg("skipping slot synchronization")); > + > + ProcessSlotSyncInterrupts(NULL); > > When reading this function, I expected that the slotsync worker would > resume working once the parameters became valid, but it was not > correct. For example, if I changed hot_standby_feedback from off to > on, the slotsync worker reads the config file, exits, and then > restarts. Given that the slotsync worker ends up exiting on parameter > changes anyway, why do we want to have it wait for parameters to > become valid? IIUC even if the slotsync worker exits when a parameter > is not valid, it restarts at some intervals. Thanks for the feedback Changed this functionality in v75. Now we do not exit in wait_for_valid_params_and_get_dbname() on GUC change. We re-validate the new values and if found valid, carry on with slot-syncing else continue waiting. > --- > +bool > +SlotSyncWorkerCanRestart(void) > +{ > +#define SLOTSYNC_RESTART_INTERVAL_SEC 10 > + > > IIUC depending on how busy the postmaster is and the timing, the user > could wait for 1 min to re-launch the slotsync worker. But I think the > user might want to re-launch the slotsync worker more quickly for > example when the slotsync worker restarts due to parameter changes. > IIUC SloSyncWorkerCanRestart() doesn't consider the fact that the > slotsync worker previously exited with 0 or 1. Modified this in v75. As you suggested in [1], we reset last_start_time on GUC change before proc_exit, so that the postmaster restarts worker immediately without waiting. > --- > + /* We are a normal standby */ > + valid = DatumGetBool(slot_getattr(tupslot, 2, &isnull)); > + Assert(!isnull); > > What do you mean by "normal standby"? > > --- > + appendStringInfo(&cmd, > + "SELECT pg_is_in_recovery(), count(*) = 1" > + " FROM pg_replication_slots" > + " WHERE slot_type='physical' AND slot_name=%s", > + quote_literal_cstr(PrimarySlotName)); > > I think we need to make "pg_replication_slots" schema-qualified. Modified. > --- > + errdetail("The primary server slot \"%s\" specified by" > + " \"%s\" is not valid.", > + PrimarySlotName, "primary_slot_name")); > > and > > + errmsg("slot sync worker will shutdown because" > + " %s is disabled", "enable_syncslot")); > > It's better to write it in one line for better greppability. Modified. [1]: https://www.postgresql.org/message-id/CAD21AoAv6FwZ6UPNTj6%3D7A%2B3O2m4utzfL8ZGS6X1EGexikG66A%40mail.gmail.com thanks Shveta
On Thu, Feb 1, 2024 at 12:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Jan 31, 2024 at 9:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Wed, Jan 31, 2024 at 7:42 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > Considering my previous where we don't want to restart for a required > > > parameter change, isn't it better to avoid repeated restart (say when > > > the user gave an invalid dbname)? BTW, I think this restart interval > > > is added based on your previous complaint [1]. > > > > I think it's useful that the slotsync worker restarts immediately when > > a required parameter is changed but waits to restart when it exits > > with an error. IIUC the apply worker does so; if it restarts due to a > > subscription parameter change, it resets the last-start time so that > > the launcher will restart it without waiting. > > > > Agreed, this idea sounds good to me. > > > > > > > > > > > > --- > > > > When I dropped a database on the primary that has a failover slot, I > > > > got the following logs on the standby: > > > > > > > > 2024-01-31 17:25:21.750 JST [1103933] FATAL: replication slot "s" is > > > > active for PID 1103935 > > > > 2024-01-31 17:25:21.750 JST [1103933] CONTEXT: WAL redo at 0/3020D20 > > > > for Database/DROP: dir 1663/16384 > > > > 2024-01-31 17:25:21.751 JST [1103930] LOG: startup process (PID > > > > 1103933) exited with exit code 1 > > > > > > > > It seems that because the slotsync worker created the slot on the > > > > standby, the slot's active_pid is still valid. > > > > > > > > > > But we release the slot after sync. And we do take a shared lock on > > > the database to make the startup process wait for slotsync. There is > > > one gap which is that we don't reset active_pid for temp slots in > > > ReplicationSlotRelease(), so for temp slots such an error can occur > > > but OTOH, we immediately make the slot persistent after sync. As per > > > my understanding, it is only possible to get this error if the initial > > > sync doesn't happen and the slot remains temporary. Is that your case? > > > How did reproduce this? > > > > I created a failover slot manually on the primary and dropped the > > database where the failover slot is created. So this would not happen > > in normal cases. > > > > Right, it won't happen in normal cases (say for walsender). This can > happen in some cases even without this patch as noted in comments just > above active_pid check in ReplicationSlotsDropDBSlots(). Now, we need > to think whether we should just update the comments above active_pid > check to explain this case or try to engineer some solution for this > not-so-common case. I guess if we want a solution we need to stop > slotsync worker temporarily till the drop database WAL is applied or > something like that. > > > BTW I've tested the following switch/fail-back scenario but it seems > > not to work fine. Am I missing something? > > > > Setup: > > node1 is the primary, node2 is the physical standby for node1, and > > node3 is the subscriber connecting to node1. > > > > Steps: > > 1. [node1]: create a table and a publication for the table. > > 2. [node2]: set enable_syncslot = on and start (to receive WALs from node1). > > 3. [node3]: create a subscription with failover = true for the publication. > > 4. [node2]: promote to the new standby. > > 5. [node3]: alter subscription to connect the new primary, node2. > > 6. [node1]: stop, set enable_syncslot = on (and other required > > parameters), then start as a new standby. > > > > Then I got the error "exiting from slot synchronization because same > > name slot "test_sub" already exists on the standby". > > > > The logical replication slot that was created on the old primary > > (node1) has been synchronized to the old standby (node2). Therefore on > > node2, the slot's "synced" field is true. However, once node1 starts > > as the new standby with slot synchronization, the slotsync worker > > cannot synchronize the slot because the slot's "synced" field on the > > primary is false. > > > > Yeah, we avoided doing anything in this case because the user could > have manually created another slot with the same name on standby. > Unlike WAL slots can be modified on standby as we allow decoding on > standby, so we can't allow to overwrite the existing slots. We won't > be able to distinguish whether the existing slot was a slot that the > user wants to sync with primary or a slot created on standby to > perform decoding. I think in this case user first needs to drop the > slot on new standby. Yes, but if we do a switch-back further (i.e. in above case, node1 backs to the primary again and node becomes the standby again), the user doesn't need to remove failover slots since they are already marked as "synced". I wonder if we could do something automatically to reduce the user's operation. Also, If we support slot synchronization feature also on a cascading standby in the future, this operation will have to be changed. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Here are some review comments for v750001. ====== Commit message 1. This patch provides support for non-replication connection in libpqrcv_connect(). ~ 1a. /connection/connections/ ~ 1b. Maybe there needs to be a few more sentences just to describe what you mean by "non-replication connection". ~ 1c. IIUC although the 'replication' parameter is added, in this patch AFAICT every call to the connect function is still passing that argument as true. If that's correct, probably this patch comment should emphasise that this patch doesn't change any functionality at all but is just preparation for later patches which *will* pass false for the replication arg. ~~~ 2. This patch also implements a new API libpqrcv_get_dbname_from_conninfo() to extract database name from the given connection-info ~ /extract database name/the extract database name/ ====== .../libpqwalreceiver/libpqwalreceiver.c 3. + * Apart from walreceiver, the libpq-specific routines here are now being used + * by logical replication worker as well. /worker/workers/ ~~~ 4. libpqrcv_connect /* - * Establish the connection to the primary server for XLOG streaming + * Establish the connection to the primary server. + * + * The connection established could be either a replication one or + * a non-replication one based on input argument 'replication'. And further + * if it is a replication connection, it could be either logical or physical + * based on input argument 'logical'. That first comment ("could be either a replication one or...") seemed a bit meaningless (e.g. it like saying "this boolean argument can be true or false") because it doesn't describe what is the meaning of a "replication connection" versus what is a "non-replication connection". ~~~ 5. /* We can not have logical without replication */ Assert(replication || !logical); if (replication) { keys[++i] = "replication"; vals[i] = logical ? "database" : "true"; if (!logical) { /* * The database name is ignored by the server in replication mode, * but specify "replication" for .pgpass lookup. */ keys[++i] = "dbname"; vals[i] = "replication"; } } keys[++i] = "fallback_application_name"; vals[i] = appname; if (logical) { ... } ~ The Assert already says we cannot be 'logical' if not 'replication', therefore IMO it seemed strange that the code was not refactored to bring that 2nd "if (logical)" code to within the scope of the "if (replication)". e.g. Can't you do something like this: Assert(replication || !logical); if (replication) { ... if (logical) { ... } else { ... } } keys[++i] = "fallback_application_name"; vals[i] = appname; ~~~ 6. libpqrcv_get_dbname_from_conninfo + for (PQconninfoOption *opt = opts; opt->keyword != NULL; ++opt) + { + /* + * If multiple dbnames are specified, then the last one will be + * returned + */ + if (strcmp(opt->keyword, "dbname") == 0 && opt->val && + opt->val[0] != '\0') + dbname = pstrdup(opt->val); + } Should you also pfree the old dbname instead of gathering a bunch of strdups if there happened to be multiple dbnames specified ? SUGGESTION if (strcmp(opt->keyword, "dbname") == 0 && opt->val && *opt->val) { if (dbname) pfree(dbname); dbname = pstrdup(opt->val); } ====== src/include/replication/walreceiver.h 7. /* * walrcv_connect_fn * * Establish connection to a cluster. 'logical' is true if the * connection is logical, and false if the connection is physical. * 'appname' is a name associated to the connection, to use for example * with fallback_application_name or application_name. Returns the * details about the connection established, as defined by * WalReceiverConn for each WAL receiver module. On error, NULL is * returned with 'err' including the error generated. */ typedef WalReceiverConn *(*walrcv_connect_fn) (const char *conninfo, bool replication, bool logical, bool must_use_password, const char *appname, char **err); ~ The comment is missing any description of the new parameter 'replication'. ~~~ 8. +/* + * walrcv_get_dbname_from_conninfo_fn + * + * Returns the dbid from the primary_conninfo + */ +typedef char *(*walrcv_get_dbname_from_conninfo_fn) (const char *conninfo); + /dbid/database name/ ====== Kind Regards, Peter Smith. Fujitsu Australia
On Fri, Feb 2, 2024 at 6:46 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Feb 1, 2024 at 12:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > BTW I've tested the following switch/fail-back scenario but it seems > > > not to work fine. Am I missing something? > > > > > > Setup: > > > node1 is the primary, node2 is the physical standby for node1, and > > > node3 is the subscriber connecting to node1. > > > > > > Steps: > > > 1. [node1]: create a table and a publication for the table. > > > 2. [node2]: set enable_syncslot = on and start (to receive WALs from node1). > > > 3. [node3]: create a subscription with failover = true for the publication. > > > 4. [node2]: promote to the new standby. > > > 5. [node3]: alter subscription to connect the new primary, node2. > > > 6. [node1]: stop, set enable_syncslot = on (and other required > > > parameters), then start as a new standby. > > > > > > Then I got the error "exiting from slot synchronization because same > > > name slot "test_sub" already exists on the standby". > > > > > > The logical replication slot that was created on the old primary > > > (node1) has been synchronized to the old standby (node2). Therefore on > > > node2, the slot's "synced" field is true. However, once node1 starts > > > as the new standby with slot synchronization, the slotsync worker > > > cannot synchronize the slot because the slot's "synced" field on the > > > primary is false. > > > > > > > Yeah, we avoided doing anything in this case because the user could > > have manually created another slot with the same name on standby. > > Unlike WAL slots can be modified on standby as we allow decoding on > > standby, so we can't allow to overwrite the existing slots. We won't > > be able to distinguish whether the existing slot was a slot that the > > user wants to sync with primary or a slot created on standby to > > perform decoding. I think in this case user first needs to drop the > > slot on new standby. > > Yes, but if we do a switch-back further (i.e. in above case, node1 > backs to the primary again and node becomes the standby again), the > user doesn't need to remove failover slots since they are already > marked as "synced". But, I think in this case node-2's timeline will be ahead of node-1, so will we be able to make node-2 follow node-1 again without any additional steps? One thing is not clear to me after promotion the timeline changes in WAL, so the locations in slots will be as per new timelines, after that will it be safe to sync slots from the new primary to old-primary? In general, I think after failover, we recommend running pg_rewind if the old primary has to follow the new primary to account for divergence in WAL. So, not sure we can safely start syncing slots in old-primary from new-primary, consider that in the new primary, the same name slot may have dropped/re-created multiple times. We can probably reset all the fields of the existing slot the first time syncing for an existing slot or do something like that but I think it would be better to just re-create the slot. > I wonder if we could do something automatically to > reduce the user's operation. One possibility is that we forcefully drop/re-create the slot or directly overwrite the slot contents but that would probably be better done via some GUC or slot-level parameter. I feel we should leave this for another day, for the first version, we can document that an error will occur if the same name slots on standby exist, so users need to ensure that there shouldn't be an existing same name slots on standby before sync. -- With Regards, Amit Kapila.
Hi, On Thu, Feb 01, 2024 at 04:12:43PM +0530, Amit Kapila wrote: > On Thu, Jan 25, 2024 at 11:26 AM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > > On Wed, Jan 24, 2024 at 04:09:15PM +0530, shveta malik wrote: > > > On Wed, Jan 24, 2024 at 2:38 PM Bertrand Drouvot > > > <bertranddrouvot.pg@gmail.com> wrote: > > > > > > > > I also see Sawada-San's point and I'd vote for "sync_replication_slots". Then for > > > > the current feature I think "failover" and "on" should be the values to turn the > > > > feature on (assuming "on" would mean "all kind of supported slots"). > > > > > > Even if others agree and we change this GUC name to > > > "sync_replication_slots", I feel we should keep the values as "on" and > > > "off" currently, where "on" would mean 'sync failover slots' (docs can > > > state that clearly). > > > > I gave more thoughts on it and I think the values should only be "failover" or > > "off". > > > > The reason is that if we allow "on" and change the "on" behavior in future > > versions (to support more than failover slots) then that would change the behavior > > for the ones that used "on". > > > > I again thought on this point and feel that even if we start to sync > say physical slots their purpose would also be to allow > failover/switchover, otherwise, there is no use of syncing the slots. Yeah, I think this is a good point. > So, by that theory, we can just go for naming it as > sync_failover_slots or simply sync_slots with values 'off' and 'on'. > Now, if these are used for switchover then there is an argument that > adding 'failover' in the GUC name could be confusing but I feel > 'failover' is used widely enough that it shouldn't be a problem for > users to understand, otherwise, we can go with simple name like > sync_slots as well. > I agree and "on"/"off" looks enough to me now. As far the GUC name I've the feeling that "replication" should be part of it, and think that sync_replication_slots is fine. The reason behind is that "sync_slots" could be confusing if in the future other kind of "slot" (other than replication ones) are added in the engine. Thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Thu, Feb 01, 2024 at 05:29:15PM +0530, shveta malik wrote: > Attached v75 patch-set. Changes are: > > 1) Re-arranged the patches: > 1.1) 'libpqrc' related changes (from v74-001 and v74-004) are > separated out in v75-001 as those are independent changes. > 1.2) 'Add logical slot sync capability', 'Slot sync worker as special > process' and 'App-name changes' are now merged to single patch which > makes v75-002. > 1.3) 'Wait for physical Standby confirmation' and 'Failover Validation > Document' patches are maintained as is (v75-003 and v75-004 now). Thanks! I only looked at the commit message for v75-0002 and see that it has changed since the comment done in [1], but it still does not look correct to me. " If a logical slot on the primary is valid but is invalidated on the standby, then that slot is dropped and recreated on the standby in next sync-cycle provided the slot still exists on the primary server. It is okay to recreate such slots as long as these are not consumable on the standby (which is the case currently). This situation may occur due to the following reasons: - The max_slot_wal_keep_size on the standby is insufficient to retain WAL records from the restart_lsn of the slot. - primary_slot_name is temporarily reset to null and the physical slot is removed. - The primary changes wal_level to a level lower than logical. " If a logical decoding slot "still exists on the primary server" then the primary can not change the wal_level to lower than logical, one would get something like: "FATAL: logical replication slot "logical_slot" exists, but wal_level < logical" and then slots won't get invalidated on the standby. I've the feeling that the wal_level conflict part may need to be explained separately? (I think it's not possible that they end up being re-created on the standby for this conflict, they will be simply removed as it would mean the counterpart one on the primary does not exist anymore). [1]: https://www.postgresql.org/message-id/ZYWdSIeAMQQcLmVT%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, Feb 2, 2024 at 9:50 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are some review comments for v750001. > > ~~~ > > 2. > This patch also implements a new API libpqrcv_get_dbname_from_conninfo() > to extract database name from the given connection-info > > ~ > > /extract database name/the extract database name/ > I think it should be "..extract the database name.." > ====== > .../libpqwalreceiver/libpqwalreceiver.c > > > 4. libpqrcv_connect > > /* > - * Establish the connection to the primary server for XLOG streaming > + * Establish the connection to the primary server. > + * > + * The connection established could be either a replication one or > + * a non-replication one based on input argument 'replication'. And further > + * if it is a replication connection, it could be either logical or physical > + * based on input argument 'logical'. > > That first comment ("could be either a replication one or...") seemed > a bit meaningless (e.g. it like saying "this boolean argument can be > true or false") because it doesn't describe what is the meaning of a > "replication connection" versus what is a "non-replication > connection". > The replication connection is a term already used in the code and docs. For example, see the error message: "pg_hba.conf rejects replication connection for host ..". It means that for communication the connection would use replication protocol instead of the normal (one used by queries) protocol. The other possibility could be to individually explain each parameter but I think that is not what we follow in this or related functions. I feel we can use a simple comment like: "This API can be used for both replication and regular connections." > ~~~ > > 5. > /* We can not have logical without replication */ > Assert(replication || !logical); > > if (replication) > { > keys[++i] = "replication"; > vals[i] = logical ? "database" : "true"; > > if (!logical) > { > /* > * The database name is ignored by the server in replication mode, > * but specify "replication" for .pgpass lookup. > */ > keys[++i] = "dbname"; > vals[i] = "replication"; > } > } > > keys[++i] = "fallback_application_name"; > vals[i] = appname; > if (logical) > { > ... > } > > ~ > > The Assert already says we cannot be 'logical' if not 'replication', > therefore IMO it seemed strange that the code was not refactored to > bring that 2nd "if (logical)" code to within the scope of the "if > (replication)". > > e.g. Can't you do something like this: > > Assert(replication || !logical); > > if (replication) > { > ... > if (logical) > { > ... > } > else > { > ... > } > } > keys[++i] = "fallback_application_name"; > vals[i] = appname; > +1. > ~~~ > > 6. libpqrcv_get_dbname_from_conninfo > > + for (PQconninfoOption *opt = opts; opt->keyword != NULL; ++opt) > + { > + /* > + * If multiple dbnames are specified, then the last one will be > + * returned > + */ > + if (strcmp(opt->keyword, "dbname") == 0 && opt->val && > + opt->val[0] != '\0') > + dbname = pstrdup(opt->val); > + } > > Should you also pfree the old dbname instead of gathering a bunch of > strdups if there happened to be multiple dbnames specified ? > > SUGGESTION > if (strcmp(opt->keyword, "dbname") == 0 && opt->val && *opt->val) > { > if (dbname) > pfree(dbname); > dbname = pstrdup(opt->val); > } > makes sense and shouldn't we need to call PQconninfoFree(opts); at the end of libpqrcv_get_dbname_from_conninfo() similar to libpqrcv_check_conninfo()? -- With Regards, Amit Kapila.
On Fri, Feb 2, 2024 at 1:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Feb 2, 2024 at 6:46 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Thu, Feb 1, 2024 at 12:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > BTW I've tested the following switch/fail-back scenario but it seems > > > > not to work fine. Am I missing something? > > > > > > > > Setup: > > > > node1 is the primary, node2 is the physical standby for node1, and > > > > node3 is the subscriber connecting to node1. > > > > > > > > Steps: > > > > 1. [node1]: create a table and a publication for the table. > > > > 2. [node2]: set enable_syncslot = on and start (to receive WALs from node1). > > > > 3. [node3]: create a subscription with failover = true for the publication. > > > > 4. [node2]: promote to the new standby. > > > > 5. [node3]: alter subscription to connect the new primary, node2. > > > > 6. [node1]: stop, set enable_syncslot = on (and other required > > > > parameters), then start as a new standby. > > > > > > > > Then I got the error "exiting from slot synchronization because same > > > > name slot "test_sub" already exists on the standby". > > > > > > > > The logical replication slot that was created on the old primary > > > > (node1) has been synchronized to the old standby (node2). Therefore on > > > > node2, the slot's "synced" field is true. However, once node1 starts > > > > as the new standby with slot synchronization, the slotsync worker > > > > cannot synchronize the slot because the slot's "synced" field on the > > > > primary is false. > > > > > > > > > > Yeah, we avoided doing anything in this case because the user could > > > have manually created another slot with the same name on standby. > > > Unlike WAL slots can be modified on standby as we allow decoding on > > > standby, so we can't allow to overwrite the existing slots. We won't > > > be able to distinguish whether the existing slot was a slot that the > > > user wants to sync with primary or a slot created on standby to > > > perform decoding. I think in this case user first needs to drop the > > > slot on new standby. > > > > Yes, but if we do a switch-back further (i.e. in above case, node1 > > backs to the primary again and node becomes the standby again), the > > user doesn't need to remove failover slots since they are already > > marked as "synced". > > But, I think in this case node-2's timeline will be ahead of node-1, > so will we be able to make node-2 follow node-1 again without any > additional steps? One thing is not clear to me after promotion the > timeline changes in WAL, so the locations in slots will be as per new > timelines, after that will it be safe to sync slots from the new > primary to old-primary? In order for node-1 to go back to the primary again, it needs to be promoted. That is, the node-1's timeline increments and node-2 follows node-1. > > In general, I think after failover, we recommend running pg_rewind if > the old primary has to follow the new primary to account for > divergence in WAL. So, not sure we can safely start syncing slots in > old-primary from new-primary, consider that in the new primary, the > same name slot may have dropped/re-created multiple times. Right. And I missed the point that all replication slots are removed after pg_rewind. It would not be a problem in a failover case. But probably we still need to consider a switchover cas (i.e. switch roles with clean shutdowns) since it doesn't require to run pg_rewind? > We can > probably reset all the fields of the existing slot the first time > syncing for an existing slot or do something like that but I think it > would be better to just re-create the slot. > > > > I wonder if we could do something automatically to > > reduce the user's operation. > > One possibility is that we forcefully drop/re-create the slot or > directly overwrite the slot contents but that would probably be better > done via some GUC or slot-level parameter. I feel we should leave this > for another day, for the first version, we can document that an error > will occur if the same name slots on standby exist, so users need to > ensure that there shouldn't be an existing same name slots on standby > before sync. > Hmm, I'm afraid it might not be user-friendly. But probably we can leave it for now as it's not impossible. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Here are some review comments for v750002. (this is a WIP but this is what I found so far...) ====== doc/src/sgml/protocol.sgml 1. > > 2. ALTER_REPLICATION_SLOT slot_name ( option [, ...] ) # > > > > <para> > > - If true, the slot is enabled to be synced to the standbys. > > + If true, the slot is enabled to be synced to the standbys > > + so that logical replication can be resumed after failover. > > </para> > > > > This also should have the sentence "The default is false.", e.g. the > > same as the same option in CREATE_REPLICATION_SLOT says. > > I have not added this. I feel the default value related details should > be present in the 'CREATE' part, it is not meaningful for the "ALTER" > part. ALTER does not have any defaults, it just modifies the options > given by the user. You are correct. My mistake. ====== src/backend/postmaster/bgworker.c 2. #include "replication/logicalworker.h" +#include "replication/worker_internal.h" #include "storage/dsm.h" Is this change needed when the rest of the code is removed? ====== src/backend/replication/logical/slotsync.c 3. synchronize_one_slot > > 3. > > + /* > > + * Make sure that concerned WAL is received and flushed before syncing > > + * slot to target lsn received from the primary server. > > + * > > + * This check will never pass if on the primary server, user has > > + * configured standby_slot_names GUC correctly, otherwise this can hit > > + * frequently. > > + */ > > + latestFlushPtr = GetStandbyFlushRecPtr(NULL); > > + if (remote_slot->confirmed_lsn > latestFlushPtr) > > > > BEFORE > > This check will never pass if on the primary server, user has > > configured standby_slot_names GUC correctly, otherwise this can hit > > frequently. > > > > SUGGESTION (simpler way to say the same thing?) > > This will always be the case unless the standby_slot_names GUC is not > > correctly configured on the primary server. > > It is not true. It will not hit this condition "always" but has higher > chances to hit it when standby_slot_names is not configured. I think > you meant 'unless the standby_slot_names GUC is correctly configured'. > I feel the current comment gives clear info (less confusing) and thus > I have not changed it for the time being. I can consider if I get more > comments there. Hmm. I meant what I wrote. The "This" of my suggested text refers to the previous sentence in the comment (not about "hitting" ?? your condition). TBH, regardless of the wording you choose, I think it will be much clearer to move the comment to be inside the if. SUGGESTION /* * Make sure that concerned WAL is received and flushed before syncing * slot to target lsn received from the primary server. */ latestFlushPtr = GetStandbyFlushRecPtr(NULL); if (remote_slot->confirmed_lsn > latestFlushPtr) { /* * Can get here only when if GUC 'standby_slot_names' on the primary * server was not configured correctly. */ ... } ~~~ 4. +static bool +validate_parameters_and_get_dbname(char **dbname) +{ + /* + * A physical replication slot(primary_slot_name) is required on the + * primary to ensure that the rows needed by the standby are not removed + * after restarting, so that the synchronized slot on the standby will not + * be invalidated. + */ + if (PrimarySlotName == NULL || *PrimarySlotName == '\0') + { + ereport(LOG, + /* translator: %s is a GUC variable name */ + errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("bad configuration for slot synchronization"), + errhint("\"%s\" must be defined.", "primary_slot_name")); + return false; + } + + /* + * hot_standby_feedback must be enabled to cooperate with the physical + * replication slot, which allows informing the primary about the xmin and + * catalog_xmin values on the standby. + */ + if (!hot_standby_feedback) + { + ereport(LOG, + /* translator: %s is a GUC variable name */ + errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("bad configuration for slot synchronization"), + errhint("\"%s\" must be enabled.", "hot_standby_feedback")); + return false; + } + + /* + * Logical decoding requires wal_level >= logical and we currently only + * synchronize logical slots. + */ + if (wal_level < WAL_LEVEL_LOGICAL) + { + ereport(LOG, + errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("bad configuration for slot synchronization"), + errhint("\"wal_level\" must be >= logical.")); + return false; + } + + /* + * The primary_conninfo is required to make connection to primary for + * getting slots information. + */ + if (PrimaryConnInfo == NULL || *PrimaryConnInfo == '\0') + { + ereport(LOG, + /* translator: %s is a GUC variable name */ + errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("bad configuration for slot synchronization"), + errhint("\"%s\" must be defined.", "primary_conninfo")); + return false; + } + + /* + * The slot sync worker needs a database connection for walrcv_exec to + * work. + */ + *dbname = walrcv_get_dbname_from_conninfo(PrimaryConnInfo); + if (*dbname == NULL) + { + ereport(LOG, + + /* + * translator: 'dbname' is a specific option; %s is a GUC variable + * name + */ + errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("bad configuration for slot synchronization"), + errhint("'dbname' must be specified in \"%s\".", "primary_conninfo")); + return false; + } + + return true; +} I wonder if it is better to log all the problems in one go instead of making users stumble onto them one at a time after fixing one and then hitting the next problem. e.g. just set some variable "all_ok = false;" each time instead of all the "return false;" Then at the end of the function just "return all_ok;" ====== Kind Regards, Peter Smith. Fujitsu Australia
On Thu, Feb 1, 2024 at 5:29 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Thu, Feb 1, 2024 at 2:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > Agreed, and I am fine with merging 0001, 0002, and 0004 as suggested > > by you though I have a few minor comments on 0002 and 0004. I was > > thinking about what will be a logical way to split the slot sync > > worker patch (combined result of 0001, 0002, and 0004), and one idea > > occurred to me is that we can have the first patch as > > synchronize_solts() API and the functionality required to implement > > that API then the second patch would be a slot sync worker which uses > > that API to synchronize slots and does all the required validations. > > Any thoughts? > > If we shift 'synchronize_slots()' to the first patch but there is no > caller of it, we may have a compiler warning for the same. The only > way it can be done is if we temporarily add SQL function on standby > which uses 'synchronize_slots()'. This SQL function can then be > removed in later patches where we actually have a caller for > 'synchronize_slots'. > Can such a SQL function say pg_synchronize_slots() which can sync all slots that have a failover flag set be useful in general apart from just writing tests for this new API? I am thinking maybe users want more control over when to sync the slots and write their bgworker or simply do it just before shutdown once (sort of planned switchover) or at some other pre-defined times. BTW, we also have pg_log_standby_snapshot() which otherwise would be done periodically by background processes. > > 1) Re-arranged the patches: > 1.1) 'libpqrc' related changes (from v74-001 and v74-004) are > separated out in v75-001 as those are independent changes. Bertrand, Sawada-San, and others, do you see a problem with such a split? Can we go ahead with v75_0001 separately after fixing the open comments? -- With Regards, Amit Kapila.
On Fri, Feb 2, 2024 at 10:53 AM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > On Thu, Feb 01, 2024 at 04:12:43PM +0530, Amit Kapila wrote: > > On Thu, Jan 25, 2024 at 11:26 AM Bertrand Drouvot > > > > I again thought on this point and feel that even if we start to sync > > say physical slots their purpose would also be to allow > > failover/switchover, otherwise, there is no use of syncing the slots. > > Yeah, I think this is a good point. > > > So, by that theory, we can just go for naming it as > > sync_failover_slots or simply sync_slots with values 'off' and 'on'. > > Now, if these are used for switchover then there is an argument that > > adding 'failover' in the GUC name could be confusing but I feel > > 'failover' is used widely enough that it shouldn't be a problem for > > users to understand, otherwise, we can go with simple name like > > sync_slots as well. > > > > I agree and "on"/"off" looks enough to me now. As far the GUC name I've the > feeling that "replication" should be part of it, and think that sync_replication_slots > is fine. The reason behind is that "sync_slots" could be confusing if in the > future other kind of "slot" (other than replication ones) are added in the engine. > +1 for sync_replication_slots with values as 'on'/'off'. -- With Regards, Amit Kapila.
Hi, On Fri, Feb 02, 2024 at 12:25:30PM +0530, Amit Kapila wrote: > On Thu, Feb 1, 2024 at 5:29 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Thu, Feb 1, 2024 at 2:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > Agreed, and I am fine with merging 0001, 0002, and 0004 as suggested > > > by you though I have a few minor comments on 0002 and 0004. I was > > > thinking about what will be a logical way to split the slot sync > > > worker patch (combined result of 0001, 0002, and 0004), and one idea > > > occurred to me is that we can have the first patch as > > > synchronize_solts() API and the functionality required to implement > > > that API then the second patch would be a slot sync worker which uses > > > that API to synchronize slots and does all the required validations. > > > Any thoughts? > > > > If we shift 'synchronize_slots()' to the first patch but there is no > > caller of it, we may have a compiler warning for the same. The only > > way it can be done is if we temporarily add SQL function on standby > > which uses 'synchronize_slots()'. This SQL function can then be > > removed in later patches where we actually have a caller for > > 'synchronize_slots'. > > > > Can such a SQL function say pg_synchronize_slots() which can sync all > slots that have a failover flag set be useful in general apart from > just writing tests for this new API? I am thinking maybe users want > more control over when to sync the slots and write their bgworker or > simply do it just before shutdown once (sort of planned switchover) or > at some other pre-defined times. Big +1 for having this kind of function in user's hands (as the standby's slots may be lagging behind during a switchover for example). As far the name, I think it would make sense to add "replication" or "repl" something like pg_sync_replication_slots()? (that would be aligned with pg_create_logical_replication_slot() and friends). > BTW, we also have > pg_log_standby_snapshot() which otherwise would be done periodically > by background processes. > > > > > 1) Re-arranged the patches: > > 1.1) 'libpqrc' related changes (from v74-001 and v74-004) are > > separated out in v75-001 as those are independent changes. > > Bertrand, Sawada-San, and others, do you see a problem with such a > split? Can we go ahead with v75_0001 separately after fixing the open > comments? I think that makes sense, specially if we're also creating a user callable function to sync the slot(s) at wish. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, Feb 2, 2024 at 1:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > +1 for sync_replication_slots with values as 'on'/'off'. Okay. PFA v76 which changes this GUC name as suggested. It also addressed comments from Peter given in in [1] and [2]. [1]: https://www.postgresql.org/message-id/CAHut%2BPvFj8ZOx8-YdMWBS9vxMcmgxwOcA%2BYuJVgrayjhsiszHQ%40mail.gmail.com [2]: https://www.postgresql.org/message-id/CAHut%2BPtRJ_4x3re0bPn791PTL6kc2TRm1A2EPY1kjTCax_9F%3DA%40mail.gmail.com thanks Shveta
Attachment
On Fri, Feb 2, 2024 at 12:25 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are some review comments for v750002. Thanks for the feedback Peter. Addressed all in v76 except one. > (this is a WIP but this is what I found so far...) > I wonder if it is better to log all the problems in one go instead of > making users stumble onto them one at a time after fixing one and then > hitting the next problem. e.g. just set some variable "all_ok = > false;" each time instead of all the "return false;" > > Then at the end of the function just "return all_ok;" If we do this way, then we need to find a way to combine the msgs as well, otherwise the same msg will be repeated multiple times. For the concerned functionality (which needs one time config effort by user), I feel the existing way looks okay. We may consider optimizing it if we get more comments here. thanks Shveta
On Friday, February 2, 2024 2:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Feb 1, 2024 at 5:29 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Thu, Feb 1, 2024 at 2:35 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > > > Agreed, and I am fine with merging 0001, 0002, and 0004 as suggested > > > by you though I have a few minor comments on 0002 and 0004. I was > > > thinking about what will be a logical way to split the slot sync > > > worker patch (combined result of 0001, 0002, and 0004), and one idea > > > occurred to me is that we can have the first patch as > > > synchronize_solts() API and the functionality required to implement > > > that API then the second patch would be a slot sync worker which > > > uses that API to synchronize slots and does all the required validations. > > > Any thoughts? > > > > If we shift 'synchronize_slots()' to the first patch but there is no > > caller of it, we may have a compiler warning for the same. The only > > way it can be done is if we temporarily add SQL function on standby > > which uses 'synchronize_slots()'. This SQL function can then be > > removed in later patches where we actually have a caller for > > 'synchronize_slots'. > > > > Can such a SQL function say pg_synchronize_slots() which can sync all slots that > have a failover flag set be useful in general apart from just writing tests for this > new API? I am thinking maybe users want more control over when to sync the > slots and write their bgworker or simply do it just before shutdown once (sort > of planned switchover) or at some other pre-defined times. BTW, we also have > pg_log_standby_snapshot() which otherwise would be done periodically by > background processes. Here is an attempt for this. The slotsync worker patch is now splitted into two patches(0002 and 0003). I also adjusted the doc, comments and tests for the new pg_synchronize_slots() function. Best Regards, Hou zj
Attachment
On Monday, February 5, 2024 10:17 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Friday, February 2, 2024 2:56 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > On Thu, Feb 1, 2024 at 5:29 PM shveta malik <shveta.malik@gmail.com> > wrote: > > > > > > On Thu, Feb 1, 2024 at 2:35 PM Amit Kapila <amit.kapila16@gmail.com> > > wrote: > > > > > > > > Agreed, and I am fine with merging 0001, 0002, and 0004 as > > > > suggested by you though I have a few minor comments on 0002 and > > > > 0004. I was thinking about what will be a logical way to split the > > > > slot sync worker patch (combined result of 0001, 0002, and 0004), > > > > and one idea occurred to me is that we can have the first patch as > > > > synchronize_solts() API and the functionality required to > > > > implement that API then the second patch would be a slot sync > > > > worker which uses that API to synchronize slots and does all the required > validations. > > > > Any thoughts? > > > > > > If we shift 'synchronize_slots()' to the first patch but there is no > > > caller of it, we may have a compiler warning for the same. The only > > > way it can be done is if we temporarily add SQL function on standby > > > which uses 'synchronize_slots()'. This SQL function can then be > > > removed in later patches where we actually have a caller for > > > 'synchronize_slots'. > > > > > > > Can such a SQL function say pg_synchronize_slots() which can sync all > > slots that have a failover flag set be useful in general apart from > > just writing tests for this new API? I am thinking maybe users want > > more control over when to sync the slots and write their bgworker or > > simply do it just before shutdown once (sort of planned switchover) or > > at some other pre-defined times. BTW, we also have > > pg_log_standby_snapshot() which otherwise would be done periodically > > by background processes. > > Here is an attempt for this. The slotsync worker patch is now splitted into two > patches(0002 and 0003). I also adjusted the doc, comments and tests for the > new pg_synchronize_slots() function. There was one miss in the doc that cause CFbot failure, attach the correct version V77_2 here. There are no code changes compared to V77 version. Best Regards, Hou zj
Attachment
On Fri, Feb 2, 2024 at 11:18 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Fri, Feb 2, 2024 at 12:25 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Here are some review comments for v750002. > > Thanks for the feedback Peter. Addressed all in v76 except one. > > > (this is a WIP but this is what I found so far...) > > > I wonder if it is better to log all the problems in one go instead of > > making users stumble onto them one at a time after fixing one and then > > hitting the next problem. e.g. just set some variable "all_ok = > > false;" each time instead of all the "return false;" > > > > Then at the end of the function just "return all_ok;" > > If we do this way, then we need to find a way to combine the msgs as > well, otherwise the same msg will be repeated multiple times. For the > concerned functionality (which needs one time config effort by user), > I feel the existing way looks okay. We may consider optimizing it if > we get more comments here. > I don't think combining messages is necessary; I considered these all as different (not the same msg repeated multiple times) since they all have different errhints. I felt a user would only know to make a configuration correction when they are informed something is wrong, so my review point was we could tell them all the wrong things up-front so then those can all be fixed with a "one time config effort by user". Otherwise, if multiple settings (e.g. from the list below) have wrong values, I imagined the user will fix the first reported one, then the next bad config will be reported, then the user will fix that one, then the next bad config will be reported, then the user will fix that one, and so on. It just seemed potentially/unnecessarilly painful. - errhint("\"%s\" must be defined.", "primary_slot_name")); - errhint("\"%s\" must be enabled.", "hot_standby_feedback")); - errhint("\"wal_level\" must be >= logical.")); - errhint("\"%s\" must be defined.", "primary_conninfo")); - errhint("'dbname' must be specified in \"%s\".", "primary_conninfo")); ~ Anyway, I just wanted to explain my review comment some more because maybe my reason wasn't clear the first time. Whatever your decision is, it is fine by me. ====== Kind Regards, Peter Smith. Fujitsu Australia
On Thursday, February 1, 2024 12:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Feb 1, 2024 at 8:15 AM Euler Taveira <euler@eulerto.com> wrote: > > > > On Mon, Jan 29, 2024, at 10:17 AM, Zhijie Hou (Fujitsu) wrote: > > > > Attach the V72-0001 which addressed above comments, other patches will > be > > rebased and posted after pushing first patch. Thanks Shveta for helping > address > > the comments. > > > > > > While working on another patch I noticed a new NOTICE message: > > > > NOTICE: changed the failover state of replication slot "foo" on publisher to > false > > > > I wasn't paying much attention to this thread then I start reading the 2 > > patches that was recently committed. The message above surprises me > because > > pg_createsubscriber starts to emit this message. The reason is that it doesn't > > create the replication slot during the CREATE SUBSCRIPTION. Instead, it > creates > > the replication slot with failover = false and no such option is informed > > during CREATE SUBSCRIPTION which means it uses the default value (failover > = > > false). I expect that I don't see any message because it is *not* changing the > > behavior. I was wrong. It doesn't check the failover state on publisher, it > > just executes walrcv_alter_slot() and emits a message. > > > > IMO if we are changing an outstanding property on node A from node B, > node B > > already knows (or might know) about that behavior change (because it is > sending > > the command), however, node A doesn't (unless log_replication_commands > = on -- > > it is not the default). > > > > Do we really need this message as NOTICE? > > > > The reason for adding this NOTICE was to keep it similar to other > Notice messages in these commands like create/drop slot. However, here > the difference is we may not have altered the slot as the property is > already the same as we want to set on the publisher. So, I am not sure > whether we should follow the existing behavior or just get rid of it. > And then do we remove similar NOTICE in AlterSubscription() as well? > Normally, I think NOTICE intends to let users know if we did anything > with slots while executing subscription commands. Does anyone else > have an opinion on this point? > > A related point, I think we can avoid setting the 'failover' property > in ReplicationSlotAlter() if it is not changed, the advantage is we > will avoid saving slots. OTOH, this won't be a frequent operation so > we can leave it as it is as well. Here is a patch to remove the NOTICE and improve the ReplicationSlotAlter. The patch also includes few cleanups based on Peter's feedback. Best Regards, Hou zj
Attachment
On Mon, Feb 5, 2024 at 1:29 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:
On Monday, February 5, 2024 10:17 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:
There was one miss in the doc that cause CFbot failure,
attach the correct version V77_2 here. There are no code changes compared to V77 version.
Best Regards,
Hou zj
Just noticed that doc/src/sgml/config.sgml still refers to enable_synclot instead of sync_replication_slots:
The standbys corresponding to the physical replication slots in
<varname>standby_slot_names</varname> must configure
<literal>enable_syncslot = true</literal> so they can receive
failover logical slots changes from the primary.
<varname>standby_slot_names</varname> must configure
<literal>enable_syncslot = true</literal> so they can receive
failover logical slots changes from the primary.
regards,
Ajin Cherian
Fujitsu Australia
On Mon, Feb 5, 2024 at 7:59 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > I have pushed the first patch. Next, a few comments on 0002 are as follows: 1. +static bool +validate_parameters_and_get_dbname(char **dbname, int elevel) For 0002, we don't need dbname as out parameter. Also, we can rename the function to validate_slotsync_params() or something like that. Also, for 0003, we don't need to get the dbname from wait_for_valid_params_and_get_dbname(), instead there could be a common function that can be invoked from validate_slotsync_params() and caller of wait function that caches the value of dbname. The other parameter elevel is also not required for 0002. 2. + /* + * Make sure that concerned WAL is received and flushed before syncing + * slot to target lsn received from the primary server. + */ + latestFlushPtr = GetStandbyFlushRecPtr(NULL); + if (remote_slot->confirmed_lsn > latestFlushPtr) + { + /* + * Can get here only if GUC 'standby_slot_names' on the primary server + * was not configured correctly. + */ + ereport(LOG, + errmsg("skipping slot synchronization as the received slot sync" + " LSN %X/%X for slot \"%s\" is ahead of the standby position %X/%X", + LSN_FORMAT_ARGS(remote_slot->confirmed_lsn), + remote_slot->name, + LSN_FORMAT_ARGS(latestFlushPtr))); + + return false; In the case of a function invocation, this should be an ERROR. We can move the comment related to 'standby_slot_names' to a later patch where that GUC is introduced. See, if there are other LOGs in the patch that needs to be converted to ERROR. 3. The function pg_sync_replication_slots() should be in file slotfuncs.c and common functionality between this function and slotsync worker can be exposed via a function in slotsync.c. 4. /* + * Using the specified primary server connection, check whether we are + * cascading standby and validates primary_slot_name for + * non-cascading-standbys. + */ + check_primary_info(wrconn, &am_cascading_standby, + &primary_slot_invalid, ERROR); + + if (am_cascading_standby) + ereport(ERROR, + errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot synchronize replication slots to a cascading standby")); primary_slot_invalid is not used in this patch. I think we can allow the function can be executed on cascading_standby as well because this will be used for the planned switchover. 5. I don't see any problem with allowing concurrent processes trying to sync the same slot at the same time as each process will acquire the slot and only one process can acquire the slot at a time, the other will get an ERROR. -- With Regards, Amit Kapila.
On Mon, Feb 5, 2024 at 10:57 AM Ajin Cherian <itsajin@gmail.com> wrote: > > Just noticed that doc/src/sgml/config.sgml still refers to enable_synclot instead of sync_replication_slots: > > The standbys corresponding to the physical replication slots in > <varname>standby_slot_names</varname> must configure > <literal>enable_syncslot = true</literal> so they can receive > failover logical slots changes from the primary. Thanks Ajin for pointing this out. Here are v78 patches, corrected there. Other changes are: 1) Rebased the patches as the v77-001 is now pushed. 2) Enabled executing pg_sync_replication_slots() on cascading-standby. 3) Rearranged the code around parameter validity checks. Changed function names and changed the way how dbname is extracted as suggested by Amit offlist. 4) Rearranged the code around check_primary_info(). Removed output args. 5) Few other trivial changes. thanks Shveta
Attachment
On Mon, Feb 5, 2024 at 4:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > I have pushed the first patch. Next, a few comments on 0002 are as follows: Thanks for the feedback Amit. Some of these are addressed in v78. Rest will be addressed in the next version. > 1. > +static bool > +validate_parameters_and_get_dbname(char **dbname, int elevel) > > For 0002, we don't need dbname as out parameter. Also, we can rename > the function to validate_slotsync_params() or something like that. > Also, for 0003, we don't need to get the dbname from > wait_for_valid_params_and_get_dbname(), instead there could be a > common function that can be invoked from validate_slotsync_params() > and caller of wait function that caches the value of dbname. > > The other parameter elevel is also not required for 0002. > > 2. > + /* > + * Make sure that concerned WAL is received and flushed before syncing > + * slot to target lsn received from the primary server. > + */ > + latestFlushPtr = GetStandbyFlushRecPtr(NULL); > + if (remote_slot->confirmed_lsn > latestFlushPtr) > + { > + /* > + * Can get here only if GUC 'standby_slot_names' on the primary server > + * was not configured correctly. > + */ > + ereport(LOG, > + errmsg("skipping slot synchronization as the received slot sync" > + " LSN %X/%X for slot \"%s\" is ahead of the standby position %X/%X", > + LSN_FORMAT_ARGS(remote_slot->confirmed_lsn), > + remote_slot->name, > + LSN_FORMAT_ARGS(latestFlushPtr))); > + > + return false; > > In the case of a function invocation, this should be an ERROR. We can > move the comment related to 'standby_slot_names' to a later patch > where that GUC is introduced. See, if there are other LOGs in the > patch that needs to be converted to ERROR. > > 3. The function pg_sync_replication_slots() should be in file > slotfuncs.c and common functionality between this function and > slotsync worker can be exposed via a function in slotsync.c. > > 4. > /* > + * Using the specified primary server connection, check whether we are > + * cascading standby and validates primary_slot_name for > + * non-cascading-standbys. > + */ > + check_primary_info(wrconn, &am_cascading_standby, > + &primary_slot_invalid, ERROR); > + > + if (am_cascading_standby) > + ereport(ERROR, > + errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("cannot synchronize replication slots to a cascading standby")); > > primary_slot_invalid is not used in this patch. I think we can allow > the function can be executed on cascading_standby as well because this > will be used for the planned switchover. > > 5. I don't see any problem with allowing concurrent processes trying > to sync the same slot at the same time as each process will acquire > the slot and only one process can acquire the slot at a time, the > other will get an ERROR. > > -- > With Regards, > Amit Kapila.
On Mon, Feb 5, 2024 at 8:26 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Mon, Feb 5, 2024 at 10:57 AM Ajin Cherian <itsajin@gmail.com> wrote: > > > > Just noticed that doc/src/sgml/config.sgml still refers to enable_synclot instead of sync_replication_slots: > > > > The standbys corresponding to the physical replication slots in > > <varname>standby_slot_names</varname> must configure > > <literal>enable_syncslot = true</literal> so they can receive > > failover logical slots changes from the primary. > > Thanks Ajin for pointing this out. Here are v78 patches, corrected there. > > Other changes are: > > 1) Rebased the patches as the v77-001 is now pushed. > 2) Enabled executing pg_sync_replication_slots() on cascading-standby. > 3) Rearranged the code around parameter validity checks. Changed > function names and changed the way how dbname is extracted as > suggested by Amit offlist. > 4) Rearranged the code around check_primary_info(). Removed output args. > 5) Few other trivial changes. > Thank you for updating the patch! Here are some comments: --- Since Two processes (e.g. the slotsync worker and pg_sync_replication_slots()) concurrently fetch and update the slot information, there is a race condition where slot's confirmed_flush_lsn goes backward. . We have the following check but it doesn't prevent the slot's confirmed_flush_lsn from moving backward if the restart_lsn does't change: /* * Sanity check: As long as the invalidations are handled * appropriately as above, this should never happen. */ if (remote_slot->restart_lsn < slot->data.restart_lsn) elog(ERROR, "cannot synchronize local slot \"%s\" LSN(%X/%X)" " to remote slot's LSN(%X/%X) as synchronization" " would move it backwards", remote_slot->name, LSN_FORMAT_ARGS(slot->data.restart_lsn), LSN_FORMAT_ARGS(remote_slot->restart_lsn)); --- + It is recommended that subscriptions are first disabled before promoting f+ the standby and are enabled back after altering the connection string. I think it's better to describe the reason why it's recommended to disable subscriptions before the standby promotion. --- +/* Slot sync worker objects */ +extern PGDLLIMPORT char *PrimaryConnInfo; +extern PGDLLIMPORT char *PrimarySlotName; These two variables are declared also in xlogrecovery.h. Is it intentional? If so, I think it's better to write comments. --- Global functions and variables used by the slotsync worker are declared in logicalworker.h and worker_internal.h. But is it really okay to make a dependency between the slotsync worker and logical replication workers? IIUC the slotsync worker is conceptually a separate feature from the logical replication. I think the slotsync worker can have its own header file. --- + SELECT r.srsubid AS subid, CONCAT('pg_' || srsubid || '_sync_' || srrelid || '_' || ctl.system_identifier) AS slotname and + SELECT (CASE WHEN r.srsubstate = 'f' THEN pg_replication_origin_progress(CONCAT('pg_' || r.srsubid || '_' || r.srrelid), false) If we use CONCAT function, we can replace '||' with ','. --- + Confirm that the standby server is not lagging behind the subscribers. + This step can be skipped if + <link linkend="guc-standby-slot-names"><varname>standby_slot_names</varname></link> + has been correctly configured. How can the user confirm if standby_slot_names is correctly configured? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Friday, February 2, 2024 2:03 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On Thu, Feb 01, 2024 at 05:29:15PM +0530, shveta malik wrote: > > Attached v75 patch-set. Changes are: > > > > 1) Re-arranged the patches: > > 1.1) 'libpqrc' related changes (from v74-001 and v74-004) are > > separated out in v75-001 as those are independent changes. > > 1.2) 'Add logical slot sync capability', 'Slot sync worker as special > > process' and 'App-name changes' are now merged to single patch which > > makes v75-002. > > 1.3) 'Wait for physical Standby confirmation' and 'Failover Validation > > Document' patches are maintained as is (v75-003 and v75-004 now). > > Thanks! > > I only looked at the commit message for v75-0002 and see that it has changed > since the comment done in [1], but it still does not look correct to me. > > " > If a logical slot on the primary is valid but is invalidated on the standby, then > that slot is dropped and recreated on the standby in next sync-cycle provided > the slot still exists on the primary server. It is okay to recreate such slots as long > as these are not consumable on the standby (which is the case currently). This > situation may occur due to the following reasons: > - The max_slot_wal_keep_size on the standby is insufficient to retain WAL > records from the restart_lsn of the slot. > - primary_slot_name is temporarily reset to null and the physical slot is > removed. > - The primary changes wal_level to a level lower than logical. > " > > If a logical decoding slot "still exists on the primary server" then the primary > can not change the wal_level to lower than logical, one would get something > like: > > "FATAL: logical replication slot "logical_slot" exists, but wal_level < logical" > > and then slots won't get invalidated on the standby. I've the feeling that the > wal_level conflict part may need to be explained separately? (I think it's not > possible that they end up being re-created on the standby for this conflict, > they will be simply removed as it would mean the counterpart one on the > primary does not exist anymore). This is possible in some extreme cases, because the slot is synced asynchronously. For example: If on the primary the wal_level is changed to 'replica' and then changed back to 'logical', the standby would receive two XLOG_PARAMETER_CHANGE wals. And before the standby replay these wals, user can create a failover slot on the primary because the wal_level is logical, and if the slotsync worker has synced the slots before startup process replay the XLOG_PARAMETER_CHANGE, then when replaying the XLOG_PARAMETER_CHANGE, the just synced slot will be invalidated. Although I think it doesn't seem a real world case, so I am not sure is it worth separate explanation. Best Regards, Hou zj
Here are some review comments for v78-0001 ====== GENERAL 1. Should the "Chapter 30 Logical Replication" at least have another section that mentions the feature of slot synchronization so the information about it is easier to find? It doesn't need to say much -- just give a reference to the other sections where it is explained already. ====== Commit Message 2. A new 'synced' flag is introduced for replication slots, indicating whether the slot has been synchronized from the primary server. On a standby, synced slots cannot be dropped or consumed, and any attempt to perform logical decoding on them will result in an error. ~ It doesn't say *where* is this new 'synced' flag. ~~~ 3. The logical replication slots on the primary can be synchronized to the hot standby by enabling the failover option during slot creation and calling pg_sync_replication_slots() function on the standby. For the synchronization to work, it is mandatory to have a physical replication slot between the primary and the standby, hot_standby_feedback must be enabled on the standby and a valid dbname must be specified in primary_conninfo. ~ 3a. "by enabling the failover option during slot creation" -- Should you elaborate more about that part by mentioning the failover parameter of the create slot API, or the "failover" option of the CREATE SUBSCRIPTION? ~ 3b. I find it easy to read if the GUC parameters are quoted, but YMMV. /hot_standby_feedback/'hot_standby_feedback'/ /primary_conninfo/'primary_conninfo'/ ~~~ 4. If a logical slot on the primary is valid but is invalidated on the standby, then that slot is dropped and can be recreated on the standby in next pg_sync_replication_slots() call provided the slot still exists on the primary server. It is okay to recreate such slots as long as these are not consumable on the standby (which is the case currently). This situation may occur due to the following reasons: - The max_slot_wal_keep_size on the standby is insufficient to retain WAL records from the restart_lsn of the slot. - primary_slot_name is temporarily reset to null and the physical slot is removed. - The primary changes wal_level to a level lower than logical. ~ 4a. /and can be recreated/but will be recreated/ ~ 4b. (As before, I would quote the GUCs for easier readability) /max_slot_wal_keep_size/'max_slot_wal_keep_size'/ /primary_slot_name/'primary_slot_name'/ /'wal_level'/wal_level/ ====== doc/src/sgml/config.sgml 5. + <para> + To synchronize replication slots (see + <xref linkend="logicaldecoding-replication-slots-synchronization"/>), + it is also necessary to specify a valid <literal>dbname</literal> + in the <varname>primary_conninfo</varname> string. This will only be + used for slot synchronization. It is ignored for streaming. </para> Somehow, I thought the below wording is slightly better (and it also matches the linked section title). YMMV. /To synchronize replication slots/For replication slot synchronization/ ====== src/sgml/func.sgml 6. + <row> + <entry id="pg-sync-replication-slots" role="func_table_entry"><para role="func_signature"> + <indexterm> + <primary>pg_sync_replication_slots</primary> Currently, this is in section "9.27.6 Replication Management Functions", but I wondered if it should also have some mention in the "9.27.4. Recovery Control Functions" section. ====== doc/src/sgml/logicaldecoding.sgml 7. + <title>Replication Slot Synchronization</title> + <para> + A logical replication slot on the primary can be synchronized to the hot + standby by enabling the <literal>failover</literal> option during slot + creation and calling <function>pg_sync_replication_slots</function> + on the standby. For the synchronization + to work, it is mandatory to have a physical replication slot between the + primary and the standby, and + <link linkend="guc-hot-standby-feedback"><varname>hot_standby_feedback</varname></link> + must be enabled on the standby. It is also necessary to specify a valid + <literal>dbname</literal> in the + <link linkend="guc-primary-conninfo"><varname>primary_conninfo</varname></link>. + </para> 7a. Should you elaborate more about the "enabling the failover option during slot creation" part by mentioning the failover parameter of the create slot API, or the "failover" option of the CREATE SUBSCRIPTION? ~ 7b. I think it will be better to include a link to the pg_sync_replication_slots function. ~~~ 8. + <para> + To resume logical replication after failover from the synced logical + slots, the subscription's 'conninfo' must be altered to point to the + new primary server. This is done using + <link linkend="sql-altersubscription-params-connection"><command>ALTER SUBSCRIPTION ... CONNECTION</command></link>. + It is recommended that subscriptions are first disabled before promoting + the standby and are enabled back after altering the connection string. + </para> /and are enabled back/and are re-enabled/ ====== src/backend/replication/logical/slotsync.c 9. + * This file contains the code for slot synchronization on a physical standby + * to fetch logical failover slots information from the primary server, create + * the slots on the standby and synchronize them periodically. IIUC there is no "periodically" logic in this patch 0001 anymore because that is now in a later patch, so this part of the comment maybe needs adjustment. ~~~ 10. + * While creating the slot on physical standby, if the local restart_lsn and/or + * local catalog_xmin is ahead of those on the remote then we cannot create the + * local slot in sync with the primary server because that would mean moving + * the local slot backwards and the standby might not have WALs retained for + * old LSN. In this case, the slot will be marked as RS_TEMPORARY. Once the + * primary server catches up, the slot will be marked as RS_PERSISTENT (which + * means sync-ready) and we can perform the sync periodically. 10a. The wording "While creating the slot [...] then we cannot create the local slot" sounds strange. Maybe it can be reworded like SUGGESTION If the physical standby restart_lsn and/or local catalog_xmin is ahead of those on the remote then we cannot create the local standby slot in sync with the primary server because... ~ 10b. /and we can perform the sync periodically./after which we can call pg_sync_replication_slots() periodically to perform syncs./ ~~~ 11. + * The slots that were synchronized will be dropped if they are currently not + * needed to be synchronized. SUGGESTION Any standby synchronized slots will be dropped if they no longer need to be synchronized. See comment atop drop_obsolete_slots() for more details. ~~~ 12. +static bool +local_slot_update(RemoteSlot * remote_slot, Oid remote_dbid) Space after the pointer (*)? ~~~ 13. +/* + * Drop obsolete slots + * + * Drop the slots that no longer need to be synced i.e. these either do not + * exist on the primary or are no longer enabled for failover. + * + * Additionally, it drops slots that are valid on the primary but got + * invalidated on the standby. This situation may occur due to the following + * reasons: + * - The max_slot_wal_keep_size on the standby is insufficient to retain WAL + * records from the restart_lsn of the slot. + * - primary_slot_name is temporarily reset to null and the physical slot is + * removed. + * - The primary changes wal_level to a level lower than logical. + * + * The assumption is that these dropped slots will get recreated in next + * sync-cycle and it is okay to drop and recreate such slots as long as these + * are not consumable on the standby (which is the case currently). + */ 13a. /Additionally, it drops slots/Additionally, drop any slots/ ~ 13b. /max_slot_wal_keep_size/'max_slot_wal_keep_size'/ /primary_slot_name/'primary_slot_name'/ /wal_level/'wal_level'/ ~ 13c. /The assumption is/The assumptions are/ ~~~ 14. +static bool +update_and_persist_slot(RemoteSlot * remote_slot, Oid remote_dbid) Space after the pointer (*)? ~~~ 15. +static bool +synchronize_one_slot(RemoteSlot * remote_slot, Oid remote_dbid) Space after the pointer (*)? ~~~ 16. + if (remote_slot->confirmed_lsn > latestFlushPtr) + { + /* + * Can get here only if GUC 'standby_slot_names' on the primary server + * was not configured correctly. + */ + ereport(ERROR, + errmsg("skipping slot synchronization as the received slot sync" + " LSN %X/%X for slot \"%s\" is ahead of the standby position %X/%X", + LSN_FORMAT_ARGS(remote_slot->confirmed_lsn), + remote_slot->name, + LSN_FORMAT_ARGS(latestFlushPtr))); + + return false; + } Unreachable return false after ERROR? ~~~ 17. +/* + * Using the specified primary server connection, validates primary_slot_name. + */ The comment seems expressed in a backward way. SUGGESTION Validate the 'primary_slot_name' using the specified primary server connection. ~~~ 18. +static void +validate_primary_slot(WalReceiverConn *wrconn, int slot_invalid_elevel) I think here it is the "configuration" that is wrong, not the "slot". So I suggest removing that word slot from the parameter. /slot_invalid_elevel/invalid_elevel/ ~~~ 19. +/* + * Returns true if all necessary GUCs for slot synchronization are set + * appropriately, otherwise returns false. + */ +static bool +validate_slotsync_params(int elevel) 19a. /Returns true/Return true/ /returns false/return false/ ~ 19b. IMO for consistency better to use the same param name as the previous function /elevel/invalid_elevel/ ~~~ 20. +Datum +pg_sync_replication_slots(PG_FUNCTION_ARGS) +{ + WalReceiverConn *wrconn = NULL; + char *err; + StringInfoData app_name; The wrconn assignment at declaration seems unnecessary since it will be immediately overwritten on the first usage. ~~~ 21. + if (cluster_name[0]) + appendStringInfo(&app_name, "%s_%s", cluster_name, "slotsync"); + else + appendStringInfo(&app_name, "%s", "slotsync"); I wondered why this was coded using format string substitutions instead of like below: if (cluster_name[0]) appendStringInfo(&app_name, "%s_slotsync", cluster_name); else appendStringInfoString(&app_name, "slotsync"); OR if (cluster_name[0]) appendStringInfo(&app_name, "%s_", cluster_name); appendStringInfoString(&app_name, "slotsync"); ~~~ 22. + /* + * Establish the connection to the primary server for slots + * synchronization. + */ + wrconn = walrcv_connect(PrimaryConnInfo, false, false, false, + app_name.data, &err); Unnecessarily verbose? SUGGESTION Connect to the primary server. ~~~ 23. + syncing_slots = true; + + PG_TRY(); + { + /* + * Using the specified primary server connection, validates the slot + * in primary_slot_name. + */ + validate_primary_slot(wrconn, ERROR); + + (void) synchronize_slots(wrconn); + } + PG_FINALLY(); + { + syncing_slots = false; + walrcv_disconnect(wrconn); + } + PG_END_TRY(); 23a. IMO the "syncing_slots = true;" can be deferred until immediately before call to synchronize_slots(); ~ 23b. I felt the comment seems backwards, so can be worded as suggested elsewhere in this post. SUGGESTION Validate the 'primary_slot_name' using the specified primary server connection. OTOH, if you can change the function name to validate_primary_slot_name() then no comment is needed because then it becomes self-explanatory. ====== src/backend/replication/slot.c 24. + /* + * Do not allow users to create the slots with failover enabled on the + * standby as we do not support sync to the cascading standby. + * + * Slots with failover enabled can still be created when doing slot + * synchronization, as it needs to maintain this value in sync with the + * remote slots. + */ + if (failover && RecoveryInProgress() && !IsSyncingReplicationSlots()) + ereport(ERROR, + errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot enable failover for a replication slot" + " created on the standby")); I felt it started to become confusing using "synchronization" and "sync" in the same sentence. SUGGESTION However, slots with failover enabled can be created during slot synchronization because we need to retain the same values as the remote slot. ====== .../t/040_standby_failover_slots_sync.pl 25. + +$standby1->safe_psql('postgres', "SELECT pg_sync_replication_slots();"); Since this is where we use the function added by this patch, it deserves to have a comment. SUGGESTION # Synchronize the primary server slots to the standby. ====== src/tools/pgindent/typedefs.list 26. It looks like 'RemoteSlot' should be included in the typedefs.list file. Probably this is the explanation for the space problems I reported earlier in this post. ====== Kind Regards, Peter Smith. Fujitsu Australia
Hi, Previously ([1] #19 and #22) I had suggested that some conflict_reason code could be split off and pushed first as a prerequisite/ preparatory/ independent patch. At the time, my suggestion was premature because there was still a lot of code under development. But now the affected code is in the first patch 00001, and there are already other precedents of slot-sync preparatory patches getting pushed. So I thought to resurrect this splitting suggestion again, as perhaps is the right time to do it. Details are below: ====== IMO the new #defines for the conflict_reason stuff plus where they get used can be pushed as an independent patch. Specifically, this stuff: ~~~ From src/include/replication/slot.h: +/* + * The possible values for 'conflict_reason' returned in + * pg_get_replication_slots. + */ +#define SLOT_INVAL_WAL_REMOVED_TEXT "wal_removed" +#define SLOT_INVAL_HORIZON_TEXT "rows_removed" +#define SLOT_INVAL_WAL_LEVEL_TEXT "wal_level_insufficient" + ~~~ From src/backend/replication/logical/slotsync.c: Also, IMO this function should live in slot.c; Although slotsync.c might be the only caller, this is not really a slot-sync specific function. +/* + * Maps the pg_replication_slots.conflict_reason text value to + * ReplicationSlotInvalidationCause enum value + */ +static ReplicationSlotInvalidationCause +get_slot_invalidation_cause(char *conflict_reason) +{ + Assert(conflict_reason); + + if (strcmp(conflict_reason, SLOT_INVAL_WAL_REMOVED_TEXT) == 0) + return RS_INVAL_WAL_REMOVED; + else if (strcmp(conflict_reason, SLOT_INVAL_HORIZON_TEXT) == 0) + return RS_INVAL_HORIZON; + else if (strcmp(conflict_reason, SLOT_INVAL_WAL_LEVEL_TEXT) == 0) + return RS_INVAL_WAL_LEVEL; + else + Assert(0); + + /* Keep compiler quiet */ + return RS_INVAL_NONE; +} ~~~ From src/backend/replication/slotfuncs.c: case RS_INVAL_WAL_REMOVED: - values[i++] = CStringGetTextDatum("wal_removed"); + values[i++] = CStringGetTextDatum(SLOT_INVAL_WAL_REMOVED_TEXT); break; case RS_INVAL_HORIZON: - values[i++] = CStringGetTextDatum("rows_removed"); + values[i++] = CStringGetTextDatum(SLOT_INVAL_HORIZON_TEXT); break; case RS_INVAL_WAL_LEVEL: - values[i++] = CStringGetTextDatum("wal_level_insufficient"); + values[i++] = CStringGetTextDatum(SLOT_INVAL_WAL_LEVEL_TEXT); break; ~~~ Thoughts? ====== [1] https://www.postgresql.org/message-id/CAHut%2BPtJAAPghc4GPt0k%3DjeMz1qu4H7mnaDifOHsVsMqi-qOLA%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
On Mon, Feb 5, 2024 at 7:56 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > --- > Since Two processes (e.g. the slotsync worker and > pg_sync_replication_slots()) concurrently fetch and update the slot > information, there is a race condition where slot's > confirmed_flush_lsn goes backward. > Right, this is possible, though there shouldn't be a problem because anyway, slotsync is an async process. Till we hold restart_lsn, the required WAL won't be removed. Having said that, I can think of two ways to avoid it: (a) We can have some flag in shared memory using which we can detect whether any other process is doing slot syncronization and then either error out at that time or simply wait or may take nowait kind of parameter from user to decide what to do? If this is feasible, we can simply error out for the first version and extend it later if we see any use cases for the same (b) similar to restart_lsn, if confirmed_flush_lsn is getting moved back, raise an error, this is good for now but in future we may still have another similar issue, so I would prefer (a) among these but I am fine if you prefer (b) or have some other ideas like just note down in comments that this is a harmless case and can happen only very rarely. > > --- > + It is recommended that subscriptions are first disabled before promoting > f+ the standby and are enabled back after altering the connection string. > > I think it's better to describe the reason why it's recommended to > disable subscriptions before the standby promotion. > Agreed. The reason I see for this is that if we don't disable the subscription before promotion and changing the connection string there is a chance that the old primary comes back and the subscriber can have some additional data, though the chances of same are less. > --- > +/* Slot sync worker objects */ > +extern PGDLLIMPORT char *PrimaryConnInfo; > +extern PGDLLIMPORT char *PrimarySlotName; > > These two variables are declared also in xlogrecovery.h. Is it > intentional? If so, I think it's better to write comments. > > --- > Global functions and variables used by the slotsync worker are > declared in logicalworker.h and worker_internal.h. But is it really > okay to make a dependency between the slotsync worker and logical > replication workers? IIUC the slotsync worker is conceptually a > separate feature from the logical replication. I think the slotsync > worker can have its own header file. > +1. > > --- > + Confirm that the standby server is not lagging behind the subscribers. > + This step can be skipped if > + <link linkend="guc-standby-slot-names"><varname>standby_slot_names</varname></link> > + has been correctly configured. > > How can the user confirm if standby_slot_names is correctly configured? > I think users can refer to LOGs to see if it has changed since the first time it was configured. I tried by existing parameter and see the following in LOG: LOG: received SIGHUP, reloading configuration files 2024-02-06 11:38:59.069 IST [9240] LOG: parameter "autovacuum" changed to "on" If the user can't confirm then it is better to follow the steps mentioned in the patch. Do you want something else to be written in docs for this? If so, what? -- With Regards, Amit Kapila.
Hi, I took another high-level look at all the funtion names of the slotsync.c file. ====== src/backend/replication/logical/slotsync.c +static bool +local_slot_update(RemoteSlot * remote_slot, Oid remote_dbid) +static List * +get_local_synced_slots(void) +static bool +check_sync_slot_on_remote(ReplicationSlot *local_slot, List *remote_slots, +static void +drop_obsolete_slots(List *remote_slot_list) +static void +reserve_wal_for_slot(XLogRecPtr restart_lsn) +static bool +update_and_persist_slot(RemoteSlot * remote_slot, Oid remote_dbid) +static bool +synchronize_one_slot(RemoteSlot * remote_slot, Oid remote_dbid) +get_slot_invalidation_cause(char *conflict_reason) +static bool +synchronize_slots(WalReceiverConn *wrconn) +static void +validate_primary_slot(WalReceiverConn *wrconn, int slot_invalid_elevel) +static bool +validate_slotsync_params(int elevel) +bool +IsSyncingReplicationSlots(void) +Datum +pg_sync_replication_slots(PG_FUNCTION_ARGS) ~~~ There seems some muddling of names here: - "local" versus ? and "remote" versus "primary"; or sometimes the function does not give an indication. - "sync_slot" versus "synced_slot" versus nothing - "check" versus "validate" - etc. Below are some suggestions (some are unchanged); probably there are better ideas for names but my point is that the current names could be improved: CURRENT SUGGESTION get_local_synced_slots get_local_synced_slots check_sync_slot_on_remote check_local_synced_slot_exists_on_remote drop_obsolete_slots drop_local_synced_slots reserve_wal_for_slot reserve_wal_for_local_slot local_slot_update update_local_synced_slot update_and_persist_slot update_and_persist_local_synced_slot get_slot_invalidation_cause get_slot_conflict_reason synchronize_slots synchronize_remote_slots_to_local synchronize_one_slot synchronize_remote_slot_to_local validate_primary_slot check_remote_synced_slot_exists validate_slotsync_params check_local_config IsSyncingReplicationSlots IsSyncingReplicationSlots pg_sync_replication_slots pg_sync_replication_slots ====== Kind Regards, Peter Smith. Fujitsu Australia
Hi, On Tue, Feb 06, 2024 at 03:19:11AM +0000, Zhijie Hou (Fujitsu) wrote: > On Friday, February 2, 2024 2:03 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > > > Hi, > > > > On Thu, Feb 01, 2024 at 05:29:15PM +0530, shveta malik wrote: > > > Attached v75 patch-set. Changes are: > > > > > > 1) Re-arranged the patches: > > > 1.1) 'libpqrc' related changes (from v74-001 and v74-004) are > > > separated out in v75-001 as those are independent changes. > > > 1.2) 'Add logical slot sync capability', 'Slot sync worker as special > > > process' and 'App-name changes' are now merged to single patch which > > > makes v75-002. > > > 1.3) 'Wait for physical Standby confirmation' and 'Failover Validation > > > Document' patches are maintained as is (v75-003 and v75-004 now). > > > > Thanks! > > > > I only looked at the commit message for v75-0002 and see that it has changed > > since the comment done in [1], but it still does not look correct to me. > > > > " > > If a logical slot on the primary is valid but is invalidated on the standby, then > > that slot is dropped and recreated on the standby in next sync-cycle provided > > the slot still exists on the primary server. It is okay to recreate such slots as long > > as these are not consumable on the standby (which is the case currently). This > > situation may occur due to the following reasons: > > - The max_slot_wal_keep_size on the standby is insufficient to retain WAL > > records from the restart_lsn of the slot. > > - primary_slot_name is temporarily reset to null and the physical slot is > > removed. > > - The primary changes wal_level to a level lower than logical. > > " > > > > If a logical decoding slot "still exists on the primary server" then the primary > > can not change the wal_level to lower than logical, one would get something > > like: > > > > "FATAL: logical replication slot "logical_slot" exists, but wal_level < logical" > > > > and then slots won't get invalidated on the standby. I've the feeling that the > > wal_level conflict part may need to be explained separately? (I think it's not > > possible that they end up being re-created on the standby for this conflict, > > they will be simply removed as it would mean the counterpart one on the > > primary does not exist anymore). > > This is possible in some extreme cases, because the slot is synced > asynchronously. > > For example: If on the primary the wal_level is changed to 'replica' It means that all the logical slots have been dropped on the primary (if not, it's not possible to change it to a level < logical). > and then > changed back to 'logical', the standby would receive two XLOG_PARAMETER_CHANGE > wals. And before the standby replay these wals, user can create a failover slot And now it is re-created. So the slot has been dropped and recreated on the primary, to it's kind of expected it is also dropped and re-created on the standby (should it be invalidated or not). > Although I think it doesn't seem a real world case, so I am not sure is it worth > separate explanation. Yeah, I don't think your example is worth a separate explanation also because it's expected to see the slot being dropped / re-created anyway (see above). That said, I still think the commit message needs some re-wording, what about? ===== If a logical slot on the primary is valid but is invalidated on the standby, then that slot is dropped and can be recreated on the standby in next pg_sync_replication_slots() call provided the slot still exists on the primary server. It is okay to recreate such slots as long as these are not consumable on the standby (which is the case currently). This situation may occur due to the following reasons: - The max_slot_wal_keep_size on the standby is insufficient to retain WAL records from the restart_lsn of the slot. - primary_slot_name is temporarily reset to null and the physical slot is removed. Changing the primary wal_level to a level lower than logical is only possible if the logical slots are removed on the primary, so it's expected to see the slots being removed on the standby too (and re-created if they are re-created on the primary). ===== Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Feb 6, 2024 at 3:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Feb 5, 2024 at 7:56 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > --- > > Since Two processes (e.g. the slotsync worker and > > pg_sync_replication_slots()) concurrently fetch and update the slot > > information, there is a race condition where slot's > > confirmed_flush_lsn goes backward. > > > > Right, this is possible, though there shouldn't be a problem because > anyway, slotsync is an async process. Till we hold restart_lsn, the > required WAL won't be removed. Having said that, I can think of two > ways to avoid it: (a) We can have some flag in shared memory using > which we can detect whether any other process is doing slot > syncronization and then either error out at that time or simply wait > or may take nowait kind of parameter from user to decide what to do? > If this is feasible, we can simply error out for the first version and > extend it later if we see any use cases for the same (b) similar to > restart_lsn, if confirmed_flush_lsn is getting moved back, raise an > error, this is good for now but in future we may still have another > similar issue, so I would prefer (a) among these but I am fine if you > prefer (b) or have some other ideas like just note down in comments > that this is a harmless case and can happen only very rarely. Thank you for sharing the ideas. I would prefer (a). For (b), the same issue still happens for other fields. > > > > > --- > > + It is recommended that subscriptions are first disabled before promoting > > f+ the standby and are enabled back after altering the connection string. > > > > I think it's better to describe the reason why it's recommended to > > disable subscriptions before the standby promotion. > > > > Agreed. The reason I see for this is that if we don't disable the > subscription before promotion and changing the connection string there > is a chance that the old primary comes back and the subscriber can > have some additional data, though the chances of same are less. > > > --- > > +/* Slot sync worker objects */ > > +extern PGDLLIMPORT char *PrimaryConnInfo; > > +extern PGDLLIMPORT char *PrimarySlotName; > > > > These two variables are declared also in xlogrecovery.h. Is it > > intentional? If so, I think it's better to write comments. > > > > --- > > Global functions and variables used by the slotsync worker are > > declared in logicalworker.h and worker_internal.h. But is it really > > okay to make a dependency between the slotsync worker and logical > > replication workers? IIUC the slotsync worker is conceptually a > > separate feature from the logical replication. I think the slotsync > > worker can have its own header file. > > > > +1. > > > > > --- > > + Confirm that the standby server is not lagging behind the subscribers. > > + This step can be skipped if > > + <link linkend="guc-standby-slot-names"><varname>standby_slot_names</varname></link> > > + has been correctly configured. > > > > How can the user confirm if standby_slot_names is correctly configured? > > > > I think users can refer to LOGs to see if it has changed since the > first time it was configured. I tried by existing parameter and see > the following in LOG: > LOG: received SIGHUP, reloading configuration files > 2024-02-06 11:38:59.069 IST [9240] LOG: parameter "autovacuum" changed to "on" > > If the user can't confirm then it is better to follow the steps > mentioned in the patch. Do you want something else to be written in > docs for this? If so, what? IIUC even if a wrong slot name is specified to standby_slot_names or even standby_slot_names is empty, the standby server might not be lagging behind the subscribers depending on the timing. But when checking it the next time, the standby server might lag behind the subscribers. So what I wanted to know is how the user can confirm if a failover-enabled subscription is ensured not to go in front of failover-candidate standbys (i.e., standbys using the slots listed in standby_slot_names). Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Feb 6, 2024 at 1:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Feb 6, 2024 at 3:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Feb 5, 2024 at 7:56 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > --- > > > Since Two processes (e.g. the slotsync worker and > > > pg_sync_replication_slots()) concurrently fetch and update the slot > > > information, there is a race condition where slot's > > > confirmed_flush_lsn goes backward. > > > > > > > Right, this is possible, though there shouldn't be a problem because > > anyway, slotsync is an async process. Till we hold restart_lsn, the > > required WAL won't be removed. Having said that, I can think of two > > ways to avoid it: (a) We can have some flag in shared memory using > > which we can detect whether any other process is doing slot > > syncronization and then either error out at that time or simply wait > > or may take nowait kind of parameter from user to decide what to do? > > If this is feasible, we can simply error out for the first version and > > extend it later if we see any use cases for the same (b) similar to > > restart_lsn, if confirmed_flush_lsn is getting moved back, raise an > > error, this is good for now but in future we may still have another > > similar issue, so I would prefer (a) among these but I am fine if you > > prefer (b) or have some other ideas like just note down in comments > > that this is a harmless case and can happen only very rarely. > > Thank you for sharing the ideas. I would prefer (a). For (b), the same > issue still happens for other fields. I agree that (a) looks better. On a separate note, while looking at this API pg_sync_replication_slots(PG_FUNCTION_ARGS) shouldn't there be an optional parameter to give one slot or multiple slots or all slots as default, that will give better control to the user no? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Feb 6, 2024 at 1:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Feb 6, 2024 at 3:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > I think users can refer to LOGs to see if it has changed since the > > first time it was configured. I tried by existing parameter and see > > the following in LOG: > > LOG: received SIGHUP, reloading configuration files > > 2024-02-06 11:38:59.069 IST [9240] LOG: parameter "autovacuum" changed to "on" > > > > If the user can't confirm then it is better to follow the steps > > mentioned in the patch. Do you want something else to be written in > > docs for this? If so, what? > > IIUC even if a wrong slot name is specified to standby_slot_names or > even standby_slot_names is empty, the standby server might not be > lagging behind the subscribers depending on the timing. But when > checking it the next time, the standby server might lag behind the > subscribers. So what I wanted to know is how the user can confirm if a > failover-enabled subscription is ensured not to go in front of > failover-candidate standbys (i.e., standbys using the slots listed in > standby_slot_names). > But isn't the same explained by two steps ((a) Firstly, on the subscriber node check the last replayed WAL. (b) Next, on the standby server check that the last-received WAL location is ahead of the replayed WAL location on the subscriber identified above.) in the latest *_0004 patch. -- With Regards, Amit Kapila.
On Tue, Feb 6, 2024 at 3:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Feb 6, 2024 at 1:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Tue, Feb 6, 2024 at 3:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Mon, Feb 5, 2024 at 7:56 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > --- > > > > Since Two processes (e.g. the slotsync worker and > > > > pg_sync_replication_slots()) concurrently fetch and update the slot > > > > information, there is a race condition where slot's > > > > confirmed_flush_lsn goes backward. > > > > > > > > > > Right, this is possible, though there shouldn't be a problem because > > > anyway, slotsync is an async process. Till we hold restart_lsn, the > > > required WAL won't be removed. Having said that, I can think of two > > > ways to avoid it: (a) We can have some flag in shared memory using > > > which we can detect whether any other process is doing slot > > > syncronization and then either error out at that time or simply wait > > > or may take nowait kind of parameter from user to decide what to do? > > > If this is feasible, we can simply error out for the first version and > > > extend it later if we see any use cases for the same (b) similar to > > > restart_lsn, if confirmed_flush_lsn is getting moved back, raise an > > > error, this is good for now but in future we may still have another > > > similar issue, so I would prefer (a) among these but I am fine if you > > > prefer (b) or have some other ideas like just note down in comments > > > that this is a harmless case and can happen only very rarely. > > > > Thank you for sharing the ideas. I would prefer (a). For (b), the same > > issue still happens for other fields. > > I agree that (a) looks better. On a separate note, while looking at > this API pg_sync_replication_slots(PG_FUNCTION_ARGS) shouldn't there > be an optional parameter to give one slot or multiple slots or all > slots as default, that will give better control to the user no? > As of now, we want to give functionality similar to slotsync worker with a difference that users can use this new function for planned switchovers. So, syncing all failover slots by default. I think if there is a use case to selectively sync some of the failover slots then we can probably extend this function and slotsync worker as well. Normally, if the primary goes down due to whatever reason users would want to restart the replication for all the defined publications via existing failover slots. Why would anyone want to do it partially? -- With Regards, Amit Kapila.
On Tue, Feb 6, 2024 at 3:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Feb 6, 2024 at 3:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Tue, Feb 6, 2024 at 1:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Tue, Feb 6, 2024 at 3:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Mon, Feb 5, 2024 at 7:56 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > > > --- > > > > > Since Two processes (e.g. the slotsync worker and > > > > > pg_sync_replication_slots()) concurrently fetch and update the slot > > > > > information, there is a race condition where slot's > > > > > confirmed_flush_lsn goes backward. > > > > > > > > > > > > > Right, this is possible, though there shouldn't be a problem because > > > > anyway, slotsync is an async process. Till we hold restart_lsn, the > > > > required WAL won't be removed. Having said that, I can think of two > > > > ways to avoid it: (a) We can have some flag in shared memory using > > > > which we can detect whether any other process is doing slot > > > > syncronization and then either error out at that time or simply wait > > > > or may take nowait kind of parameter from user to decide what to do? > > > > If this is feasible, we can simply error out for the first version and > > > > extend it later if we see any use cases for the same (b) similar to > > > > restart_lsn, if confirmed_flush_lsn is getting moved back, raise an > > > > error, this is good for now but in future we may still have another > > > > similar issue, so I would prefer (a) among these but I am fine if you > > > > prefer (b) or have some other ideas like just note down in comments > > > > that this is a harmless case and can happen only very rarely. > > > > > > Thank you for sharing the ideas. I would prefer (a). For (b), the same > > > issue still happens for other fields. > > > > I agree that (a) looks better. On a separate note, while looking at > > this API pg_sync_replication_slots(PG_FUNCTION_ARGS) shouldn't there > > be an optional parameter to give one slot or multiple slots or all > > slots as default, that will give better control to the user no? > > > > As of now, we want to give functionality similar to slotsync worker > with a difference that users can use this new function for planned > switchovers. So, syncing all failover slots by default. I think if > there is a use case to selectively sync some of the failover slots > then we can probably extend this function and slotsync worker as well. > Normally, if the primary goes down due to whatever reason users would > want to restart the replication for all the defined publications via > existing failover slots. Why would anyone want to do it partially? If we consider the usability of such a function (I mean as it is implemented now, without any argument) one use case could be that if the slot sync worker is not keeping up or at some point in time the user doesn't want to wait for the worker to do this instead user can do it by himself. So now if we have such a functionality then it would be even better to extend it to selectively sync the slot. For example, if there is some issue in syncing all slots, maybe some bug or taking a long time to sync because there are a lot of slots but if the user needs to quickly failover and he/she is interested in only a couple of slots then such a option could be helpful. no? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Feb 6, 2024 at 9:35 AM Peter Smith <smithpb2250@gmail.com> wrote: > > ====== > GENERAL > > 1. > Should the "Chapter 30 Logical Replication" at least have another > section that mentions the feature of slot synchronization so the > information about it is easier to find? It doesn't need to say much -- > just give a reference to the other sections where it is explained > already. > We can think of something like Failover/Switchover but that we can do at the end once we get the worker patch and other work not with the first patch. > > 6. > + <row> > + <entry id="pg-sync-replication-slots" > role="func_table_entry"><para role="func_signature"> > + <indexterm> > + <primary>pg_sync_replication_slots</primary> > > Currently, this is in section "9.27.6 Replication Management > Functions", but I wondered if it should also have some mention in the > "9.27.4. Recovery Control Functions" section. > I feel this is more suited to "Replication Management Functions" because the other section talks about functions used during recovery whereas we won't do anything for slotsync during recovery. -- With Regards, Amit Kapila.
On Tue, Feb 6, 2024 at 3:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Feb 6, 2024 at 1:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Tue, Feb 6, 2024 at 3:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > I think users can refer to LOGs to see if it has changed since the > > > first time it was configured. I tried by existing parameter and see > > > the following in LOG: > > > LOG: received SIGHUP, reloading configuration files > > > 2024-02-06 11:38:59.069 IST [9240] LOG: parameter "autovacuum" changed to "on" > > > > > > If the user can't confirm then it is better to follow the steps > > > mentioned in the patch. Do you want something else to be written in > > > docs for this? If so, what? > > > > IIUC even if a wrong slot name is specified to standby_slot_names or > > even standby_slot_names is empty, the standby server might not be > > lagging behind the subscribers depending on the timing. But when > > checking it the next time, the standby server might lag behind the > > subscribers. So what I wanted to know is how the user can confirm if a > > failover-enabled subscription is ensured not to go in front of > > failover-candidate standbys (i.e., standbys using the slots listed in > > standby_slot_names). > > > > But isn't the same explained by two steps ((a) Firstly, on the > subscriber node check the last replayed WAL. (b) Next, on the standby > server check that the last-received WAL location is ahead of the > replayed WAL location on the subscriber identified above.) in the > latest *_0004 patch. > Additionally, I would like to add that the users can use the queries mentioned in the doc after the primary has failed and before promoting the standby. If she wants to do that when both primary and standby are available, the value of 'standby_slot_names' on primary should be referred. Isn't those two sufficient that there won't be false positives? -- With Regards, Amit Kapila.
On Tue, Feb 6, 2024 at 3:57 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Feb 6, 2024 at 3:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Feb 6, 2024 at 3:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > On Tue, Feb 6, 2024 at 1:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > On Tue, Feb 6, 2024 at 3:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > On Mon, Feb 5, 2024 at 7:56 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > > > > > --- > > > > > > Since Two processes (e.g. the slotsync worker and > > > > > > pg_sync_replication_slots()) concurrently fetch and update the slot > > > > > > information, there is a race condition where slot's > > > > > > confirmed_flush_lsn goes backward. > > > > > > > > > > > > > > > > Right, this is possible, though there shouldn't be a problem because > > > > > anyway, slotsync is an async process. Till we hold restart_lsn, the > > > > > required WAL won't be removed. Having said that, I can think of two > > > > > ways to avoid it: (a) We can have some flag in shared memory using > > > > > which we can detect whether any other process is doing slot > > > > > syncronization and then either error out at that time or simply wait > > > > > or may take nowait kind of parameter from user to decide what to do? > > > > > If this is feasible, we can simply error out for the first version and > > > > > extend it later if we see any use cases for the same (b) similar to > > > > > restart_lsn, if confirmed_flush_lsn is getting moved back, raise an > > > > > error, this is good for now but in future we may still have another > > > > > similar issue, so I would prefer (a) among these but I am fine if you > > > > > prefer (b) or have some other ideas like just note down in comments > > > > > that this is a harmless case and can happen only very rarely. > > > > > > > > Thank you for sharing the ideas. I would prefer (a). For (b), the same > > > > issue still happens for other fields. > > > > > > I agree that (a) looks better. On a separate note, while looking at > > > this API pg_sync_replication_slots(PG_FUNCTION_ARGS) shouldn't there > > > be an optional parameter to give one slot or multiple slots or all > > > slots as default, that will give better control to the user no? > > > > > > > As of now, we want to give functionality similar to slotsync worker > > with a difference that users can use this new function for planned > > switchovers. So, syncing all failover slots by default. I think if > > there is a use case to selectively sync some of the failover slots > > then we can probably extend this function and slotsync worker as well. > > Normally, if the primary goes down due to whatever reason users would > > want to restart the replication for all the defined publications via > > existing failover slots. Why would anyone want to do it partially? > > If we consider the usability of such a function (I mean as it is > implemented now, without any argument) one use case could be that if > the slot sync worker is not keeping up or at some point in time the > user doesn't want to wait for the worker to do this instead user can > do it by himself. > Possibly, but I was imagining that it would be used for planned switchover cases and also for testing the core sync slot functionality in our TAP tests. > So now if we have such a functionality then it would be even better to > extend it to selectively sync the slot. For example, if there is some > issue in syncing all slots, maybe some bug or taking a long time to > sync because there are a lot of slots but if the user needs to quickly > failover and he/she is interested in only a couple of slots then such > a option could be helpful. no? > I see your point but not sure how useful it is in the field. I am fine if others also think such a parameter will be useful and anyway I think we can even extend it after v1 is done. -- With Regards, Amit Kapila.
On Tue, Feb 6, 2024 at 12:08 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Hi, I took another high-level look at all the funtion names of the > slotsync.c file. > > > Below are some suggestions (some are unchanged); probably there are > better ideas for names but my point is that the current names could be > improved: > > CURRENT SUGGESTION ... > check_sync_slot_on_remote check_local_synced_slot_exists_on_remote > I think none of this seems to state the purpose of the function. I suggest changing it to local_sync_slot_required() and returning false either if the local_slot doesn't exist in remote_slot_list or is invalidated. -- With Regards, Amit Kapila.
On Tuesday, February 6, 2024 3:39 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Feb 6, 2024 at 3:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Feb 5, 2024 at 7:56 PM Masahiko Sawada > <sawada.mshk@gmail.com> wrote: > > > > > > --- > > > Since Two processes (e.g. the slotsync worker and > > > pg_sync_replication_slots()) concurrently fetch and update the slot > > > information, there is a race condition where slot's > > > confirmed_flush_lsn goes backward. > > > > > > > Right, this is possible, though there shouldn't be a problem because > > anyway, slotsync is an async process. Till we hold restart_lsn, the > > required WAL won't be removed. Having said that, I can think of two > > ways to avoid it: (a) We can have some flag in shared memory using > > which we can detect whether any other process is doing slot > > syncronization and then either error out at that time or simply wait > > or may take nowait kind of parameter from user to decide what to do? > > If this is feasible, we can simply error out for the first version and > > extend it later if we see any use cases for the same (b) similar to > > restart_lsn, if confirmed_flush_lsn is getting moved back, raise an > > error, this is good for now but in future we may still have another > > similar issue, so I would prefer (a) among these but I am fine if you > > prefer (b) or have some other ideas like just note down in comments > > that this is a harmless case and can happen only very rarely. > > Thank you for sharing the ideas. I would prefer (a). For (b), the same issue still > happens for other fields. Attach the V79 patch which includes the following changes. (Note that only 0001 is sent in this version, we will send the later patches after rebasing) 1. Address all the comments from Amit[1], all the comments from Peter[2] and some of the comments from Sawada-san[3]. 2. Using a flag in shared to memory to restrcit concurrent slot sync. 3. Add more tap tests for pg_sync_replication_slots function. [1] https://www.postgresql.org/message-id/CAA4eK1KGHT9S-Bst_G1CUNQvRep%3DipMs5aTBNRQFVi6TogbJ9w%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAHut%2BPtyoRf3adoLoTrbL6momzkhXAFKz656Vv9YRu4cp%3D6Yig%40mail.gmail.com [3] https://www.postgresql.org/message-id/CAD21AoCEkcTaPb%2BGdOhSQE49_mKJG6D64quHcioJGx6RCqMv%2BQ%40mail.gmail.com Best Regards, Hou zj
Attachment
On Monday, February 5, 2024 10:25 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: lvh.no-ip.org> > Subject: Re: Synchronizing slots from primary to standby > > On Mon, Feb 5, 2024 at 8:26 PM shveta malik <shveta.malik@gmail.com> > wrote: > > > > On Mon, Feb 5, 2024 at 10:57 AM Ajin Cherian <itsajin@gmail.com> wrote: > > > > > > Just noticed that doc/src/sgml/config.sgml still refers to enable_synclot > instead of sync_replication_slots: > > > > > > The standbys corresponding to the physical replication slots in > > > <varname>standby_slot_names</varname> must configure > > > <literal>enable_syncslot = true</literal> so they can receive > > > failover logical slots changes from the primary. > > > > Thanks Ajin for pointing this out. Here are v78 patches, corrected there. > > > > Other changes are: > > > > 1) Rebased the patches as the v77-001 is now pushed. > > 2) Enabled executing pg_sync_replication_slots() on cascading-standby. > > 3) Rearranged the code around parameter validity checks. Changed > > function names and changed the way how dbname is extracted as > > suggested by Amit offlist. > > 4) Rearranged the code around check_primary_info(). Removed output > args. > > 5) Few other trivial changes. > > > > Thank you for updating the patch! Here are some comments: > > --- > Since Two processes (e.g. the slotsync worker and > pg_sync_replication_slots()) concurrently fetch and update the slot information, > there is a race condition where slot's confirmed_flush_lsn goes backward. . We > have the following check but it doesn't prevent the slot's confirmed_flush_lsn > from moving backward if the restart_lsn does't change: > > /* > * Sanity check: As long as the invalidations are handled > * appropriately as above, this should never happen. > */ > if (remote_slot->restart_lsn < slot->data.restart_lsn) > elog(ERROR, > "cannot synchronize local slot \"%s\" LSN(%X/%X)" > " to remote slot's LSN(%X/%X) as synchronization" > " would move it backwards", remote_slot->name, > LSN_FORMAT_ARGS(slot->data.restart_lsn), > LSN_FORMAT_ARGS(remote_slot->restart_lsn)); > As discussed, I added a flag in shared memory to control the concurrent slot sync. > --- > + It is recommended that subscriptions are first disabled before > + promoting > f+ the standby and are enabled back after altering the connection string. > > I think it's better to describe the reason why it's recommended to disable > subscriptions before the standby promotion. Added. > > --- > +/* Slot sync worker objects */ > +extern PGDLLIMPORT char *PrimaryConnInfo; extern PGDLLIMPORT char > +*PrimarySlotName; > > These two variables are declared also in xlogrecovery.h. Is it intentional? If so, I > think it's better to write comments. Will address. > > --- > Global functions and variables used by the slotsync worker are declared in > logicalworker.h and worker_internal.h. But is it really okay to make a > dependency between the slotsync worker and logical replication workers? IIUC > the slotsync worker is conceptually a separate feature from the logical > replication. I think the slotsync worker can have its own header file. Added. > > --- > + SELECT r.srsubid AS subid, CONCAT('pg_' || srsubid || > '_sync_' || srrelid || '_' || ctl.system_identifier) AS slotname > > and > > + SELECT (CASE WHEN r.srsubstate = 'f' THEN > pg_replication_origin_progress(CONCAT('pg_' || r.srsubid || '_' || r.srrelid), false) > > If we use CONCAT function, we can replace '||' with ','. > Will address. > --- > + Confirm that the standby server is not lagging behind the subscribers. > + This step can be skipped if > + <link > linkend="guc-standby-slot-names"><varname>standby_slot_names</varna > me></link> > + has been correctly configured. > > How can the user confirm if standby_slot_names is correctly configured? Will address after concluding. Thanks Shveta for helping addressing the comments. Best Regards, Hou zj
On Tue, Feb 6, 2024 at 8:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Feb 6, 2024 at 3:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Feb 6, 2024 at 1:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Tue, Feb 6, 2024 at 3:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > I think users can refer to LOGs to see if it has changed since the > > > > first time it was configured. I tried by existing parameter and see > > > > the following in LOG: > > > > LOG: received SIGHUP, reloading configuration files > > > > 2024-02-06 11:38:59.069 IST [9240] LOG: parameter "autovacuum" changed to "on" > > > > > > > > If the user can't confirm then it is better to follow the steps > > > > mentioned in the patch. Do you want something else to be written in > > > > docs for this? If so, what? > > > > > > IIUC even if a wrong slot name is specified to standby_slot_names or > > > even standby_slot_names is empty, the standby server might not be > > > lagging behind the subscribers depending on the timing. But when > > > checking it the next time, the standby server might lag behind the > > > subscribers. So what I wanted to know is how the user can confirm if a > > > failover-enabled subscription is ensured not to go in front of > > > failover-candidate standbys (i.e., standbys using the slots listed in > > > standby_slot_names). > > > > > > > But isn't the same explained by two steps ((a) Firstly, on the > > subscriber node check the last replayed WAL. (b) Next, on the standby > > server check that the last-received WAL location is ahead of the > > replayed WAL location on the subscriber identified above.) in the > > latest *_0004 patch. > > > > Additionally, I would like to add that the users can use the queries > mentioned in the doc after the primary has failed and before promoting > the standby. If she wants to do that when both primary and standby are > available, the value of 'standby_slot_names' on primary should be > referred. Isn't those two sufficient that there won't be false > positives? From a user perspective, I'd like to confirm the following two points : 1. replication slots used by subscribers are synchronized to the standby. 2. it's guaranteed that logical replication doesn't go ahead of physical replication to the standby. These checks are necessary at least when building a replication setup (primary, standby, and subscriber). Otherwise, it's too late if we find out that no standby is failover-ready when the primary fails and we're about to do a failover. As for the point 1 above, we can use the step 1 described in the doc. As for point 2, the step 2 described in the doc could return true even if standby_slot_names isn't working. For example, standby_slot_names is empty, the user changed the standby_slot_names but forgot to reload the config file, and the walsender doesn't reflect the standby_slot_names update yet for some reason etc. It's possible that standby's last-received WAL location just happens to be ahead of the replayed WAL location on the subscriber. So even if the check query returns true once, it could return false when we check it again, if standby_slot_names is not working. On the other hand, IIUC if the point 2 is ensured, the check query always returns true. I think it would be good if we could provide a reliable way to check point 2 ideally via SQL queries (especially for tools). Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Here are some review comments for v79-0001 ====== Commit message 1. The logical replication slots on the primary can be synchronized to the hot standby by enabling the "failover" parameter during pg_create_logical_replication_slot() or by enabling "failover" option of the CREATE SUBSCRIPTION command and calling pg_sync_replication_slots() function on the standby. ~ SUGGESTION The logical replication slots on the primary can be synchronized to the hot standby by enabling failover during slot creation (e.g. using the "failover" parameter of pg_create_logical_replication_slot(), or using the "failover" option of the CREATE SUBSCRIPTION command), and then calling pg_sync_replication_slots() function on the standby. ====== 2. + <caution> + <para> + If after executing the function, hot_standby_feedback is disabled on + the standby or the physical slot configured in primary_slot_name is + removed, then it is possible that the necessary rows of the + synchronized slot will be removed by the VACUUM process on the primary + server, resulting in the synchronized slot becoming invalidated. + </para> + </caution> 2a. /If after/If, after/ ~ 2b. Use SGML <variable> for the GUC names (hot_standby_feedback, and primary_slot_name), and consider putting links for them as well. ====== src/sgml/logicaldecoding.sgml 3. + <sect2 id="logicaldecoding-replication-slots-synchronization"> + <title>Replication Slot Synchronization</title> + <para> + A logical replication slot on the primary can be synchronized to the hot + standby by enabling the <literal>failover</literal> option for the slot + and calling <function>pg_sync_replication_slots</function> + on the standby. The <literal>failover</literal> option of the slot + can be enabled either by enabling + <link linkend="sql-createsubscription-params-with-failover"> + <literal>failover</literal></link> + option during subscription creation or by providing <literal>failover</literal> + parameter during + <link linkend="pg-create-logical-replication-slot"> + <function>pg_create_logical_replication_slot</function></link>. IMO it will be better to slightly reword this (like was suggested for the Commit Message). I felt it is also better to refer/link to "CREATE SUBSCRIPTION" instead of saying "during subscription creation". SUGGESTION The logical replication slots on the primary can be synchronized to the hot standby by enabling failover during slot creation (e.g. using the "failover" parameter of pg_create_logical_replication_slot, or using the "failover" option of the CREATE SUBSCRIPTION command), and then calling pg_sync_replication_slots() function on the standby. ~~~ 4. + There are chances that the old primary is up again during the promotion + and if subscriptions are not disabled, the logical subscribers may keep + on receiving the data from the old primary server even after promotion + until the connection string is altered. This may result in the data + inconsistency issues and thus the logical subscribers may not be able + to continue the replication from the new primary server. + </para> 4a. /There are chances/There is a chance/ /may keep on receiving the data/may continue to receive data/ ~ 4b. BEFORE This may result in the data inconsistency issues and thus the logical subscribers may not be able to continue the replication from the new primary server. SUGGESTION This might result in data inconsistency issues, preventing the logical subscribers from being able to continue replication from the new primary server. ~ 4c. I felt this whole part "There is a chance..." should be rendered as a <note> or a <caution> or something. ====== src/backend/replication/logical/slotsync.c 5. +/* + * Return true if all necessary GUCs for slot synchronization are set + * appropriately, otherwise return false. + */ +bool +ValidateSlotSyncParams(void) +{ + char *dbname; + + /* + * A physical replication slot(primary_slot_name) is required on the + * primary to ensure that the rows needed by the standby are not removed + * after restarting, so that the synchronized slot on the standby will not + * be invalidated. + */ + if (PrimarySlotName == NULL || *PrimarySlotName == '\0') + { + ereport(ERROR, + /* translator: %s is a GUC variable name */ + errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("bad configuration for slot synchronization"), + errhint("\"%s\" must be defined.", "primary_slot_name")); + return false; + } + + /* + * hot_standby_feedback must be enabled to cooperate with the physical + * replication slot, which allows informing the primary about the xmin and + * catalog_xmin values on the standby. + */ + if (!hot_standby_feedback) + { + ereport(ERROR, + /* translator: %s is a GUC variable name */ + errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("bad configuration for slot synchronization"), + errhint("\"%s\" must be enabled.", "hot_standby_feedback")); + return false; + } + + /* + * Logical decoding requires wal_level >= logical and we currently only + * synchronize logical slots. + */ + if (wal_level < WAL_LEVEL_LOGICAL) + { + ereport(ERROR, + errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("bad configuration for slot synchronization"), + errhint("\"wal_level\" must be >= logical.")); + return false; + } + + /* + * The primary_conninfo is required to make connection to primary for + * getting slots information. + */ + if (PrimaryConnInfo == NULL || *PrimaryConnInfo == '\0') + { + ereport(ERROR, + /* translator: %s is a GUC variable name */ + errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("bad configuration for slot synchronization"), + errhint("\"%s\" must be defined.", "primary_conninfo")); + return false; + } + + /* + * The slot synchronization needs a database connection for walrcv_exec to + * work. + */ + dbname = walrcv_get_dbname_from_conninfo(PrimaryConnInfo); + if (dbname == NULL) + { + ereport(ERROR, + + /* + * translator: 'dbname' is a specific option; %s is a GUC variable + * name + */ + errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("bad configuration for slot synchronization"), + errhint("'dbname' must be specified in \"%s\".", "primary_conninfo")); + return false; + } + + return true; +} The code of this function has been flip-flopping between versions. Now, it is always giving an ERROR when something is wrong, so all of the "return false" are unreachable. It also means the function comment is wrong, and the boolean return is unused/unnecessary. ~~~ 6. SlotSyncShmemInit +/* + * Allocate and initialize slot sync shared memory. + */ This comment should use the same style wording as the other nearby shmem function comments. SUGGESTION Allocate and initialize the shared memory of slot synchronization. ~~~ 7. +/* + * Cleanup the shared memory of slot synchronization. + */ +static void +SlotSyncShmemExit(int code, Datum arg) Since this is static, should it use the snake case naming convention? -- e.g. slot_sync_shmem_exit. ~~~ 8. +/* + * Register the callback function to clean up the shared memory of slot + * synchronization. + */ +void +SlotSyncInitialize(void) +{ + before_shmem_exit(SlotSyncShmemExit, 0); +} This is only doing registration for cleanup of shmem stuff. So, does it really need it to be a separate function, or can this be registered within SlotSyncShmemInit() itself? ~~~ 9. SyncReplicationSlots + PG_TRY(); + { + validate_primary_slot_name(wrconn); + + (void) synchronize_slots(wrconn); + } + PG_FINALLY(); + { + if (syncing_slots) + { + SpinLockAcquire(&SlotSyncCtx->mutex); + SlotSyncCtx->syncing = false; + SpinLockRelease(&SlotSyncCtx->mutex); + + syncing_slots = false; + } + + walrcv_disconnect(wrconn); + } + PG_END_TRY(); IIUC, the "if (syncing_slots)" part is not really for normal operation, but it is a safe-guard for cleaning up if some unexpected ERROR happens. Maybe there should be a comment to say that. ====== src/test/recovery/t/040_standby_failover_slots_sync.pl 10. +# Confirm that the logical failover slot is created on the standby and is +# flagged as 'synced' +is($standby1->safe_psql('postgres', + q{SELECT count(*) = 2 FROM pg_replication_slots WHERE slot_name IN ('lsub1_slot', 'lsub2_slot') AND synced;}), + "t", + 'logical slots have synced as true on standby'); /slot is created/slots are created/ /and is flagged/and are flagged/ ====== Kind Regards, Peter Smith. Fujitsu Australia
On Tue, Feb 6, 2024 at 7:19 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > Attach the V79 patch which includes the following changes. (Note that only > 0001 is sent in this version, we will send the later patches after rebasing) Thanks Hou-San. Please find the rebased patches. There was a conflict after the recent merge, so rebased patch001. The patch002 and patch004 address a few of Swada-san's pending comments. No change in patch003 except rebasing. thanks Shveta
Attachment
On Tue, Feb 6, 2024 at 7:21 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > > --- > > +/* Slot sync worker objects */ > > +extern PGDLLIMPORT char *PrimaryConnInfo; extern PGDLLIMPORT char > > +*PrimarySlotName; > > > > These two variables are declared also in xlogrecovery.h. Is it intentional? If so, I > > think it's better to write comments. > > Will address. Added comments in v79_2. > > > > > --- > > + SELECT r.srsubid AS subid, CONCAT('pg_' || srsubid || > > '_sync_' || srrelid || '_' || ctl.system_identifier) AS slotname > > > > and > > > > + SELECT (CASE WHEN r.srsubstate = 'f' THEN > > pg_replication_origin_progress(CONCAT('pg_' || r.srsubid || '_' || r.srrelid), false) > > > > If we use CONCAT function, we can replace '||' with ','. > > Modified in v79_2. thanks Shveta
On Mon, Feb 5, 2024 at 9:19 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Thursday, February 1, 2024 12:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Feb 1, 2024 at 8:15 AM Euler Taveira <euler@eulerto.com> wrote: > > > > > > > > > While working on another patch I noticed a new NOTICE message: > > > > > > NOTICE: changed the failover state of replication slot "foo" on publisher to > > false > > > > > > I wasn't paying much attention to this thread then I start reading the 2 > > > patches that was recently committed. The message above surprises me > > because > > > pg_createsubscriber starts to emit this message. The reason is that it doesn't > > > create the replication slot during the CREATE SUBSCRIPTION. Instead, it > > creates > > > the replication slot with failover = false and no such option is informed > > > during CREATE SUBSCRIPTION which means it uses the default value (failover > > = > > > false). I expect that I don't see any message because it is *not* changing the > > > behavior. I was wrong. It doesn't check the failover state on publisher, it > > > just executes walrcv_alter_slot() and emits a message. > > > > > > IMO if we are changing an outstanding property on node A from node B, > > node B > > > already knows (or might know) about that behavior change (because it is > > sending > > > the command), however, node A doesn't (unless log_replication_commands > > = on -- > > > it is not the default). > > > > > > Do we really need this message as NOTICE? > > > > > > > The reason for adding this NOTICE was to keep it similar to other > > Notice messages in these commands like create/drop slot. However, here > > the difference is we may not have altered the slot as the property is > > already the same as we want to set on the publisher. So, I am not sure > > whether we should follow the existing behavior or just get rid of it. > > And then do we remove similar NOTICE in AlterSubscription() as well? > > Normally, I think NOTICE intends to let users know if we did anything > > with slots while executing subscription commands. Does anyone else > > have an opinion on this point? > > > > A related point, I think we can avoid setting the 'failover' property > > in ReplicationSlotAlter() if it is not changed, the advantage is we > > will avoid saving slots. OTOH, this won't be a frequent operation so > > we can leave it as it is as well. > > Here is a patch to remove the NOTICE and improve the ReplicationSlotAlter. > The patch also includes few cleanups based on Peter's feedback. > Thanks for the patch. Pushed. -- With Regards, Amit Kapila.
On Wed, Feb 7, 2024 at 9:30 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are some review comments for v79-0001 Thanks for the feedback. Addressed the comments in v80 patch-set. Please find my response inline for few. > src/sgml/logicaldecoding.sgml > 3. > + <sect2 id="logicaldecoding-replication-slots-synchronization"> > + <title>Replication Slot Synchronization</title> > + <para> > + A logical replication slot on the primary can be synchronized to the hot > + standby by enabling the <literal>failover</literal> option for the slot > + and calling <function>pg_sync_replication_slots</function> > + on the standby. The <literal>failover</literal> option of the slot > + can be enabled either by enabling > + <link linkend="sql-createsubscription-params-with-failover"> > + <literal>failover</literal></link> > + option during subscription creation or by providing > <literal>failover</literal> > + parameter during > + <link linkend="pg-create-logical-replication-slot"> > + <function>pg_create_logical_replication_slot</function></link>. > > IMO it will be better to slightly reword this (like was suggested for > the Commit Message). I felt it is also better to refer/link to "CREATE > SUBSCRIPTION" instead of saying "during subscription creation". Regarding link to create-sub, the 'sql-createsubscription-params-with-failover' takes you to the failover property of Create-Subscription page. Won't that suffice? > > 8. > +/* > + * Register the callback function to clean up the shared memory of slot > + * synchronization. > + */ > +void > +SlotSyncInitialize(void) > +{ > + before_shmem_exit(SlotSyncShmemExit, 0); > +} > > This is only doing registration for cleanup of shmem stuff. So, does > it really need it to be a separate function, or can this be registered > within SlotSyncShmemInit() itself? I think it makes more sense to call it from BaseInit() where we have all such calls like InitTemporaryFileAccess(), ReplicationSlotInitialize() etc which do similar callback registrations using before_shmem_exit(). Attached the patches for v80. Overall changes are: --Addressed comments by Peter (which I responded above) and Amit given in [1] and [2]. --Also improved commit msg and comment around 'wal_level' as suggested by Bertrand in [3]. [1]: https://www.postgresql.org/message-id/CAHut%2BPvtysbVd8tj2AADk%3DeNo0VY9Ov9wkBP-K%2B9tj1wRS4M4w%40mail.gmail.com [2]: https://www.postgresql.org/message-id/CAA4eK1%2Bar0N1xXnZZ26BG1qO4LHRS8v3wnH9Pnz4BWmk6SDTHw%40mail.gmail.com [3]: https://www.postgresql.org/message-id/ZcHX4SXkqtGe27a6%40ip-10-97-1-34.eu-west-3.compute.internal thanks Shveta
Attachment
On Tue, Feb 6, 2024 at 12:25 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > That said, I still think the commit message needs some re-wording, what about? > > ===== > If a logical slot on the primary is valid but is invalidated on the standby, > then that slot is dropped and can be recreated on the standby in next > pg_sync_replication_slots() call provided the slot still exists on the primary > server. It is okay to recreate such slots as long as these are not consumable > on the standby (which is the case currently). This situation may occur due to > the following reasons: > > - The max_slot_wal_keep_size on the standby is insufficient to retain WAL > records from the restart_lsn of the slot. > - primary_slot_name is temporarily reset to null and the physical slot is > removed. > > Changing the primary wal_level to a level lower than logical is only possible > if the logical slots are removed on the primary, so it's expected to see > the slots being removed on the standby too (and re-created if they are > re-created on the primary). > ===== Thanks for the feedback. I have incorporated the suggestions in v80. thanks Shveta
On Tue, Feb 6, 2024 at 12:08 PM Peter Smith <smithpb2250@gmail.com> wrote: > > There seems some muddling of names here: > - "local" versus ? and "remote" versus "primary"; or sometimes the > function does not give an indication. > - "sync_slot" versus "synced_slot" versus nothing > - "check" versus "validate" > - etc. > > Below are some suggestions (some are unchanged); probably there are > better ideas for names but my point is that the current names could be > improved: > > CURRENT SUGGESTION ... > drop_obsolete_slots drop_local_synced_slots The new name doesn't convey the intent of the function. If we want to have a difference based on remote/local slots then we can probably name it as drop_local_obsolete_slots. > reserve_wal_for_slot reserve_wal_for_local_slot > local_slot_update update_local_synced_slot > update_and_persist_slot update_and_persist_local_synced_slot > The new names sound better in the above cases as the current names appear too generic. > get_slot_invalidation_cause get_slot_conflict_reason > synchronize_slots synchronize_remote_slots_to_local > synchronize_one_slot synchronize_remote_slot_to_local > The new names don't sound like an improvement. > validate_primary_slot check_remote_synced_slot_exists > validate_slotsync_params check_local_config > In the above cases, the current name conveys the intent of function whereas new names sound a bit generic. So, let's not change in this case. -- With Regards, Amit Kapila.
> > So now if we have such a functionality then it would be even better to > > extend it to selectively sync the slot. For example, if there is some > > issue in syncing all slots, maybe some bug or taking a long time to > > sync because there are a lot of slots but if the user needs to quickly > > failover and he/she is interested in only a couple of slots then such > > a option could be helpful. no? > > > > I see your point but not sure how useful it is in the field. I am fine > if others also think such a parameter will be useful and anyway I > think we can even extend it after v1 is done. > Okay, I am fine with that. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Feb 7, 2024 at 4:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > ... > > drop_obsolete_slots drop_local_synced_slots > > The new name doesn't convey the intent of the function. If we want to > have a difference based on remote/local slots then we can probably > name it as drop_local_obsolete_slots. > > > reserve_wal_for_slot reserve_wal_for_local_slot > > local_slot_update update_local_synced_slot > > update_and_persist_slot update_and_persist_local_synced_slot > > > > The new names sound better in the above cases as the current names > appear too generic. Sure, made the suggested function name changes. Since there is no other change, I kept the version as v80_2. thanks Shveta
Attachment
We conducted stress testing for the patch with a setup of one primary node with 100 tables and five subscribers, each having 20 subscriptions. Then created three physical standbys syncing the logical replication slots from the primary node. All 100 slots were successfully synced on all three standbys. We then ran the load and monitored LSN convergence using the prescribed SQL checks. Once the standbys were failover-ready, we were able to successfully promote one of the standbys and all the subscribers seamlessly migrated to the new primary node. We replicated the tests with 200 tables, creating 200 logical replication slots. With the increased load, all the tests were completed successfully. Minor errors (not due to patch) observed during tests - 1) When the load was run, on subscribers, the logical replication apply workers started failing due to timeout. This is not related to the patch as it happened due to the small "wal_receiver_timeout" setting w.r.t. the load. To confirm, we ran the same load without the patch too, and the same failure happened. 2) There was a buffer overflow exception on the primary node with the '200 replication slots' case. It was not related to the patch as it was due to short memory configuration. All the tests were done on Windows as well as Linux environments. Thank you Ajin for the stress test and analysis on Linux.
On Wednesday, February 7, 2024 9:13 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Feb 6, 2024 at 8:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Feb 6, 2024 at 3:33 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > > > On Tue, Feb 6, 2024 at 1:09 PM Masahiko Sawada > <sawada.mshk@gmail.com> wrote: > > > > > > > > On Tue, Feb 6, 2024 at 3:19 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > > > > > > > I think users can refer to LOGs to see if it has changed since > > > > > the first time it was configured. I tried by existing parameter > > > > > and see the following in LOG: > > > > > LOG: received SIGHUP, reloading configuration files > > > > > 2024-02-06 11:38:59.069 IST [9240] LOG: parameter "autovacuum" > changed to "on" > > > > > > > > > > If the user can't confirm then it is better to follow the steps > > > > > mentioned in the patch. Do you want something else to be written > > > > > in docs for this? If so, what? > > > > > > > > IIUC even if a wrong slot name is specified to standby_slot_names > > > > or even standby_slot_names is empty, the standby server might not > > > > be lagging behind the subscribers depending on the timing. But > > > > when checking it the next time, the standby server might lag > > > > behind the subscribers. So what I wanted to know is how the user > > > > can confirm if a failover-enabled subscription is ensured not to > > > > go in front of failover-candidate standbys (i.e., standbys using > > > > the slots listed in standby_slot_names). > > > > > > > > > > But isn't the same explained by two steps ((a) Firstly, on the > > > subscriber node check the last replayed WAL. (b) Next, on the > > > standby server check that the last-received WAL location is ahead of > > > the replayed WAL location on the subscriber identified above.) in > > > the latest *_0004 patch. > > > > > > > Additionally, I would like to add that the users can use the queries > > mentioned in the doc after the primary has failed and before promoting > > the standby. If she wants to do that when both primary and standby are > > available, the value of 'standby_slot_names' on primary should be > > referred. Isn't those two sufficient that there won't be false > > positives? > > From a user perspective, I'd like to confirm the following two points : > > 1. replication slots used by subscribers are synchronized to the standby. > 2. it's guaranteed that logical replication doesn't go ahead of physical > replication to the standby. > > These checks are necessary at least when building a replication setup (primary, > standby, and subscriber). Otherwise, it's too late if we find out that no standby > is failover-ready when the primary fails and we're about to do a failover. > > As for the point 1 above, we can use the step 1 described in the doc. > > As for point 2, the step 2 described in the doc could return true even if > standby_slot_names isn't working. For example, standby_slot_names is empty, > the user changed the standby_slot_names but forgot to reload the config file, > and the walsender doesn't reflect the standby_slot_names update yet for some > reason etc. It's possible that standby's last-received WAL location just happens > to be ahead of the replayed WAL location on the subscriber. So even if the > check query returns true once, it could return false when we check it again, if > standby_slot_names is not working. On the other hand, IIUC if the point 2 is > ensured, the check query always returns true. I think it would be good if we > could provide a reliable way to check point 2 ideally via SQL queries (especially > for tools). Based on off-list discussions with Sawada-san and Amit, an alternative approach to improve this would be collecting the names of the standby slots that each walsender has waited for, which will be visible in the pg_stat_replication view. By checking this information, users can confirm that the GUC standby_slot_names is correctly configured and that logical replication is not lagging behind the standbys that hold these slots. To achieve this, we can implement the collection of slot information within each logical walsender with failover slot acquired, when waiting for he standby to catch up (WalSndWaitForWal). For each valid standby slot that the walsender has waited for, we will store the slot names in a shared memory area specific to each walsender. To optimize performance, we can only rebuild the slot names if the GUC has changed. We can track this by introducing a flag to monitor GUC modifications. When a user queries the pg_stat_replication view, we will retrieve the collected slot names from the shared memory area associated with each walsender. However, before returning the slot names, we can verify their validity once again. If any of the collected slots have been dropped or invalidated during this time, we will exclude them from the result returned to the user. Apart from the above design, I feel since user currently have a way to detect this manually as mentioned in the 0004 patch(we can improve the doc if needed), the new view info can be a separate improvement for it after pushing the main patch set. Best Regards, Hou zj
Here are some review comments for patch v80_2-0001. ====== Commit message 1. We may also see the the slots invalidated and dropped on the standby if the primary changes 'wal_level' to a level lower than logical. Changing the primary 'wal_level' to a level lower than logical is only possible if the logical slots are removed on the primary server, so it's expected to see the slots being removed on the standby too (and re-created if they are re-created on the primary server). ~ Typo /the the/the/ ====== src/sgml/logicaldecoding.sgml 2. + <para> + The logical replication slots on the primary can be synchronized to + the hot standby by enabling <literal>failover</literal> during slot + creation (e.g. using the <literal>failover</literal> parameter of + <link linkend="pg-create-logical-replication-slot"> + <function>pg_create_logical_replication_slot</function></link>, or + using the <link linkend="sql-createsubscription-params-with-failover"> + <literal>failover</literal></link> option of the CREATE SUBSCRIPTION + command), and then calling <link linkend="pg-sync-replication-slots"> + <function>pg_sync_replication_slots</function></link> + on the standby. For the synchronization to work, it is mandatory to + have a physical replication slot between the primary and the standby, and + <link linkend="guc-hot-standby-feedback"><varname>hot_standby_feedback</varname></link> + must be enabled on the standby. It is also necessary to specify a valid + <literal>dbname</literal> in the + <link linkend="guc-primary-conninfo"><varname>primary_conninfo</varname></link>. + </para> Shveta previously asked: Regarding link to create-sub, the 'sql-createsubscription-params-with-failover' takes you to the failover property of Create-Subscription page. Won't that suffice? PS: Yes, the current links in 80_2 are fine. ~ 2a. In hindsight, maybe it is simpler just to say "option of CREATE SUBSCRIPTION." instead of "option of the CREATE SUBSCRIPTION command." ~ 2b. Anyway, the "CREATE SUBSCRIPTION" should be rendered as a <command> ====== 3. +/* + * Flag to tell if we are syncing replication slots. Unlike the 'syncing' flag + * in SlotSyncCtxStruct, this flag is true only if the current process is + * performing slot synchronization. This flag is also used as safe-guard + * to clean-up shared 'syncing' flag of SlotSyncCtxStruct if some problem + * happens while we are in the process of synchronization. + */ 3a. It looks confusing to use the same word "process" to mean 2 different things. SUGGESTION This flag is also used as a safeguard to reset the shared 'syncing' flag of SlotSyncCtxStruct if some problem occurs while synchronizing. ~ 3b. TBH, I didn't think that 2nd sentence comment needed to be here -- it seemed more appropriate to say this comment inline where it does this logic in the function SyncReplicationSlots() ~~~ 4. local_sync_slot_required +/* + * Helper function to check if local_slot is required to be retained. + * + * Return false either if local_slot does not exist on the remote_slots list or + * is invalidated while the corresponding remote slot in the list is still + * valid, otherwise return true. + */ /does not exist on the remote_slots list/does not exist in the remote_slots list/ /while the corresponding remote slot in the list is still valid/while the corresponding remote slot is still valid/ ~~~ 5. + bool locally_invalidated = false; + bool remote_exists = false; + IMO it is more natural to declare these in the other order since the function logic assigns/tests them in the other order. ~~~ 6. + + if (!remote_exists || locally_invalidated) + return false; + + return true; IMO it would be both simpler and easier to understand if this was written as one line: return remote_exists && !locally_invalidated; ~~~ 7. + * Note: Change of 'wal_level' on the primary server to a level lower than + * logical may also result in slots invalidation and removal on standby. This + * is because such 'wal_level' change is only possible if the logical slots + * are removed on the primary server, so it's expected to see the slots being + * invalidated and removed on the standby too (and re-created if they are + * re-created on the primary). /may also result in slots invalidation/may also result in slot invalidation/ /removal on standby/removal on the standby/ ~~~ 8. + /* Drop the local slot f it is not required to be retained. */ + if (!local_sync_slot_required(local_slot, remote_slot_list)) I didn't think this comment was needed because IMO the function name is self-explanatory. Anyway, if you do want to keep it, then there is a typo to fix: /f it is/if it is/ ~~~ 9. + * Update the LSNs and persist the local synced slot for further syncs if the + * remote restart_lsn and catalog_xmin have caught up with the local ones, + * otherwise do nothing. Something about "persist ... for further syncs" wording seems awkward to me but I wasn't sure exactly what it should be. When I fed this comment into ChatGPT it interpreted "further" as "future" which seemed better. e.g. If the remote restart_lsn and catalog_xmin have caught up with the local ones, then update the LSNs and store the local synced slot for future synchronization; otherwise, do nothing. Maybe that is a better way to express this comment? ~~~ 10. +/* + * Validates if all necessary GUCs for slot synchronization are set + * appropriately, otherwise raise ERROR. + */ /Validates if all/Check all/ ====== Kind Regards, Peter Smith. Fujitsu Australia
On Wed, Feb 7, 2024 at 5:32 PM shveta malik <shveta.malik@gmail.com> wrote: > > Sure, made the suggested function name changes. Since there is no > other change, I kept the version as v80_2. > Few comments on 0001 =================== 1. + * the slots on the standby and synchronize them. This is done on every call + * to SQL function pg_sync_replication_slots. > I think the second sentence can be slightly changed to: "This is done by a call to SQL function pg_sync_replication_slots." or "One can call SQL function pg_sync_replication_slots to invoke this functionality." 2. +update_local_synced_slot(RemoteSlot *remote_slot, Oid remote_dbid) { ... + SpinLockAcquire(&slot->mutex); + slot->data.plugin = plugin_name; + slot->data.database = remote_dbid; + slot->data.two_phase = remote_slot->two_phase; + slot->data.failover = remote_slot->failover; + slot->data.restart_lsn = remote_slot->restart_lsn; + slot->data.confirmed_flush = remote_slot->confirmed_lsn; + slot->data.catalog_xmin = remote_slot->catalog_xmin; + slot->effective_catalog_xmin = remote_slot->catalog_xmin; + SpinLockRelease(&slot->mutex); + + if (remote_slot->catalog_xmin != slot->data.catalog_xmin) + ReplicationSlotsComputeRequiredXmin(false); + + if (remote_slot->restart_lsn != slot->data.restart_lsn) + ReplicationSlotsComputeRequiredLSN(); ... } How is it possible that after assigning the values from remote_slot they can differ from local slot values? 3. + /* + * Find the oldest existing WAL segment file. + * + * Normally, we can determine it by using the last removed segment + * number. However, if no WAL segment files have been removed by a + * checkpoint since startup, we need to search for the oldest segment + * file currently existing in XLOGDIR. + */ + oldest_segno = XLogGetLastRemovedSegno() + 1; + + if (oldest_segno == 1) + oldest_segno = XLogGetOldestSegno(0); I feel this way isn't there a risk that XLogGetOldestSegno() will get us the seg number from some previous timeline which won't make sense to compare segno in reserve_wal_for_local_slot. Shouldn't you need to fetch the current timeline and send as a parameter to this function as that is the timeline on which standby is communicating with primary. 4. + if (remote_slot->confirmed_lsn > latestFlushPtr) + ereport(ERROR, + errmsg("skipping slot synchronization as the received slot sync" I think the internal errors should be reported with elog as you have done at other palces in the patch. 5. +synchronize_one_slot(RemoteSlot *remote_slot, Oid remote_dbid) { ... + /* + * Copy the invalidation cause from remote only if local slot is not + * invalidated locally, we don't want to overwrite existing one. + */ + if (slot->data.invalidated == RS_INVAL_NONE) + { + SpinLockAcquire(&slot->mutex); + slot->data.invalidated = remote_slot->invalidated; + SpinLockRelease(&slot->mutex); + + /* Make sure the invalidated state persists across server restart */ + ReplicationSlotMarkDirty(); + ReplicationSlotSave(); + slot_updated = true; + } ... } Do we need to copy the 'invalidated' from remote to local if both are same? I think this will happen for each slot each time because normally slots won't be invalidated ones, so there is needless writes. 6. + * Returns TRUE if any of the slots gets updated in this sync-cycle. + */ +static bool +synchronize_slots(WalReceiverConn *wrconn) ... ... +void +SyncReplicationSlots(WalReceiverConn *wrconn) +{ + PG_TRY(); + { + validate_primary_slot_name(wrconn); + + (void) synchronize_slots(wrconn); For the purpose of 0001, synchronize_slots() doesn't seems to use return value. So, I suggest to change it accordingly and move the return value in the required patch. 7. + /* + * The primary_slot_name is not set yet or WALs not received yet. + * Synchronization is not possible if the walreceiver is not started. + */ + latestWalEnd = GetWalRcvLatestWalEnd(); + SpinLockAcquire(&WalRcv->mutex); + if ((WalRcv->slotname[0] == '\0') || + XLogRecPtrIsInvalid(latestWalEnd)) + { + SpinLockRelease(&WalRcv->mutex); + return false; For the purpose of 0001, we should give WARNING here. -- With Regards, Amit Kapila.
On Thu, Feb 8, 2024 at 12:08 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are some review comments for patch v80_2-0001. Thanks for the feedback Peter. Addressed the comments in v81. Attached patch001 for early feedback. Rest of the patches need rebasing and thus will post those later. It also addresses comments by Amit in [1]. [1]: https://www.postgresql.org/message-id/CAA4eK1Ldhh_kf-qG-m5BKY0R1SkdBSx5j%2BEzwpie%2BH9GPWWOYA%40mail.gmail.com thanks Shveta
Attachment
On Thu, Feb 8, 2024 at 4:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Few comments on 0001 > =================== Thanks Amit. Addressed these in v81. > 1. > + * the slots on the standby and synchronize them. This is done on every call > + * to SQL function pg_sync_replication_slots. > > > > I think the second sentence can be slightly changed to: "This is done > by a call to SQL function pg_sync_replication_slots." or "One can call > SQL function pg_sync_replication_slots to invoke this functionality." Done. > 2. > +update_local_synced_slot(RemoteSlot *remote_slot, Oid remote_dbid) > { > ... > + SpinLockAcquire(&slot->mutex); > + slot->data.plugin = plugin_name; > + slot->data.database = remote_dbid; > + slot->data.two_phase = remote_slot->two_phase; > + slot->data.failover = remote_slot->failover; > + slot->data.restart_lsn = remote_slot->restart_lsn; > + slot->data.confirmed_flush = remote_slot->confirmed_lsn; > + slot->data.catalog_xmin = remote_slot->catalog_xmin; > + slot->effective_catalog_xmin = remote_slot->catalog_xmin; > + SpinLockRelease(&slot->mutex); > + > + if (remote_slot->catalog_xmin != slot->data.catalog_xmin) > + ReplicationSlotsComputeRequiredXmin(false); > + > + if (remote_slot->restart_lsn != slot->data.restart_lsn) > + ReplicationSlotsComputeRequiredLSN(); > ... > } > > How is it possible that after assigning the values from remote_slot > they can differ from local slot values? It was a mistake while comment fixing in previous versions. Corrected it now. Thanks for catching. > 3. > + /* > + * Find the oldest existing WAL segment file. > + * > + * Normally, we can determine it by using the last removed segment > + * number. However, if no WAL segment files have been removed by a > + * checkpoint since startup, we need to search for the oldest segment > + * file currently existing in XLOGDIR. > + */ > + oldest_segno = XLogGetLastRemovedSegno() + 1; > + > + if (oldest_segno == 1) > + oldest_segno = XLogGetOldestSegno(0); > > I feel this way isn't there a risk that XLogGetOldestSegno() will get > us the seg number from some previous timeline which won't make sense > to compare segno in reserve_wal_for_local_slot. Shouldn't you need to > fetch the current timeline and send as a parameter to this function as > that is the timeline on which standby is communicating with primary. Yes, modified it. > 4. > + if (remote_slot->confirmed_lsn > latestFlushPtr) > + ereport(ERROR, > + errmsg("skipping slot synchronization as the received slot sync" > > I think the internal errors should be reported with elog as you have > done at other palces in the patch. Done. > 5. > +synchronize_one_slot(RemoteSlot *remote_slot, Oid remote_dbid) > { > ... > + /* > + * Copy the invalidation cause from remote only if local slot is not > + * invalidated locally, we don't want to overwrite existing one. > + */ > + if (slot->data.invalidated == RS_INVAL_NONE) > + { > + SpinLockAcquire(&slot->mutex); > + slot->data.invalidated = remote_slot->invalidated; > + SpinLockRelease(&slot->mutex); > + > + /* Make sure the invalidated state persists across server restart */ > + ReplicationSlotMarkDirty(); > + ReplicationSlotSave(); > + slot_updated = true; > + } > ... > } > > Do we need to copy the 'invalidated' from remote to local if both are > same? I think this will happen for each slot each time because > normally slots won't be invalidated ones, so there is needless writes. It is not needed everytime. Optimized it. Now we copy only if local_slot's 'invalidated' value is RS_INVAL_NONE while remote-slot's value != RS_INVAL_NONE. > 6. > + * Returns TRUE if any of the slots gets updated in this sync-cycle. > + */ > +static bool > +synchronize_slots(WalReceiverConn *wrconn) > ... > ... > > +void > +SyncReplicationSlots(WalReceiverConn *wrconn) > +{ > + PG_TRY(); > + { > + validate_primary_slot_name(wrconn); > + > + (void) synchronize_slots(wrconn); > > For the purpose of 0001, synchronize_slots() doesn't seems to use > return value. So, I suggest to change it accordingly and move the > return value in the required patch. Modified it. Also changed return values of all related internal functions which were returning slot_updated. > 7. > + /* > + * The primary_slot_name is not set yet or WALs not received yet. > + * Synchronization is not possible if the walreceiver is not started. > + */ > + latestWalEnd = GetWalRcvLatestWalEnd(); > + SpinLockAcquire(&WalRcv->mutex); > + if ((WalRcv->slotname[0] == '\0') || > + XLogRecPtrIsInvalid(latestWalEnd)) > + { > + SpinLockRelease(&WalRcv->mutex); > + return false; > > For the purpose of 0001, we should give WARNING here. I will fix it in the next version. Sorry, I somehow missed it this time. thanks Shveta
On Thu, Feb 8, 2024 at 4:31 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Thu, Feb 8, 2024 at 12:08 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Here are some review comments for patch v80_2-0001. > > Thanks for the feedback Peter. Addressed the comments in v81. Missed to mention, Hou-san helped in addressing some of these comments in v81.Thanks Hou-San. thanks Shveta
Here are some review comments for patch v81-0001. ====== 1. GENERAL - ReplicationSlotInvalidationCause enum. I was thinking that the ReplicationSlotInvalidationCause should explicitly set RS_INVAL_NONE = 0 (it's zero anyway, but making it explicit with a comment /* Must be zero. */. will stop it from being changed in the future). ------ /* * Slots can be invalidated, e.g. due to max_slot_wal_keep_size. If so, the * 'invalidated' field is set to a value other than _NONE. */ typedef enum ReplicationSlotInvalidationCause { RS_INVAL_NONE = 0, /* Must be zero. */ ... } ReplicationSlotInvalidationCause; ------ The reason to do this is because many places in the patch check for RS_INVAL_NONE, but if RS_INVAL_NONE == 0 is assured, all those code fragments can be simplified and IMO also become more readable. e.g. update_local_synced_slot() BEFORE Assert(slot->data.invalidated == RS_INVAL_NONE); AFTER Assert(!slot->data.invalidated); ~ e.g. local_sync_slot_required() BEFORE locally_invalidated = (remote_slot->invalidated == RS_INVAL_NONE) && (local_slot->data.invalidated != RS_INVAL_NONE); AFTER locally_invalidated = !remote_slot->invalidated && local_slot->data.invalidated; ~ e.g. synchronize_one_slot() BEFORE if (slot->data.invalidated == RS_INVAL_NONE && remote_slot->invalidated != RS_INVAL_NONE) AFTER if (!slot->data.invalidated && remote_slot->invalidated; BEFORE /* Skip the sync of an invalidated slot */ if (slot->data.invalidated != RS_INVAL_NONE) AFTER /* Skip the sync of an invalidated slot */ if (slot->data.invalidated) BEFORE /* Skip creating the local slot if remote_slot is invalidated already */ if (remote_slot->invalidated != RS_INVAL_NONE) AFTER /* Skip creating the local slot if remote_slot is invalidated already */ if (remote_slot->invalidated) ~ e.g. synchronize_slots() BEFORE if ((XLogRecPtrIsInvalid(remote_slot->restart_lsn) || XLogRecPtrIsInvalid(remote_slot->confirmed_lsn) || !TransactionIdIsValid(remote_slot->catalog_xmin)) && remote_slot->invalidated == RS_INVAL_NONE) AFTER if ((XLogRecPtrIsInvalid(remote_slot->restart_lsn) || XLogRecPtrIsInvalid(remote_slot->confirmed_lsn) || !TransactionIdIsValid(remote_slot->catalog_xmin)) && !remote_slot->invalidated) ====== src/backend/replication/logical/slotsync.c 2. update_local_synced_slot + if (strcmp(remote_slot->plugin, NameStr(slot->data.plugin)) == 0 && + remote_dbid == slot->data.database && + !xmin_changed && !restart_lsn_changed && + remote_slot->two_phase == slot->data.two_phase && + remote_slot->failover == slot->data.failover && + remote_slot->confirmed_lsn == slot->data.confirmed_flush) + return false; Consider rearranging the conditions to put the strcmp later -- e.g. might as well avoid the (more expensive?) strcmp if some of those boolean tests are already false. ~~~ 3. + /* + * There is a possibility of parallel database drop by startup + * process and re-creation of new slot by user in the small window + * between getting the slot to drop and locking the db. This new + * user-created slot may end up using the same shared memory as + * that of 'local_slot'. Thus check if local_slot is still the + * synced one before performing actual drop. + */ BEFORE There is a possibility of parallel database drop by startup process and re-creation of new slot by user in the small window between getting the slot to drop and locking the db. SUGGESTION In the small window between getting the slot to drop and locking the database, there is a possibility of a parallel database drop by the startup process or the creation of a new slot by the user. ~~~ 4. +/* + * Synchronize single slot to given position. + * + * This creates a new slot if there is no existing one and updates the + * metadata of the slot as per the data received from the primary server. + * + * The slot is created as a temporary slot and stays in the same state until the + * the remote_slot catches up with locally reserved position and local slot is + * updated. The slot is then persisted and is considered as sync-ready for + * periodic syncs. + */ /Synchronize single slot to given position./Synchronize a single slot to the given position./ ~~~ 5. synchronize_slots + /* + * The primary_slot_name is not set yet or WALs not received yet. + * Synchronization is not possible if the walreceiver is not started. + */ + latestWalEnd = GetWalRcvLatestWalEnd(); + SpinLockAcquire(&WalRcv->mutex); + if ((WalRcv->slotname[0] == '\0') || + XLogRecPtrIsInvalid(latestWalEnd)) + { + SpinLockRelease(&WalRcv->mutex); + return; + } + SpinLockRelease(&WalRcv->mutex); The comment talks about the GUC "primary_slot_name", but the code is checking the WalRcv's slotname. It may be the same, but the difference is confusing. ~~~ 6. + /* + * If restart_lsn, confirmed_lsn or catalog_xmin is invalid but slot + * is not invalidated, that means we have fetched the remote_slot in + * its RS_EPHEMERAL state itself. In such a case, avoid syncing it + * yet. We can always sync it in the next sync cycle when the + * remote_slot is persisted and has valid lsn(s) and xmin values. + * + * XXX: In future, if we plan to expose 'slot->data.persistency' in + * pg_replication_slots view, then we can avoid fetching RS_EPHEMERAL + * slots in the first place. + */ SUGGESTION (1st para) If restart_lsn, confirmed_lsn or catalog_xmin is invalid but the slot is valid, that means we have fetched the remote_slot in its RS_EPHEMERAL state. In such a case, don't sync it; we can always sync it in the next ... ~~~ 7. + /* + * Use shared lock to prevent a conflict with + * ReplicationSlotsDropDBSlots(), trying to drop the same slot during + * a drop-database operation. + */ + LockSharedObject(DatabaseRelationId, remote_dbid, 0, AccessShareLock); + + synchronize_one_slot(remote_slot, remote_dbid); + + UnlockSharedObject(DatabaseRelationId, remote_dbid, 0, AccessShareLock); IMO remove the blank lines (e.g., you don't use this kind of formatting for spin locks) ====== Kind Regards, Peter Smith. Fujitsu Australia
On Thursday, February 8, 2024 7:07 PM shveta malik <shveta.malik@gmail.com> > > On Thu, Feb 8, 2024 at 4:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > Few comments on 0001 > > =================== > > 7. > > + /* > > + * The primary_slot_name is not set yet or WALs not received yet. > > + * Synchronization is not possible if the walreceiver is not started. > > + */ > > + latestWalEnd = GetWalRcvLatestWalEnd(); > > + SpinLockAcquire(&WalRcv->mutex); if ((WalRcv->slotname[0] == '\0') > > + || > > + XLogRecPtrIsInvalid(latestWalEnd)) > > + { > > + SpinLockRelease(&WalRcv->mutex); > > + return false; > > > > For the purpose of 0001, we should give WARNING here. Fixed. Here is the V82 patch set which includes the following changes: 0001 1. Fixed one miss that the size of shared memory for slot sync was not counted in CalculateShmemSize(). 2. Added a warning message if walreceiver has not started yet. 2. Fixed the above comment. 0002 - 0003 Rebased 0004 1. Added more details that user should run second query on standby after the primary is down. 2. Mentioned that the query needs to be run on the db that includes the failover subscription. Thanks Shveta for working on the changes. Best Regards, Hou zj
Attachment
On Fri, Feb 9, 2024 at 10:00 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > Here is the V82 patch set which includes the following changes: > +reserve_wal_for_local_slot(XLogRecPtr restart_lsn) { ... + /* + * Find the oldest existing WAL segment file. + * + * Normally, we can determine it by using the last removed segment + * number. However, if no WAL segment files have been removed by a + * checkpoint since startup, we need to search for the oldest segment + * file currently existing in XLOGDIR. + */ + oldest_segno = XLogGetLastRemovedSegno() + 1; + + if (oldest_segno == 1) + { + TimeLineID cur_timeline; + + GetWalRcvFlushRecPtr(NULL, &cur_timeline); + oldest_segno = XLogGetOldestSegno(cur_timeline); ... ... This means that if the restart_lsn of the slot is from the prior timeline then the standby needs to wait for longer times to sync the slot. Ideally, it should be okay because I don't think even if restart_lsn of the slot may be from some prior timeline than the current flush timeline on standby, how often that case can happen? OTOH, in the prior version patch(v80_2-0001*), we search for the oldest segment in all possible timelines via code like: +reserve_wal_for_local_slot(XLogRecPtr restart_lsn) { ... + */ + oldest_segno = XLogGetLastRemovedSegno() + 1; + + if (oldest_segno == 1) + oldest_segno = XLogGetOldestSegno(0); I don't see a problem either way as in both scenarios this is a very rare case and doesn't seem to cause any problem but would like to know the opinion of others. -- With Regards, Amit Kapila.
On Fri, Feb 9, 2024 at 9:57 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are some review comments for patch v81-0001. > > ====== > > 1. GENERAL - ReplicationSlotInvalidationCause enum. > > I was thinking that the ReplicationSlotInvalidationCause should > explicitly set RS_INVAL_NONE = 0 (it's zero anyway, but making it > explicit with a comment /* Must be zero. */. will stop it from being > changed in the future). > > ------ > /* > * Slots can be invalidated, e.g. due to max_slot_wal_keep_size. If so, the > * 'invalidated' field is set to a value other than _NONE. > */ > typedef enum ReplicationSlotInvalidationCause > { > RS_INVAL_NONE = 0, /* Must be zero. */ > ... > } ReplicationSlotInvalidationCause; > ------ > > The reason to do this is because many places in the patch check for > RS_INVAL_NONE, but if RS_INVAL_NONE == 0 is assured, all those code > fragments can be simplified and IMO also become more readable. > > e.g. update_local_synced_slot() > > BEFORE > Assert(slot->data.invalidated == RS_INVAL_NONE); > > AFTER > Assert(!slot->data.invalidated); > I find the current code style more intuitive. > > 5. synchronize_slots > > + /* > + * The primary_slot_name is not set yet or WALs not received yet. > + * Synchronization is not possible if the walreceiver is not started. > + */ > + latestWalEnd = GetWalRcvLatestWalEnd(); > + SpinLockAcquire(&WalRcv->mutex); > + if ((WalRcv->slotname[0] == '\0') || > + XLogRecPtrIsInvalid(latestWalEnd)) > + { > + SpinLockRelease(&WalRcv->mutex); > + return; > + } > + SpinLockRelease(&WalRcv->mutex); > > The comment talks about the GUC "primary_slot_name", but the code is > checking the WalRcv's slotname. It may be the same, but the difference > is confusing. > Yeah, in this case, it would be the same because we don't allow slot sync worker unless primary_slot_name is configured in which case WalRcv->slotname refers to primary_slot_name. However, I think it is better to explain here why slot synchronization is not possible or doesn't make sense till walreceiver starts streaming and in which case, won't it be sufficient to just check latestWalEnd? -- With Regards, Amit Kapila.
On Thu, Feb 8, 2024 at 8:01 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Thu, Feb 8, 2024 at 12:08 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Here are some review comments for patch v80_2-0001. > > Thanks for the feedback Peter. Addressed the comments in v81. > Attached patch001 for early feedback. Rest of the patches need > rebasing and thus will post those later. > > It also addresses comments by Amit in [1]. Thank you for updating the patch! Here are random comments: --- + ereport(ERROR, + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("cannot use replication slot \"%s\" for logical" + " decoding", NameStr(slot->data.name)), + errdetail("This slot is being synced from the primary server."\), + errhint("Specify another replication slot.")); + I think it's better to use "synchronized" instead of "synced" for consistency with other places. --- We can create a temporary failover slot on the primary, but such slot is not synchronized. Do we want to disallow creating it? --- + + /* + * Register the callback function to clean up the shared memory of slot + * synchronization. + */ + SlotSyncInitialize(); I think it would have a wider impact than expected. IIUC this callback is needed only for processes who calls synchronize_slots(). Why do we want all processes to register this callback? --- + if (!valid) + ereport(ERROR, + errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("bad configuration for slot synchronization"), + /* translator: second %s is a GUC variable name */ + errdetail("The primary server slot \"%s\" specified by \"%s\" i\s not valid.", + PrimarySlotName, "primary_slot_name")); + I think that the detail message is not appropriate since the primary_slot_name could actually be a valid name. I think we can rephrase it to something like "The replication slot %s specified by %s does not exist on the primary server". Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Friday, February 9, 2024 2:44 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Feb 8, 2024 at 8:01 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Thu, Feb 8, 2024 at 12:08 PM Peter Smith <smithpb2250@gmail.com> > wrote: > > > > > > Here are some review comments for patch v80_2-0001. > > > > Thanks for the feedback Peter. Addressed the comments in v81. > > Attached patch001 for early feedback. Rest of the patches need > > rebasing and thus will post those later. > > > > It also addresses comments by Amit in [1]. > > Thank you for updating the patch! Here are random comments: Thanks for the comments! > > --- > + > + /* > + * Register the callback function to clean up the shared memory of > slot > + * synchronization. > + */ > + SlotSyncInitialize(); > > I think it would have a wider impact than expected. IIUC this callback is needed > only for processes who calls synchronize_slots(). Why do we want all processes > to register this callback? I think the current style is similar to the ReplicationSlotInitialize() above it. For backend, both of them can only be used when user calls slot SQL functions. So, I think it could be fine to register it at the general place which can also avoid registering the same again for the later slotsync worker patch. Another alternative is to register the callback when calling slotsync functions and unregister it after the function call. And register the callback in slotsyncworkmain() for the slotsync worker patch, although this may adds a few more codes. Best Regards, Hou zj
FYI -- I checked patch v81-0001 to find which of the #includes are strictly needed. ====== src/backend/replication/logical/slotsync.c 1. +#include "postgres.h" + +#include <time.h> + +#include "access/genam.h" +#include "access/table.h" +#include "access/xlog_internal.h" +#include "access/xlogrecovery.h" +#include "catalog/pg_database.h" +#include "commands/dbcommands.h" +#include "libpq/pqsignal.h" +#include "pgstat.h" +#include "postmaster/bgworker.h" +#include "postmaster/fork_process.h" +#include "postmaster/interrupt.h" +#include "postmaster/postmaster.h" +#include "replication/logical.h" +#include "replication/logicallauncher.h" +#include "replication/walreceiver.h" +#include "replication/slotsync.h" +#include "storage/ipc.h" +#include "storage/lmgr.h" +#include "storage/procarray.h" +#include "tcop/tcopprot.h" +#include "utils/builtins.h" +#include "utils/fmgroids.h" +#include "utils/guc_hooks.h" +#include "utils/pg_lsn.h" +#include "utils/ps_status.h" +#include "utils/timeout.h" +#include "utils/varlena.h" Many of these #includes seem unnecessary. e.g. I was able to remove all those that are commented-out below, and the file still compiles OK with no warnings: #include "postgres.h" //#include <time.h> //#include "access/genam.h" //#include "access/table.h" #include "access/xlog_internal.h" #include "access/xlogrecovery.h" #include "catalog/pg_database.h" #include "commands/dbcommands.h" //#include "libpq/pqsignal.h" //#include "pgstat.h" //#include "postmaster/bgworker.h" //#include "postmaster/fork_process.h" //#include "postmaster/interrupt.h" //#include "postmaster/postmaster.h" #include "replication/logical.h" //#include "replication/logicallauncher.h" //#include "replication/walreceiver.h" #include "replication/slotsync.h" #include "storage/ipc.h" #include "storage/lmgr.h" #include "storage/procarray.h" //#include "tcop/tcopprot.h" #include "utils/builtins.h" //#include "utils/fmgroids.h" //#include "utils/guc_hooks.h" #include "utils/pg_lsn.h" //#include "utils/ps_status.h" //#include "utils/timeout.h" //#include "utils/varlena.h" ====== src/backend/replication/slot.c 2. #include "pgstat.h" +#include "replication/slotsync.h" #include "replication/slot.h" +#include "replication/walsender.h" #include "storage/fd.h" The #include "replication/walsender.h" seems to be unnecessary. ====== src/backend/replication/walsender.c 3. #include "replication/logical.h" +#include "replication/slotsync.h" #include "replication/slot.h" The #include "replication/slotsync.h" is needed, but only for Assert code: Assert(am_cascading_walsender || IsSyncingReplicationSlots()); So you could #ifdef around that #include if you wish to. ====== Kind Regards, Peter Smith. Fujitsu Australia
On Fri, Feb 9, 2024 at 10:00 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > Few comments on 0001 =================== 1. Shouldn't pg_sync_replication_slots() check whether the user has replication privilege? 2. The function declarations in slotsync.h don't seem to be in the same order as they are defined in slotsync.c. For example, see ValidateSlotSyncParams(). The same is true for new functions exposed via walreceiver.h and walsender.h. Please check the patch for other such inconsistencies. 3. +# Wait for the standby to finish sync +$standby1->wait_for_log( + qr/LOG: ( [A-Z0-9]+:)? newly created slot \"lsub1_slot\" is sync-ready now/, + $offset); + +$standby1->wait_for_log( + qr/LOG: ( [A-Z0-9]+:)? newly created slot \"lsub2_slot\" is sync-ready now/, + $offset); + +# Confirm that the logical failover slots are created on the standby and are +# flagged as 'synced' +is($standby1->safe_psql('postgres', + q{SELECT count(*) = 2 FROM pg_replication_slots WHERE slot_name IN ('lsub1_slot', 'lsub2_slot') AND synced;}), + "t", + 'logical slots have synced as true on standby'); Isn't the last test that queried pg_replication_slots sufficient? I think wait_for_log() would be required for slotsync worker or am I missing something? Apart from the above, I have modified a few comments in the attached. -- With Regards, Amit Kapila.
Attachment
On Friday, February 9, 2024 6:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Feb 9, 2024 at 10:00 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> > wrote: > > > > Few comments on 0001 > =================== > 1. Shouldn't pg_sync_replication_slots() check whether the user has replication > privilege? Yes, added. > > 2. The function declarations in slotsync.h don't seem to be in the same order as > they are defined in slotsync.c. For example, see ValidateSlotSyncParams(). The > same is true for new functions exposed via walreceiver.h and walsender.h. Please > check the patch for other such inconsistencies. I reordered the function declarations. > > 3. > +# Wait for the standby to finish sync > +$standby1->wait_for_log( > + qr/LOG: ( [A-Z0-9]+:)? newly created slot \"lsub1_slot\" is sync-ready > +now/, $offset); > + > +$standby1->wait_for_log( > + qr/LOG: ( [A-Z0-9]+:)? newly created slot \"lsub2_slot\" is sync-ready > +now/, $offset); > + > +# Confirm that the logical failover slots are created on the standby > +and are # flagged as 'synced' > +is($standby1->safe_psql('postgres', > + q{SELECT count(*) = 2 FROM pg_replication_slots WHERE slot_name IN > ('lsub1_slot', 'lsub2_slot') AND synced;}), > + "t", > + 'logical slots have synced as true on standby'); > > Isn't the last test that queried pg_replication_slots sufficient? I think > wait_for_log() would be required for slotsync worker or am I missing something? I think it's not needed in 0001, so removed. > Apart from the above, I have modified a few comments in the attached. Thanks, it looks good to me, so applied. Attach the V83 patch which addressed Peter[1][2], Amit and Sawada-san's[3] comments. Only 0001 is sent in this version, we will send other patches after rebasing. [1] https://www.postgresql.org/message-id/CAHut%2BPvW8s6AYD2UD0xadM%2B3VqBkXP2LjD30LEGRkHUa-Szm%2BQ%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAHut%2BPv88vp9mNxX37c_Bc5FDBsTS%2BdhV02Vgip9Wqwh7GBYSg%40mail.gmail.com [3] https://www.postgresql.org/message-id/CAD21AoDvyLu%3D2-mqfGn_T_3jUamR34w%2BsxKvYnVzKqTCpyq_FQ%40mail.gmail.com Best Regards, Hou zj
Attachment
On Friday, February 9, 2024 12:27 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are some review comments for patch v81-0001. Thanks for the comments. > . GENERAL - ReplicationSlotInvalidationCause enum. > > I was thinking that the ReplicationSlotInvalidationCause should > explicitly set RS_INVAL_NONE = 0 (it's zero anyway, but making it > explicit with a comment / Must be zero. /. will stop it from being > changed in the future). I think the current code is better, so didn't change this. > 5. synchronize_slots > > + /* > + * The primary_slot_name is not set yet or WALs not received yet. > + * Synchronization is not possible if the walreceiver is not started. > + */ > + latestWalEnd = GetWalRcvLatestWalEnd(); > + SpinLockAcquire(&WalRcv->mutex); > + if ((WalRcv->slotname[0] == '\0') || > + XLogRecPtrIsInvalid(latestWalEnd)) > + { > + SpinLockRelease(&WalRcv->mutex); > + return; > + } > + SpinLockRelease(&WalRcv->mutex); > > The comment talks about the GUC "primary_slot_name", but the code is > checking the WalRcv's slotname. It may be the same, but the difference > is confusing. This part has been removed. > 7. > + /* > + * Use shared lock to prevent a conflict with > + * ReplicationSlotsDropDBSlots(), trying to drop the same slot during > + * a drop-database operation. > + */ > + LockSharedObject(DatabaseRelationId, remote_dbid, 0, AccessShareLock); > + > + synchronize_one_slot(remote_slot, remote_dbid); > + > + UnlockSharedObject(DatabaseRelationId, remote_dbid, 0, > + AccessShareLock); > > IMO remove the blank lines (e.g., you don't use this kind of formatting for spin > locks) I am not sure if it will look better, so didn't change this. Other comments look good. Best Regards, Hou zj
On Friday, February 9, 2024 4:13 PM Peter Smith <smithpb2250@gmail.com> wrote: > > FYI -- I checked patch v81-0001 to find which of the #includes are strictly needed. Thanks! > > 1. > ... > > Many of these #includes seem unnecessary. e.g. I was able to remove > all those that are commented-out below, and the file still compiles OK > with no warnings: Removed. > > > ====== > src/backend/replication/slot.c > > > > 2. > #include "pgstat.h" > +#include "replication/slotsync.h" > #include "replication/slot.h" > +#include "replication/walsender.h" > #include "storage/fd.h" > > The #include "replication/walsender.h" seems to be unnecessary. Removed. > > ====== > src/backend/replication/walsender.c > > 3. > #include "replication/logical.h" > +#include "replication/slotsync.h" > #include "replication/slot.h" > > The #include "replication/slotsync.h" is needed, but only for Assert code: > Assert(am_cascading_walsender || IsSyncingReplicationSlots()); > > So you could #ifdef around that #include if you wish to. I am not sure if it's necessary and didn't find similar coding, so didn't change. Best Regards, Hou zj
On Saturday, February 10, 2024 11:37 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > Attach the V83 patch which addressed Peter[1][2], Amit and Sawada-san's[3] > comments. Only 0001 is sent in this version, we will send other patches after > rebasing. > > [1] > https://www.postgresql.org/message-id/CAHut%2BPvW8s6AYD2UD0xadM%2B > 3VqBkXP2LjD30LEGRkHUa-Szm%2BQ%40mail.gmail.com > [2] > https://www.postgresql.org/message-id/CAHut%2BPv88vp9mNxX37c_Bc5FDBs > TS%2BdhV02Vgip9Wqwh7GBYSg%40mail.gmail.com > [3] > https://www.postgresql.org/message-id/CAD21AoDvyLu%3D2-mqfGn_T_3jUa > mR34w%2BsxKvYnVzKqTCpyq_FQ%40mail.gmail.com I noticed one cfbot failure that the slot is not synced when the standby is lagging behind the subscriber. I have modified the test to disable the sub before syncing to avoid this failure. Attach the V83_2 patch, no other code changes are included in this version. Best Regards, Hou zj
Attachment
On Fri, Feb 9, 2024 at 4:08 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Friday, February 9, 2024 2:44 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Thu, Feb 8, 2024 at 8:01 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > On Thu, Feb 8, 2024 at 12:08 PM Peter Smith <smithpb2250@gmail.com> > > wrote: > > > > > > > > Here are some review comments for patch v80_2-0001. > > > > > > Thanks for the feedback Peter. Addressed the comments in v81. > > > Attached patch001 for early feedback. Rest of the patches need > > > rebasing and thus will post those later. > > > > > > It also addresses comments by Amit in [1]. > > > > Thank you for updating the patch! Here are random comments: > > Thanks for the comments! > > > > > --- > > + > > + /* > > + * Register the callback function to clean up the shared memory of > > slot > > + * synchronization. > > + */ > > + SlotSyncInitialize(); > > > > I think it would have a wider impact than expected. IIUC this callback is needed > > only for processes who calls synchronize_slots(). Why do we want all processes > > to register this callback? > > I think the current style is similar to the ReplicationSlotInitialize() above it. For backend, > both of them can only be used when user calls slot SQL functions. So, I think it could be fine to > register it at the general place which can also avoid registering the same again for the later > slotsync worker patch. Yes, but it seems to be a legitimate case since replication slot code involves many functions that need the callback to clear the flag. On the other hand, in the slotsync code, only one function, SyncReplicationSlots(), needs the callback at least in 0001 patch. > Another alternative is to register the callback when calling slotsync functions > and unregister it after the function call. And register the callback in > slotsyncworkmain() for the slotsync worker patch, although this may adds a few > more codes. Another idea is that SyncReplicationSlots() calls synchronize_slots() in PG_ENSURE_ERROR_CLEANUP() block instead of PG_TRY(), to make sure to clear the flag in case of ERROR or FATAL. And the slotsync worker uses the before_shmem_callback to clear the flag. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Sat, Feb 10, 2024 at 5:31 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, Feb 9, 2024 at 4:08 PM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > Another alternative is to register the callback when calling slotsync functions > > and unregister it after the function call. And register the callback in > > slotsyncworkmain() for the slotsync worker patch, although this may adds a few > > more codes. > > Another idea is that SyncReplicationSlots() calls synchronize_slots() > in PG_ENSURE_ERROR_CLEANUP() block instead of PG_TRY(), to make sure > to clear the flag in case of ERROR or FATAL. And the slotsync worker > uses the before_shmem_callback to clear the flag. > +1. This sounds like a better way to clear the flag. -- With Regards, Amit Kapila.
On Saturday, February 10, 2024 9:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sat, Feb 10, 2024 at 5:31 PM Masahiko Sawada <sawada.mshk@gmail.com> > wrote: > > > > On Fri, Feb 9, 2024 at 4:08 PM Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> wrote: > > > > > Another alternative is to register the callback when calling > > > slotsync functions and unregister it after the function call. And > > > register the callback in > > > slotsyncworkmain() for the slotsync worker patch, although this may > > > adds a few more codes. > > > > Another idea is that SyncReplicationSlots() calls synchronize_slots() > > in PG_ENSURE_ERROR_CLEANUP() block instead of PG_TRY(), to make sure > > to clear the flag in case of ERROR or FATAL. And the slotsync worker > > uses the before_shmem_callback to clear the flag. > > > > +1. This sounds like a better way to clear the flag. Agreed. Here is the V84 patch which addressed this. Apart from above, I removed the txn start/end codes from 0001 as they are used in the slotsync worker patch. And I also ran pgindent and pgperltidy for the patch. Best Regards, Hou zj
Attachment
On Sun, Feb 11, 2024 at 6:53 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > Agreed. Here is the V84 patch which addressed this. > Few comments: ============= 1. Isn't the new function (pg_sync_replication_slots()) allowed to sync the slots from physical standby to another cascading standby? Won't it be better to simply disallow syncing slots on cascading standby to keep it consistent with slotsync worker behavior? 2. Previously, I commented to keep the declaration and definition of functions in the same order but I see that it still doesn't match in the below case: @@ -44,6 +46,7 @@ extern void WalSndWakeup(bool physical, bool logical); extern void WalSndInitStopping(void); extern void WalSndWaitStopping(void); extern void HandleWalSndInitStopping(void); +extern XLogRecPtr GetStandbyFlushRecPtr(TimeLineID *tli); extern void WalSndRqstFileReload(void); I think we can keep the new declaration just before WalSndSignals(). That would be more consistent. 3. + <para> + True if this is a logical slot that was synced from a primary server. + </para> + <para> + On a hot standby, the slots with the synced column marked as true can + neither be used for logical decoding nor dropped by the user. The value I don't think we need a separate para here. Apart from this, I have made several cosmetic changes in the attached. Please include these in the next version unless you see any problems. -- With Regards, Amit Kapila.
Attachment
Hi, On Sun, Feb 11, 2024 at 01:23:19PM +0000, Zhijie Hou (Fujitsu) wrote: > On Saturday, February 10, 2024 9:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Sat, Feb 10, 2024 at 5:31 PM Masahiko Sawada <sawada.mshk@gmail.com> > > wrote: > > > > > > On Fri, Feb 9, 2024 at 4:08 PM Zhijie Hou (Fujitsu) > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > Another alternative is to register the callback when calling > > > > slotsync functions and unregister it after the function call. And > > > > register the callback in > > > > slotsyncworkmain() for the slotsync worker patch, although this may > > > > adds a few more codes. > > > > > > Another idea is that SyncReplicationSlots() calls synchronize_slots() > > > in PG_ENSURE_ERROR_CLEANUP() block instead of PG_TRY(), to make sure > > > to clear the flag in case of ERROR or FATAL. And the slotsync worker > > > uses the before_shmem_callback to clear the flag. > > > > > > > +1. This sounds like a better way to clear the flag. > > Agreed. Here is the V84 patch which addressed this. > > Apart from above, I removed the txn start/end codes from 0001 as they are used > in the slotsync worker patch. And I also ran pgindent and pgperltidy for the > patch. > Thanks! A few random comments: 001 === " For the synchronization to work, it is mandatory to have a physical replication slot between the primary and the standby, " Maybe mention "primary_slot_name" here? 002 === + <para> + Synchronize the logical failover slots from the primary server to the standby server. should we say "logical failover replication slots" instead? 003 === + If, after executing the function, + <link linkend="guc-hot-standby-feedback"> + <varname>hot_standby_feedback</varname></link> is disabled on + the standby or the physical slot configured in + <link linkend="guc-primary-slot-name"> + <varname>primary_slot_name</varname></link> is + removed, I think another option that could lead to slot invalidation is if primary_slot_name is NULL or miss-configured. Indeed hot_standby_feedback would be working (for the catalog_xmin) but only as long as the standby is up and running. 004 === + on the standby. For the synchronization to work, it is mandatory to + have a physical replication slot between the primary and the standby, should we mention primary_slot_name here? 005 === + To resume logical replication after failover from the synced logical + slots, the subscription's 'conninfo' must be altered Only in a pub/sub context but not for other ways of using the logical replication slot(s). 006 === + neither be used for logical decoding nor dropped by the user what about "nor dropped manually"? 007 === +typedef struct SlotSyncCtxStruct +{ Should we remove "Struct" from the struct name? 008 === + ereport(LOG, + errmsg("dropped replication slot \"%s\" of dbid %d", + NameStr(local_slot->data.name), + local_slot->data.database)); We emit a message when an "invalidated" slot is dropped but not when we create a slot. Shouldn't we emit a message when we create a synced slot on the standby? I think that could be confusing to see "a drop" message not followed by "a create" one when it's expected (slot valid on the primary for example). 009 === Regarding 040_standby_failover_slots_sync.pl what about adding tests for? - synced slot invalidation (and ensure it's recreated once pg_sync_replication_slots() is called and when the slot in primary is valid) - cannot enable failover for a temporary replication slot - replication slots can only be synchronized from a standby server Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Feb 12, 2024 at 3:33 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > A few random comments: > > > 003 === > > + If, after executing the function, > + <link linkend="guc-hot-standby-feedback"> > + <varname>hot_standby_feedback</varname></link> is disabled on > + the standby or the physical slot configured in > + <link linkend="guc-primary-slot-name"> > + <varname>primary_slot_name</varname></link> is > + removed, > > I think another option that could lead to slot invalidation is if primary_slot_name > is NULL or miss-configured. > If the primary_slot_name is NULL then the function will error out. So, not sure, if we need to say anything explicitly here. > Indeed hot_standby_feedback would be working > (for the catalog_xmin) but only as long as the standby is up and running. > ... > > 005 === > > + To resume logical replication after failover from the synced logical > + slots, the subscription's 'conninfo' must be altered > > Only in a pub/sub context but not for other ways of using the logical replication > slot(s). > Right, but what additional information do you want here? I thought we were speaking about the in-build logical replication here so this is okay. > > 008 === > > + ereport(LOG, > + errmsg("dropped replication slot \"%s\" of dbid %d", > + NameStr(local_slot->data.name), > + local_slot->data.database)); > > We emit a message when an "invalidated" slot is dropped but not when we create > a slot. Shouldn't we emit a message when we create a synced slot on the standby? > > I think that could be confusing to see "a drop" message not followed by "a create" > one when it's expected (slot valid on the primary for example). > Isn't the below message for sync-ready slot sufficient? Otherwise, in most cases, we will LOG multiple similar messages. + ereport(LOG, + errmsg("newly created slot \"%s\" is sync-ready now", + remote_slot->name)); -- With Regards, Amit Kapila.
Hi, On Mon, Feb 12, 2024 at 04:19:33PM +0530, Amit Kapila wrote: > On Mon, Feb 12, 2024 at 3:33 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > > A few random comments: > > > > > > 003 === > > > > + If, after executing the function, > > + <link linkend="guc-hot-standby-feedback"> > > + <varname>hot_standby_feedback</varname></link> is disabled on > > + the standby or the physical slot configured in > > + <link linkend="guc-primary-slot-name"> > > + <varname>primary_slot_name</varname></link> is > > + removed, > > > > I think another option that could lead to slot invalidation is if primary_slot_name > > is NULL or miss-configured. > > > > If the primary_slot_name is NULL then the function will error out. Yeah right, it had to be non NULL initially so we know there is a physical slot (if not dropped) that should prevent conflicts at the first place (should hsf be on). Please forget about comment 003 then. > > > > 005 === > > > > + To resume logical replication after failover from the synced logical > > + slots, the subscription's 'conninfo' must be altered > > > > Only in a pub/sub context but not for other ways of using the logical replication > > slot(s). > > > > Right, but what additional information do you want here? I thought we > were speaking about the in-build logical replication here so this is > okay. The "Logical Decoding Concepts" sub-chapter also mentions "Logical decoding clients" so I was not sure the part added in the patch was for in-build logical replication only. Or maybe just reword that way "In case of in-build logical replication, to resume after failover from the synced......"? > > > > > 008 === > > > > + ereport(LOG, > > + errmsg("dropped replication slot \"%s\" of dbid %d", > > + NameStr(local_slot->data.name), > > + local_slot->data.database)); > > > > We emit a message when an "invalidated" slot is dropped but not when we create > > a slot. Shouldn't we emit a message when we create a synced slot on the standby? > > > > I think that could be confusing to see "a drop" message not followed by "a create" > > one when it's expected (slot valid on the primary for example). > > > > Isn't the below message for sync-ready slot sufficient? Otherwise, in > most cases, we will LOG multiple similar messages. > > + ereport(LOG, > + errmsg("newly created slot \"%s\" is sync-ready now", > + remote_slot->name)); Yes it is sufficient if we reach it. For example during some test, I was able to go through this code path: Breakpoint 2, update_and_persist_local_synced_slot (remote_slot=0x56450e7c49c0, remote_dbid=5) at slotsync.c:340 340 ReplicationSlot *slot = MyReplicationSlot; (gdb) n 346 if (remote_slot->restart_lsn < slot->data.restart_lsn || (gdb) 347 TransactionIdPrecedes(remote_slot->catalog_xmin, (gdb) 346 if (remote_slot->restart_lsn < slot->data.restart_lsn || (gdb) 358 return; means exiting from update_and_persist_local_synced_slot() without reaching the "newly created slot" message (the slot on the primary was "inactive"). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Monday, February 12, 2024 6:03 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On Sun, Feb 11, 2024 at 01:23:19PM +0000, Zhijie Hou (Fujitsu) wrote: > > On Saturday, February 10, 2024 9:10 PM Amit Kapila > <amit.kapila16@gmail.com> wrote: > > > > > > On Sat, Feb 10, 2024 at 5:31 PM Masahiko Sawada > > > <sawada.mshk@gmail.com> > > > wrote: > > > > > > > > On Fri, Feb 9, 2024 at 4:08 PM Zhijie Hou (Fujitsu) > > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > > Another alternative is to register the callback when calling > > > > > slotsync functions and unregister it after the function call. > > > > > And register the callback in > > > > > slotsyncworkmain() for the slotsync worker patch, although this > > > > > may adds a few more codes. > > > > > > > > Another idea is that SyncReplicationSlots() calls > > > > synchronize_slots() in PG_ENSURE_ERROR_CLEANUP() block instead of > > > > PG_TRY(), to make sure to clear the flag in case of ERROR or > > > > FATAL. And the slotsync worker uses the before_shmem_callback to clear > the flag. > > > > > > > > > > +1. This sounds like a better way to clear the flag. > > > > Agreed. Here is the V84 patch which addressed this. > > > > Apart from above, I removed the txn start/end codes from 0001 as they > > are used in the slotsync worker patch. And I also ran pgindent and > > pgperltidy for the patch. > > > > Thanks! > > A few random comments: Thanks for the comments. > > 001 === > > " > For > the synchronization to work, it is mandatory to have a physical replication slot > between the primary and the standby, " > > Maybe mention "primary_slot_name" here? Added. > > 002 === > > + <para> > + Synchronize the logical failover slots from the primary server to the > standby server. > > should we say "logical failover replication slots" instead? Changed. > > 003 === > > + If, after executing the function, > + <link linkend="guc-hot-standby-feedback"> > + <varname>hot_standby_feedback</varname></link> is disabled > on > + the standby or the physical slot configured in > + <link linkend="guc-primary-slot-name"> > + <varname>primary_slot_name</varname></link> is > + removed, > > I think another option that could lead to slot invalidation is if primary_slot_name > is NULL or miss-configured. Indeed hot_standby_feedback would be working > (for the catalog_xmin) but only as long as the standby is up and running. I didn't change this based on the discussion. > > 004 === > > + on the standby. For the synchronization to work, it is mandatory to > + have a physical replication slot between the primary and the > + standby, > > should we mention primary_slot_name here? Added. > > 005 === > > + To resume logical replication after failover from the synced logical > + slots, the subscription's 'conninfo' must be altered > > Only in a pub/sub context but not for other ways of using the logical replication > slot(s). I am not very sure about this, because the 3-rd part logicalrep can also have their own replication origin, so I didn't change for now, but will think over this. > > 006 === > > + neither be used for logical decoding nor dropped by the user > > what about "nor dropped manually"? Changed. > > 007 === > > +typedef struct SlotSyncCtxStruct > +{ > > Should we remove "Struct" from the struct name? The name was named based on some other comment to be consistent with LogicalReplCtxStruct, so I didn't change this. If other also prefer without struct, we can change it later. > 008 === > > + ereport(LOG, > + errmsg("dropped replication slot > \"%s\" of dbid %d", > + > NameStr(local_slot->data.name), > + > + local_slot->data.database)); > > We emit a message when an "invalidated" slot is dropped but not when we > create a slot. Shouldn't we emit a message when we create a synced slot on the > standby? > > I think that could be confusing to see "a drop" message not followed by "a > create" > one when it's expected (slot valid on the primary for example). I think we will report "sync-ready" for newly synced slot, for newly created temporary slots, I am not sure do we need to report log to them, because they will be dropped on promotion anyway. But if others also prefer to log, I am fine with that. > > 009 === > > Regarding 040_standby_failover_slots_sync.pl what about adding tests for? > > - synced slot invalidation (and ensure it's recreated once > pg_sync_replication_slots() is called and when the slot in primary is valid) Will try this in next version. > - cannot enable failover for a temporary replication slot Added. > - replication slots can only be synchronized from a standby server Added. Best Regards, Hou zj
On Monday, February 12, 2024 5:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sun, Feb 11, 2024 at 6:53 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> > wrote: > > > > Agreed. Here is the V84 patch which addressed this. > > > > Few comments: > ============= > 1. Isn't the new function (pg_sync_replication_slots()) allowed to sync the slots > from physical standby to another cascading standby? > Won't it be better to simply disallow syncing slots on cascading standby to keep > it consistent with slotsync worker behavior? > > 2. > Previously, I commented to keep the declaration and definition of functions in > the same order but I see that it still doesn't match in the below case: > > @@ -44,6 +46,7 @@ extern void WalSndWakeup(bool physical, bool logical); > extern void WalSndInitStopping(void); extern void WalSndWaitStopping(void); > extern void HandleWalSndInitStopping(void); > +extern XLogRecPtr GetStandbyFlushRecPtr(TimeLineID *tli); > extern void WalSndRqstFileReload(void); > > I think we can keep the new declaration just before WalSndSignals(). > That would be more consistent. > > 3. > + <para> > + True if this is a logical slot that was synced from a primary server. > + </para> > + <para> > + On a hot standby, the slots with the synced column marked as true can > + neither be used for logical decoding nor dropped by the user. > + The value > > I don't think we need a separate para here. > > Apart from this, I have made several cosmetic changes in the attached. > Please include these in the next version unless you see any problems. Thanks for the comments, I have addressed them. Here is the new version patch which addressed above and most of Bertrand's comments. TODO: trying to add one test for the case the slot is valid on primary while the synced slots is invalidated on the standby. Best Regards, Houzj
Attachment
On Tue, Feb 13, 2024 at 6:45 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Monday, February 12, 2024 5:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Thanks for the comments, I have addressed them. > > Here is the new version patch which addressed above and > most of Bertrand's comments. Thanks for the patch. I am trying to run valgrind on patch001. I followed the steps given in [1]. It ended up generating 850 log files. Is there a way to figure out that we have a memory related problem w/o going through each log file manually? I also tried running the steps with '-leak-check=summary' (in the first run, it was '-leak-check=no' as suggested in wiki) with and without the patch and tried comparing those manually for a few of them. I found o/p more or less the same. But this is a mammoth task if we have to do it manually for 850 files. So any pointers here? For reference: Sample log file with '-leak-check=no' ==00:00:08:44.321 250746== HEAP SUMMARY: ==00:00:08:44.321 250746== in use at exit: 1,298,274 bytes in 290 blocks ==00:00:08:44.321 250746== total heap usage: 11,958 allocs, 7,005 frees, 8,175,630 bytes allocated ==00:00:08:44.321 250746== ==00:00:08:44.321 250746== For a detailed leak analysis, rerun with: --leak-check=full ==00:00:08:44.321 250746== ==00:00:08:44.321 250746== For lists of detected and suppressed errors, rerun with: -s ==00:00:08:44.321 250746== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) Sample log file with '-leak-check=summary' ==00:00:00:27.300 265785== HEAP SUMMARY: ==00:00:00:27.300 265785== in use at exit: 1,929,907 bytes in 310 blocks ==00:00:00:27.300 265785== total heap usage: 71,677 allocs, 7,754 frees, 95,750,897 bytes allocated ==00:00:00:27.300 265785== ==00:00:00:27.394 265785== LEAK SUMMARY: ==00:00:00:27.394 265785== definitely lost: 20,507 bytes in 171 blocks ==00:00:00:27.394 265785== indirectly lost: 16,419 bytes in 61 blocks ==00:00:00:27.394 265785== possibly lost: 354,670 bytes in 905 blocks ==00:00:00:27.394 265785== still reachable: 592,586 bytes in 1,473 blocks ==00:00:00:27.394 265785== suppressed: 0 bytes in 0 blocks ==00:00:00:27.394 265785== Rerun with --leak-check=full to see details of leaked memory ==00:00:00:27.394 265785== ==00:00:00:27.394 265785== For lists of detected and suppressed errors, rerun with: -s ==00:00:00:27.394 265785== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) [1]: https://wiki.postgresql.org/wiki/Valgrind thanks Shveta
On Tuesday, February 13, 2024 9:16 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > Here is the new version patch which addressed above and most of Bertrand's > comments. > > TODO: trying to add one test for the case the slot is valid on primary while the > synced slots is invalidated on the standby. Here is the V85_2 patch set that added the test and fixed one typo, there are no other code changes. Best Regards, Hou zj
Attachment
On Tue, Feb 13, 2024 at 9:38 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > Here is the V85_2 patch set that added the test and fixed one typo, > there are no other code changes. > Few comments on the latest changes: ============================== 1. +# Confirm that the invalidated slot has been dropped. +$standby1->wait_for_log(qr/dropped replication slot "lsub1_slot" of dbid 5/, + $log_offset); Is it okay to hardcode dbid 5? I am a bit worried that it can lead to instability in the test. 2. +check_primary_info(WalReceiverConn *wrconn, int elevel) +{ .. + bool primary_info_valid; I don't think for 0001, we need an elevel as an argument, so let's remove it. Additionally, can we change the variable name primary_info_valid to primary_slot_valid? Also, can we change the function name to validate_remote_info() as the remote can be both primary or standby? 3. +SyncReplicationSlots(WalReceiverConn *wrconn) +{ + PG_ENSURE_ERROR_CLEANUP(slotsync_failure_callback, PointerGetDatum(wrconn)); + { + check_primary_info(wrconn, ERROR); + + synchronize_slots(wrconn); + } + PG_END_ENSURE_ERROR_CLEANUP(slotsync_failure_callback, PointerGetDatum(wrconn)); + + walrcv_disconnect(wrconn); It is better to disconnect in the caller where we have made the connection. -- With Regards, Amit Kapila.
Attachment
On Fri, Feb 9, 2024 at 10:04 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > +reserve_wal_for_local_slot(XLogRecPtr restart_lsn) > { > ... > + /* > + * Find the oldest existing WAL segment file. > + * > + * Normally, we can determine it by using the last removed segment > + * number. However, if no WAL segment files have been removed by a > + * checkpoint since startup, we need to search for the oldest segment > + * file currently existing in XLOGDIR. > + */ > + oldest_segno = XLogGetLastRemovedSegno() + 1; > + > + if (oldest_segno == 1) > + { > + TimeLineID cur_timeline; > + > + GetWalRcvFlushRecPtr(NULL, &cur_timeline); > + oldest_segno = XLogGetOldestSegno(cur_timeline); > ... > ... > > This means that if the restart_lsn of the slot is from the prior > timeline then the standby needs to wait for longer times to sync the > slot. Ideally, it should be okay because I don't think even if > restart_lsn of the slot may be from some prior timeline than the > current flush timeline on standby, how often that case can happen? I tested this behaviour on v85 patch, it is working as expected i.e. if remot_slot's lsn belongs to a prior timeline then on executing pg_sync_replication_slots() function, it creates a temporary slot and waits for primary to catch up. And once primary catches up, the next execution of SQL function persistes the slot and syncs it. Setup: primary-->standby1-->standby2 Steps: 1) Insert data on primary. It gets replicated to both standbys. 2) Create logical slot on primary and execute pg_sync_replication_slots() on standby1. The slot gets synced and persisted on standby1. 3) Shutdown standby2. 4) Insert data on primary. It gets replicated to standby1. 5) Shutdown primary and promote standby1. 6) Insert some data on standby1/new primary directly. 7) Start standby2: It now needs to catch up old data of timeline1 (from step 4) + new data of timeline2 (from step 6) . It does that. On reaching the end of the old timeline, walreceiver gets restarted and starts streaming using the new timeline. 8) Execute pg_sync_replication_slots() on standby2 to sync the slot. Now remote_slot's lsn belongs to a prior timeline on standby2. In my test-run, remote_slot's lsn belonged to segno=4 on standby2, while the oldest segno of current_timline(2) was 6. Thus it created the slot locally with lsn belonging to the oldest segno 6 of cur_timeline(2) but did not persist it as remote_slot's lsn was behind. 9) Now on standby1/new-primary, advance the logical slot by calling pg_replication_slot_advance(). 10) Execute pg_sync_replication_slots() again on standby2, now the local temporary slot gets persisted as the restart_lsn of primary has caught up. thanks Shveta
Hi, On Tue, Feb 13, 2024 at 04:08:23AM +0000, Zhijie Hou (Fujitsu) wrote: > On Tuesday, February 13, 2024 9:16 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > > > Here is the new version patch which addressed above and most of Bertrand's > > comments. > > > > TODO: trying to add one test for the case the slot is valid on primary while the > > synced slots is invalidated on the standby. > > Here is the V85_2 patch set that added the test and fixed one typo, > there are no other code changes. Thanks! Out of curiosity I ran a code coverage and the result for slotsync.c can be found in [1]. It appears that: - only one function is not covered (slotsync_failure_callback()). - 84% of the slotsync.c code is covered, the parts that are not are mainly related to "errors". Worth to try to extend the coverage? (I've in mind 731, 739, 766, 778, 786, 796, 808) [1]: https://htmlpreview.github.io/?https://raw.githubusercontent.com/bdrouvot/pg_code_coverage/main/src/backend/replication/logical/slotsync.c.gcov.html Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Feb 13, 2024 at 4:59 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > On Tue, Feb 13, 2024 at 04:08:23AM +0000, Zhijie Hou (Fujitsu) wrote: > > On Tuesday, February 13, 2024 9:16 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > > > > > Here is the new version patch which addressed above and most of Bertrand's > > > comments. > > > > > > TODO: trying to add one test for the case the slot is valid on primary while the > > > synced slots is invalidated on the standby. > > > > Here is the V85_2 patch set that added the test and fixed one typo, > > there are no other code changes. > > Thanks! > > Out of curiosity I ran a code coverage and the result for slotsync.c can be > found in [1]. > > It appears that: > > - only one function is not covered (slotsync_failure_callback()). > - 84% of the slotsync.c code is covered, the parts that are not are mainly > related to "errors". > > Worth to try to extend the coverage? (I've in mind 731, 739, 766, 778, 786, 796, > 808) > All these additional line numbers mentioned by you are ERROR paths. I think if we want we can easily cover most of those but I am not sure if there is a benefit to cover each error path. -- With Regards, Amit Kapila.
On Tuesday, February 13, 2024 2:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Feb 13, 2024 at 9:38 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> > wrote: > > > > Here is the V85_2 patch set that added the test and fixed one typo, > > there are no other code changes. > > > > Few comments on the latest changes: Thanks for the comments. > ============================== > 1. > +# Confirm that the invalidated slot has been dropped. > +$standby1->wait_for_log(qr/dropped replication slot "lsub1_slot" of > +dbid 5/, $log_offset); > > Is it okay to hardcode dbid 5? I am a bit worried that it can lead to instability in > the test. > > 2. > +check_primary_info(WalReceiverConn *wrconn, int elevel) { > .. > + bool primary_info_valid; > > I don't think for 0001, we need an elevel as an argument, so let's remove it. > Additionally, can we change the variable name primary_info_valid to > primary_slot_valid? Also, can we change the function name to > validate_remote_info() as the remote can be both primary or standby? > > 3. > +SyncReplicationSlots(WalReceiverConn *wrconn) { > +PG_ENSURE_ERROR_CLEANUP(slotsync_failure_callback, > +PointerGetDatum(wrconn)); { check_primary_info(wrconn, ERROR); > + > + synchronize_slots(wrconn); > + } > + PG_END_ENSURE_ERROR_CLEANUP(slotsync_failure_callback, > PointerGetDatum(wrconn)); > + > + walrcv_disconnect(wrconn); > > It is better to disconnect in the caller where we have made the connection. All above comments look good to me. Here is the V86 patch that addressed above. This version also includes some other minor changes: 1. Added few comments for the temporary slot creation and XLogGetOldestSegno. 2. Adjusted the doc for the SQL function. 3. Reordered two error messages in slot create function. 4. Fixed few typos. Thanks Shveta for off-list discussions. Best Regards, Hou zj
Attachment
On Tuesday, February 13, 2024 7:30 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > On Tue, Feb 13, 2024 at 04:08:23AM +0000, Zhijie Hou (Fujitsu) wrote: > > On Tuesday, February 13, 2024 9:16 AM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > > > Here is the new version patch which addressed above and most of > > > Bertrand's comments. > > > > > > TODO: trying to add one test for the case the slot is valid on > > > primary while the synced slots is invalidated on the standby. > > > > Here is the V85_2 patch set that added the test and fixed one typo, > > there are no other code changes. > > Thanks! > > Out of curiosity I ran a code coverage and the result for slotsync.c can be found > in [1]. > > It appears that: > > - only one function is not covered (slotsync_failure_callback()). Thanks for the test ! I think slotsync_failure_callback can be covered easier in the next slotsync worker patch on worker exit, I will post that after rebasing. Best Regards, Hou zj
Hi, On Tue, Feb 13, 2024 at 05:20:35PM +0530, Amit Kapila wrote: > On Tue, Feb 13, 2024 at 4:59 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > - 84% of the slotsync.c code is covered, the parts that are not are mainly > > related to "errors". > > > > Worth to try to extend the coverage? (I've in mind 731, 739, 766, 778, 786, 796, > > 808) > > > > All these additional line numbers mentioned by you are ERROR paths. I > think if we want we can easily cover most of those but I am not sure > if there is a benefit to cover each error path. Yeah, I think 731, 739 and one among the remaining ones mentioned up-thread should be enough, thoughts? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Feb 13, 2024 at 9:25 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > On Tue, Feb 13, 2024 at 05:20:35PM +0530, Amit Kapila wrote: > > On Tue, Feb 13, 2024 at 4:59 PM Bertrand Drouvot > > <bertranddrouvot.pg@gmail.com> wrote: > > > - 84% of the slotsync.c code is covered, the parts that are not are mainly > > > related to "errors". > > > > > > Worth to try to extend the coverage? (I've in mind 731, 739, 766, 778, 786, 796, > > > 808) > > > > > > > All these additional line numbers mentioned by you are ERROR paths. I > > think if we want we can easily cover most of those but I am not sure > > if there is a benefit to cover each error path. > > Yeah, I think 731, 739 and one among the remaining ones mentioned up-thread should > be enough, thoughts? > I don't know how beneficial those selective ones would be but if I have to pick a few among those then I would pick the ones at 731 and 808. The reason is that 731 is related to cascading standby restriction which we may uplift in the future and at that time one needs to be careful about the behavior, for 808 as well, in the future, we may have a separate GUC for slot_db_name. These may not be good enough reasons as to why we add tests for these ERROR cases but not for others, however, if we have to randomly pick a few among all ERROR paths, these seem better to me than others. -- With Regards, Amit Kapila.
On Wednesday, February 14, 2024 10:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Feb 13, 2024 at 9:25 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > > On Tue, Feb 13, 2024 at 05:20:35PM +0530, Amit Kapila wrote: > > > On Tue, Feb 13, 2024 at 4:59 PM Bertrand Drouvot > > > <bertranddrouvot.pg@gmail.com> wrote: > > > > - 84% of the slotsync.c code is covered, the parts that are not > > > > are mainly related to "errors". > > > > > > > > Worth to try to extend the coverage? (I've in mind 731, 739, 766, > > > > 778, 786, 796, > > > > 808) > > > > > > > > > > All these additional line numbers mentioned by you are ERROR paths. > > > I think if we want we can easily cover most of those but I am not > > > sure if there is a benefit to cover each error path. > > > > Yeah, I think 731, 739 and one among the remaining ones mentioned > > up-thread should be enough, thoughts? > > > > I don't know how beneficial those selective ones would be but if I have to pick a > few among those then I would pick the ones at 731 and 808. The reason is that > 731 is related to cascading standby restriction which we may uplift in the future > and at that time one needs to be careful about the behavior, for 808 as well, in > the future, we may have a separate GUC for slot_db_name. These may not be > good enough reasons as to why we add tests for these ERROR cases but not for > others, however, if we have to randomly pick a few among all ERROR paths, > these seem better to me than others. Here is V87 patch that adds test for the suggested cases. Best Regards, Hou zj
Attachment
On Wed, Feb 14, 2024 at 9:34 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > Here is V87 patch that adds test for the suggested cases. > I have pushed this patch and it leads to a BF failure: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2024-02-14%2004%3A43%3A37 The test failures are: # Failed test 'logical decoding is not allowed on synced slot' # at /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_failover_slots_sync.pl line 272. # Failed test 'synced slot on standby cannot be altered' # at /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_failover_slots_sync.pl line 281. # Failed test 'synced slot on standby cannot be dropped' # at /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_failover_slots_sync.pl line 287. The reason is that in LOGs, we see a different ERROR message than what is expected: 2024-02-14 04:52:32.916 UTC [1767765][client backend][3/4:0] ERROR: replication slot "lsub1_slot" is active for PID 1760871 Now, we see the slot still active because a test before these tests (# Test that if the synchronized slot is invalidated while the remote slot is still valid, ....) is not able to successfully persist the slot and the synced temporary slot remains active. The reason is clear by referring to below standby LOGS: LOG: connection authorized: user=bf database=postgres application_name=040_standby_failover_slots_sync.pl LOG: statement: SELECT pg_sync_replication_slots(); LOG: dropped replication slot "lsub1_slot" of dbid 5 STATEMENT: SELECT pg_sync_replication_slots(); ... SELECT conflict_reason IS NULL AND synced FROM pg_replication_slots WHERE slot_name = 'lsub1_slot'; In the above LOGs, we should ideally see: "newly created slot "lsub1_slot" is sync-ready now" after the "LOG: dropped replication slot "lsub1_slot" of dbid 5" but lack of that means the test didn't accomplish what it was supposed to. Ideally, the same test should have failed but the pass criteria for the test failed to check whether the slot is persisted or not. The probable reason for failure is that remote_slot's restart_lsn lags behind the oldest WAL segment on standby. Now, in the test, we do ensure that the publisher and subscriber are caught up by following steps: # Enable the subscription to let it catch up to the latest wal position $subscriber1->safe_psql('postgres', "ALTER SUBSCRIPTION regress_mysub1 ENABLE"); $primary->wait_for_catchup('regress_mysub1'); However, this doesn't guarantee that restart_lsn is moved to a position new enough that standby has a WAL corresponding to it. One easy fix is to re-create the subscription with the same slot_name after we have ensured that the slot has been invalidated on standby so that a new restart_lsn is assigned to the slot but it is better to analyze some more why the slot's restart_lsn hasn't moved enough only sometimes. -- With Regards, Amit Kapila.
On Wed, Feb 14, 2024 at 2:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Feb 14, 2024 at 9:34 AM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > Here is V87 patch that adds test for the suggested cases. > > > > I have pushed this patch and it leads to a BF failure: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2024-02-14%2004%3A43%3A37 > > The test failures are: > # Failed test 'logical decoding is not allowed on synced slot' > # at /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_failover_slots_sync.pl > line 272. > # Failed test 'synced slot on standby cannot be altered' > # at /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_failover_slots_sync.pl > line 281. > # Failed test 'synced slot on standby cannot be dropped' > # at /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_failover_slots_sync.pl > line 287. > > The reason is that in LOGs, we see a different ERROR message than what > is expected: > 2024-02-14 04:52:32.916 UTC [1767765][client backend][3/4:0] ERROR: > replication slot "lsub1_slot" is active for PID 1760871 > > Now, we see the slot still active because a test before these tests (# > Test that if the synchronized slot is invalidated while the remote > slot is still valid, ....) is not able to successfully persist the > slot and the synced temporary slot remains active. > > The reason is clear by referring to below standby LOGS: > > LOG: connection authorized: user=bf database=postgres > application_name=040_standby_failover_slots_sync.pl > LOG: statement: SELECT pg_sync_replication_slots(); > LOG: dropped replication slot "lsub1_slot" of dbid 5 > STATEMENT: SELECT pg_sync_replication_slots(); > ... > SELECT conflict_reason IS NULL AND synced FROM pg_replication_slots > WHERE slot_name = 'lsub1_slot'; > > In the above LOGs, we should ideally see: "newly created slot > "lsub1_slot" is sync-ready now" after the "LOG: dropped replication > slot "lsub1_slot" of dbid 5" but lack of that means the test didn't > accomplish what it was supposed to. Ideally, the same test should have > failed but the pass criteria for the test failed to check whether the > slot is persisted or not. > > The probable reason for failure is that remote_slot's restart_lsn lags > behind the oldest WAL segment on standby. Now, in the test, we do > ensure that the publisher and subscriber are caught up by following > steps: > # Enable the subscription to let it catch up to the latest wal position > $subscriber1->safe_psql('postgres', > "ALTER SUBSCRIPTION regress_mysub1 ENABLE"); > > $primary->wait_for_catchup('regress_mysub1'); > > However, this doesn't guarantee that restart_lsn is moved to a > position new enough that standby has a WAL corresponding to it. > To ensure that restart_lsn has been moved to a recent position, we need to log XLOG_RUNNING_XACTS and make sure the same is processed as well by walsender. The attached patch does the required change. Hou-San can reproduce this problem by adding additional checkpoints in the test and after applying the attached it fixes the problem. Now, this patch is mostly based on the theory we formed based on LOGs on BF and a reproducer by Hou-San, so still, there is some chance that this doesn't fix the BF failures in which case I'll again look into those. -- With Regards, Amit Kapila.
Attachment
On Wednesday, February 14, 2024 6:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Feb 14, 2024 at 2:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Feb 14, 2024 at 9:34 AM Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> wrote: > > > > > > Here is V87 patch that adds test for the suggested cases. > > > > > > > I have pushed this patch and it leads to a BF failure: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&d > > t=2024-02-14%2004%3A43%3A37 > > > > The test failures are: > > # Failed test 'logical decoding is not allowed on synced slot' > > # at > /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_f > ailover_slots_sync.pl > > line 272. > > # Failed test 'synced slot on standby cannot be altered' > > # at > /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_f > ailover_slots_sync.pl > > line 281. > > # Failed test 'synced slot on standby cannot be dropped' > > # at > /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_f > ailover_slots_sync.pl > > line 287. > > > > The reason is that in LOGs, we see a different ERROR message than what > > is expected: > > 2024-02-14 04:52:32.916 UTC [1767765][client backend][3/4:0] ERROR: > > replication slot "lsub1_slot" is active for PID 1760871 > > > > Now, we see the slot still active because a test before these tests (# > > Test that if the synchronized slot is invalidated while the remote > > slot is still valid, ....) is not able to successfully persist the > > slot and the synced temporary slot remains active. > > > > The reason is clear by referring to below standby LOGS: > > > > LOG: connection authorized: user=bf database=postgres > > application_name=040_standby_failover_slots_sync.pl > > LOG: statement: SELECT pg_sync_replication_slots(); > > LOG: dropped replication slot "lsub1_slot" of dbid 5 > > STATEMENT: SELECT pg_sync_replication_slots(); ... > > SELECT conflict_reason IS NULL AND synced FROM pg_replication_slots > > WHERE slot_name = 'lsub1_slot'; > > > > In the above LOGs, we should ideally see: "newly created slot > > "lsub1_slot" is sync-ready now" after the "LOG: dropped replication > > slot "lsub1_slot" of dbid 5" but lack of that means the test didn't > > accomplish what it was supposed to. Ideally, the same test should have > > failed but the pass criteria for the test failed to check whether the > > slot is persisted or not. > > > > The probable reason for failure is that remote_slot's restart_lsn lags > > behind the oldest WAL segment on standby. Now, in the test, we do > > ensure that the publisher and subscriber are caught up by following > > steps: > > # Enable the subscription to let it catch up to the latest wal > > position $subscriber1->safe_psql('postgres', > > "ALTER SUBSCRIPTION regress_mysub1 ENABLE"); > > > > $primary->wait_for_catchup('regress_mysub1'); > > > > However, this doesn't guarantee that restart_lsn is moved to a > > position new enough that standby has a WAL corresponding to it. > > > > To ensure that restart_lsn has been moved to a recent position, we need to log > XLOG_RUNNING_XACTS and make sure the same is processed as well by > walsender. The attached patch does the required change. > > Hou-San can reproduce this problem by adding additional checkpoints in the > test and after applying the attached it fixes the problem. Now, this patch is > mostly based on the theory we formed based on LOGs on BF and a reproducer > by Hou-San, so still, there is some chance that this doesn't fix the BF failures in > which case I'll again look into those. I have verified that the patch can fix the issue on my machine(after adding few more checkpoints before slot invalidation test.) I also added one more check in the test to confirm the synced slot is not temp slot. Here is the v2 patch. Best Regards, Hou zj
Attachment
Hi, On Wed, Feb 14, 2024 at 10:40:11AM +0000, Zhijie Hou (Fujitsu) wrote: > On Wednesday, February 14, 2024 6:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Feb 14, 2024 at 2:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Wed, Feb 14, 2024 at 9:34 AM Zhijie Hou (Fujitsu) > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > Here is V87 patch that adds test for the suggested cases. > > > > > > > > > > I have pushed this patch and it leads to a BF failure: > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&d > > > t=2024-02-14%2004%3A43%3A37 > > > > > > The test failures are: > > > # Failed test 'logical decoding is not allowed on synced slot' > > > # at > > /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_f > > ailover_slots_sync.pl > > > line 272. > > > # Failed test 'synced slot on standby cannot be altered' > > > # at > > /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_f > > ailover_slots_sync.pl > > > line 281. > > > # Failed test 'synced slot on standby cannot be dropped' > > > # at > > /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_f > > ailover_slots_sync.pl > > > line 287. > > > > > > The reason is that in LOGs, we see a different ERROR message than what > > > is expected: > > > 2024-02-14 04:52:32.916 UTC [1767765][client backend][3/4:0] ERROR: > > > replication slot "lsub1_slot" is active for PID 1760871 > > > > > > Now, we see the slot still active because a test before these tests (# > > > Test that if the synchronized slot is invalidated while the remote > > > slot is still valid, ....) is not able to successfully persist the > > > slot and the synced temporary slot remains active. > > > > > > The reason is clear by referring to below standby LOGS: > > > > > > LOG: connection authorized: user=bf database=postgres > > > application_name=040_standby_failover_slots_sync.pl > > > LOG: statement: SELECT pg_sync_replication_slots(); > > > LOG: dropped replication slot "lsub1_slot" of dbid 5 > > > STATEMENT: SELECT pg_sync_replication_slots(); ... > > > SELECT conflict_reason IS NULL AND synced FROM pg_replication_slots > > > WHERE slot_name = 'lsub1_slot'; > > > > > > In the above LOGs, we should ideally see: "newly created slot > > > "lsub1_slot" is sync-ready now" after the "LOG: dropped replication > > > slot "lsub1_slot" of dbid 5" but lack of that means the test didn't > > > accomplish what it was supposed to. Ideally, the same test should have > > > failed but the pass criteria for the test failed to check whether the > > > slot is persisted or not. > > > > > > The probable reason for failure is that remote_slot's restart_lsn lags > > > behind the oldest WAL segment on standby. Now, in the test, we do > > > ensure that the publisher and subscriber are caught up by following > > > steps: > > > # Enable the subscription to let it catch up to the latest wal > > > position $subscriber1->safe_psql('postgres', > > > "ALTER SUBSCRIPTION regress_mysub1 ENABLE"); > > > > > > $primary->wait_for_catchup('regress_mysub1'); > > > > > > However, this doesn't guarantee that restart_lsn is moved to a > > > position new enough that standby has a WAL corresponding to it. > > > > > > > To ensure that restart_lsn has been moved to a recent position, we need to log > > XLOG_RUNNING_XACTS and make sure the same is processed as well by > > walsender. The attached patch does the required change. > > > > Hou-San can reproduce this problem by adding additional checkpoints in the > > test and after applying the attached it fixes the problem. Now, this patch is > > mostly based on the theory we formed based on LOGs on BF and a reproducer > > by Hou-San, so still, there is some chance that this doesn't fix the BF failures in > > which case I'll again look into those. > > I have verified that the patch can fix the issue on my machine(after adding few > more checkpoints before slot invalidation test.) I also added one more check in > the test to confirm the synced slot is not temp slot. Here is the v2 patch. Thanks! +# To ensure that restart_lsn has moved to a recent WAL position, we need +# to log XLOG_RUNNING_XACTS and make sure the same is processed as well +$primary->psql('postgres', "CHECKPOINT"); Instead of "CHECKPOINT" wouldn't a less heavy "SELECT pg_log_standby_snapshot();" be enough? Not a big deal but maybe we could do the change while modifying 040_standby_failover_slots_sync.pl in the next patch "Add a new slotsync worker". Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, Feb 14, 2024 at 7:26 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > On Wed, Feb 14, 2024 at 10:40:11AM +0000, Zhijie Hou (Fujitsu) wrote: > > On Wednesday, February 14, 2024 6:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > To ensure that restart_lsn has been moved to a recent position, we need to log > > > XLOG_RUNNING_XACTS and make sure the same is processed as well by > > > walsender. The attached patch does the required change. > > > > > > Hou-San can reproduce this problem by adding additional checkpoints in the > > > test and after applying the attached it fixes the problem. Now, this patch is > > > mostly based on the theory we formed based on LOGs on BF and a reproducer > > > by Hou-San, so still, there is some chance that this doesn't fix the BF failures in > > > which case I'll again look into those. > > > > I have verified that the patch can fix the issue on my machine(after adding few > > more checkpoints before slot invalidation test.) I also added one more check in > > the test to confirm the synced slot is not temp slot. Here is the v2 patch. > > Thanks! > > +# To ensure that restart_lsn has moved to a recent WAL position, we need > +# to log XLOG_RUNNING_XACTS and make sure the same is processed as well > +$primary->psql('postgres', "CHECKPOINT"); > > Instead of "CHECKPOINT" wouldn't a less heavy "SELECT pg_log_standby_snapshot();" > be enough? > Yeah, that would be enough. However, the test still fails randomly due to the same reason. See [1]. So, as mentioned yesterday, now, I feel it is better to recreate the subscription/slot so that it can get the latest restart_lsn rather than relying on pg_log_standby_snapshot() to move it. > Not a big deal but maybe we could do the change while modifying > 040_standby_failover_slots_sync.pl in the next patch "Add a new slotsync worker". > Right, we can do that or probably this test would have made more sense with a worker patch where we could wait for the slot to be synced. Anyway, let's try to recreate the slot/subscription idea. BTW, do you think that adding a LOG when we are not able to sync will help in debugging such problems? I think eventually we can change it to DEBUG1 but for now, it can help with stabilizing BF and or some other reported issues. [1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-02-15%2000%3A14%3A38 -- With Regards, Amit Kapila.
On Thursday, February 15, 2024 10:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Feb 14, 2024 at 7:26 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > > On Wed, Feb 14, 2024 at 10:40:11AM +0000, Zhijie Hou (Fujitsu) wrote: > > > On Wednesday, February 14, 2024 6:05 PM Amit Kapila > <amit.kapila16@gmail.com> wrote: > > > > > > > > To ensure that restart_lsn has been moved to a recent position, we > > > > need to log XLOG_RUNNING_XACTS and make sure the same is processed > > > > as well by walsender. The attached patch does the required change. > > > > > > > > Hou-San can reproduce this problem by adding additional > > > > checkpoints in the test and after applying the attached it fixes > > > > the problem. Now, this patch is mostly based on the theory we > > > > formed based on LOGs on BF and a reproducer by Hou-San, so still, > > > > there is some chance that this doesn't fix the BF failures in which case I'll > again look into those. > > > > > > I have verified that the patch can fix the issue on my machine(after > > > adding few more checkpoints before slot invalidation test.) I also > > > added one more check in the test to confirm the synced slot is not temp slot. > Here is the v2 patch. > > > > Thanks! > > > > +# To ensure that restart_lsn has moved to a recent WAL position, we > > +need # to log XLOG_RUNNING_XACTS and make sure the same is processed > > +as well $primary->psql('postgres', "CHECKPOINT"); > > > > Instead of "CHECKPOINT" wouldn't a less heavy "SELECT > pg_log_standby_snapshot();" > > be enough? > > > > Yeah, that would be enough. However, the test still fails randomly due to the > same reason. See [1]. So, as mentioned yesterday, now, I feel it is better to > recreate the subscription/slot so that it can get the latest restart_lsn rather than > relying on pg_log_standby_snapshot() to move it. > > > Not a big deal but maybe we could do the change while modifying > > 040_standby_failover_slots_sync.pl in the next patch "Add a new slotsync > worker". > > > > Right, we can do that or probably this test would have made more sense with a > worker patch where we could wait for the slot to be synced. > Anyway, let's try to recreate the slot/subscription idea. BTW, do you think that > adding a LOG when we are not able to sync will help in debugging such > problems? I think eventually we can change it to DEBUG1 but for now, it can help > with stabilizing BF and or some other reported issues. Here is the patch that attempts the re-create sub idea. I also think that a LOG/DEBUG would be useful for such analysis, so the 0002 is to add such a log. Best Regards, Hou zj
Attachment
Hi, Since the slotsync function is committed, I rebased remaining patches. And here is the V88 patch set. Best Regards, Hou zj
Attachment
On Thu, Feb 15, 2024 at 9:05 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Thursday, February 15, 2024 10:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Feb 14, 2024 at 7:26 PM Bertrand Drouvot > > > > Right, we can do that or probably this test would have made more sense with a > > worker patch where we could wait for the slot to be synced. > > Anyway, let's try to recreate the slot/subscription idea. BTW, do you think that > > adding a LOG when we are not able to sync will help in debugging such > > problems? I think eventually we can change it to DEBUG1 but for now, it can help > > with stabilizing BF and or some other reported issues. > > Here is the patch that attempts the re-create sub idea. > Pushed this. > I also think that a LOG/DEBUG > would be useful for such analysis, so the 0002 is to add such a log. > I feel such a LOG would be useful. + ereport(LOG, + errmsg("waiting for remote slot \"%s\" LSN (%X/%X) and catalog xmin" + " (%u) to pass local slot LSN (%X/%X) and catalog xmin (%u)", I think waiting is a bit misleading here, how about something like: "could not sync slot information as remote slot precedes local slot: remote slot \"%s\": LSN (%X/%X), catalog xmin (%u) local slot: LSN (%X/%X), catalog xmin (%u)" -- With Regards, Amit Kapila.
Hi, On Thu, Feb 15, 2024 at 02:49:54PM +0530, Amit Kapila wrote: > On Thu, Feb 15, 2024 at 9:05 AM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > On Thursday, February 15, 2024 10:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Wed, Feb 14, 2024 at 7:26 PM Bertrand Drouvot > > > > > > Right, we can do that or probably this test would have made more sense with a > > > worker patch where we could wait for the slot to be synced. > > > Anyway, let's try to recreate the slot/subscription idea. BTW, do you think that > > > adding a LOG when we are not able to sync will help in debugging such > > > problems? I think eventually we can change it to DEBUG1 but for now, it can help > > > with stabilizing BF and or some other reported issues. > > > > Here is the patch that attempts the re-create sub idea. > > > > Pushed this. > > > > I also think that a LOG/DEBUG > > would be useful for such analysis, so the 0002 is to add such a log. > > > > I feel such a LOG would be useful. Same here. > + ereport(LOG, > + errmsg("waiting for remote slot \"%s\" LSN (%X/%X) and catalog xmin" > + " (%u) to pass local slot LSN (%X/%X) and catalog xmin (%u)", > > I think waiting is a bit misleading here, how about something like: > "could not sync slot information as remote slot precedes local slot: > remote slot \"%s\": LSN (%X/%X), catalog xmin (%u) local slot: LSN > (%X/%X), catalog xmin (%u)" > This wording works for me. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thursday, February 15, 2024 5:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Feb 15, 2024 at 9:05 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> > wrote: > > > > On Thursday, February 15, 2024 10:49 AM Amit Kapila > <amit.kapila16@gmail.com> wrote: > > > > > > On Wed, Feb 14, 2024 at 7:26 PM Bertrand Drouvot > > > > > > Right, we can do that or probably this test would have made more > > > sense with a worker patch where we could wait for the slot to be synced. > > > Anyway, let's try to recreate the slot/subscription idea. BTW, do > > > you think that adding a LOG when we are not able to sync will help > > > in debugging such problems? I think eventually we can change it to > > > DEBUG1 but for now, it can help with stabilizing BF and or some other > reported issues. > > > > Here is the patch that attempts the re-create sub idea. > > > > Pushed this. > > > > I also think that a LOG/DEBUG > > would be useful for such analysis, so the 0002 is to add such a log. > > > > I feel such a LOG would be useful. > > + ereport(LOG, > + errmsg("waiting for remote slot \"%s\" LSN (%X/%X) and catalog xmin" > + " (%u) to pass local slot LSN (%X/%X) and catalog xmin (%u)", > > I think waiting is a bit misleading here, how about something like: > "could not sync slot information as remote slot precedes local slot: > remote slot \"%s\": LSN (%X/%X), catalog xmin (%u) local slot: LSN (%X/%X), > catalog xmin (%u)" Changed. Attach the v2 patch here. Apart from the new log message. I think we can add one more debug message in reserve_wal_for_local_slot, this could be useful to analyze the failure. And we can also enable the DEBUG log in the 040 tap-test, I see we have similar setting in 010_logical_decoding_timline and logging debug1 message doesn't increase noticable time on my machine. These are done in 0002. Best Regards, Hou zj
Attachment
On Thu, Feb 15, 2024 at 4:29 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Thursday, February 15, 2024 5:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Feb 15, 2024 at 9:05 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> > > wrote: > > > > > > On Thursday, February 15, 2024 10:49 AM Amit Kapila > > <amit.kapila16@gmail.com> wrote: > > > > > > > > On Wed, Feb 14, 2024 at 7:26 PM Bertrand Drouvot > > > > > > > > Right, we can do that or probably this test would have made more > > > > sense with a worker patch where we could wait for the slot to be synced. > > > > Anyway, let's try to recreate the slot/subscription idea. BTW, do > > > > you think that adding a LOG when we are not able to sync will help > > > > in debugging such problems? I think eventually we can change it to > > > > DEBUG1 but for now, it can help with stabilizing BF and or some other > > reported issues. > > > > > > Here is the patch that attempts the re-create sub idea. > > > > > > > Pushed this. > > > > > > > I also think that a LOG/DEBUG > > > would be useful for such analysis, so the 0002 is to add such a log. > > > > > > > I feel such a LOG would be useful. > > > > + ereport(LOG, > > + errmsg("waiting for remote slot \"%s\" LSN (%X/%X) and catalog xmin" > > + " (%u) to pass local slot LSN (%X/%X) and catalog xmin (%u)", > > > > I think waiting is a bit misleading here, how about something like: > > "could not sync slot information as remote slot precedes local slot: > > remote slot \"%s\": LSN (%X/%X), catalog xmin (%u) local slot: LSN (%X/%X), > > catalog xmin (%u)" > > Changed. > > Attach the v2 patch here. > > Apart from the new log message. I think we can add one more debug message in > reserve_wal_for_local_slot, this could be useful to analyze the failure. Yeah, that can also be helpful, but the added message looks naive to me. + elog(DEBUG1, "segno: %ld oldest_segno: %ld", oldest_segno, segno); Instead of the above, how about something like: "segno: %ld of purposed restart_lsn for the synced slot, oldest_segno: %ld available"? > And we > can also enable the DEBUG log in the 040 tap-test, I see we have similar > setting in 010_logical_decoding_timline and logging debug1 message doesn't > increase noticable time on my machine. These are done in 0002. > I haven't tested it but I think this can help in debugging BF failures, if any. I am not sure if to keep it always like that but till the time these tests are stabilized, this sounds like a good idea. So, how, about just making test changes as a separate patch so that later if required we can revert/remove it easily? Bertrand, do you have any thoughts on this? -- With Regards, Amit Kapila.
Hi, On Thu, Feb 15, 2024 at 05:00:18PM +0530, Amit Kapila wrote: > On Thu, Feb 15, 2024 at 4:29 PM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > Attach the v2 patch here. > > > > Apart from the new log message. I think we can add one more debug message in > > reserve_wal_for_local_slot, this could be useful to analyze the failure. > > Yeah, that can also be helpful, but the added message looks naive to me. > + elog(DEBUG1, "segno: %ld oldest_segno: %ld", oldest_segno, segno); > > Instead of the above, how about something like: "segno: %ld of > purposed restart_lsn for the synced slot, oldest_segno: %ld > available"? Looks good to me. I'm not sure if it would make more sense to elog only if segno < oldest_segno means just before the XLogSegNoOffsetToRecPtr() call? But I'm fine with the proposed location too. > > > And we > > can also enable the DEBUG log in the 040 tap-test, I see we have similar > > setting in 010_logical_decoding_timline and logging debug1 message doesn't > > increase noticable time on my machine. These are done in 0002. > > > > I haven't tested it but I think this can help in debugging BF > failures, if any. I am not sure if to keep it always like that but > till the time these tests are stabilized, this sounds like a good > idea. So, how, about just making test changes as a separate patch so > that later if required we can revert/remove it easily? Bertrand, do > you have any thoughts on this? +1 on having DEBUG log in the 040 tap-test until it's stabilized (I think we took the same approach for 035_standby_logical_decoding.pl IIRC) and then revert it back. Also I was thinking: what about adding an output to pg_sync_replication_slots()? The output could be the number of sync slots that have been created and are not considered as sync-ready during the execution. I think that could be a good addition to v2-0001-Add-a-log-if-remote-slot-didn-t-catch-up-to-local.patch proposed here (should trigger special attention in case of non zero value). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Feb 15, 2024 at 5:46 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > On Thu, Feb 15, 2024 at 05:00:18PM +0530, Amit Kapila wrote: > > On Thu, Feb 15, 2024 at 4:29 PM Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> wrote: > > > Attach the v2 patch here. > > > > > > Apart from the new log message. I think we can add one more debug message in > > > reserve_wal_for_local_slot, this could be useful to analyze the failure. > > > > Yeah, that can also be helpful, but the added message looks naive to me. > > + elog(DEBUG1, "segno: %ld oldest_segno: %ld", oldest_segno, segno); > > > > Instead of the above, how about something like: "segno: %ld of > > purposed restart_lsn for the synced slot, oldest_segno: %ld > > available"? > > Looks good to me. I'm not sure if it would make more sense to elog only if > segno < oldest_segno means just before the XLogSegNoOffsetToRecPtr() call? > > But I'm fine with the proposed location too. > I am also fine either way but the current location gives required information in more number of cases and could be helpful in debugging this new facility. > > > > > And we > > > can also enable the DEBUG log in the 040 tap-test, I see we have similar > > > setting in 010_logical_decoding_timline and logging debug1 message doesn't > > > increase noticable time on my machine. These are done in 0002. > > > > > > > I haven't tested it but I think this can help in debugging BF > > failures, if any. I am not sure if to keep it always like that but > > till the time these tests are stabilized, this sounds like a good > > idea. So, how, about just making test changes as a separate patch so > > that later if required we can revert/remove it easily? Bertrand, do > > you have any thoughts on this? > > +1 on having DEBUG log in the 040 tap-test until it's stabilized (I think we > took the same approach for 035_standby_logical_decoding.pl IIRC) and then revert > it back. > Good to know! > Also I was thinking: what about adding an output to pg_sync_replication_slots()? > The output could be the number of sync slots that have been created and are > not considered as sync-ready during the execution. > Yeah, we can consider outputting some information via this function like how many slots are synced and persisted but not sure what would be appropriate here. Because one can anyway find that or more information by querying pg_replication_slots. I think we can keep discussing what makes more sense as a return value but let's commit the debug/log patches as they will be helpful to analyze BF failures or any other issues reported. -- With Regards, Amit Kapila.
On Thu, Feb 15, 2024 at 12:07 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > Since the slotsync function is committed, I rebased remaining patches. > And here is the V88 patch set. > Please find the improvements in some of the comments in v88_0001* attached. Kindly include these in next version, if you are okay with it. -- With Regards, Amit Kapila.
Attachment
Hi, On Thu, Feb 15, 2024 at 06:13:38PM +0530, Amit Kapila wrote: > On Thu, Feb 15, 2024 at 12:07 PM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > Since the slotsync function is committed, I rebased remaining patches. > > And here is the V88 patch set. > > Thanks! > > Please find the improvements in some of the comments in v88_0001* > attached. Kindly include these in next version, if you are okay with > it. Looking at v88_0001, random comments: 1 === Commit message "Be enabling slot synchronization" Typo? s:Be/By 2 === + It enables a physical standby to synchronize logical failover slots + from the primary server so that logical subscribers are not blocked + after failover. Not sure "not blocked" is the right wording. "can be resumed from the new primary" maybe? (was discussed in [1]) 3 === +#define SlotSyncWorkerAllowed() \ + (sync_replication_slots && pmState == PM_HOT_STANDBY && \ + SlotSyncWorkerCanRestart()) Maybe add a comment above the macro explaining the logic? 4 === +#include "replication/walreceiver.h" #include "replication/slotsync.h" should be reverse order? 5 === + if (SlotSyncWorker->syncing) { - SpinLockRelease(&SlotSyncCtx->mutex); + SpinLockRelease(&SlotSyncWorker->mutex); ereport(ERROR, errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("cannot synchronize replication slots concurrently")); } worth to add a test in 040_standby_failover_slots_sync.pl for it? 6 === +static void +slotsync_reread_config(bool restart) +{ worth to add test(s) in 040_standby_failover_slots_sync.pl for it? [1]: https://www.postgresql.org/message-id/CAA4eK1JcBG6TJ3o5iUd4z0BuTbciLV3dK4aKgb7OgrNGoLcfSQ%40mail.gmail.com Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Thu, Feb 15, 2024 at 05:58:47PM +0530, Amit Kapila wrote: > On Thu, Feb 15, 2024 at 5:46 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > Also I was thinking: what about adding an output to pg_sync_replication_slots()? > > The output could be the number of sync slots that have been created and are > > not considered as sync-ready during the execution. > > > > Yeah, we can consider outputting some information via this function > like how many slots are synced and persisted but not sure what would > be appropriate here. Because one can anyway find that or more > information by querying pg_replication_slots. Right, so maybe just return a bool that would indicate that at least one new created slot(s) is/are not sync-ready? (If so, then the details could be found in pg_replication_slots). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thursday, February 15, 2024 8:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Feb 15, 2024 at 5:46 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > > On Thu, Feb 15, 2024 at 05:00:18PM +0530, Amit Kapila wrote: > > > On Thu, Feb 15, 2024 at 4:29 PM Zhijie Hou (Fujitsu) > > > <houzj.fnst@fujitsu.com> wrote: > > > > Attach the v2 patch here. > > > > > > > > Apart from the new log message. I think we can add one more debug > > > > message in reserve_wal_for_local_slot, this could be useful to analyze the > failure. > > > > > > Yeah, that can also be helpful, but the added message looks naive to me. > > > + elog(DEBUG1, "segno: %ld oldest_segno: %ld", oldest_segno, segno); > > > > > > Instead of the above, how about something like: "segno: %ld of > > > purposed restart_lsn for the synced slot, oldest_segno: %ld > > > available"? > > > > Looks good to me. I'm not sure if it would make more sense to elog > > only if segno < oldest_segno means just before the > XLogSegNoOffsetToRecPtr() call? > > > > But I'm fine with the proposed location too. > > > > I am also fine either way but the current location gives required information in > more number of cases and could be helpful in debugging this new facility. > > > > > > > > And we > > > > can also enable the DEBUG log in the 040 tap-test, I see we have > > > > similar setting in 010_logical_decoding_timline and logging debug1 > > > > message doesn't increase noticable time on my machine. These are done > in 0002. > > > > > > > > > > I haven't tested it but I think this can help in debugging BF > > > failures, if any. I am not sure if to keep it always like that but > > > till the time these tests are stabilized, this sounds like a good > > > idea. So, how, about just making test changes as a separate patch so > > > that later if required we can revert/remove it easily? Bertrand, do > > > you have any thoughts on this? > > > > +1 on having DEBUG log in the 040 tap-test until it's stabilized (I > > +think we > > took the same approach for 035_standby_logical_decoding.pl IIRC) and > > then revert it back. > > > > Good to know! > > > Also I was thinking: what about adding an output to > pg_sync_replication_slots()? > > The output could be the number of sync slots that have been created > > and are not considered as sync-ready during the execution. > > > > Yeah, we can consider outputting some information via this function like how > many slots are synced and persisted but not sure what would be appropriate > here. Because one can anyway find that or more information by querying > pg_replication_slots. I think we can keep discussing what makes more sense as a > return value but let's commit the debug/log patches as they will be helpful to > analyze BF failures or any other issues reported. Agreed. Here is new patch set as suggested. I used debug2 in the 040 as it could provide more information about communication between primary and standby. This also doesn't increase noticeable testing time on my machine for debug build. Best Regards, Hou zj
Attachment
On Friday, February 16, 2024 8:33 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > On Thursday, February 15, 2024 8:29 PM Amit Kapila > <amit.kapila16@gmail.com> wrote: > > > > > On Thu, Feb 15, 2024 at 5:46 PM Bertrand Drouvot > > <bertranddrouvot.pg@gmail.com> wrote: > > > > > > On Thu, Feb 15, 2024 at 05:00:18PM +0530, Amit Kapila wrote: > > > > On Thu, Feb 15, 2024 at 4:29 PM Zhijie Hou (Fujitsu) > > > > <houzj.fnst@fujitsu.com> wrote: > > > > > Attach the v2 patch here. > > > > > > > > > > Apart from the new log message. I think we can add one more > > > > > debug message in reserve_wal_for_local_slot, this could be > > > > > useful to analyze the > > failure. > > > > > > > > Yeah, that can also be helpful, but the added message looks naive to me. > > > > + elog(DEBUG1, "segno: %ld oldest_segno: %ld", oldest_segno, > > > > + segno); > > > > > > > > Instead of the above, how about something like: "segno: %ld of > > > > purposed restart_lsn for the synced slot, oldest_segno: %ld > > > > available"? > > > > > > Looks good to me. I'm not sure if it would make more sense to elog > > > only if segno < oldest_segno means just before the > > XLogSegNoOffsetToRecPtr() call? > > > > > > But I'm fine with the proposed location too. > > > > > > > I am also fine either way but the current location gives required > > information in more number of cases and could be helpful in debugging this > new facility. > > > > > > > > > > > And we > > > > > can also enable the DEBUG log in the 040 tap-test, I see we have > > > > > similar setting in 010_logical_decoding_timline and logging > > > > > debug1 message doesn't increase noticable time on my machine. > > > > > These are done > > in 0002. > > > > > > > > > > > > > I haven't tested it but I think this can help in debugging BF > > > > failures, if any. I am not sure if to keep it always like that but > > > > till the time these tests are stabilized, this sounds like a good > > > > idea. So, how, about just making test changes as a separate patch > > > > so that later if required we can revert/remove it easily? > > > > Bertrand, do you have any thoughts on this? > > > > > > +1 on having DEBUG log in the 040 tap-test until it's stabilized (I > > > +think we > > > took the same approach for 035_standby_logical_decoding.pl IIRC) and > > > then revert it back. > > > > > > > Good to know! > > > > > Also I was thinking: what about adding an output to > > pg_sync_replication_slots()? > > > The output could be the number of sync slots that have been created > > > and are not considered as sync-ready during the execution. > > > > > > > Yeah, we can consider outputting some information via this function > > like how many slots are synced and persisted but not sure what would > > be appropriate here. Because one can anyway find that or more > > information by querying pg_replication_slots. I think we can keep > > discussing what makes more sense as a return value but let's commit > > the debug/log patches as they will be helpful to analyze BF failures or any > other issues reported. > > Agreed. Here is new patch set as suggested. I used debug2 in the 040 as it could > provide more information about communication between primary and standby. > This also doesn't increase noticeable testing time on my machine for debug > build. Sorry, there was a miss in the DEBUG message, I should have used UINT64_FORMAT for XLogSegNo(uint64) instead of %ld. Here is a small patch to fix this. Best Regards, Hou zj
Attachment
On Fri, Feb 16, 2024 at 11:12 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Friday, February 16, 2024 8:33 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > > > > > Yeah, we can consider outputting some information via this function > > > like how many slots are synced and persisted but not sure what would > > > be appropriate here. Because one can anyway find that or more > > > information by querying pg_replication_slots. I think we can keep > > > discussing what makes more sense as a return value but let's commit > > > the debug/log patches as they will be helpful to analyze BF failures or any > > other issues reported. > > > > Agreed. Here is new patch set as suggested. I used debug2 in the 040 as it could > > provide more information about communication between primary and standby. > > This also doesn't increase noticeable testing time on my machine for debug > > build. > > Sorry, there was a miss in the DEBUG message, I should have used > UINT64_FORMAT for XLogSegNo(uint64) instead of %ld. Here is a small patch > to fix this. > Thanks for noticing this. I have pushed all your debug patches. Let's hope if there is a BF failure next time, we can gather enough information to know the reason of the same. -- With Regards, Amit Kapila.
Hi, On Fri, Feb 16, 2024 at 12:32:45AM +0000, Zhijie Hou (Fujitsu) wrote: > Agreed. Here is new patch set as suggested. I used debug2 in the 040 as it > could provide more information about communication between primary and standby. > This also doesn't increase noticeable testing time on my machine for debug > build. Same here, and there is no big diff as far the amount of log generated: Without the patch: $ du -sh ./src/test/recovery/tmp_check/log/*040* 4.0K ./src/test/recovery/tmp_check/log/040_standby_failover_slots_sync_cascading_standby.log 24K ./src/test/recovery/tmp_check/log/040_standby_failover_slots_sync_publisher.log 16K ./src/test/recovery/tmp_check/log/040_standby_failover_slots_sync_standby1.log 4.0K ./src/test/recovery/tmp_check/log/040_standby_failover_slots_sync_subscriber1.log 12K ./src/test/recovery/tmp_check/log/regress_log_040_standby_failover_slots_sync With the patch: $ du -sh ./src/test/recovery/tmp_check/log/*040* 4.0K ./src/test/recovery/tmp_check/log/040_standby_failover_slots_sync_cascading_standby.log 36K ./src/test/recovery/tmp_check/log/040_standby_failover_slots_sync_publisher.log 48K ./src/test/recovery/tmp_check/log/040_standby_failover_slots_sync_standby1.log 4.0K ./src/test/recovery/tmp_check/log/040_standby_failover_slots_sync_subscriber1.log 12K ./src/test/recovery/tmp_check/log/regress_log_040_standby_failover_slots_sync Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, Feb 16, 2024 at 11:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Thanks for noticing this. I have pushed all your debug patches. Let's > hope if there is a BF failure next time, we can gather enough > information to know the reason of the same. > There is a new BF failure [1] after adding these LOGs and I think I know what is going wrong. First, let's look at standby LOGs: 2024-02-16 06:18:18.442 UTC [241414][client backend][2/14:0] DEBUG: segno: 4 of purposed restart_lsn for the synced slot, oldest_segno: 4 available 2024-02-16 06:18:18.443 UTC [241414][client backend][2/14:0] DEBUG: xmin required by slots: data 0, catalog 741 2024-02-16 06:18:18.443 UTC [241414][client backend][2/14:0] LOG:mote could not sync slot information as reslot precedes local slot: remote slot "lsub1_slot": LSN (0/4000168), catalog xmin (739) local slot: LSN (0/4000168), catalog xmin (741) So, from the above LOG, it is clear that the remote slot's catalog xmin (739) precedes the local catalog xmin (741) which makes the sync on standby to not complete. Next, let's look at the LOG from the primary during the nearby time: 2024-02-16 06:18:11.354 UTC [238037][autovacuum worker][5/17:0] DEBUG: analyzing "pg_catalog.pg_depend" 2024-02-16 06:18:11.360 UTC [238037][autovacuum worker][5/17:0] DEBUG: "pg_depend": scanned 13 of 13 pages, containing 1709 live rows and 0 dead rows; 1709 rows in sample, 1709 estimated total rows ... 2024-02-16 06:18:11.372 UTC [238037][autovacuum worker][5/0:0] DEBUG: Autovacuum VacuumUpdateCosts(db=1, rel=14050, dobalance=yes, cost_limit=200, cost_delay=2 active=yes failsafe=no) 2024-02-16 06:18:11.372 UTC [238037][autovacuum worker][5/19:0] DEBUG: analyzing "information_schema.sql_features" 2024-02-16 06:18:11.377 UTC [238037][autovacuum worker][5/19:0] DEBUG: "sql_features": scanned 8 of 8 pages, containing 756 live rows and 0 dead rows; 756 rows in sample, 756 estimated total rows It shows us that autovacuum worker has analyzed catalog table and for updating its statistics in pg_statistic table, it would have acquired a new transaction id. Now, after the slot creation, a new transaction id that has updated the catalog is generated on primary and would have been replication to standby. Due to this catalog_xmin of primary's slot would precede standby's catalog_xmin and we see this failure. As per this theory, we should disable autovacuum on primary to avoid updates to catalog_xmin values. [1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2024-02-16%2006%3A12%3A59 -- With Regards, Amit Kapila.
Hi, On Fri, Feb 16, 2024 at 01:12:31PM +0530, Amit Kapila wrote: > On Fri, Feb 16, 2024 at 11:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > Thanks for noticing this. I have pushed all your debug patches. Let's > > hope if there is a BF failure next time, we can gather enough > > information to know the reason of the same. > > > > There is a new BF failure [1] after adding these LOGs and I think I > know what is going wrong. First, let's look at standby LOGs: > > 2024-02-16 06:18:18.442 UTC [241414][client backend][2/14:0] DEBUG: > segno: 4 of purposed restart_lsn for the synced slot, oldest_segno: 4 > available > 2024-02-16 06:18:18.443 UTC [241414][client backend][2/14:0] DEBUG: > xmin required by slots: data 0, catalog 741 > 2024-02-16 06:18:18.443 UTC [241414][client backend][2/14:0] LOG:mote > could not sync slot information as reslot precedes local slot: remote > slot "lsub1_slot": LSN (0/4000168), catalog xmin (739) local slot: LSN > (0/4000168), catalog xmin (741) > > So, from the above LOG, it is clear that the remote slot's catalog > xmin (739) precedes the local catalog xmin (741) which makes the sync > on standby to not complete. Yeah, catalog_xmin was the other suspect (with restart_lsn) and agree it is the culprit here. > Next, let's look at the LOG from the primary during the nearby time: > 2024-02-16 06:18:11.354 UTC [238037][autovacuum worker][5/17:0] DEBUG: > analyzing "pg_catalog.pg_depend" > 2024-02-16 06:18:11.360 UTC [238037][autovacuum worker][5/17:0] DEBUG: > "pg_depend": scanned 13 of 13 pages, containing 1709 live rows and 0 > dead rows; 1709 rows in sample, 1709 estimated total rows > ... > 2024-02-16 06:18:11.372 UTC [238037][autovacuum worker][5/0:0] DEBUG: > Autovacuum VacuumUpdateCosts(db=1, rel=14050, dobalance=yes, > cost_limit=200, cost_delay=2 active=yes failsafe=no) > 2024-02-16 06:18:11.372 UTC [238037][autovacuum worker][5/19:0] DEBUG: > analyzing "information_schema.sql_features" > 2024-02-16 06:18:11.377 UTC [238037][autovacuum worker][5/19:0] DEBUG: > "sql_features": scanned 8 of 8 pages, containing 756 live rows and 0 > dead rows; 756 rows in sample, 756 estimated total rows > > It shows us that autovacuum worker has analyzed catalog table and for > updating its statistics in pg_statistic table, it would have acquired > a new transaction id. Now, after the slot creation, a new transaction > id that has updated the catalog is generated on primary and would have > been replication to standby. Due to this catalog_xmin of primary's > slot would precede standby's catalog_xmin and we see this failure. > > As per this theory, we should disable autovacuum on primary to avoid > updates to catalog_xmin values. Makes sense to me. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Friday, February 16, 2024 3:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Feb 16, 2024 at 11:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > Thanks for noticing this. I have pushed all your debug patches. Let's > > hope if there is a BF failure next time, we can gather enough > > information to know the reason of the same. > > > > There is a new BF failure [1] after adding these LOGs and I think I know what is > going wrong. First, let's look at standby LOGs: > > 2024-02-16 06:18:18.442 UTC [241414][client backend][2/14:0] DEBUG: > segno: 4 of purposed restart_lsn for the synced slot, oldest_segno: 4 available > 2024-02-16 06:18:18.443 UTC [241414][client backend][2/14:0] DEBUG: > xmin required by slots: data 0, catalog 741 > 2024-02-16 06:18:18.443 UTC [241414][client backend][2/14:0] LOG:mote could > not sync slot information as reslot precedes local slot: remote slot "lsub1_slot": > LSN (0/4000168), catalog xmin (739) local slot: LSN (0/4000168), catalog xmin > (741) > > So, from the above LOG, it is clear that the remote slot's catalog xmin (739) > precedes the local catalog xmin (741) which makes the sync on standby to not > complete. > > Next, let's look at the LOG from the primary during the nearby time: > 2024-02-16 06:18:11.354 UTC [238037][autovacuum worker][5/17:0] DEBUG: > analyzing "pg_catalog.pg_depend" > 2024-02-16 06:18:11.360 UTC [238037][autovacuum worker][5/17:0] DEBUG: > "pg_depend": scanned 13 of 13 pages, containing 1709 live rows and 0 dead > rows; 1709 rows in sample, 1709 estimated total rows ... > 2024-02-16 06:18:11.372 UTC [238037][autovacuum worker][5/0:0] DEBUG: > Autovacuum VacuumUpdateCosts(db=1, rel=14050, dobalance=yes, > cost_limit=200, cost_delay=2 active=yes failsafe=no) > 2024-02-16 06:18:11.372 UTC [238037][autovacuum worker][5/19:0] DEBUG: > analyzing "information_schema.sql_features" > 2024-02-16 06:18:11.377 UTC [238037][autovacuum worker][5/19:0] DEBUG: > "sql_features": scanned 8 of 8 pages, containing 756 live rows and 0 dead rows; > 756 rows in sample, 756 estimated total rows > > It shows us that autovacuum worker has analyzed catalog table and for updating > its statistics in pg_statistic table, it would have acquired a new transaction id. Now, > after the slot creation, a new transaction id that has updated the catalog is > generated on primary and would have been replication to standby. Due to this > catalog_xmin of primary's slot would precede standby's catalog_xmin and we see > this failure. > > As per this theory, we should disable autovacuum on primary to avoid updates to > catalog_xmin values. > Agreed. Here is the patch to disable autovacuum in the test. Best Regards, Hou zj
Attachment
On Thu, Feb 15, 2024 at 10:48 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Looking at v88_0001, random comments: Thanks for the feedback. > > 1 === > > Commit message "Be enabling slot synchronization" > > Typo? s:Be/By Modified. > 2 === > > + It enables a physical standby to synchronize logical failover slots > + from the primary server so that logical subscribers are not blocked > + after failover. > > Not sure "not blocked" is the right wording. > "can be resumed from the new primary" maybe? (was discussed in [1]) Modified. > 3 === > > +#define SlotSyncWorkerAllowed() \ > + (sync_replication_slots && pmState == PM_HOT_STANDBY && \ > + SlotSyncWorkerCanRestart()) > > Maybe add a comment above the macro explaining the logic? Done. > 4 === > > +#include "replication/walreceiver.h" > #include "replication/slotsync.h" > > should be reverse order? Removed walreceiver.h inclusion as it was not needed. > 5 === > > + if (SlotSyncWorker->syncing) > { > - SpinLockRelease(&SlotSyncCtx->mutex); > + SpinLockRelease(&SlotSyncWorker->mutex); > ereport(ERROR, > errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > errmsg("cannot synchronize replication slots concurrently")); > } > > worth to add a test in 040_standby_failover_slots_sync.pl for it? It will be very difficult to stabilize this test as we have to make sure that the concurrent users (SQL function(s) and/or worker(s)) are in that target function at the same time to hit it. > > 6 === > > +static void > +slotsync_reread_config(bool restart) > +{ > > worth to add test(s) in 040_standby_failover_slots_sync.pl for it? Added test. Please find v89 patch set. The other changes are: patch001: 1) Addressed some comments by Amit and Ajin given off-list. 2) Removed redundant header inclusions from slotsync.c. 3) Corrected the value returned by validate_remote_info(). 4) Restructured code around validate_remote_info. 5) Improved comments and commit msg. patch002: Rebased it. thanks Shveta
Attachment
On Fri, Feb 16, 2024 at 4:10 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Thu, Feb 15, 2024 at 10:48 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > 5 === > > > > + if (SlotSyncWorker->syncing) > > { > > - SpinLockRelease(&SlotSyncCtx->mutex); > > + SpinLockRelease(&SlotSyncWorker->mutex); > > ereport(ERROR, > > errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > errmsg("cannot synchronize replication slots concurrently")); > > } > > > > worth to add a test in 040_standby_failover_slots_sync.pl for it? > > It will be very difficult to stabilize this test as we have to make > sure that the concurrent users (SQL function(s) and/or worker(s)) are > in that target function at the same time to hit it. > Yeah, I also think would be tricky to write a stable test, maybe one can explore using a new injection point facility but I don't think it is worth for this error check as this appears straightforward to be broken in the future by other changes. -- With Regards, Amit Kapila.
On Friday, February 16, 2024 6:41 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Thu, Feb 15, 2024 at 10:48 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > > Looking at v88_0001, random comments: > > Thanks for the feedback. > > > > > 1 === > > > > Commit message "Be enabling slot synchronization" > > > > Typo? s:Be/By > > Modified. > > > 2 === > > > > + It enables a physical standby to synchronize logical failover slots > > + from the primary server so that logical subscribers are not blocked > > + after failover. > > > > Not sure "not blocked" is the right wording. > > "can be resumed from the new primary" maybe? (was discussed in [1]) > > Modified. > > > 3 === > > > > +#define SlotSyncWorkerAllowed() \ > > + (sync_replication_slots && pmState == PM_HOT_STANDBY && \ > > + SlotSyncWorkerCanRestart()) > > > > Maybe add a comment above the macro explaining the logic? > > Done. > > > 4 === > > > > +#include "replication/walreceiver.h" > > #include "replication/slotsync.h" > > > > should be reverse order? > > Removed walreceiver.h inclusion as it was not needed. > > > 5 === > > > > + if (SlotSyncWorker->syncing) > > { > > - SpinLockRelease(&SlotSyncCtx->mutex); > > + SpinLockRelease(&SlotSyncWorker->mutex); > > ereport(ERROR, > > > errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > errmsg("cannot synchronize replication > slots concurrently")); > > } > > > > worth to add a test in 040_standby_failover_slots_sync.pl for it? > > It will be very difficult to stabilize this test as we have to make sure that the > concurrent users (SQL function(s) and/or worker(s)) are in that target function at > the same time to hit it. > > > > > 6 === > > > > +static void > > +slotsync_reread_config(bool restart) > > +{ > > > > worth to add test(s) in 040_standby_failover_slots_sync.pl for it? > > Added test. > > Please find v89 patch set. The other changes are: Thanks for the patch. Here are few comments: 1. +static char * +get_dbname_from_conninfo(const char *conninfo) +{ + static char *dbname; + + if (dbname) + return dbname; + else + dbname = walrcv_get_dbname_from_conninfo(conninfo); + + return dbname; +} I think it's not necessary to have a static variable here, because the guc check doesn't seem performance sensitive. Additionaly, it does not work well with slotsync SQL functions, because the dbname will be allocated in temp memory context when reaching here via SQL function, so it's not safe to access this static variable in next function call. 2. +static bool +validate_remote_info(WalReceiverConn *wrconn, int elevel) ... + + return (!remote_in_recovery && primary_slot_valid); The primary_slot_valid could be uninitialized in this function. Best Regards, Hou zj
On Sun, Feb 18, 2024 at 7:40 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Friday, February 16, 2024 6:41 PM shveta malik <shveta.malik@gmail.com> wrote: > Thanks for the patch. Here are few comments: Thanks for the comments. > > 2. > > +static bool > +validate_remote_info(WalReceiverConn *wrconn, int elevel) > ... > + > + return (!remote_in_recovery && primary_slot_valid); > > The primary_slot_valid could be uninitialized in this function. return (!remote_in_recovery && primary_slot_valid); Here if remote_in_recovery is true, it will not even read primary_slot_valid. It will read primary_slot_valid only if remote_in_recovery is false and in such a case primary_slot_valid will always be initialized in the else block above, let me know if you still feel we shall initialize this to some default? thanks Shveta
On Monday, February 19, 2024 11:39 AM shveta malik <shveta.malik@gmail.com> wrote: > > On Sun, Feb 18, 2024 at 7:40 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> > wrote: > > > > On Friday, February 16, 2024 6:41 PM shveta malik <shveta.malik@gmail.com> > wrote: > > Thanks for the patch. Here are few comments: > > Thanks for the comments. > > > > > 2. > > > > +static bool > > +validate_remote_info(WalReceiverConn *wrconn, int elevel) > > ... > > + > > + return (!remote_in_recovery && primary_slot_valid); > > > > The primary_slot_valid could be uninitialized in this function. > > return (!remote_in_recovery && primary_slot_valid); > > Here if remote_in_recovery is true, it will not even read primary_slot_valid. It > will read primary_slot_valid only if remote_in_recovery is false and in such a > case primary_slot_valid will always be initialized in the else block above, let me > know if you still feel we shall initialize this to some default? I understand that it will not be used, but some complier could report WARNING for the un-initialized variable. The cfbot[1] seems complain about this as well. [1] https://cirrus-ci.com/task/5416851522453504 Best Regards, Hou zj
On Mon, Feb 19, 2024 at 9:32 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > > I understand that it will not be used, but some complier could report WARNING > for the un-initialized variable. The cfbot[1] seems complain about this as well. > > [1] https://cirrus-ci.com/task/5416851522453504 Okay I see. Thanks for pointing it out. Here are the patches addressing your comments. Changes are in patch001, rest are rebased. thanks Shveta
Attachment
Hi, On Sat, Feb 17, 2024 at 10:10:18AM +0530, Amit Kapila wrote: > On Fri, Feb 16, 2024 at 4:10 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Thu, Feb 15, 2024 at 10:48 PM Bertrand Drouvot > > <bertranddrouvot.pg@gmail.com> wrote: > > > > > 5 === > > > > > > + if (SlotSyncWorker->syncing) > > > { > > > - SpinLockRelease(&SlotSyncCtx->mutex); > > > + SpinLockRelease(&SlotSyncWorker->mutex); > > > ereport(ERROR, > > > errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > > errmsg("cannot synchronize replication slots concurrently")); > > > } > > > > > > worth to add a test in 040_standby_failover_slots_sync.pl for it? > > > > It will be very difficult to stabilize this test as we have to make > > sure that the concurrent users (SQL function(s) and/or worker(s)) are > > in that target function at the same time to hit it. > > > > Yeah, I also think would be tricky to write a stable test, maybe one > can explore using a new injection point facility but I don't think it > is worth for this error check as this appears straightforward to be > broken in the future by other changes. Yeah, injection point would probably be the way to go. Agree that's probably not worth adding such a test (we can change our mind later on if needed anyway). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Feb 19, 2024 at 9:46 AM shveta malik <shveta.malik@gmail.com> wrote: > > Okay I see. Thanks for pointing it out. Here are the patches > addressing your comments. Changes are in patch001, rest are rebased. > Few comments on 0001 ==================== 1. I think it is better to error out when the valid GUC or option is not set in ensure_valid_slotsync_params() and ensure_valid_remote_info() instead of waiting. And we shouldn't start the worker in the first place if not all required GUCs are set. This will additionally simplify the code a bit. 2. +typedef struct SlotSyncWorkerCtxStruct { - /* prevents concurrent slot syncs to avoid slot overwrites */ + pid_t pid; + bool stopSignaled; bool syncing; + time_t last_start_time; slock_t mutex; -} SlotSyncCtxStruct; +} SlotSyncWorkerCtxStruct; I think we don't need to change the name of this struct as this can be used both by the worker and the backend. We can probably add the comment to indicate that all the fields except 'syncing' are used by slotsync worker. 3. Similar to above, function names like SlotSyncShmemInit() shouldn't be changed to SlotSyncWorkerShmemInit(). 4. +ReplSlotSyncWorkerMain(int argc, char *argv[]) { ... + on_shmem_exit(slotsync_worker_onexit, (Datum) 0); ... + before_shmem_exit(slotsync_failure_callback, PointerGetDatum(wrconn)); ... } Do we need two separate callbacks? Can't we have just one (say slotsync_failure_callback) that cleans additional things in case of slotsync worker? And, if we need both those callbacks then please add some comments for both and why one is before_shmem_exit() and the other is on_shmem_exit(). In addition to the above, I have made a few changes in the comments and code (cosmetic code changes). Please include those in the next version if you find them okay. You need to rename .txt file to remove .txt and apply atop v90-0001*. -- With Regards, Amit Kapila.
Attachment
On Mon, Feb 19, 2024 at 5:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Few comments on 0001 Thanks for the feedback. > ==================== > 1. I think it is better to error out when the valid GUC or option is > not set in ensure_valid_slotsync_params() and > ensure_valid_remote_info() instead of waiting. And we shouldn't start > the worker in the first place if not all required GUCs are set. This > will additionally simplify the code a bit. Sure, removed 'ensure' functions. Moved the validation checks to the postmaster before starting the slot sync worker. > 2. > +typedef struct SlotSyncWorkerCtxStruct > { > - /* prevents concurrent slot syncs to avoid slot overwrites */ > + pid_t pid; > + bool stopSignaled; > bool syncing; > + time_t last_start_time; > slock_t mutex; > -} SlotSyncCtxStruct; > +} SlotSyncWorkerCtxStruct; > > I think we don't need to change the name of this struct as this can be > used both by the worker and the backend. We can probably add the > comment to indicate that all the fields except 'syncing' are used by > slotsync worker. Modified. > 3. Similar to above, function names like SlotSyncShmemInit() shouldn't > be changed to SlotSyncWorkerShmemInit(). Modified. > 4. > +ReplSlotSyncWorkerMain(int argc, char *argv[]) > { > ... > + on_shmem_exit(slotsync_worker_onexit, (Datum) 0); > ... > + before_shmem_exit(slotsync_failure_callback, PointerGetDatum(wrconn)); > ... > } > > Do we need two separate callbacks? Can't we have just one (say > slotsync_failure_callback) that cleans additional things in case of > slotsync worker? And, if we need both those callbacks then please add > some comments for both and why one is before_shmem_exit() and the > other is on_shmem_exit(). I think we can merge these now. Earlier 'on_shmem_exit' was needed to avoid race-condition between startup and slot sync worker process to drop 'i' slots on promotion. Now we do not have any such scenario. But I need some time to analyze it well. Will do it in the next version. > In addition to the above, I have made a few changes in the comments > and code (cosmetic code changes). Please include those in the next > version if you find them okay. You need to rename .txt file to remove > .txt and apply atop v90-0001*. Sure, included these. Please find the patch001 attached. I will rebase the rest of the patches and post them tomorrow. thanks Shveta
Attachment
On Mon, Feb 19, 2024 at 9:59 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Mon, Feb 19, 2024 at 5:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > Few comments on 0001 > > Thanks for the feedback. > > > ==================== > > 1. I think it is better to error out when the valid GUC or option is > > not set in ensure_valid_slotsync_params() and > > ensure_valid_remote_info() instead of waiting. And we shouldn't start > > the worker in the first place if not all required GUCs are set. This > > will additionally simplify the code a bit. > > Sure, removed 'ensure' functions. Moved the validation checks to the > postmaster before starting the slot sync worker. > > > 2. > > +typedef struct SlotSyncWorkerCtxStruct > > { > > - /* prevents concurrent slot syncs to avoid slot overwrites */ > > + pid_t pid; > > + bool stopSignaled; > > bool syncing; > > + time_t last_start_time; > > slock_t mutex; > > -} SlotSyncCtxStruct; > > +} SlotSyncWorkerCtxStruct; > > > > I think we don't need to change the name of this struct as this can be > > used both by the worker and the backend. We can probably add the > > comment to indicate that all the fields except 'syncing' are used by > > slotsync worker. > > Modified. > > > 3. Similar to above, function names like SlotSyncShmemInit() shouldn't > > be changed to SlotSyncWorkerShmemInit(). > > Modified. > > > 4. > > +ReplSlotSyncWorkerMain(int argc, char *argv[]) > > { > > ... > > + on_shmem_exit(slotsync_worker_onexit, (Datum) 0); > > ... > > + before_shmem_exit(slotsync_failure_callback, PointerGetDatum(wrconn)); > > ... > > } > > > > Do we need two separate callbacks? Can't we have just one (say > > slotsync_failure_callback) that cleans additional things in case of > > slotsync worker? And, if we need both those callbacks then please add > > some comments for both and why one is before_shmem_exit() and the > > other is on_shmem_exit(). > > I think we can merge these now. Earlier 'on_shmem_exit' was needed to > avoid race-condition between startup and slot sync worker process to > drop 'i' slots on promotion. Now we do not have any such scenario. > But I need some time to analyze it well. Will do it in the next > version. > > > In addition to the above, I have made a few changes in the comments > > and code (cosmetic code changes). Please include those in the next > > version if you find them okay. You need to rename .txt file to remove > > .txt and apply atop v90-0001*. > > Sure, included these. > > Please find the patch001 attached. I've reviewed the v91 patch. Here are random comments: --- /* * Checks the remote server info. * - * We ensure that the 'primary_slot_name' exists on the remote server and the - * remote server is not a standby node. + * Check whether we are a cascading standby. For non-cascading standbys, it + * also ensures that the 'primary_slot_name' exists on the remote server. */ IIUC what the validate_remote_info() does doesn't not change by this patch, so the previous comment seems to be clearer to me. --- if (remote_in_recovery) + { + /* + * If we are a cascading standby, no need to check further for + * 'primary_slot_name'. + */ ereport(ERROR, errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("cannot synchronize replication slots from a standby server")); + } + else + { + bool primary_slot_valid; - primary_slot_valid = DatumGetBool(slot_getattr(tupslot, 2, &isnull)); - Assert(!isnull); + primary_slot_valid = DatumGetBool(slot_getattr(tupslot, 2, &isnull)); + Assert(!isnull); - if (!primary_slot_valid) - ereport(ERROR, - errcode(ERRCODE_INVALID_PARAMETER_VALUE), - errmsg("bad configuration for slot synchronization"), - /* translator: second %s is a GUC variable name */ - errdetail("The replication slot \"%s\" specified by \"%s\" does not exist on the primary server.", - PrimarySlotName, "primary_slot_name")); + if (!primary_slot_valid) + ereport(ERROR, + errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("bad configuration for slot synchronization"), + /* translator: second %s is a GUC variable name */ + errdetail("The replication slot \"%s\" specified by \"%s\" does not exist on the primary server.", + PrimarySlotName, "primary_slot_name")); + } I think it's a refactoring rather than changes required by the slotsync worker. We can do that in a separate patch but why do we need this change in the first place? --- + ValidateSlotSyncParams(ERROR); + /* Load the libpq-specific functions */ load_file("libpqwalreceiver", false); - ValidateSlotSyncParams(); + (void) CheckDbnameInConninfo(); Is there any reason why we move where to check the parameters? Some comments not related to the patch but to the existing code: --- It might have already been discussed but is the src/backend/replication/logical the right place for the slocsync.c? If it's independent of logical decoding/replication, is under src/backend/replication could be more appropriate? --- /* Construct query to fetch slots with failover enabled. */ appendStringInfo(&s, "SELECT slot_name, plugin, confirmed_flush_lsn," " restart_lsn, catalog_xmin, two_phase, failover," " database, conflict_reason" " FROM pg_catalog.pg_replication_slots" " WHERE failover and NOT temporary"); /* Execute the query */ res = walrcv_exec(wrconn, s.data, SLOTSYNC_COLUMN_COUNT, slotRow); pfree(s.data); We don't need 's' as the query is constant. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Feb 20, 2024 at 8:25 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > I've reviewed the v91 patch. Here are random comments: Thanks for the comments. > --- > /* > * Checks the remote server info. > * > - * We ensure that the 'primary_slot_name' exists on the remote server and the > - * remote server is not a standby node. > + * Check whether we are a cascading standby. For non-cascading standbys, it > + * also ensures that the 'primary_slot_name' exists on the remote server. > */ > > IIUC what the validate_remote_info() does doesn't not change by this > patch, so the previous comment seems to be clearer to me. > > --- > if (remote_in_recovery) > + { > + /* > + * If we are a cascading standby, no need to check further for > + * 'primary_slot_name'. > + */ > ereport(ERROR, > errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > errmsg("cannot synchronize replication slots from a > standby server")); > + } > + else > + { > + bool primary_slot_valid; > > - primary_slot_valid = DatumGetBool(slot_getattr(tupslot, 2, &isnull)); > - Assert(!isnull); > + primary_slot_valid = DatumGetBool(slot_getattr(tupslot, 2, &isnull)); > + Assert(!isnull); > > - if (!primary_slot_valid) > - ereport(ERROR, > - errcode(ERRCODE_INVALID_PARAMETER_VALUE), > - errmsg("bad configuration for slot synchronization"), > - /* translator: second %s is a GUC variable name */ > - errdetail("The replication slot \"%s\" specified by > \"%s\" does not exist on the primary server.", > - PrimarySlotName, "primary_slot_name")); > + if (!primary_slot_valid) > + ereport(ERROR, > + errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("bad configuration for slot synchronization"), > + /* translator: second %s is a GUC variable name */ > + errdetail("The replication slot \"%s\" specified > by \"%s\" does not exist on the primary server.", > + PrimarySlotName, "primary_slot_name")); > + } > > I think it's a refactoring rather than changes required by the > slotsync worker. We can do that in a separate patch but why do we need > this change in the first place? In v90, this refactoring was made due to the fact that validate_remote_info() was supposed to behave differently for SQL function and slot-sync worker. SQL-function was supposed to ERROR out while the worker was supposed to LOG and become no-op. And thus the change was needed. In v91, we made this functionality same i.e. both sql function and worker will error out but missed to remove this refactoring. Thanks for catching this, I will revert it in the next version. To match the refactoring, I made the comment change too, will revert that as well. > --- > + ValidateSlotSyncParams(ERROR); > + > /* Load the libpq-specific functions */ > load_file("libpqwalreceiver", false); > > - ValidateSlotSyncParams(); > + (void) CheckDbnameInConninfo(); > > Is there any reason why we move where to check the parameters? Earlier DBname verification was done inside ValidateSlotSyncParams() and thus it was needed to load 'libpqwalreceiver' before we call this function. Now we have moved dbname verification in a separate call and thus the above order got changed. ValidateSlotSyncParams() is a common function used by SQL function and worker. Earlier slot sync worker was checking all the GUCs after starting up and was exiting each time any GUC was invalid. It was suggested that it would be better to check the GUCs before starting the slot sync worker in the postmaster itself, making the ValidateSlotSyncParams() move to postmaster (see SlotSyncWorkerAllowed). But it was not a good idea to load libpq in postmaster and thus we moved libpq related verification out of ValidateSlotSyncParams(). This resulted in the above change. I hope it answers your query. thanks Shveta
On Tue, Feb 20, 2024 at 8:25 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Some comments not related to the patch but to the existing code: > > --- > It might have already been discussed but is the > src/backend/replication/logical the right place for the slocsync.c? If > it's independent of logical decoding/replication, is under > src/backend/replication could be more appropriate? > This point has not been discussed, so thanks for raising it. I think the reasoning behind keeping it in logical is that this file contains a code for logical slot syncing and a worker doing that. As it is mostly about logical slot syncing so there is an argument to keep it under logical. In the future, we may need to extend this functionality to have a per-db slot sync worker as well in which case it will probably be again somewhat related to logical slots. Having said that, there is an argument to keep it under replication as well because the functionality it provides is for physical standbys. -- With Regards, Amit Kapila.
On Tue, Feb 20, 2024 at 8:25 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > I've reviewed the v91 patch. Here are random comments: > Thanks for the comments, addressed in v92. slotsync.c file is still under 'logical'. I am waiting for the discussion to be concluded. v92 also addresses some off-list comments given by Amit and Hou-San. The changes are in patch001, rest of the patches are rebased. thanks Shveta
Attachment
On Tue, Feb 20, 2024 at 12:33 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Tue, Feb 20, 2024 at 8:25 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > I've reviewed the v91 patch. Here are random comments: > > Thanks for the comments. > > > --- > > /* > > * Checks the remote server info. > > * > > - * We ensure that the 'primary_slot_name' exists on the remote server and the > > - * remote server is not a standby node. > > + * Check whether we are a cascading standby. For non-cascading standbys, it > > + * also ensures that the 'primary_slot_name' exists on the remote server. > > */ > > > > IIUC what the validate_remote_info() does doesn't not change by this > > patch, so the previous comment seems to be clearer to me. > > > > --- > > if (remote_in_recovery) > > + { > > + /* > > + * If we are a cascading standby, no need to check further for > > + * 'primary_slot_name'. > > + */ > > ereport(ERROR, > > errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > > errmsg("cannot synchronize replication slots from a > > standby server")); > > + } > > + else > > + { > > + bool primary_slot_valid; > > > > - primary_slot_valid = DatumGetBool(slot_getattr(tupslot, 2, &isnull)); > > - Assert(!isnull); > > + primary_slot_valid = DatumGetBool(slot_getattr(tupslot, 2, &isnull)); > > + Assert(!isnull); > > > > - if (!primary_slot_valid) > > - ereport(ERROR, > > - errcode(ERRCODE_INVALID_PARAMETER_VALUE), > > - errmsg("bad configuration for slot synchronization"), > > - /* translator: second %s is a GUC variable name */ > > - errdetail("The replication slot \"%s\" specified by > > \"%s\" does not exist on the primary server.", > > - PrimarySlotName, "primary_slot_name")); > > + if (!primary_slot_valid) > > + ereport(ERROR, > > + errcode(ERRCODE_INVALID_PARAMETER_VALUE), > > + errmsg("bad configuration for slot synchronization"), > > + /* translator: second %s is a GUC variable name */ > > + errdetail("The replication slot \"%s\" specified > > by \"%s\" does not exist on the primary server.", > > + PrimarySlotName, "primary_slot_name")); > > + } > > > > I think it's a refactoring rather than changes required by the > > slotsync worker. We can do that in a separate patch but why do we need > > this change in the first place? > > In v90, this refactoring was made due to the fact that > validate_remote_info() was supposed to behave differently for SQL > function and slot-sync worker. SQL-function was supposed to ERROR out > while the worker was supposed to LOG and become no-op. And thus the > change was needed. In v91, we made this functionality same i.e. both > sql function and worker will error out but missed to remove this > refactoring. Thanks for catching this, I will revert it in the next > version. To match the refactoring, I made the comment change too, will > revert that as well. > > > --- > > + ValidateSlotSyncParams(ERROR); > > + > > /* Load the libpq-specific functions */ > > load_file("libpqwalreceiver", false); > > > > - ValidateSlotSyncParams(); > > + (void) CheckDbnameInConninfo(); > > > > Is there any reason why we move where to check the parameters? > > Earlier DBname verification was done inside ValidateSlotSyncParams() > and thus it was needed to load 'libpqwalreceiver' before we call this > function. Now we have moved dbname verification in a separate call and > thus the above order got changed. ValidateSlotSyncParams() is a common > function used by SQL function and worker. Earlier slot sync worker was > checking all the GUCs after starting up and was exiting each time any > GUC was invalid. It was suggested that it would be better to check the > GUCs before starting the slot sync worker in the postmaster itself, > making the ValidateSlotSyncParams() move to postmaster (see > SlotSyncWorkerAllowed). But it was not a good idea to load libpq in > postmaster and thus we moved libpq related verification out of > ValidateSlotSyncParams(). This resulted in the above change. I hope > it answers your query. Thank you for the explanation. It makes sense to me to move the check. As for ValidateSlotSyncParams() called by SlotSyncWorkerAllowed(), I have two comments: 1. The error messages are not very descriptive and seem not to match other messages the postmaster says. When starting the standby server with misconfiguration about the slotsync, I got the following messages from the postmaster: 2024-02-20 17:01:16.356 JST [456741] LOG: database system is ready to accept read-only connections 2024-02-20 17:01:16.358 JST [456741] LOG: bad configuration for slot synchronization 2024-02-20 17:01:16.358 JST [456741] HINT: "hot_standby_feedback" must be enabled. It says "bad configuration" but is still working, and does not say further information such as whether it skipped to start the slotsync worker etc. I think these messages could work for the slotsync worker but we might want to have more descriptive messages for the postmaster. For example, "skipped starting slot sync worker because hot_standby_feedback is disabled". 2. If the wal_level is not logical, the server will need to restart anyway to change the wal_level and have the slotsync worker work. Does it make sense to have the postmaster exit if the wal_level is not logical and sync_replication_slots is enabled? For instance, we have similar checks in PostmsaterMain(): if (summarize_wal && wal_level == WAL_LEVEL_MINIMAL) ereport(ERROR, (errmsg("WAL cannot be summarized when wal_level is \"minimal\""))); Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Feb 20, 2024 at 12:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Feb 20, 2024 at 8:25 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > Some comments not related to the patch but to the existing code: > > > > --- > > It might have already been discussed but is the > > src/backend/replication/logical the right place for the slocsync.c? If > > it's independent of logical decoding/replication, is under > > src/backend/replication could be more appropriate? > > Thank you for the comment. > > This point has not been discussed, so thanks for raising it. I think > the reasoning behind keeping it in logical is that this file contains > a code for logical slot syncing and a worker doing that. As it is > mostly about logical slot syncing so there is an argument to keep it > under logical. In the future, we may need to extend this functionality > to have a per-db slot sync worker as well in which case it will > probably be again somewhat related to logical slots. That's a valid argument. > Having said that, > there is an argument to keep it under replication as well because the > functionality it provides is for physical standbys. Another argument to keep it under replication is; all files under the replication/logical directory are logical decoding and logical replication infrastructures. IOW these are the functionality built on top of (logical) replication slots. On the other hand, the slotsync worker (and slotsync functionality) looks a part of slot management functionality, which seems the same layer of slot.c. BTW the slotsync.c of v91 patch includes "replication/logical.h" but it isn't actually necessary and #include'ing "replication/slot.h" is sufficient. Regards -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Feb 20, 2024 at 6:19 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Thank you for the explanation. It makes sense to me to move the check. > > As for ValidateSlotSyncParams() called by SlotSyncWorkerAllowed(), I > have two comments: > > 1. The error messages are not very descriptive and seem not to match > other messages the postmaster says. When starting the standby server > with misconfiguration about the slotsync, I got the following messages > from the postmaster: > > 2024-02-20 17:01:16.356 JST [456741] LOG: database system is ready to > accept read-only connections > 2024-02-20 17:01:16.358 JST [456741] LOG: bad configuration for slot > synchronization > 2024-02-20 17:01:16.358 JST [456741] HINT: "hot_standby_feedback" > must be enabled. > > It says "bad configuration" but is still working, and does not say > further information such as whether it skipped to start the slotsync > worker etc. I think these messages could work for the slotsync worker > but we might want to have more descriptive messages for the > postmaster. For example, "skipped starting slot sync worker because > hot_standby_feedback is disabled". > We are planning to change it to something like:"slot synchronization requires hot_standby_feedback to be enabled". See [1] > 2. If the wal_level is not logical, the server will need to restart > anyway to change the wal_level and have the slotsync worker work. Does > it make sense to have the postmaster exit if the wal_level is not > logical and sync_replication_slots is enabled? For instance, we have > similar checks in PostmsaterMain(): > > if (summarize_wal && wal_level == WAL_LEVEL_MINIMAL) > ereport(ERROR, > (errmsg("WAL cannot be summarized when wal_level is > \"minimal\""))); > +1. I think giving an error in this case makes sense. Miscellaneous comments: ======================== 1. +void +ShutDownSlotSync(void) +{ + SpinLockAcquire(&SlotSyncCtx->mutex); + + SlotSyncCtx->stopSignaled = true; This flag is never reset back. I think we should reset this once the promotion is complete. Though offhand, I don't see any problem with this but it doesn't look clean and can be a source of bugs in the future. 2. +char * +CheckDbnameInConninfo(void) { char *dbname; Let's name this function as CheckAndGetDbnameFromConninfo(). Apart from the above, I have made cosmetic changes in the attached. [1] - https://www.postgresql.org/message-id/CAJpy0uBWomyAjP0zyFdzhGxn%2BXsAb2OdJA%2BKfNyZRv2nV6PD9g%40mail.gmail.com -- With Regards, Amit Kapila.
Attachment
On Tue, Feb 20, 2024 at 6:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Miscellaneous comments: Thanks for the comments. > ======================== > 1. > +void > +ShutDownSlotSync(void) > +{ > + SpinLockAcquire(&SlotSyncCtx->mutex); > + > + SlotSyncCtx->stopSignaled = true; > > This flag is never reset back. I think we should reset this once the > promotion is complete. Though offhand, I don't see any problem with > this but it doesn't look clean and can be a source of bugs in the > future. Did reset of flag in MaybeStartSlotSyncWorker() when we attempt to start the worker after promotion completion and find that stopSignaled is true while pmState is PM_RUN. From that point onwards, we can rely on pmState to prevent the launch of the slot sync worker and thus can reset stopSignaled. > 2. > +char * > +CheckDbnameInConninfo(void) > { > char *dbname; > > Let's name this function as CheckAndGetDbnameFromConninfo(). Modified. > Apart from the above, I have made cosmetic changes in the attached. Included these changes. Thanks. Here are the v93 patches. It also addresses Swada-san's comment of converting LOG to ERROR on receiving wal_level < logical. I have also incorporated one more change wherein we check that 'Shutdown <= SmartShutdown' before launching the slot sync worker. Since we do not need slot sync process to help in the rest of the shutdown process, so better not to start it when shutdown (immediate or fast) is going on. I have done this based on the details in [1]. It is similar to WalReceiver behaviour. Thoughts? [1]: https://www.postgresql.org/message-id/flat/CAJpy0uCeQm2aFJLkx-D0BeAEvSdViTZf4wD7zT9coDHfLv1NaA%40mail.gmail.com thanks Shveta
Attachment
On Tue, Feb 20, 2024 at 6:19 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Feb 20, 2024 at 12:33 PM shveta malik <shveta.malik@gmail.com> wrote: > Thank you for the explanation. It makes sense to me to move the check. > > > 2. If the wal_level is not logical, the server will need to restart > anyway to change the wal_level and have the slotsync worker work. Does > it make sense to have the postmaster exit if the wal_level is not > logical and sync_replication_slots is enabled? For instance, we have > similar checks in PostmsaterMain(): > > if (summarize_wal && wal_level == WAL_LEVEL_MINIMAL) > ereport(ERROR, > (errmsg("WAL cannot be summarized when wal_level is > \"minimal\""))); Thanks for the feedback. I have addressed it in v93. thanks SHveta
On Wed, Feb 21, 2024 at 12:15 PM shveta malik <shveta.malik@gmail.com> wrote: > > > Thanks for the feedback. I have addressed it in v93. > A few minor comments: ================= 1. +/* + * Is stopSignaled set in SlotSyncCtx? + */ +bool +IsStopSignaledSet(void) +{ + bool signaled; + + SpinLockAcquire(&SlotSyncCtx->mutex); + signaled = SlotSyncCtx->stopSignaled; + SpinLockRelease(&SlotSyncCtx->mutex); + + return signaled; +} + +/* + * Reset stopSignaled in SlotSyncCtx. + */ +void +ResetStopSignaled(void) +{ + SpinLockAcquire(&SlotSyncCtx->mutex); + SlotSyncCtx->stopSignaled = false; + SpinLockRelease(&SlotSyncCtx->mutex); +} I think these newly introduced functions don't need spinlock to be acquired as these are just one-byte read-and-write. Additionally, when IsStopSignaledSet() is invoked, there shouldn't be any concurrent process to update that value. What do you think? 2. +REPL_SLOTSYNC_MAIN "Waiting in main loop of slot sync worker." +REPL_SLOTSYNC_SHUTDOWN "Waiting for slot sync worker to shut down." Let's use REPLICATION instead of REPL. I see other wait events using REPLICATION in their names. 3. - * In standalone mode and in autovacuum worker processes, we use a fixed - * ID, otherwise we figure it out from the authenticated user name. + * In standalone mode, autovacuum worker processes and slot sync worker + * process, we use a fixed ID, otherwise we figure it out from the + * authenticated user name. */ - if (bootstrap || IsAutoVacuumWorkerProcess()) + if (bootstrap || IsAutoVacuumWorkerProcess() || IsLogicalSlotSyncWorker()) { InitializeSessionUserIdStandalone(); am_superuser = true; IIRC, we discussed this previously and it is safe to make the local connection as superuser as we don't consult any user tables, so we can probably add a comment where we invoke InitPostgres in slotsync.c 4. $publisher->safe_psql('postgres', - "CREATE PUBLICATION regress_mypub FOR ALL TABLES;"); + "CREATE PUBLICATION regress_mypub FOR ALL TABLES;" +); Why this change is required in the patch? 5. +# Confirm that restart_lsn and of confirmed_flush_lsn lsub1_slot slot are synced +# to the standby /and of/; looks like a typo 6. +# Confirm that restart_lsn and of confirmed_flush_lsn lsub1_slot slot are synced +# to the standby +ok( $standby1->poll_query_until( + 'postgres', + "SELECT '$primary_restart_lsn' = restart_lsn AND '$primary_flush_lsn' = confirmed_flush_lsn from pg_replication_slots WHERE slot_name = 'lsub1_slot';"), + 'restart_lsn and confirmed_flush_lsn of slot lsub1_slot synced to standby'); + ... ... +# Confirm the synced slot 'lsub1_slot' is retained on the new primary +is($standby1->safe_psql('postgres', + q{SELECT slot_name FROM pg_replication_slots WHERE slot_name = 'lsub1_slot';}), + 'lsub1_slot', + 'synced slot retained on the new primary'); In both these checks, we should additionally check the 'synced' and 'temporary' flags to ensure that they are marked appropriately. -- With Regards, Amit Kapila.
On Wed, Feb 21, 2024 at 5:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > A few minor comments: Thanks for the feedback. > ================= > 1. > +/* > + * Is stopSignaled set in SlotSyncCtx? > + */ > +bool > +IsStopSignaledSet(void) > +{ > + bool signaled; > + > + SpinLockAcquire(&SlotSyncCtx->mutex); > + signaled = SlotSyncCtx->stopSignaled; > + SpinLockRelease(&SlotSyncCtx->mutex); > + > + return signaled; > +} > + > +/* > + * Reset stopSignaled in SlotSyncCtx. > + */ > +void > +ResetStopSignaled(void) > +{ > + SpinLockAcquire(&SlotSyncCtx->mutex); > + SlotSyncCtx->stopSignaled = false; > + SpinLockRelease(&SlotSyncCtx->mutex); > +} > > I think these newly introduced functions don't need spinlock to be > acquired as these are just one-byte read-and-write. Additionally, when > IsStopSignaledSet() is invoked, there shouldn't be any concurrent > process to update that value. What do you think? Yes, we can avoid taking spinlock here. These functions are invoked after checking that pmState is PM_RUN. And in that state we do not expect any other process writing this flag. > 2. > +REPL_SLOTSYNC_MAIN "Waiting in main loop of slot sync worker." > +REPL_SLOTSYNC_SHUTDOWN "Waiting for slot sync worker to shut down." > > Let's use REPLICATION instead of REPL. I see other wait events using > REPLICATION in their names. Modified. > 3. > - * In standalone mode and in autovacuum worker processes, we use a fixed > - * ID, otherwise we figure it out from the authenticated user name. > + * In standalone mode, autovacuum worker processes and slot sync worker > + * process, we use a fixed ID, otherwise we figure it out from the > + * authenticated user name. > */ > - if (bootstrap || IsAutoVacuumWorkerProcess()) > + if (bootstrap || IsAutoVacuumWorkerProcess() || IsLogicalSlotSyncWorker()) > { > InitializeSessionUserIdStandalone(); > am_superuser = true; > > IIRC, we discussed this previously and it is safe to make the local > connection as superuser as we don't consult any user tables, so we can > probably add a comment where we invoke InitPostgres in slotsync.c Added comment. Thanks Hou-San for the analysis here and providing comment. > 4. > $publisher->safe_psql('postgres', > - "CREATE PUBLICATION regress_mypub FOR ALL TABLES;"); > + "CREATE PUBLICATION regress_mypub FOR ALL TABLES;" > +); > > Why this change is required in the patch? Not needed, removed it. > 5. > +# Confirm that restart_lsn and of confirmed_flush_lsn lsub1_slot slot > are synced > +# to the standby > > /and of/; looks like a typo Modified. > 6. > +# Confirm that restart_lsn and of confirmed_flush_lsn lsub1_slot slot > are synced > +# to the standby > +ok( $standby1->poll_query_until( > + 'postgres', > + "SELECT '$primary_restart_lsn' = restart_lsn AND > '$primary_flush_lsn' = confirmed_flush_lsn from pg_replication_slots > WHERE slot_name = 'lsub1_slot';"), > + 'restart_lsn and confirmed_flush_lsn of slot lsub1_slot synced to standby'); > + > ... > ... > +# Confirm the synced slot 'lsub1_slot' is retained on the new primary > +is($standby1->safe_psql('postgres', > + q{SELECT slot_name FROM pg_replication_slots WHERE slot_name = > 'lsub1_slot';}), > + 'lsub1_slot', > + 'synced slot retained on the new primary'); > > In both these checks, we should additionally check the 'synced' and > 'temporary' flags to ensure that they are marked appropriately. Modified. Please find patch001 attached. There is a CFBot failure in patch002. The test added there needs some adjustment. We will rebase and post rest of the patches once we fix that issue. thanks Shveta
Attachment
On Thu, Feb 22, 2024 at 10:31 AM shveta malik <shveta.malik@gmail.com> wrote: > > Please find patch001 attached. There is a CFBot failure in patch002. > The test added there needs some adjustment. We will rebase and post > rest of the patches once we fix that issue. > There was a recent commit 801792e to improve error messaging in slotsync.c which resulted in conflict. Thus rebased the patch. There is no new change in the patch attached thanks Shveta
Attachment
Hi, On Thu, Feb 22, 2024 at 12:16:34PM +0530, shveta malik wrote: > On Thu, Feb 22, 2024 at 10:31 AM shveta malik <shveta.malik@gmail.com> wrote: > There was a recent commit 801792e to improve error messaging in > slotsync.c which resulted in conflict. Thus rebased the patch. There > is no new change in the patch attached Thanks! Some random comments about v92_001 (Sorry if it has already been discussed up-thread): 1 === + * We do not update the 'synced' column from true to false here Worth to mention from which system view the 'synced' column belongs to? 2 === (Nit) +#define MIN_WORKER_NAPTIME_MS 200 +#define MAX_WORKER_NAPTIME_MS 30000 /* 30s */ [MIN|MAX]_SLOTSYNC_WORKER_NAPTIME_MS instead? It is used only in slotsync.c so more a Nit. 3 === res = walrcv_exec(wrconn, query, SLOTSYNC_COLUMN_COUNT, slotRow); - if (res->status != WALRCV_OK_TUPLES) Line removal intended? 4 === + if (wal_level < WAL_LEVEL_LOGICAL) + { + ereport(ERROR, + errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("slot synchronization requires wal_level >= \"logical\"")); + return false; + } I think the return is not needed here as it won't be reached due to the "ERROR". Or should we use "elevel" instead of "ERROR"? 5 === + * operate as a superuser. This is safe because the slot sync worker does + * not interact with user tables, eliminating the risk of executing + * arbitrary code within triggers. Right. I did not check but if we are using operators in our remote SPI calls then it would be worth to ensure they are coming from the pg_catalog schema? Using something like "OPERATOR(pg_catalog.=)" using "=" as an example. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Feb 22, 2024 at 3:44 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > Thanks! > > Some random comments about v92_001 (Sorry if it has already been discussed > up-thread): Thanks for the feedback. The patch is pushed 15 minutes back. I will prepare a top-up patch for your comments. > 1 === > > + * We do not update the 'synced' column from true to false here > > Worth to mention from which system view the 'synced' column belongs to? Sure, I will change it. > 2 === (Nit) > > +#define MIN_WORKER_NAPTIME_MS 200 > +#define MAX_WORKER_NAPTIME_MS 30000 /* 30s */ > > [MIN|MAX]_SLOTSYNC_WORKER_NAPTIME_MS instead? It is used only in slotsync.c so > more a Nit. Okay, will change it, > 3 === > > res = walrcv_exec(wrconn, query, SLOTSYNC_COLUMN_COUNT, slotRow); > - > if (res->status != WALRCV_OK_TUPLES) > > Line removal intended? I feel the current style is better, where we do not have space between the function call and return value checking. > 4 === > > + if (wal_level < WAL_LEVEL_LOGICAL) > + { > + ereport(ERROR, > + errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("slot synchronization requires wal_level >= \"logical\"")); > + return false; > + } > > I think the return is not needed here as it won't be reached due to the "ERROR". > Or should we use "elevel" instead of "ERROR"? It was suggested to raise ERROR for wal_level validation, please see [1]. But yes, I will remove the return value. Thanks for catching this. > 5 === > > + * operate as a superuser. This is safe because the slot sync worker does > + * not interact with user tables, eliminating the risk of executing > + * arbitrary code within triggers. > > Right. I did not check but if we are using operators in our remote SPI calls > then it would be worth to ensure they are coming from the pg_catalog schema? > Using something like "OPERATOR(pg_catalog.=)" using "=" as an example. Can you please elaborate this one, I am not sure if I understood it. [1]: https://www.postgresql.org/message-id/CAD21AoB2ipSzQb5-o5pEYKie4oTPJTsYR1ip9_wRVrF6HbBWDQ%40mail.gmail.com thanks Shveta
Hi, On Thu, Feb 22, 2024 at 04:01:34PM +0530, shveta malik wrote: > On Thu, Feb 22, 2024 at 3:44 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > > Hi, > > > > Thanks! > > > > Some random comments about v92_001 (Sorry if it has already been discussed > > up-thread): > > Thanks for the feedback. The patch is pushed 15 minutes back. Yeah, saw that after I send the comments ;-) > I will > prepare a top-up patch for your comments. Thanks! > > 4 === > > > > + if (wal_level < WAL_LEVEL_LOGICAL) > > + { > > + ereport(ERROR, > > + errcode(ERRCODE_INVALID_PARAMETER_VALUE), > > + errmsg("slot synchronization requires wal_level >= \"logical\"")); > > + return false; > > + } > > > > I think the return is not needed here as it won't be reached due to the "ERROR". > > Or should we use "elevel" instead of "ERROR"? > > It was suggested to raise ERROR for wal_level validation, please see > [1]. But yes, I will remove the return value. Yeah, thanks, ERROR makes sense here. > > 5 === > > > > + * operate as a superuser. This is safe because the slot sync worker does > > + * not interact with user tables, eliminating the risk of executing > > + * arbitrary code within triggers. > > > > Right. I did not check but if we are using operators in our remote SPI calls > > then it would be worth to ensure they are coming from the pg_catalog schema? > > Using something like "OPERATOR(pg_catalog.=)" using "=" as an example. > > Can you please elaborate this one, I am not sure if I understood it. Suppose that in synchronize_slots() the query would be: const char *query = "SELECT slot_name, plugin, confirmed_flush_lsn," " restart_lsn, catalog_xmin, two_phase, failover," " database, conflict_reason" " FROM pg_catalog.pg_replication_slots" " WHERE failover and NOT temporary and 1 = 1"; Then my comment is to rewrite it to: const char *query = "SELECT slot_name, plugin, confirmed_flush_lsn," " restart_lsn, catalog_xmin, two_phase, failover," " database, conflict_reason" " FROM pg_catalog.pg_replication_slots" " WHERE failover and NOT temporary and 1 OPERATOR(pg_catalog.=) 1"; to ensure the operator "=" is coming from the pg_catalog schema. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Feb 22, 2024 at 4:35 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > On Thu, Feb 22, 2024 at 04:01:34PM +0530, shveta malik wrote: > > On Thu, Feb 22, 2024 at 3:44 PM Bertrand Drouvot > > <bertranddrouvot.pg@gmail.com> wrote: > > > > > > Hi, > > > > > > Thanks! > > > > > > Some random comments about v92_001 (Sorry if it has already been discussed > > > up-thread): > > > > Thanks for the feedback. The patch is pushed 15 minutes back. > > Yeah, saw that after I send the comments ;-) > There is a BF failure. See https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2024-02-22%2010%3A13%3A03. The initial analysis suggests that for some reason, the primary went down after the slot sync worker was invoked the first time. See the below in the primary's LOG: 2024-02-22 10:59:56.896 UTC [2721639:29] standby1_slotsync worker LOG: 00000: statement: SELECT slot_name, plugin, confirmed_flush_lsn, restart_lsn, catalog_xmin, two_phase, failover, database, conflict_reason FROM pg_catalog.pg_replication_slots WHERE failover and NOT temporary 2024-02-22 10:59:56.896 UTC [2721639:30] standby1_slotsync worker LOCATION: exec_simple_query, postgres.c:1070 2024-02-22 11:00:26.967 UTC [2721639:31] standby1_slotsync worker LOG: 00000: statement: SELECT slot_name, plugin, confirmed_flush_lsn, restart_lsn, catalog_xmin, two_phase, failover, database, conflict_reason FROM pg_catalog.pg_replication_slots WHERE failover and NOT temporary 2024-02-22 11:00:26.967 UTC [2721639:32] standby1_slotsync worker LOCATION: exec_simple_query, postgres.c:1070 2024-02-22 11:00:35.908 UTC [2721435:309] LOG: 00000: received immediate shutdown request 2024-02-22 11:00:35.908 UTC [2721435:310] LOCATION: process_pm_shutdown_request, postmaster.c:2859 2024-02-22 11:00:35.911 UTC [2721435:311] LOG: 00000: database system is shut down 2024-02-22 11:00:35.911 UTC [2721435:312] LOCATION: UnlinkLockFiles, miscinit.c:1138 -- With Regards, Amit Kapila.
On Thu, Feb 22, 2024 at 5:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Feb 22, 2024 at 4:35 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > > On Thu, Feb 22, 2024 at 04:01:34PM +0530, shveta malik wrote: > > > On Thu, Feb 22, 2024 at 3:44 PM Bertrand Drouvot > > > <bertranddrouvot.pg@gmail.com> wrote: > > > > > > > > Hi, > > > > > > > > Thanks! > > > > > > > > Some random comments about v92_001 (Sorry if it has already been discussed > > > > up-thread): > > > > > > Thanks for the feedback. The patch is pushed 15 minutes back. > > > > Yeah, saw that after I send the comments ;-) > > > > There is a BF failure. See > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2024-02-22%2010%3A13%3A03. > > The initial analysis suggests that for some reason, the primary went > down after the slot sync worker was invoked the first time. See the > below in the primary's LOG: > The reason is that the test failed waiting on below LOG: ### Reloading node "standby1" # Running: pg_ctl -D /home/ec2-user/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/t_040_standby_failover_slots_sync_standby1_data/pgdata reload server signaled timed out waiting for match: (?^:LOG: slot sync worker started) at t/040_standby_failover_slots_sync.pl line 376. Now, on standby, we see a LOG like 2024-02-22 10:57:35.432 UTC [2721638:1] LOG: 00000: slot sync worker started. Even then the test failed and the reason is that it has an extra 0000 before the actual message which is due to log_error_verbosity = verbose in config. I think here the test's log matching code needs to have a more robust log line matching code. -- With Regards, Amit Kapila.
On Thursday, February 22, 2024 8:41 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Feb 22, 2024 at 5:23 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > On Thu, Feb 22, 2024 at 4:35 PM Bertrand Drouvot > > <bertranddrouvot.pg@gmail.com> wrote: > > > > > > On Thu, Feb 22, 2024 at 04:01:34PM +0530, shveta malik wrote: > > > > On Thu, Feb 22, 2024 at 3:44 PM Bertrand Drouvot > > > > <bertranddrouvot.pg@gmail.com> wrote: > > > > > > > > > > Hi, > > > > > > > > > > Thanks! > > > > > > > > > > Some random comments about v92_001 (Sorry if it has already been > > > > > discussed > > > > > up-thread): > > > > > > > > Thanks for the feedback. The patch is pushed 15 minutes back. > > > > > > Yeah, saw that after I send the comments ;-) > > > > > > > There is a BF failure. See > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2024-0 > 2-22%2010%3A13%3A03. > > > > The initial analysis suggests that for some reason, the primary went > > down after the slot sync worker was invoked the first time. See the > > below in the primary's LOG: > > > > The reason is that the test failed waiting on below LOG: > > ### Reloading node "standby1" > # Running: pg_ctl -D > /home/ec2-user/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/t_ > 040_standby_failover_slots_sync_standby1_data/pgdata > reload > server signaled > timed out waiting for match: (?^:LOG: slot sync worker started) at > t/040_standby_failover_slots_sync.pl line 376. > > Now, on standby, we see a LOG like 2024-02-22 10:57:35.432 UTC [2721638:1] > LOG: 00000: slot sync worker started. Even then the test failed and the reason is > that it has an extra 0000 before the actual message which is due to > log_error_verbosity = verbose in config. I think here the test's log matching > code needs to have a more robust log line matching code. Agreed. Here is a small patch to change the msg in wait_for_log so that it only search the message part. Best Regards, Hou zj
Attachment
Hi, Since the slotsync worker patch has been committed, I rebased the remaining patches. And here is the V95 patch set. Also, I fixed a bug in the current 0001 patch where the member of the standby slot names list pointed to the freed memory after calling ProcessConfigFile(). Now, we will obtain a new list when we call ProcessConfigFile(). The optimization to only get the new list when the names actually change has been removed. I think this change is acceptable because ProcessConfigFile is not a frequent occurrence. Additionally, I reordered the tests in 040_standby_failover_slots_sync.pl. Now the new test will be conducted after the sync slot test to prevent the risk of the logical slot occasionally not catching up to the latest catalog_xmin and, as a result, not being able to be synced immediately. Best Regards, Hou zj
Attachment
On Friday, February 23, 2024 10:02 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > Hi, > > Since the slotsync worker patch has been committed, I rebased the remaining > patches. > And here is the V95 patch set. > > Also, I fixed a bug in the current 0001 patch where the member of the standby > slot names list pointed to the freed memory after calling ProcessConfigFile(). > Now, we will obtain a new list when we call ProcessConfigFile(). The > optimization to only get the new list when the names actually change has been > removed. I think this change is acceptable because ProcessConfigFile is not a > frequent occurrence. > > Additionally, I reordered the tests in 040_standby_failover_slots_sync.pl. Now > the new test will be conducted after the sync slot test to prevent the risk of the > logical slot occasionally not catching up to the latest catalog_xmin and, as a > result, not being able to be synced immediately. There is one unexpected change in the previous version, sorry for that. Here is the correct version. Best Regards, Hou zj
Attachment
On Thu, Feb 22, 2024 at 4:35 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Suppose that in synchronize_slots() the query would be: > > const char *query = "SELECT slot_name, plugin, confirmed_flush_lsn," > " restart_lsn, catalog_xmin, two_phase, failover," > " database, conflict_reason" > " FROM pg_catalog.pg_replication_slots" > " WHERE failover and NOT temporary and 1 = 1"; > > Then my comment is to rewrite it to: > > const char *query = "SELECT slot_name, plugin, confirmed_flush_lsn," > " restart_lsn, catalog_xmin, two_phase, failover," > " database, conflict_reason" > " FROM pg_catalog.pg_replication_slots" > " WHERE failover and NOT temporary and 1 OPERATOR(pg_catalog.=) 1"; > > to ensure the operator "=" is coming from the pg_catalog schema. > Thanks for the details, but slot-sync does not use SPI calls, it uses libpqrcv calls. So is this change needed? thanks Shveta
On Fri, Feb 23, 2024 at 8:35 AM shveta malik <shveta.malik@gmail.com> wrote: > > On Thu, Feb 22, 2024 at 4:35 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > > Suppose that in synchronize_slots() the query would be: > > > > const char *query = "SELECT slot_name, plugin, confirmed_flush_lsn," > > " restart_lsn, catalog_xmin, two_phase, failover," > > " database, conflict_reason" > > " FROM pg_catalog.pg_replication_slots" > > " WHERE failover and NOT temporary and 1 = 1"; > > > > Then my comment is to rewrite it to: > > > > const char *query = "SELECT slot_name, plugin, confirmed_flush_lsn," > > " restart_lsn, catalog_xmin, two_phase, failover," > > " database, conflict_reason" > > " FROM pg_catalog.pg_replication_slots" > > " WHERE failover and NOT temporary and 1 OPERATOR(pg_catalog.=) 1"; > > > > to ensure the operator "=" is coming from the pg_catalog schema. > > > > Thanks for the details, but slot-sync does not use SPI calls, it uses > libpqrcv calls. So is this change needed? Additionally, I would like to have a better understanding of why it's necessary and whether it addresses any potential security risks. thanks Shveta
On Friday, February 23, 2024 10:18 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > > > Hi, > > > > Since the slotsync worker patch has been committed, I rebased the > > remaining patches. > > And here is the V95 patch set. > > > > Also, I fixed a bug in the current 0001 patch where the member of the > > standby slot names list pointed to the freed memory after calling > ProcessConfigFile(). > > Now, we will obtain a new list when we call ProcessConfigFile(). The > > optimization to only get the new list when the names actually change > > has been removed. I think this change is acceptable because > > ProcessConfigFile is not a frequent occurrence. > > > > Additionally, I reordered the tests in > > 040_standby_failover_slots_sync.pl. Now the new test will be conducted > > after the sync slot test to prevent the risk of the logical slot > > occasionally not catching up to the latest catalog_xmin and, as a result, not > being able to be synced immediately. > > There is one unexpected change in the previous version, sorry for that. > Here is the correct version. I noticed one CFbot failure[1] which is because the tap-test doesn't wait for the standby to catch up before promoting, thus the data inserted after promotion could not be replicated to the subscriber. Add a wait_for_replay_catchup to fix it. Apart from this, I also adjusted some variable names in the tap-test to be consistent. And added back a mis-removed ProcessConfigFile call. [1] https://cirrus-ci.com/task/6126787437002752?logs=check_world#L312 Best Regards, Hou zj
Attachment
On Fri, Feb 23, 2024 at 10:06 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > > I noticed one CFbot failure[1] which is because the tap-test doesn't wait for the > standby to catch up before promoting, thus the data inserted after promotion > could not be replicated to the subscriber. Add a wait_for_replay_catchup to fix it. > > Apart from this, I also adjusted some variable names in the tap-test to be > consistent. And added back a mis-removed ProcessConfigFile call. > > [1] https://cirrus-ci.com/task/6126787437002752?logs=check_world#L312 > Thanks for the patches. Had a quick look at v95_2, here are some trivial comments: slot.h: ----- 1) extern List *GetStandbySlotList(bool copy); extern void WaitForStandbyConfirmation(XLogRecPtr wait_for_lsn); extern void FilterStandbySlots(XLogRecPtr wait_for_lsn, List **standby_slots); The order is different from the one in slot.c slot.c: ----- 2) warningfmt = _("replication slot \"%s\" specified in parameter \"%s\" does not exist, ignoring"); GUC names should not have double quotes. Same in each warningfmt in this function 3) errmsg("replication slot \"%s\" specified in parameter \"%s\" does not have active_pid", Same here, double quotes around standby_slot_names should be removed walsender.c: ------ 4) * Used by logical decoding SQL functions that acquired slot with failover * enabled. To be consistent with other such comments in previous patches: slot with failover enabled --> failover enabled slot 5) Wake up the logical walsender processes with failover-enabled slots failover-enabled slots --> failover enabled slots postgresql.conf.sample: ---------- 6) streaming replication standby server slot names that logical walsender processes will wait for Is it better to say it like this? (I leave this to your preference) streaming replication standby server slot names for which logical walsender processes will wait. thanks Shveta
Hi, On Fri, Feb 23, 2024 at 08:35:44AM +0530, shveta malik wrote: > On Thu, Feb 22, 2024 at 4:35 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > > Suppose that in synchronize_slots() the query would be: > > > > const char *query = "SELECT slot_name, plugin, confirmed_flush_lsn," > > " restart_lsn, catalog_xmin, two_phase, failover," > > " database, conflict_reason" > > " FROM pg_catalog.pg_replication_slots" > > " WHERE failover and NOT temporary and 1 = 1"; > > > > Then my comment is to rewrite it to: > > > > const char *query = "SELECT slot_name, plugin, confirmed_flush_lsn," > > " restart_lsn, catalog_xmin, two_phase, failover," > > " database, conflict_reason" > > " FROM pg_catalog.pg_replication_slots" > > " WHERE failover and NOT temporary and 1 OPERATOR(pg_catalog.=) 1"; > > > > to ensure the operator "=" is coming from the pg_catalog schema. > > > > Thanks for the details, but slot-sync does not use SPI calls, it uses > libpqrcv calls. Sorry for the confusion, I meant to say "remote SQL calls". > So is this change needed? The example I provided is a "fake" one (as currently the "=" operator is not used in the const char *query in synchronize_slots()). So there is currently nothing to change here. I just want to highlight that if we are using (or will use) operators in the remote SQL calls then we should ensure they are coming from the pg_catalog schema (as in the example provided above). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Fri, Feb 23, 2024 at 09:43:48AM +0530, shveta malik wrote: > On Fri, Feb 23, 2024 at 8:35 AM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Thu, Feb 22, 2024 at 4:35 PM Bertrand Drouvot > > <bertranddrouvot.pg@gmail.com> wrote: > > > > > > Suppose that in synchronize_slots() the query would be: > > > > > > const char *query = "SELECT slot_name, plugin, confirmed_flush_lsn," > > > " restart_lsn, catalog_xmin, two_phase, failover," > > > " database, conflict_reason" > > > " FROM pg_catalog.pg_replication_slots" > > > " WHERE failover and NOT temporary and 1 = 1"; > > > > > > Then my comment is to rewrite it to: > > > > > > const char *query = "SELECT slot_name, plugin, confirmed_flush_lsn," > > > " restart_lsn, catalog_xmin, two_phase, failover," > > > " database, conflict_reason" > > > " FROM pg_catalog.pg_replication_slots" > > > " WHERE failover and NOT temporary and 1 OPERATOR(pg_catalog.=) 1"; > > > > > > to ensure the operator "=" is coming from the pg_catalog schema. > > > > > > > Thanks for the details, but slot-sync does not use SPI calls, it uses > > libpqrcv calls. So is this change needed? > > Additionally, I would like to have a better understanding of why it's > necessary and whether it addresses any potential security risks. Because one could create say the "=" OPERATOR in their own schema, attach a function to it doing undesired stuff and change the search_path for the database the sync slot worker connects to. Then this new "=" operator would be used (instead of the pg_catalog.= one), triggering the "undesired" function as superuser. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, Feb 23, 2024 at 1:28 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > Because one could create say the "=" OPERATOR in their own schema, attach a > function to it doing undesired stuff and change the search_path for the database > the sync slot worker connects to. > > Then this new "=" operator would be used (instead of the pg_catalog.= one), > triggering the "undesired" function as superuser. Thanks for the details. I understand it now. We do not use '=' in our main slots-fetch query but we do use '=' in remote-validation query. See validate_remote_info(). Do you think instead of doing the above, we can override search-path with empty string in the slot-sync case. SImilar to logical apply worker and autovacuum worker case (see InitializeLogRepWorker(), AutoVacWorkerMain()). thanks Shveta
Hi, On Fri, Feb 23, 2024 at 02:15:11PM +0530, shveta malik wrote: > On Fri, Feb 23, 2024 at 1:28 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > > Hi, > > > > Because one could create say the "=" OPERATOR in their own schema, attach a > > function to it doing undesired stuff and change the search_path for the database > > the sync slot worker connects to. > > > > Then this new "=" operator would be used (instead of the pg_catalog.= one), > > triggering the "undesired" function as superuser. > > Thanks for the details. I understand it now. We do not use '=' in our > main slots-fetch query but we do use '=' in remote-validation query. > See validate_remote_info(). Oh, right, I missed it during the review. > Do you think instead of doing the above, > we can override search-path with empty string in the slot-sync case. > SImilar to logical apply worker and autovacuum worker case (see > InitializeLogRepWorker(), AutoVacWorkerMain()). Yeah, we should definitively ensure that any operators being used in the query is coming from the pg_catalog schema (could be by setting the search path or using the up-thread proposal). Setting the search path would prevent any risks in case the query is changed later on, so I'd vote for changing the search path in validate_remote_info() and in synchronize_slots() to be on the safe side. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Friday, February 23, 2024 5:07 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > On Fri, Feb 23, 2024 at 02:15:11PM +0530, shveta malik wrote: > > On Fri, Feb 23, 2024 at 1:28 PM Bertrand Drouvot > > <bertranddrouvot.pg@gmail.com> wrote: > > > > > > Hi, > > > > > > Because one could create say the "=" OPERATOR in their own schema, > > > attach a function to it doing undesired stuff and change the > > > search_path for the database the sync slot worker connects to. > > > > > > Then this new "=" operator would be used (instead of the > > > pg_catalog.= one), triggering the "undesired" function as superuser. > > > > Thanks for the details. I understand it now. We do not use '=' in our > > main slots-fetch query but we do use '=' in remote-validation query. > > See validate_remote_info(). > > Oh, right, I missed it during the review. > > > Do you think instead of doing the above, we can override search-path > > with empty string in the slot-sync case. > > SImilar to logical apply worker and autovacuum worker case (see > > InitializeLogRepWorker(), AutoVacWorkerMain()). > > Yeah, we should definitively ensure that any operators being used in the query > is coming from the pg_catalog schema (could be by setting the search path or > using the up-thread proposal). > > Setting the search path would prevent any risks in case the query is changed > later on, so I'd vote for changing the search path in validate_remote_info() and > in synchronize_slots() to be on the safe side. I think to set secure search path for remote connection, the standard approach could be to extend the code in libpqrcv_connect[1], so that we don't need to schema qualify all the operators in the queries. And for local connection, I agree it's also needed to add a SetConfigOption("search_path", "" call in the slotsync worker. [1] libpqrcv_connect ... if (logical) ... res = libpqrcv_PQexec(conn->streamConn, ALWAYS_SECURE_SEARCH_PATH_SQL); Best Regards, Hou zj
On Friday, February 23, 2024 1:22 PM shveta malik <shveta.malik@gmail.com> wrote: > > Thanks for the patches. Had a quick look at v95_2, here are some > trivial comments: Thanks for the comments. > 6) streaming replication standby server slot names that logical walsender > processes will wait for > > Is it better to say it like this? (I leave this to your preference) > > streaming replication standby server slot names for which logical > walsender processes will wait. I feel the current one seems better, so didn’t change. Other comments have been addressed. Here is the V97 patch set which addressed Shveta's comments. Besides, I'd like to clarify and discuss the behavior of standby_slot_names once. As it stands in the patch, If the slots specified in standby_slot_names are dropped or invalidated, the logical walsender will issue a WARNING and continue to replicate the changes. Another option for this could be to have the walsender pause until the slot in standby_slot_names is re-created or becomes valid again. Does anyone else have an opinion on this matter ? Best Regards, Hou zj
Attachment
Hi, On Fri, Feb 23, 2024 at 04:36:44AM +0000, Zhijie Hou (Fujitsu) wrote: > On Friday, February 23, 2024 10:18 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > > > > > Hi, > > > > > > Since the slotsync worker patch has been committed, I rebased the > > > remaining patches. > > > And here is the V95 patch set. > > > > > > Also, I fixed a bug in the current 0001 patch where the member of the > > > standby slot names list pointed to the freed memory after calling > > ProcessConfigFile(). > > > Now, we will obtain a new list when we call ProcessConfigFile(). The > > > optimization to only get the new list when the names actually change > > > has been removed. I think this change is acceptable because > > > ProcessConfigFile is not a frequent occurrence. > > > > > > Additionally, I reordered the tests in > > > 040_standby_failover_slots_sync.pl. Now the new test will be conducted > > > after the sync slot test to prevent the risk of the logical slot > > > occasionally not catching up to the latest catalog_xmin and, as a result, not > > being able to be synced immediately. > > > > There is one unexpected change in the previous version, sorry for that. > > Here is the correct version. > > I noticed one CFbot failure[1] which is because the tap-test doesn't wait for the > standby to catch up before promoting, thus the data inserted after promotion > could not be replicated to the subscriber. Add a wait_for_replay_catchup to fix it. > > Apart from this, I also adjusted some variable names in the tap-test to be > consistent. And added back a mis-removed ProcessConfigFile call. Thanks! Here are some random comments: 1 === Commit message "Allow logical walsenders to wait for the physical" s/physical/physical standby/? 2 == +++ b/src/backend/replication/logical/logicalfuncs.c @@ -30,6 +30,7 @@ #include "replication/decode.h" #include "replication/logical.h" #include "replication/message.h" +#include "replication/walsender.h" Is this include needed? 3 === + * Slot sync is currently not supported on the cascading standby. This is s/on the/on a/? 4 === + if (!ok) + GUC_check_errdetail("List syntax is invalid."); + + /* + * If there is a syntax error in the name or if the replication slots' + * data is not initialized yet (i.e., we are in the startup process), skip + * the slot verification. + */ + if (!ok || !ReplicationSlotCtl) + { + pfree(rawname); + list_free(elemlist); + return ok; + } we are testing the "ok" value twice, what about using if...else if... instead and test it once? If so, it might be worth to put the: " + pfree(rawname); + list_free(elemlist); + return ok; " in a "goto". 5 === + * for which all standbys to wait for. Even if we have physical-slots s/physical-slots/physical slots/? 6 === * Switch to the same memory context under which GUC variables are s/to the same memory/to the memory/? 7 === + * Return a copy of standby_slot_names_list if the copy flag is set to true, Not sure, but would it be worth explaining why one would want to set to flag to true or false? (i.e why one would not want to receive the original list). 8 === + if (RecoveryInProgress()) + return NIL; The need is well documented just above, but are we not violating the fact that we return the original list or a copy of it? (that's what the comment above the GetStandbySlotList() function definition is saying). I think the comment above the GetStandbySlotList() function needs a bit of rewording to cover that case. 9 === + * harmless, a WARNING should be enough, no need to error-out. s/error-out/error out/? 10 === + if (slot->data.invalidated != RS_INVAL_NONE) + { + /* + * Specified physical slot have been invalidated, so no point + * in waiting for it. We discovered in [1], that if the wal_status is "unreserved" then the slot is still serving the standby. I think we should handle this case differently, thoughts? 11 === + * Specified physical slot have been invalidated, so no point s/have been/has been/? 12 === +++ b/src/backend/replication/slotfuncs.c @@ -22,6 +22,7 @@ #include "replication/logical.h" #include "replication/slot.h" #include "replication/slotsync.h" +#include "replication/walsender.h" Is this include needed? [1]: https://www.postgresql.org/message-id/CALj2ACWE9asmvN1B18LqfHE8uBuWGsCEP7OO5trRCxPtTPeHVA%40mail.gmail.com Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Fri, Feb 23, 2024 at 09:46:00AM +0000, Zhijie Hou (Fujitsu) wrote: > On Friday, February 23, 2024 1:22 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > Thanks for the patches. Had a quick look at v95_2, here are some > > trivial comments: > > Thanks for the comments. > > > 6) streaming replication standby server slot names that logical walsender > > processes will wait for > > > > Is it better to say it like this? (I leave this to your preference) > > > > streaming replication standby server slot names for which logical > > walsender processes will wait. > > I feel the current one seems better, so didn’t change. Other comments have been > addressed. Here is the V97 patch set which addressed Shveta's comments. > > > Besides, I'd like to clarify and discuss the behavior of standby_slot_names once. > > As it stands in the patch, If the slots specified in standby_slot_names are > dropped or invalidated, the logical walsender will issue a WARNING and continue > to replicate the changes. Another option for this could be to have the > walsender pause until the slot in standby_slot_names is re-created or becomes > valid again. Does anyone else have an opinion on this matter ? Good point, I'd vote for: the only reasons not to wait are: - slots mentioned in standby_slot_names exist and valid and do catch up or - standby_slot_names is empty The reason is that setting standby_slot_names to a non empty value means that one wants the walsender to wait until the standby catchup. The way to remove this intentional behavior should be by changing the standby_slot_names value (not the existence or the state of the slot(s) it points too). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Fri, Feb 23, 2024 at 09:30:58AM +0000, Zhijie Hou (Fujitsu) wrote: > On Friday, February 23, 2024 5:07 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > On Fri, Feb 23, 2024 at 02:15:11PM +0530, shveta malik wrote: > > > > > > Thanks for the details. I understand it now. We do not use '=' in our > > > main slots-fetch query but we do use '=' in remote-validation query. > > > See validate_remote_info(). > > > > Oh, right, I missed it during the review. > > > > > Do you think instead of doing the above, we can override search-path > > > with empty string in the slot-sync case. > > > SImilar to logical apply worker and autovacuum worker case (see > > > InitializeLogRepWorker(), AutoVacWorkerMain()). > > > > Yeah, we should definitively ensure that any operators being used in the query > > is coming from the pg_catalog schema (could be by setting the search path or > > using the up-thread proposal). > > > > Setting the search path would prevent any risks in case the query is changed > > later on, so I'd vote for changing the search path in validate_remote_info() and > > in synchronize_slots() to be on the safe side. > > I think to set secure search path for remote connection, the standard approach > could be to extend the code in libpqrcv_connect[1], so that we don't need to schema > qualify all the operators in the queries. > > And for local connection, I agree it's also needed to add a > SetConfigOption("search_path", "" call in the slotsync worker. > > [1] > libpqrcv_connect > ... > if (logical) > ... > res = libpqrcv_PQexec(conn->streamConn, > ALWAYS_SECURE_SEARCH_PATH_SQL); > Agree, something like in the attached? (it's .txt to not disturb the CF bot). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Friday, February 23, 2024 6:12 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Here are some random comments: Thanks for the comments! > > 1 === > > Commit message "Allow logical walsenders to wait for the physical" > > s/physical/physical standby/? > > 2 == > > +++ b/src/backend/replication/logical/logicalfuncs.c > @@ -30,6 +30,7 @@ > #include "replication/decode.h" > #include "replication/logical.h" > #include "replication/message.h" > +#include "replication/walsender.h" > > Is this include needed? Removed. > > 3 === > > + * Slot sync is currently not supported on the cascading > + standby. This is > > s/on the/on a/? Changed. > > 4 === > > + if (!ok) > + GUC_check_errdetail("List syntax is invalid."); > + > + /* > + * If there is a syntax error in the name or if the replication slots' > + * data is not initialized yet (i.e., we are in the startup process), skip > + * the slot verification. > + */ > + if (!ok || !ReplicationSlotCtl) > + { > + pfree(rawname); > + list_free(elemlist); > + return ok; > + } > > we are testing the "ok" value twice, what about using if...else if... instead and > test it once? If so, it might be worth to put the: > > " > + pfree(rawname); > + list_free(elemlist); > + return ok; > " > > in a "goto". There were comments to remove the 'goto' statement and avoid duplicate free code, so I prefer the current style. > > 5 === > > + * for which all standbys to wait for. Even if we have > + physical-slots > > s/physical-slots/physical slots/? Changed. > > 6 === > > * Switch to the same memory context under which GUC variables are > > s/to the same memory/to the memory/? Changed. > > 7 === > > + * Return a copy of standby_slot_names_list if the copy flag is set to > + true, > > Not sure, but would it be worth explaining why one would want to set to flag to > true or false? (i.e why one would not want to receive the original list). I think the usage can be found from the caller's code, e.g we need to remove the slots that caught up from the list each time, so we cannot directly modify the global list. The GetStandbySlotList function is general function and I feel we can avoid adding more comments here. > > 8 === > > + if (RecoveryInProgress()) > + return NIL; > > The need is well documented just above, but are we not violating the fact that > we return the original list or a copy of it? (that's what the comment above the > GetStandbySlotList() function definition is saying). > > I think the comment above the GetStandbySlotList() function needs a bit of > rewording to cover that case. Adjusted. > > 9 === > > + * harmless, a WARNING should be enough, no need to > error-out. > > s/error-out/error out/? Changed. > > 10 === > > + if (slot->data.invalidated != RS_INVAL_NONE) > + { > + /* > + * Specified physical slot have been invalidated, > so no point > + * in waiting for it. > > We discovered in [1], that if the wal_status is "unreserved" then the slot is still > serving the standby. I think we should handle this case differently, thoughts? I think the 'invalidated' slot can still be used is a separate bug. Because once the slot is invalidated, it can neither protect WALs or ROWs from being removed even if the restart_lsn of the slot can be moved forward after being invalidated. If the standby can move restart_lsn forward for invalidated slots, then it should also set the 'invalidated' flag back to NONE, otherwise the slot cannot serve its purpose anymore. I also reported similar bug before[1]. > > 11 === > > + * Specified physical slot have been > + invalidated, so no point > > s/have been/has been/? Changed. > > 12 === > > +++ b/src/backend/replication/slotfuncs.c > @@ -22,6 +22,7 @@ > #include "replication/logical.h" > #include "replication/slot.h" > #include "replication/slotsync.h" > +#include "replication/walsender.h" > > Is this include needed? No, it's not needed. Removed. Attach the V98 patch set which addressed above comments. I also adjusted few comments based on off-list comments from Shveta. The discussion for wait behavior is on-going, so I didn't change the behavior in this version. [1] https://www.postgresql.org/message-id/flat/OS0PR01MB5716A626A4AF5814E057CEE39484A@OS0PR01MB5716.jpnprd01.prod.outlook.com Best Regards, Hou zj
Attachment
On Fri, Feb 23, 2024 at 7:41 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > I think to set secure search path for remote connection, the standard approach > > could be to extend the code in libpqrcv_connect[1], so that we don't need to schema > > qualify all the operators in the queries. > > > > And for local connection, I agree it's also needed to add a > > SetConfigOption("search_path", "" call in the slotsync worker. > > > > [1] > > libpqrcv_connect > > ... > > if (logical) > > ... > > res = libpqrcv_PQexec(conn->streamConn, > > ALWAYS_SECURE_SEARCH_PATH_SQL); > > > > Agree, something like in the attached? (it's .txt to not disturb the CF bot). Thanks for the patch, changes look good. I have corporated it in the patch which addresses the rest of your comments in [1]. I have attached the patch as .txt [1]: https://www.postgresql.org/message-id/ZdcejBDCr%2BwlVGnO%40ip-10-97-1-34.eu-west-3.compute.internal thanks Shveta
Attachment
On Fri, Feb 23, 2024 at 4:45 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > On Fri, Feb 23, 2024 at 09:46:00AM +0000, Zhijie Hou (Fujitsu) wrote: > > > > Besides, I'd like to clarify and discuss the behavior of standby_slot_names once. > > > > As it stands in the patch, If the slots specified in standby_slot_names are > > dropped or invalidated, the logical walsender will issue a WARNING and continue > > to replicate the changes. Another option for this could be to have the > > walsender pause until the slot in standby_slot_names is re-created or becomes > > valid again. Does anyone else have an opinion on this matter ? > > Good point, I'd vote for: the only reasons not to wait are: > > - slots mentioned in standby_slot_names exist and valid and do catch up > or > - standby_slot_names is empty > > The reason is that setting standby_slot_names to a non empty value means that > one wants the walsender to wait until the standby catchup. The way to remove this > intentional behavior should be by changing the standby_slot_names value (not the > existence or the state of the slot(s) it points too). > It seems we already do wait for the case when there is an inactive slot as per the below code [1] in the patch. So, probably waiting in other cases is also okay and also as this parameter is a SIGHUP parameter, users should be easily able to change its value if required. Do you think it is a good idea to mention this in docs as well? I think it is important to raise WARNING as the patch is doing in all the cases where the slot is not being processed so that users can be notified and they can take the required action. [1] - else if (XLogRecPtrIsInvalid(slot->data.restart_lsn) || + slot->data.restart_lsn < wait_for_lsn) + { + bool inactive = (slot->active_pid == 0); + + SpinLockRelease(&slot->mutex); + + /* Log warning if no active_pid for this physical slot */ + if (inactive) + ereport(WARNING, + errmsg("replication slot \"%s\" specified in parameter %s does not have active_pid", + name, "standby_slot_names"), + errdetail("Logical replication is waiting on the standby associated with \"%s\".", + name), + errhint("Consider starting standby associated with \"%s\" or amend standby_slot_names.", + name)); + + /* Continue if the current slot hasn't caught up. */ + continue; -- With Regards, Amit Kapila.
Hi, On Mon, Feb 26, 2024 at 02:18:58AM +0000, Zhijie Hou (Fujitsu) wrote: > On Friday, February 23, 2024 6:12 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > + if (!ok) > > + GUC_check_errdetail("List syntax is invalid."); > > + > > + /* > > + * If there is a syntax error in the name or if the replication slots' > > + * data is not initialized yet (i.e., we are in the startup process), skip > > + * the slot verification. > > + */ > > + if (!ok || !ReplicationSlotCtl) > > + { > > + pfree(rawname); > > + list_free(elemlist); > > + return ok; > > + } > > > > we are testing the "ok" value twice, what about using if...else if... instead and > > test it once? If so, it might be worth to put the: > > > > " > > + pfree(rawname); > > + list_free(elemlist); > > + return ok; > > " > > > > in a "goto". > > There were comments to remove the 'goto' statement and avoid > duplicate free code, so I prefer the current style. The duplicate free code would come from the if...else if... rewrite but then the "goto" would remove it, so I'm not sure to understand your point. > > > > 7 === > > > > + * Return a copy of standby_slot_names_list if the copy flag is set to > > + true, > > > > Not sure, but would it be worth explaining why one would want to set to flag to > > true or false? (i.e why one would not want to receive the original list). > > I think the usage can be found from the caller's code, e.g we need to remove > the slots that caught up from the list each time, so we cannot directly modify > the global list. The GetStandbySlotList function is general function and I feel > we can avoid adding more comments here. Okay, yeah makes sense. > > > > 10 === > > > > + if (slot->data.invalidated != RS_INVAL_NONE) > > + { > > + /* > > + * Specified physical slot have been invalidated, > > so no point > > + * in waiting for it. > > > > We discovered in [1], that if the wal_status is "unreserved" then the slot is still > > serving the standby. I think we should handle this case differently, thoughts? > > I think the 'invalidated' slot can still be used is a separate bug. > Because > once the slot is invalidated, it can neither protect WALs or ROWs from being > removed even if the restart_lsn of the slot can be moved forward after being invalidated. > > If the standby can move restart_lsn forward for invalidated slots, then > it should also set the 'invalidated' flag back to NONE, otherwise the slot > cannot serve its purpose anymore. I also reported similar bug before[1]. I see. But should'nt we add a check on restart_lsn as this is done here in pg_get_replication_slots()? " case WALAVAIL_REMOVED: /* * If we read the restart_lsn long enough ago, maybe that file * has been removed by now. However, the walsender could have * moved forward enough that it jumped to another file after * we looked. If checkpointer signalled the process to * termination, then it's definitely lost; but if a process is * still alive, then "unreserved" seems more appropriate. if (!XLogRecPtrIsInvalid(slot_contents.data.restart_lsn)) " My point is that I think we should behave like it's not a bug and then adapt the code accordingly here (until the bug gets fixed). Currently we are not waiting for this slot while it's still serving the standby which does not seem good too, thoughts? > Attach the V98 patch set which addressed above comments. > I also adjusted few comments based on off-list comments from Shveta. Thanks! Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Mon, Feb 26, 2024 at 09:13:05AM +0530, shveta malik wrote: > On Fri, Feb 23, 2024 at 7:41 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > > Hi, > > > I think to set secure search path for remote connection, the standard approach > > > could be to extend the code in libpqrcv_connect[1], so that we don't need to schema > > > qualify all the operators in the queries. > > > > > > And for local connection, I agree it's also needed to add a > > > SetConfigOption("search_path", "" call in the slotsync worker. > > > > > > [1] > > > libpqrcv_connect > > > ... > > > if (logical) > > > ... > > > res = libpqrcv_PQexec(conn->streamConn, > > > ALWAYS_SECURE_SEARCH_PATH_SQL); > > > > > > > Agree, something like in the attached? (it's .txt to not disturb the CF bot). > > Thanks for the patch, changes look good. I have corporated it in the > patch which addresses the rest of your comments in [1]. I have > attached the patch as .txt Thanks! LGTM. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Mon, Feb 26, 2024 at 10:48:38AM +0530, Amit Kapila wrote: > On Fri, Feb 23, 2024 at 4:45 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > > On Fri, Feb 23, 2024 at 09:46:00AM +0000, Zhijie Hou (Fujitsu) wrote: > > > > > > Besides, I'd like to clarify and discuss the behavior of standby_slot_names once. > > > > > > As it stands in the patch, If the slots specified in standby_slot_names are > > > dropped or invalidated, the logical walsender will issue a WARNING and continue > > > to replicate the changes. Another option for this could be to have the > > > walsender pause until the slot in standby_slot_names is re-created or becomes > > > valid again. Does anyone else have an opinion on this matter ? > > > > Good point, I'd vote for: the only reasons not to wait are: > > > > - slots mentioned in standby_slot_names exist and valid and do catch up > > or > > - standby_slot_names is empty > > > > The reason is that setting standby_slot_names to a non empty value means that > > one wants the walsender to wait until the standby catchup. The way to remove this > > intentional behavior should be by changing the standby_slot_names value (not the > > existence or the state of the slot(s) it points too). > > > > It seems we already do wait for the case when there is an inactive > slot as per the below code [1] in the patch. So, probably waiting in > other cases is also okay and also as this parameter is a SIGHUP > parameter, users should be easily able to change its value if > required. Agree. > Do you think it is a good idea to mention this in docs as > well? Yeah, I think the more the better. > I think it is important to raise WARNING as the patch is doing in all > the cases where the slot is not being processed so that users can be > notified and they can take the required action. +1 Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Feb 26, 2024 at 12:59 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > On Mon, Feb 26, 2024 at 02:18:58AM +0000, Zhijie Hou (Fujitsu) wrote: > > On Friday, February 23, 2024 6:12 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > > + if (!ok) > > > + GUC_check_errdetail("List syntax is invalid."); > > > + > > > + /* > > > + * If there is a syntax error in the name or if the replication slots' > > > + * data is not initialized yet (i.e., we are in the startup process), skip > > > + * the slot verification. > > > + */ > > > + if (!ok || !ReplicationSlotCtl) > > > + { > > > + pfree(rawname); > > > + list_free(elemlist); > > > + return ok; > > > + } > > > > > > we are testing the "ok" value twice, what about using if...else if... instead and > > > test it once? If so, it might be worth to put the: > > > > > > " > > > + pfree(rawname); > > > + list_free(elemlist); > > > + return ok; > > > " > > > > > > in a "goto". > > > > There were comments to remove the 'goto' statement and avoid > > duplicate free code, so I prefer the current style. > > The duplicate free code would come from the if...else if... rewrite but then > the "goto" would remove it, so I'm not sure to understand your point. > I think Hou-San wants to say that there was previously a comment to remove goto and now you are saying to introduce it. But, I think we can avoid both code duplication and goto, if the first thing we check in the function is ReplicationSlotCtl and return false if the same is not set. Won't that be better? > > > > > > > 10 === > > > > > > + if (slot->data.invalidated != RS_INVAL_NONE) > > > + { > > > + /* > > > + * Specified physical slot have been invalidated, > > > so no point > > > + * in waiting for it. > > > > > > We discovered in [1], that if the wal_status is "unreserved" then the slot is still > > > serving the standby. I think we should handle this case differently, thoughts? > > > > I think the 'invalidated' slot can still be used is a separate bug. > > Because > > once the slot is invalidated, it can neither protect WALs or ROWs from being > > removed even if the restart_lsn of the slot can be moved forward after being invalidated. > > > > If the standby can move restart_lsn forward for invalidated slots, then > > it should also set the 'invalidated' flag back to NONE, otherwise the slot > > cannot serve its purpose anymore. I also reported similar bug before[1]. > ... > > My point is that I think we should behave like it's not a bug and then adapt the > code accordingly here (until the bug gets fixed). > oh, I think this doesn't sound like a good idea to me. We should fix that bug independently rather than adding code in new features to consider the bug as a valid behavior. It will add the burden on us to remember and remove the additional new check(s). -- With Regards, Amit Kapila.
On Mon, Feb 26, 2024 at 7:49 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > Attach the V98 patch set which addressed above comments. > Few comments: ============= 1. WalSndWaitForWal(XLogRecPtr loc) { int wakeEvents; + bool wait_for_standby = false; + uint32 wait_event; + List *standby_slots = NIL; static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr; + if (MyReplicationSlot->data.failover && replication_active) + standby_slots = GetStandbySlotList(true); + /* - * Fast path to avoid acquiring the spinlock in case we already know we - * have enough WAL available. This is particularly interesting if we're - * far behind. + * Check if all the standby servers have confirmed receipt of WAL up to + * RecentFlushPtr even when we already know we have enough WAL available. + * + * Note that we cannot directly return without checking the status of + * standby servers because the standby_slot_names may have changed, which + * means there could be new standby slots in the list that have not yet + * caught up to the RecentFlushPtr. */ - if (RecentFlushPtr != InvalidXLogRecPtr && - loc <= RecentFlushPtr) - return RecentFlushPtr; + if (!XLogRecPtrIsInvalid(RecentFlushPtr) && loc <= RecentFlushPtr) + { + FilterStandbySlots(RecentFlushPtr, &standby_slots); I think even if the slot list is not changed, we will always process each slot mentioned in standby_slot_names once. Can't we cache the previous list of slots for we have already waited for? In that case, we won't even need to copy the list via GetStandbySlotList() unless we need to wait. 2. WalSndWaitForWal(XLogRecPtr loc) { + /* + * Update the standby slots that have not yet caught up to the flushed + * position. It is good to wait up to RecentFlushPtr and then let it + * send the changes to logical subscribers one by one which are + * already covered in RecentFlushPtr without needing to wait on every + * change for standby confirmation. + */ + if (wait_for_standby) + FilterStandbySlots(RecentFlushPtr, &standby_slots); + /* Update our idea of the currently flushed position. */ - if (!RecoveryInProgress()) + else if (!RecoveryInProgress()) RecentFlushPtr = GetFlushRecPtr(NULL); else RecentFlushPtr = GetXLogReplayRecPtr(NULL); ... /* * If postmaster asked us to stop, don't wait anymore. * * It's important to do this check after the recomputation of * RecentFlushPtr, so we can send all remaining data before shutting * down. */ if (got_STOPPING) break; I think because 'wait_for_standby' may not be set in the first or consecutive cycles we may send the WAL to the logical subscriber before sending it to the physical subscriber during shutdown. -- With Regards, Amit Kapila.
On Mon, Feb 26, 2024 at 5:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > + if (!ok) > > > > + GUC_check_errdetail("List syntax is invalid."); > > > > + > > > > + /* > > > > + * If there is a syntax error in the name or if the replication slots' > > > > + * data is not initialized yet (i.e., we are in the startup process), skip > > > > + * the slot verification. > > > > + */ > > > > + if (!ok || !ReplicationSlotCtl) > > > > + { > > > > + pfree(rawname); > > > > + list_free(elemlist); > > > > + return ok; > > > > + } > > > > > > > > we are testing the "ok" value twice, what about using if...else if... instead and > > > > test it once? If so, it might be worth to put the: > > > > > > > > " > > > > + pfree(rawname); > > > > + list_free(elemlist); > > > > + return ok; > > > > " > > > > > > > > in a "goto". > > > > > > There were comments to remove the 'goto' statement and avoid > > > duplicate free code, so I prefer the current style. > > > > The duplicate free code would come from the if...else if... rewrite but then > > the "goto" would remove it, so I'm not sure to understand your point. > > > > I think Hou-San wants to say that there was previously a comment to > remove goto and now you are saying to introduce it. But, I think we > can avoid both code duplication and goto, if the first thing we check > in the function is ReplicationSlotCtl and return false if the same is > not set. Won't that be better? I think we can not do that as we need to check atleast syntax before we return due to NULL ReplicationSlotCtl. We get NULL ReplicationSlotCtl during instance startup in check_standby_slot_names() as postmaster first loads GUC-table and then initializes shared-memory for replication slots. See calls of InitializeGUCOptions() and CreateSharedMemoryAndSemaphores() in PostmasterMain(). FWIW, I do not have any issue with current code as well, but if we have to change it, is [1] any better? [1]: check_standby_slot_names() { .... if (!ok) { GUC_check_errdetail("List syntax is invalid."); } else if (ReplicationSlotCtl) { foreach-loop for slot validation } pfree(rawname); list_free(elemlist); return ok; } thanks SHveta
Hi, On Mon, Feb 26, 2024 at 05:18:25PM +0530, Amit Kapila wrote: > On Mon, Feb 26, 2024 at 12:59 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > > 10 === > > > > > > > > + if (slot->data.invalidated != RS_INVAL_NONE) > > > > + { > > > > + /* > > > > + * Specified physical slot have been invalidated, > > > > so no point > > > > + * in waiting for it. > > > > > > > > We discovered in [1], that if the wal_status is "unreserved" then the slot is still > > > > serving the standby. I think we should handle this case differently, thoughts? > > > > > > I think the 'invalidated' slot can still be used is a separate bug. > > > Because > > > once the slot is invalidated, it can neither protect WALs or ROWs from being > > > removed even if the restart_lsn of the slot can be moved forward after being invalidated. > > > > > > If the standby can move restart_lsn forward for invalidated slots, then > > > it should also set the 'invalidated' flag back to NONE, otherwise the slot > > > cannot serve its purpose anymore. I also reported similar bug before[1]. > > > ... > > > > My point is that I think we should behave like it's not a bug and then adapt the > > code accordingly here (until the bug gets fixed). > > > > oh, I think this doesn't sound like a good idea to me. We should fix > that bug independently rather than adding code in new features to > consider the bug as a valid behavior. Agree, but it all depends if there is a consensus of the other thread being a bug or not. I also think it is but there is this part of the code in pg_get_replication_slots() that makes me think ones could think it is not. " case WALAVAIL_REMOVED: /* * If we read the restart_lsn long enough ago, maybe that file * has been removed by now. However, the walsender could have * moved forward enough that it jumped to another file after * we looked. If checkpointer signalled the process to * termination, then it's definitely lost; but if a process is * still alive, then "unreserved" seems more appropriate. * " Anyway, I also think it is a bug so agree to keep the check as it is currenlty ( and keep an eye on the other thread outcome too). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Mon, Feb 26, 2024 at 05:52:40PM +0530, shveta malik wrote: > On Mon, Feb 26, 2024 at 5:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > + if (!ok) > > > > > + GUC_check_errdetail("List syntax is invalid."); > > > > > + > > > > > + /* > > > > > + * If there is a syntax error in the name or if the replication slots' > > > > > + * data is not initialized yet (i.e., we are in the startup process), skip > > > > > + * the slot verification. > > > > > + */ > > > > > + if (!ok || !ReplicationSlotCtl) > > > > > + { > > > > > + pfree(rawname); > > > > > + list_free(elemlist); > > > > > + return ok; > > > > > + } > > > > > > > > > > we are testing the "ok" value twice, what about using if...else if... instead and > > > > > test it once? If so, it might be worth to put the: > > > > > > > > > > " > > > > > + pfree(rawname); > > > > > + list_free(elemlist); > > > > > + return ok; > > > > > " > > > > > > > > > > in a "goto". > > > > > > > > There were comments to remove the 'goto' statement and avoid > > > > duplicate free code, so I prefer the current style. > > > > > > The duplicate free code would come from the if...else if... rewrite but then > > > the "goto" would remove it, so I'm not sure to understand your point. > > > > > > > I think Hou-San wants to say that there was previously a comment to > > remove goto and now you are saying to introduce it. But, I think we > > can avoid both code duplication and goto, if the first thing we check > > in the function is ReplicationSlotCtl and return false if the same is > > not set. Won't that be better? > > I think we can not do that as we need to check atleast syntax before > we return due to NULL ReplicationSlotCtl. We get NULL > ReplicationSlotCtl during instance startup in > check_standby_slot_names() as postmaster first loads GUC-table and > then initializes shared-memory for replication slots. See calls of > InitializeGUCOptions() and CreateSharedMemoryAndSemaphores() in > PostmasterMain(). FWIW, I do not have any issue with current code as > well, but if we have to change it, is [1] any better? > > [1]: > check_standby_slot_names() > { > .... > if (!ok) > { > GUC_check_errdetail("List syntax is invalid."); > } > else if (ReplicationSlotCtl) > { > foreach-loop for slot validation > } > > pfree(rawname); > list_free(elemlist); > return ok; > } > Yeah thanks, it does not test the "ok" value twice and get rid of the goto while checking the syntax first: I'd vote for it. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Monday, February 26, 2024 1:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Feb 23, 2024 at 4:45 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > > On Fri, Feb 23, 2024 at 09:46:00AM +0000, Zhijie Hou (Fujitsu) wrote: > > > > > > Besides, I'd like to clarify and discuss the behavior of standby_slot_names > once. > > > > > > As it stands in the patch, If the slots specified in > > > standby_slot_names are dropped or invalidated, the logical walsender > > > will issue a WARNING and continue to replicate the changes. Another > > > option for this could be to have the walsender pause until the slot > > > in standby_slot_names is re-created or becomes valid again. Does anyone > else have an opinion on this matter ? > > > > Good point, I'd vote for: the only reasons not to wait are: > > > > - slots mentioned in standby_slot_names exist and valid and do catch > > up or > > - standby_slot_names is empty > > > > The reason is that setting standby_slot_names to a non empty value > > means that one wants the walsender to wait until the standby catchup. > > The way to remove this intentional behavior should be by changing the > > standby_slot_names value (not the existence or the state of the slot(s) it > points too). > > > > It seems we already do wait for the case when there is an inactive slot as per the > below code [1] in the patch. So, probably waiting in other cases is also okay and > also as this parameter is a SIGHUP parameter, users should be easily able to > change its value if required. Do you think it is a good idea to mention this in > docs as well? > > I think it is important to raise WARNING as the patch is doing in all the cases > where the slot is not being processed so that users can be notified and they can > take the required action. Agreed. Here is the V99 patch which addressed the above. This version also includes: 1. list_free the slot list when reloading the list due to GUC change. 2. Refactored the validate_standby_slots based on Shveta's suggestion. 3. Added errcode for the warnings as most of existing have errcodes. Amit's latest comments[1] are pending, we will address that in next version. [1] https://www.postgresql.org/message-id/CAA4eK1LJdmGATWG%3DxOD1CB9cogukk2cLNBGH8h-n-ZDJuwBdJg%40mail.gmail.com Best Regards, Hou zj
Attachment
Here are some review comments for v99-0001 ========== 0. GENERAL. +#standby_slot_names = '' # streaming replication standby server slot names that + # logical walsender processes will wait for IMO the GUC name is too generic. There is nothing in this name to suggest it has anything to do with logical slot synchronization; that meaning is only found in the accompanying comment -- it would be better if the GUC name itself were more self-explanatory. e.g. Maybe like 'wal_sender_sync_standby_slot_names' or 'wal_sender_standby_slot_names', 'wal_sender_wait_for_standby_slots', or ... (Of course, this has some impact across docs and comments and variables in the patch). ========== Commit Message 1. A new parameter named standby_slot_names is introduced. Maybe quote the GUC names here to make it more readable. ~~ 2. Additionally, The SQL functions pg_logical_slot_get_changes and pg_replication_slot_advance are modified to wait for the replication slots mentioned in standby_slot_names to catch up before returning the changes to the user. ~ 2a. "pg_replication_slot_advance" is a typo? Did you mean pg_logical_replication_slot_advance? ~ 2b. The "before returning the changes to the user" seems like it is referring only to the first function. Maybe needs slight rewording like: /before returning the changes to the user./ before returning./ ========== doc/src/sgml/config.sgml 3. standby_slot_names + <para> + List of physical slots guarantees that logical replication slots with + failover enabled do not consume changes until those changes are received + and flushed to corresponding physical standbys. If a logical replication + connection is meant to switch to a physical standby after the standby is + promoted, the physical replication slot for the standby should be listed + here. Note that logical replication will not proceed if the slots + specified in the standby_slot_names do not exist or are invalidated. + </para> The wording doesn't seem right. IMO this should be worded much like how this GUC is described in guc_tables.c e.g something a bit like: Lists the streaming replication standby server slot names that logical WAL sender processes will wait for. Logical WAL sender processes will send decoded changes to plugins only after the specified replication slots confirm receiving WAL. This guarantees that logical replication slots with failover enabled do not consume changes until those changes are received and flushed to corresponding physical standbys... ========== doc/src/sgml/logicaldecoding.sgml 4. Section 48.2.3 Replication Slot Synchronization + It's also highly recommended that the said physical replication slot + is named in + <link linkend="guc-standby-slot-names"><varname>standby_slot_names</varname></link> + list on the primary, to prevent the subscriber from consuming changes + faster than the hot standby. But once we configure it, then certain latency + is expected in sending changes to logical subscribers due to wait on + physical replication slots in + <link linkend="guc-standby-slot-names"><varname>standby_slot_names</varname></link> 4a. /It's also highly/It is highly/ ~ 4b. BEFORE But once we configure it, then certain latency is expected in sending changes to logical subscribers due to wait on physical replication slots in <link linkend="guc-standby-slot-names"><varname>standby_slot_names</varname></link> SUGGESTION Even when correctly configured, some latency is expected when sending changes to logical subscribers due to the waiting on slots named in standby_slot_names. ========== .../replication/logical/logicalfuncs.c 5. pg_logical_slot_get_changes_guts + if (XLogRecPtrIsInvalid(upto_lsn)) + wait_for_wal_lsn = end_of_wal; + else + wait_for_wal_lsn = Min(upto_lsn, end_of_wal); + + /* + * Wait for specified streaming replication standby servers (if any) + * to confirm receipt of WAL up to wait_for_wal_lsn. + */ + WaitForStandbyConfirmation(wait_for_wal_lsn); Perhaps those statements all belong together with the comment up-front. e.g. + /* + * Wait for specified streaming replication standby servers (if any) + * to confirm receipt of WAL up to wait_for_wal_lsn. + */ + if (XLogRecPtrIsInvalid(upto_lsn)) + wait_for_wal_lsn = end_of_wal; + else + wait_for_wal_lsn = Min(upto_lsn, end_of_wal); + WaitForStandbyConfirmation(wait_for_wal_lsn); ========== src/backend/replication/logical/slotsync.c ========== src/backend/replication/slot.c 6. +static bool +validate_standby_slots(char **newval) +{ + char *rawname; + List *elemlist; + ListCell *lc; + bool ok; + + /* Need a modifiable copy of string */ + rawname = pstrdup(*newval); + + /* Verify syntax and parse string into a list of identifiers */ + ok = SplitIdentifierString(rawname, ',', &elemlist); + + if (!ok) + { + GUC_check_errdetail("List syntax is invalid."); + } + + /* + * If the replication slots' data have been initialized, verify if the + * specified slots exist and are logical slots. + */ + else if (ReplicationSlotCtl) + { + foreach(lc, elemlist) 6a. So, if the ReplicationSlotCtl is NULL, it is possible to return ok=true without ever checking if the slots exist or are of the correct kind. I am wondering what are the ramifications of that. -- e.g. assuming names are OK when maybe they aren't OK at all. AFAICT this works because it relies on getting subsequent WARNINGS when calling FilterStandbySlots(). If that is correct then maybe the comment here can be enhanced to say so. Indeed, if it works like that, now I am wondering do we need this for loop validation at all. e.g. it seems just a matter of timing whether we get ERRORs validating the GUC here, or WARNINGS later in the FilterStandbySlots. Maybe we don't need the double-checking and it is enough to check in FilterStandbySlots? ~ 6b. AFAIK there are alternative foreach macros available now, so you shouldn't need to declare the ListCell. ~~~ 7. check_standby_slot_names +bool +check_standby_slot_names(char **newval, void **extra, GucSource source) +{ + if (strcmp(*newval, "") == 0) + return true; Using strcmp seems like an overkill way to check for empty string. SUGGESTION if (*newval == '\0') return true; ~~~ 8. + if (strcmp(*newval, "*") == 0) + { + GUC_check_errdetail("\"%s\" is not accepted for standby_slot_names", + *newval); + return false; + } It seems overkill to use a format specifier when "*" is already the known value. SUGGESTION GUC_check_errdetail("Wildcard \"*\" is not accepted for standby_slot_names."); ~~~ 9. + /* Now verify if the specified slots really exist and have correct type */ + if (!validate_standby_slots(newval)) + return false; As in a prior comment, if ReplicationSlotCtl is NULL then it is not always going to do exactly what that comment says it is doing... ~~~ 10. assign_standby_slot_names + if (!SplitIdentifierString(standby_slot_names_cpy, ',', &standby_slots)) + { + /* This should not happen if GUC checked check_standby_slot_names. */ + elog(ERROR, "invalid list syntax"); + } I didn't see how it is possible to get here without having first executed check_standby_slot_names. But, if it can happen, then maybe describe the scenario in the comment. ~~~ 11. + * Note that since we do not support syncing slots to cascading standbys, we + * return NIL if we are running in a standby to indicate that no standby slots + * need to be waited for, regardless of the copy flag value. I didn't understand the relevance of even mentioning "regardless of the copy flag value". ~~~ 12. FilterStandbySlots + errhint("Consider starting standby associated with \"%s\" or amend standby_slot_names.", + name)); This errhint should use a format substitution for the GUC "standby_slot_names" for consistency with everything else. ~~~ 13. WaitForStandbyConfirmation + /* + * We wait for the slots in the standby_slot_names to catch up, but we + * use a timeout (1s) so we can also check the if the + * standby_slot_names has been changed. + */ + ConditionVariableTimedSleep(&WalSndCtl->wal_confirm_rcv_cv, 1000, + WAIT_EVENT_WAIT_FOR_STANDBY_CONFIRMATION); Typo "the if the" ========== src/backend/replication/slotfuncs.c 14. pg_physical_replication_slot_advance + + PhysicalWakeupLogicalWalSnd(); Should this have a comment to say what it is for? ========== src/backend/replication/walsender.c 15. +/* + * Wake up the logical walsender processes with failover enabled slots if the + * currently acquired physical slot is specified in standby_slot_names GUC. + */ +void +PhysicalWakeupLogicalWalSnd(void) +{ + ListCell *lc; + List *standby_slots; + + Assert(MyReplicationSlot && SlotIsPhysical(MyReplicationSlot)); + + standby_slots = GetStandbySlotList(false); + + foreach(lc, standby_slots) + { + char *name = lfirst(lc); + + if (strcmp(name, NameStr(MyReplicationSlot->data.name)) == 0) + { + ConditionVariableBroadcast(&WalSndCtl->wal_confirm_rcv_cv); + return; + } + } +} 15a. There already exists another function called WalSndWakeup(bool physical, bool logical), so I think this new one should use a similar name pattern -- e.g. maybe like WalSndWakeupLogicalForSlotSync or ... ~ 15b. IIRC there are some new List macros you can use instead of needing to declare the ListCell? ========== .../utils/activity/wait_event_names.txt 16. +WAIT_FOR_STANDBY_CONFIRMATION "Waiting for the WAL to be received by physical standby." Moving the 'the' will make this more consistent with all other "Waiting for WAL..." names. SUGGESTION Waiting for WAL to be received by the physical standby. ========== src/backend/utils/misc/guc_tables.c 17. + { + {"standby_slot_names", PGC_SIGHUP, REPLICATION_PRIMARY, + gettext_noop("Lists streaming replication standby server slot " + "names that logical WAL sender processes will wait for."), + gettext_noop("Decoded changes are sent out to plugins by logical " + "WAL sender processes only after specified " + "replication slots confirm receiving WAL."), + GUC_LIST_INPUT | GUC_LIST_QUOTE + }, + &standby_slot_names, + "", + check_standby_slot_names, assign_standby_slot_names, NULL + }, The wording of the detail msg feels kind of backwards to me. BEFORE Decoded changes are sent out to plugins by logical WAL sender processes only after specified replication slots confirm receiving WAL. SUGGESTION Logical WAL sender processes will send decoded changes to plugins only after the specified replication slots confirm receiving WAL. ========== src/backend/utils/misc/postgresql.conf.sample 18. +#standby_slot_names = '' # streaming replication standby server slot names that + # logical walsender processes will wait for I'm not sure this is the best GUC name. See the general comment #0 above in this post. ========== src/include/replication/slot.h ========== src/include/replication/walsender.h ========== src/include/replication/walsender_private.h ========== src/include/utils/guc_hooks.h ========== src/test/recovery/t/006_logical_decoding.pl 19. +# Pass failover=true (last-arg), it should not have any impact on advancing. SUGGESTION Passing failover=true (last arg) should not have any impact on advancing. ========== .../t/040_standby_failover_slots_sync.pl 20. +# +# | ----> standby1 (primary_slot_name = sb1_slot) +# | ----> standby2 (primary_slot_name = sb2_slot) +# primary ----- | +# | ----> subscriber1 (failover = true) +# | ----> subscriber2 (failover = false) In the diagram, the "--->" means a mixture of physical standbys and logical pub/sub replication. Maybe it can be a bit clearer? SUGGESTION # primary (publisher) # # (physical standbys) # | ----> standby1 (primary_slot_name = sb1_slot) # | ----> standby2 (primary_slot_name = sb2_slot) # # (logical replication) # | ----> subscriber1 (failover = true, slot_name = lsub1_slot) # | ----> subscriber2 (failover = false, slot_name = lsub2_slot) ~~~ 21. +# Set up is configured in such a way that the logical slot of subscriber1 is +# enabled failover, thus it will wait for the physical slot of +# standby1(sb1_slot) to catch up before sending decoded changes to subscriber1. /is enabled failover/is enabled for failover/ ~~~ 22. +# Create another subscriber node without enabling failover, wait for sync to +# complete +my $subscriber2 = PostgreSQL::Test::Cluster->new('subscriber2'); +$subscriber2->init; +$subscriber2->start; +$subscriber2->safe_psql( + 'postgres', qq[ + CREATE TABLE tab_int (a int PRIMARY KEY); + CREATE SUBSCRIPTION regress_mysub2 CONNECTION '$publisher_connstr' PUBLICATION regress_mypub WITH (slot_name = lsub2_slot); +]); + +$subscriber1->wait_for_subscription_sync; + Is this meant to wait for 'subscription2'? ~~~ 23. # Stop the standby associated with the specified physical replication slot so # that the logical replication slot won't receive changes until the standby # comes up. Maybe this can give the values for better understanding: SUGGESTION Stop the standby associated with the specified physical replication slot (sb1_slot) so that the logical replication slot (lsub1_slot) won't receive changes until the standby comes up. ~~~ 24. +# Wait for the standby that's up and running gets the data from primary SUGGESTION Wait until the standby2 that's still running gets the data from the primary. ~~~ 25. +# Wait for the subscription that's up and running and is not enabled for failover. +# It gets the data from primary without waiting for any standbys. SUGGESTION Wait for subscription2 to get the data from the primary. This subscription was not enabled for failover so it gets the data without waiting for any standbys. ~~~ 26. +# The subscription that's up and running and is enabled for failover +# doesn't get the data from primary and keeps waiting for the +# standby specified in standby_slot_names. SUGGESTION The subscription1 was enabled for failover so it doesn't get the data from primary and keeps waiting for the standby specified in standby_slot_names (sb1_slot aka standby1). ~~~ 27. +# Start the standby specified in standby_slot_names and wait for it to catch +# up with the primary. SUGGESTION Start the standby specified in standby_slot_names (sb1_slot aka standby1) and wait for it to catch up with the primary. ~~~ 28. +# Now that the standby specified in standby_slot_names is up and running, +# primary must send the decoded changes to subscription enabled for failover +# While the standby was down, this subscriber didn't receive any data from +# primary i.e. the primary didn't allow it to go ahead of standby. SUGGESTION Now that the standby specified in standby_slot_names is up and running, the primary can send the decoded changes to the subscription enabled for failover (i.e. subscription1). While the standby was down, subscription1 didn't receive any data from the primary. i.e. the primary didn't allow it to go ahead of standby. ~~~ 29. +# Stop the standby associated with the specified physical replication slot so +# that the logical replication slot won't receive changes until the standby +# slot's restart_lsn is advanced or the slot is removed from the +# standby_slot_names list. +$primary->safe_psql('postgres', "TRUNCATE tab_int;"); +$primary->wait_for_catchup('regress_mysub1'); +$standby1->stop; Isn't this fragment more like the first step of the *next* TEST instead of the last step of this one? ~~~ 30. +################################################## +# Verify that when using pg_logical_slot_get_changes to consume changes from a +# logical slot with failover enabled, it will also wait for the slots specified +# in standby_slot_names to catch up. +################################################## AFAICT this test is checking only that the function cannot return while waiting for the stopped standby, but it doesn't seem to check that it *does* return when the stopped standby comes alive again. ~~~ 31. +$result = + $subscriber1->safe_psql('postgres', "SELECT count(*) = 0 FROM tab_int;"); +is($result, 't', + "subscriber1 doesn't get data as the sb1_slot doesn't catch up"); Do you think this fragment should have a comment? ====== Kind Regards, Peter Smith. Fujitsu Australia
Hi, On Tue, Feb 27, 2024 at 06:17:44PM +1100, Peter Smith wrote: > +static bool > +validate_standby_slots(char **newval) > +{ > + char *rawname; > + List *elemlist; > + ListCell *lc; > + bool ok; > + > + /* Need a modifiable copy of string */ > + rawname = pstrdup(*newval); > + > + /* Verify syntax and parse string into a list of identifiers */ > + ok = SplitIdentifierString(rawname, ',', &elemlist); > + > + if (!ok) > + { > + GUC_check_errdetail("List syntax is invalid."); > + } > + > + /* > + * If the replication slots' data have been initialized, verify if the > + * specified slots exist and are logical slots. > + */ > + else if (ReplicationSlotCtl) > + { > + foreach(lc, elemlist) > > 6a. > So, if the ReplicationSlotCtl is NULL, it is possible to return > ok=true without ever checking if the slots exist or are of the correct > kind. I am wondering what are the ramifications of that. -- e.g. > assuming names are OK when maybe they aren't OK at all. AFAICT this > works because it relies on getting subsequent WARNINGS when calling > FilterStandbySlots(). If that is correct then maybe the comment here > can be enhanced to say so. > > Indeed, if it works like that, now I am wondering do we need this for > loop validation at all. e.g. it seems just a matter of timing whether > we get ERRORs validating the GUC here, or WARNINGS later in the > FilterStandbySlots. Maybe we don't need the double-checking and it is > enough to check in FilterStandbySlots? Good point, I have the feeling that it is enough to check in FilterStandbySlots(). Indeed, if the value is syntactically correct, then I think that its actual value "really" matters when the logical decoding is starting/running, does it provide additional benefits "before" that? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Feb 27, 2024 at 4:07 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > On Tue, Feb 27, 2024 at 06:17:44PM +1100, Peter Smith wrote: > > +static bool > > +validate_standby_slots(char **newval) > > +{ > > + char *rawname; > > + List *elemlist; > > + ListCell *lc; > > + bool ok; > > + > > + /* Need a modifiable copy of string */ > > + rawname = pstrdup(*newval); > > + > > + /* Verify syntax and parse string into a list of identifiers */ > > + ok = SplitIdentifierString(rawname, ',', &elemlist); > > + > > + if (!ok) > > + { > > + GUC_check_errdetail("List syntax is invalid."); > > + } > > + > > + /* > > + * If the replication slots' data have been initialized, verify if the > > + * specified slots exist and are logical slots. > > + */ > > + else if (ReplicationSlotCtl) > > + { > > + foreach(lc, elemlist) > > > > 6a. > > So, if the ReplicationSlotCtl is NULL, it is possible to return > > ok=true without ever checking if the slots exist or are of the correct > > kind. I am wondering what are the ramifications of that. -- e.g. > > assuming names are OK when maybe they aren't OK at all. AFAICT this > > works because it relies on getting subsequent WARNINGS when calling > > FilterStandbySlots(). If that is correct then maybe the comment here > > can be enhanced to say so. > > > > Indeed, if it works like that, now I am wondering do we need this for > > loop validation at all. e.g. it seems just a matter of timing whether > > we get ERRORs validating the GUC here, or WARNINGS later in the > > FilterStandbySlots. Maybe we don't need the double-checking and it is > > enough to check in FilterStandbySlots? > > Good point, I have the feeling that it is enough to check in FilterStandbySlots(). > I think it is better if we get earlier in a case where the parameter is changed and performed SIGHUP instead of waiting till we get to logical decoding. So, there is merit in keeping these checks during initial validation. -- With Regards, Amit Kapila.
On Tue, Feb 27, 2024 at 12:48 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are some review comments for v99-0001 > > ========== > 0. GENERAL. > > +#standby_slot_names = '' # streaming replication standby server slot names that > + # logical walsender processes will wait for > > IMO the GUC name is too generic. There is nothing in this name to > suggest it has anything to do with logical slot synchronization; that > meaning is only found in the accompanying comment -- it would be > better if the GUC name itself were more self-explanatory. > > e.g. Maybe like 'wal_sender_sync_standby_slot_names' or > 'wal_sender_standby_slot_names', 'wal_sender_wait_for_standby_slots', > or ... > It would be wrong and or misleading to append wal_sender to this GUC name as this is used during SQL APIs as well. Also, adding wait sounds more like a boolean. So, I don't see the proposed names any better than the current one. > ~~~ > > 9. > + /* Now verify if the specified slots really exist and have correct type */ > + if (!validate_standby_slots(newval)) > + return false; > > As in a prior comment, if ReplicationSlotCtl is NULL then it is not > always going to do exactly what that comment says it is doing... > It will do what the comment says when invoked as part of the SIGHUP signal. I think the current comment is okay. -- With Regards, Amit Kapila.
On Tuesday, February 27, 2024 3:18 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are some review comments for v99-0001 Thanks for the comments! > Commit Message > > 1. > A new parameter named standby_slot_names is introduced. > > Maybe quote the GUC names here to make it more readable. Added. > > ~~ > > 2. > Additionally, The SQL functions pg_logical_slot_get_changes and > pg_replication_slot_advance are modified to wait for the replication slots > mentioned in standby_slot_names to catch up before returning the changes to > the user. > > ~ > > 2a. > "pg_replication_slot_advance" is a typo? Did you mean > pg_logical_replication_slot_advance? pg_logical_replication_slot_advance is not a user visible function. So the pg_replication_slot_advance is correct. > > ~ > > 2b. > The "before returning the changes to the user" seems like it is referring only to > the first function. > > Maybe needs slight rewording like: > /before returning the changes to the user./ before returning./ Changed. > > ========== > doc/src/sgml/config.sgml > > 3. standby_slot_names > > + <para> > + List of physical slots guarantees that logical replication slots with > + failover enabled do not consume changes until those changes > are received > + and flushed to corresponding physical standbys. If a logical > replication > + connection is meant to switch to a physical standby after the > standby is > + promoted, the physical replication slot for the standby > should be listed > + here. Note that logical replication will not proceed if the slots > + specified in the standby_slot_names do not exist or are invalidated. > + </para> > > The wording doesn't seem right. IMO this should be worded much like how this > GUC is described in guc_tables.c > > e.g something a bit like: > > Lists the streaming replication standby server slot names that logical WAL > sender processes will wait for. Logical WAL sender processes will send > decoded changes to plugins only after the specified replication slots confirm > receiving WAL. This guarantees that logical replication slots with failover > enabled do not consume changes until those changes are received and flushed > to corresponding physical standbys... Changed. > > ========== > doc/src/sgml/logicaldecoding.sgml > > 4. Section 48.2.3 Replication Slot Synchronization > > + It's also highly recommended that the said physical replication slot > + is named in > + <link > linkend="guc-standby-slot-names"><varname>standby_slot_names</varna > me></link> > + list on the primary, to prevent the subscriber from consuming changes > + faster than the hot standby. But once we configure it, then > certain latency > + is expected in sending changes to logical subscribers due to wait on > + physical replication slots in > + <link > + > linkend="guc-standby-slot-names"><varname>standby_slot_names</varna > me> > + </link> > > 4a. > /It's also highly/It is highly/ > > ~ > > 4b. > > BEFORE > But once we configure it, then certain latency is expected in sending changes > to logical subscribers due to wait on physical replication slots in <link > linkend="guc-standby-slot-names"><varname>standby_slot_names</varna > me></link> > > SUGGESTION > Even when correctly configured, some latency is expected when sending > changes to logical subscribers due to the waiting on slots named in > standby_slot_names. Changed. > > ========== > .../replication/logical/logicalfuncs.c > > 5. pg_logical_slot_get_changes_guts > > + if (XLogRecPtrIsInvalid(upto_lsn)) > + wait_for_wal_lsn = end_of_wal; > + else > + wait_for_wal_lsn = Min(upto_lsn, end_of_wal); > + > + /* > + * Wait for specified streaming replication standby servers (if any) > + * to confirm receipt of WAL up to wait_for_wal_lsn. > + */ > + WaitForStandbyConfirmation(wait_for_wal_lsn); > > Perhaps those statements all belong together with the comment up-front. e.g. > > + /* > + * Wait for specified streaming replication standby servers (if any) > + * to confirm receipt of WAL up to wait_for_wal_lsn. > + */ > + if (XLogRecPtrIsInvalid(upto_lsn)) > + wait_for_wal_lsn = end_of_wal; > + else > + wait_for_wal_lsn = Min(upto_lsn, end_of_wal); > + WaitForStandbyConfirmation(wait_for_wal_lsn); Changed. > > ========== > src/backend/replication/logical/slotsync.c > > ========== > src/backend/replication/slot.c > > 6. > +static bool > +validate_standby_slots(char **newval) > +{ > + char *rawname; > + List *elemlist; > + ListCell *lc; > + bool ok; > + > + /* Need a modifiable copy of string */ rawname = pstrdup(*newval); > + > + /* Verify syntax and parse string into a list of identifiers */ ok = > + SplitIdentifierString(rawname, ',', &elemlist); > + > + if (!ok) > + { > + GUC_check_errdetail("List syntax is invalid."); } > + > + /* > + * If the replication slots' data have been initialized, verify if the > + * specified slots exist and are logical slots. > + */ > + else if (ReplicationSlotCtl) > + { > + foreach(lc, elemlist) > > 6a. > So, if the ReplicationSlotCtl is NULL, it is possible to return ok=true without > ever checking if the slots exist or are of the correct kind. I am wondering what > are the ramifications of that. -- e.g. > assuming names are OK when maybe they aren't OK at all. AFAICT this works > because it relies on getting subsequent WARNINGS when calling > FilterStandbySlots(). If that is correct then maybe the comment here can be > enhanced to say so. > > Indeed, if it works like that, now I am wondering do we need this for loop > validation at all. e.g. it seems just a matter of timing whether we get ERRORs > validating the GUC here, or WARNINGS later in the FilterStandbySlots. Maybe > we don't need the double-checking and it is enough to check in > FilterStandbySlots? I think the check is OK so didn’t change this. > > ~ > > 6b. > AFAIK there are alternative foreach macros available now, so you shouldn't > need to declare the ListCell. Changed. > > ~~~ > > 7. check_standby_slot_names > > +bool > +check_standby_slot_names(char **newval, void **extra, GucSource source) > +{ if (strcmp(*newval, "") == 0) return true; > > Using strcmp seems like an overkill way to check for empty string. > > SUGGESTION > > if (*newval == '\0') > return true; > Changed. > ~~~ > > 8. > + if (strcmp(*newval, "*") == 0) > + { > + GUC_check_errdetail("\"%s\" is not accepted for standby_slot_names", > + *newval); return false; } > > It seems overkill to use a format specifier when "*" is already the known value. > > SUGGESTION > GUC_check_errdetail("Wildcard \"*\" is not accepted for > standby_slot_names."); > Changed. > ~~~ > > 9. > + /* Now verify if the specified slots really exist and have correct > + type */ if (!validate_standby_slots(newval)) return false; > > As in a prior comment, if ReplicationSlotCtl is NULL then it is not always going > to do exactly what that comment says it is doing... I think the comment is OK, one can check the detail in the function definition if needed. > > ~~~ > > 10. assign_standby_slot_names > > + if (!SplitIdentifierString(standby_slot_names_cpy, ',', > + &standby_slots)) { > + /* This should not happen if GUC checked check_standby_slot_names. */ > + elog(ERROR, "invalid list syntax"); } > > I didn't see how it is possible to get here without having first executed > check_standby_slot_names. But, if it can happen, then maybe describe the > scenario in the comment. This is sanity check which we don't expect to happen, which follows similar style of preprocessNamespacePath. > > ~~~ > > 11. > + * Note that since we do not support syncing slots to cascading > + standbys, we > + * return NIL if we are running in a standby to indicate that no > + standby slots > + * need to be waited for, regardless of the copy flag value. > > I didn't understand the relevance of even mentioning "regardless of the copy > flag value". Removed. > > ~~~ > > 12. FilterStandbySlots > > + errhint("Consider starting standby associated with \"%s\" or amend > standby_slot_names.", > + name)); > > This errhint should use a format substitution for the GUC "standby_slot_names" > for consistency with everything else. Changed. > > ~~~ > > 13. WaitForStandbyConfirmation > > + /* > + * We wait for the slots in the standby_slot_names to catch up, but we > + * use a timeout (1s) so we can also check the if the > + * standby_slot_names has been changed. > + */ > + ConditionVariableTimedSleep(&WalSndCtl->wal_confirm_rcv_cv, 1000, > + WAIT_EVENT_WAIT_FOR_STANDBY_CONFIRMATION); > > Typo "the if the" > Changed. > ========== > src/backend/replication/slotfuncs.c > > 14. pg_physical_replication_slot_advance > + > + PhysicalWakeupLogicalWalSnd(); > > Should this have a comment to say what it is for? > Added. > ========== > src/backend/replication/walsender.c > > 15. > +/* > + * Wake up the logical walsender processes with failover enabled slots > +if the > + * currently acquired physical slot is specified in standby_slot_names GUC. > + */ > +void > +PhysicalWakeupLogicalWalSnd(void) > +{ > + ListCell *lc; > + List *standby_slots; > + > + Assert(MyReplicationSlot && SlotIsPhysical(MyReplicationSlot)); > + > + standby_slots = GetStandbySlotList(false); > + > + foreach(lc, standby_slots) > + { > + char *name = lfirst(lc); > + > + if (strcmp(name, NameStr(MyReplicationSlot->data.name)) == 0) { > +ConditionVariableBroadcast(&WalSndCtl->wal_confirm_rcv_cv); > + return; > + } > + } > +} > > 15a. > There already exists another function called WalSndWakeup(bool physical, > bool logical), so I think this new one should use a similar name pattern -- e.g. > maybe like WalSndWakeupLogicalForSlotSync or ... WalSndWakeup is a general function for both physical and logical sender, but our new function is specific to physical sender which is more similar to PhysicalConfirmReceivedLocation/ PhysicalReplicationSlotNewXmin, so I think the current name is ok. > > ~ > > 15b. > IIRC there are some new List macros you can use instead of needing to declare > the ListCell? Changed. > > ========== > .../utils/activity/wait_event_names.txt > > 16. > +WAIT_FOR_STANDBY_CONFIRMATION "Waiting for the WAL to be received > by > physical standby." > > Moving the 'the' will make this more consistent with all other "Waiting for > WAL..." names. > > SUGGESTION > Waiting for WAL to be received by the physical standby. Changed. > > ========== > src/backend/utils/misc/guc_tables.c > > 17. > + { > + {"standby_slot_names", PGC_SIGHUP, REPLICATION_PRIMARY, > + gettext_noop("Lists streaming replication standby server slot " > + "names that logical WAL sender processes will wait for."), > + gettext_noop("Decoded changes are sent out to plugins by logical " > + "WAL sender processes only after specified " > + "replication slots confirm receiving WAL."), GUC_LIST_INPUT | > + GUC_LIST_QUOTE }, &standby_slot_names, "", check_standby_slot_names, > + assign_standby_slot_names, NULL }, > > The wording of the detail msg feels kind of backwards to me. > > BEFORE > Decoded changes are sent out to plugins by logical WAL sender processes > only after specified replication slots confirm receiving WAL. > > SUGGESTION > Logical WAL sender processes will send decoded changes to plugins only after > the specified replication slots confirm receiving WAL. Changed. > > ========== > src/backend/utils/misc/postgresql.conf.sample > > 18. > +#standby_slot_names = '' # streaming replication standby server slot > +names that # logical walsender processes will wait for > > I'm not sure this is the best GUC name. See the general comment #0 above in > this post. As discussed, I didn’t change this. > > ========== > src/include/replication/slot.h > > ========== > src/include/replication/walsender.h > > ========== > src/include/replication/walsender_private.h > > ========== > src/include/utils/guc_hooks.h > > ========== > src/test/recovery/t/006_logical_decoding.pl > > 19. > +# Pass failover=true (last-arg), it should not have any impact on advancing. > > SUGGESTION > Passing failover=true (last arg) should not have any impact on advancing. Changed. > > ========== > .../t/040_standby_failover_slots_sync.pl > > 20. > +# > +# | ----> standby1 (primary_slot_name = sb1_slot) # | ----> standby2 > +(primary_slot_name = sb2_slot) # primary ----- | # | ----> subscriber1 > +(failover = true) # | ----> subscriber2 (failover = false) > > In the diagram, the "--->" means a mixture of physical standbys and logical > pub/sub replication. Maybe it can be a bit clearer? > > SUGGESTION > > # primary (publisher) > # > # (physical standbys) > # | ----> standby1 (primary_slot_name = sb1_slot) > # | ----> standby2 (primary_slot_name = sb2_slot) > # > # (logical replication) > # | ----> subscriber1 (failover = true, slot_name = lsub1_slot) > # | ----> subscriber2 (failover = false, slot_name = lsub2_slot) > I think one can distinguish it based on the 'standby' and 'subscriber' as well, because 'standby' normally refer to physical standby while the other refer to logical. > ~~~ > > 21. > +# Set up is configured in such a way that the logical slot of > +subscriber1 is # enabled failover, thus it will wait for the physical > +slot of # standby1(sb1_slot) to catch up before sending decoded changes to > subscriber1. > > /is enabled failover/is enabled for failover/ Changed. > > ~~~ > > 22. > +# Create another subscriber node without enabling failover, wait for > +sync to # complete my $subscriber2 = > +PostgreSQL::Test::Cluster->new('subscriber2'); > +$subscriber2->init; > +$subscriber2->start; > +$subscriber2->safe_psql( > + 'postgres', qq[ > + CREATE TABLE tab_int (a int PRIMARY KEY); CREATE SUBSCRIPTION > +regress_mysub2 CONNECTION '$publisher_connstr' > PUBLICATION regress_mypub WITH (slot_name = lsub2_slot); > +]); > + > +$subscriber1->wait_for_subscription_sync; > + > > Is this meant to wait for 'subscription2'? Yes, fixed. > > ~~~ > > 23. > # Stop the standby associated with the specified physical replication slot so # > that the logical replication slot won't receive changes until the standby # > comes up. > > Maybe this can give the values for better understanding: > > SUGGESTION > Stop the standby associated with the specified physical replication slot > (sb1_slot) so that the logical replication slot (lsub1_slot) won't receive changes > until the standby comes up. Changed. > > ~~~ > > 24. > +# Wait for the standby that's up and running gets the data from primary > > SUGGESTION > Wait until the standby2 that's still running gets the data from the primary. > Changed. > ~~~ > > 25. > +# Wait for the subscription that's up and running and is not enabled > for failover. > +# It gets the data from primary without waiting for any standbys. > > SUGGESTION > Wait for subscription2 to get the data from the primary. This subscription was > not enabled for failover so it gets the data without waiting for any standbys. > Changed. > ~~~ > > 26. > +# The subscription that's up and running and is enabled for failover # > +doesn't get the data from primary and keeps waiting for the # standby > +specified in standby_slot_names. > > SUGGESTION > The subscription1 was enabled for failover so it doesn't get the data from > primary and keeps waiting for the standby specified in standby_slot_names > (sb1_slot aka standby1). > Changed. > ~~~ > > 27. > +# Start the standby specified in standby_slot_names and wait for it to > +catch # up with the primary. > > SUGGESTION > Start the standby specified in standby_slot_names (sb1_slot aka > standby1) and wait for it to catch up with the primary. > Changed. > ~~~ > > 28. > +# Now that the standby specified in standby_slot_names is up and > +running, # primary must send the decoded changes to subscription > +enabled for failover # While the standby was down, this subscriber > +didn't receive any data from # primary i.e. the primary didn't allow it to go > ahead of standby. > > SUGGESTION > Now that the standby specified in standby_slot_names is up and running, the > primary can send the decoded changes to the subscription enabled for failover > (i.e. subscription1). While the standby was down, > subscription1 didn't receive any data from the primary. i.e. the primary didn't > allow it to go ahead of standby. > Changed. > ~~~ > > 29. > +# Stop the standby associated with the specified physical replication > +slot so # that the logical replication slot won't receive changes until > +the standby # slot's restart_lsn is advanced or the slot is removed > +from the # standby_slot_names list. > +$primary->safe_psql('postgres', "TRUNCATE tab_int;"); > +$primary->wait_for_catchup('regress_mysub1'); > +$standby1->stop; > > Isn't this fragment more like the first step of the *next* TEST instead of the last > step of this one? > Changed. > ~~~ > > 30. > +################################################## > +# Verify that when using pg_logical_slot_get_changes to consume changes > +from a # logical slot with failover enabled, it will also wait for the > +slots specified # in standby_slot_names to catch up. > +################################################## > > AFAICT this test is checking only that the function cannot return while waiting > for the stopped standby, but it doesn't seem to check that it *does* return > when the stopped standby comes alive again. > Will think about this. > ~~~ > > 31. > +$result = > + $subscriber1->safe_psql('postgres', "SELECT count(*) = 0 FROM > +tab_int;"); is($result, 't', > + "subscriber1 doesn't get data as the sb1_slot doesn't catch up"); > > Do you think this fragment should have a comment? Added. Attach the V100 patch set which addressed above comments. Best Regards, Hou zj
Attachment
On Mon, Feb 26, 2024 at 9:13 AM shveta malik <shveta.malik@gmail.com> wrote: > > On Fri, Feb 23, 2024 at 7:41 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > > Hi, > > > I think to set secure search path for remote connection, the standard approach > > > could be to extend the code in libpqrcv_connect[1], so that we don't need to schema > > > qualify all the operators in the queries. > > > > > > And for local connection, I agree it's also needed to add a > > > SetConfigOption("search_path", "" call in the slotsync worker. > > > > > > [1] > > > libpqrcv_connect > > > ... > > > if (logical) > > > ... > > > res = libpqrcv_PQexec(conn->streamConn, > > > ALWAYS_SECURE_SEARCH_PATH_SQL); > > > > > > > Agree, something like in the attached? (it's .txt to not disturb the CF bot). > > Thanks for the patch, changes look good. I have corporated it in the > patch which addresses the rest of your comments in [1]. I have > attached the patch as .txt > Few comments: =============== 1. - if (logical) + if (logical || !replication) { Can we add a comment about connection types that require ALWAYS_SECURE_SEARCH_PATH_SQL? 2. Can we add a test case to demonstrate that the '=' operator can be hijacked to do different things when the slotsync worker didn't use ALWAYS_SECURE_SEARCH_PATH_SQL? -- With Regards, Amit Kapila.
Hi, On Wed, Feb 28, 2024 at 08:49:19AM +0530, Amit Kapila wrote: > On Mon, Feb 26, 2024 at 9:13 AM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Fri, Feb 23, 2024 at 7:41 PM Bertrand Drouvot > > <bertranddrouvot.pg@gmail.com> wrote: > > > > > > Hi, > > > > I think to set secure search path for remote connection, the standard approach > > > > could be to extend the code in libpqrcv_connect[1], so that we don't need to schema > > > > qualify all the operators in the queries. > > > > > > > > And for local connection, I agree it's also needed to add a > > > > SetConfigOption("search_path", "" call in the slotsync worker. > > > > > > > > [1] > > > > libpqrcv_connect > > > > ... > > > > if (logical) > > > > ... > > > > res = libpqrcv_PQexec(conn->streamConn, > > > > ALWAYS_SECURE_SEARCH_PATH_SQL); > > > > > > > > > > Agree, something like in the attached? (it's .txt to not disturb the CF bot). > > > > Thanks for the patch, changes look good. I have corporated it in the > > patch which addresses the rest of your comments in [1]. I have > > attached the patch as .txt > > > > Few comments: > =============== > 1. > - if (logical) > + if (logical || !replication) > { > > Can we add a comment about connection types that require > ALWAYS_SECURE_SEARCH_PATH_SQL? Yeah, will do. > > 2. > Can we add a test case to demonstrate that the '=' operator can be > hijacked to do different things when the slotsync worker didn't use > ALWAYS_SECURE_SEARCH_PATH_SQL? I don't think that's good to create a test to show how to hijack an operator within a background worker. I had a quick look and did not find existing tests in this area around ALWAYS_SECURE_SEARCH_PATH_SQL / search_patch and background worker. Such a test would: - "just" ensure that search_path works as expected - show how to hijack an operator within a background worker Based on the above I don't think that such a test is worth it. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wednesday, February 28, 2024 2:38 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > On Wed, Feb 28, 2024 at 08:49:19AM +0530, Amit Kapila wrote: > > On Mon, Feb 26, 2024 at 9:13 AM shveta malik <shveta.malik@gmail.com> > wrote: > > > > > > On Fri, Feb 23, 2024 at 7:41 PM Bertrand Drouvot > > > <bertranddrouvot.pg@gmail.com> wrote: > > > > > > > > Hi, > > > > > I think to set secure search path for remote connection, the > > > > > standard approach could be to extend the code in > > > > > libpqrcv_connect[1], so that we don't need to schema qualify all the > operators in the queries. > > > > > > > > > > And for local connection, I agree it's also needed to add a > > > > > SetConfigOption("search_path", "" call in the slotsync worker. > > > > > > > > > > [1] > > > > > libpqrcv_connect > > > > > ... > > > > > if (logical) > > > > > ... > > > > > res = libpqrcv_PQexec(conn->streamConn, > > > > > > > > > > ALWAYS_SECURE_SEARCH_PATH_SQL); > > > > > > > > > > > > > Agree, something like in the attached? (it's .txt to not disturb the CF bot). > > > > > > Thanks for the patch, changes look good. I have corporated it in the > > > patch which addresses the rest of your comments in [1]. I have > > > attached the patch as .txt > > > > > > > Few comments: > > =============== > > 1. > > - if (logical) > > + if (logical || !replication) > > { > > > > Can we add a comment about connection types that require > > ALWAYS_SECURE_SEARCH_PATH_SQL? > > Yeah, will do. > > > > > 2. > > Can we add a test case to demonstrate that the '=' operator can be > > hijacked to do different things when the slotsync worker didn't use > > ALWAYS_SECURE_SEARCH_PATH_SQL? > > I don't think that's good to create a test to show how to hijack an operator > within a background worker. > > I had a quick look and did not find existing tests in this area around > ALWAYS_SECURE_SEARCH_PATH_SQL / search_patch and background worker. I think a similar commit 11da970 has added a test for the search_path, e.g. # Create some preexisting content on publisher $node_publisher->safe_psql( 'postgres', "CREATE FUNCTION public.pg_get_replica_identity_index(int) RETURNS regclass LANGUAGE sql AS 'SELECT 1/0'"); # shall not call Best Regards, Hou zj
On Wed, Feb 28, 2024 at 8:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > Few comments: Thanks for the feedback. > =============== > 1. > - if (logical) > + if (logical || !replication) > { > > Can we add a comment about connection types that require > ALWAYS_SECURE_SEARCH_PATH_SQL? > > 2. > Can we add a test case to demonstrate that the '=' operator can be > hijacked to do different things when the slotsync worker didn't use > ALWAYS_SECURE_SEARCH_PATH_SQL? > Here is the patch with new test added and improved comments. thanks Shveta
Attachment
Hi, On Wed, Feb 28, 2024 at 06:48:37AM +0000, Zhijie Hou (Fujitsu) wrote: > On Wednesday, February 28, 2024 2:38 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > On Wed, Feb 28, 2024 at 08:49:19AM +0530, Amit Kapila wrote: > > > On Mon, Feb 26, 2024 at 9:13 AM shveta malik <shveta.malik@gmail.com> > > wrote: > > > > > > > > On Fri, Feb 23, 2024 at 7:41 PM Bertrand Drouvot > > > > <bertranddrouvot.pg@gmail.com> wrote: > > > > > > > > > > Hi, > > > > > > I think to set secure search path for remote connection, the > > > > > > standard approach could be to extend the code in > > > > > > libpqrcv_connect[1], so that we don't need to schema qualify all the > > operators in the queries. > > > > > > > > > > > > And for local connection, I agree it's also needed to add a > > > > > > SetConfigOption("search_path", "" call in the slotsync worker. > > > > > > > > > > > > [1] > > > > > > libpqrcv_connect > > > > > > ... > > > > > > if (logical) > > > > > > ... > > > > > > res = libpqrcv_PQexec(conn->streamConn, > > > > > > > > > > > > ALWAYS_SECURE_SEARCH_PATH_SQL); > > > > > > > > > > > > > > > > Agree, something like in the attached? (it's .txt to not disturb the CF bot). > > > > > > > > Thanks for the patch, changes look good. I have corporated it in the > > > > patch which addresses the rest of your comments in [1]. I have > > > > attached the patch as .txt > > > > > > > > > > Few comments: > > > =============== > > > 1. > > > - if (logical) > > > + if (logical || !replication) > > > { > > > > > > Can we add a comment about connection types that require > > > ALWAYS_SECURE_SEARCH_PATH_SQL? > > > > Yeah, will do. > > > > > > > > 2. > > > Can we add a test case to demonstrate that the '=' operator can be > > > hijacked to do different things when the slotsync worker didn't use > > > ALWAYS_SECURE_SEARCH_PATH_SQL? > > > > I don't think that's good to create a test to show how to hijack an operator > > within a background worker. > > > > I had a quick look and did not find existing tests in this area around > > ALWAYS_SECURE_SEARCH_PATH_SQL / search_patch and background worker. > > I think a similar commit 11da970 has added a test for the search_path, e.g. Oh right, thanks for sharing! But do we think it's worth to show how to hijack an operator within a background worker "just" to verify that the search_path works as expected? I don't think it's worth it but will do if others have different opinions. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Wed, Feb 28, 2024 at 12:29:01PM +0530, shveta malik wrote: > On Wed, Feb 28, 2024 at 8:49 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > Few comments: > > Thanks for the feedback. > > > =============== > > 1. > > - if (logical) > > + if (logical || !replication) > > { > > > > Can we add a comment about connection types that require > > ALWAYS_SECURE_SEARCH_PATH_SQL? > > > > 2. > > Can we add a test case to demonstrate that the '=' operator can be > > hijacked to do different things when the slotsync worker didn't use > > ALWAYS_SECURE_SEARCH_PATH_SQL? > > > > Here is the patch with new test added and improved comments. Thanks! A few comments: 1 === + * used to run normal SQL queries s/run normal SQL/run SQL/ ? As mentioned up-thread I don't like that much the idea of creating such a test but if we do then here are my comments: 2 === +CREATE FUNCTION myschema.myintne(bigint, int) Should we explain why 'bigint, int' is important here (instead of 'int, int')? 3 === +# stage of syncing newly created slots. If the worker was not prepared +# to handle such attacks, it would have failed during Worth to mention the underlying check / function that would get an "unexpected" result? Except for the above (nit) comments the patch looks good to me. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, Feb 28, 2024 at 12:31 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > On Wed, Feb 28, 2024 at 06:48:37AM +0000, Zhijie Hou (Fujitsu) wrote: > > On Wednesday, February 28, 2024 2:38 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > > > 2. > > > > Can we add a test case to demonstrate that the '=' operator can be > > > > hijacked to do different things when the slotsync worker didn't use > > > > ALWAYS_SECURE_SEARCH_PATH_SQL? > > > > > > I don't think that's good to create a test to show how to hijack an operator > > > within a background worker. > > > > > > I had a quick look and did not find existing tests in this area around > > > ALWAYS_SECURE_SEARCH_PATH_SQL / search_patch and background worker. > > > > I think a similar commit 11da970 has added a test for the search_path, e.g. > > Oh right, thanks for sharing! > > But do we think it's worth to show how to hijack an operator within a background > worker "just" to verify that the search_path works as expected? > > I don't think it's worth it but will do if others have different opinions. > I think it is important to add this test because if we break this behavior for any reason it will be a security hazard. Now, if adding it increases the timing of the test too much then we should rethink but otherwise, I don't see any reason not to add this test. -- With Regards, Amit Kapila.
On Wed, Feb 28, 2024 at 1:33 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > A few comments: Thanks for reviewing. > > 1 === > > + * used to run normal SQL queries > > s/run normal SQL/run SQL/ ? > > As mentioned up-thread I don't like that much the idea of creating such a test > but if we do then here are my comments: > > 2 === > > +CREATE FUNCTION myschema.myintne(bigint, int) > > Should we explain why 'bigint, int' is important here (instead of > 'int, int')? > > 3 === > > +# stage of syncing newly created slots. If the worker was not prepared > +# to handle such attacks, it would have failed during > > Worth to mention the underlying check / function that would get an "unexpected" > result? > > Except for the above (nit) comments the patch looks good to me. Here is the patch which addresses the above comments. Also optimized the test a little bit. Now we use pg_sync_replication_slots() function instead of worker to test the operator-redirection using search-patch. This has been done to simplify the test case and reduce the added time. thanks Shveta
Attachment
On Wed, Feb 28, 2024 at 3:26 PM shveta malik <shveta.malik@gmail.com> wrote: > > > Here is the patch which addresses the above comments. Also optimized > the test a little bit. Now we use pg_sync_replication_slots() function > instead of worker to test the operator-redirection using search-patch. > This has been done to simplify the test case and reduce the added > time. > I have slightly adjusted the comments in the attached, otherwise, LGTM. -- With Regards, Amit Kapila.
Attachment
Hi, On Wed, Feb 28, 2024 at 02:23:27AM +0000, Zhijie Hou (Fujitsu) wrote: > Attach the V100 patch set which addressed above comments. Thanks! A few random comments: 1 === + if (!ok) + { + GUC_check_errdetail("List syntax is invalid."); + } What about to get rid of the brackets here? 2 === + + /* + * If the replication slots' data have been initialized, verify if the + * specified slots exist and are logical slots. + */ remove the empty line above the comment? 3 === +check_standby_slot_names(char **newval, void **extra, GucSource source) +{ + if ((*newval)[0] == '\0') + return true; I think "**newval == '\0'" is easier to read but that's a matter of taste and check_synchronous_standby_names() is already using the same so it's a nit. 4 === Regarding the test, what about adding one to test the "new" behavior discussed up-thread? (logical replication will wait if slot mentioned in standby_slot_names is dropped and/or does not exist when the engine starts?) Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Wed, Feb 28, 2024 at 04:50:55PM +0530, Amit Kapila wrote: > On Wed, Feb 28, 2024 at 3:26 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > Here is the patch which addresses the above comments. Also optimized > > the test a little bit. Now we use pg_sync_replication_slots() function > > instead of worker to test the operator-redirection using search-patch. > > This has been done to simplify the test case and reduce the added > > time. Thanks! > I have slightly adjusted the comments in the attached, otherwise, LGTM. Same here, LGTM. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, Feb 28, 2024 at 10:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Feb 28, 2024 at 3:26 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > Here is the patch which addresses the above comments. Also optimized > > the test a little bit. Now we use pg_sync_replication_slots() function > > instead of worker to test the operator-redirection using search-patch. > > This has been done to simplify the test case and reduce the added > > time. > > > > I have slightly adjusted the comments in the attached, otherwise, LGTM. > > -- - if (logical) + /* + * Set always-secure search path for the cases where the connection is + * used to run SQL queries, so malicious users can't get control. + */ + if (logical || !replication) { PGresult *res; I found this condition a bit confusing. According to the libpqrcv_connect function comment: * This function can be used for both replication and regular connections. * If it is a replication connection, it could be either logical or physical * based on input argument 'logical'. IIUC that comment is saying the 'replication' flag is like the main categorization and the 'logical' flag is like a subcategory (for when 'replication' is true). Therefore, won't the modified check be better to be written the other way around? This will also be consistent with the way the Assert was written. SUGGESTION if (!replication || logical) { ... ====== Kind Regards, Peter Smith. Fujitsu Australia
On Monday, February 26, 2024 7:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Feb 26, 2024 at 7:49 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> > wrote: > > > > Attach the V98 patch set which addressed above comments. > > > > Few comments: > ============= > 1. > WalSndWaitForWal(XLogRecPtr loc) > { > int wakeEvents; > + bool wait_for_standby = false; > + uint32 wait_event; > + List *standby_slots = NIL; > static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr; > > + if (MyReplicationSlot->data.failover && replication_active) > + standby_slots = GetStandbySlotList(true); > + > /* > - * Fast path to avoid acquiring the spinlock in case we already know we > - * have enough WAL available. This is particularly interesting if we're > - * far behind. > + * Check if all the standby servers have confirmed receipt of WAL up to > + * RecentFlushPtr even when we already know we have enough WAL available. > + * > + * Note that we cannot directly return without checking the status of > + * standby servers because the standby_slot_names may have changed, > + which > + * means there could be new standby slots in the list that have not yet > + * caught up to the RecentFlushPtr. > */ > - if (RecentFlushPtr != InvalidXLogRecPtr && > - loc <= RecentFlushPtr) > - return RecentFlushPtr; > + if (!XLogRecPtrIsInvalid(RecentFlushPtr) && loc <= RecentFlushPtr) { > + FilterStandbySlots(RecentFlushPtr, &standby_slots); > > I think even if the slot list is not changed, we will always process each slot > mentioned in standby_slot_names once. Can't we cache the previous list of > slots for we have already waited for? In that case, we won't even need to copy > the list via GetStandbySlotList() unless we need to wait. > > 2. > WalSndWaitForWal(XLogRecPtr loc) > { > + /* > + * Update the standby slots that have not yet caught up to the flushed > + * position. It is good to wait up to RecentFlushPtr and then let it > + * send the changes to logical subscribers one by one which are > + * already covered in RecentFlushPtr without needing to wait on every > + * change for standby confirmation. > + */ > + if (wait_for_standby) > + FilterStandbySlots(RecentFlushPtr, &standby_slots); > + > /* Update our idea of the currently flushed position. */ > - if (!RecoveryInProgress()) > + else if (!RecoveryInProgress()) > RecentFlushPtr = GetFlushRecPtr(NULL); > else > RecentFlushPtr = GetXLogReplayRecPtr(NULL); ... > /* > * If postmaster asked us to stop, don't wait anymore. > * > * It's important to do this check after the recomputation of > * RecentFlushPtr, so we can send all remaining data before shutting > * down. > */ > if (got_STOPPING) > break; > > I think because 'wait_for_standby' may not be set in the first or consecutive > cycles we may send the WAL to the logical subscriber before sending it to the > physical subscriber during shutdown. Here is the v101 patch set which addressed above comments. This version will cache the oldest standby slot's LSN each time we waited for them to catch up. The cached LSN is invalidated when we reload the GUC config. In the WalSndWaitForWal function, instead of traversing the entire standby list each time, we can check the cached LSN to quickly determine if the standbys have caught up. When a shutdown signal is received, we continue to wait for the standby slots to catch up. When waiting for the standbys to catch up after receiving the shutdown signal, an ERROR is reported if any slots are dropped, invalidated, or inactive. This measure is taken to prevent the walsender from waiting indefinitely. Best Regards, Hou zj
Attachment
On Wed, Feb 28, 2024 at 1:23 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Tuesday, February 27, 2024 3:18 PM Peter Smith <smithpb2250@gmail.com> wrote: ... > > 20. > > +# > > +# | ----> standby1 (primary_slot_name = sb1_slot) # | ----> standby2 > > +(primary_slot_name = sb2_slot) # primary ----- | # | ----> subscriber1 > > +(failover = true) # | ----> subscriber2 (failover = false) > > > > In the diagram, the "--->" means a mixture of physical standbys and logical > > pub/sub replication. Maybe it can be a bit clearer? > > > > SUGGESTION > > > > # primary (publisher) > > # > > # (physical standbys) > > # | ----> standby1 (primary_slot_name = sb1_slot) > > # | ----> standby2 (primary_slot_name = sb2_slot) > > # > > # (logical replication) > > # | ----> subscriber1 (failover = true, slot_name = lsub1_slot) > > # | ----> subscriber2 (failover = false, slot_name = lsub2_slot) > > > > I think one can distinguish it based on the 'standby' and 'subscriber' as well, because > 'standby' normally refer to physical standby while the other refer to logical. > Ok, but shouldn't it at least include info about the logical slot names associated with the subscribers (slot_name = lsub1_slot, slot_name = lsub2_slot) like suggested above? ====== Here are some more review comments for v100-0001 ====== doc/src/sgml/config.sgml 1. + <para> + Lists the streaming replication standby server slot names that logical + WAL sender processes will wait for. Logical WAL sender processes will + send decoded changes to plugins only after the specified replication + slots confirm receiving WAL. This guarantees that logical replication + slots with failover enabled do not consume changes until those changes + are received and flushed to corresponding physical standbys. If a + logical replication connection is meant to switch to a physical standby + after the standby is promoted, the physical replication slot for the + standby should be listed here. Note that logical replication will not + proceed if the slots specified in the standby_slot_names do not exist or + are invalidated. + </para> Is that note ("Note that logical replication will not proceed if the slots specified in the standby_slot_names do not exist or are invalidated") meant only for subscriptions marked for 'failover' or any subscription? Maybe wording can be modified to help clarify it? ====== src/backend/replication/slot.c 2. +/* + * A helper function to validate slots specified in GUC standby_slot_names. + */ +static bool +validate_standby_slots(char **newval) +{ + char *rawname; + List *elemlist; + bool ok; + + /* Need a modifiable copy of string */ + rawname = pstrdup(*newval); + + /* Verify syntax and parse string into a list of identifiers */ + ok = SplitIdentifierString(rawname, ',', &elemlist); + + if (!ok) + { + GUC_check_errdetail("List syntax is invalid."); + } + + /* + * If the replication slots' data have been initialized, verify if the + * specified slots exist and are logical slots. + */ + else if (ReplicationSlotCtl) + { + foreach_ptr(char, name, elemlist) + { + ReplicationSlot *slot; + + slot = SearchNamedReplicationSlot(name, true); + + if (!slot) + { + GUC_check_errdetail("replication slot \"%s\" does not exist", + name); + ok = false; + break; + } + + if (!SlotIsPhysical(slot)) + { + GUC_check_errdetail("\"%s\" is not a physical replication slot", + name); + ok = false; + break; + } + } + } + + pfree(rawname); + list_free(elemlist); + return ok; +} 2a. I didn't mention this previously because I thought this function was not going to change anymore, but since Bertrand suggested some changes [1], I will say IMO the { } are fine here for the single statement, but I think it will be better to rearrange this code to be like below. Having a 2nd NOP 'else' gives a much better place where you can put your ReplicationSlotCtl comment. if (!ok) { GUC_check_errdetail("List syntax is invalid."); } else if (!ReplicationSlotCtl) { <Insert big comment here about why it is OK to skip when ReplicationSlotCtl is NULL> } else { foreach_ptr ... } ~ 2b. In any case even if don't refactor anything I still think you need to extend that comment to explain how skipping ReplicationSlotCtl is NULL is only OK because there will be some later checking also done in the FilterStandbySlots() function to catch any config problems. ------ [1] https://www.postgresql.org/message-id/Zd8ahZXw82ieFxX/%40ip-10-97-1-34.eu-west-3.compute.internal Kind Regards, Peter Smith. Fujitsu Australia
On Wednesday, February 28, 2024 7:36 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > On Wed, Feb 28, 2024 at 02:23:27AM +0000, Zhijie Hou (Fujitsu) wrote: > > Attach the V100 patch set which addressed above comments. > > A few random comments: Thanks for the comments! > > 1 === > > + if (!ok) > + { > + GUC_check_errdetail("List syntax is invalid."); > + } > > What about to get rid of the brackets here? I personally prefer the current style. > > 2 === > > + > + /* > + * If the replication slots' data have been initialized, verify if the > + * specified slots exist and are logical slots. > + */ > > remove the empty line above the comment? I feel it would be clean to have an empty line before the comments. > > 3 === > > +check_standby_slot_names(char **newval, void **extra, GucSource source) > +{ > + if ((*newval)[0] == '\0') > + return true; > > I think "**newval == '\0'" is easier to read but that's a matter of taste and > check_synchronous_standby_names() is already using the same so it's a nit. I don't have a strong opinion on this, so will change if others also feel so. > > 4 === > > Regarding the test, what about adding one to test the "new" behavior > discussed up-thread? (logical replication will wait if slot mentioned in > standby_slot_names is dropped and/or does not exist when the engine starts?) Will think about this. Best Regards, Hou zj
On Tue, Feb 27, 2024 at 11:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Feb 27, 2024 at 12:48 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > Here are some review comments for v99-0001 > > > > ========== > > 0. GENERAL. > > > > +#standby_slot_names = '' # streaming replication standby server slot names that > > + # logical walsender processes will wait for > > > > IMO the GUC name is too generic. There is nothing in this name to > > suggest it has anything to do with logical slot synchronization; that > > meaning is only found in the accompanying comment -- it would be > > better if the GUC name itself were more self-explanatory. > > > > e.g. Maybe like 'wal_sender_sync_standby_slot_names' or > > 'wal_sender_standby_slot_names', 'wal_sender_wait_for_standby_slots', > > or ... > > > > It would be wrong and or misleading to append wal_sender to this GUC > name as this is used during SQL APIs as well. Fair enough, but the fact that some SQL functions might wait is also not mentioned in the config file comment, nor in the documentation for GUC 'standby_slot_names'. Seems like a docs omission? > Also, adding wait sounds > more like a boolean. So, I don't see the proposed names any better > than the current one. > Anyway, the point is that the current GUC name 'standby_slot_names' is not ideal IMO because it doesn't have enough meaning by itself -- e.g. you have to read the accompanying comment or documentation to have any idea of its purpose. My suggested GUC names were mostly just to get people thinking about it. Maybe others can come up with better names. ====== Kind Regards, Peter Smith. Fujitsu Australia
Hi, On Thu, Feb 29, 2024 at 10:43:07AM +1100, Peter Smith wrote: > - if (logical) > + /* > + * Set always-secure search path for the cases where the connection is > + * used to run SQL queries, so malicious users can't get control. > + */ > + if (logical || !replication) > { > PGresult *res; > > I found this condition a bit confusing. According to the > libpqrcv_connect function comment: > > * This function can be used for both replication and regular connections. > * If it is a replication connection, it could be either logical or physical > * based on input argument 'logical'. > > IIUC that comment is saying the 'replication' flag is like the main > categorization and the 'logical' flag is like a subcategory (for when > 'replication' is true). Therefore, won't the modified check be better > to be written the other way around? This will also be consistent with > the way the Assert was written. > > SUGGESTION > if (!replication || logical) > { > ... Thanks for the review! Yeah, that makes sense from a categorization point of view. Out of curiosity, I checked which condition returns true most of the time by: Looking at the walrcv_connect calls: logical: 6 times !replication: 2 times (only for sync slot related stuff) Looking at a check-world coverage: logical: 1006 times !replication: 16 times So according to the above, using what has been proposed initially: " if (logical || !replication) " provides the benefit to avoid the second check on !replication most of the time (at least during check-world). Of course it also all depends if the slot sync feature (the only one that makes use of !replication) is used or not. Based on the above, I did prefer the original proposal but I think we can keep what has been pushed (Peter's proposal). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Feb 29, 2024 at 8:29 AM Peter Smith <smithpb2250@gmail.com> wrote: > > On Wed, Feb 28, 2024 at 1:23 PM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > On Tuesday, February 27, 2024 3:18 PM Peter Smith <smithpb2250@gmail.com> wrote: > ... > > > 20. > > > +# > > > +# | ----> standby1 (primary_slot_name = sb1_slot) # | ----> standby2 > > > +(primary_slot_name = sb2_slot) # primary ----- | # | ----> subscriber1 > > > +(failover = true) # | ----> subscriber2 (failover = false) > > > > > > In the diagram, the "--->" means a mixture of physical standbys and logical > > > pub/sub replication. Maybe it can be a bit clearer? > > > > > > SUGGESTION > > > > > > # primary (publisher) > > > # > > > # (physical standbys) > > > # | ----> standby1 (primary_slot_name = sb1_slot) > > > # | ----> standby2 (primary_slot_name = sb2_slot) > > > # > > > # (logical replication) > > > # | ----> subscriber1 (failover = true, slot_name = lsub1_slot) > > > # | ----> subscriber2 (failover = false, slot_name = lsub2_slot) > > > > > > > I think one can distinguish it based on the 'standby' and 'subscriber' as well, because > > 'standby' normally refer to physical standby while the other refer to logical. > > I think Peter's suggestion will make the setup clear. > > Ok, but shouldn't it at least include info about the logical slot > names associated with the subscribers (slot_name = lsub1_slot, > slot_name = lsub2_slot) like suggested above? > > ====== > > Here are some more review comments for v100-0001 > > ====== > doc/src/sgml/config.sgml > > 1. > + <para> > + Lists the streaming replication standby server slot names that logical > + WAL sender processes will wait for. Logical WAL sender processes will > + send decoded changes to plugins only after the specified replication > + slots confirm receiving WAL. This guarantees that logical replication > + slots with failover enabled do not consume changes until those changes > + are received and flushed to corresponding physical standbys. If a > + logical replication connection is meant to switch to a physical standby > + after the standby is promoted, the physical replication slot for the > + standby should be listed here. Note that logical replication will not > + proceed if the slots specified in the standby_slot_names do > not exist or > + are invalidated. > + </para> > > Is that note ("Note that logical replication will not proceed if the > slots specified in the standby_slot_names do not exist or are > invalidated") meant only for subscriptions marked for 'failover' or > any subscription? Maybe wording can be modified to help clarify it? > I think it is implicit that here we are talking about failover slots. I think clarifying again the same could be repetitive considering the previous sentence: "This guarantees that logical replication slots with failover enabled do not consume .." have mentioned it. > ====== > src/backend/replication/slot.c > > 2. > +/* > + * A helper function to validate slots specified in GUC standby_slot_names. > + */ > +static bool > +validate_standby_slots(char **newval) > +{ > + char *rawname; > + List *elemlist; > + bool ok; > + > + /* Need a modifiable copy of string */ > + rawname = pstrdup(*newval); > + > + /* Verify syntax and parse string into a list of identifiers */ > + ok = SplitIdentifierString(rawname, ',', &elemlist); > + > + if (!ok) > + { > + GUC_check_errdetail("List syntax is invalid."); > + } > + > + /* > + * If the replication slots' data have been initialized, verify if the > + * specified slots exist and are logical slots. > + */ > + else if (ReplicationSlotCtl) > + { > + foreach_ptr(char, name, elemlist) > + { > + ReplicationSlot *slot; > + > + slot = SearchNamedReplicationSlot(name, true); > + > + if (!slot) > + { > + GUC_check_errdetail("replication slot \"%s\" does not exist", > + name); > + ok = false; > + break; > + } > + > + if (!SlotIsPhysical(slot)) > + { > + GUC_check_errdetail("\"%s\" is not a physical replication slot", > + name); > + ok = false; > + break; > + } > + } > + } > + > + pfree(rawname); > + list_free(elemlist); > + return ok; > +} > > 2a. > I didn't mention this previously because I thought this function was > not going to change anymore, but since Bertrand suggested some changes > [1], I will say IMO the { } are fine here for the single statement, > but I think it will be better to rearrange this code to be like below. > Having a 2nd NOP 'else' gives a much better place where you can put > your ReplicationSlotCtl comment. > > if (!ok) > { > GUC_check_errdetail("List syntax is invalid."); > } > else if (!ReplicationSlotCtl) > { > <Insert big comment here about why it is OK to skip when > ReplicationSlotCtl is NULL> > } > else > { > foreach_ptr ... > } > +1. This will make the code and reasoning to skip clear. Few additional comments on the latest patch: ================================= 1. static XLogRecPtr WalSndWaitForWal(XLogRecPtr loc) { ... + if (!XLogRecPtrIsInvalid(RecentFlushPtr) && loc <= RecentFlushPtr && + (!replication_active || StandbyConfirmedFlush(loc, WARNING))) + { + /* + * Fast path to avoid acquiring the spinlock in case we already know + * we have enough WAL available and all the standby servers have + * confirmed receipt of WAL up to RecentFlushPtr. This is particularly + * interesting if we're far behind. + */ return RecentFlushPtr; + } ... ... + * Wait for WALs to be flushed to disk only if the postmaster has not + * asked us to stop. + */ + if (loc > RecentFlushPtr && !got_STOPPING) + wait_event = WAIT_EVENT_WAL_SENDER_WAIT_FOR_WAL; + + /* + * Check if the standby slots have caught up to the flushed position. + * It is good to wait up to RecentFlushPtr and then let it send the + * changes to logical subscribers one by one which are already covered + * in RecentFlushPtr without needing to wait on every change for + * standby confirmation. Note that after receiving the shutdown signal, + * an ERROR is reported if any slots are dropped, invalidated, or + * inactive. This measure is taken to prevent the walsender from + * waiting indefinitely. + */ + else if (replication_active && + !StandbyConfirmedFlush(RecentFlushPtr, WARNING)) + { + wait_event = WAIT_EVENT_WAIT_FOR_STANDBY_CONFIRMATION; + wait_for_standby = true; + } + else + { + /* Already caught up and doesn't need to wait for standby_slots. */ break; + } ... } Can we try to move these checks into a separate function that returns a boolean and has an out parameter as wait_event? 2. How about naming StandbyConfirmedFlush() as StandbySlotsAreCaughtup? -- With Regards, Amit Kapila.
On Thu, Feb 29, 2024 at 9:13 AM Peter Smith <smithpb2250@gmail.com> wrote: > > On Tue, Feb 27, 2024 at 11:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > Also, adding wait sounds > > more like a boolean. So, I don't see the proposed names any better > > than the current one. > > > > Anyway, the point is that the current GUC name 'standby_slot_names' is > not ideal IMO because it doesn't have enough meaning by itself -- e.g. > you have to read the accompanying comment or documentation to have any > idea of its purpose. > Yeah, one has to read the description but that is true for other parameters like "temp_tablespaces". I don't have any better ideas but open to suggestions. -- With Regards, Amit Kapila.
On Thu, Feb 29, 2024 at 3:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Few additional comments on the latest patch: > ================================= > 1. > static XLogRecPtr > WalSndWaitForWal(XLogRecPtr loc) > { > ... > + if (!XLogRecPtrIsInvalid(RecentFlushPtr) && loc <= RecentFlushPtr && > + (!replication_active || StandbyConfirmedFlush(loc, WARNING))) > + { > + /* > + * Fast path to avoid acquiring the spinlock in case we already know > + * we have enough WAL available and all the standby servers have > + * confirmed receipt of WAL up to RecentFlushPtr. This is particularly > + * interesting if we're far behind. > + */ > return RecentFlushPtr; > + } > ... > ... > + * Wait for WALs to be flushed to disk only if the postmaster has not > + * asked us to stop. > + */ > + if (loc > RecentFlushPtr && !got_STOPPING) > + wait_event = WAIT_EVENT_WAL_SENDER_WAIT_FOR_WAL; > + > + /* > + * Check if the standby slots have caught up to the flushed position. > + * It is good to wait up to RecentFlushPtr and then let it send the > + * changes to logical subscribers one by one which are already covered > + * in RecentFlushPtr without needing to wait on every change for > + * standby confirmation. Note that after receiving the shutdown signal, > + * an ERROR is reported if any slots are dropped, invalidated, or > + * inactive. This measure is taken to prevent the walsender from > + * waiting indefinitely. > + */ > + else if (replication_active && > + !StandbyConfirmedFlush(RecentFlushPtr, WARNING)) > + { > + wait_event = WAIT_EVENT_WAIT_FOR_STANDBY_CONFIRMATION; > + wait_for_standby = true; > + } > + else > + { > + /* Already caught up and doesn't need to wait for standby_slots. */ > break; > + } > ... > } > > Can we try to move these checks into a separate function that returns > a boolean and has an out parameter as wait_event? > > 2. How about naming StandbyConfirmedFlush() as StandbySlotsAreCaughtup? > Some more comments: ================== 1. + foreach_ptr(char, name, elemlist) + { + ReplicationSlot *slot; + + slot = SearchNamedReplicationSlot(name, true); + + if (!slot) + { + GUC_check_errdetail("replication slot \"%s\" does not exist", + name); + ok = false; + break; + } + + if (!SlotIsPhysical(slot)) + { + GUC_check_errdetail("\"%s\" is not a physical replication slot", + name); + ok = false; + break; + } + } Won't the second check need protection via ReplicationSlotControlLock? 2. In WaitForStandbyConfirmation(), we are anyway calling StandbyConfirmedFlush() before the actual sleep in the loop, so does calling it at the beginning of the function will serve any purpose? If so, it is better to add some comments explaining the same. 3. Also do we need to perform the validation checks done in StandbyConfirmedFlush() repeatedly when it is invoked in a loop? We can probably separate those checks and perform them just once. 4. + * + * XXX: If needed, this can be attempted in future. Remove this part of the comment. -- With Regards, Amit Kapila.
On Thu, Feb 29, 2024 at 7:04 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > Here is the v101 patch set which addressed above comments. > Thanks for the patch. Few comments: 1) Shall we mention in doc that shutdown will wait for standbys in standby_slot_names to confirm receiving WAL: Suggestion for logicaldecoding.sgml: When <varname>standby_slot_names</varname> is utilized, the primary server will not completely shut down until the corresponding standbys, associated with the physical replication slots specified in <varname>standby_slot_names</varname>, have confirmed receiving the WAL up to the latest flushed position on the primary server. slot.c 2) /* * If a logical slot name is provided in standby_slot_names, report * a message and skip it. Although logical slots are disallowed in * the GUC check_hook(validate_standby_slots), it is still * possible for a user to drop an existing physical slot and * recreate a logical slot with the same name. */ This is not completely true, we can still specify a logical slot during instance start and it will accept it. Suggestion: /* * If a logical slot name is provided in standby_slot_names, report * a message and skip it. It is possible for user to specify a * logical slot name in standby_slot_names just before the server * startup. The GUC check_hook(validate_standby_slots) can not * validate such a slot during startup as the ReplicationSlotCtl * shared memory is not initialized by that time. It is also * possible for user to drop an existing physical slot and * recreate a logical slot with the same name. */ 3. Wait for physical standby to confirm receiving the given lsn standby -->standbys 4. In StandbyConfirmedFlush(), is it better to have below errdetail in all problematic cases: Logical replication is waiting on the standby associated with \"%s\ We have it only for inactive pid case but we are waiting in all cases. thanks Shveta
On Thursday, February 29, 2024 7:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Feb 29, 2024 at 3:23 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > Few additional comments on the latest patch: > > ================================= > > 1. > > static XLogRecPtr > > WalSndWaitForWal(XLogRecPtr loc) > > { > > ... > > + if (!XLogRecPtrIsInvalid(RecentFlushPtr) && loc <= RecentFlushPtr && > > + (!replication_active || StandbyConfirmedFlush(loc, WARNING))) { > > + /* > > + * Fast path to avoid acquiring the spinlock in case we already know > > + * we have enough WAL available and all the standby servers have > > + * confirmed receipt of WAL up to RecentFlushPtr. This is > > + particularly > > + * interesting if we're far behind. > > + */ > > return RecentFlushPtr; > > + } > > ... > > ... > > + * Wait for WALs to be flushed to disk only if the postmaster has not > > + * asked us to stop. > > + */ > > + if (loc > RecentFlushPtr && !got_STOPPING) wait_event = > > + WAIT_EVENT_WAL_SENDER_WAIT_FOR_WAL; > > + > > + /* > > + * Check if the standby slots have caught up to the flushed position. > > + * It is good to wait up to RecentFlushPtr and then let it send the > > + * changes to logical subscribers one by one which are already > > + covered > > + * in RecentFlushPtr without needing to wait on every change for > > + * standby confirmation. Note that after receiving the shutdown > > + signal, > > + * an ERROR is reported if any slots are dropped, invalidated, or > > + * inactive. This measure is taken to prevent the walsender from > > + * waiting indefinitely. > > + */ > > + else if (replication_active && > > + !StandbyConfirmedFlush(RecentFlushPtr, WARNING)) { wait_event = > > + WAIT_EVENT_WAIT_FOR_STANDBY_CONFIRMATION; > > + wait_for_standby = true; > > + } > > + else > > + { > > + /* Already caught up and doesn't need to wait for standby_slots. */ > > break; > > + } > > ... > > } > > > > Can we try to move these checks into a separate function that returns > > a boolean and has an out parameter as wait_event? > > > > 2. How about naming StandbyConfirmedFlush() as > StandbySlotsAreCaughtup? > > > > Some more comments: > ================== > 1. > + foreach_ptr(char, name, elemlist) > + { > + ReplicationSlot *slot; > + > + slot = SearchNamedReplicationSlot(name, true); > + > + if (!slot) > + { > + GUC_check_errdetail("replication slot \"%s\" does not exist", name); > + ok = false; break; } > + > + if (!SlotIsPhysical(slot)) > + { > + GUC_check_errdetail("\"%s\" is not a physical replication slot", > + name); ok = false; break; } } > > Won't the second check need protection via ReplicationSlotControlLock? Yes, added. > > 2. In WaitForStandbyConfirmation(), we are anyway calling > StandbyConfirmedFlush() before the actual sleep in the loop, so does calling it > at the beginning of the function will serve any purpose? If so, it is better to add > some comments explaining the same. It is used as a fast-path to avoid calling condition variable stuff, I think we can directly put failover check and list check in the beginning instead of calling that function. > > 3. Also do we need to perform the validation checks done in > StandbyConfirmedFlush() repeatedly when it is invoked in a loop? We can > probably separate those checks and perform them just once. I have removed slot.failover check from the StandbyConfirmedFlush function, so that we can do it when necessary. I didn’t change the check for standby_slot_names_list because the list could be changed in the loop when reloading config. > > 4. > + * > + * XXX: If needed, this can be attempted in future. > > Remove this part of the comment. Removed. Attach the V102 patch set which addressed Amit and Shveta's comments. Thanks Shveta for helping addressing the comments off-list. Best Regards, Hou zj
Attachment
On Thursday, February 29, 2024 7:36 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Thu, Feb 29, 2024 at 7:04 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> > wrote: > > > > Here is the v101 patch set which addressed above comments. > > > > Thanks for the patch. Few comments: > > 1) Shall we mention in doc that shutdown will wait for standbys in > standby_slot_names to confirm receiving WAL: > > Suggestion for logicaldecoding.sgml: > > When <varname>standby_slot_names</varname> is utilized, the primary > server will not completely shut down until the corresponding standbys, > associated with the physical replication slots specified in > <varname>standby_slot_names</varname>, have confirmed receiving the > WAL up to the latest flushed position on the primary server. > > slot.c > 2) > /* > * If a logical slot name is provided in standby_slot_names, report > * a message and skip it. Although logical slots are disallowed in > * the GUC check_hook(validate_standby_slots), it is still > * possible for a user to drop an existing physical slot and > * recreate a logical slot with the same name. > */ > > This is not completely true, we can still specify a logical slot during instance > start and it will accept it. > > Suggestion: > /* > * If a logical slot name is provided in standby_slot_names, report > * a message and skip it. It is possible for user to specify a > * logical slot name in standby_slot_names just before the server > * startup. The GUC check_hook(validate_standby_slots) can not > * validate such a slot during startup as the ReplicationSlotCtl > * shared memory is not initialized by that time. It is also > * possible for user to drop an existing physical slot and > * recreate a logical slot with the same name. > */ > > 3. Wait for physical standby to confirm receiving the given lsn > > standby -->standbys > > > 4. > In StandbyConfirmedFlush(), is it better to have below errdetail in all > problematic cases: > Logical replication is waiting on the standby associated with \"%s\ > > We have it only for inactive pid case but we are waiting in all cases. Thanks for the comments, I have addressed them. Best Regards, Hou zj
On Thursday, February 29, 2024 5:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Feb 29, 2024 at 8:29 AM Peter Smith <smithpb2250@gmail.com> > wrote: > > > > On Wed, Feb 28, 2024 at 1:23 PM Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> wrote: > > > > > > On Tuesday, February 27, 2024 3:18 PM Peter Smith > <smithpb2250@gmail.com> wrote: > > ... > > > > 20. > > > > +# > > > > +# | ----> standby1 (primary_slot_name = sb1_slot) # | ----> standby2 > > > > +(primary_slot_name = sb2_slot) # primary ----- | # | ----> subscriber1 > > > > +(failover = true) # | ----> subscriber2 (failover = false) > > > > > > > > In the diagram, the "--->" means a mixture of physical standbys and > logical > > > > pub/sub replication. Maybe it can be a bit clearer? > > > > > > > > SUGGESTION > > > > > > > > # primary (publisher) > > > > # > > > > # (physical standbys) > > > > # | ----> standby1 (primary_slot_name = sb1_slot) > > > > # | ----> standby2 (primary_slot_name = sb2_slot) > > > > # > > > > # (logical replication) > > > > # | ----> subscriber1 (failover = true, slot_name = lsub1_slot) > > > > # | ----> subscriber2 (failover = false, slot_name = lsub2_slot) > > > > > > > > > > I think one can distinguish it based on the 'standby' and 'subscriber' as well, > because > > > 'standby' normally refer to physical standby while the other refer to logical. > > > > > I think Peter's suggestion will make the setup clear. Changed. > > > > > Ok, but shouldn't it at least include info about the logical slot > > names associated with the subscribers (slot_name = lsub1_slot, > > slot_name = lsub2_slot) like suggested above? > > > > ====== > > > > Here are some more review comments for v100-0001 > > > > ====== > > doc/src/sgml/config.sgml > > > > 1. > > + <para> > > + Lists the streaming replication standby server slot names that > logical > > + WAL sender processes will wait for. Logical WAL sender processes > will > > + send decoded changes to plugins only after the specified > replication > > + slots confirm receiving WAL. This guarantees that logical replication > > + slots with failover enabled do not consume changes until those > changes > > + are received and flushed to corresponding physical standbys. If a > > + logical replication connection is meant to switch to a physical > standby > > + after the standby is promoted, the physical replication slot for the > > + standby should be listed here. Note that logical replication will not > > + proceed if the slots specified in the standby_slot_names do > > not exist or > > + are invalidated. > > + </para> > > > > Is that note ("Note that logical replication will not proceed if the > > slots specified in the standby_slot_names do not exist or are > > invalidated") meant only for subscriptions marked for 'failover' or > > any subscription? Maybe wording can be modified to help clarify it? > > > > I think it is implicit that here we are talking about failover slots. > I think clarifying again the same could be repetitive considering the > previous sentence: "This guarantees that logical replication slots > with failover enabled do not consume .." have mentioned it. > > > ====== > > src/backend/replication/slot.c > > > > 2. > > +/* > > + * A helper function to validate slots specified in GUC standby_slot_names. > > + */ > > +static bool > > +validate_standby_slots(char **newval) > > +{ > > + char *rawname; > > + List *elemlist; > > + bool ok; > > + > > + /* Need a modifiable copy of string */ > > + rawname = pstrdup(*newval); > > + > > + /* Verify syntax and parse string into a list of identifiers */ > > + ok = SplitIdentifierString(rawname, ',', &elemlist); > > + > > + if (!ok) > > + { > > + GUC_check_errdetail("List syntax is invalid."); > > + } > > + > > + /* > > + * If the replication slots' data have been initialized, verify if the > > + * specified slots exist and are logical slots. > > + */ > > + else if (ReplicationSlotCtl) > > + { > > + foreach_ptr(char, name, elemlist) > > + { > > + ReplicationSlot *slot; > > + > > + slot = SearchNamedReplicationSlot(name, true); > > + > > + if (!slot) > > + { > > + GUC_check_errdetail("replication slot \"%s\" does not exist", > > + name); > > + ok = false; > > + break; > > + } > > + > > + if (!SlotIsPhysical(slot)) > > + { > > + GUC_check_errdetail("\"%s\" is not a physical replication slot", > > + name); > > + ok = false; > > + break; > > + } > > + } > > + } > > + > > + pfree(rawname); > > + list_free(elemlist); > > + return ok; > > +} > > > > 2a. > > I didn't mention this previously because I thought this function was > > not going to change anymore, but since Bertrand suggested some changes > > [1], I will say IMO the { } are fine here for the single statement, > > but I think it will be better to rearrange this code to be like below. > > Having a 2nd NOP 'else' gives a much better place where you can put > > your ReplicationSlotCtl comment. > > > > if (!ok) > > { > > GUC_check_errdetail("List syntax is invalid."); > > } > > else if (!ReplicationSlotCtl) > > { > > <Insert big comment here about why it is OK to skip when > > ReplicationSlotCtl is NULL> > > } > > else > > { > > foreach_ptr ... > > } > > > > +1. This will make the code and reasoning to skip clear. Changed. > > Few additional comments on the latest patch: > ================================= > 1. > static XLogRecPtr > WalSndWaitForWal(XLogRecPtr loc) > { > ... > + if (!XLogRecPtrIsInvalid(RecentFlushPtr) && loc <= RecentFlushPtr && > + (!replication_active || StandbyConfirmedFlush(loc, WARNING))) > + { > + /* > + * Fast path to avoid acquiring the spinlock in case we already know > + * we have enough WAL available and all the standby servers have > + * confirmed receipt of WAL up to RecentFlushPtr. This is particularly > + * interesting if we're far behind. > + */ > return RecentFlushPtr; > + } > ... > ... > + * Wait for WALs to be flushed to disk only if the postmaster has not > + * asked us to stop. > + */ > + if (loc > RecentFlushPtr && !got_STOPPING) > + wait_event = WAIT_EVENT_WAL_SENDER_WAIT_FOR_WAL; > + > + /* > + * Check if the standby slots have caught up to the flushed position. > + * It is good to wait up to RecentFlushPtr and then let it send the > + * changes to logical subscribers one by one which are already covered > + * in RecentFlushPtr without needing to wait on every change for > + * standby confirmation. Note that after receiving the shutdown signal, > + * an ERROR is reported if any slots are dropped, invalidated, or > + * inactive. This measure is taken to prevent the walsender from > + * waiting indefinitely. > + */ > + else if (replication_active && > + !StandbyConfirmedFlush(RecentFlushPtr, WARNING)) > + { > + wait_event = WAIT_EVENT_WAIT_FOR_STANDBY_CONFIRMATION; > + wait_for_standby = true; > + } > + else > + { > + /* Already caught up and doesn't need to wait for standby_slots. */ > break; > + } > ... > } > > Can we try to move these checks into a separate function that returns > a boolean and has an out parameter as wait_event? Refactored. > > 2. How about naming StandbyConfirmedFlush() as StandbySlotsAreCaughtup? I used a similar version based on the suggested name: StandbySlotsHaveCaughtup. Best Regards, Hou zj
On Wed, Feb 28, 2024 at 4:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Feb 28, 2024 at 3:26 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > Here is the patch which addresses the above comments. Also optimized > > the test a little bit. Now we use pg_sync_replication_slots() function > > instead of worker to test the operator-redirection using search-patch. > > This has been done to simplify the test case and reduce the added > > time. > > > > I have slightly adjusted the comments in the attached, otherwise, LGTM. This patch was pushed (commit: b3f6b14) and it resulted in BF failure: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-02-29%2012%3A49%3A27 The concerned log on standby1: 2024-02-29 14:23:16.738 UTC [3908:4] 040_standby_failover_slots_sync.pl LOG: statement: SELECT pg_sync_replication_slots(); The system cannot find the file specified. 2024-02-29 14:23:16.971 UTC [3908:5] 040_standby_failover_slots_sync.pl ERROR: could not connect to the primary server: connection to server at "127.0.0.1", port 65352 failed: FATAL: SSPI authentication failed for user "repl_role" 2024-02-29 14:23:16.971 UTC [3908:6] 040_standby_failover_slots_sync.pl STATEMENT: SELECT pg_sync_replication_slots(); It seems authentication is failing for the new role added.We also see method=sspi used in the publisher log. We are analysing it further and will share the findings. thanks Shveta
On Friday, March 1, 2024 10:17 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > > Attach the V102 patch set which addressed Amit and Shveta's comments. > Thanks Shveta for helping addressing the comments off-list. The cfbot reported a compile warning, here is the new version patch which fixed it, Also removed some outdate comments in this version. Best Regards, Hou zj
Attachment
Here are some review comments for v102-0001. ====== doc/src/sgml/config.sgml 1. + <para> + Lists the streaming replication standby server slot names that logical + WAL sender processes will wait for. Logical WAL sender processes will + send decoded changes to plugins only after the specified replication + slots confirm receiving WAL. This guarantees that logical replication + slots with failover enabled do not consume changes until those changes + are received and flushed to corresponding physical standbys. If a + logical replication connection is meant to switch to a physical standby + after the standby is promoted, the physical replication slot for the + standby should be listed here. Note that logical replication will not + proceed if the slots specified in the standby_slot_names do not exist or + are invalidated. + </para> Should this also mention the effect this GUC has on those 2 SQL functions? E.g. Commit message says: Additionally, The SQL functions pg_logical_slot_get_changes and pg_replication_slot_advance are modified to wait for the replication slots mentioned in 'standby_slot_names' to catch up before returning. ====== src/backend/replication/slot.c 2. validate_standby_slots + else if (!ReplicationSlotCtl) + { + /* + * We cannot validate the replication slot if the replication slots' + * data has not been initialized. This is ok as we will validate the + * specified slot when waiting for them to catch up. See + * StandbySlotsHaveCaughtup for details. + */ + } + else + { + /* + * If the replication slots' data have been initialized, verify if the + * specified slots exist and are logical slots. + */ + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED); IMO that 2nd comment does not need to say "If the replication slots' data have been initialized," because that is implicit from the if/else. ~~~ 3. GetStandbySlotList +List * +GetStandbySlotList(void) +{ + if (RecoveryInProgress()) + return NIL; + else + return standby_slot_names_list; +} The 'else' is not needed. IMO is better without it, but YMMV. ~~~ 4. StandbySlotsHaveCaughtup +/* + * Return true if the slots specified in standby_slot_names have caught up to + * the given WAL location, false otherwise. + * + * The elevel parameter determines the error level used for logging messages + * related to slots that do not exist, are invalidated, or are inactive. + */ +bool +StandbySlotsHaveCaughtup(XLogRecPtr wait_for_lsn, int elevel) /determines/specifies/ ~ 5. + /* + * Don't need to wait for the standby to catch up if there is no value in + * standby_slot_names. + */ + if (!standby_slot_names_list) + return true; + + /* + * If we are on a standby server, we do not need to wait for the standby to + * catch up since we do not support syncing slots to cascading standbys. + */ + if (RecoveryInProgress()) + return true; + + /* + * Return true if all the standbys have already caught up to the passed in + * WAL localtion. + */ + if (!XLogRecPtrIsInvalid(standby_slot_oldest_flush_lsn) && + standby_slot_oldest_flush_lsn >= wait_for_lsn) + return true; 5a. I felt all these comments should be worded in a consistent way like "Don't need to wait ..." e.g. 1. Don't need to wait for the standbys to catch up if there is no value in 'standby_slot_names'. 2. Don't need to wait for the standbys to catch up if we are on a standby server, since we do not support syncing slots to cascading standbys. 3. Don't need to wait for the standbys to catch up if they are already beyond the specified WAL location. ~ 5b. typo /WAL localtion/WAL location/ ~~~ 6. + if (!slot) + { + /* + * It may happen that the slot specified in standby_slot_names GUC + * value is dropped, so let's skip over it. + */ + ereport(elevel, + errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("replication slot \"%s\" specified in parameter %s does not exist", + name, "standby_slot_names")); + continue; + } Is "is dropped" strictly the only reason? IIUC another reason is the slot maybe just did not even exist in the first place but it was not detected before now because inititial validation was skipped. ~~~ 7. + /* + * Return false if not all the standbys have caught up to the specified WAL + * location. + */ + if (caught_up_slot_num != list_length(standby_slot_names_list)) + return false; Somehow it seems complicated to have a counter of the slots as you process then compare that counter to the list_length to determine if one of them was skipped. Probably simpler just to set a 'skipped' flag set whenever you do 'continue'... ====== src/backend/replication/walsender.c 8. +/* + * Returns true if there are not enough WALs to be read, or if not all standbys + * have caught up to the flushed position when failover_slot is true; + * otherwise, returns false. + * + * Set prioritize_stop to true to skip waiting for WALs if the shutdown signal + * is received. + * + * Set failover_slot to true if the current acquired slot is a failover enabled + * slot and we are streaming. + * + * If returning true, the function sets the appropriate wait event in + * wait_event; otherwise, wait_event is set to 0. + */ +static bool +NeedToWaitForWal(XLogRecPtr target_lsn, XLogRecPtr flushed_lsn, + bool prioritize_stop, bool failover_slot, + uint32 *wait_event) 8a. /Set prioritize_stop to true/Specify prioritize_stop=true/ /Set failover_slot to true/Specify failover_slot=true/ ~ 8b. Aren't the static function names typically snake_case? ~~~ 9. + /* + * Check if we need to wait for WALs to be flushed to disk. We don't need + * to wait for WALs after receiving the shutdown signal unless the + * wait_for_wal_on_stop is true. + */ + if (target_lsn > flushed_lsn && !(prioritize_stop && got_STOPPING)) + *wait_event = WAIT_EVENT_WAL_SENDER_WAIT_FOR_WAL; The comment says 'wait_for_wal_on_stop' but the code says 'prioritize_stop' (??) ~~~ 10. + /* + * Check if we need to wait for WALs to be flushed to disk. We don't need + * to wait for WALs after receiving the shutdown signal unless the + * wait_for_wal_on_stop is true. + */ + if (target_lsn > flushed_lsn && !(prioritize_stop && got_STOPPING)) + *wait_event = WAIT_EVENT_WAL_SENDER_WAIT_FOR_WAL; + + /* + * Check if the standby slots have caught up to the flushed position. It is + * good to wait up to RecentFlushPtr and then let it send the changes to + * logical subscribers one by one which are already covered in + * RecentFlushPtr without needing to wait on every change for standby + * confirmation. Note that after receiving the shutdown signal, an ERROR is + * reported if any slots are dropped, invalidated, or inactive. This + * measure is taken to prevent the walsender from waiting indefinitely. + */ + else if (failover_slot && !StandbySlotsHaveCaughtup(flushed_lsn, elevel)) + *wait_event = WAIT_EVENT_WAIT_FOR_STANDBY_CONFIRMATION; + else + return false; + + return true; This if/else/else seems overly difficult to read. IMO it will be easier if written like: SUGGESTION <comment> if (target_lsn > flushed_lsn && !(prioritize_stop && got_STOPPING)) { *wait_event = WAIT_EVENT_WAL_SENDER_WAIT_FOR_WAL; return true; } <comment> if (failover_slot && !StandbySlotsHaveCaughtup(flushed_lsn, elevel)) { *wait_event = WAIT_EVENT_WAIT_FOR_STANDBY_CONFIRMATION; return true; } return false; ---------- Kind Regards, Peter Smith. Fujitsu Australia
On Fri, Mar 1, 2024 at 8:58 AM shveta malik <shveta.malik@gmail.com> wrote: > > On Wed, Feb 28, 2024 at 4:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Feb 28, 2024 at 3:26 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > > > > Here is the patch which addresses the above comments. Also optimized > > > the test a little bit. Now we use pg_sync_replication_slots() function > > > instead of worker to test the operator-redirection using search-patch. > > > This has been done to simplify the test case and reduce the added > > > time. > > > > > > > I have slightly adjusted the comments in the attached, otherwise, LGTM. > > This patch was pushed (commit: b3f6b14) and it resulted in BF failure: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-02-29%2012%3A49%3A27 > Yeah, we forgot to allow proper authentication on Windows for non-superusers used in the test. We need to use: "auth_extra => [ '--create-role', 'repl_role' ])" in the test. See attached. I'll do some more testing with this and then push it. -- With Regards, Amit Kapila.
Attachment
On Fri, Mar 1, 2024 at 12:42 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Friday, March 1, 2024 10:17 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > > > > > Attach the V102 patch set which addressed Amit and Shveta's comments. > > Thanks Shveta for helping addressing the comments off-list. > > The cfbot reported a compile warning, here is the new version patch which fixed it, > Also removed some outdate comments in this version. > Thank you for updating the patch! I've reviewed the v102-0001 patch. Here are some comments: --- I got a compiler warning: walsender.c:1829:6: warning: variable 'wait_event' is used uninitialized whenever '&&' condition is false [-Wsometimes-uninitialized] if (!XLogRecPtrIsInvalid(RecentFlushPtr) && ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ walsender.c:1871:7: note: uninitialized use occurs here if (wait_event == WAIT_EVENT_WAL_SENDER_WAIT_FOR_WAL) ^~~~~~~~~~ walsender.c:1829:6: note: remove the '&&' if its condition is always true if (!XLogRecPtrIsInvalid(RecentFlushPtr) && ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ walsender.c:1818:20: note: initialize the variable 'wait_event' to silence this warning uint32 wait_event; ^ = 0 1 warning generated. --- +void +assign_standby_slot_names(const char *newval, void *extra) +{ + List *standby_slots; + MemoryContext oldcxt; + char *standby_slot_names_cpy = extra; + Given that the newval and extra have the same data (standby_slot_names value), why do we not use newval instead? I think that if we use newval, we don't need to guc_strdup() in check_standby_slot_names(), we might need to do list_copy_deep() instead, though. It's not clear to me as there is no comment. --- + + standby_slot_oldest_flush_lsn = min_restart_lsn; + IIUC we expect that standby_slot_oldest_flush_lsn never moves backward. If so, I think it's better to have an assertion here. --- Resetting standby_slot_names doesn't work: =# alter system set standby_slot_names to ''; ERROR: invalid value for parameter "standby_slot_names": """" DETAIL: replication slot "" does not exist --- + /* + * Switch to the memory context under which GUC variables are allocated + * (GUCMemoryContext). + */ + oldcxt = MemoryContextSwitchTo(GetMemoryChunkContext(standby_slot_names_cpy)); + standby_slot_names_list = list_copy(standby_slots); + MemoryContextSwitchTo(oldcxt); Why do we not explicitly switch to GUCMemoryContext? --- + if (!standby_slot_names_list) + return true; + Probably "standby_slot_names_list == NIL" is more consistent with other places. The same can be applied in WaitForStandbyConfirmation(). --- + /* + * Return true if all the standbys have already caught up to the passed in + * WAL localtion. + */ + s/localtion/location/ --- I was a bit surprised by the fact that standby_slot_names value is handled in a different way than a similar parameter synchronous_standby_names. For example, the following command doesn't work unless there is a replication slot 'slot1, slot2': =# alter system set standby_slot_names to 'slot1, slot2'; ERROR: invalid value for parameter "standby_slot_names": ""slot1, slot2"" DETAIL: replication slot "slot1, slot2" does not exist Whereas "alter system set synchronous_standby_names to stb1, stb2;" can correctly separate the string into 'stb1' and 'stb2'. Probably it would be okay since this behavior of standby_slot_names is the same as other GUC parameters that accept a comma-separated string. But I was confused a bit the first time I used it. --- + /* + * "*" is not accepted as in that case primary will not be able to know + * for which all standbys to wait for. Even if we have physical slots + * info, there is no way to confirm whether there is any standby + * configured for the known physical slots. + */ + if (strcmp(*newval, "*") == 0) + { + GUC_check_errdetail("\"*\" is not accepted for standby_slot_names"); + return false; + } Why only '*' is checked aside from validate_standby_slots()? I think that the doc doesn't mention anything about '*' and '*' cannot be used as a replication slot name. So even if we don't have this check, it might be no problem. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Mar 1, 2024 at 5:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > ... > + /* > + * "*" is not accepted as in that case primary will not be able to know > + * for which all standbys to wait for. Even if we have physical slots > + * info, there is no way to confirm whether there is any standby > + * configured for the known physical slots. > + */ > + if (strcmp(*newval, "*") == 0) > + { > + GUC_check_errdetail("\"*\" is not accepted for > standby_slot_names"); > + return false; > + } > > Why only '*' is checked aside from validate_standby_slots()? I think > that the doc doesn't mention anything about '*' and '*' cannot be used > as a replication slot name. So even if we don't have this check, it > might be no problem. > Hi, a while ago I asked this same question. See [1 #28] for the response.. ---------- [1] https://www.postgresql.org/message-id/OS0PR01MB571646B8186F6A06404BD50194BDA%40OS0PR01MB5716.jpnprd01.prod.outlook.com Kind Regards, Peter Smith. Fujitsu Australia
On Fri, Mar 1, 2024 at 9:53 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are some review comments for v102-0001. > > > 7. > + /* > + * Return false if not all the standbys have caught up to the specified WAL > + * location. > + */ > + if (caught_up_slot_num != list_length(standby_slot_names_list)) > + return false; > > Somehow it seems complicated to have a counter of the slots as you > process then compare that counter to the list_length to determine if > one of them was skipped. > > Probably simpler just to set a 'skipped' flag set whenever you do 'continue'... > The other thing is do we need to continue when we find some slot that can't be processed? If not, then we can simply set the boolean flag, break the loop, and return false if the boolean is set after releasing the LWLock. The other way is we simply release lock whenever we need to skip the slot and just return false. -- With Regards, Amit Kapila.
Hi, On Fri, Mar 01, 2024 at 03:22:55PM +1100, Peter Smith wrote: > Here are some review comments for v102-0001. > > ====== > doc/src/sgml/config.sgml > > 1. > + <para> > + Lists the streaming replication standby server slot names that logical > + WAL sender processes will wait for. Logical WAL sender processes will > + send decoded changes to plugins only after the specified replication > + slots confirm receiving WAL. This guarantees that logical replication > + slots with failover enabled do not consume changes until those changes > + are received and flushed to corresponding physical standbys. If a > + logical replication connection is meant to switch to a physical standby > + after the standby is promoted, the physical replication slot for the > + standby should be listed here. Note that logical replication will not > + proceed if the slots specified in the standby_slot_names do > not exist or > + are invalidated. > + </para> > > Should this also mention the effect this GUC has on those 2 SQL > functions? E.g. Commit message says: > > Additionally, The SQL functions pg_logical_slot_get_changes and > pg_replication_slot_advance are modified to wait for the replication > slots mentioned in 'standby_slot_names' to catch up before returning. I think that's also true for all the ones that rely on pg_logical_slot_get_changes_guts(), means: - pg_logical_slot_get_changes - pg_logical_slot_peek_changes - pg_logical_slot_get_binary_changes - pg_logical_slot_peek_binary_changes Not sure it's worth to mention the "binary" ones though as their doc mention they behave as their "non binary" counterpart. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Friday, March 1, 2024 2:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, Mar 1, 2024 at 12:42 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> > wrote: > > > > On Friday, March 1, 2024 10:17 AM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > > Attach the V102 patch set which addressed Amit and Shveta's comments. > > > Thanks Shveta for helping addressing the comments off-list. > > > > The cfbot reported a compile warning, here is the new version patch > > which fixed it, Also removed some outdate comments in this version. > > > > I've reviewed the v102-0001 patch. Here are some comments: Thanks for the comments ! > > --- > I got a compiler warning: > > walsender.c:1829:6: warning: variable 'wait_event' is used uninitialized > whenever '&&' condition is false [-Wsometimes-uninitialized] > if (!XLogRecPtrIsInvalid(RecentFlushPtr) && > ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > walsender.c:1871:7: note: uninitialized use occurs here > if (wait_event == > WAIT_EVENT_WAL_SENDER_WAIT_FOR_WAL) > ^~~~~~~~~~ > walsender.c:1829:6: note: remove the '&&' if its condition is always true > if (!XLogRecPtrIsInvalid(RecentFlushPtr) && > ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > walsender.c:1818:20: note: initialize the variable 'wait_event' to silence this > warning > uint32 wait_event; > ^ > = 0 > 1 warning generated. Thanks for reporting, it was fixed in V102_2. > > --- > +void > +assign_standby_slot_names(const char *newval, void *extra) { > + List *standby_slots; > + MemoryContext oldcxt; > + char *standby_slot_names_cpy = extra; > + > > Given that the newval and extra have the same data (standby_slot_names > value), why do we not use newval instead? I think that if we use > newval, we don't need to guc_strdup() in check_standby_slot_names(), > we might need to do list_copy_deep() instead, though. It's not clear > to me as there is no comment. I think SplitIdentifierString will modify the passed in string, so we'd better not pass the newval to it, otherwise the stored guc string(standby_slot_names) will be changed. I can see we are doing similar thing in other GUC check/assign function as well. (check_wal_consistency_checking/ assign_wal_consistency_checking, check_createrole_self_grant/ assign_createrole_self_grant ...). > --- > + /* > + * Switch to the memory context under which GUC variables are > allocated > + * (GUCMemoryContext). > + */ > + oldcxt = > MemoryContextSwitchTo(GetMemoryChunkContext(standby_slot_names_cpy > )); > + standby_slot_names_list = list_copy(standby_slots); > + MemoryContextSwitchTo(oldcxt); > > Why do we not explicitly switch to GUCMemoryContext? I think it's because the GUCMemoryContext is not exposed. Best Regards, Hou zj
On Thu, Feb 29, 2024 at 12:34 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:
On Monday, February 26, 2024 7:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Feb 26, 2024 at 7:49 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > Attach the V98 patch set which addressed above comments.
> >
>
> Few comments:
> =============
> 1.
> WalSndWaitForWal(XLogRecPtr loc)
> {
> int wakeEvents;
> + bool wait_for_standby = false;
> + uint32 wait_event;
> + List *standby_slots = NIL;
> static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
>
> + if (MyReplicationSlot->data.failover && replication_active)
> + standby_slots = GetStandbySlotList(true);
> +
> /*
> - * Fast path to avoid acquiring the spinlock in case we already know we
> - * have enough WAL available. This is particularly interesting if we're
> - * far behind.
> + * Check if all the standby servers have confirmed receipt of WAL up to
> + * RecentFlushPtr even when we already know we have enough WAL available.
> + *
> + * Note that we cannot directly return without checking the status of
> + * standby servers because the standby_slot_names may have changed,
> + which
> + * means there could be new standby slots in the list that have not yet
> + * caught up to the RecentFlushPtr.
> */
> - if (RecentFlushPtr != InvalidXLogRecPtr &&
> - loc <= RecentFlushPtr)
> - return RecentFlushPtr;
> + if (!XLogRecPtrIsInvalid(RecentFlushPtr) && loc <= RecentFlushPtr) {
> + FilterStandbySlots(RecentFlushPtr, &standby_slots);
>
> I think even if the slot list is not changed, we will always process each slot
> mentioned in standby_slot_names once. Can't we cache the previous list of
> slots for we have already waited for? In that case, we won't even need to copy
> the list via GetStandbySlotList() unless we need to wait.
>
> 2.
> WalSndWaitForWal(XLogRecPtr loc)
> {
> + /*
> + * Update the standby slots that have not yet caught up to the flushed
> + * position. It is good to wait up to RecentFlushPtr and then let it
> + * send the changes to logical subscribers one by one which are
> + * already covered in RecentFlushPtr without needing to wait on every
> + * change for standby confirmation.
> + */
> + if (wait_for_standby)
> + FilterStandbySlots(RecentFlushPtr, &standby_slots);
> +
> /* Update our idea of the currently flushed position. */
> - if (!RecoveryInProgress())
> + else if (!RecoveryInProgress())
> RecentFlushPtr = GetFlushRecPtr(NULL);
> else
> RecentFlushPtr = GetXLogReplayRecPtr(NULL); ...
> /*
> * If postmaster asked us to stop, don't wait anymore.
> *
> * It's important to do this check after the recomputation of
> * RecentFlushPtr, so we can send all remaining data before shutting
> * down.
> */
> if (got_STOPPING)
> break;
>
> I think because 'wait_for_standby' may not be set in the first or consecutive
> cycles we may send the WAL to the logical subscriber before sending it to the
> physical subscriber during shutdown.
Here is the v101 patch set which addressed above comments.
This version will cache the oldest standby slot's LSN each time we waited for
them to catch up. The cached LSN is invalidated when we reload the GUC config.
In the WalSndWaitForWal function, instead of traversing the entire standby list
each time, we can check the cached LSN to quickly determine if the standbys
have caught up. When a shutdown signal is received, we continue to wait for the
standby slots to catch up. When waiting for the standbys to catch up after
receiving the shutdown signal, an ERROR is reported if any slots are dropped,
invalidated, or inactive. This measure is taken to prevent the walsender from
waiting indefinitely.
Thanks for the patch.
I did some performance test run on PATCH v101 with synchronous_commit turned on to check how much logical replication changes affects transaction speed on primary compared to HEAD code. In all configurations, there is a primary, a logical subscriber and a physical standby and the logical subscriber is listed in the synchronous_standby_names. This means all transactions on primary will not be committed until the logical subscriber has confirmed receipt of this transaction.
Machine details:
Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz, 800GB RAM
My addition configuration on each instance is:
shared_buffers = 40GB
max_worker_processes = 32
max_parallel_maintenance_workers = 24
max_parallel_workers = 32
synchronous_commit = off
checkpoint_timeout = 1d
max_wal_size = 24GB
min_wal_size = 15GB
autovacuum = off
All tests are done using pgbench running for 15 minutes:
Creating tables: pgbench -p 6972 postgres -qis 2
Running benchmark: pgbench postgres -p 6972 -c 10 -j 3 -T 900 -P 5
I did some performance test run on PATCH v101 with synchronous_commit turned on to check how much logical replication changes affects transaction speed on primary compared to HEAD code. In all configurations, there is a primary, a logical subscriber and a physical standby and the logical subscriber is listed in the synchronous_standby_names. This means all transactions on primary will not be committed until the logical subscriber has confirmed receipt of this transaction.
Machine details:
Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz, 800GB RAM
My addition configuration on each instance is:
shared_buffers = 40GB
max_worker_processes = 32
max_parallel_maintenance_workers = 24
max_parallel_workers = 32
synchronous_commit = off
checkpoint_timeout = 1d
max_wal_size = 24GB
min_wal_size = 15GB
autovacuum = off
All tests are done using pgbench running for 15 minutes:
Creating tables: pgbench -p 6972 postgres -qis 2
Running benchmark: pgbench postgres -p 6972 -c 10 -j 3 -T 900 -P 5
HEAD code-
Primary with Synchronous_commit=on, physical standby with hot_standby_feedback=off
RUN1 (TPS) RUN2 (TPS) AVERAGE (TPS)
8.226658 8.17815 8.202404
HEAD code-
Primary with Synchronous_commit=on, physical standby with hot_standby_feedback=on
RUN1 (TPS) RUN2 (TPS) AVERAGE (TPS)
8.134901 8.229066 8.1819835 -- degradation from first config -0.25%
PATCHED code - (v101-0001)
Primary with synchronous_commit=on, physical standby with hot_standby_feedback=on, standby_slot_names not configured, logical subscriber not failover enabled, physical standby not configured for sync
RUN1 (TPS) RUN2 (TPS) AVERAGE (TPS)
8.18839 8.18839 8.18839-- degradation from first config -0.17%
PATCHED code - (v98-0001)
Synchronous_commit=on, hot_standby_feedback=on, standby_slot_names configured to physical standby, logical subscriber failover enabled, physical standby configured for sync
RUN1 (TPS) RUN2 (TPS) AVERAGE (TPS)
8.173062 8.068536 8.120799-- degradation from first config -0.99%
Primary with Synchronous_commit=on, physical standby with hot_standby_feedback=off
RUN1 (TPS) RUN2 (TPS) AVERAGE (TPS)
8.226658 8.17815 8.202404
HEAD code-
Primary with Synchronous_commit=on, physical standby with hot_standby_feedback=on
RUN1 (TPS) RUN2 (TPS) AVERAGE (TPS)
8.134901 8.229066 8.1819835 -- degradation from first config -0.25%
PATCHED code - (v101-0001)
Primary with synchronous_commit=on, physical standby with hot_standby_feedback=on, standby_slot_names not configured, logical subscriber not failover enabled, physical standby not configured for sync
RUN1 (TPS) RUN2 (TPS) AVERAGE (TPS)
8.18839 8.18839 8.18839-- degradation from first config -0.17%
PATCHED code - (v98-0001)
Synchronous_commit=on, hot_standby_feedback=on, standby_slot_names configured to physical standby, logical subscriber failover enabled, physical standby configured for sync
RUN1 (TPS) RUN2 (TPS) AVERAGE (TPS)
8.173062 8.068536 8.120799-- degradation from first config -0.99%
Overall, I do not see any significant performance degradation with the patch and sync-slot enabled with one logical subscriber and one physical standby.
Attaching script for my final test configuration for reference.
regards,
Ajin Cherian
Fujitsu Australia
Attachment
On Fri, Mar 1, 2024 at 11:41 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, Mar 1, 2024 at 12:42 PM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > --- > I was a bit surprised by the fact that standby_slot_names value is > handled in a different way than a similar parameter > synchronous_standby_names. For example, the following command doesn't > work unless there is a replication slot 'slot1, slot2': > > =# alter system set standby_slot_names to 'slot1, slot2'; > ERROR: invalid value for parameter "standby_slot_names": ""slot1, slot2"" > DETAIL: replication slot "slot1, slot2" does not exist > > Whereas "alter system set synchronous_standby_names to stb1, stb2;" > can correctly separate the string into 'stb1' and 'stb2'. > > Probably it would be okay since this behavior of standby_slot_names is > the same as other GUC parameters that accept a comma-separated string. > But I was confused a bit the first time I used it. > I think it is better to keep the behavior in this regard the same as 'synchronous_standby_names' because both have similarities w.r.t replication. -- With Regards, Amit Kapila.
On Friday, March 1, 2024 12:23 PM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are some review comments for v102-0001. > > ====== > doc/src/sgml/config.sgml > > 1. > + <para> > + Lists the streaming replication standby server slot names that logical > + WAL sender processes will wait for. Logical WAL sender processes will > + send decoded changes to plugins only after the specified replication > + slots confirm receiving WAL. This guarantees that logical replication > + slots with failover enabled do not consume changes until those > changes > + are received and flushed to corresponding physical standbys. If a > + logical replication connection is meant to switch to a physical standby > + after the standby is promoted, the physical replication slot for the > + standby should be listed here. Note that logical replication will not > + proceed if the slots specified in the standby_slot_names do > not exist or > + are invalidated. > + </para> > > Should this also mention the effect this GUC has on those 2 SQL functions? E.g. > Commit message says: > > Additionally, The SQL functions pg_logical_slot_get_changes and > pg_replication_slot_advance are modified to wait for the replication slots > mentioned in 'standby_slot_names' to catch up before returning. Added. > > ====== > src/backend/replication/slot.c > > 2. validate_standby_slots > > + else if (!ReplicationSlotCtl) > + { > + /* > + * We cannot validate the replication slot if the replication slots' > + * data has not been initialized. This is ok as we will validate the > + * specified slot when waiting for them to catch up. See > + * StandbySlotsHaveCaughtup for details. > + */ > + } > + else > + { > + /* > + * If the replication slots' data have been initialized, verify if the > + * specified slots exist and are logical slots. > + */ > + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED); > > IMO that 2nd comment does not need to say "If the replication slots' > data have been initialized," because that is implicit from the if/else. Removed. > > ~~~ > > 3. GetStandbySlotList > > +List * > +GetStandbySlotList(void) > +{ > + if (RecoveryInProgress()) > + return NIL; > + else > + return standby_slot_names_list; > +} > > The 'else' is not needed. IMO is better without it, but YMMV. Removed. > > ~~~ > > 4. StandbySlotsHaveCaughtup > > +/* > + * Return true if the slots specified in standby_slot_names have caught > +up to > + * the given WAL location, false otherwise. > + * > + * The elevel parameter determines the error level used for logging > +messages > + * related to slots that do not exist, are invalidated, or are inactive. > + */ > +bool > +StandbySlotsHaveCaughtup(XLogRecPtr wait_for_lsn, int elevel) > > /determines/specifies/ Changed. > > ~ > > 5. > + /* > + * Don't need to wait for the standby to catch up if there is no value > + in > + * standby_slot_names. > + */ > + if (!standby_slot_names_list) > + return true; > + > + /* > + * If we are on a standby server, we do not need to wait for the > + standby to > + * catch up since we do not support syncing slots to cascading standbys. > + */ > + if (RecoveryInProgress()) > + return true; > + > + /* > + * Return true if all the standbys have already caught up to the passed > + in > + * WAL localtion. > + */ > + if (!XLogRecPtrIsInvalid(standby_slot_oldest_flush_lsn) && > + standby_slot_oldest_flush_lsn >= wait_for_lsn) return true; > > > 5a. > I felt all these comments should be worded in a consistent way like "Don't need > to wait ..." > > e.g. > 1. Don't need to wait for the standbys to catch up if there is no value in > 'standby_slot_names'. > 2. Don't need to wait for the standbys to catch up if we are on a standby server, > since we do not support syncing slots to cascading standbys. > 3. Don't need to wait for the standbys to catch up if they are already beyond > the specified WAL location. Changed. > > ~ > > 5b. > typo > /WAL localtion/WAL location/ > Fixed. > ~~~ > > 6. > + if (!slot) > + { > + /* > + * It may happen that the slot specified in standby_slot_names GUC > + * value is dropped, so let's skip over it. > + */ > + ereport(elevel, > + errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("replication slot \"%s\" specified in parameter %s does not exist", > + name, "standby_slot_names")); > + continue; > + } > > Is "is dropped" strictly the only reason? IIUC another reason is the slot maybe > just did not even exist in the first place but it was not detected before now > because inititial validation was skipped. Changed the comment. > > ~~~ > > 7. > + /* > + * Return false if not all the standbys have caught up to the specified > + WAL > + * location. > + */ > + if (caught_up_slot_num != list_length(standby_slot_names_list)) > + return false; > > Somehow it seems complicated to have a counter of the slots as you process > then compare that counter to the list_length to determine if one of them was > skipped. > > Probably simpler just to set a 'skipped' flag set whenever you do 'continue'... > I prefer the current style because we need to set skipped =true in multiple places which doesn't seem better to me. > ====== > src/backend/replication/walsender.c > > 8. > +/* > + * Returns true if there are not enough WALs to be read, or if not all > +standbys > + * have caught up to the flushed position when failover_slot is true; > + * otherwise, returns false. > + * > + * Set prioritize_stop to true to skip waiting for WALs if the shutdown > +signal > + * is received. > + * > + * Set failover_slot to true if the current acquired slot is a failover > +enabled > + * slot and we are streaming. > + * > + * If returning true, the function sets the appropriate wait event in > + * wait_event; otherwise, wait_event is set to 0. > + */ > +static bool > +NeedToWaitForWal(XLogRecPtr target_lsn, XLogRecPtr flushed_lsn, bool > +prioritize_stop, bool failover_slot, > + uint32 *wait_event) > > 8a. > /Set prioritize_stop to true/Specify prioritize_stop=true/ > > /Set failover_slot to true/Specify failover_slot=true/ This function has been refactored a bit. > > ~ > > 8b. > Aren't the static function names typically snake_case? I think the current name style is more consistent with the other functions in walsender.c. > > ~~~ > > 9. > + /* > + * Check if we need to wait for WALs to be flushed to disk. We don't > + need > + * to wait for WALs after receiving the shutdown signal unless the > + * wait_for_wal_on_stop is true. > + */ > + if (target_lsn > flushed_lsn && !(prioritize_stop && got_STOPPING)) > + *wait_event = WAIT_EVENT_WAL_SENDER_WAIT_FOR_WAL; > > The comment says 'wait_for_wal_on_stop' but the code says 'prioritize_stop' > (??) This has been removed in last version. > ~~~ > > 10. > + /* > + * Check if we need to wait for WALs to be flushed to disk. We don't > + need > + * to wait for WALs after receiving the shutdown signal unless the > + * wait_for_wal_on_stop is true. > + */ > + if (target_lsn > flushed_lsn && !(prioritize_stop && got_STOPPING)) > + *wait_event = WAIT_EVENT_WAL_SENDER_WAIT_FOR_WAL; > + > + /* > + * Check if the standby slots have caught up to the flushed position. > + It is > + * good to wait up to RecentFlushPtr and then let it send the changes > + to > + * logical subscribers one by one which are already covered in > + * RecentFlushPtr without needing to wait on every change for standby > + * confirmation. Note that after receiving the shutdown signal, an > + ERROR is > + * reported if any slots are dropped, invalidated, or inactive. This > + * measure is taken to prevent the walsender from waiting indefinitely. > + */ > + else if (failover_slot && !StandbySlotsHaveCaughtup(flushed_lsn, > + elevel)) *wait_event = WAIT_EVENT_WAIT_FOR_STANDBY_CONFIRMATION; > + else > + return false; > + > + return true; > > This if/else/else seems overly difficult to read. IMO it will be easier if written > like: > > SUGGESTION > <comment> > if (target_lsn > flushed_lsn && !(prioritize_stop && got_STOPPING)) { > *wait_event = WAIT_EVENT_WAL_SENDER_WAIT_FOR_WAL; > return true; > } > > <comment> > if (failover_slot && !StandbySlotsHaveCaughtup(flushed_lsn, elevel)) { > *wait_event = WAIT_EVENT_WAIT_FOR_STANDBY_CONFIRMATION; > return true; > } > > return false; Changed. Attach the V103 patch set which addressed above comments and Sawada-san's comment[1]. Apart from the comments, the code in WalSndWaitForWal was refactored a bit to make it neater. Thanks Shveta for helping writing the code and doc. [1] https://www.postgresql.org/message-id/CAD21AoBhty79zHgXYMNHH1KqO2VtmjRi22QPmYP2yaHC9WFVdw%40mail.gmail.com Best Regards, Hou zj
Attachment
On Thursday, February 29, 2024 11:16 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Wednesday, February 28, 2024 7:36 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > > 4 === > > > > Regarding the test, what about adding one to test the "new" behavior > > discussed up-thread? (logical replication will wait if slot mentioned > > in standby_slot_names is dropped and/or does not exist when the engine > > starts?) > > Will think about this. I think we could add a test to check the warning message for dropped slot, but since similar wait/warning functionality has been tested, I prefer to leave this for now and can consider it again after the main patch and tests are stabilized considering the previous experience of BF instability with this feature. Best Regards, Hou zj
On Sat, Mar 2, 2024 at 9:21 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > Apart from the comments, the code in WalSndWaitForWal was refactored > a bit to make it neater. Thanks Shveta for helping writing the code and doc. > A few more comments: ================== 1. +# Wait until the primary server logs a warning indicating that it is waiting +# for the sb1_slot to catch up. +$primary->wait_for_log( + qr/replication slot \"sb1_slot\" specified in parameter standby_slot_names does not have active_pid/, + $offset); Shouldn't we wait for such a LOG even in the first test as well which involves two standbys and two logical subscribers? 2. +################################################## +# Test that logical replication will wait for the user-created inactive +# physical slot to catch up until we remove the slot from standby_slot_names. +################################################## I don't see anything different tested in this test from what we already tested in the first test involving two standbys and two logical subscribers. Can you please clarify if I am missing something? 3. Note that after receiving the shutdown signal, an ERROR + * is reported if any slots are dropped, invalidated, or inactive. This + * measure is taken to prevent the walsender from waiting indefinitely. + */ + if (NeedToWaitForStandby(target_lsn, flushed_lsn, wait_event)) Isn't this part of the comment should be moved inside NeedToWaitForStandby()? 4. + /* + * Update our idea of the currently flushed position only if we are + * not waiting for standbys to catch up, otherwise the standby would + * have to catch up to a newer WAL location in each cycle. + */ + if (wait_event != WAIT_EVENT_WAIT_FOR_STANDBY_CONFIRMATION) + { This functionality (in function WalSndWaitForWal()) seems to ensure that we first wait for the required WAL to be flushed and then wait for standbys. If true, we should cover that point in the comments here or somewhere in the function WalSndWaitForWal(). Apart from this, I have made a few modifications in the comments. -- With Regards, Amit Kapila.
Attachment
On Sat, Mar 2, 2024 at 2:51 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Friday, March 1, 2024 12:23 PM Peter Smith <smithpb2250@gmail.com> wrote: > > ... > > ====== > > src/backend/replication/slot.c > > > > 2. validate_standby_slots > > > > + else if (!ReplicationSlotCtl) > > + { > > + /* > > + * We cannot validate the replication slot if the replication slots' > > + * data has not been initialized. This is ok as we will validate the > > + * specified slot when waiting for them to catch up. See > > + * StandbySlotsHaveCaughtup for details. > > + */ > > + } > > + else > > + { > > + /* > > + * If the replication slots' data have been initialized, verify if the > > + * specified slots exist and are logical slots. > > + */ > > + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED); > > > > IMO that 2nd comment does not need to say "If the replication slots' > > data have been initialized," because that is implicit from the if/else. > > Removed. I only meant to suggest removing the 1st part, not the entire comment. I thought it is still useful to have a comment like: /* Check that the specified slots exist and are logical slots.*/ ====== And, here are some review comments for v103-0001. ====== Commit message 1. Additionally, The SQL functions pg_logical_slot_get_changes, pg_logical_slot_peek_changes and pg_replication_slot_advance are modified to wait for the replication slots mentioned in 'standby_slot_names' to catch up before returning. ~ (use the same wording as previous in this message) /mentioned in/specified in/ ====== doc/src/sgml/config.sgml 2. + Additionally, when using the replication management functions + <link linkend="pg-replication-slot-advance"> + <function>pg_replication_slot_advance</function></link>, + <link linkend="pg-logical-slot-get-changes"> + <function>pg_logical_slot_get_changes</function></link>, and + <link linkend="pg-logical-slot-peek-changes"> + <function>pg_logical_slot_peek_changes</function></link>, + with failover enabled logical slots, the functions will wait for the + physical slots specified in <varname>standby_slot_names</varname> to + confirm WAL receipt before proceeding. + </para> It says "the ... functions" twice so maybe reword it slightly. BEFORE Additionally, when using the replication management functions pg_replication_slot_advance, pg_logical_slot_get_changes, and pg_logical_slot_peek_changes, with failover enabled logical slots, the functions will wait for the physical slots specified in standby_slot_names to confirm WAL receipt before proceeding. SUGGESTION Additionally, the replication management functions pg_replication_slot_advance, pg_logical_slot_get_changes, and pg_logical_slot_peek_changes, when used with failover enabled logical slots, will wait for the physical slots specified in standby_slot_names to confirm WAL receipt before proceeding. (Actually the "will wait ... before proceeding" is also a bit tricky, so below is another possible rewording) SUGGESTION #2 Additionally, the replication management functions pg_replication_slot_advance, pg_logical_slot_get_changes, and pg_logical_slot_peek_changes, when used with failover enabled logical slots, will block until all physical slots specified in standby_slot_names have confirmed WAL receipt. ~~~ 3. + <note> + <para> + Value <literal>*</literal> is not accepted as it is inappropriate to + block logical replication for physical slots that either lack + associated standbys or have standbys associated that are not enabled + for replication slot synchronization. (see + <xref linkend="logicaldecoding-replication-slots-synchronization"/>). + </para> + </note> Why does the document need to provide an excuse/reason for the rule? You could just say something like: SUGGESTION The slots must be named explicitly. For example, specifying wildcard values like <literal>*</literal> is not permitted. ====== doc/src/sgml/func.sgml 4. @@ -28150,7 +28150,7 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset </row> <row> - <entry role="func_table_entry"><para role="func_signature"> + <entry id="pg-logical-slot-get-changes" role="func_table_entry"><para role="func_signature"> <indexterm> <primary>pg_logical_slot_get_changes</primary> </indexterm> @@ -28177,7 +28177,7 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset </row> <row> - <entry role="func_table_entry"><para role="func_signature"> + <entry id="pg-logical-slot-peek-changes" role="func_table_entry"><para role="func_signature"> <indexterm> <primary>pg_logical_slot_peek_changes</primary> </indexterm> @@ -28232,7 +28232,7 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset </row> <row> - <entry role="func_table_entry"><para role="func_signature"> + <entry id="pg-replication-slot-advance" role="func_table_entry"><para role="func_signature"> <indexterm> <primary>pg_replication_slot_advance</primary> </indexterm> Should these 3 functions say something about how their behaviour is affected by 'standby_slot_names' and give a link back to the GUC 'standby_slot_names' docs? ====== src/backend/replication/slot.c 5. StandbySlotsHaveCaughtup + if (!slot) + { + /* + * If the provided slot does not exist, report a message and exit + * the loop. It is possible for a user to specify a slot name in + * standby_slot_names that does not exist just before the server + * startup. The GUC check_hook(validate_standby_slots) cannot + * validate such a slot during startup as the ReplicationSlotCtl + * shared memory is not initialized at that time. It is also + * possible for a user to drop the slot in standby_slot_names + * afterwards. + */ 5a. Minor rewording to make this code comment more similar to the next one: SUGGESTION If a slot name provided in standby_slot_names does not exist, report a message and exit the loop. A user can specify a slot name that does not exist just before the server startup... ~ 5b. + /* + * If a logical slot name is provided in standby_slot_names, + * report a message and exit the loop. Similar to the non-existent + * case, it is possible for a user to specify a logical slot name + * in standby_slot_names before the server startup, or drop an + * existing physical slot and recreate a logical slot with the + * same name. + */ /it is possible for a user to specify/a user can specify/ ~~~ 6. WaitForStandbyConfirmation + /* + * We wait for the slots in the standby_slot_names to catch up, but we + * use a timeout (1s) so we can also check if the standby_slot_names + * has been changed. + */ Remove some of the "we". SUGGESTION Wait for the slots in the standby_slot_names to catch up, but use a timeout (1s) so we can also check if the standby_slot_names has been changed. ====== src/backend/replication/walsender.c 7. NeedToWaitForStandby +/* + * Returns true if not all standbys have caught up to the flushed position + * (flushed_lsn) when failover_slot is true; otherwise, returns false. + * + * If returning true, the function sets the appropriate wait event in + * wait_event; otherwise, wait_event is set to 0. + */ The function comment refers to 'failover_slot' but IMO needs to be worded differently because 'failover_slot' is not a parameter anymore. ~~~ 8. NeedToWaitForWal +/* + * Returns true if we need to wait for WALs to be flushed to disk, or if not + * all standbys have caught up to the flushed position (flushed_lsn) when + * failover_slot is true; otherwise, returns false. + * + * If returning true, the function sets the appropriate wait event in + * wait_event; otherwise, wait_event is set to 0. + */ +static bool +NeedToWaitForWal(XLogRecPtr target_lsn, XLogRecPtr flushed_lsn, + uint32 *wait_event) Same as above -- Now that 'failover_slot' is not a function parameter, I thought this should be reworded. ~~~ 9. NeedToWaitForWal + /* + * Check if the standby slots have caught up to the flushed position. It + * is good to wait up to flushed position and then let it send the changes + * to logical subscribers one by one which are already covered in flushed + * position without needing to wait on every change for standby + * confirmation. Note that after receiving the shutdown signal, an ERROR + * is reported if any slots are dropped, invalidated, or inactive. This + * measure is taken to prevent the walsender from waiting indefinitely. + */ + if (NeedToWaitForStandby(target_lsn, flushed_lsn, wait_event)) + return true; I felt it was confusing things for this function to also call to the other one -- it seems an overlapping/muddling of the purpose of these. I think it will be easier to understand if the calling code just calls to one or both of these functions as required. ~~~ 10. - WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_WAIT_FOR_WAL); + WalSndWait(wakeEvents, sleeptime, wait_event); Tracing the assignments of the 'wait_event' is a bit tricky... IIUC it should not be 0 when we got to this point, so maybe it is better to put Assert(wait_event) before this call? ---------- Kind Regards, Peter Smith. Fujitsu Australia
On Sun, Mar 3, 2024 at 5:17 AM Peter Smith <smithpb2250@gmail.com> wrote: > > ~~~ > > 3. > + <note> > + <para> > + Value <literal>*</literal> is not accepted as it is inappropriate to > + block logical replication for physical slots that either lack > + associated standbys or have standbys associated that are not enabled > + for replication slot synchronization. (see > + <xref linkend="logicaldecoding-replication-slots-synchronization"/>). > + </para> > + </note> > > Why does the document need to provide an excuse/reason for the rule? > You could just say something like: > > SUGGESTION > The slots must be named explicitly. For example, specifying wildcard > values like <literal>*</literal> is not permitted. > I would like to document the reason somewhere, if not in docs, then let's write a comment for the same in code. > ====== > > ~~~ > > 9. NeedToWaitForWal > > + /* > + * Check if the standby slots have caught up to the flushed position. It > + * is good to wait up to flushed position and then let it send the changes > + * to logical subscribers one by one which are already covered in flushed > + * position without needing to wait on every change for standby > + * confirmation. Note that after receiving the shutdown signal, an ERROR > + * is reported if any slots are dropped, invalidated, or inactive. This > + * measure is taken to prevent the walsender from waiting indefinitely. > + */ > + if (NeedToWaitForStandby(target_lsn, flushed_lsn, wait_event)) > + return true; > > I felt it was confusing things for this function to also call to the > other one -- it seems an overlapping/muddling of the purpose of these. > I think it will be easier to understand if the calling code just calls > to one or both of these functions as required. > I felt otherwise because the caller has to call these functions at more than one place which makes the caller's code difficult to follow. It is better to encapsulate the computation of wait_event. -- With Regards, Amit Kapila.
On Sunday, March 3, 2024 7:47 AM Peter Smith <smithpb2250@gmail.com> wrote: > > On Sat, Mar 2, 2024 at 2:51 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> > wrote: > > > > On Friday, March 1, 2024 12:23 PM Peter Smith <smithpb2250@gmail.com> > wrote: > > > > ... > > > ====== > > > src/backend/replication/slot.c > > > > > > 2. validate_standby_slots > > > > > > + else if (!ReplicationSlotCtl) > > > + { > > > + /* > > > + * We cannot validate the replication slot if the replication slots' > > > + * data has not been initialized. This is ok as we will validate > > > + the > > > + * specified slot when waiting for them to catch up. See > > > + * StandbySlotsHaveCaughtup for details. > > > + */ > > > + } > > > + else > > > + { > > > + /* > > > + * If the replication slots' data have been initialized, verify if > > > + the > > > + * specified slots exist and are logical slots. > > > + */ > > > + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED); > > > > > > IMO that 2nd comment does not need to say "If the replication slots' > > > data have been initialized," because that is implicit from the if/else. > > > > Removed. > > I only meant to suggest removing the 1st part, not the entire comment. > I thought it is still useful to have a comment like: > > /* Check that the specified slots exist and are logical slots.*/ OK, I misunderstood. Fixed. > > ====== > > And, here are some review comments for v103-0001. Thanks for the comments. > > ====== > Commit message > > 1. > Additionally, The SQL functions pg_logical_slot_get_changes, > pg_logical_slot_peek_changes and pg_replication_slot_advance are modified > to wait for the replication slots mentioned in 'standby_slot_names' to catch up > before returning. > > ~ > > (use the same wording as previous in this message) > > /mentioned in/specified in/ Changed. > > ====== > doc/src/sgml/config.sgml > > 2. > + Additionally, when using the replication management functions > + <link linkend="pg-replication-slot-advance"> > + <function>pg_replication_slot_advance</function></link>, > + <link linkend="pg-logical-slot-get-changes"> > + <function>pg_logical_slot_get_changes</function></link>, and > + <link linkend="pg-logical-slot-peek-changes"> > + <function>pg_logical_slot_peek_changes</function></link>, > + with failover enabled logical slots, the functions will wait for the > + physical slots specified in > <varname>standby_slot_names</varname> to > + confirm WAL receipt before proceeding. > + </para> > > It says "the ... functions" twice so maybe reword it slightly. > > BEFORE > Additionally, when using the replication management functions > pg_replication_slot_advance, pg_logical_slot_get_changes, and > pg_logical_slot_peek_changes, with failover enabled logical slots, the functions > will wait for the physical slots specified in standby_slot_names to confirm WAL > receipt before proceeding. > > SUGGESTION > Additionally, the replication management functions > pg_replication_slot_advance, pg_logical_slot_get_changes, and > pg_logical_slot_peek_changes, when used with failover enabled logical slots, > will wait for the physical slots specified in standby_slot_names to confirm WAL > receipt before proceeding. > > (Actually the "will wait ... before proceeding" is also a bit tricky, so below is > another possible rewording) > > SUGGESTION #2 > Additionally, the replication management functions > pg_replication_slot_advance, pg_logical_slot_get_changes, and > pg_logical_slot_peek_changes, when used with failover enabled logical slots, > will block until all physical slots specified in standby_slot_names have > confirmed WAL receipt. I prefer the #2 version. > > ~~~ > > 3. > + <note> > + <para> > + Value <literal>*</literal> is not accepted as it is inappropriate to > + block logical replication for physical slots that either lack > + associated standbys or have standbys associated that are not > enabled > + for replication slot synchronization. (see > + <xref > linkend="logicaldecoding-replication-slots-synchronization"/>). > + </para> > + </note> > > Why does the document need to provide an excuse/reason for the rule? > You could just say something like: > > SUGGESTION > The slots must be named explicitly. For example, specifying wildcard values like > <literal>*</literal> is not permitted. As suggested by Amit, I moved this to code comments. > > ====== > doc/src/sgml/func.sgml > > 4. > @@ -28150,7 +28150,7 @@ postgres=# SELECT '0/0'::pg_lsn + > pd.segment_number * ps.setting::int + :offset > </row> > > <row> > - <entry role="func_table_entry"><para role="func_signature"> > + <entry id="pg-logical-slot-get-changes" > role="func_table_entry"><para role="func_signature"> > <indexterm> > <primary>pg_logical_slot_get_changes</primary> > </indexterm> > @@ -28177,7 +28177,7 @@ postgres=# SELECT '0/0'::pg_lsn + > pd.segment_number * ps.setting::int + :offset > </row> > > <row> > - <entry role="func_table_entry"><para role="func_signature"> > + <entry id="pg-logical-slot-peek-changes" > role="func_table_entry"><para role="func_signature"> > <indexterm> > <primary>pg_logical_slot_peek_changes</primary> > </indexterm> > @@ -28232,7 +28232,7 @@ postgres=# SELECT '0/0'::pg_lsn + > pd.segment_number * ps.setting::int + :offset > </row> > > <row> > - <entry role="func_table_entry"><para role="func_signature"> > + <entry id="pg-replication-slot-advance" > role="func_table_entry"><para role="func_signature"> > <indexterm> > <primary>pg_replication_slot_advance</primary> > </indexterm> > > Should these 3 functions say something about how their behaviour is affected > by 'standby_slot_names' and give a link back to the GUC 'standby_slot_names' > docs? I added the info for pg_logical_slot_get_changes() and pg_replication_slot_advance(). The pg_logical_slot_peek_changes() function clarifies that it behaves just like pg_logical_slot_get_changes(), so I didn’t touch it. > > ====== > src/backend/replication/slot.c > > 5. StandbySlotsHaveCaughtup > > + if (!slot) > + { > + /* > + * If the provided slot does not exist, report a message and exit > + * the loop. It is possible for a user to specify a slot name in > + * standby_slot_names that does not exist just before the server > + * startup. The GUC check_hook(validate_standby_slots) cannot > + * validate such a slot during startup as the ReplicationSlotCtl > + * shared memory is not initialized at that time. It is also > + * possible for a user to drop the slot in standby_slot_names > + * afterwards. > + */ > > 5a. > Minor rewording to make this code comment more similar to the next one: > > SUGGESTION > If a slot name provided in standby_slot_names does not exist, report a message > and exit the loop. A user can specify a slot name that does not exist just before > the server startup... Changed. > > ~ > > 5b. > + /* > + * If a logical slot name is provided in standby_slot_names, > + * report a message and exit the loop. Similar to the non-existent > + * case, it is possible for a user to specify a logical slot name > + * in standby_slot_names before the server startup, or drop an > + * existing physical slot and recreate a logical slot with the > + * same name. > + */ > > /it is possible for a user to specify/a user can specify/ Changed. > > ~~~ > > 6. WaitForStandbyConfirmation > > + /* > + * We wait for the slots in the standby_slot_names to catch up, but we > + * use a timeout (1s) so we can also check if the standby_slot_names > + * has been changed. > + */ > > Remove some of the "we". > > SUGGESTION > Wait for the slots in the standby_slot_names to catch up, but use a timeout (1s) > so we can also check if the standby_slot_names has been changed. Changed. > > ====== > src/backend/replication/walsender.c > > 7. NeedToWaitForStandby > > +/* > + * Returns true if not all standbys have caught up to the flushed > +position > + * (flushed_lsn) when failover_slot is true; otherwise, returns false. > + * > + * If returning true, the function sets the appropriate wait event in > + * wait_event; otherwise, wait_event is set to 0. > + */ > > The function comment refers to 'failover_slot' but IMO needs to be worded > differently because 'failover_slot' is not a parameter anymore. Changed. > > ~~~ > > 8. NeedToWaitForWal > > +/* > + * Returns true if we need to wait for WALs to be flushed to disk, or > +if not > + * all standbys have caught up to the flushed position (flushed_lsn) > +when > + * failover_slot is true; otherwise, returns false. > + * > + * If returning true, the function sets the appropriate wait event in > + * wait_event; otherwise, wait_event is set to 0. > + */ > +static bool > +NeedToWaitForWal(XLogRecPtr target_lsn, XLogRecPtr flushed_lsn, > + uint32 *wait_event) > > Same as above -- Now that 'failover_slot' is not a function parameter, I thought > this should be reworded. Changed. > > ~~~ > > 9. NeedToWaitForWal > > + /* > + * Check if the standby slots have caught up to the flushed position. > + It > + * is good to wait up to flushed position and then let it send the > + changes > + * to logical subscribers one by one which are already covered in > + flushed > + * position without needing to wait on every change for standby > + * confirmation. Note that after receiving the shutdown signal, an > + ERROR > + * is reported if any slots are dropped, invalidated, or inactive. This > + * measure is taken to prevent the walsender from waiting indefinitely. > + */ > + if (NeedToWaitForStandby(target_lsn, flushed_lsn, wait_event)) return > + true; > > I felt it was confusing things for this function to also call to the other one -- it > seems an overlapping/muddling of the purpose of these. > I think it will be easier to understand if the calling code just calls to one or both > of these functions as required. Same as Amit, I didn't change this. > > ~~~ > > 10. > > - WalSndWait(wakeEvents, sleeptime, > WAIT_EVENT_WAL_SENDER_WAIT_FOR_WAL); > + WalSndWait(wakeEvents, sleeptime, wait_event); > > Tracing the assignments of the 'wait_event' is a bit tricky... IIUC it should not be > 0 when we got to this point, so maybe it is better to put Assert(wait_event) > before this call? Added. Best Regards, Hou zj
On Saturday, March 2, 2024 6:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sat, Mar 2, 2024 at 9:21 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> > wrote: > > > > Apart from the comments, the code in WalSndWaitForWal was refactored a > > bit to make it neater. Thanks Shveta for helping writing the code and doc. > > > > A few more comments: Thanks for the comments. > ================== > 1. > +# Wait until the primary server logs a warning indicating that it is > +waiting # for the sb1_slot to catch up. > +$primary->wait_for_log( > + qr/replication slot \"sb1_slot\" specified in parameter > standby_slot_names does not have active_pid/, > + $offset); > > Shouldn't we wait for such a LOG even in the first test as well which involves two > standbys and two logical subscribers? Yes, we should. Added. > > 2. > +################################################## > +# Test that logical replication will wait for the user-created inactive > +# physical slot to catch up until we remove the slot from standby_slot_names. > +################################################## > > > I don't see anything different tested in this test from what we already tested in > the first test involving two standbys and two logical subscribers. Can you > please clarify if I am missing something? I think the intention was to test that the wait loop is ended due to GUC config reload, while the first test is for the case when the loop is ended due to restart_lsn movement. But it seems we tested the config reload with xx_get_changes() as well, so I can remove it if you agree. > > 3. > Note that after receiving the shutdown signal, an ERROR > + * is reported if any slots are dropped, invalidated, or inactive. This > + * measure is taken to prevent the walsender from waiting indefinitely. > + */ > + if (NeedToWaitForStandby(target_lsn, flushed_lsn, wait_event)) > > Isn't this part of the comment should be moved inside > NeedToWaitForStandby()? Moved. > > 4. > + /* > + * Update our idea of the currently flushed position only if we are > + * not waiting for standbys to catch up, otherwise the standby would > + * have to catch up to a newer WAL location in each cycle. > + */ > + if (wait_event != WAIT_EVENT_WAIT_FOR_STANDBY_CONFIRMATION) > + { > > This functionality (in function WalSndWaitForWal()) seems to ensure that we > first wait for the required WAL to be flushed and then wait for standbys. If true, > we should cover that point in the comments here or somewhere in the function > WalSndWaitForWal(). > > Apart from this, I have made a few modifications in the comments. Thanks. I have reviewed and merged them. Here is the V104 patch which addressed above and Peter's comments. Best Regards, Hou zj
Attachment
On Sun, Mar 3, 2024 at 2:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sun, Mar 3, 2024 at 5:17 AM Peter Smith <smithpb2250@gmail.com> wrote: > > ... > > 9. NeedToWaitForWal > > > > + /* > > + * Check if the standby slots have caught up to the flushed position. It > > + * is good to wait up to flushed position and then let it send the changes > > + * to logical subscribers one by one which are already covered in flushed > > + * position without needing to wait on every change for standby > > + * confirmation. Note that after receiving the shutdown signal, an ERROR > > + * is reported if any slots are dropped, invalidated, or inactive. This > > + * measure is taken to prevent the walsender from waiting indefinitely. > > + */ > > + if (NeedToWaitForStandby(target_lsn, flushed_lsn, wait_event)) > > + return true; > > > > I felt it was confusing things for this function to also call to the > > other one -- it seems an overlapping/muddling of the purpose of these. > > I think it will be easier to understand if the calling code just calls > > to one or both of these functions as required. > > > > I felt otherwise because the caller has to call these functions at > more than one place which makes the caller's code difficult to follow. > It is better to encapsulate the computation of wait_event. > You may have misinterpreted my review comment because I didn't say anything about changing the encapsulation of the computation of the wait_event. I only wrote it is better IMO for the functions to stick to just one job each according to their function name. E.g.: - NeedToWaitForStandby should *only* check if we need to wait for standbys - NeedToWaitForWal should *only* check if we need to wait for WAL flush; i.e. this shouldn't be also calling NeedToWaitForStandby(). Also, AFAICT the caller changes should not be difficult. Indeed, these changes will make the code aligned properly with what the comments are saying: BEFORE /* * Fast path to avoid acquiring the spinlock in case we already know we * have enough WAL available and all the standby servers have confirmed * receipt of WAL up to RecentFlushPtr. This is particularly interesting * if we're far behind. */ if (!XLogRecPtrIsInvalid(RecentFlushPtr) && !NeedToWaitForWal(loc, RecentFlushPtr, &wait_event)) return RecentFlushPtr; SUGGESTED ... if (!XLogRecPtrIsInvalid(RecentFlushPtr) && !NeedToWaitForWal(loc, RecentFlushPtr, &wait_event) && !NeedToWaitForStandby(loc, RecentFlushPtr, &wait_event) return RecentFlushPtr; ~~~ BEFORE /* * Exit the loop if already caught up and doesn't need to wait for * standby slots. */ if (!wait_for_standby_at_stop && !NeedToWaitForWal(loc, RecentFlushPtr, &wait_event)) break; SUGGESTED ... if (!wait_for_standby_at_stop && !NeedToWaitForWal(loc, RecentFlushPtr, &wait_event) && !NeedToWaitForStandby(loc, RecentFlushPtr, &wait_event)) break; ---------- Kind Regards, Peter Smith. Fujitsu Australia
On Sun, Mar 3, 2024 at 6:51 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Sunday, March 3, 2024 7:47 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > 3. > > + <note> > > + <para> > > + Value <literal>*</literal> is not accepted as it is inappropriate to > > + block logical replication for physical slots that either lack > > + associated standbys or have standbys associated that are not > > enabled > > + for replication slot synchronization. (see > > + <xref > > linkend="logicaldecoding-replication-slots-synchronization"/>). > > + </para> > > + </note> > > > > Why does the document need to provide an excuse/reason for the rule? > > You could just say something like: > > > > SUGGESTION > > The slots must be named explicitly. For example, specifying wildcard values like > > <literal>*</literal> is not permitted. > > As suggested by Amit, I moved this to code comments. Was the total removal of this note deliberate? I only suggested removing the *reason* for the rule, not the entire rule. Otherwise, the user won't know to avoid doing this until they try it and get an error. > > > > 9. NeedToWaitForWal > > > > + /* > > + * Check if the standby slots have caught up to the flushed position. > > + It > > + * is good to wait up to flushed position and then let it send the > > + changes > > + * to logical subscribers one by one which are already covered in > > + flushed > > + * position without needing to wait on every change for standby > > + * confirmation. Note that after receiving the shutdown signal, an > > + ERROR > > + * is reported if any slots are dropped, invalidated, or inactive. This > > + * measure is taken to prevent the walsender from waiting indefinitely. > > + */ > > + if (NeedToWaitForStandby(target_lsn, flushed_lsn, wait_event)) return > > + true; > > > > I felt it was confusing things for this function to also call to the other one -- it > > seems an overlapping/muddling of the purpose of these. > > I think it will be easier to understand if the calling code just calls to one or both > > of these functions as required. > > Same as Amit, I didn't change this. AFAICT my previous review comment was misinterpreted. Please see [1] for more details. ~~~~ Here are some more review comments for v104-00001 ====== Commit message 1. Additionally, The SQL functions pg_logical_slot_get_changes, pg_logical_slot_peek_changes and pg_replication_slot_advance are modified to wait for the replication slots specified in 'standby_slot_names' to catch up before returning. ~ Maybe that should be expressed using similar wording as the docs... SUGGESTION Additionally, The SQL functions ... are modified. Now, when used with failover-enabled logical slots, these functions will block until all physical slots specified in 'standby_slot_names' have confirmed WAL receipt. ====== doc/src/sgml/config.sgml 2. + <function>pg_logical_slot_peek_changes</function></link>, + when used with failover enabled logical slots, will block until all + physical slots specified in <varname>standby_slot_names</varname> have + confirmed WAL receipt. /failover enabled logical slots/failover-enabled logical slots/ ====== doc/src/sgml/func.sgml 3. + The function may be blocked if the specified slot is a failover enabled + slot and <link linkend="guc-standby-slot-names"><varname>standby_slot_names</varname></link> + is configured. </para></entry> /a failover enabled slot//a failover-enabled slot/ ~~~ 4. + slot may return to an earlier position. The function may be blocked if + the specified slot is a failover enabled slot and + <link linkend="guc-standby-slot-names"><varname>standby_slot_names</varname></link> + is configured. /a failover enabled slot//a failover-enabled slot/ ====== src/backend/replication/slot.c 5. +/* + * Wait for physical standbys to confirm receiving the given lsn. + * + * Used by logical decoding SQL functions that acquired failover enabled slot. + * It waits for physical standbys corresponding to the physical slots specified + * in the standby_slot_names GUC. + */ +void +WaitForStandbyConfirmation(XLogRecPtr wait_for_lsn) /failover enabled slot/failover-enabled slot/ ~~~ 6. + /* + * Don't need to wait for the standby to catch up if the current acquired + * slot is not a failover enabled slot, or there is no value in + * standby_slot_names. + */ /failover enabled slot/failover-enabled slot/ ====== src/backend/replication/slotfuncs.c 7. + + /* + * Wake up logical walsenders holding failover enabled slots after + * updating the restart_lsn of the physical slot. + */ + PhysicalWakeupLogicalWalSnd(); /failover enabled slots/failover-enabled slots/ ====== src/backend/replication/walsender.c 8. +/* + * Wake up the logical walsender processes with failover enabled slots if the + * currently acquired physical slot is specified in standby_slot_names GUC. + */ +void +PhysicalWakeupLogicalWalSnd(void) /failover enabled slots/failover-enabled slots/ 9. +/* + * Returns true if not all standbys have caught up to the flushed position + * (flushed_lsn) when the current acquired slot is a failover enabled logical + * slot and we are streaming; otherwise, returns false. + * + * If returning true, the function sets the appropriate wait event in + * wait_event; otherwise, wait_event is set to 0. + */ +static bool +NeedToWaitForStandby(XLogRecPtr target_lsn, XLogRecPtr flushed_lsn, + uint32 *wait_event) 9a. /failover enabled logical slot/failover-enabled logical slot/ ~ 9b. Probably that function name should be plural. /NeedToWaitForStandby/NeedToWaitForStandbys/ ~~~ 10. +/* + * Returns true if we need to wait for WALs to be flushed to disk, or if not + * all standbys have caught up to the flushed position (flushed_lsn) when the + * current acquired slot is a failover enabled logical slot and we are + * streaming; otherwise, returns false. + * + * If returning true, the function sets the appropriate wait event in + * wait_event; otherwise, wait_event is set to 0. + */ +static bool +NeedToWaitForWal(XLogRecPtr target_lsn, XLogRecPtr flushed_lsn, + uint32 *wait_event) /failover enabled logical slot/failover-enabled logical slot/ ~~~ 11. WalSndWaitForWal + /* + * Within the loop, we wait for the necessary WALs to be flushed to + * disk first, followed by waiting for standbys to catch up if there + * are enought WALs or upon receiving the shutdown signal. To avoid + * the scenario where standbys need to catch up to a newer WAL + * location in each iteration, we update our idea of the currently + * flushed position only if we are not waiting for standbys to catch + * up. + */ typo /enought/enough/ ---------- [1] https://www.postgresql.org/message-id/CAHut%2BPteoyDki-XdygDgoaZJLmasutzRquQepYx0raNs0RSMvg%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Austalia
On Mon, Mar 4, 2024 at 6:57 AM Peter Smith <smithpb2250@gmail.com> wrote: > > On Sun, Mar 3, 2024 at 2:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Sun, Mar 3, 2024 at 5:17 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > ... > > > 9. NeedToWaitForWal > > > > > > + /* > > > + * Check if the standby slots have caught up to the flushed position. It > > > + * is good to wait up to flushed position and then let it send the changes > > > + * to logical subscribers one by one which are already covered in flushed > > > + * position without needing to wait on every change for standby > > > + * confirmation. Note that after receiving the shutdown signal, an ERROR > > > + * is reported if any slots are dropped, invalidated, or inactive. This > > > + * measure is taken to prevent the walsender from waiting indefinitely. > > > + */ > > > + if (NeedToWaitForStandby(target_lsn, flushed_lsn, wait_event)) > > > + return true; > > > > > > I felt it was confusing things for this function to also call to the > > > other one -- it seems an overlapping/muddling of the purpose of these. > > > I think it will be easier to understand if the calling code just calls > > > to one or both of these functions as required. > > > > > > > I felt otherwise because the caller has to call these functions at > > more than one place which makes the caller's code difficult to follow. > > It is better to encapsulate the computation of wait_event. > > > > You may have misinterpreted my review comment because I didn't say > anything about changing the encapsulation of the computation of the > wait_event. > No, I have understood it in the same way as you have outlined in this email and liked the way the current patch has it. -- With Regards, Amit Kapila.
On Mon, Mar 4, 2024 at 7:25 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > ====== > doc/src/sgml/config.sgml > > 2. > + <function>pg_logical_slot_peek_changes</function></link>, > + when used with failover enabled logical slots, will block until all > + physical slots specified in <varname>standby_slot_names</varname> have > + confirmed WAL receipt. > > /failover enabled logical slots/failover-enabled logical slots/ > How about just saying logical failover slots at this and other places? -- With Regards, Amit Kapila.
On Mon, Mar 4, 2024 at 2:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Mar 4, 2024 at 7:25 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > > ====== > > doc/src/sgml/config.sgml > > > > 2. > > + <function>pg_logical_slot_peek_changes</function></link>, > > + when used with failover enabled logical slots, will block until all > > + physical slots specified in <varname>standby_slot_names</varname> have > > + confirmed WAL receipt. > > > > /failover enabled logical slots/failover-enabled logical slots/ > > > > How about just saying logical failover slots at this and other places? > Yes, that wording works too. ---------- Kind Regards, Peter Smith. Fujitsu Australia.
On Mon, Mar 4, 2024 at 2:38 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Mar 4, 2024 at 6:57 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > On Sun, Mar 3, 2024 at 2:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Sun, Mar 3, 2024 at 5:17 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > > ... > > > > 9. NeedToWaitForWal > > > > > > > > + /* > > > > + * Check if the standby slots have caught up to the flushed position. It > > > > + * is good to wait up to flushed position and then let it send the changes > > > > + * to logical subscribers one by one which are already covered in flushed > > > > + * position without needing to wait on every change for standby > > > > + * confirmation. Note that after receiving the shutdown signal, an ERROR > > > > + * is reported if any slots are dropped, invalidated, or inactive. This > > > > + * measure is taken to prevent the walsender from waiting indefinitely. > > > > + */ > > > > + if (NeedToWaitForStandby(target_lsn, flushed_lsn, wait_event)) > > > > + return true; > > > > > > > > I felt it was confusing things for this function to also call to the > > > > other one -- it seems an overlapping/muddling of the purpose of these. > > > > I think it will be easier to understand if the calling code just calls > > > > to one or both of these functions as required. > > > > > > > > > > I felt otherwise because the caller has to call these functions at > > > more than one place which makes the caller's code difficult to follow. > > > It is better to encapsulate the computation of wait_event. > > > > > > > You may have misinterpreted my review comment because I didn't say > > anything about changing the encapsulation of the computation of the > > wait_event. > > > > No, I have understood it in the same way as you have outlined in this > email and liked the way the current patch has it. > OK, if the code will remain as-is wouldn't it be better to anyway change the function name to indicate what it really does? e.g. NeedToWaitForWal --> NeedToWaitForWalFlushOrStandbys ---------- Kind Regards, Peter Smith. Fujitsu Australia
Hi, On Thu, Feb 29, 2024 at 03:38:59PM +0530, Amit Kapila wrote: > On Thu, Feb 29, 2024 at 9:13 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > On Tue, Feb 27, 2024 at 11:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > Also, adding wait sounds > > > more like a boolean. So, I don't see the proposed names any better > > > than the current one. > > > > > > > Anyway, the point is that the current GUC name 'standby_slot_names' is > > not ideal IMO because it doesn't have enough meaning by itself -- e.g. > > you have to read the accompanying comment or documentation to have any > > idea of its purpose. > > > > Yeah, one has to read the description but that is true for other > parameters like "temp_tablespaces". I don't have any better ideas but > open to suggestions. What about "non_lagging_standby_slots"? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Mar 4, 2024 at 9:35 AM Peter Smith <smithpb2250@gmail.com> wrote: > > OK, if the code will remain as-is wouldn't it be better to anyway > change the function name to indicate what it really does? > > e.g. NeedToWaitForWal --> NeedToWaitForWalFlushOrStandbys > This seems too long. I would prefer the current name NeedToWaitForWal as waiting for WAL means waiting to flush the WAL and waiting to replicate it to standby. On similar lines, the variable name standby_slot_oldest_flush_lsn looks too long. How about ss_oldest_flush_lsn (where ss indicates standy_slots)? Apart from this, I have made minor modifications in the attached. -- With Regards, Amit Kapila.
Attachment
Hi, On Sun, Mar 03, 2024 at 07:56:32AM +0000, Zhijie Hou (Fujitsu) wrote: > Here is the V104 patch which addressed above and Peter's comments. Thanks! A few more random comments: 1 === + The function may be blocked if the specified slot is a failover enabled s/blocked/waiting/ ? 2 === + * specified slot when waiting for them to catch up. See + * StandbySlotsHaveCaughtup for details. s/StandbySlotsHaveCaughtup/StandbySlotsHaveCaughtup()/ ? 3 === + /* Now verify if the specified slots really exist and have correct type */ remove "really"? 4 === + /* + * Don't need to wait for the standbys to catch up if there is no value in + * standby_slot_names. + */ + if (standby_slot_names_list == NIL) + return true; + + /* + * Don't need to wait for the standbys to catch up if we are on a standby + * server, since we do not support syncing slots to cascading standbys. + */ + if (RecoveryInProgress()) + return true; + + /* + * Don't need to wait for the standbys to catch up if they are already + * beyond the specified WAL location. + */ + if (!XLogRecPtrIsInvalid(standby_slot_oldest_flush_lsn) && + standby_slot_oldest_flush_lsn >= wait_for_lsn) + return true; What about using OR conditions instead? 5 === +static bool +NeedToWaitForStandby(XLogRecPtr target_lsn, XLogRecPtr flushed_lsn, + uint32 *wait_event) Not a big deal but does it need to return a bool? (I mean it all depends of the *wait_event value). Is it for better code readability in the caller? 6 === +static bool +NeedToWaitForWal(XLogRecPtr target_lsn, XLogRecPtr flushed_lsn, + uint32 *wait_event) Same questions as for NeedToWaitForStandby(). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Mar 4, 2024 at 4:52 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > On Sun, Mar 03, 2024 at 07:56:32AM +0000, Zhijie Hou (Fujitsu) wrote: > > Here is the V104 patch which addressed above and Peter's comments. > > Thanks! > > > 4 === > > + /* > + * Don't need to wait for the standbys to catch up if there is no value in > + * standby_slot_names. > + */ > + if (standby_slot_names_list == NIL) > + return true; > + > + /* > + * Don't need to wait for the standbys to catch up if we are on a standby > + * server, since we do not support syncing slots to cascading standbys. > + */ > + if (RecoveryInProgress()) > + return true; > + > + /* > + * Don't need to wait for the standbys to catch up if they are already > + * beyond the specified WAL location. > + */ > + if (!XLogRecPtrIsInvalid(standby_slot_oldest_flush_lsn) && > + standby_slot_oldest_flush_lsn >= wait_for_lsn) > + return true; > > What about using OR conditions instead? > I think we can use but it seems code is easier to follow this way but this is just a matter of personal choice. > 5 === > > +static bool > +NeedToWaitForStandby(XLogRecPtr target_lsn, XLogRecPtr flushed_lsn, > + uint32 *wait_event) > > Not a big deal but does it need to return a bool? (I mean it all depends of > the *wait_event value). Is it for better code readability in the caller? > Yes, I think so. Adding checks based on wait_events sounds a bit awkward. -- With Regards, Amit Kapila.
On Monday, March 4, 2024 7:22 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On Sun, Mar 03, 2024 at 07:56:32AM +0000, Zhijie Hou (Fujitsu) wrote: > > Here is the V104 patch which addressed above and Peter's comments. > > Thanks! > > A few more random comments: Thanks for the comments! > > 1 === > > + The function may be blocked if the specified slot is a failover > + enabled > > s/blocked/waiting/ ? Changed. > > 2 === > > + * specified slot when waiting for them to catch up. See > + * StandbySlotsHaveCaughtup for details. > > s/StandbySlotsHaveCaughtup/StandbySlotsHaveCaughtup()/ ? Changed. > > 3 === > > + /* Now verify if the specified slots really exist and have > + correct type */ > > remove "really"? Changed. > > 4 === > > + /* > + * Don't need to wait for the standbys to catch up if there is no value in > + * standby_slot_names. > + */ > + if (standby_slot_names_list == NIL) > + return true; > + > + /* > + * Don't need to wait for the standbys to catch up if we are on a > standby > + * server, since we do not support syncing slots to cascading standbys. > + */ > + if (RecoveryInProgress()) > + return true; > + > + /* > + * Don't need to wait for the standbys to catch up if they are already > + * beyond the specified WAL location. > + */ > + if (!XLogRecPtrIsInvalid(standby_slot_oldest_flush_lsn) && > + standby_slot_oldest_flush_lsn >= wait_for_lsn) > + return true; > > What about using OR conditions instead? > > 5 === > > +static bool > +NeedToWaitForStandby(XLogRecPtr target_lsn, XLogRecPtr flushed_lsn, > + uint32 *wait_event) > > Not a big deal but does it need to return a bool? (I mean it all depends of the > *wait_event value). Is it for better code readability in the caller? > > 6 === > > +static bool > +NeedToWaitForWal(XLogRecPtr target_lsn, XLogRecPtr flushed_lsn, > + uint32 *wait_event) > > Same questions as for NeedToWaitForStandby(). I also feel the current style looks a bit cleaner, so didn’t change these. Best Regards, Hou zj
On Monday, March 4, 2024 9:55 AM Peter Smith <smithpb2250@gmail.com> wrote: > > On Sun, Mar 3, 2024 at 6:51 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> > wrote: > > > > On Sunday, March 3, 2024 7:47 AM Peter Smith <smithpb2250@gmail.com> > wrote: > > > > > > > 3. > > > + <note> > > > + <para> > > > + Value <literal>*</literal> is not accepted as it is inappropriate to > > > + block logical replication for physical slots that either lack > > > + associated standbys or have standbys associated that are > > > + not > > > enabled > > > + for replication slot synchronization. (see > > > + <xref > > > linkend="logicaldecoding-replication-slots-synchronization"/>). > > > + </para> > > > + </note> > > > > > > Why does the document need to provide an excuse/reason for the rule? > > > You could just say something like: > > > > > > SUGGESTION > > > The slots must be named explicitly. For example, specifying wildcard > > > values like <literal>*</literal> is not permitted. > > > > As suggested by Amit, I moved this to code comments. > > Was the total removal of this note deliberate? I only suggested removing the > *reason* for the rule, not the entire rule. Otherwise, the user won't know to > avoid doing this until they try it and get an error. OK, Added. > > > > > > > > 9. NeedToWaitForWal > > > > > > + /* > > > + * Check if the standby slots have caught up to the flushed position. > > > + It > > > + * is good to wait up to flushed position and then let it send the > > > + changes > > > + * to logical subscribers one by one which are already covered in > > > + flushed > > > + * position without needing to wait on every change for standby > > > + * confirmation. Note that after receiving the shutdown signal, an > > > + ERROR > > > + * is reported if any slots are dropped, invalidated, or inactive. > > > + This > > > + * measure is taken to prevent the walsender from waiting indefinitely. > > > + */ > > > + if (NeedToWaitForStandby(target_lsn, flushed_lsn, wait_event)) > > > + return true; > > > > > > I felt it was confusing things for this function to also call to the > > > other one -- it seems an overlapping/muddling of the purpose of these. > > > I think it will be easier to understand if the calling code just > > > calls to one or both of these functions as required. > > > > Same as Amit, I didn't change this. > > AFAICT my previous review comment was misinterpreted. Please see [1] for > more details. > > ~~~~ > > Here are some more review comments for v104-00001 Thanks! > > ====== > Commit message > > 1. > Additionally, The SQL functions pg_logical_slot_get_changes, > pg_logical_slot_peek_changes and pg_replication_slot_advance are modified > to wait for the replication slots specified in 'standby_slot_names' to catch up > before returning. > > ~ > > Maybe that should be expressed using similar wording as the docs... > > SUGGESTION > Additionally, The SQL functions ... are modified. Now, when used with > failover-enabled logical slots, these functions will block until all physical slots > specified in 'standby_slot_names' have confirmed WAL receipt. Changed. > > ====== > doc/src/sgml/config.sgml > > 2. > + <function>pg_logical_slot_peek_changes</function></link>, > + when used with failover enabled logical slots, will block until all > + physical slots specified in > <varname>standby_slot_names</varname> have > + confirmed WAL receipt. > > /failover enabled logical slots/failover-enabled logical slots/ Changed. Note that for this comment and remaining comments, I used the later version we agreed(logical failover slot). > > ====== > doc/src/sgml/func.sgml > > 3. > + The function may be blocked if the specified slot is a failover enabled > + slot and <link > linkend="guc-standby-slot-names"><varname>standby_slot_names</varna > me></link> > + is configured. > </para></entry> > > /a failover enabled slot//a failover-enabled slot/ Changed. > > ~~~ > > 4. > + slot may return to an earlier position. The function may be blocked if > + the specified slot is a failover enabled slot and > + <link > linkend="guc-standby-slot-names"><varname>standby_slot_names</varna > me></link> > + is configured. > > /a failover enabled slot//a failover-enabled slot/ Changed. > > ====== > src/backend/replication/slot.c > > 5. > +/* > + * Wait for physical standbys to confirm receiving the given lsn. > + * > + * Used by logical decoding SQL functions that acquired failover enabled slot. > + * It waits for physical standbys corresponding to the physical slots > +specified > + * in the standby_slot_names GUC. > + */ > +void > +WaitForStandbyConfirmation(XLogRecPtr wait_for_lsn) > > /failover enabled slot/failover-enabled slot/ Changed. > > ~~~ > > 6. > + /* > + * Don't need to wait for the standby to catch up if the current > + acquired > + * slot is not a failover enabled slot, or there is no value in > + * standby_slot_names. > + */ > > /failover enabled slot/failover-enabled slot/ Changed. > > ====== > src/backend/replication/slotfuncs.c > > 7. > + > + /* > + * Wake up logical walsenders holding failover enabled slots after > + * updating the restart_lsn of the physical slot. > + */ > + PhysicalWakeupLogicalWalSnd(); > > /failover enabled slots/failover-enabled slots/ Changed. > > ====== > src/backend/replication/walsender.c > > 8. > +/* > + * Wake up the logical walsender processes with failover enabled slots > +if the > + * currently acquired physical slot is specified in standby_slot_names GUC. > + */ > +void > +PhysicalWakeupLogicalWalSnd(void) > > /failover enabled slots/failover-enabled slots/ Changed. > > 9. > +/* > + * Returns true if not all standbys have caught up to the flushed > +position > + * (flushed_lsn) when the current acquired slot is a failover enabled > +logical > + * slot and we are streaming; otherwise, returns false. > + * > + * If returning true, the function sets the appropriate wait event in > + * wait_event; otherwise, wait_event is set to 0. > + */ > +static bool > +NeedToWaitForStandby(XLogRecPtr target_lsn, XLogRecPtr flushed_lsn, > + uint32 *wait_event) > > 9a. > /failover enabled logical slot/failover-enabled logical slot/ Changed. > > ~ > > 9b. > Probably that function name should be plural. > > /NeedToWaitForStandby/NeedToWaitForStandbys/ > > ~~~ > > 10. > +/* > + * Returns true if we need to wait for WALs to be flushed to disk, or > +if not > + * all standbys have caught up to the flushed position (flushed_lsn) > +when the > + * current acquired slot is a failover enabled logical slot and we are > + * streaming; otherwise, returns false. > + * > + * If returning true, the function sets the appropriate wait event in > + * wait_event; otherwise, wait_event is set to 0. > + */ > +static bool > +NeedToWaitForWal(XLogRecPtr target_lsn, XLogRecPtr flushed_lsn, > + uint32 *wait_event) > > /failover enabled logical slot/failover-enabled logical slot/ Changed. > > ~~~ > > 11. WalSndWaitForWal > > + /* > + * Within the loop, we wait for the necessary WALs to be flushed to > + * disk first, followed by waiting for standbys to catch up if there > + * are enought WALs or upon receiving the shutdown signal. To avoid > + * the scenario where standbys need to catch up to a newer WAL > + * location in each iteration, we update our idea of the currently > + * flushed position only if we are not waiting for standbys to catch > + * up. > + */ > > typo > > /enought/enough/ Fixed. Best Regards, Hou zj
On Monday, March 4, 2024 5:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Mar 4, 2024 at 9:35 AM Peter Smith <smithpb2250@gmail.com> > wrote: > > > > OK, if the code will remain as-is wouldn't it be better to anyway > > change the function name to indicate what it really does? > > > > e.g. NeedToWaitForWal --> NeedToWaitForWalFlushOrStandbys > > > > This seems too long. I would prefer the current name NeedToWaitForWal as > waiting for WAL means waiting to flush the WAL and waiting to replicate it to > standby. On similar lines, the variable name standby_slot_oldest_flush_lsn looks > too long. How about ss_oldest_flush_lsn (where ss indicates standy_slots)? > > Apart from this, I have made minor modifications in the attached. Thanks, I have merged it. Attach the V105 patch set which addressed Peter, Amit and Bertrand's comments. This version also includes the following changes: * We found a string matching issue for query_until() and fixed it. * Removed one un-used parameter from NeedToWaitForStandbys. * Disable the sub before testing the pg_logical_slot_get_changes in 040.pl, this is to prevent This test from catching the warning from another walsender. * Ran pgindent. Best Regards, Hou zj
Attachment
Hi, On Mon, Mar 04, 2024 at 01:28:04PM +0000, Zhijie Hou (Fujitsu) wrote: > Attach the V105 patch set Thanks! Sorry I missed those during the previous review: 1 === Commit message: "these functions will block until" s/block/wait/ ? 2 === + when used with logical failover slots, will block until all s/block/wait/ ? It seems those are the 2 remaining "block" that could deserve the proposed above change. 3 === + invalidated = slot->data.invalidated != RS_INVAL_NONE; + inactive = slot->active_pid == 0; invalidated = (slot->data.invalidated != RS_INVAL_NONE); inactive = (slot->active_pid == 0); instead? I think it's easier to read and it looks like this is the way it's written in other places (at least the few I checked). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Here are some review comments for v105-0001 ========== doc/src/sgml/config.sgml 1. + <para> + The standbys corresponding to the physical replication slots in + <varname>standby_slot_names</varname> must configure + <literal>sync_replication_slots = true</literal> so they can receive + logical failover slots changes from the primary. + </para> /slots changes/slot changes/ ====== doc/src/sgml/func.sgml 2. + The function may be waiting if the specified slot is a logical failover + slot and <link linkend="guc-standby-slot-names"><varname>standby_slot_names</varname></link> + is configured. I know this has been through multiple versions already, but this latest wording "may be waiting..." doesn't seem very good to me. How about one of these? * The function may not be able to return immediately if the specified slot is a logical failover slot and standby_slot_names is configured. * The function return might be blocked if the specified slot is a logical failover slot and standby_slot_names is configured. * If the specified slot is a logical failover slot then the function will block until all physical slots specified in standby_slot_names have confirmed WAL receipt. * If the specified slot is a logical failover slot then the function will not return until all physical slots specified in standby_slot_names have confirmed WAL receipt. ~~~ 3. + slot may return to an earlier position. The function may be waiting if + the specified slot is a logical failover slot and + <link linkend="guc-standby-slot-names"><varname>standby_slot_names</varname></link> Same as previous review comment #2 ====== src/backend/replication/slot.c 4. WaitForStandbyConfirmation + * Used by logical decoding SQL functions that acquired logical failover slot. IIUC it doesn't work like that. pg_logical_slot_get_changes_guts() calls here unconditionally (i.e. the SQL functions don't even check if they are failover slots before calling this) so the comment seems misleading/redundant. ====== src/backend/replication/walsender.c 5. NeedToWaitForWal + /* + * Check if the standby slots have caught up to the flushed position. It + * is good to wait up to the flushed position and then let the WalSender + * send the changes to logical subscribers one by one which are already + * covered by the flushed position without needing to wait on every change + * for standby confirmation. + */ + if (NeedToWaitForStandbys(flushed_lsn, wait_event)) + return true; + + *wait_event = 0; + return false; +} + 5a. The comment (or part of it?) seems misplaced because it is talking WalSender sending changes, but that is not happening in this function. Also, isn't what this is saying already described by the other comment in the caller? e.g.: + /* + * Within the loop, we wait for the necessary WALs to be flushed to + * disk first, followed by waiting for standbys to catch up if there + * are enough WALs or upon receiving the shutdown signal. To avoid the + * scenario where standbys need to catch up to a newer WAL location in + * each iteration, we update our idea of the currently flushed + * position only if we are not waiting for standbys to catch up. + */ ~ 5b. Most of the code is unnecessary. AFAICT all this is exactly same as just 1 line: return NeedToWaitForStandbys(flushed_lsn, wait_event); ~~~ 6. WalSndWaitForWal + /* + * Within the loop, we wait for the necessary WALs to be flushed to + * disk first, followed by waiting for standbys to catch up if there + * are enough WALs or upon receiving the shutdown signal. To avoid the + * scenario where standbys need to catch up to a newer WAL location in + * each iteration, we update our idea of the currently flushed + * position only if we are not waiting for standbys to catch up. + */ Regarding that 1st sentence: maybe this logic used to be done explicitly "within the loop" but IIUC this logic is now hidden inside NeedToWaitForWal() so the comment should mention that. ---------- Kind Regards, Peter Smith. Fujitsu Australia
On Mon, Mar 4, 2024 at 2:27 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > On Thu, Feb 29, 2024 at 03:38:59PM +0530, Amit Kapila wrote: > > On Thu, Feb 29, 2024 at 9:13 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > > On Tue, Feb 27, 2024 at 11:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > > Also, adding wait sounds > > > > more like a boolean. So, I don't see the proposed names any better > > > > than the current one. > > > > > > > > > > Anyway, the point is that the current GUC name 'standby_slot_names' is > > > not ideal IMO because it doesn't have enough meaning by itself -- e.g. > > > you have to read the accompanying comment or documentation to have any > > > idea of its purpose. > > > > > > > Yeah, one has to read the description but that is true for other > > parameters like "temp_tablespaces". I don't have any better ideas but > > open to suggestions. > > What about "non_lagging_standby_slots"? > I still prefer the current one as that at least resembles with existing synchronous_standby_names. I think we can change the GUC name if we get an agreement on a better name before release. At this stage, let's move with the current one. -- With Regards, Amit Kapila.
On Tue, Mar 5, 2024 at 6:10 AM Peter Smith <smithpb2250@gmail.com> wrote: > > ====== > src/backend/replication/walsender.c > > 5. NeedToWaitForWal > > + /* > + * Check if the standby slots have caught up to the flushed position. It > + * is good to wait up to the flushed position and then let the WalSender > + * send the changes to logical subscribers one by one which are already > + * covered by the flushed position without needing to wait on every change > + * for standby confirmation. > + */ > + if (NeedToWaitForStandbys(flushed_lsn, wait_event)) > + return true; > + > + *wait_event = 0; > + return false; > +} > + > > 5a. > The comment (or part of it?) seems misplaced because it is talking > WalSender sending changes, but that is not happening in this function. > I don't think so. This is invoked only by walsender and a static function. I don't see any other better place to mention this. > Also, isn't what this is saying already described by the other comment > in the caller? e.g.: > Oh no, here we are explaining the wait order. -- With Regards, Amit Kapila.
On Tue, Mar 5, 2024 at 9:15 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Mar 5, 2024 at 6:10 AM Peter Smith <smithpb2250@gmail.com> wrote: > > > > ====== > > src/backend/replication/walsender.c > > > > 5. NeedToWaitForWal > > > > + /* > > + * Check if the standby slots have caught up to the flushed position. It > > + * is good to wait up to the flushed position and then let the WalSender > > + * send the changes to logical subscribers one by one which are already > > + * covered by the flushed position without needing to wait on every change > > + * for standby confirmation. > > + */ > > + if (NeedToWaitForStandbys(flushed_lsn, wait_event)) > > + return true; > > + > > + *wait_event = 0; > > + return false; > > +} > > + > > > > 5a. > > The comment (or part of it?) seems misplaced because it is talking > > WalSender sending changes, but that is not happening in this function. > > > > I don't think so. This is invoked only by walsender and a static > function. I don't see any other better place to mention this. > > > Also, isn't what this is saying already described by the other comment > > in the caller? e.g.: > > > > Oh no, here we are explaining the wait order. I think there is a scope of improvement here. The comment inside NeedToWaitForWal() which states that we need to wait here for standbys on flush-position(and not on each change) should be outside of this function. It is too embedded. And the comment which states the order of wait (first flush and then standbys confirmation) should be outside the for-loop in WalSndWaitForWal(), but yes we do need both the comments. Attached a patch (.txt) for comments improvement, please merge if appropriate. thanks Shveta
Attachment
I did performance tests for the v99 patch w.r.t. wait time analysis. As this patch is introducing a wait for standby before sending changes to a subscriber, at the primary node, logged time at the start and end of the XLogSendLogical() call (which eventually calls WalSndWaitForWal()) and calculated total time taken by this function during the load run. For load, ran pgbench for 15 minutes: Creating tables: pgbench -p 5833 postgres -qis 2 Running benchmark: pgbench postgres -p 5833 -c 10 -j 3 -T 900 -P 20 Machine details: 11th Gen Intel(R) Core(TM) i9-11950H @ 2.60GHz 32GB RAM OS - Windows 10 Enterprise Test setup: Primary node --> -> One physical standby node -> One subscriber node having only one subscription with failover=true -- the slot-sync relevant parameters are set to default (OFF) for all the tests i.e. hot_standby_feedback = off sync_replication_slots = false -- addition configuration on each instance is: shared_buffers = 6GB max_worker_processes = 32 max_parallel_maintenance_workers = 24 max_parallel_workers = 32 synchronous_commit = off checkpoint_timeout = 1d max_wal_size = 24GB min_wal_size = 15GB autovacuum = off To review the wait time impact with and without patch, compared three cases (did two runs for each case)- (1) HEAD code: time taken in run 1 = 103.935631 seconds time taken in run 2 = 104.832186 seconds (2) HEAD code + v99 patch ('standby_slot_names' is not set): time taken in run 1 = 104.076343 seconds time taken in run 2 = 103.116226 seconds (3) HEAD code + v99 patch + a valid 'standby_slot_names' is set: time taken in run 1 = 103.871012 seconds time taken in run 2 = 103.793524 seconds The time consumption of XLogSendLogical() is almost same in all the cases and no performance degradation is observed. -- Thanks, Nisha
On Monday, March 4, 2024 11:44 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On Mon, Mar 04, 2024 at 01:28:04PM +0000, Zhijie Hou (Fujitsu) wrote: > > Attach the V105 patch set > > Thanks! > > Sorry I missed those during the previous review: No problem, thanks for the comments! > > 1 === > > Commit message: "these functions will block until" > > s/block/wait/ ? > > 2 === > > + when used with logical failover slots, will block until all > > s/block/wait/ ? > > It seems those are the 2 remaining "block" that could deserve the proposed > above change. I prefer using 'block' here. And it seems others also suggest to change the 'wait'[1]. > > 3 === > > + invalidated = slot->data.invalidated != RS_INVAL_NONE; > + inactive = slot->active_pid == 0; > > invalidated = (slot->data.invalidated != RS_INVAL_NONE); inactive = > (slot->active_pid == 0); > > instead? > > I think it's easier to read and it looks like this is the way it's written in other > places (at least the few I checked). I think the current code is consistent with other similar code in slot.c. (grep "data.invalidated != RS_INVAL_NONE"). [1] https://www.postgresql.org/message-id/CAHut%2BPsATK8z1TEcfFE8zWoS1hagqsvaWYCgom_zYtScfwO7uQ%40mail.gmail.com Best Regards, Hou zj
On Tuesday, March 5, 2024 8:40 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are some review comments for v105-0001 > > ========== > doc/src/sgml/config.sgml > > 1. > + <para> > + The standbys corresponding to the physical replication slots in > + <varname>standby_slot_names</varname> must configure > + <literal>sync_replication_slots = true</literal> so they can receive > + logical failover slots changes from the primary. > + </para> > > /slots changes/slot changes/ Changed. > > ====== > doc/src/sgml/func.sgml > > 2. > + The function may be waiting if the specified slot is a logical failover > + slot and <link > linkend="guc-standby-slot-names"><varname>standby_slot_names</varna > me></link> > + is configured. > > I know this has been through multiple versions already, but this > latest wording "may be waiting..." doesn't seem very good to me. > > How about one of these? > > * The function may not be able to return immediately if the specified > slot is a logical failover slot and standby_slot_names is configured. > > * The function return might be blocked if the specified slot is a > logical failover slot and standby_slot_names is configured. > > * If the specified slot is a logical failover slot then the function > will block until all physical slots specified in standby_slot_names > have confirmed WAL receipt. > > * If the specified slot is a logical failover slot then the function > will not return until all physical slots specified in > standby_slot_names have confirmed WAL receipt. I prefer the last one. > > ~~~ > > 3. > + slot may return to an earlier position. The function may be waiting if > + the specified slot is a logical failover slot and > + <link > linkend="guc-standby-slot-names"><varname>standby_slot_names</varna > me></link> > > > Same as previous review comment #2 Changed. > > ====== > src/backend/replication/slot.c > > 4. WaitForStandbyConfirmation > > + * Used by logical decoding SQL functions that acquired logical failover slot. > > IIUC it doesn't work like that. pg_logical_slot_get_changes_guts() > calls here unconditionally (i.e. the SQL functions don't even check if > they are failover slots before calling this) so the comment seems > misleading/redundant. I removed the "acquired logical failover slot.". > > ====== > src/backend/replication/walsender.c > > 5. NeedToWaitForWal > > + /* > + * Check if the standby slots have caught up to the flushed position. It > + * is good to wait up to the flushed position and then let the WalSender > + * send the changes to logical subscribers one by one which are already > + * covered by the flushed position without needing to wait on every change > + * for standby confirmation. > + */ > + if (NeedToWaitForStandbys(flushed_lsn, wait_event)) > + return true; > + > + *wait_event = 0; > + return false; > +} > + > > 5a. > The comment (or part of it?) seems misplaced because it is talking > WalSender sending changes, but that is not happening in this function. > > Also, isn't what this is saying already described by the other comment > in the caller? e.g.: > > + /* > + * Within the loop, we wait for the necessary WALs to be flushed to > + * disk first, followed by waiting for standbys to catch up if there > + * are enough WALs or upon receiving the shutdown signal. To avoid the > + * scenario where standbys need to catch up to a newer WAL location in > + * each iteration, we update our idea of the currently flushed > + * position only if we are not waiting for standbys to catch up. > + */ > I moved these comments based on Shveta's suggestion. > ~ > > 5b. > Most of the code is unnecessary. AFAICT all this is exactly same as just 1 line: > > return NeedToWaitForStandbys(flushed_lsn, wait_event); Changed. > > ~~~ > > 6. WalSndWaitForWal > > + /* > + * Within the loop, we wait for the necessary WALs to be flushed to > + * disk first, followed by waiting for standbys to catch up if there > + * are enough WALs or upon receiving the shutdown signal. To avoid the > + * scenario where standbys need to catch up to a newer WAL location in > + * each iteration, we update our idea of the currently flushed > + * position only if we are not waiting for standbys to catch up. > + */ > > Regarding that 1st sentence: maybe this logic used to be done > explicitly "within the loop" but IIUC this logic is now hidden inside > NeedToWaitForWal() so the comment should mention that. Changed. Best Regards, Hou zj
On Tuesday, March 5, 2024 2:35 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Tue, Mar 5, 2024 at 9:15 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Mar 5, 2024 at 6:10 AM Peter Smith <smithpb2250@gmail.com> > wrote: > > > > > > ====== > > > src/backend/replication/walsender.c > > > > > > 5. NeedToWaitForWal > > > > > > + /* > > > + * Check if the standby slots have caught up to the flushed > > > + position. It > > > + * is good to wait up to the flushed position and then let the > > > + WalSender > > > + * send the changes to logical subscribers one by one which are > > > + already > > > + * covered by the flushed position without needing to wait on every > > > + change > > > + * for standby confirmation. > > > + */ > > > + if (NeedToWaitForStandbys(flushed_lsn, wait_event)) return true; > > > + > > > + *wait_event = 0; > > > + return false; > > > +} > > > + > > > > > > 5a. > > > The comment (or part of it?) seems misplaced because it is talking > > > WalSender sending changes, but that is not happening in this function. > > > > > > > I don't think so. This is invoked only by walsender and a static > > function. I don't see any other better place to mention this. > > > > > Also, isn't what this is saying already described by the other > > > comment in the caller? e.g.: > > > > > > > Oh no, here we are explaining the wait order. > > I think there is a scope of improvement here. The comment inside > NeedToWaitForWal() which states that we need to wait here for standbys on > flush-position(and not on each change) should be outside of this function. It is > too embedded. And the comment which states the order of wait (first flush and > then standbys confirmation) should be outside the for-loop in > WalSndWaitForWal(), but yes we do need both the comments. Attached a > patch (.txt) for comments improvement, please merge if appropriate. Thanks, I have slightly modified the top-up patch and merged it. Attach the V106 patch which addressed above and Peter's comments[1]. [1] https://www.postgresql.org/message-id/CAHut%2BPsATK8z1TEcfFE8zWoS1hagqsvaWYCgom_zYtScfwO7uQ%40mail.gmail.com Best Regards, Hou zj
Attachment
On Fri, Mar 1, 2024 at 4:21 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Friday, March 1, 2024 2:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > --- > > +void > > +assign_standby_slot_names(const char *newval, void *extra) { > > + List *standby_slots; > > + MemoryContext oldcxt; > > + char *standby_slot_names_cpy = extra; > > + > > > > Given that the newval and extra have the same data (standby_slot_names > > value), why do we not use newval instead? I think that if we use > > newval, we don't need to guc_strdup() in check_standby_slot_names(), > > we might need to do list_copy_deep() instead, though. It's not clear > > to me as there is no comment. > > I think SplitIdentifierString will modify the passed in string, so we'd better > not pass the newval to it, otherwise the stored guc string(standby_slot_names) > will be changed. I can see we are doing similar thing in other GUC check/assign > function as well. (check_wal_consistency_checking/ > assign_wal_consistency_checking, check_createrole_self_grant/ > assign_createrole_self_grant ...). Why does it have to be a List in the first place? In earlier version patches, we used to copy the list and delete the element until it became empty, while waiting for physical wal senders. But we now just refer to each slot name in the list. The current code assumes that stnadby_slot_names_cpy is allocated in GUCMemoryContext but once it changes, it will silently get broken. I think we can check and assign standby_slot_names in a similar way to check/assign_temp_tablespaces and check/assign_synchronous_standby_names. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Tue, Mar 5, 2024 at 4:21 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Tuesday, March 5, 2024 2:35 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Tue, Mar 5, 2024 at 9:15 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Tue, Mar 5, 2024 at 6:10 AM Peter Smith <smithpb2250@gmail.com> > > wrote: > > > > > > > > ====== > > > > src/backend/replication/walsender.c > > > > > > > > 5. NeedToWaitForWal > > > > > > > > + /* > > > > + * Check if the standby slots have caught up to the flushed > > > > + position. It > > > > + * is good to wait up to the flushed position and then let the > > > > + WalSender > > > > + * send the changes to logical subscribers one by one which are > > > > + already > > > > + * covered by the flushed position without needing to wait on every > > > > + change > > > > + * for standby confirmation. > > > > + */ > > > > + if (NeedToWaitForStandbys(flushed_lsn, wait_event)) return true; > > > > + > > > > + *wait_event = 0; > > > > + return false; > > > > +} > > > > + > > > > > > > > 5a. > > > > The comment (or part of it?) seems misplaced because it is talking > > > > WalSender sending changes, but that is not happening in this function. > > > > > > > > > > I don't think so. This is invoked only by walsender and a static > > > function. I don't see any other better place to mention this. > > > > > > > Also, isn't what this is saying already described by the other > > > > comment in the caller? e.g.: > > > > > > > > > > Oh no, here we are explaining the wait order. > > > > I think there is a scope of improvement here. The comment inside > > NeedToWaitForWal() which states that we need to wait here for standbys on > > flush-position(and not on each change) should be outside of this function. It is > > too embedded. And the comment which states the order of wait (first flush and > > then standbys confirmation) should be outside the for-loop in > > WalSndWaitForWal(), but yes we do need both the comments. Attached a > > patch (.txt) for comments improvement, please merge if appropriate. > > Thanks, I have slightly modified the top-up patch and merged it. > > Attach the V106 patch which addressed above and Peter's comments[1]. > I have one question about PhysicalWakeupLogicalWalSnd(): +/* + * Wake up the logical walsender processes with logical failover slots if the + * currently acquired physical slot is specified in standby_slot_names GUC. + */ +void +PhysicalWakeupLogicalWalSnd(void) +{ + List *standby_slots; + + Assert(MyReplicationSlot && SlotIsPhysical(MyReplicationSlot)); + + standby_slots = GetStandbySlotList(); + + foreach_ptr(char, name, standby_slots) + { + if (strcmp(name, NameStr(MyReplicationSlot->data.name)) == 0) + { + ConditionVariableBroadcast(&WalSndCtl->wal_confirm_rcv_cv); + return; + } + } +} IIUC walsender calls this function every time after updating the slot's restart_lsn, which could be very frequently. I'm concerned that it could be expensive to do a linear search on the standby_slot_names list every time. Is it possible to cache the information in walsender local somehow? For example, the walsender sets a flag in WalSnd after processing the config file if its slot name is present in standby_slot_names. That way, they can wake up logical walsenders if eligible after updating the slot's restart_lsn, without checking the standby_slot_names value. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wednesday, March 6, 2024 9:30 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: Hi, > On Fri, Mar 1, 2024 at 4:21 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> > wrote: > > > > On Friday, March 1, 2024 2:11 PM Masahiko Sawada > <sawada.mshk@gmail.com> wrote: > > > > > > > > > --- > > > +void > > > +assign_standby_slot_names(const char *newval, void *extra) { > > > + List *standby_slots; > > > + MemoryContext oldcxt; > > > + char *standby_slot_names_cpy = extra; > > > + > > > > > > Given that the newval and extra have the same data > > > (standby_slot_names value), why do we not use newval instead? I > > > think that if we use newval, we don't need to guc_strdup() in > > > check_standby_slot_names(), we might need to do list_copy_deep() > > > instead, though. It's not clear to me as there is no comment. > > > > I think SplitIdentifierString will modify the passed in string, so > > we'd better not pass the newval to it, otherwise the stored guc > > string(standby_slot_names) will be changed. I can see we are doing > > similar thing in other GUC check/assign function as well. > > (check_wal_consistency_checking/ assign_wal_consistency_checking, > > check_createrole_self_grant/ assign_createrole_self_grant ...). > > Why does it have to be a List in the first place? I thought the List type is convenient to use here, as we have existing list build function(SplitIdentifierString), and have convenient list macro to loop the list(foreach_ptr) which can save some codes. > In earlier version patches, we > used to copy the list and delete the element until it became empty, while > waiting for physical wal senders. But we now just refer to each slot name in the > list. The current code assumes that stnadby_slot_names_cpy is allocated in > GUCMemoryContext but once it changes, it will silently get broken. I think we > can check and assign standby_slot_names in a similar way to > check/assign_temp_tablespaces and > check/assign_synchronous_standby_names. Yes, we could do follow it by allocating an array and copy each slot name into it, but it also requires some codes to build and scan the array. So, is it possible to expose the GucMemorycontext or have an API like guc_copy_list instead ? If we don't want to touch the guc api, I am ok with using an array as well. Best Regards, Hou zj
On Wed, Mar 6, 2024 at 7:36 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Mar 5, 2024 at 4:21 PM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > I have one question about PhysicalWakeupLogicalWalSnd(): > > +/* > + * Wake up the logical walsender processes with logical failover slots if the > + * currently acquired physical slot is specified in standby_slot_names GUC. > + */ > +void > +PhysicalWakeupLogicalWalSnd(void) > +{ > + List *standby_slots; > + > + Assert(MyReplicationSlot && SlotIsPhysical(MyReplicationSlot)); > + > + standby_slots = GetStandbySlotList(); > + > + foreach_ptr(char, name, standby_slots) > + { > + if (strcmp(name, NameStr(MyReplicationSlot->data.name)) == 0) > + { > + > ConditionVariableBroadcast(&WalSndCtl->wal_confirm_rcv_cv); > + return; > + } > + } > +} > > IIUC walsender calls this function every time after updating the > slot's restart_lsn, which could be very frequently. I'm concerned that > it could be expensive to do a linear search on the standby_slot_names > list every time. Is it possible to cache the information in walsender > local somehow? > We can cache this information for WalSender but not for the case where users use pg_physical_replication_slot_advance(). We don't expect this list to be long enough to matter, so we can leave this optimization for the future especially if we encounter any such case unless you think otherwise. -- With Regards, Amit Kapila.
On Wed, Mar 6, 2024 at 12:47 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Mar 6, 2024 at 7:36 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Tue, Mar 5, 2024 at 4:21 PM Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> wrote: > > > > I have one question about PhysicalWakeupLogicalWalSnd(): > > > > +/* > > + * Wake up the logical walsender processes with logical failover slots if the > > + * currently acquired physical slot is specified in standby_slot_names GUC. > > + */ > > +void > > +PhysicalWakeupLogicalWalSnd(void) > > +{ > > + List *standby_slots; > > + > > + Assert(MyReplicationSlot && SlotIsPhysical(MyReplicationSlot)); > > + > > + standby_slots = GetStandbySlotList(); > > + > > + foreach_ptr(char, name, standby_slots) > > + { > > + if (strcmp(name, NameStr(MyReplicationSlot->data.name)) == 0) > > + { > > + > > ConditionVariableBroadcast(&WalSndCtl->wal_confirm_rcv_cv); > > + return; > > + } > > + } > > +} > > > > IIUC walsender calls this function every time after updating the > > slot's restart_lsn, which could be very frequently. I'm concerned that > > it could be expensive to do a linear search on the standby_slot_names > > list every time. Is it possible to cache the information in walsender > > local somehow? > > > > We can cache this information for WalSender but not for the case where > users use pg_physical_replication_slot_advance(). We don't expect this > list to be long enough to matter, so we can leave this optimization > for the future especially if we encounter any such case unless you > think otherwise. Okay, agreed. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Mar 1, 2024 at 3:22 PM Peter Smith <smithpb2250@gmail.com> wrote: > > On Fri, Mar 1, 2024 at 5:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > ... > > + /* > > + * "*" is not accepted as in that case primary will not be able to know > > + * for which all standbys to wait for. Even if we have physical slots > > + * info, there is no way to confirm whether there is any standby > > + * configured for the known physical slots. > > + */ > > + if (strcmp(*newval, "*") == 0) > > + { > > + GUC_check_errdetail("\"*\" is not accepted for > > standby_slot_names"); > > + return false; > > + } > > > > Why only '*' is checked aside from validate_standby_slots()? I think > > that the doc doesn't mention anything about '*' and '*' cannot be used > > as a replication slot name. So even if we don't have this check, it > > might be no problem. > > > > Hi, a while ago I asked this same question. See [1 #28] for the response.. Thanks. Quoting the response from the email: SplitIdentifierString() does not give error for '*' and '*' can be considered as valid value which if accepted can mislead user that all the standbys's slots are now considered, which is not the case here. So we want to explicitly call out this case i.e. '*' is not accepted as valid value for standby_slot_names. IIUC we're concerned with a case like where the user confused standby_slot_names values with synchronous_standby_names values. Which means we would need to keep thath check consistent with available values of synchronous_standby_names. For example, if we support a regexp for synchronous_standby_names, we will have to update the check so we disallow other special characters. Also, if we add a new replication-related parameter that accepts other special characters as the value in the future, will we want to raise an error also for such values in check_standby_slot_names()? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Mar 6, 2024 at 12:07 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, Mar 1, 2024 at 3:22 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > On Fri, Mar 1, 2024 at 5:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > ... > > > + /* > > > + * "*" is not accepted as in that case primary will not be able to know > > > + * for which all standbys to wait for. Even if we have physical slots > > > + * info, there is no way to confirm whether there is any standby > > > + * configured for the known physical slots. > > > + */ > > > + if (strcmp(*newval, "*") == 0) > > > + { > > > + GUC_check_errdetail("\"*\" is not accepted for > > > standby_slot_names"); > > > + return false; > > > + } > > > > > > Why only '*' is checked aside from validate_standby_slots()? I think > > > that the doc doesn't mention anything about '*' and '*' cannot be used > > > as a replication slot name. So even if we don't have this check, it > > > might be no problem. > > > > > > > Hi, a while ago I asked this same question. See [1 #28] for the response.. > > Thanks. Quoting the response from the email: > > SplitIdentifierString() does not give error for '*' and '*' can be considered > as valid value which if accepted can mislead user that all the standbys's slots > are now considered, which is not the case here. So we want to explicitly call > out this case i.e. '*' is not accepted as valid value for standby_slot_names. > > IIUC we're concerned with a case like where the user confused > standby_slot_names values with synchronous_standby_names values. Which > means we would need to keep thath check consistent with available > values of synchronous_standby_names. > Both have different formats to specify. For example, for synchronous_standby_names we have the following kind of syntax to specify: [FIRST] num_sync ( standby_name [, ...] ) ANY num_sync ( standby_name [, ...] ) standby_name [, ...] I don't think we can have a common check for both of them as the specifications are different. In fact, I don't think we need a special check for '*'. The user will anyway get a WARNING at a later point that the replication slot with that name doesn't exist. -- With Regards, Amit Kapila.
On Wednesday, March 6, 2024 11:04 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Wednesday, March 6, 2024 9:30 AM Masahiko Sawada > <sawada.mshk@gmail.com> wrote: > > Hi, > > > On Fri, Mar 1, 2024 at 4:21 PM Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> > > wrote: > > > > > > On Friday, March 1, 2024 2:11 PM Masahiko Sawada > > <sawada.mshk@gmail.com> wrote: > > > > > > > > > > > > --- > > > > +void > > > > +assign_standby_slot_names(const char *newval, void *extra) { > > > > + List *standby_slots; > > > > + MemoryContext oldcxt; > > > > + char *standby_slot_names_cpy = extra; > > > > + > > > > > > > > Given that the newval and extra have the same data > > > > (standby_slot_names value), why do we not use newval instead? I > > > > think that if we use newval, we don't need to guc_strdup() in > > > > check_standby_slot_names(), we might need to do list_copy_deep() > > > > instead, though. It's not clear to me as there is no comment. > > > > > > I think SplitIdentifierString will modify the passed in string, so > > > we'd better not pass the newval to it, otherwise the stored guc > > > string(standby_slot_names) will be changed. I can see we are doing > > > similar thing in other GUC check/assign function as well. > > > (check_wal_consistency_checking/ assign_wal_consistency_checking, > > > check_createrole_self_grant/ assign_createrole_self_grant ...). > > > > Why does it have to be a List in the first place? > > I thought the List type is convenient to use here, as we have existing list build > function(SplitIdentifierString), and have convenient list macro to loop the > list(foreach_ptr) which can save some codes. > > > In earlier version patches, we > > used to copy the list and delete the element until it became empty, > > while waiting for physical wal senders. But we now just refer to each > > slot name in the list. The current code assumes that > > stnadby_slot_names_cpy is allocated in GUCMemoryContext but once it > > changes, it will silently get broken. I think we can check and assign > > standby_slot_names in a similar way to check/assign_temp_tablespaces > > and check/assign_synchronous_standby_names. > > Yes, we could do follow it by allocating an array and copy each slot name into it, > but it also requires some codes to build and scan the array. So, is it possible to > expose the GucMemorycontext or have an API like guc_copy_list instead ? > If we don't want to touch the guc api, I am ok with using an array as well. I rethink about this and realize that it's not good to do the memory allocation in assign hook function. As the "src/backend/utils/misc/README" said, we'd better do that in check hook function and pass it via extra to assign hook function. And thus array is a good choice in this case rather than a List which cannot be passed to *extra. Here is the V107 patch set which parse and cache the standby slot names in an array instead of a List. Best Regards, Hou zj
Attachment
On Wednesday, March 6, 2024 9:13 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Wednesday, March 6, 2024 11:04 AM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > On Wednesday, March 6, 2024 9:30 AM Masahiko Sawada > > <sawada.mshk@gmail.com> wrote: > > > > Hi, > > > > > On Fri, Mar 1, 2024 at 4:21 PM Zhijie Hou (Fujitsu) > > > <houzj.fnst@fujitsu.com> > > > wrote: > > > > > > > > On Friday, March 1, 2024 2:11 PM Masahiko Sawada > > > <sawada.mshk@gmail.com> wrote: > > > > > > > > > > > > > > > --- > > > > > +void > > > > > +assign_standby_slot_names(const char *newval, void *extra) { > > > > > + List *standby_slots; > > > > > + MemoryContext oldcxt; > > > > > + char *standby_slot_names_cpy = extra; > > > > > + > > > > > > > > > > Given that the newval and extra have the same data > > > > > (standby_slot_names value), why do we not use newval instead? I > > > > > think that if we use newval, we don't need to guc_strdup() in > > > > > check_standby_slot_names(), we might need to do list_copy_deep() > > > > > instead, though. It's not clear to me as there is no comment. > > > > > > > > I think SplitIdentifierString will modify the passed in string, so > > > > we'd better not pass the newval to it, otherwise the stored guc > > > > string(standby_slot_names) will be changed. I can see we are doing > > > > similar thing in other GUC check/assign function as well. > > > > (check_wal_consistency_checking/ assign_wal_consistency_checking, > > > > check_createrole_self_grant/ assign_createrole_self_grant ...). > > > > > > Why does it have to be a List in the first place? > > > > I thought the List type is convenient to use here, as we have existing > > list build function(SplitIdentifierString), and have convenient list > > macro to loop the > > list(foreach_ptr) which can save some codes. > > > > > In earlier version patches, we > > > used to copy the list and delete the element until it became empty, > > > while waiting for physical wal senders. But we now just refer to > > > each slot name in the list. The current code assumes that > > > stnadby_slot_names_cpy is allocated in GUCMemoryContext but once it > > > changes, it will silently get broken. I think we can check and > > > assign standby_slot_names in a similar way to > > > check/assign_temp_tablespaces and > check/assign_synchronous_standby_names. > > > > Yes, we could do follow it by allocating an array and copy each slot > > name into it, but it also requires some codes to build and scan the > > array. So, is it possible to expose the GucMemorycontext or have an API like > guc_copy_list instead ? > > If we don't want to touch the guc api, I am ok with using an array as well. > > I rethink about this and realize that it's not good to do the memory allocation in > assign hook function. As the "src/backend/utils/misc/README" said, we'd > better do that in check hook function and pass it via extra to assign hook > function. And thus array is a good choice in this case rather than a List which > cannot be passed to *extra. > > Here is the V107 patch set which parse and cache the standby slot names in an > array instead of a List. The patch needs to be rebased due to recent commit. Attach the V107_2 path set. There are no code changes in this version. Best Regards, Hou zj
Attachment
Here are some review comments for v107-0001 ====== src/backend/replication/slot.c 1. +/* + * Struct for the configuration of standby_slot_names. + * + * Note: this must be a flat representation that can be held in a single chunk + * of guc_malloc'd memory, so that it can be stored as the "extra" data for the + * standby_slot_names GUC. + */ +typedef struct +{ + int slot_num; + + /* slot_names contains nmembers consecutive nul-terminated C strings */ + char slot_names[FLEXIBLE_ARRAY_MEMBER]; +} StandbySlotConfigData; + 1a. To avoid any ambiguity this 1st field is somehow a slot ID number, I felt a better name would be 'nslotnames' or even just 'n' or 'count', ~ 1b. (fix typo) SUGGESTION for the 2nd field comment slot_names is a chunk of 'n' X consecutive null-terminated C strings ~ 1c. A more explanatory name for this typedef maybe is 'StandbySlotNamesConfigData' ? ~~~ 2. +/* This is parsed and cached configuration for standby_slot_names */ +static StandbySlotConfigData *standby_slot_config; 2a. /This is parsed and cached configuration for .../This is the parsed and cached configuration for .../ ~ 2b. Similar to above -- since this only has name information maybe it is more correct to call it 'standby_slot_names_config'? ~~~ 3. +/* + * A helper function to validate slots specified in GUC standby_slot_names. + * + * The rawname will be parsed, and the parsed result will be saved into + * *elemlist. + */ +static bool +validate_standby_slots(char *rawname, List **elemlist) /and the parsed result/and the result/ ~~~ 4. check_standby_slot_names + /* Need a modifiable copy of string */ + rawname = pstrdup(*newval); /copy of string/copy of the GUC string/ ~~~ 5. +assign_standby_slot_names(const char *newval, void *extra) +{ + /* + * The standby slots may have changed, so we must recompute the oldest + * LSN. + */ + ss_oldest_flush_lsn = InvalidXLogRecPtr; + + standby_slot_config = (StandbySlotConfigData *) extra; +} To avoid leaking don't we need to somewhere take care to free any memory used by a previous value (if any) of this 'standby_slot_config'? ~~~ 6. AcquiredStandbySlot +/* + * Return true if the currently acquired slot is specified in + * standby_slot_names GUC; otherwise, return false. + */ +bool +AcquiredStandbySlot(void) +{ + const char *name; + + /* Return false if there is no value in standby_slot_names */ + if (standby_slot_config == NULL) + return false; + + name = standby_slot_config->slot_names; + for (int i = 0; i < standby_slot_config->slot_num; i++) + { + if (strcmp(name, NameStr(MyReplicationSlot->data.name)) == 0) + return true; + + name += strlen(name) + 1; + } + + return false; +} 6a. Just checking "(standby_slot_config == NULL)" doesn't seem enough to me, because IIUIC it is possible when 'standby_slot_names' has no value then maybe standby_slot_config is not NULL but standby_slot_config->slot_num is 0. ~ 6b. IMO this function would be tidier written such that the MyReplicationSlot->data.name is passed as a parameter. Then you can name the function more naturally like: IsSlotInStandbySlotNames(const char *slot_name) ~ 6c. IMO the body of the function will be tidier if written so there are only 2 returns instead of 3 like SUGGESTION: if (...) { for (...) { ... return true; } } return false; ~~~ 7. + /* + * Don't need to wait for the standbys to catch up if there is no value in + * standby_slot_names. + */ + if (standby_slot_config == NULL) + return true; (similar to a previous review comment) This check doesn't seem enough because IIUIC it is possible when 'standby_slot_names' has no value then maybe standby_slot_config is not NULL but standby_slot_config->slot_num is 0. ~~~ 8. WaitForStandbyConfirmation + /* + * Don't need to wait for the standby to catch up if the current acquired + * slot is not a logical failover slot, or there is no value in + * standby_slot_names. + */ + if (!MyReplicationSlot->data.failover || !standby_slot_config) + return; (similar to a previous review comment) IIUIC it is possible that when 'standby_slot_names' has no value, then standby_slot_config is not NULL but standby_slot_config->slot_num is 0. So shouldn't that be checked too? Perhaps it is convenient to encapsulate this check using some macro: #define StandbySlotNamesHasNoValue() (standby_slot_config = NULL || standby_slot_config->slot_num == 0) ---------- Kind Regards, Peter Smith. Fujitsu Australia
On Wed, Mar 6, 2024 at 6:54 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Wednesday, March 6, 2024 9:13 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > > > On Wednesday, March 6, 2024 11:04 AM Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> wrote: > > > > > > On Wednesday, March 6, 2024 9:30 AM Masahiko Sawada > > > <sawada.mshk@gmail.com> wrote: > > > > > > Hi, > > > > > > > On Fri, Mar 1, 2024 at 4:21 PM Zhijie Hou (Fujitsu) > > > > <houzj.fnst@fujitsu.com> > > > > wrote: > > > > > > > > > > On Friday, March 1, 2024 2:11 PM Masahiko Sawada > > > > <sawada.mshk@gmail.com> wrote: > > > > > > > > > > > > > > > > > > --- > > > > > > +void > > > > > > +assign_standby_slot_names(const char *newval, void *extra) { > > > > > > + List *standby_slots; > > > > > > + MemoryContext oldcxt; > > > > > > + char *standby_slot_names_cpy = extra; > > > > > > + > > > > > > > > > > > > Given that the newval and extra have the same data > > > > > > (standby_slot_names value), why do we not use newval instead? I > > > > > > think that if we use newval, we don't need to guc_strdup() in > > > > > > check_standby_slot_names(), we might need to do list_copy_deep() > > > > > > instead, though. It's not clear to me as there is no comment. > > > > > > > > > > I think SplitIdentifierString will modify the passed in string, so > > > > > we'd better not pass the newval to it, otherwise the stored guc > > > > > string(standby_slot_names) will be changed. I can see we are doing > > > > > similar thing in other GUC check/assign function as well. > > > > > (check_wal_consistency_checking/ assign_wal_consistency_checking, > > > > > check_createrole_self_grant/ assign_createrole_self_grant ...). > > > > > > > > Why does it have to be a List in the first place? > > > > > > I thought the List type is convenient to use here, as we have existing > > > list build function(SplitIdentifierString), and have convenient list > > > macro to loop the > > > list(foreach_ptr) which can save some codes. > > > > > > > In earlier version patches, we > > > > used to copy the list and delete the element until it became empty, > > > > while waiting for physical wal senders. But we now just refer to > > > > each slot name in the list. The current code assumes that > > > > stnadby_slot_names_cpy is allocated in GUCMemoryContext but once it > > > > changes, it will silently get broken. I think we can check and > > > > assign standby_slot_names in a similar way to > > > > check/assign_temp_tablespaces and > > check/assign_synchronous_standby_names. > > > > > > Yes, we could do follow it by allocating an array and copy each slot > > > name into it, but it also requires some codes to build and scan the > > > array. So, is it possible to expose the GucMemorycontext or have an API like > > guc_copy_list instead ? > > > If we don't want to touch the guc api, I am ok with using an array as well. > > > > I rethink about this and realize that it's not good to do the memory allocation in > > assign hook function. As the "src/backend/utils/misc/README" said, we'd > > better do that in check hook function and pass it via extra to assign hook > > function. And thus array is a good choice in this case rather than a List which > > cannot be passed to *extra. > > > > Here is the V107 patch set which parse and cache the standby slot names in an > > array instead of a List. > > The patch needs to be rebased due to recent commit. > > Attach the V107_2 path set. There are no code changes in this version. The patch needed to be rebased due to a recent commit. Attached v107_3, there are no code changes in this version. thanks Shveta
Attachment
On Wed, Mar 6, 2024 at 5:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Mar 6, 2024 at 12:07 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Fri, Mar 1, 2024 at 3:22 PM Peter Smith <smithpb2250@gmail.com> wrote: > > > > > > On Fri, Mar 1, 2024 at 5:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > ... > > > > + /* > > > > + * "*" is not accepted as in that case primary will not be able to know > > > > + * for which all standbys to wait for. Even if we have physical slots > > > > + * info, there is no way to confirm whether there is any standby > > > > + * configured for the known physical slots. > > > > + */ > > > > + if (strcmp(*newval, "*") == 0) > > > > + { > > > > + GUC_check_errdetail("\"*\" is not accepted for > > > > standby_slot_names"); > > > > + return false; > > > > + } > > > > > > > > Why only '*' is checked aside from validate_standby_slots()? I think > > > > that the doc doesn't mention anything about '*' and '*' cannot be used > > > > as a replication slot name. So even if we don't have this check, it > > > > might be no problem. > > > > > > > > > > Hi, a while ago I asked this same question. See [1 #28] for the response.. > > > > Thanks. Quoting the response from the email: > > > > SplitIdentifierString() does not give error for '*' and '*' can be considered > > as valid value which if accepted can mislead user that all the standbys's slots > > are now considered, which is not the case here. So we want to explicitly call > > out this case i.e. '*' is not accepted as valid value for standby_slot_names. > > > > IIUC we're concerned with a case like where the user confused > > standby_slot_names values with synchronous_standby_names values. Which > > means we would need to keep thath check consistent with available > > values of synchronous_standby_names. > > > > Both have different formats to specify. For example, for > synchronous_standby_names we have the following kind of syntax to > specify: > [FIRST] num_sync ( standby_name [, ...] ) > ANY num_sync ( standby_name [, ...] ) > standby_name [, ...] > > I don't think we can have a common check for both of them as the > specifications are different. In fact, I don't think we need a special > check for '*'. I think so too. > The user will anyway get a WARNING at a later point > that the replication slot with that name doesn't exist. Right. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Mar 7, 2024 at 7:35 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are some review comments for v107-0001 > > ====== > src/backend/replication/slot.c > > 1. > +/* > + * Struct for the configuration of standby_slot_names. > + * > + * Note: this must be a flat representation that can be held in a single chunk > + * of guc_malloc'd memory, so that it can be stored as the "extra" data for the > + * standby_slot_names GUC. > + */ > +typedef struct > +{ > + int slot_num; > + > + /* slot_names contains nmembers consecutive nul-terminated C strings */ > + char slot_names[FLEXIBLE_ARRAY_MEMBER]; > +} StandbySlotConfigData; > + > > 1a. > To avoid any ambiguity this 1st field is somehow a slot ID number, I > felt a better name would be 'nslotnames' or even just 'n' or 'count', > We can probably just add a comment above slot_num and that should be sufficient but I am fine with 'nslotnames' as well, in anycase let's add a comment for the same. > > 6b. > IMO this function would be tidier written such that the > MyReplicationSlot->data.name is passed as a parameter. Then you can > name the function more naturally like: > > IsSlotInStandbySlotNames(const char *slot_name) > +1. How about naming it as SlotExistsinStandbySlotNames(char *slot_name) and pass the slot_name from MyReplicationSlot? Otherwise, we need an Assert for MyReplicationSlot in this function. Also, can we add a comment like below before the loop: + /* + * XXX: We are not expecting this list to be long so a linear search + * shouldn't hurt but if that turns out not to be true then we can cache + * this information for each WalSender as well. + */ -- With Regards, Amit Kapila.
On Thu, Mar 7, 2024 at 8:37 AM shveta malik <shveta.malik@gmail.com> wrote: > I thought about whether we can make standby_slot_names as USERSET instead of SIGHUP and it doesn't sound like a good idea as that can lead to inconsistent standby replicas even after configuring the correct value of standby_slot_names. One can set a different or '' (empty) value for a particular session and consume all changes from the slot without waiting for the standby to acknowledge the change. Also, it would be difficult for users to ensure whether the standby is always ahead of subscribers. Does anyone think differently? -- With Regards, Amit Kapila.
On Thursday, March 7, 2024 10:05 AM Peter Smith <smithpb2250@gmail.com> wrote: > > Here are some review comments for v107-0001 Thanks for the comments. > > ====== > src/backend/replication/slot.c > > 1. > +/* > + * Struct for the configuration of standby_slot_names. > + * > + * Note: this must be a flat representation that can be held in a > +single chunk > + * of guc_malloc'd memory, so that it can be stored as the "extra" data > +for the > + * standby_slot_names GUC. > + */ > +typedef struct > +{ > + int slot_num; > + > + /* slot_names contains nmembers consecutive nul-terminated C strings > +*/ char slot_names[FLEXIBLE_ARRAY_MEMBER]; > +} StandbySlotConfigData; > + > > 1a. > To avoid any ambiguity this 1st field is somehow a slot ID number, I felt a better > name would be 'nslotnames' or even just 'n' or 'count', Changed to 'nslotnames'. > > ~ > > 1b. > (fix typo) > > SUGGESTION for the 2nd field comment > slot_names is a chunk of 'n' X consecutive null-terminated C strings Changed. > > ~ > > 1c. > A more explanatory name for this typedef maybe is > 'StandbySlotNamesConfigData' ? Changed. > > ~~~ > > > 2. > +/* This is parsed and cached configuration for standby_slot_names */ > +static StandbySlotConfigData *standby_slot_config; > > > 2a. > /This is parsed and cached configuration for .../This is the parsed and cached > configuration for .../ Changed. > > ~ > > 2b. > Similar to above -- since this only has name information maybe it is more > correct to call it 'standby_slot_names_config'? > Changed. > ~~~ > > 3. > +/* > + * A helper function to validate slots specified in GUC standby_slot_names. > + * > + * The rawname will be parsed, and the parsed result will be saved into > + * *elemlist. > + */ > +static bool > +validate_standby_slots(char *rawname, List **elemlist) > > /and the parsed result/and the result/ > Changed. > ~~~ > > 4. check_standby_slot_names > > + /* Need a modifiable copy of string */ rawname = pstrdup(*newval); > > /copy of string/copy of the GUC string/ > Changed. > ~~~ > > 5. > +assign_standby_slot_names(const char *newval, void *extra) { > + /* > + * The standby slots may have changed, so we must recompute the oldest > + * LSN. > + */ > + ss_oldest_flush_lsn = InvalidXLogRecPtr; > + > + standby_slot_config = (StandbySlotConfigData *) extra; } > > To avoid leaking don't we need to somewhere take care to free any memory > used by a previous value (if any) of this 'standby_slot_config'? > The memory of extra is maintained by the GUC mechanism. It will be automatically freed when the associated GUC setting is no longer of interest. See src/backend/utils/misc/README for details. > ~~~ > > 6. AcquiredStandbySlot > > +/* > + * Return true if the currently acquired slot is specified in > + * standby_slot_names GUC; otherwise, return false. > + */ > +bool > +AcquiredStandbySlot(void) > +{ > + const char *name; > + > + /* Return false if there is no value in standby_slot_names */ if > + (standby_slot_config == NULL) return false; > + > + name = standby_slot_config->slot_names; for (int i = 0; i < > + standby_slot_config->slot_num; i++) { if (strcmp(name, > + NameStr(MyReplicationSlot->data.name)) == 0) return true; > + > + name += strlen(name) + 1; > + } > + > + return false; > +} > > 6a. > Just checking "(standby_slot_config == NULL)" doesn't seem enough to me, > because IIUIC it is possible when 'standby_slot_names' has no value then > maybe standby_slot_config is not NULL but standby_slot_config->slot_num is > 0. The standby_slot_config will always be NULL if there is no value in it. While checking, I did find a rare case that if there are only some white space in the standby_slot_names, then slot_num will be 0, and have fixed it so that standby_slot_config will always be NULL if there is no meaning value in guc. > > ~ > > 6b. > IMO this function would be tidier written such that the > MyReplicationSlot->data.name is passed as a parameter. Then you can > name the function more naturally like: > > IsSlotInStandbySlotNames(const char *slot_name) Changed it to SlotExistsInStandbySlotNames. > > ~ > > 6c. > IMO the body of the function will be tidier if written so there are only 2 returns > instead of 3 like > > SUGGESTION: > if (...) > { > for (...) > { > ... > return true; > } > } > return false; I personally prefer the current style. > > ~~~ > > 7. > + /* > + * Don't need to wait for the standbys to catch up if there is no value > + in > + * standby_slot_names. > + */ > + if (standby_slot_config == NULL) > + return true; > > (similar to a previous review comment) > > This check doesn't seem enough because IIUIC it is possible when > 'standby_slot_names' has no value then maybe standby_slot_config is not NULL > but standby_slot_config->slot_num is 0. Same as above. > > ~~~ > > 8. WaitForStandbyConfirmation > > + /* > + * Don't need to wait for the standby to catch up if the current > + acquired > + * slot is not a logical failover slot, or there is no value in > + * standby_slot_names. > + */ > + if (!MyReplicationSlot->data.failover || !standby_slot_config) return; > > (similar to a previous review comment) > > IIUIC it is possible that when 'standby_slot_names' has no value, then > standby_slot_config is not NULL but standby_slot_config->slot_num is 0. So > shouldn't that be checked too? > > Perhaps it is convenient to encapsulate this check using some macro: > #define StandbySlotNamesHasNoValue() (standby_slot_config = NULL || > standby_slot_config->slot_num == 0) Same as above, I think we can avoid checking slot_num. Best Regards, Hou zj
On Thursday, March 7, 2024 12:46 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Mar 7, 2024 at 7:35 AM Peter Smith <smithpb2250@gmail.com> > wrote: > > > > Here are some review comments for v107-0001 > > > > ====== > > src/backend/replication/slot.c > > > > 1. > > +/* > > + * Struct for the configuration of standby_slot_names. > > + * > > + * Note: this must be a flat representation that can be held in a single chunk > > + * of guc_malloc'd memory, so that it can be stored as the "extra" data for > the > > + * standby_slot_names GUC. > > + */ > > +typedef struct > > +{ > > + int slot_num; > > + > > + /* slot_names contains nmembers consecutive nul-terminated C strings */ > > + char slot_names[FLEXIBLE_ARRAY_MEMBER]; > > +} StandbySlotConfigData; > > + > > > > 1a. > > To avoid any ambiguity this 1st field is somehow a slot ID number, I > > felt a better name would be 'nslotnames' or even just 'n' or 'count', > > > > We can probably just add a comment above slot_num and that should be > sufficient but I am fine with 'nslotnames' as well, in anycase let's > add a comment for the same. Added. > > > > > 6b. > > IMO this function would be tidier written such that the > > MyReplicationSlot->data.name is passed as a parameter. Then you can > > name the function more naturally like: > > > > IsSlotInStandbySlotNames(const char *slot_name) > > > > +1. How about naming it as SlotExistsinStandbySlotNames(char > *slot_name) and pass the slot_name from MyReplicationSlot? Otherwise, > we need an Assert for MyReplicationSlot in this function. Changed as suggested. > > Also, can we add a comment like below before the loop: > + /* > + * XXX: We are not expecting this list to be long so a linear search > + * shouldn't hurt but if that turns out not to be true then we can cache > + * this information for each WalSender as well. > + */ Added. Attach the V108 patch set which addressed above and Peter's comments. I also removed the check for "*" in guc check hook. Best Regards, Hou zj
Attachment
On Thu, Mar 7, 2024 at 12:00 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > > Attach the V108 patch set which addressed above and Peter's comments. > I also removed the check for "*" in guc check hook. > Pushed with minor modifications. I'll keep an eye on BF. BTW, one thing that we should try to evaluate a bit more is the traversal of slots in StandbySlotsHaveCaughtup() where we verify if all the slots mentioned in standby_slot_names have received the required WAL. Even if the standby_slot_names list is short the total number of slots can be much larger which can lead to an increase in CPU usage during traversal. There is an optimization that allows to cache ss_oldest_flush_lsn and ensures that we don't need to traverse the slots each time so it may not hit frequently but still there is a chance. I see it is possible to further optimize this area by caching the position of each slot mentioned in standby_slot_names in replication_slots array but not sure whether it is worth. -- With Regards, Amit Kapila.
On Fri, Mar 8, 2024 at 2:33 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Mar 7, 2024 at 12:00 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
>
> Attach the V108 patch set which addressed above and Peter's comments.
> I also removed the check for "*" in guc check hook.
>
Pushed with minor modifications. I'll keep an eye on BF.
BTW, one thing that we should try to evaluate a bit more is the
traversal of slots in StandbySlotsHaveCaughtup() where we verify if
all the slots mentioned in standby_slot_names have received the
required WAL. Even if the standby_slot_names list is short the total
number of slots can be much larger which can lead to an increase in
CPU usage during traversal. There is an optimization that allows to
cache ss_oldest_flush_lsn and ensures that we don't need to traverse
the slots each time so it may not hit frequently but still there is a
chance. I see it is possible to further optimize this area by caching
the position of each slot mentioned in standby_slot_names in
replication_slots array but not sure whether it is worth.
I tried to test this by configuring a large number of logical slots while making sure the standby slots are at the end of the array and checking if there was any performance hit in logical replication from these searches.
Setup:
1. 1 primary server configured with 3 servers in the standby_slot_names, 1 extra logical slot (not configured for failover) + 1 logical subscriber configures as failover + 3 physical standbys(all configured to sync logical slots)
2. 1 primary server configured with 3 servers in the standby_slot_names, 100 extra logical slot (not configured for failover) + 1 logical subscriber configures as failover + 3 physical standbys(all configured to sync logical slots)
3. 1 primary server configured with 3 servers in the standby_slot_names, 500 extra logical slot (not configured for failover) + 1 logical subscriber configures as failover + 3 physical standbys(all configured to sync logical slots)
In the three setups, 3 standby_slot_names are compared with a list of 2,101 and 501 slots respectively.
I ran a pgbench for 15 minutes for all 3 setups:
Case 1: Average TPS - 8.143399 TPS
Case 2: Average TPS - 8.187462 TPS
Case 3: Average TPS - 8.190611 TPS
I see no degradation in the performance, the differences in performance are well within the run to run variations seen.
Nisha also did some performance tests to record the lag introduced by the large number of slots traversal in StandbySlotsHaveCaughtup(). The tests logged time at the start and end of the XLogSendLogical() call (which eventually calls WalSndWaitForWal() --> StandbySlotsHaveCaughtup()) and calculated total time taken by this function during the load run for different total slots count.
Setup:
--one primary with 3 standbys and one subscriber with one active subscription
--hot_standby_feedback=off and sync_replication_slots=false
--made sure the standby slots remain at the end ReplicationSlotCtl->replication_slots array to measure performance of worst case scenario for standby slot search in StandbySlotsHaveCaughtup()
pgbench for 15 min was run. Here is the data:
Case1 : with 1 logical slot, standby_slot_names having 3 slots
Run1: 626.141642 secs
Run2: 631.930254 secs
Case2 : with 100 logical slots, standby_slot_names having 3 slots
Run1: 629.38332 secs
Run2: 630.548432 secs
Case3 : with 500 logical slots, standby_slot_names having 3 slots
Run1: 629.910829 secs
Run2: 627.924183 secs
There was no degradation in performance seen.
Setup:
1. 1 primary server configured with 3 servers in the standby_slot_names, 1 extra logical slot (not configured for failover) + 1 logical subscriber configures as failover + 3 physical standbys(all configured to sync logical slots)
2. 1 primary server configured with 3 servers in the standby_slot_names, 100 extra logical slot (not configured for failover) + 1 logical subscriber configures as failover + 3 physical standbys(all configured to sync logical slots)
3. 1 primary server configured with 3 servers in the standby_slot_names, 500 extra logical slot (not configured for failover) + 1 logical subscriber configures as failover + 3 physical standbys(all configured to sync logical slots)
In the three setups, 3 standby_slot_names are compared with a list of 2,101 and 501 slots respectively.
I ran a pgbench for 15 minutes for all 3 setups:
Case 1: Average TPS - 8.143399 TPS
Case 2: Average TPS - 8.187462 TPS
Case 3: Average TPS - 8.190611 TPS
I see no degradation in the performance, the differences in performance are well within the run to run variations seen.
Nisha also did some performance tests to record the lag introduced by the large number of slots traversal in StandbySlotsHaveCaughtup(). The tests logged time at the start and end of the XLogSendLogical() call (which eventually calls WalSndWaitForWal() --> StandbySlotsHaveCaughtup()) and calculated total time taken by this function during the load run for different total slots count.
Setup:
--one primary with 3 standbys and one subscriber with one active subscription
--hot_standby_feedback=off and sync_replication_slots=false
--made sure the standby slots remain at the end ReplicationSlotCtl->replication_slots array to measure performance of worst case scenario for standby slot search in StandbySlotsHaveCaughtup()
pgbench for 15 min was run. Here is the data:
Case1 : with 1 logical slot, standby_slot_names having 3 slots
Run1: 626.141642 secs
Run2: 631.930254 secs
Case2 : with 100 logical slots, standby_slot_names having 3 slots
Run1: 629.38332 secs
Run2: 630.548432 secs
Case3 : with 500 logical slots, standby_slot_names having 3 slots
Run1: 629.910829 secs
Run2: 627.924183 secs
There was no degradation in performance seen.
Thanks Nisha for helping with the testing.
regards,
Ajin Cherian
Fujitsu Australia
On Fri, Mar 8, 2024 at 9:56 AM Ajin Cherian <itsajin@gmail.com> wrote: > >> Pushed with minor modifications. I'll keep an eye on BF. >> >> BTW, one thing that we should try to evaluate a bit more is the >> traversal of slots in StandbySlotsHaveCaughtup() where we verify if >> all the slots mentioned in standby_slot_names have received the >> required WAL. Even if the standby_slot_names list is short the total >> number of slots can be much larger which can lead to an increase in >> CPU usage during traversal. There is an optimization that allows to >> cache ss_oldest_flush_lsn and ensures that we don't need to traverse >> the slots each time so it may not hit frequently but still there is a >> chance. I see it is possible to further optimize this area by caching >> the position of each slot mentioned in standby_slot_names in >> replication_slots array but not sure whether it is worth. >> >> > > I tried to test this by configuring a large number of logical slots while making sure the standby slots are at the endof the array and checking if there was any performance hit in logical replication from these searches. > Thanks Ajin and Nisha. We also plan: 1) Redoing XLogSendLogical time-log related test with 'sync_replication_slots' enabled. 2) pg_recvlogical test to monitor lag in StandbySlotsHaveCaughtup() for a large number of slots. 3) Profiling to see if StandbySlotsHaveCaughtup() is noticeable in the report when there are a large number of slots to traverse. thanks Shveta
On Friday, March 8, 2024 1:09 PM shveta malik <shveta.malik@gmail.com> wrote: > On Fri, Mar 8, 2024 at 9:56 AM Ajin Cherian <itsajin@gmail.com> wrote: > > > >> Pushed with minor modifications. I'll keep an eye on BF. > >> > >> BTW, one thing that we should try to evaluate a bit more is the > >> traversal of slots in StandbySlotsHaveCaughtup() where we verify if > >> all the slots mentioned in standby_slot_names have received the > >> required WAL. Even if the standby_slot_names list is short the total > >> number of slots can be much larger which can lead to an increase in > >> CPU usage during traversal. There is an optimization that allows to > >> cache ss_oldest_flush_lsn and ensures that we don't need to traverse > >> the slots each time so it may not hit frequently but still there is a > >> chance. I see it is possible to further optimize this area by caching > >> the position of each slot mentioned in standby_slot_names in > >> replication_slots array but not sure whether it is worth. > >> > >> > > > > I tried to test this by configuring a large number of logical slots while making > sure the standby slots are at the end of the array and checking if there was any > performance hit in logical replication from these searches. > > > Thanks Nisha for conducting some additional tests and discussing with me internally. We have collected the performance data on HEAD. Basically, we don't see a noticeable difference in the performance data and StandbySlotsHaveCaughtup also does not stand out in the profile. Here are the details: > 1) Redoing XLogSendLogical time-log related test with > 'sync_replication_slots' enabled. Setup: - one primary + 3standbys + one subscriber with one active subscription - ran 15 min pgbench for all cases - hot_standby_feedback=ON and sync_replication_slots=TRUE (To maximize the impact of SearchNamedReplicationSlot clear, the standby slot is at the end of the ReplicationSlotCtl->replication_slots array in each test) Case1 - 1 slot: 895.305565 secs Case2 - 100 slots: 894.936039 secs Case3 - 500 slots: 895.256412 secs > 2) pg_recvlogical test to monitor lag in StandbySlotsHaveCaughtup() for a > large number of slots. We reran the XLogSendLogical() wait time analysis tests. Setup: - One primary node and 3 standby nodes - Created logical slots using "test_decoding" and activated one walsender by running pg_recvlogical on one slot. - hot_standby_feedback=ON and sync_replication_slots=TRUE - Did one run for each case with pgbench for 15 min (To maximize the impact of SearchNamedReplicationSlot clear, the stanbys slot is at the end of the ReplicationSlotCtl->replication_slots array in each test) Case1 - 1 slot: 894.83775 secs Case2 - 100 slots: 894.449356 secs Case3 - 500 slots: 894.98479 secs There is no noticeable regression when the number of replication slots increases. > 3) Profiling to see if StandbySlotsHaveCaughtup() is noticeable in the report > when there are a large number of slots to traverse. The setup is the same as 2). To maximize the impact of SearchNamedReplicationSlot clear, the stanbys slot is at the end of the ReplicationSlotCtl->replication_slots array. The StandbySlotsHaveCaughtup is not noticeable in the profile. 0.03% 0.00% postgres postgres [.] StandbySlotsHaveCaughtup After some investigation, it appears that the cached 'ss_oldest_flush_lsn' plays a crucial role in optimizing this workload, effectively reducing the need for frequent strcmp operations within the loop. To test the impact of frequent strcmp calls, we conducted a test by removing the 'ss_oldest_flush_lsn' check and re-evaluating the profile. This time, although the profile indicated a small increase in the StandbySlotsHaveCaughtup metric, it still does not raise significant concerns. --1.47%--NeedToWaitForWal | NeedToWaitForStandbys | StandbySlotsHaveCaughtup | | | --0.96%--SearchNamedReplicationSlot The scripts that were used to setup the test environment for all above tests are attached. The machine configuration for above tests is as follows: CPU : E7-4890v2(2.8Ghz/15core)×4 MEM : 768GB HDD : 600GB×2 OS : RHEL 7.9 While no noticeable overhead was observed in the SearchNamedReplicationSlot operation, we explored a strategy to enhance efficiency by minimizing the search for standby slots within the loop. The idea is to cache the position of each standby slot within ReplicationSlotCtl->replication_slots. We will reference the slot directly through ReplicationSlotCtl->replication_slots[index]. If the slot name matches, we will perform other checks including the restart_lsn; otherwise, SearchNamedReplicationSlot is invoked to update the index cache accordingly. This optimization can reduce the cost from O(n*m) to O(n). Note that since we didn't see the overhead in the test, I am not proposing to push this patch now. But just share the idea and a small patch in case anyone came across a workload where performance impact of SearchNamedReplicationSlot becomes noticeable. Best Regards, Hou zj
Attachment
Hi, Since the standby_slot_names patch has been committed, I am attaching the last doc patch for review. Best Regards, Hou zj
Attachment
Hi, On Thu, Mar 14, 2024 at 02:22:44AM +0000, Zhijie Hou (Fujitsu) wrote: > Hi, > > Since the standby_slot_names patch has been committed, I am attaching the last > doc patch for review. > Thanks! 1 === + continue subscribing to publications now on the new primary server without + any data loss. I think "without any data loss" should be re-worded in this context. Data loss in the sense "data committed on the primary and not visible on the subscriber in case of failover" can still occurs (in case synchronous replication is not used). 2 === + If the result (<literal>failover_ready</literal>) of both above steps is + true, existing subscriptions will be able to continue without data loss. + </para> I don't think that's true if synchronous replication is not used. Say, - synchronous replication is not used - primary is not able to reach the standby anymore and standby_slot_names is set - new data is inserted into the primary - then not replicated to subscriber (due to standby_slot_names) Then I think the both above steps will return true but data would be lost in case of failover. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, When analyzing one BF error[1], we find an issue of slotsync: Since we don't perform logical decoding for the synced slots when syncing the lsn/xmin of slot, no logical snapshots will be serialized to disk. So, when user starts to use these synced slots after promotion, it needs to re-build the consistent snapshot from the restart_lsn if the WAL(xl_running_xacts) at restart_lsn position indicates that there are running transactions. This however could cause the data that before the consistent point to be missed[2]. This issue doesn't exist on the primary because the snapshot at restart_lsn should have been serialized to disk (SnapBuildProcessRunningXacts -> SnapBuildSerialize), so even if the logical decoding restarts, it can find consistent snapshot immediately at restart_lsn. To fix this, we could use the fast forward logical decoding to advance the synced slot's lsn/xmin when syncing these values instead of directly updating the slot's info. This way, the snapshot will be serialized to disk when decoding. If we could not reach to the consistent point at the remote restart_lsn, the slot is marked as temp and will be persisted once it reaches the consistent point. I am still analyzing the fix and will share once ready. [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&dt=2024-03-19%2010%3A03%3A06 [2] The steps to reproduce the data miss issue on a primary->standby setup: Note, we need to set LOG_SNAPSHOT_INTERVAL_MS to a bigger number(1500000) to prevent cocurrent LogStandbySnapshot() call and enable sync_replication_slots on the standby. 1. Create a failover logical slot on the primary. SELECT 'init' FROM pg_create_logical_replication_slot('logicalslot', 'test_decoding', false, false, true); 2. Use the following steps to advance the restart_lsn of the failover slot to a position where the xl_running_xacts at that position indicates that there is running transaction. TXN1 BEGIN; create table dummy1(a int); TXN2 SELECT pg_log_standby_snapshot(); TXN1 COMMIT; TXN1 BEGIN; create table dummy2(a int); TXN2 SELECT pg_log_standby_snapshot(); TXN1 COMMIT; -- the restart_lsn will be advanced to a position where there was 1 running transaction. And we need to wait for the restart_lsn to be synced to the standby. SELECT pg_replication_slot_advance('logicalslot', pg_current_wal_lsn()); -- insert some data here before calling next pg_log_standby_snapshot(). INSERT INTO reptable VALUES(999); 3. Promote the standby and try to consume the change(999) from the synced slot on the standby. We will find that no change is returned. select * from pg_logical_slot_get_changes('logicalslot', NULL, NULL); Best Regards, Hou zj
Hi, On Thu, Mar 28, 2024 at 04:38:19AM +0000, Zhijie Hou (Fujitsu) wrote: > Hi, > > When analyzing one BF error[1], we find an issue of slotsync: Since we don't > perform logical decoding for the synced slots when syncing the lsn/xmin of > slot, no logical snapshots will be serialized to disk. So, when user starts to > use these synced slots after promotion, it needs to re-build the consistent > snapshot from the restart_lsn if the WAL(xl_running_xacts) at restart_lsn > position indicates that there are running transactions. This however could > cause the data that before the consistent point to be missed[2]. I see, nice catch and explanation, thanks! > This issue doesn't exist on the primary because the snapshot at restart_lsn > should have been serialized to disk (SnapBuildProcessRunningXacts -> > SnapBuildSerialize), so even if the logical decoding restarts, it can find > consistent snapshot immediately at restart_lsn. Right. > To fix this, we could use the fast forward logical decoding to advance the synced > slot's lsn/xmin when syncing these values instead of directly updating the > slot's info. This way, the snapshot will be serialized to disk when decoding. > If we could not reach to the consistent point at the remote restart_lsn, the > slot is marked as temp and will be persisted once it reaches the consistent > point. I am still analyzing the fix and will share once ready. Thanks! I'm wondering about the performance impact (even in fast_forward mode), might be worth to keep an eye on it. Should we create a 17 open item [1]? [1]: https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Mar 28, 2024 at 10:08 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > When analyzing one BF error[1], we find an issue of slotsync: Since we don't > perform logical decoding for the synced slots when syncing the lsn/xmin of > slot, no logical snapshots will be serialized to disk. So, when user starts to > use these synced slots after promotion, it needs to re-build the consistent > snapshot from the restart_lsn if the WAL(xl_running_xacts) at restart_lsn > position indicates that there are running transactions. This however could > cause the data that before the consistent point to be missed[2]. > > This issue doesn't exist on the primary because the snapshot at restart_lsn > should have been serialized to disk (SnapBuildProcessRunningXacts -> > SnapBuildSerialize), so even if the logical decoding restarts, it can find > consistent snapshot immediately at restart_lsn. > > To fix this, we could use the fast forward logical decoding to advance the synced > slot's lsn/xmin when syncing these values instead of directly updating the > slot's info. This way, the snapshot will be serialized to disk when decoding. > If we could not reach to the consistent point at the remote restart_lsn, the > slot is marked as temp and will be persisted once it reaches the consistent > point. I am still analyzing the fix and will share once ready. > Yes, we can use this but one thing to note is that CreateDecodingContext() will expect that the slot's and current database are the same. I think the reason for that is we need to check system tables of the current database while decoding and sending data to the output_plugin which won't be a requirement for the fast_forward case. So, we need to skip that check in fast_forward mode. Next, I was thinking about the case of the first time updating the restart and confirmed_flush LSN while syncing the slots. I think we can keep the current logic as it is based on the following analysis. For each logical slot, cases possible on the primary: 1. The restart_lsn doesn't have a serialized snapshot and hasn't yet reached the consistent point. 2. The restart_lsn doesn't have a serialized snapshot but has reached a consistent point. 3. The restart_lsn has a serialized snapshot which means it has reached a consistent point as well. Considering we keep the logic to reserve initial WAL positions the same as the current (Reserve WAL for the currently active local slot using the specified WAL location (restart_lsn). If the given WAL location has been removed, reserve WAL using the oldest existing WAL segment.), I could think of the below scenarios: A. For 1, we shouldn't sync the slot as it still wouldn't have been marked persistent on the primary. B. For 2, we would sync the slot B1. If remote_restart_lsn >= local_resart_lsn, then advance the slot by calling pg_logical_replication_slot_advance(). B11. If we reach consistent point, then it should be okay because after promotion as well we should reach consistent point. B111. But again is it possible that there is some xact that comes before consistent_point on primary and the same is after consistent_point on standby? This shouldn't matter as we will start decoding transactions after confirmed_flush_lsn which would be the same on primary and standby. B22. If we haven't reached consistent_point, then we won't mark the slot as persistent, and at the next sync we will do the same till it reaches consistent_point. At that time, the situation will be similar to B11. B2. If remote_restart_lsn < local_restart_lsn, then we will wait for the next sync cycle and keep the slot as temporary. Once in the next or some consecutive sync cycle, we reach the condition remote_restart_lsn >= local_restart_lsn, we will proceed to advance the slot and we should have the same behavior as B1. C. For 3, we would sync the slot, but the behavior should be the same as B. Thoughts? -- With Regards, Amit Kapila.
On Thu, Mar 28, 2024 at 3:34 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > On Thu, Mar 28, 2024 at 04:38:19AM +0000, Zhijie Hou (Fujitsu) wrote: > > > To fix this, we could use the fast forward logical decoding to advance the synced > > slot's lsn/xmin when syncing these values instead of directly updating the > > slot's info. This way, the snapshot will be serialized to disk when decoding. > > If we could not reach to the consistent point at the remote restart_lsn, the > > slot is marked as temp and will be persisted once it reaches the consistent > > point. I am still analyzing the fix and will share once ready. > > Thanks! I'm wondering about the performance impact (even in fast_forward mode), > might be worth to keep an eye on it. > True, we can consider performance but correctness should be a priority, and can we think of a better way to fix this issue? > Should we create a 17 open item [1]? > > [1]: https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items > Yes, we can do that. -- With Regards, Amit Kapila.
Hi, On Thu, Mar 28, 2024 at 05:05:35PM +0530, Amit Kapila wrote: > On Thu, Mar 28, 2024 at 3:34 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > > On Thu, Mar 28, 2024 at 04:38:19AM +0000, Zhijie Hou (Fujitsu) wrote: > > > > > To fix this, we could use the fast forward logical decoding to advance the synced > > > slot's lsn/xmin when syncing these values instead of directly updating the > > > slot's info. This way, the snapshot will be serialized to disk when decoding. > > > If we could not reach to the consistent point at the remote restart_lsn, the > > > slot is marked as temp and will be persisted once it reaches the consistent > > > point. I am still analyzing the fix and will share once ready. > > > > Thanks! I'm wondering about the performance impact (even in fast_forward mode), > > might be worth to keep an eye on it. > > > > True, we can consider performance but correctness should be a > priority, Yeah of course. > and can we think of a better way to fix this issue? I'll keep you posted if there is one that I can think of. > > Should we create a 17 open item [1]? > > > > [1]: https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items > > > > Yes, we can do that. done. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thursday, March 28, 2024 7:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Mar 28, 2024 at 10:08 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> > wrote: > > > > When analyzing one BF error[1], we find an issue of slotsync: Since we > > don't perform logical decoding for the synced slots when syncing the > > lsn/xmin of slot, no logical snapshots will be serialized to disk. So, > > when user starts to use these synced slots after promotion, it needs > > to re-build the consistent snapshot from the restart_lsn if the > > WAL(xl_running_xacts) at restart_lsn position indicates that there are > > running transactions. This however could cause the data that before the > consistent point to be missed[2]. > > > > This issue doesn't exist on the primary because the snapshot at > > restart_lsn should have been serialized to disk > > (SnapBuildProcessRunningXacts -> SnapBuildSerialize), so even if the > > logical decoding restarts, it can find consistent snapshot immediately at > restart_lsn. > > > > To fix this, we could use the fast forward logical decoding to advance > > the synced slot's lsn/xmin when syncing these values instead of > > directly updating the slot's info. This way, the snapshot will be serialized to > disk when decoding. > > If we could not reach to the consistent point at the remote > > restart_lsn, the slot is marked as temp and will be persisted once it > > reaches the consistent point. I am still analyzing the fix and will share once > ready. > > > > Yes, we can use this but one thing to note is that > CreateDecodingContext() will expect that the slot's and current database are > the same. I think the reason for that is we need to check system tables of the > current database while decoding and sending data to the output_plugin which > won't be a requirement for the fast_forward case. So, we need to skip that > check in fast_forward mode. Agreed. > > Next, I was thinking about the case of the first time updating the restart and > confirmed_flush LSN while syncing the slots. I think we can keep the current > logic as it is based on the following analysis. > > For each logical slot, cases possible on the primary: > 1. The restart_lsn doesn't have a serialized snapshot and hasn't yet reached the > consistent point. > 2. The restart_lsn doesn't have a serialized snapshot but has reached a > consistent point. > 3. The restart_lsn has a serialized snapshot which means it has reached a > consistent point as well. > > Considering we keep the logic to reserve initial WAL positions the same as the > current (Reserve WAL for the currently active local slot using the specified WAL > location (restart_lsn). If the given WAL location has been removed, reserve > WAL using the oldest existing WAL segment.), I could think of the below > scenarios: > A. For 1, we shouldn't sync the slot as it still wouldn't have been marked > persistent on the primary. > B. For 2, we would sync the slot > B1. If remote_restart_lsn >= local_resart_lsn, then advance the slot by calling > pg_logical_replication_slot_advance(). > B11. If we reach consistent point, then it should be okay because after > promotion as well we should reach consistent point. > B111. But again is it possible that there is some xact that comes > before consistent_point on primary and the same is after consistent_point on > standby? This shouldn't matter as we will start decoding transactions after > confirmed_flush_lsn which would be the same on primary and standby. > B22. If we haven't reached consistent_point, then we won't mark the slot > as persistent, and at the next sync we will do the same till it reaches > consistent_point. At that time, the situation will be similar to B11. > B2. If remote_restart_lsn < local_restart_lsn, then we will wait for the next > sync cycle and keep the slot as temporary. Once in the next or some > consecutive sync cycle, we reach the condition remote_restart_lsn >= > local_restart_lsn, we will proceed to advance the slot and we should have the > same behavior as B1. > C. For 3, we would sync the slot, but the behavior should be the same as B. > > Thoughts? Looks reasonable to me. Here is the patch based on above lines. I am also testing and verifying the patch locally. Best Regards, Hou zj
Attachment
On Thursday, March 28, 2024 10:02 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Thursday, March 28, 2024 7:32 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > On Thu, Mar 28, 2024 at 10:08 AM Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> > > wrote: > > > > > > When analyzing one BF error[1], we find an issue of slotsync: Since > > > we don't perform logical decoding for the synced slots when syncing > > > the lsn/xmin of slot, no logical snapshots will be serialized to > > > disk. So, when user starts to use these synced slots after > > > promotion, it needs to re-build the consistent snapshot from the > > > restart_lsn if the > > > WAL(xl_running_xacts) at restart_lsn position indicates that there > > > are running transactions. This however could cause the data that > > > before the > > consistent point to be missed[2]. > > > > > > This issue doesn't exist on the primary because the snapshot at > > > restart_lsn should have been serialized to disk > > > (SnapBuildProcessRunningXacts -> SnapBuildSerialize), so even if the > > > logical decoding restarts, it can find consistent snapshot > > > immediately at > > restart_lsn. > > > > > > To fix this, we could use the fast forward logical decoding to > > > advance the synced slot's lsn/xmin when syncing these values instead > > > of directly updating the slot's info. This way, the snapshot will be > > > serialized to > > disk when decoding. > > > If we could not reach to the consistent point at the remote > > > restart_lsn, the slot is marked as temp and will be persisted once > > > it reaches the consistent point. I am still analyzing the fix and > > > will share once > > ready. > > > > > > > Yes, we can use this but one thing to note is that > > CreateDecodingContext() will expect that the slot's and current > > database are the same. I think the reason for that is we need to check > > system tables of the current database while decoding and sending data > > to the output_plugin which won't be a requirement for the fast_forward > > case. So, we need to skip that check in fast_forward mode. > > Agreed. > > > > > Next, I was thinking about the case of the first time updating the > > restart and confirmed_flush LSN while syncing the slots. I think we > > can keep the current logic as it is based on the following analysis. > > > > For each logical slot, cases possible on the primary: > > 1. The restart_lsn doesn't have a serialized snapshot and hasn't yet > > reached the consistent point. > > 2. The restart_lsn doesn't have a serialized snapshot but has reached > > a consistent point. > > 3. The restart_lsn has a serialized snapshot which means it has > > reached a consistent point as well. > > > > Considering we keep the logic to reserve initial WAL positions the > > same as the current (Reserve WAL for the currently active local slot > > using the specified WAL location (restart_lsn). If the given WAL > > location has been removed, reserve WAL using the oldest existing WAL > > segment.), I could think of the below > > scenarios: > > A. For 1, we shouldn't sync the slot as it still wouldn't have been > > marked persistent on the primary. > > B. For 2, we would sync the slot > > B1. If remote_restart_lsn >= local_resart_lsn, then advance the > > slot by calling pg_logical_replication_slot_advance(). > > B11. If we reach consistent point, then it should be okay > > because after promotion as well we should reach consistent point. > > B111. But again is it possible that there is some xact > > that comes before consistent_point on primary and the same is after > > consistent_point on standby? This shouldn't matter as we will start > > decoding transactions after confirmed_flush_lsn which would be the same on > primary and standby. > > B22. If we haven't reached consistent_point, then we won't mark > > the slot as persistent, and at the next sync we will do the same till > > it reaches consistent_point. At that time, the situation will be similar to B11. > > B2. If remote_restart_lsn < local_restart_lsn, then we will wait > > for the next sync cycle and keep the slot as temporary. Once in the > > next or some consecutive sync cycle, we reach the condition > > remote_restart_lsn >= local_restart_lsn, we will proceed to advance > > the slot and we should have the same behavior as B1. > > C. For 3, we would sync the slot, but the behavior should be the same as B. > > > > Thoughts? > > Looks reasonable to me. > > Here is the patch based on above lines. > I am also testing and verifying the patch locally. Attach a new version patch which fixed an un-initialized variable issue and added some comments. Also, temporarily enable DEBUG2 for the 040 tap-test so that we can analyze the possible CFbot failures easily. Best Regards, Hou zj
Attachment
On Fri, Mar 29, 2024 at 6:36 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > Attach a new version patch which fixed an un-initialized variable issue and > added some comments. Also, temporarily enable DEBUG2 for the 040 tap-test so that > we can analyze the possible CFbot failures easily. As suggested by Amit in [1], for the fix being discussed where we need to advance the synced slot on standby, we need to skip the dbid check in fast_forward mode in CreateDecodingContext(). We tried few tests to make sure that there was no table-access done during fast-forward mode 1) Initially we tried avoiding database-id check in CreateDecodingContext() only when called by pg_logical_replication_slot_advance(). 'make check-world' passed on HEAD for the same. 2) But the more generic solution was to skip the database check if "fast_forward" is true. It was tried and 'make check-world' passed on HEAD for that as well. 3) Another thing tried by Hou-San was to run pgbench after skipping db check in the fast_forward logical decoding case. pgbench was run to generate some changes and then the logical slot was advanced to the latest position in another database. A LOG was added in relation_open to catch table access. It was found that there was no table-access in fast forward logical decoding i.e. no LOGS for table-open were generated during the test. Steps given at [2] [1]: https://www.postgresql.org/message-id/CAA4eK1KMiKangJa4NH_K1oFc87Y01n3rnpuwYagT59Y%3DADW8Dw%40mail.gmail.com [2]: -------------- 1. apply the DEBUG patch (attached as .txt) which will log the relation open and table cache access. 2. create a slot: SELECT 'init' FROM pg_create_logical_replication_slot('logicalslot', 'test_decoding', false, false, true); 3. run pgbench to generate some data. pgbench -i postgres pgbench --aggregate-interval=5 --time=5 --client=10 --log --rate=1000 --latency-limit=10 --failures-detailed --max-tries=10 postgres 4. start a fresh session in a different db and advance the slot to the latest position. There should be no relation open or CatCache log between the LOG "starting logical decoding for slot .." and LOG "decoding over". SELECT pg_replication_slot_advance('logicalslot', pg_current_wal_lsn()); -------------- thanks Shveta
Attachment
Dear Hou, Thanks for updating the patch! Here is a comment for it. ``` + /* + * By advancing the restart_lsn, confirmed_lsn, and xmin using + * fast-forward logical decoding, we can verify whether a consistent + * snapshot can be built. This process also involves saving necessary + * snapshots to disk during decoding, ensuring that logical decoding + * efficiently reaches a consistent point at the restart_lsn without + * the potential loss of data during snapshot creation. + */ + pg_logical_replication_slot_advance(remote_slot->confirmed_lsn, + found_consistent_point); + ReplicationSlotsComputeRequiredLSN(); + updated_lsn = true; ``` You added them like pg_replication_slot_advance(), but the function also calls ReplicationSlotsComputeRequiredXmin(false) at that time. According to the related commit b48df81 and discussions [1], I know it is needed only for physical slots, but it makes more consistent to call requiredXmin() as well, per [2]: ``` This may be a waste if no advancing is done, but it could also be an advantage to enforce a recalculation of the thresholds for each function call. And that's more consistent with the slot copy, drop and creation. ``` How do you think? [1]: https://www.postgresql.org/message-id/20200609171904.kpltxxvjzislidks%40alap3.anarazel.de [2]: https://www.postgresql.org/message-id/20200616072727.GA2361%40paquier.xyz Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
Hi, On Fri, Mar 29, 2024 at 01:06:15AM +0000, Zhijie Hou (Fujitsu) wrote: > Attach a new version patch which fixed an un-initialized variable issue and > added some comments. Also, temporarily enable DEBUG2 for the 040 tap-test so that > we can analyze the possible CFbot failures easily. > Thanks! + if (remote_slot->confirmed_lsn != slot->data.confirmed_flush) + { + /* + * By advancing the restart_lsn, confirmed_lsn, and xmin using + * fast-forward logical decoding, we ensure that the required snapshots + * are saved to disk. This enables logical decoding to quickly reach a + * consistent point at the restart_lsn, eliminating the risk of missing + * data during snapshot creation. + */ + pg_logical_replication_slot_advance(remote_slot->confirmed_lsn, + found_consistent_point); + ReplicationSlotsComputeRequiredLSN(); + updated_lsn = true; + } Instead of using pg_logical_replication_slot_advance() for each synced slot and during sync cycles what about?: - keep sync slot synchronization as it is currently (not using pg_logical_replication_slot_advance()) - create "an hidden" logical slot if sync slot feature is on - at the time of promotion use pg_logical_replication_slot_advance() on this hidden slot only to advance to the max lsn of the synced slots I'm not sure that would be enough, just asking your thoughts on this (benefits would be to avoid calling pg_logical_replication_slot_advance() on each sync slots and during the sync cycles). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, Mar 29, 2024 at 6:36 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > > Attach a new version patch which fixed an un-initialized variable issue and > added some comments. > The other approach to fix this issue could be that the slotsync worker get the serialized snapshot using pg_read_binary_file() corresponding to restart_lsn and writes those at standby. But there are cases when we won't have such a file like (a) when we initially create the slot and reach the consistent_point, or (b) also by the time the slotsync worker starts to read the remote snapshot file, the snapshot file could have been removed by the checkpointer on the primary (if the restart_lsn of the remote has been advanced in this window). So, in such cases, we anyway need to advance the slot. I think these could be optimizations that we could do in the future. Few comments: ============= 1. - if (slot->data.database != MyDatabaseId) + if (slot->data.database != MyDatabaseId && !fast_forward) ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("replication slot \"%s\" was not created in this database", @@ -526,7 +527,7 @@ CreateDecodingContext(XLogRecPtr start_lsn, * Do not allow consumption of a "synchronized" slot until the standby * gets promoted. */ - if (RecoveryInProgress() && slot->data.synced) + if (RecoveryInProgress() && slot->data.synced && !IsSyncingReplicationSlots()) Add comments at both of the above places. 2. +extern XLogRecPtr pg_logical_replication_slot_advance(XLogRecPtr moveto, + bool *found_consistent_point); + This API looks a bit awkward as the functionality doesn't match the name. How about having a function with name LogicalSlotAdvanceAndCheckReadynessForDecoding(moveto, ready_for_decoding) with the same functionality as your patch has for pg_logical_replication_slot_advance() and then invoke it both from pg_logical_replication_slot_advance and slotsync.c. The function name is too big, we can think of a shorter name. Any ideas? -- With Regards, Amit Kapila.
On Friday, March 29, 2024 2:48 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On Fri, Mar 29, 2024 at 01:06:15AM +0000, Zhijie Hou (Fujitsu) wrote: > > Attach a new version patch which fixed an un-initialized variable > > issue and added some comments. Also, temporarily enable DEBUG2 for the > > 040 tap-test so that we can analyze the possible CFbot failures easily. > > > > Thanks! > > + if (remote_slot->confirmed_lsn != slot->data.confirmed_flush) > + { > + /* > + * By advancing the restart_lsn, confirmed_lsn, and xmin using > + * fast-forward logical decoding, we ensure that the required > snapshots > + * are saved to disk. This enables logical decoding to quickly > reach a > + * consistent point at the restart_lsn, eliminating the risk of > missing > + * data during snapshot creation. > + */ > + > pg_logical_replication_slot_advance(remote_slot->confirmed_lsn, > + > found_consistent_point); > + ReplicationSlotsComputeRequiredLSN(); > + updated_lsn = true; > + } > > Instead of using pg_logical_replication_slot_advance() for each synced slot and > during sync cycles what about?: > > - keep sync slot synchronization as it is currently (not using > pg_logical_replication_slot_advance()) > - create "an hidden" logical slot if sync slot feature is on > - at the time of promotion use pg_logical_replication_slot_advance() on this > hidden slot only to advance to the max lsn of the synced slots > > I'm not sure that would be enough, just asking your thoughts on this (benefits > would be to avoid calling pg_logical_replication_slot_advance() on each sync > slots and during the sync cycles). Thanks for the idea ! I considered about this. I think advancing the "hidden" slot on promotion may be a bit late, because if we cannot reach the consistent point after advancing the "hidden" slot, then it means we may need to remove all the synced slots as we are not sure if they are usable(will not loss data) after promotion. And it may confuse user a bit as they have seen these slots to be sync-ready. The current approach is to mark such un-consistent slot as temp and persist them once it reaches consistent point, so that user can ensure the slot can be used after promotion once persisted. Another optimization idea is to check the snapshot file existence before calling the slot_advance(). If the file already exists, we skip the decoding and directly update the restart_lsn. This way, we could also avoid some duplicate decoding work. Best Regards, Hou zj
Hi, On Fri, Mar 29, 2024 at 07:23:11AM +0000, Zhijie Hou (Fujitsu) wrote: > On Friday, March 29, 2024 2:48 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > > > Hi, > > > > On Fri, Mar 29, 2024 at 01:06:15AM +0000, Zhijie Hou (Fujitsu) wrote: > > > Attach a new version patch which fixed an un-initialized variable > > > issue and added some comments. Also, temporarily enable DEBUG2 for the > > > 040 tap-test so that we can analyze the possible CFbot failures easily. > > > > > > > Thanks! > > > > + if (remote_slot->confirmed_lsn != slot->data.confirmed_flush) > > + { > > + /* > > + * By advancing the restart_lsn, confirmed_lsn, and xmin using > > + * fast-forward logical decoding, we ensure that the required > > snapshots > > + * are saved to disk. This enables logical decoding to quickly > > reach a > > + * consistent point at the restart_lsn, eliminating the risk of > > missing > > + * data during snapshot creation. > > + */ > > + > > pg_logical_replication_slot_advance(remote_slot->confirmed_lsn, > > + > > found_consistent_point); > > + ReplicationSlotsComputeRequiredLSN(); > > + updated_lsn = true; > > + } > > > > Instead of using pg_logical_replication_slot_advance() for each synced slot and > > during sync cycles what about?: > > > > - keep sync slot synchronization as it is currently (not using > > pg_logical_replication_slot_advance()) > > - create "an hidden" logical slot if sync slot feature is on > > - at the time of promotion use pg_logical_replication_slot_advance() on this > > hidden slot only to advance to the max lsn of the synced slots > > > > I'm not sure that would be enough, just asking your thoughts on this (benefits > > would be to avoid calling pg_logical_replication_slot_advance() on each sync > > slots and during the sync cycles). > > Thanks for the idea ! > > I considered about this. I think advancing the "hidden" slot on promotion may be a > bit late, because if we cannot reach the consistent point after advancing the > "hidden" slot, then it means we may need to remove all the synced slots as we > are not sure if they are usable(will not loss data) after promotion. What about advancing the hidden slot during the sync cycles then? > The current approach is to mark such un-consistent slot as temp and persist > them once it reaches consistent point, so that user can ensure the slot can be > used after promotion once persisted. Right, but do we need to do so for all the sync slots? Would a single hidden slot be enough? > Another optimization idea is to check the snapshot file existence before calling the > slot_advance(). If the file already exists, we skip the decoding and directly > update the restart_lsn. This way, we could also avoid some duplicate decoding > work. Yeah, I think it's a good idea (even better if we can do this check without performing any I/O). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, Mar 29, 2024 at 9:34 AM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Thanks for updating the patch! Here is a comment for it. > > ``` > + /* > + * By advancing the restart_lsn, confirmed_lsn, and xmin using > + * fast-forward logical decoding, we can verify whether a consistent > + * snapshot can be built. This process also involves saving necessary > + * snapshots to disk during decoding, ensuring that logical decoding > + * efficiently reaches a consistent point at the restart_lsn without > + * the potential loss of data during snapshot creation. > + */ > + pg_logical_replication_slot_advance(remote_slot->confirmed_lsn, > + found_consistent_point); > + ReplicationSlotsComputeRequiredLSN(); > + updated_lsn = true; > ``` > > You added them like pg_replication_slot_advance(), but the function also calls > ReplicationSlotsComputeRequiredXmin(false) at that time. According to the related > commit b48df81 and discussions [1], I know it is needed only for physical slots, > but it makes more consistent to call requiredXmin() as well, per [2]: > Yeah, I also think it is okay to call for the sake of consistency with pg_replication_slot_advance(). -- With Regards, Amit Kapila.
On Fri, Mar 29, 2024 at 1:08 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > On Fri, Mar 29, 2024 at 07:23:11AM +0000, Zhijie Hou (Fujitsu) wrote: > > On Friday, March 29, 2024 2:48 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > > > > > Hi, > > > > > > On Fri, Mar 29, 2024 at 01:06:15AM +0000, Zhijie Hou (Fujitsu) wrote: > > > > Attach a new version patch which fixed an un-initialized variable > > > > issue and added some comments. Also, temporarily enable DEBUG2 for the > > > > 040 tap-test so that we can analyze the possible CFbot failures easily. > > > > > > > > > > Thanks! > > > > > > + if (remote_slot->confirmed_lsn != slot->data.confirmed_flush) > > > + { > > > + /* > > > + * By advancing the restart_lsn, confirmed_lsn, and xmin using > > > + * fast-forward logical decoding, we ensure that the required > > > snapshots > > > + * are saved to disk. This enables logical decoding to quickly > > > reach a > > > + * consistent point at the restart_lsn, eliminating the risk of > > > missing > > > + * data during snapshot creation. > > > + */ > > > + > > > pg_logical_replication_slot_advance(remote_slot->confirmed_lsn, > > > + > > > found_consistent_point); > > > + ReplicationSlotsComputeRequiredLSN(); > > > + updated_lsn = true; > > > + } > > > > > > Instead of using pg_logical_replication_slot_advance() for each synced slot and > > > during sync cycles what about?: > > > > > > - keep sync slot synchronization as it is currently (not using > > > pg_logical_replication_slot_advance()) > > > - create "an hidden" logical slot if sync slot feature is on > > > - at the time of promotion use pg_logical_replication_slot_advance() on this > > > hidden slot only to advance to the max lsn of the synced slots > > > > > > I'm not sure that would be enough, just asking your thoughts on this (benefits > > > would be to avoid calling pg_logical_replication_slot_advance() on each sync > > > slots and during the sync cycles). > > > > Thanks for the idea ! > > > > I considered about this. I think advancing the "hidden" slot on promotion may be a > > bit late, because if we cannot reach the consistent point after advancing the > > "hidden" slot, then it means we may need to remove all the synced slots as we > > are not sure if they are usable(will not loss data) after promotion. > > What about advancing the hidden slot during the sync cycles then? > > > The current approach is to mark such un-consistent slot as temp and persist > > them once it reaches consistent point, so that user can ensure the slot can be > > used after promotion once persisted. > > Right, but do we need to do so for all the sync slots? Would a single hidden > slot be enough? > Even if we mark one of the synced slots as persistent without reaching a consistent state, it could create a problem after promotion. And, how a single hidden slot would serve the purpose, different synced slots will have different restart/confirmed_flush LSN and we won't be able to perform advancing for those using a single slot. For example, say for first synced slot, it has not reached a consistent state and then how can it try for the second slot? This sounds quite tricky to make work. We should go with something simple where the chances of introducing bugs are lesser. -- With Regards, Amit Kapila.
Hi, On Fri, Mar 29, 2024 at 02:35:22PM +0530, Amit Kapila wrote: > On Fri, Mar 29, 2024 at 1:08 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > > On Fri, Mar 29, 2024 at 07:23:11AM +0000, Zhijie Hou (Fujitsu) wrote: > > > On Friday, March 29, 2024 2:48 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > > > > > > > Hi, > > > > > > > > On Fri, Mar 29, 2024 at 01:06:15AM +0000, Zhijie Hou (Fujitsu) wrote: > > > > > Attach a new version patch which fixed an un-initialized variable > > > > > issue and added some comments. Also, temporarily enable DEBUG2 for the > > > > > 040 tap-test so that we can analyze the possible CFbot failures easily. > > > > > > > > > > > > > Thanks! > > > > > > > > + if (remote_slot->confirmed_lsn != slot->data.confirmed_flush) > > > > + { > > > > + /* > > > > + * By advancing the restart_lsn, confirmed_lsn, and xmin using > > > > + * fast-forward logical decoding, we ensure that the required > > > > snapshots > > > > + * are saved to disk. This enables logical decoding to quickly > > > > reach a > > > > + * consistent point at the restart_lsn, eliminating the risk of > > > > missing > > > > + * data during snapshot creation. > > > > + */ > > > > + > > > > pg_logical_replication_slot_advance(remote_slot->confirmed_lsn, > > > > + > > > > found_consistent_point); > > > > + ReplicationSlotsComputeRequiredLSN(); > > > > + updated_lsn = true; > > > > + } > > > > > > > > Instead of using pg_logical_replication_slot_advance() for each synced slot and > > > > during sync cycles what about?: > > > > > > > > - keep sync slot synchronization as it is currently (not using > > > > pg_logical_replication_slot_advance()) > > > > - create "an hidden" logical slot if sync slot feature is on > > > > - at the time of promotion use pg_logical_replication_slot_advance() on this > > > > hidden slot only to advance to the max lsn of the synced slots > > > > > > > > I'm not sure that would be enough, just asking your thoughts on this (benefits > > > > would be to avoid calling pg_logical_replication_slot_advance() on each sync > > > > slots and during the sync cycles). > > > > > > Thanks for the idea ! > > > > > > I considered about this. I think advancing the "hidden" slot on promotion may be a > > > bit late, because if we cannot reach the consistent point after advancing the > > > "hidden" slot, then it means we may need to remove all the synced slots as we > > > are not sure if they are usable(will not loss data) after promotion. > > > > What about advancing the hidden slot during the sync cycles then? > > > > > The current approach is to mark such un-consistent slot as temp and persist > > > them once it reaches consistent point, so that user can ensure the slot can be > > > used after promotion once persisted. > > > > Right, but do we need to do so for all the sync slots? Would a single hidden > > slot be enough? > > > > Even if we mark one of the synced slots as persistent without reaching > a consistent state, it could create a problem after promotion. And, > how a single hidden slot would serve the purpose, different synced > slots will have different restart/confirmed_flush LSN and we won't be > able to perform advancing for those using a single slot. For example, > say for first synced slot, it has not reached a consistent state and > then how can it try for the second slot? This sounds quite tricky to > make work. We should go with something simple where the chances of > introducing bugs are lesser. Yeah, better to go with something simple. + if (remote_slot->confirmed_lsn != slot->data.confirmed_flush) + { + /* + * By advancing the restart_lsn, confirmed_lsn, and xmin using + * fast-forward logical decoding, we ensure that the required snapshots + * are saved to disk. This enables logical decoding to quickly reach a + * consistent point at the restart_lsn, eliminating the risk of missing + * data during snapshot creation. + */ + pg_logical_replication_slot_advance(remote_slot->confirmed_lsn, + found_consistent_point); In our case, what about skipping WaitForStandbyConfirmation() in pg_logical_replication_slot_advance()? (It could go until the RecoveryInProgress() check in StandbySlotsHaveCaughtup() if we don't skip it). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Friday, March 29, 2024 2:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Mar 29, 2024 at 6:36 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> > wrote: > > > > > > Attach a new version patch which fixed an un-initialized variable > > issue and added some comments. > > > > The other approach to fix this issue could be that the slotsync worker get the > serialized snapshot using pg_read_binary_file() corresponding to restart_lsn > and writes those at standby. But there are cases when we won't have such a file > like (a) when we initially create the slot and reach the consistent_point, or (b) > also by the time the slotsync worker starts to read the remote snapshot file, the > snapshot file could have been removed by the checkpointer on the primary (if > the restart_lsn of the remote has been advanced in this window). So, in such > cases, we anyway need to advance the slot. I think these could be optimizations > that we could do in the future. > > Few comments: Thanks for the comments. > ============= > 1. > - if (slot->data.database != MyDatabaseId) > + if (slot->data.database != MyDatabaseId && !fast_forward) > ereport(ERROR, > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > errmsg("replication slot \"%s\" was not created in this database", @@ -526,7 > +527,7 @@ CreateDecodingContext(XLogRecPtr start_lsn, > * Do not allow consumption of a "synchronized" slot until the standby > * gets promoted. > */ > - if (RecoveryInProgress() && slot->data.synced) > + if (RecoveryInProgress() && slot->data.synced && > + !IsSyncingReplicationSlots()) > > > Add comments at both of the above places. Added. > > > 2. > +extern XLogRecPtr pg_logical_replication_slot_advance(XLogRecPtr moveto, > + bool *found_consistent_point); > + > > This API looks a bit awkward as the functionality doesn't match the name. How > about having a function with name > LogicalSlotAdvanceAndCheckReadynessForDecoding(moveto, > ready_for_decoding) with the same functionality as your patch has for > pg_logical_replication_slot_advance() and then invoke it both from > pg_logical_replication_slot_advance and slotsync.c. The function name is too > big, we can think of a shorter name. Any ideas? How about LogicalSlotAdvanceAndCheckDecodingState() Or just LogicalSlotAdvanceAndCheckDecoding()? (I used the suggested LogicalSlotAdvanceAndCheckReadynessForDecoding in this version, It can be renamed in next version if we agree). Attach the V3 patch which addressed above comments and Kuroda-san's comments[1]. I also adjusted the tap-test to only check the confirmed_flush_lsn after syncing, as the restart_lsn could be different from the remote one due to the new slot_advance() call. I am also testing some optimization idea locally and will share if ready. [1] https://www.postgresql.org/message-id/TYCPR01MB1207757BB2A32B6815CE1CCE7F53A2%40TYCPR01MB12077.jpnprd01.prod.outlook.com Best Regards, Hou zj
Attachment
On Thu, Mar 28, 2024 at 10:08 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > [2] The steps to reproduce the data miss issue on a primary->standby setup: I'm trying to reproduce the problem with [1], but I can see the changes after the standby is promoted. Am I missing anything here? ubuntu:~/postgres/pg17/bin$ ./psql -d postgres -p 5433 -c "select * from pg_logical_slot_get_changes('lrep_sync_slot', NULL, NULL);" lsn | xid | data -----------+-----+--------------------------------------------- 0/30000B0 | 738 | BEGIN 738 0/3017FC8 | 738 | COMMIT 738 0/3017FF8 | 739 | BEGIN 739 0/3019A38 | 739 | COMMIT 739 0/3019A38 | 740 | BEGIN 740 0/3019A38 | 740 | table public.dummy1: INSERT: a[integer]:999 0/3019AA8 | 740 | COMMIT 740 (7 rows) [1] -#define LOG_SNAPSHOT_INTERVAL_MS 15000 +#define LOG_SNAPSHOT_INTERVAL_MS 1500000 ./initdb -D db17 echo "archive_mode = on archive_command='cp %p /home/ubuntu/postgres/pg17/bin/archived_wal/%f' wal_level='logical' autovacuum = off checkpoint_timeout='1h'" | tee -a db17/postgresql.conf ./pg_ctl -D db17 -l logfile17 start rm -rf sbdata logfilesbdata ./pg_basebackup -D sbdata ./psql -d postgres -p 5432 -c "SELECT pg_create_logical_replication_slot('lrep_sync_slot', 'test_decoding', false, false, true);" ./psql -d postgres -p 5432 -c "SELECT pg_create_physical_replication_slot('phy_repl_slot', true, false);" echo "port=5433 primary_conninfo='host=localhost port=5432 dbname=postgres user=ubuntu' primary_slot_name='phy_repl_slot' restore_command='cp /home/ubuntu/postgres/pg17/bin/archived_wal/%f %p' hot_standby_feedback=on sync_replication_slots=on" | tee -a sbdata/postgresql.conf touch sbdata/standby.signal ./pg_ctl -D sbdata -l logfilesbdata start ./psql -d postgres -p 5433 -c "SELECT pg_is_in_recovery();" ./psql -d postgres SESSION1, TXN1 BEGIN; create table dummy1(a int); SESSION2, TXN2 SELECT pg_log_standby_snapshot(); SESSION1, TXN1 COMMIT; SESSION1, TXN1 BEGIN; create table dummy2(a int); SESSION2, TXN2 SELECT pg_log_standby_snapshot(); SESSION1, TXN1 COMMIT; ./psql -d postgres -p 5432 -c "SELECT pg_replication_slot_advance('lrep_sync_slot', pg_current_wal_lsn());" ./psql -d postgres -p 5432 -c "INSERT INTO dummy1 VALUES(999);" ./psql -d postgres -p 5433 -c "SELECT pg_promote();" ./psql -d postgres -p 5433 -c "SELECT pg_is_in_recovery();" ./psql -d postgres -p 5433 -c "select * from pg_logical_slot_get_changes('lrep_sync_slot', NULL, NULL);" -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Apr 1, 2024 at 10:01 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Thu, Mar 28, 2024 at 10:08 AM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > [2] The steps to reproduce the data miss issue on a primary->standby setup: > > I'm trying to reproduce the problem with [1], but I can see the > changes after the standby is promoted. Am I missing anything here? > > ubuntu:~/postgres/pg17/bin$ ./psql -d postgres -p 5433 -c "select * > from pg_logical_slot_get_changes('lrep_sync_slot', NULL, NULL);" > lsn | xid | data > -----------+-----+--------------------------------------------- > 0/30000B0 | 738 | BEGIN 738 > 0/3017FC8 | 738 | COMMIT 738 > 0/3017FF8 | 739 | BEGIN 739 > 0/3019A38 | 739 | COMMIT 739 > 0/3019A38 | 740 | BEGIN 740 > 0/3019A38 | 740 | table public.dummy1: INSERT: a[integer]:999 > 0/3019AA8 | 740 | COMMIT 740 > (7 rows) > > [1] > -#define LOG_SNAPSHOT_INTERVAL_MS 15000 > +#define LOG_SNAPSHOT_INTERVAL_MS 1500000 > > ./initdb -D db17 > echo "archive_mode = on > archive_command='cp %p /home/ubuntu/postgres/pg17/bin/archived_wal/%f' > wal_level='logical' > autovacuum = off > checkpoint_timeout='1h'" | tee -a db17/postgresql.conf > > ./pg_ctl -D db17 -l logfile17 start > > rm -rf sbdata logfilesbdata > ./pg_basebackup -D sbdata > > ./psql -d postgres -p 5432 -c "SELECT > pg_create_logical_replication_slot('lrep_sync_slot', 'test_decoding', > false, false, true);" > ./psql -d postgres -p 5432 -c "SELECT > pg_create_physical_replication_slot('phy_repl_slot', true, false);" > > echo "port=5433 > primary_conninfo='host=localhost port=5432 dbname=postgres user=ubuntu' > primary_slot_name='phy_repl_slot' > restore_command='cp /home/ubuntu/postgres/pg17/bin/archived_wal/%f %p' > hot_standby_feedback=on > sync_replication_slots=on" | tee -a sbdata/postgresql.conf > > touch sbdata/standby.signal > > ./pg_ctl -D sbdata -l logfilesbdata start > ./psql -d postgres -p 5433 -c "SELECT pg_is_in_recovery();" > > ./psql -d postgres > > SESSION1, TXN1 > BEGIN; > create table dummy1(a int); > > SESSION2, TXN2 > SELECT pg_log_standby_snapshot(); > > SESSION1, TXN1 > COMMIT; > > SESSION1, TXN1 > BEGIN; > create table dummy2(a int); > > SESSION2, TXN2 > SELECT pg_log_standby_snapshot(); > > SESSION1, TXN1 > COMMIT; > > ./psql -d postgres -p 5432 -c "SELECT > pg_replication_slot_advance('lrep_sync_slot', pg_current_wal_lsn());" > After this step and before the next, did you ensure that the slot sync has synced the latest confirmed_flush/restart LSNs? You can query: "select slot_name,restart_lsn, confirmed_flush_lsn from pg_replication_slots;" to ensure the same on both the primary and standby. -- With Regards, Amit Kapila.
On Monday, April 1, 2024 8:56 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Friday, March 29, 2024 2:50 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > On Fri, Mar 29, 2024 at 6:36 AM Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> > > wrote: > > > > > > > > > Attach a new version patch which fixed an un-initialized variable > > > issue and added some comments. > > > > > > > The other approach to fix this issue could be that the slotsync worker > > get the serialized snapshot using pg_read_binary_file() corresponding > > to restart_lsn and writes those at standby. But there are cases when > > we won't have such a file like (a) when we initially create the slot > > and reach the consistent_point, or (b) also by the time the slotsync > > worker starts to read the remote snapshot file, the snapshot file > > could have been removed by the checkpointer on the primary (if the > > restart_lsn of the remote has been advanced in this window). So, in > > such cases, we anyway need to advance the slot. I think these could be > optimizations that we could do in the future. > > > > Few comments: > > Thanks for the comments. > > > ============= > > 1. > > - if (slot->data.database != MyDatabaseId) > > + if (slot->data.database != MyDatabaseId && !fast_forward) > > ereport(ERROR, > > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > errmsg("replication slot \"%s\" was not created in this database", > > @@ -526,7 > > +527,7 @@ CreateDecodingContext(XLogRecPtr start_lsn, > > * Do not allow consumption of a "synchronized" slot until the standby > > * gets promoted. > > */ > > - if (RecoveryInProgress() && slot->data.synced) > > + if (RecoveryInProgress() && slot->data.synced && > > + !IsSyncingReplicationSlots()) > > > > > > Add comments at both of the above places. > > Added. > > > > > > > 2. > > +extern XLogRecPtr pg_logical_replication_slot_advance(XLogRecPtr > moveto, > > + bool *found_consistent_point); > > + > > > > This API looks a bit awkward as the functionality doesn't match the > > name. How about having a function with name > > LogicalSlotAdvanceAndCheckReadynessForDecoding(moveto, > > ready_for_decoding) with the same functionality as your patch has for > > pg_logical_replication_slot_advance() and then invoke it both from > > pg_logical_replication_slot_advance and slotsync.c. The function name > > is too big, we can think of a shorter name. Any ideas? > > How about LogicalSlotAdvanceAndCheckDecodingState() Or just > LogicalSlotAdvanceAndCheckDecoding()? (I used the suggested > LogicalSlotAdvanceAndCheckReadynessForDecoding in this version, It can be > renamed in next version if we agree). > > Attach the V3 patch which addressed above comments and Kuroda-san's > comments[1]. I also adjusted the tap-test to only check the confirmed_flush_lsn > after syncing, as the restart_lsn could be different from the remote one due to > the new slot_advance() call. I am also testing some optimization idea locally and > will share if ready. Attach the V4 patch which includes the optimization to skip the decoding if the snapshot at the syncing restart_lsn is already serialized. It can avoid most of the duplicate decoding in my test, and I am doing some more tests locally. Best Regards, Hou zj
Attachment
On Mon, Apr 1, 2024 at 10:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Apr 1, 2024 at 10:01 AM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > On Thu, Mar 28, 2024 at 10:08 AM Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> wrote: > > > > > > [2] The steps to reproduce the data miss issue on a primary->standby setup: > > > > I'm trying to reproduce the problem with [1], but I can see the > > changes after the standby is promoted. Am I missing anything here? > > > > ubuntu:~/postgres/pg17/bin$ ./psql -d postgres -p 5433 -c "select * > > from pg_logical_slot_get_changes('lrep_sync_slot', NULL, NULL);" > > lsn | xid | data > > -----------+-----+--------------------------------------------- > > 0/30000B0 | 738 | BEGIN 738 > > 0/3017FC8 | 738 | COMMIT 738 > > 0/3017FF8 | 739 | BEGIN 739 > > 0/3019A38 | 739 | COMMIT 739 > > 0/3019A38 | 740 | BEGIN 740 > > 0/3019A38 | 740 | table public.dummy1: INSERT: a[integer]:999 > > 0/3019AA8 | 740 | COMMIT 740 > > (7 rows) > > > > [1] > > -#define LOG_SNAPSHOT_INTERVAL_MS 15000 > > +#define LOG_SNAPSHOT_INTERVAL_MS 1500000 > > > > ./initdb -D db17 > > echo "archive_mode = on > > archive_command='cp %p /home/ubuntu/postgres/pg17/bin/archived_wal/%f' > > wal_level='logical' > > autovacuum = off > > checkpoint_timeout='1h'" | tee -a db17/postgresql.conf > > > > ./pg_ctl -D db17 -l logfile17 start > > > > rm -rf sbdata logfilesbdata > > ./pg_basebackup -D sbdata > > > > ./psql -d postgres -p 5432 -c "SELECT > > pg_create_logical_replication_slot('lrep_sync_slot', 'test_decoding', > > false, false, true);" > > ./psql -d postgres -p 5432 -c "SELECT > > pg_create_physical_replication_slot('phy_repl_slot', true, false);" > > > > echo "port=5433 > > primary_conninfo='host=localhost port=5432 dbname=postgres user=ubuntu' > > primary_slot_name='phy_repl_slot' > > restore_command='cp /home/ubuntu/postgres/pg17/bin/archived_wal/%f %p' > > hot_standby_feedback=on > > sync_replication_slots=on" | tee -a sbdata/postgresql.conf > > > > touch sbdata/standby.signal > > > > ./pg_ctl -D sbdata -l logfilesbdata start > > ./psql -d postgres -p 5433 -c "SELECT pg_is_in_recovery();" > > > > ./psql -d postgres > > > > SESSION1, TXN1 > > BEGIN; > > create table dummy1(a int); > > > > SESSION2, TXN2 > > SELECT pg_log_standby_snapshot(); > > > > SESSION1, TXN1 > > COMMIT; > > > > SESSION1, TXN1 > > BEGIN; > > create table dummy2(a int); > > > > SESSION2, TXN2 > > SELECT pg_log_standby_snapshot(); > > > > SESSION1, TXN1 > > COMMIT; > > > > ./psql -d postgres -p 5432 -c "SELECT > > pg_replication_slot_advance('lrep_sync_slot', pg_current_wal_lsn());" > > > > After this step and before the next, did you ensure that the slot sync > has synced the latest confirmed_flush/restart LSNs? You can query: > "select slot_name,restart_lsn, confirmed_flush_lsn from > pg_replication_slots;" to ensure the same on both the primary and > standby. +1. To ensure last sync, one can run this manually on standby just before promotion : SELECT pg_sync_replication_slots(); thanks Shveta
Hi, On Mon, Apr 01, 2024 at 06:05:34AM +0000, Zhijie Hou (Fujitsu) wrote: > On Monday, April 1, 2024 8:56 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > Attach the V4 patch which includes the optimization to skip the decoding if > the snapshot at the syncing restart_lsn is already serialized. It can avoid most > of the duplicate decoding in my test, and I am doing some more tests locally. > Thanks! 1 === Same comment as in [1]. In LogicalSlotAdvanceAndCheckReadynessForDecoding(), if we are synchronizing slots then I think that we can skip: + /* + * Wait for specified streaming replication standby servers (if any) + * to confirm receipt of WAL up to moveto lsn. + */ + WaitForStandbyConfirmation(moveto); Indeed if we are dealing with synced slot then we know we're in RecoveryInProgress(). Then there is no need to call WaitForStandbyConfirmation() as it could go until the RecoveryInProgress() in StandbySlotsHaveCaughtup() for nothing (as we already know it). 2 === + { + if (SnapBuildSnapshotExists(remote_slot->restart_lsn)) + { That could call SnapBuildSnapshotExists() multiple times for the same "restart_lsn" (for example in case of multiple remote slots to sync). What if the sync worker records the last lsn it asks for serialization (and serialized ? Then we could check that value first before deciding to call (or not) SnapBuildSnapshotExists() on it? It's not ideal because it would record "only the last one" but that would be simple enough for now (currently there is only one sync worker so that scenario is likely to happen). Maybe an idea for future improvement (not for now) could be that SnapBuildSerialize() maintains a "small list" of "already serialized" snapshots. [1]: https://www.postgresql.org/message-id/ZgayTFIhLfzhpHci%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Did performance test on optimization patch (v2-0001-optimize-the-slot-advancement.patch). Please find the results: Setup: - One primary node with 100 failover-enabled logical slots - 20 DBs, each having 5 failover-enabled logical replication slots - One physical standby node with 'sync_replication_slots' as off but other parameters required by slot-sync as enabled. Node Configurations: please see config.txt Test Plan: 1) Create 20 Databases on Primary node, each with 5 failover slots using "pg_create_logical_replication_slot()". Overall 100 failover slots. 2) Use pg_sync_replication_slot() to sync them to the standby. Note the execution time of sync and lsns values. 3) On Primary node, run pgbench for 15 mins on postgres db 4) Advance lsns of all the 100 slots on primary using pg_replication_slot_advance(). 5) Use pg_sync_replication_slot() to sync slots to the standby. Note the execution time of sync and lsns values. Executed the above test plan for three cases and did time elapsed comparison for the pg_replication_slot_advance()- (1) HEAD Time taken by pg_sync_replication_slot() on Standby node - a) The initial sync (step 2) = 140.208 ms b) Sync after pgbench run on primary (step 5) = 66.994 ms (2) HEAD + v3-0001-advance-the-restart_lsn-of-synced-slots-using-log.patch a) The initial sync (step 2) = 163.885 ms b) Sync after pgbench run on primary (step 5) = 837901.290 ms (13:57.901) >> With v3 patch, the pg_sync_replication_slot() takes a significant amount of time to sync the slots. (3) HEAD + v3-0001-advance-the-restart_lsn-of-synced-slots-using-log.patch + v2-0001-optimize-the-slot-advancement.patch a) The initial sync (step 2) = 165.554 ms b) Sync after pgbench run on primary (step 5) = 7991.718 ms (00:07.992) >> With the optimization patch, the time taken by pg_sync_replication_slot() is reduced significantly to ~7 seconds. We did the same test with a single DB too by creating all 100 failover slots in postgres DB and the results were almost similar. Attached the scripts used for the test - "v3_perf_test_scripts.tar.gz" include files - setup_multidb.sh : setup primary and standby nodes createdb20.sql : create 20 DBs createslot20.sql : create total 100 logical slots, 5 on each DB run_sync.sql : call pg_replication_slot_advance() with timing advance20.sql : advance lsn of all slots on Primary node to current lsn advance20_perdb.sql : use on HEAD to advance lsn on Primary node get_synced_data.sql : get details of the config.txt : configuration used for nodes
Attachment
On Mon, Apr 1, 2024 at 6:26 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Friday, March 29, 2024 2:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > 2. > > +extern XLogRecPtr pg_logical_replication_slot_advance(XLogRecPtr moveto, > > + bool *found_consistent_point); > > + > > > > This API looks a bit awkward as the functionality doesn't match the name. How > > about having a function with name > > LogicalSlotAdvanceAndCheckReadynessForDecoding(moveto, > > ready_for_decoding) with the same functionality as your patch has for > > pg_logical_replication_slot_advance() and then invoke it both from > > pg_logical_replication_slot_advance and slotsync.c. The function name is too > > big, we can think of a shorter name. Any ideas? > > How about LogicalSlotAdvanceAndCheckDecodingState() Or just > LogicalSlotAdvanceAndCheckDecoding()? > It is about snapbuild state, so how about naming the function as LogicalSlotAdvanceAndCheckSnapState()? I have made quite a few cosmetic changes in comments and code. See attached. This is atop your latest patch. Can you please review and include these changes in the next version? -- With Regards, Amit Kapila.
Attachment
On Mon, Apr 1, 2024 at 2:51 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On Mon, Apr 01, 2024 at 06:05:34AM +0000, Zhijie Hou (Fujitsu) wrote: > > On Monday, April 1, 2024 8:56 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > Attach the V4 patch which includes the optimization to skip the decoding if > > the snapshot at the syncing restart_lsn is already serialized. It can avoid most > > of the duplicate decoding in my test, and I am doing some more tests locally. > > > > Thanks! > > 1 === > > Same comment as in [1]. > > In LogicalSlotAdvanceAndCheckReadynessForDecoding(), if we are synchronizing slots > then I think that we can skip: > > + /* > + * Wait for specified streaming replication standby servers (if any) > + * to confirm receipt of WAL up to moveto lsn. > + */ > + WaitForStandbyConfirmation(moveto); > > Indeed if we are dealing with synced slot then we know we're in RecoveryInProgress(). > > Then there is no need to call WaitForStandbyConfirmation() as it could go until > the RecoveryInProgress() in StandbySlotsHaveCaughtup() for nothing (as we already > know it). > Won't it will normally return from the first check in WaitForStandbyConfirmation() because standby_slot_names_config is not set on standby? > 2 === > > + { > + if (SnapBuildSnapshotExists(remote_slot->restart_lsn)) > + { > > That could call SnapBuildSnapshotExists() multiple times for the same > "restart_lsn" (for example in case of multiple remote slots to sync). > > What if the sync worker records the last lsn it asks for serialization (and > serialized ? Then we could check that value first before deciding to call (or not) > SnapBuildSnapshotExists() on it? > > It's not ideal because it would record "only the last one" but that would be > simple enough for now (currently there is only one sync worker so that scenario > is likely to happen). > Yeah, we could do that but I am not sure how much it can help. I guess we could do some tests to see if it helps. -- With Regards, Amit Kapila.
On Mon, Apr 1, 2024 at 10:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > After this step and before the next, did you ensure that the slot sync > has synced the latest confirmed_flush/restart LSNs? You can query: > "select slot_name,restart_lsn, confirmed_flush_lsn from > pg_replication_slots;" to ensure the same on both the primary and > standby. Yes, after ensuring the slot is synced on the standby, the problem is reproduced for me and the proposed patch fixes it (i.e. able to see the changes even after the promotion). I'm just thinking if we can add a TAP test for this issue, but one key aspect of this reproducer is to not let someone write a RUNNING_XACTS WAL record on the primary in between before the standby promotion. Setting bgwriter_delay to max isn't helping me. I think we can think of using an injection point to add delay in LogStandbySnapshot() for having this problem reproduced consistently in a TAP test. Perhaps, we can think of adding this later after the fix is shipped. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Mon, Apr 01, 2024 at 05:04:53PM +0530, Amit Kapila wrote: > On Mon, Apr 1, 2024 at 2:51 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > Then there is no need to call WaitForStandbyConfirmation() as it could go until > > the RecoveryInProgress() in StandbySlotsHaveCaughtup() for nothing (as we already > > know it). > > > > Won't it will normally return from the first check in > WaitForStandbyConfirmation() because standby_slot_names_config is not > set on standby? I think standby_slot_names can be set on a standby. One could want to set it in a cascading standby env (though it won't have any real effects until the standby is promoted). > > > 2 === > > > > + { > > + if (SnapBuildSnapshotExists(remote_slot->restart_lsn)) > > + { > > > > That could call SnapBuildSnapshotExists() multiple times for the same > > "restart_lsn" (for example in case of multiple remote slots to sync). > > > > What if the sync worker records the last lsn it asks for serialization (and > > serialized ? Then we could check that value first before deciding to call (or not) > > SnapBuildSnapshotExists() on it? > > > > It's not ideal because it would record "only the last one" but that would be > > simple enough for now (currently there is only one sync worker so that scenario > > is likely to happen). > > > > Yeah, we could do that but I am not sure how much it can help. I guess > we could do some tests to see if it helps. Yeah not sure either. I just think it can only help and shouldn't make things worst (but could avoid extra SnapBuildSnapshotExists() calls). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Monday, April 1, 2024 7:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Apr 1, 2024 at 6:26 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> > wrote: > > > > On Friday, March 29, 2024 2:50 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > > > > > > > > > > > 2. > > > +extern XLogRecPtr pg_logical_replication_slot_advance(XLogRecPtr > moveto, > > > + bool *found_consistent_point); > > > + > > > > > > This API looks a bit awkward as the functionality doesn't match the > > > name. How about having a function with name > > > LogicalSlotAdvanceAndCheckReadynessForDecoding(moveto, > > > ready_for_decoding) with the same functionality as your patch has > > > for > > > pg_logical_replication_slot_advance() and then invoke it both from > > > pg_logical_replication_slot_advance and slotsync.c. The function > > > name is too big, we can think of a shorter name. Any ideas? > > > > How about LogicalSlotAdvanceAndCheckDecodingState() Or just > > LogicalSlotAdvanceAndCheckDecoding()? > > > > It is about snapbuild state, so how about naming the function as > LogicalSlotAdvanceAndCheckSnapState()? It looks better to me, so changed. > > I have made quite a few cosmetic changes in comments and code. See > attached. This is atop your latest patch. Can you please review and include > these changes in the next version? Thanks, I have reviewed and merged them. Attach the V5 patch set which addressed above comments and ran pgindent. I will think and test the improvement suggested by Bertrand[1] and reply after that. [1] https://www.postgresql.org/message-id/Zgp8n9QD5nYSESnM%40ip-10-97-1-34.eu-west-3.compute.internal Best Regards, Hou zj
Attachment
On Mon, Apr 1, 2024 at 11:36 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > Attach the V4 patch which includes the optimization to skip the decoding if > the snapshot at the syncing restart_lsn is already serialized. It can avoid most > of the duplicate decoding in my test, and I am doing some more tests locally. Thanks for the patch. I'm thinking if we can reduce the amount of work that we do for synced slots in each sync worker cycle. With that context in mind, why do we need to create decoding context every time? Can't we create it once, store it in an in-memory structure and use it for each sync worker cycle? Is there any problem with it? What do you think? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Apr 1, 2024 at 6:58 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > On Mon, Apr 01, 2024 at 05:04:53PM +0530, Amit Kapila wrote: > > On Mon, Apr 1, 2024 at 2:51 PM Bertrand Drouvot > > <bertranddrouvot.pg@gmail.com> wrote: > > > Then there is no need to call WaitForStandbyConfirmation() as it could go until > > > the RecoveryInProgress() in StandbySlotsHaveCaughtup() for nothing (as we already > > > know it). > > > > > > > Won't it will normally return from the first check in > > WaitForStandbyConfirmation() because standby_slot_names_config is not > > set on standby? > > I think standby_slot_names can be set on a standby. One could want to set it in > a cascading standby env (though it won't have any real effects until the standby > is promoted). > Yeah, it is possible but doesn't seem worth additional checks for this micro-optimization. -- With Regards, Amit Kapila.
On Tuesday, April 2, 2024 8:43 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Mon, Apr 1, 2024 at 11:36 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> > wrote: > > > > Attach the V4 patch which includes the optimization to skip the > > decoding if the snapshot at the syncing restart_lsn is already > > serialized. It can avoid most of the duplicate decoding in my test, and I am > doing some more tests locally. > > Thanks for the patch. I'm thinking if we can reduce the amount of work that we > do for synced slots in each sync worker cycle. With that context in mind, why do > we need to create decoding context every time? > Can't we create it once, store it in an in-memory structure and use it for each > sync worker cycle? Is there any problem with it? What do you think? Thanks for the idea. I think the cost of decoding context seems to be relatively minor when compared to the IO cost. After generating the profiles for the tests shared by Nisha[1], it appears that the StartupDecodingContext is not a issue. While the suggested refactoring is an option, I think we can consider this as a future improvement and addressing it only if we encounter scenarios where the creation of decoding context becomes a bottleneck. [1] https://www.postgresql.org/message-id/CALj2ACUeij5tFzJ1-cuoUh%2Bmhj33v%2BYgqD_gHYUpRdXSCSBbhw%40mail.gmail.com Best Regards, Hou zj
Attachment
On Mon, Apr 1, 2024 at 5:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > 2 === > > > > + { > > + if (SnapBuildSnapshotExists(remote_slot->restart_lsn)) > > + { > > > > That could call SnapBuildSnapshotExists() multiple times for the same > > "restart_lsn" (for example in case of multiple remote slots to sync). > > > > What if the sync worker records the last lsn it asks for serialization (and > > serialized ? Then we could check that value first before deciding to call (or not) > > SnapBuildSnapshotExists() on it? > > > > It's not ideal because it would record "only the last one" but that would be > > simple enough for now (currently there is only one sync worker so that scenario > > is likely to happen). > > > > Yeah, we could do that but I am not sure how much it can help. I guess > we could do some tests to see if it helps. I had a look at test-results conducted by Nisha, I did not find any repetitive restart_lsn from primary being synced to standby for that particular test of 100 slots. Unless we have some concrete test in mind (having repetitive restart_lsn), I do not think that by using the given tests, we can establish the benefit of suggested optimization. Attached the log files of all slots test for reference, thanks Shveta
Attachment
On Monday, April 1, 2024 9:28 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > On Mon, Apr 01, 2024 at 05:04:53PM +0530, Amit Kapila wrote: > > On Mon, Apr 1, 2024 at 2:51 PM Bertrand Drouvot > > > > > > 2 === > > > > > > + { > > > + if (SnapBuildSnapshotExists(remote_slot->restart_lsn)) > > > + { > > > > > > That could call SnapBuildSnapshotExists() multiple times for the > > > same "restart_lsn" (for example in case of multiple remote slots to sync). > > > > > > What if the sync worker records the last lsn it asks for > > > serialization (and serialized ? Then we could check that value first > > > before deciding to call (or not) > > > SnapBuildSnapshotExists() on it? > > > > > > It's not ideal because it would record "only the last one" but that > > > would be simple enough for now (currently there is only one sync > > > worker so that scenario is likely to happen). > > > > > > > Yeah, we could do that but I am not sure how much it can help. I guess > > we could do some tests to see if it helps. > > Yeah not sure either. I just think it can only help and shouldn't make things > worst (but could avoid extra SnapBuildSnapshotExists() calls). Thanks for the idea. I tried some tests based on Nisha's setup[1]. I tried to advance the slots on the primary to the same restart_lsn before calling sync_replication_slots(), and reduced the data generated by pgbench. The SnapBuildSnapshotExists is still not noticeable in the profile. So, I feel we could leave this as a further improvement once we encounter scenarios where the duplicate SnapBuildSnapshotExists call becomes noticeable. [1] https://www.postgresql.org/message-id/CALj2ACUeij5tFzJ1-cuoUh%2Bmhj33v%2BYgqD_gHYUpRdXSCSBbhw%40mail.gmail.com Best Regards, Hou zj
Attachment
Hi, On Tue, Apr 02, 2024 at 04:24:49AM +0000, Zhijie Hou (Fujitsu) wrote: > On Monday, April 1, 2024 9:28 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > > > On Mon, Apr 01, 2024 at 05:04:53PM +0530, Amit Kapila wrote: > > > On Mon, Apr 1, 2024 at 2:51 PM Bertrand Drouvot > > > > > > > > > 2 === > > > > > > > > + { > > > > + if (SnapBuildSnapshotExists(remote_slot->restart_lsn)) > > > > + { > > > > > > > > That could call SnapBuildSnapshotExists() multiple times for the > > > > same "restart_lsn" (for example in case of multiple remote slots to sync). > > > > > > > > What if the sync worker records the last lsn it asks for > > > > serialization (and serialized ? Then we could check that value first > > > > before deciding to call (or not) > > > > SnapBuildSnapshotExists() on it? > > > > > > > > It's not ideal because it would record "only the last one" but that > > > > would be simple enough for now (currently there is only one sync > > > > worker so that scenario is likely to happen). > > > > > > > > > > Yeah, we could do that but I am not sure how much it can help. I guess > > > we could do some tests to see if it helps. > > > > Yeah not sure either. I just think it can only help and shouldn't make things > > worst (but could avoid extra SnapBuildSnapshotExists() calls). > > Thanks for the idea. I tried some tests based on Nisha's setup[1]. Thank you and Nisha and Shveta for the testing! > I tried to > advance the slots on the primary to the same restart_lsn before calling > sync_replication_slots(), and reduced the data generated by pgbench. Agree that this scenario makes sense to try to see the impact of SnapBuildSnapshotExists(). > The SnapBuildSnapshotExists is still not noticeable in the profile. SnapBuildSnapshotExists() number of calls are probably negligeable when compared to the IO calls generated by the fast forward logical decoding in this scenario. > So, I feel we > could leave this as a further improvement once we encounter scenarios where > the duplicate SnapBuildSnapshotExists call becomes noticeable. Sounds reasonable to me. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tuesday, April 2, 2024 8:35 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Monday, April 1, 2024 7:30 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > On Mon, Apr 1, 2024 at 6:26 AM Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> > > wrote: > > > > > > On Friday, March 29, 2024 2:50 PM Amit Kapila > > > <amit.kapila16@gmail.com> > > wrote: > > > > > > > > > > > > > > > > > > > 2. > > > > +extern XLogRecPtr pg_logical_replication_slot_advance(XLogRecPtr > > moveto, > > > > + bool *found_consistent_point); > > > > + > > > > > > > > This API looks a bit awkward as the functionality doesn't match > > > > the name. How about having a function with name > > > > LogicalSlotAdvanceAndCheckReadynessForDecoding(moveto, > > > > ready_for_decoding) with the same functionality as your patch has > > > > for > > > > pg_logical_replication_slot_advance() and then invoke it both from > > > > pg_logical_replication_slot_advance and slotsync.c. The function > > > > name is too big, we can think of a shorter name. Any ideas? > > > > > > How about LogicalSlotAdvanceAndCheckDecodingState() Or just > > > LogicalSlotAdvanceAndCheckDecoding()? > > > > > > > It is about snapbuild state, so how about naming the function as > > LogicalSlotAdvanceAndCheckSnapState()? > > It looks better to me, so changed. > > > > > I have made quite a few cosmetic changes in comments and code. See > > attached. This is atop your latest patch. Can you please review and > > include these changes in the next version? > > Thanks, I have reviewed and merged them. > Attach the V5 patch set which addressed above comments and ran pgindent. I added one test in 040_standby_failover_slots_sync.pl in 0002 patch, which can reproduce the data loss issue consistently on my machine. It may not reproduce in some rare cases if concurrent xl_running_xacts are written by bgwriter, but I think it's still valuable if it can verify the fix in most cases. The test will fail if directly applied on HEAD, and will pass after applying atop of 0001. Best Regards, Hou zj
Attachment
Hi, On Tue, Apr 02, 2024 at 07:20:46AM +0000, Zhijie Hou (Fujitsu) wrote: > I added one test in 040_standby_failover_slots_sync.pl in 0002 patch, which can > reproduce the data loss issue consistently on my machine. Thanks! > It may not reproduce > in some rare cases if concurrent xl_running_xacts are written by bgwriter, but > I think it's still valuable if it can verify the fix in most cases. What about adding a "wait" injection point in LogStandbySnapshot() to prevent checkpointer/bgwriter to log a standby snapshot? Something among those lines: if (AmCheckpointerProcess() || AmBackgroundWriterProcess()) INJECTION_POINT("bgw-log-standby-snapshot"); And make use of it in the test, something like: $node_primary->safe_psql('postgres', "SELECT injection_points_attach('bgw-log-standby-snapshot', 'wait');"); Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tuesday, April 2, 2024 3:21 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > On Tuesday, April 2, 2024 8:35 AM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > On Monday, April 1, 2024 7:30 PM Amit Kapila <amit.kapila16@gmail.com> > > wrote: > > > > > > On Mon, Apr 1, 2024 at 6:26 AM Zhijie Hou (Fujitsu) > > > <houzj.fnst@fujitsu.com> > > > wrote: > > > > > > > > On Friday, March 29, 2024 2:50 PM Amit Kapila > > > > <amit.kapila16@gmail.com> > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > 2. > > > > > +extern XLogRecPtr > > > > > +pg_logical_replication_slot_advance(XLogRecPtr > > > moveto, > > > > > + bool *found_consistent_point); > > > > > + > > > > > > > > > > This API looks a bit awkward as the functionality doesn't match > > > > > the name. How about having a function with name > > > > > LogicalSlotAdvanceAndCheckReadynessForDecoding(moveto, > > > > > ready_for_decoding) with the same functionality as your patch > > > > > has for > > > > > pg_logical_replication_slot_advance() and then invoke it both > > > > > from pg_logical_replication_slot_advance and slotsync.c. The > > > > > function name is too big, we can think of a shorter name. Any ideas? > > > > > > > > How about LogicalSlotAdvanceAndCheckDecodingState() Or just > > > > LogicalSlotAdvanceAndCheckDecoding()? > > > > > > > > > > It is about snapbuild state, so how about naming the function as > > > LogicalSlotAdvanceAndCheckSnapState()? > > > > It looks better to me, so changed. > > > > > > > > I have made quite a few cosmetic changes in comments and code. See > > > attached. This is atop your latest patch. Can you please review and > > > include these changes in the next version? > > > > Thanks, I have reviewed and merged them. > > Attach the V5 patch set which addressed above comments and ran pgindent. > > I added one test in 040_standby_failover_slots_sync.pl in 0002 patch, which can > reproduce the data loss issue consistently on my machine. It may not > reproduce in some rare cases if concurrent xl_running_xacts are written by > bgwriter, but I think it's still valuable if it can verify the fix in most cases. The test > will fail if directly applied on HEAD, and will pass after applying atop of 0001. CFbot[1] complained about one query result's order in the tap-test, so I am attaching a V7 patch set which fixed this. There are no changes in 0001. [1] https://cirrus-ci.com/task/6375962162495488 Best Regards, Hou zj
Attachment
On Tue, Apr 2, 2024 at 1:54 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > On Tue, Apr 02, 2024 at 07:20:46AM +0000, Zhijie Hou (Fujitsu) wrote: > > I added one test in 040_standby_failover_slots_sync.pl in 0002 patch, which can > > reproduce the data loss issue consistently on my machine. > > Thanks! > > > It may not reproduce > > in some rare cases if concurrent xl_running_xacts are written by bgwriter, but > > I think it's still valuable if it can verify the fix in most cases. > > What about adding a "wait" injection point in LogStandbySnapshot() to prevent > checkpointer/bgwriter to log a standby snapshot? Something among those lines: > > if (AmCheckpointerProcess() || AmBackgroundWriterProcess()) > INJECTION_POINT("bgw-log-standby-snapshot"); > > And make use of it in the test, something like: > > $node_primary->safe_psql('postgres', > "SELECT injection_points_attach('bgw-log-standby-snapshot', 'wait');"); > Sometimes we want the checkpoint to log the standby snapshot as we need it at a predictable time, maybe one can use pg_log_standby_snapshot() instead of that. Can we add an injection point as a separate patch/commit after a bit more discussion? I want to discuss this in a separate thread so that later we should not get an objection to adding an injection_point at this location. One other idea to make such tests predictable is to add a developer-specific GUC say debug_bg_log_standby_snapshot or something like that but injection point sounds like a better idea. -- With Regards, Amit Kapila.
Hi, On Tue, Apr 02, 2024 at 02:19:30PM +0530, Amit Kapila wrote: > On Tue, Apr 2, 2024 at 1:54 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > What about adding a "wait" injection point in LogStandbySnapshot() to prevent > > checkpointer/bgwriter to log a standby snapshot? Something among those lines: > > > > if (AmCheckpointerProcess() || AmBackgroundWriterProcess()) > > INJECTION_POINT("bgw-log-standby-snapshot"); > > > > And make use of it in the test, something like: > > > > $node_primary->safe_psql('postgres', > > "SELECT injection_points_attach('bgw-log-standby-snapshot', 'wait');"); > > > > Sometimes we want the checkpoint to log the standby snapshot as we > need it at a predictable time, maybe one can use > pg_log_standby_snapshot() instead of that. Can we add an injection > point as a separate patch/commit after a bit more discussion? Sure, let's come back to this injection point discussion after the feature freeze. BTW, I think it could also be useful to make use of injection point for the test that has been added in 7f13ac8123. I'll open a new thread for this at that time. >. One other > idea to make such tests predictable is to add a developer-specific GUC > say debug_bg_log_standby_snapshot or something like that but injection > point sounds like a better idea. Agree. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Apr 2, 2024 at 2:11 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > CFbot[1] complained about one query result's order in the tap-test, so I am > attaching a V7 patch set which fixed this. There are no changes in 0001. > > [1] https://cirrus-ci.com/task/6375962162495488 Thanks. Here are some comments: 1. Can we just remove pg_logical_replication_slot_advance and use LogicalSlotAdvanceAndCheckSnapState instead? If worried about the function naming, LogicalSlotAdvanceAndCheckSnapState can be renamed to pg_logical_replication_slot_advance? + * Advance our logical replication slot forward. See + * LogicalSlotAdvanceAndCheckSnapState for details. */ static XLogRecPtr pg_logical_replication_slot_advance(XLogRecPtr moveto) { 2. + if (!ready_for_decoding) + { + elog(DEBUG1, "could not find consistent point for synced slot; restart_lsn = %X/%X", + LSN_FORMAT_ARGS(slot->data.restart_lsn)); Can we specify the slot name in the message? 3. Also, can the "could not find consistent point for synced slot; restart_lsn = %X/%X" be emitted at LOG level just like other messages in update_and_persist_local_synced_slot. Although, I see "XXX should this be changed to elog(DEBUG1) perhaps?", these messages need to be at LOG level as they help debug issues if at all they are hit. 4. How about using found_consistent_snapshot instead of ready_for_decoding? A general understanding is that the synced slots are not allowed for decoding (although with this fix, we do that for internal purposes), ready_for_decoding looks a bit misleading. 5. As far as the test case for this issue is concerned, I'm fine with adding one using an INJECTION point because we seem to be having no consistent way to control postgres writing current snapshot to WAL. 6. A nit: can we use "fast_forward mode" instead of "fast-forward mode" just to be consistent? + * logical changes unless we are in fast-forward mode where no changes are 7. + /* + * We need to access the system tables during decoding to build the + * logical changes unless we are in fast-forward mode where no changes are + * generated. + */ + if (slot->data.database != MyDatabaseId && !fast_forward) May I know if we need this change for this fix? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tuesday, April 2, 2024 8:49 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Tue, Apr 2, 2024 at 2:11 PM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > CFbot[1] complained about one query result's order in the tap-test, so I am > > attaching a V7 patch set which fixed this. There are no changes in 0001. > > > > [1] https://cirrus-ci.com/task/6375962162495488 > > Thanks. Here are some comments: Thanks for the comments. > > 1. Can we just remove pg_logical_replication_slot_advance and use > LogicalSlotAdvanceAndCheckSnapState instead? If worried about the > function naming, LogicalSlotAdvanceAndCheckSnapState can be renamed to > pg_logical_replication_slot_advance? > > + * Advance our logical replication slot forward. See > + * LogicalSlotAdvanceAndCheckSnapState for details. > */ > static XLogRecPtr > pg_logical_replication_slot_advance(XLogRecPtr moveto) > { It was commented[1] that it's not appropriate for the pg_logical_replication_slot_advance to have an out parameter 'ready_for_decoding' which looks bit awkward as the functionality doesn't match the name, and is also not consistent with the style of pg_physical_replication_slot_advance(). So, we added a new function. > > 2. > + if (!ready_for_decoding) > + { > + elog(DEBUG1, "could not find consistent point for synced > slot; restart_lsn = %X/%X", > + LSN_FORMAT_ARGS(slot->data.restart_lsn)); > > Can we specify the slot name in the message? Added. > > 3. Also, can the "could not find consistent point for synced slot; > restart_lsn = %X/%X" be emitted at LOG level just like other messages > in update_and_persist_local_synced_slot. Although, I see "XXX should > this be changed to elog(DEBUG1) perhaps?", these messages need to be > at LOG level as they help debug issues if at all they are hit. Changed to LOG and reworded the message. > > 4. How about using found_consistent_snapshot instead of > ready_for_decoding? A general understanding is that the synced slots > are not allowed for decoding (although with this fix, we do that for > internal purposes), ready_for_decoding looks a bit misleading. Agreed and renamed. > > 5. As far as the test case for this issue is concerned, I'm fine with > adding one using an INJECTION point because we seem to be having no > consistent way to control postgres writing current snapshot to WAL. Since me and my colleagues can reproduce the issue consistently after applying 0002 and it could be rare for concurrent xl_running_xacts to happen, we discussed[2] to consider adding the INJECTION point after pushing the main fix. > > 6. A nit: can we use "fast_forward mode" instead of "fast-forward > mode" just to be consistent? > + * logical changes unless we are in fast-forward mode where no changes > are > > 7. > + /* > + * We need to access the system tables during decoding to build the > + * logical changes unless we are in fast-forward mode where no changes > are > + * generated. > + */ > + if (slot->data.database != MyDatabaseId && !fast_forward) > > May I know if we need this change for this fix? The slotsync worker needs to advance the slots from different databases in fast_forward. So, we need to skip this check in fast_forward mode. The analysis can be found in [3]. Attach the V8 patch which addressed above comments. [1] https://www.postgresql.org/message-id/CAA4eK1%2BwkaRi2BrLLC_0gKbHN68Awc9dRp811G3An6A6fuqdOg%40mail.gmail.com [2] https://www.postgresql.org/message-id/ZgvI9iAUWCZ17z5V%40ip-10-97-1-34.eu-west-3.compute.internal [3] https://www.postgresql.org/message-id/CAJpy0uCQ2PDCAqcnbdOz6q_ZqmBfMyBpVqKDqL_XZBP%3DeK-1yw%40mail.gmail.com Best Regards, Hou zj
Attachment
On Tue, Apr 2, 2024 at 7:25 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > > 1. Can we just remove pg_logical_replication_slot_advance and use > > LogicalSlotAdvanceAndCheckSnapState instead? If worried about the > > function naming, LogicalSlotAdvanceAndCheckSnapState can be renamed to > > pg_logical_replication_slot_advance? > > > > + * Advance our logical replication slot forward. See > > + * LogicalSlotAdvanceAndCheckSnapState for details. > > */ > > static XLogRecPtr > > pg_logical_replication_slot_advance(XLogRecPtr moveto) > > { > > It was commented[1] that it's not appropriate for the > pg_logical_replication_slot_advance to have an out parameter > 'ready_for_decoding' which looks bit awkward as the functionality doesn't match > the name, and is also not consistent with the style of > pg_physical_replication_slot_advance(). So, we added a new function. I disagree here. A new function just for a parameter is not that great IMHO. I'd just rename LogicalSlotAdvanceAndCheckSnapState(XLogRecPtr moveto, bool *found_consistent_snapshot) to pg_logical_replication_slot_advance(XLogRecPtr moveto, bool *found_consistent_snapshot) and use it. If others don't like this, I'd at least turn pg_logical_replication_slot_advance(XLogRecPtr moveto) a static inline function. > > 5. As far as the test case for this issue is concerned, I'm fine with > > adding one using an INJECTION point because we seem to be having no > > consistent way to control postgres writing current snapshot to WAL. > > Since me and my colleagues can reproduce the issue consistently after applying > 0002 and it could be rare for concurrent xl_running_xacts to happen, we discussed[2] to > consider adding the INJECTION point after pushing the main fix. Right. > > 7. > > + /* > > + * We need to access the system tables during decoding to build the > > + * logical changes unless we are in fast-forward mode where no changes > > are > > + * generated. > > + */ > > + if (slot->data.database != MyDatabaseId && !fast_forward) > > > > May I know if we need this change for this fix? > > The slotsync worker needs to advance the slots from different databases in > fast_forward. So, we need to skip this check in fast_forward mode. The analysis can > be found in [3]. - if (slot->data.database != MyDatabaseId) + /* + * We need to access the system tables during decoding to build the + * logical changes unless we are in fast_forward mode where no changes are + * generated. + */ + if (slot->data.database != MyDatabaseId && !fast_forward) ereport(ERROR, It's not clear from the comment that we need it for a slotsync worker to advance the slots from different databases. Can this be put into the comment? Also, specify in the comment, why this is safe? Also, if this change is needed for only slotsync workers, why not protect it with IsSyncingReplicationSlots()? Otherwise, it might impact non-slotsync callers, no? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Tue, Apr 2, 2024 at 7:42 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Tue, Apr 2, 2024 at 7:25 PM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > > 1. Can we just remove pg_logical_replication_slot_advance and use > > > LogicalSlotAdvanceAndCheckSnapState instead? If worried about the > > > function naming, LogicalSlotAdvanceAndCheckSnapState can be renamed to > > > pg_logical_replication_slot_advance? > > > > > > + * Advance our logical replication slot forward. See > > > + * LogicalSlotAdvanceAndCheckSnapState for details. > > > */ > > > static XLogRecPtr > > > pg_logical_replication_slot_advance(XLogRecPtr moveto) > > > { > > > > It was commented[1] that it's not appropriate for the > > pg_logical_replication_slot_advance to have an out parameter > > 'ready_for_decoding' which looks bit awkward as the functionality doesn't match > > the name, and is also not consistent with the style of > > pg_physical_replication_slot_advance(). So, we added a new function. > > I disagree here. A new function just for a parameter is not that great > IMHO. > It is not for the parameter but primarily for the functionality it provides. The additional functionality of whether we reached a consistent point while advancing the slot doesn't sound to suit the current function. Also, we want to keep the signature similar to the existing function pg_physical_replication_slot_advance(). > > I'd just rename LogicalSlotAdvanceAndCheckSnapState(XLogRecPtr > moveto, bool *found_consistent_snapshot) to > pg_logical_replication_slot_advance(XLogRecPtr moveto, bool > *found_consistent_snapshot) and use it. If others don't like this, I'd > at least turn pg_logical_replication_slot_advance(XLogRecPtr moveto) a > static inline function. > Yeah, we can do that but it is not a performance-sensitive routine so don't know if it is worth it. > > > 5. As far as the test case for this issue is concerned, I'm fine with > > > adding one using an INJECTION point because we seem to be having no > > > consistent way to control postgres writing current snapshot to WAL. > > > > Since me and my colleagues can reproduce the issue consistently after applying > > 0002 and it could be rare for concurrent xl_running_xacts to happen, we discussed[2] to > > consider adding the INJECTION point after pushing the main fix. > > Right. > > > > 7. > > > + /* > > > + * We need to access the system tables during decoding to build the > > > + * logical changes unless we are in fast-forward mode where no changes > > > are > > > + * generated. > > > + */ > > > + if (slot->data.database != MyDatabaseId && !fast_forward) > > > > > > May I know if we need this change for this fix? > > > > The slotsync worker needs to advance the slots from different databases in > > fast_forward. So, we need to skip this check in fast_forward mode. The analysis can > > be found in [3]. > - if (slot->data.database != MyDatabaseId) > + /* > + * We need to access the system tables during decoding to build the > + * logical changes unless we are in fast_forward mode where no changes are > + * generated. > + */ > + if (slot->data.database != MyDatabaseId && !fast_forward) > ereport(ERROR, > > It's not clear from the comment that we need it for a slotsync worker > to advance the slots from different databases. Can this be put into > the comment? Also, specify in the comment, why this is safe? > It is not specific to slot sync worker but specific to fast_forward mode. There is already a comment "We need to access the system tables during decoding to build the logical changes unless we are in fast_forward mode where no changes are generated." telling why it is safe. The point is we need database access to access system tables while generating the logical changes and in fast-forward mode, we don't generate logical changes so this check is not required. Do let me if you have a different understanding or if my understanding is incorrect. -- With Regards, Amit Kapila.
On Wed, Apr 3, 2024 at 9:04 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > I'd just rename LogicalSlotAdvanceAndCheckSnapState(XLogRecPtr > > moveto, bool *found_consistent_snapshot) to > > pg_logical_replication_slot_advance(XLogRecPtr moveto, bool > > *found_consistent_snapshot) and use it. If others don't like this, I'd > > at least turn pg_logical_replication_slot_advance(XLogRecPtr moveto) a > > static inline function. > > > Yeah, we can do that but it is not a performance-sensitive routine so > don't know if it is worth it. Okay for what the patch has right now. No more bikeshedding from me on this. > > > The slotsync worker needs to advance the slots from different databases in > > > fast_forward. So, we need to skip this check in fast_forward mode. The analysis can > > > be found in [3]. > > - if (slot->data.database != MyDatabaseId) > > + /* > > + * We need to access the system tables during decoding to build the > > + * logical changes unless we are in fast_forward mode where no changes are > > + * generated. > > + */ > > + if (slot->data.database != MyDatabaseId && !fast_forward) > > ereport(ERROR, > > > > It's not clear from the comment that we need it for a slotsync worker > > to advance the slots from different databases. Can this be put into > > the comment? Also, specify in the comment, why this is safe? > > > It is not specific to slot sync worker but specific to fast_forward > mode. There is already a comment "We need to access the system tables > during decoding to build the logical changes unless we are in > fast_forward mode where no changes are generated." telling why it is > safe. The point is we need database access to access system tables > while generating the logical changes and in fast-forward mode, we > don't generate logical changes so this check is not required. Do let > me if you have a different understanding or if my understanding is > incorrect. Understood. Thanks. Just curious, why isn't a problem for the existing fast_forward mode callers pg_replication_slot_advance and LogicalReplicationSlotHasPendingWal? I quickly looked at v8, and have a nit, rest all looks good. + if (DecodingContextReady(ctx) && found_consistent_snapshot) + *found_consistent_snapshot = true; Can the found_consistent_snapshot be checked first to help avoid the function call DecodingContextReady() for pg_replication_slot_advance callers? -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, Apr 3, 2024 at 9:36 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Wed, Apr 3, 2024 at 9:04 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > I'd just rename LogicalSlotAdvanceAndCheckSnapState(XLogRecPtr > > > moveto, bool *found_consistent_snapshot) to > > > pg_logical_replication_slot_advance(XLogRecPtr moveto, bool > > > *found_consistent_snapshot) and use it. If others don't like this, I'd > > > at least turn pg_logical_replication_slot_advance(XLogRecPtr moveto) a > > > static inline function. > > > > > Yeah, we can do that but it is not a performance-sensitive routine so > > don't know if it is worth it. > > Okay for what the patch has right now. No more bikeshedding from me on this. > > > > > The slotsync worker needs to advance the slots from different databases in > > > > fast_forward. So, we need to skip this check in fast_forward mode. The analysis can > > > > be found in [3]. > > > - if (slot->data.database != MyDatabaseId) > > > + /* > > > + * We need to access the system tables during decoding to build the > > > + * logical changes unless we are in fast_forward mode where no changes are > > > + * generated. > > > + */ > > > + if (slot->data.database != MyDatabaseId && !fast_forward) > > > ereport(ERROR, > > > > > > It's not clear from the comment that we need it for a slotsync worker > > > to advance the slots from different databases. Can this be put into > > > the comment? Also, specify in the comment, why this is safe? > > > > > It is not specific to slot sync worker but specific to fast_forward > > mode. There is already a comment "We need to access the system tables > > during decoding to build the logical changes unless we are in > > fast_forward mode where no changes are generated." telling why it is > > safe. The point is we need database access to access system tables > > while generating the logical changes and in fast-forward mode, we > > don't generate logical changes so this check is not required. Do let > > me if you have a different understanding or if my understanding is > > incorrect. > > Understood. Thanks. Just curious, why isn't a problem for the existing > fast_forward mode callers pg_replication_slot_advance and > LogicalReplicationSlotHasPendingWal? > We call those after connecting to the database and the slot also belongs to that database whereas during synchronization of slots standby. the slots could be from different databases. > I quickly looked at v8, and have a nit, rest all looks good. > > + if (DecodingContextReady(ctx) && found_consistent_snapshot) > + *found_consistent_snapshot = true; > > Can the found_consistent_snapshot be checked first to help avoid the > function call DecodingContextReady() for pg_replication_slot_advance > callers? > Okay, changed. Additionally, I have updated the comments and commit message. I'll push this patch after some more testing. -- With Regards, Amit Kapila.
Attachment
On Wed, Apr 3, 2024 at 11:13 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Apr 3, 2024 at 9:36 AM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > I quickly looked at v8, and have a nit, rest all looks good. > > > > + if (DecodingContextReady(ctx) && found_consistent_snapshot) > > + *found_consistent_snapshot = true; > > > > Can the found_consistent_snapshot be checked first to help avoid the > > function call DecodingContextReady() for pg_replication_slot_advance > > callers? > > > > Okay, changed. Additionally, I have updated the comments and commit > message. I'll push this patch after some more testing. > Pushed! -- With Regards, Amit Kapila.
On Wed, Apr 3, 2024 at 7:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Apr 3, 2024 at 11:13 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Apr 3, 2024 at 9:36 AM Bharath Rupireddy > > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > > I quickly looked at v8, and have a nit, rest all looks good. > > > > > > + if (DecodingContextReady(ctx) && found_consistent_snapshot) > > > + *found_consistent_snapshot = true; > > > > > > Can the found_consistent_snapshot be checked first to help avoid the > > > function call DecodingContextReady() for pg_replication_slot_advance > > > callers? > > > > > > > Okay, changed. Additionally, I have updated the comments and commit > > message. I'll push this patch after some more testing. > > > > Pushed! While testing this change, I realized that it could happen that the server logs are flooded with the following logical decoding logs that are written every 200 ms: 2024-04-04 16:15:19.270 JST [3838739] LOG: starting logical decoding for slot "test_sub" 2024-04-04 16:15:19.270 JST [3838739] DETAIL: Streaming transactions committing after 0/50006F48, reading WAL from 0/50006F10. 2024-04-04 16:15:19.270 JST [3838739] LOG: logical decoding found consistent point at 0/50006F10 2024-04-04 16:15:19.270 JST [3838739] DETAIL: There are no running transactions. 2024-04-04 16:15:19.477 JST [3838739] LOG: starting logical decoding for slot "test_sub" 2024-04-04 16:15:19.477 JST [3838739] DETAIL: Streaming transactions committing after 0/50006F48, reading WAL from 0/50006F10. 2024-04-04 16:15:19.477 JST [3838739] LOG: logical decoding found consistent point at 0/50006F10 2024-04-04 16:15:19.477 JST [3838739] DETAIL: There are no running transactions. For example, I could reproduce it with the following steps: 1. create the primary and start. 2. run "pgbench -i -s 100" on the primary. 3. run pg_basebackup to create the standby. 4. configure slotsync setup on the standby and start. 5. create a publication for all tables on the primary. 6. create the subscriber and start. 7. run "pgbench -i -Idtpf" on the subscriber. 8. create a subscription on the subscriber (initial data copy will start). The logical decoding logs were written every 200 ms during the initial data synchronization. Looking at the new changes for update_local_synced_slot(): if (remote_slot->confirmed_lsn != slot->data.confirmed_flush || remote_slot->restart_lsn != slot->data.restart_lsn || remote_slot->catalog_xmin != slot->data.catalog_xmin) { /* * We can't directly copy the remote slot's LSN or xmin unless there * exists a consistent snapshot at that point. Otherwise, after * promotion, the slots may not reach a consistent point before the * confirmed_flush_lsn which can lead to a data loss. To avoid data * loss, we let slot machinery advance the slot which ensures that * snapbuilder/slot statuses are updated properly. */ if (SnapBuildSnapshotExists(remote_slot->restart_lsn)) { /* * Update the slot info directly if there is a serialized snapshot * at the restart_lsn, as the slot can quickly reach consistency * at restart_lsn by restoring the snapshot. */ SpinLockAcquire(&slot->mutex); slot->data.restart_lsn = remote_slot->restart_lsn; slot->data.confirmed_flush = remote_slot->confirmed_lsn; slot->data.catalog_xmin = remote_slot->catalog_xmin; slot->effective_catalog_xmin = remote_slot->catalog_xmin; SpinLockRelease(&slot->mutex); if (found_consistent_snapshot) *found_consistent_snapshot = true; } else { LogicalSlotAdvanceAndCheckSnapState(remote_slot->confirmed_lsn, found_consistent_snapshot); } ReplicationSlotsComputeRequiredXmin(false); ReplicationSlotsComputeRequiredLSN(); slot_updated = true; We call LogicalSlotAdvanceAndCheckSnapState() if one of confirmed_lsn, restart_lsn, and catalog_xmin is different between the remote slot and the local slot. In my test case, during the initial sync performing, only catalog_xmin was different and there was no serialized snapshot at restart_lsn, and the slotsync worker called LogicalSlotAdvanceAndCheckSnapState(). However no slot properties were changed even after the function and it set slot_updated = true. So it starts the next slot synchronization after 200ms. It seems to me that we can skip calling LogicalSlotAdvanceAndCheckSnapState() at least when the remote and local have the same restart_lsn and confirmed_lsn. I'm not sure there are other scenarios but is it worth fixing this symptom? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Apr 3, 2024 at 3:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Apr 3, 2024 at 11:13 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Apr 3, 2024 at 9:36 AM Bharath Rupireddy > > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > > I quickly looked at v8, and have a nit, rest all looks good. > > > > > > + if (DecodingContextReady(ctx) && found_consistent_snapshot) > > > + *found_consistent_snapshot = true; > > > > > > Can the found_consistent_snapshot be checked first to help avoid the > > > function call DecodingContextReady() for pg_replication_slot_advance > > > callers? > > > > > > > Okay, changed. Additionally, I have updated the comments and commit > > message. I'll push this patch after some more testing. > > > > Pushed! There is an intermittent BF failure observed at [1] after this commit (2ec005b). Analysis: We see in BF logs, that during the first call of the sync function, restart_lsn of the synced slot is advanced to a lsn which is > than remote slot's restart_lsn. And when sync call is done second time without any further change on primary, it hits the error: ERROR: cannot synchronize local slot "lsub1_slot" LSN(0/3000060) to remote slot's LSN(0/3000028) as synchronization would move it backwards Relevant BF logs are given at [2]. This reproduces intermittently depending upon if bgwriter logs running txn record when the test is running. We were able to mimic the test case to reproduce the failure. Please see attached bf-test.txt for steps. Issue: Issue is that we are having a wrong sanity check based on 'restart_lsn' in synchronize_one_slot(): if (remote_slot->restart_lsn < slot->data.restart_lsn) elog(ERROR, ...); Prior to commit 2ec005b, this check was okay, as we did not expect restart_lsn of the synced slot to be ahead of remote since we were directly copying the lsns. But now when we use 'advance' to do logical decoding on standby, there is a possibility that restart lsn of the synced slot is ahead of remote slot, if there are running txns records found after reaching consistent-point while consuming WALs from restart_lsn till confirmed_lsn. In such a case, slot-sync's advance may end up serializing snapshots and setting restart_lsn to the serialized snapshot point, ahead of remote one. Fix: The sanity check needs to be corrected. Attached a patch to address the issue. a) The sanity check is corrected to compare confirmed_lsn rather than restart_lsn. Additional changes: b) A log has been added after LogicalSlotAdvanceAndCheckSnapState() to log the case when the local and remote slots' confirmed-lsn were not found to be the same after sync (if at all). c) Now we attempt to sync in update_local_synced_slot() if one of confirmed_lsn, restart_lsn, and catalog_xmin for remote slot is ahead of local slot instead of them just being unequal. [1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-04-03%2017%3A57%3A28 [2]: 2024-04-03 18:00:41.896 UTC [3239617][client backend][0/2:0] LOG: statement: SELECT pg_sync_replication_slots(); LOG: starting logical decoding for slot "lsub1_slot" DETAIL: Streaming transactions committing after 0/0, reading WAL from 0/3000028. LOG: logical decoding found consistent point at 0/3000028 DEBUG: serializing snapshot to pg_logical/snapshots/0-3000060.snap DEBUG: got new restart lsn 0/3000060 at 0/3000060 LOG: newly created slot "lsub1_slot" is sync-ready now 2024-04-03 18:00:45.218 UTC [3243487][client backend][2/2:0] LOG: statement: SELECT pg_sync_replication_slots(); ERROR: cannot synchronize local slot "lsub1_slot" LSN(0/3000060) to remote slot's LSN(0/3000028) as synchronization would move it backwards thanks Shveta
Attachment
On Thu, Apr 4, 2024 at 1:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > While testing this change, I realized that it could happen that the > server logs are flooded with the following logical decoding logs that > are written every 200 ms: > > 2024-04-04 16:15:19.270 JST [3838739] LOG: starting logical decoding > for slot "test_sub" ... ... > > For example, I could reproduce it with the following steps: > > 1. create the primary and start. > 2. run "pgbench -i -s 100" on the primary. > 3. run pg_basebackup to create the standby. > 4. configure slotsync setup on the standby and start. > 5. create a publication for all tables on the primary. > 6. create the subscriber and start. > 7. run "pgbench -i -Idtpf" on the subscriber. > 8. create a subscription on the subscriber (initial data copy will start). > > The logical decoding logs were written every 200 ms during the initial > data synchronization. > > Looking at the new changes for update_local_synced_slot(): > > if (remote_slot->confirmed_lsn != slot->data.confirmed_flush || > remote_slot->restart_lsn != slot->data.restart_lsn || > remote_slot->catalog_xmin != slot->data.catalog_xmin) > { > /* > * We can't directly copy the remote slot's LSN or xmin unless there > * exists a consistent snapshot at that point. Otherwise, after > * promotion, the slots may not reach a consistent point before the > * confirmed_flush_lsn which can lead to a data loss. To avoid data > * loss, we let slot machinery advance the slot which ensures that > * snapbuilder/slot statuses are updated properly. > */ > if (SnapBuildSnapshotExists(remote_slot->restart_lsn)) > { > /* > * Update the slot info directly if there is a serialized snapshot > * at the restart_lsn, as the slot can quickly reach consistency > * at restart_lsn by restoring the snapshot. > */ > SpinLockAcquire(&slot->mutex); > slot->data.restart_lsn = remote_slot->restart_lsn; > slot->data.confirmed_flush = remote_slot->confirmed_lsn; > slot->data.catalog_xmin = remote_slot->catalog_xmin; > slot->effective_catalog_xmin = remote_slot->catalog_xmin; > SpinLockRelease(&slot->mutex); > > if (found_consistent_snapshot) > *found_consistent_snapshot = true; > } > else > { > LogicalSlotAdvanceAndCheckSnapState(remote_slot->confirmed_lsn, > found_consistent_snapshot); > } > > ReplicationSlotsComputeRequiredXmin(false); > ReplicationSlotsComputeRequiredLSN(); > > slot_updated = true; > > We call LogicalSlotAdvanceAndCheckSnapState() if one of confirmed_lsn, > restart_lsn, and catalog_xmin is different between the remote slot and > the local slot. In my test case, during the initial sync performing, > only catalog_xmin was different and there was no serialized snapshot > at restart_lsn, and the slotsync worker called > LogicalSlotAdvanceAndCheckSnapState(). However no slot properties were > changed even after the function and it set slot_updated = true. So it > starts the next slot synchronization after 200ms. > > It seems to me that we can skip calling > LogicalSlotAdvanceAndCheckSnapState() at least when the remote and > local have the same restart_lsn and confirmed_lsn. > I think we can do that but do we know what caused catalog_xmin to be updated regularly without any change in restart/confirmed_flush LSN? I think the LSNs are not updated during the initial sync (copy) time but how catalog_xmin is getting updated for the same slot? BTW, if we see, we will probably anyway except this xmin as it is due to the following code in LogicalIncreaseXminForSlot() LogicalIncreaseXminForSlot() { /* * If the client has already confirmed up to this lsn, we directly can * mark this as accepted. This can happen if we restart decoding in a * slot. */ else if (current_lsn <= slot->data.confirmed_flush) { slot->candidate_catalog_xmin = xmin; slot->candidate_xmin_lsn = current_lsn; /* our candidate can directly be used */ updated_xmin = true; } > I'm not sure there are other scenarios but is it worth fixing this symptom? > I think so but let's investigate this a bit more. BTW, while thinking on this one, I noticed that in the function LogicalConfirmReceivedLocation(), we first update the disk copy, see comment [1] and then in-memory whereas the same is not true in update_local_synced_slot() for the case when snapshot exists. Now, do we have the same risk here in case of standby? Because I think we will use these xmins while sending the feedback message (in XLogWalRcvSendHSFeedback()). [1] /* * We have to write the changed xmin to disk *before* we change * the in-memory value, otherwise after a crash we wouldn't know * that some catalog tuples might have been removed already. -- With Regards, Amit Kapila.
On Thu, Apr 4, 2024 at 2:59 PM shveta malik <shveta.malik@gmail.com> wrote: > > > Prior to commit 2ec005b, this check was okay, as we did not expect > restart_lsn of the synced slot to be ahead of remote since we were > directly copying the lsns. But now when we use 'advance' to do logical > decoding on standby, there is a possibility that restart lsn of the > synced slot is ahead of remote slot, if there are running txns records > found after reaching consistent-point while consuming WALs from > restart_lsn till confirmed_lsn. In such a case, slot-sync's advance > may end up serializing snapshots and setting restart_lsn to the > serialized snapshot point, ahead of remote one. > > Fix: > The sanity check needs to be corrected. Attached a patch to address the issue. Please find v2 which has detailed commit-msg and some more comments in code. thanks Shveta
Attachment
Hi, On Thu, Apr 04, 2024 at 05:31:45PM +0530, shveta malik wrote: > On Thu, Apr 4, 2024 at 2:59 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > Prior to commit 2ec005b, this check was okay, as we did not expect > > restart_lsn of the synced slot to be ahead of remote since we were > > directly copying the lsns. But now when we use 'advance' to do logical > > decoding on standby, there is a possibility that restart lsn of the > > synced slot is ahead of remote slot, if there are running txns records > > found after reaching consistent-point while consuming WALs from > > restart_lsn till confirmed_lsn. In such a case, slot-sync's advance > > may end up serializing snapshots and setting restart_lsn to the > > serialized snapshot point, ahead of remote one. > > > > Fix: > > The sanity check needs to be corrected. Attached a patch to address the issue. > Thanks for reporting, explaining the issue and providing a patch. Regarding the patch: 1 === + * Attempt to sync lsns and xmins only if remote slot is ahead of local s/lsns/LSNs/? 2 === + if (slot->data.confirmed_flush != remote_slot->confirmed_lsn) + elog(LOG, + "could not synchronize local slot \"%s\" LSN(%X/%X)" + " to remote slot's LSN(%X/%X) ", + remote_slot->name, + LSN_FORMAT_ARGS(slot->data.confirmed_flush), + LSN_FORMAT_ARGS(remote_slot->confirmed_lsn)); I don't think that the message is correct here. Unless I am missing something there is nothing in the following code path that would prevent the slot to be sync during this cycle. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, Apr 5, 2024 at 9:22 AM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On Thu, Apr 04, 2024 at 05:31:45PM +0530, shveta malik wrote: > > On Thu, Apr 4, 2024 at 2:59 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > > > > Prior to commit 2ec005b, this check was okay, as we did not expect > > > restart_lsn of the synced slot to be ahead of remote since we were > > > directly copying the lsns. But now when we use 'advance' to do logical > > > decoding on standby, there is a possibility that restart lsn of the > > > synced slot is ahead of remote slot, if there are running txns records > > > found after reaching consistent-point while consuming WALs from > > > restart_lsn till confirmed_lsn. In such a case, slot-sync's advance > > > may end up serializing snapshots and setting restart_lsn to the > > > serialized snapshot point, ahead of remote one. > > > > > > Fix: > > > The sanity check needs to be corrected. Attached a patch to address the issue. > > > > Thanks for reporting, explaining the issue and providing a patch. > > Regarding the patch: > > 1 === > > + * Attempt to sync lsns and xmins only if remote slot is ahead of local > > s/lsns/LSNs/? > > 2 === > > + if (slot->data.confirmed_flush != remote_slot->confirmed_lsn) > + elog(LOG, > + "could not synchronize local slot \"%s\" LSN(%X/%X)" > + " to remote slot's LSN(%X/%X) ", > + remote_slot->name, > + LSN_FORMAT_ARGS(slot->data.confirmed_flush), > + LSN_FORMAT_ARGS(remote_slot->confirmed_lsn)); > > I don't think that the message is correct here. Unless I am missing something > there is nothing in the following code path that would prevent the slot to be > sync during this cycle. This is a sanity check, I will put a comment to indicate the same. We want to ensure if anything changes in future, we get correct logs to indicate that. If required, the LOG msg can be changed. Kindly suggest if you have anything better in mind. thanks Shveta
Hi, On Fri, Apr 05, 2024 at 09:43:35AM +0530, shveta malik wrote: > On Fri, Apr 5, 2024 at 9:22 AM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > > Hi, > > > > On Thu, Apr 04, 2024 at 05:31:45PM +0530, shveta malik wrote: > > > On Thu, Apr 4, 2024 at 2:59 PM shveta malik <shveta.malik@gmail.com> wrote: > > 2 === > > > > + if (slot->data.confirmed_flush != remote_slot->confirmed_lsn) > > + elog(LOG, > > + "could not synchronize local slot \"%s\" LSN(%X/%X)" > > + " to remote slot's LSN(%X/%X) ", > > + remote_slot->name, > > + LSN_FORMAT_ARGS(slot->data.confirmed_flush), > > + LSN_FORMAT_ARGS(remote_slot->confirmed_lsn)); > > > > I don't think that the message is correct here. Unless I am missing something > > there is nothing in the following code path that would prevent the slot to be > > sync during this cycle. > > This is a sanity check, I will put a comment to indicate the same. Thanks! > We > want to ensure if anything changes in future, we get correct logs to > indicate that. Right, understood that way. > If required, the LOG msg can be changed. Kindly suggest if you have > anything better in mind. > What about something like? ereport(LOG, errmsg("synchronized confirmed_flush_lsn for slot \"%s\" differs from remote slot", remote_slot->name), errdetail("Remote slot has LSN %X/%X but local slot has LSN %X/%X.", LSN_FORMAT_ARGS(remote_slot->restart_lsn), LSN_FORMAT_ARGS(slot->data.restart_lsn)); Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, Apr 5, 2024 at 10:09 AM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > What about something like? > > ereport(LOG, > errmsg("synchronized confirmed_flush_lsn for slot \"%s\" differs from remote slot", > remote_slot->name), > errdetail("Remote slot has LSN %X/%X but local slot has LSN %X/%X.", > LSN_FORMAT_ARGS(remote_slot->restart_lsn), > LSN_FORMAT_ARGS(slot->data.restart_lsn)); > > Regards, +1. Better than earlier. I will update and post the patch. thanks Shveta
Hi, On Fri, Apr 05, 2024 at 04:09:01PM +0530, shveta malik wrote: > On Fri, Apr 5, 2024 at 10:09 AM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > > What about something like? > > > > ereport(LOG, > > errmsg("synchronized confirmed_flush_lsn for slot \"%s\" differs from remote slot", > > remote_slot->name), > > errdetail("Remote slot has LSN %X/%X but local slot has LSN %X/%X.", > > LSN_FORMAT_ARGS(remote_slot->restart_lsn), > > LSN_FORMAT_ARGS(slot->data.restart_lsn)); > > > > Regards, > > +1. Better than earlier. I will update and post the patch. > Thanks! BTW, I just realized that the LSN I used in my example in the LSN_FORMAT_ARGS() are not the right ones. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, Apr 5, 2024 at 4:31 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > BTW, I just realized that the LSN I used in my example in the LSN_FORMAT_ARGS() > are not the right ones. Noted. Thanks. Please find v3 with the comments addressed. thanks Shveta
Attachment
On Thu, Apr 4, 2024 at 2:59 PM shveta malik <shveta.malik@gmail.com> wrote: > > There is an intermittent BF failure observed at [1] after this commit (2ec005b). > Thanks for analyzing and providing the patch. I'll look into it. There is another BF failure [1] which I have analyzed. The main reason for failure is the following test: # Failed test 'logical slots have synced as true on standby' # at /home/bf/bf-build/serinus/HEAD/pgsql/src/test/recovery/t/040_standby_failover_slots_sync.pl line 198. # got: 'f' # expected: 't' Here, we are expecting that the two logical slots (lsub1_slot, and lsub2_slot), one created via subscription and another one via API pg_create_logical_replication_slot() are synced. The standby LOGs which are as follows show that the one created by API 'lsub2_slot' is synced but the other one 'lsub1_slot': LOG for lsub1_slot: ================ 2024-04-05 04:37:07.421 UTC [3867682][client backend][0/2:0] DETAIL: Streaming transactions committing after 0/0, reading WAL from 0/3000060. 2024-04-05 04:37:07.421 UTC [3867682][client backend][0/2:0] STATEMENT: SELECT pg_sync_replication_slots(); 2024-04-05 04:37:07.422 UTC [3867682][client backend][0/2:0] DEBUG: xmin required by slots: data 0, catalog 740 2024-04-05 04:37:07.422 UTC [3867682][client backend][0/2:0] LOG: could not sync slot "lsub1_slot" LOG for lsub2_slot: ================ 2024-04-05 04:37:08.518 UTC [3867682][client backend][0/2:0] DEBUG: xmin required by slots: data 0, catalog 740 2024-04-05 04:37:08.769 UTC [3867682][client backend][0/2:0] LOG: newly created slot "lsub2_slot" is sync-ready now 2024-04-05 04:37:08.769 UTC [3867682][client backend][0/2:0] STATEMENT: SELECT pg_sync_replication_slots(); We can see from the log of lsub1_slot that its restart_lsn is 0/3000060 which means it will start reading from the WAL from that location. Now, if we check the publisher log, we have running_xacts record at that location. See following LOGs: 2024-04-05 04:36:57.830 UTC [3860839][client backend][8/2:0] LOG: statement: SELECT pg_create_logical_replication_slot('lsub2_slot', 'test_decoding', false, false, true); 2024-04-05 04:36:58.718 UTC [3860839][client backend][8/2:0] DEBUG: snapshot of 0+0 running transaction ids (lsn 0/3000060 oldest xid 740 latest complete 739 next xid 740) .... .... 2024-04-05 04:37:05.074 UTC [3854278][background writer][:0] DEBUG: snapshot of 0+0 running transaction ids (lsn 0/3000098 oldest xid 740 latest complete 739 next xid 740) The first running_xact record ends at 3000060 and the second one at 3000098. So, the start location of the second running_xact is 3000060, the same can be confirmed by the following LOG line of walsender: 2024-04-05 04:37:05.144 UTC [3857385][walsender][25/0:0] DEBUG: serializing snapshot to pg_logical/snapshots/0-3000060.snap This shows that while processing running_xact at location 3000060, we have serialized the snapshot. As there is no running transaction in WAL at 3000060 so ideally we should have reached a consistent state after processing that record on standby. But the reason standby didn't process that LOG is that the confirmed_flush LSN is also at the same location so the function LogicalSlotAdvanceAndCheckSnapState() exits without reading the WAL at that location. Now, this can be confirmed by the below walsender-specific LOG in publisher: 2024-04-05 04:36:59.155 UTC [3857385][walsender][25/0:0] DEBUG: write 0/3000060 flush 0/3000060 apply 0/3000060 reply_time 2024-04-05 04:36:59.155181+00 We update the confirmed_flush location with the flush location after receiving the above feedback. You can notice that we didn't receive the feedback for the 3000098 location and hence both the confirmed_flush and restart_lsn are at the same location 0/3000060. Now, the test is waiting for the subscriber to send feedback of the last WAL write location by $primary->wait_for_catchup('regress_mysub1'); As noticed from the publisher LOGs, the query we used for wait is: SELECT '0/3000060' <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name IN ('regress_mysub1', 'walreceiver') Here, instead of '0/3000060' it should have used ''0/3000098' which is the last write location. This position we get via function pg_current_wal_lsn()->GetXLogWriteRecPtr()->LogwrtResult.Write. And this variable seems to be touched by commit c9920a9068eac2e6c8fb34988d18c0b42b9bf811. Though unlikely could c9920a9068eac2e6c8fb34988d18c0b42b9bf811 be a reason for failure? At this stage, I am not sure so just sharing with others to see if what I am saying sounds logical. I'll think more about this. [1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2024-04-05%2004%3A34%3A27 -- With Regards, Amit Kapila.
On Fri, Apr 5, 2024 at 5:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Apr 4, 2024 at 2:59 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > There is an intermittent BF failure observed at [1] after this commit (2ec005b). > > > > Thanks for analyzing and providing the patch. I'll look into it. There > is another BF failure [1] which I have analyzed. The main reason for > failure is the following test: > > # Failed test 'logical slots have synced as true on standby' > # at /home/bf/bf-build/serinus/HEAD/pgsql/src/test/recovery/t/040_standby_failover_slots_sync.pl > line 198. > # got: 'f' > # expected: 't' > > Here, we are expecting that the two logical slots (lsub1_slot, and > lsub2_slot), one created via subscription and another one via API > pg_create_logical_replication_slot() are synced. The standby LOGs > which are as follows show that the one created by API 'lsub2_slot' is > synced but the other one 'lsub1_slot': > > LOG for lsub1_slot: > ================ > 2024-04-05 04:37:07.421 UTC [3867682][client backend][0/2:0] DETAIL: > Streaming transactions committing after 0/0, reading WAL from > 0/3000060. > 2024-04-05 04:37:07.421 UTC [3867682][client backend][0/2:0] > STATEMENT: SELECT pg_sync_replication_slots(); > 2024-04-05 04:37:07.422 UTC [3867682][client backend][0/2:0] DEBUG: > xmin required by slots: data 0, catalog 740 > 2024-04-05 04:37:07.422 UTC [3867682][client backend][0/2:0] LOG: > could not sync slot "lsub1_slot" > > LOG for lsub2_slot: > ================ > 2024-04-05 04:37:08.518 UTC [3867682][client backend][0/2:0] DEBUG: > xmin required by slots: data 0, catalog 740 > 2024-04-05 04:37:08.769 UTC [3867682][client backend][0/2:0] LOG: > newly created slot "lsub2_slot" is sync-ready now > 2024-04-05 04:37:08.769 UTC [3867682][client backend][0/2:0] > STATEMENT: SELECT pg_sync_replication_slots(); > > We can see from the log of lsub1_slot that its restart_lsn is > 0/3000060 which means it will start reading from the WAL from that > location. Now, if we check the publisher log, we have running_xacts > record at that location. See following LOGs: > > 2024-04-05 04:36:57.830 UTC [3860839][client backend][8/2:0] LOG: > statement: SELECT pg_create_logical_replication_slot('lsub2_slot', > 'test_decoding', false, false, true); > 2024-04-05 04:36:58.718 UTC [3860839][client backend][8/2:0] DEBUG: > snapshot of 0+0 running transaction ids (lsn 0/3000060 oldest xid 740 > latest complete 739 next xid 740) > .... > .... > 2024-04-05 04:37:05.074 UTC [3854278][background writer][:0] DEBUG: > snapshot of 0+0 running transaction ids (lsn 0/3000098 oldest xid 740 > latest complete 739 next xid 740) > > The first running_xact record ends at 3000060 and the second one at > 3000098. So, the start location of the second running_xact is 3000060, > the same can be confirmed by the following LOG line of walsender: > > 2024-04-05 04:37:05.144 UTC [3857385][walsender][25/0:0] DEBUG: > serializing snapshot to pg_logical/snapshots/0-3000060.snap > > This shows that while processing running_xact at location 3000060, we > have serialized the snapshot. As there is no running transaction in > WAL at 3000060 so ideally we should have reached a consistent state > after processing that record on standby. But the reason standby didn't > process that LOG is that the confirmed_flush LSN is also at the same > location so the function LogicalSlotAdvanceAndCheckSnapState() exits > without reading the WAL at that location. Now, this can be confirmed > by the below walsender-specific LOG in publisher: > > 2024-04-05 04:36:59.155 UTC [3857385][walsender][25/0:0] DEBUG: write > 0/3000060 flush 0/3000060 apply 0/3000060 reply_time 2024-04-05 > 04:36:59.155181+00 > > We update the confirmed_flush location with the flush location after > receiving the above feedback. You can notice that we didn't receive > the feedback for the 3000098 location and hence both the > confirmed_flush and restart_lsn are at the same location 0/3000060. > Now, the test is waiting for the subscriber to send feedback of the > last WAL write location by > $primary->wait_for_catchup('regress_mysub1'); As noticed from the > publisher LOGs, the query we used for wait is: > > SELECT '0/3000060' <= replay_lsn AND state = 'streaming' > FROM pg_catalog.pg_stat_replication > WHERE application_name IN ('regress_mysub1', 'walreceiver') > > Here, instead of '0/3000060' it should have used ''0/3000098' which is > the last write location. This position we get via function > pg_current_wal_lsn()->GetXLogWriteRecPtr()->LogwrtResult.Write. And > this variable seems to be touched by commit > c9920a9068eac2e6c8fb34988d18c0b42b9bf811. Though unlikely could > c9920a9068eac2e6c8fb34988d18c0b42b9bf811 be a reason for failure? At > this stage, I am not sure so just sharing with others to see if what I > am saying sounds logical. I'll think more about this. > Thinking more on this, it doesn't seem related to c9920a9068eac2e6c8fb34988d18c0b42b9bf811 as that commit doesn't change any locking or something like that which impacts write positions. I think what has happened here is that running_xact record written by the background writer [1] is not written to the kernel or disk (see LogStandbySnapshot()), before pg_current_wal_lsn() checks the current_lsn to be compared with replayed LSN. Note that the reason why walsender has picked the running_xact written by background writer is because it has checked after pg_current_wal_lsn() query, see LOGs [2]. I think we can probably try to reproduce manually via debugger. If this theory is correct then I think we will need to use injection points to control the behavior of bgwriter or use the slots created via SQL API for syncing in tests. Thoughts? [1] - 2024-04-05 04:37:05.074 UTC [3854278][background writer][:0] DEBUG: snapshot of 0+0 running transaction ids (lsn 0/3000098 oldest xid 740 latest complete 739 next xid 740) [2] - 2024-04-05 04:37:05.134 UTC [3866413][client backend][1/4:0] LOG: statement: SELECT pg_current_wal_lsn() 2024-04-05 04:37:05.144 UTC [3866413][client backend][:0] LOG: disconnection: session time: 0:00:00.021 user=bf database=postgres host=[local] 2024-04-05 04:37:05.144 UTC [3857385][walsender][25/0:0] DEBUG: serializing snapshot to pg_logical/snapshots/0-3000060.snap -- With Regards, Amit Kapila.
Hi, On Fri, Apr 05, 2024 at 06:23:10PM +0530, Amit Kapila wrote: > On Fri, Apr 5, 2024 at 5:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > Thinking more on this, it doesn't seem related to > c9920a9068eac2e6c8fb34988d18c0b42b9bf811 as that commit doesn't change > any locking or something like that which impacts write positions. Agree. > I think what has happened here is that running_xact record written by > the background writer [1] is not written to the kernel or disk (see > LogStandbySnapshot()), before pg_current_wal_lsn() checks the > current_lsn to be compared with replayed LSN. Agree, I think it's not visible through pg_current_wal_lsn() yet. Also I think that the DEBUG message in LogCurrentRunningXacts() " elog(DEBUG2, "snapshot of %d+%d running transaction ids (lsn %X/%X oldest xid %u latest complete %u next xid %u)", CurrRunningXacts->xcnt, CurrRunningXacts->subxcnt, LSN_FORMAT_ARGS(recptr), CurrRunningXacts->oldestRunningXid, CurrRunningXacts->latestCompletedXid, CurrRunningXacts->nextXid); " should be located after the XLogSetAsyncXactLSN() call. Indeed, the new LSN is visible after the spinlock (XLogCtl->info_lck) in XLogSetAsyncXactLSN() is released, see: \watch on Session 1 provides: pg_current_wal_lsn -------------------- 0/87D110 (1 row) Until: Breakpoint 2, XLogSetAsyncXactLSN (asyncXactLSN=8900936) at xlog.c:2579 2579 XLogRecPtr WriteRqstPtr = asyncXactLSN; (gdb) n 2581 bool wakeup = false; (gdb) 2584 SpinLockAcquire(&XLogCtl->info_lck); (gdb) 2585 RefreshXLogWriteResult(LogwrtResult); (gdb) 2586 sleeping = XLogCtl->WalWriterSleeping; (gdb) 2587 prevAsyncXactLSN = XLogCtl->asyncXactLSN; (gdb) 2588 if (XLogCtl->asyncXactLSN < asyncXactLSN) (gdb) 2589 XLogCtl->asyncXactLSN = asyncXactLSN; (gdb) 2590 SpinLockRelease(&XLogCtl->info_lck); (gdb) p p/x (uint32) XLogCtl->asyncXactLSN $1 = 0x87d148 Then session 1 provides: pg_current_wal_lsn -------------------- 0/87D148 (1 row) So, when we see in the log: 2024-04-05 04:37:05.074 UTC [3854278][background writer][:0] DEBUG: snapshot of 0+0 running transaction ids (lsn 0/3000098oldest xid 740 latest complete 739 next xid 740) 2024-04-05 04:37:05.197 UTC [3866475][client backend][2/4:0] LOG: statement: SELECT '0/3000060' <= replay_lsn AND state= 'streaming' It's indeed possible that the new LSN was not visible yet (spinlock not released?) before the query began (because we can not rely on the time the DEBUG message has been emitted). > Note that the reason why > walsender has picked the running_xact written by background writer is > because it has checked after pg_current_wal_lsn() query, see LOGs [2]. > I think we can probably try to reproduce manually via debugger. > > If this theory is correct It think it is. > then I think we will need to use injection > points to control the behavior of bgwriter or use the slots created > via SQL API for syncing in tests. > > Thoughts? I think that maybe as a first step we should move the "elog(DEBUG2," message as proposed above to help debugging (that could help to confirm the above theory). If the theory is proven then I'm not sure we need the extra complexity of injection point here, maybe just relying on the slots created via SQL API could be enough. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Fri, Apr 05, 2024 at 02:35:42PM +0000, Bertrand Drouvot wrote: > I think that maybe as a first step we should move the "elog(DEBUG2," message as > proposed above to help debugging (that could help to confirm the above theory). If you agree and think that makes sense, pleae find attached a tiny patch doing so. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Fri, Apr 5, 2024 at 8:05 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > On Fri, Apr 05, 2024 at 06:23:10PM +0530, Amit Kapila wrote: > > On Fri, Apr 5, 2024 at 5:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Thinking more on this, it doesn't seem related to > > c9920a9068eac2e6c8fb34988d18c0b42b9bf811 as that commit doesn't change > > any locking or something like that which impacts write positions. > > Agree. > > > I think what has happened here is that running_xact record written by > > the background writer [1] is not written to the kernel or disk (see > > LogStandbySnapshot()), before pg_current_wal_lsn() checks the > > current_lsn to be compared with replayed LSN. > > Agree, I think it's not visible through pg_current_wal_lsn() yet. > > Also I think that the DEBUG message in LogCurrentRunningXacts() > > " > elog(DEBUG2, > "snapshot of %d+%d running transaction ids (lsn %X/%X oldest xid %u latest complete %u next xid %u)", > CurrRunningXacts->xcnt, CurrRunningXacts->subxcnt, > LSN_FORMAT_ARGS(recptr), > CurrRunningXacts->oldestRunningXid, > CurrRunningXacts->latestCompletedXid, > CurrRunningXacts->nextXid); > " > > should be located after the XLogSetAsyncXactLSN() call. Indeed, the new LSN is > visible after the spinlock (XLogCtl->info_lck) in XLogSetAsyncXactLSN() is > released, > I think the new LSN can be visible only when the corresponding WAL is written by XLogWrite(). I don't know what in XLogSetAsyncXactLSN() can make it visible. In your experiment below, isn't it possible that in the meantime WAL writer has written that WAL due to which you are seeing an updated location? >see: > > \watch on Session 1 provides: > > pg_current_wal_lsn > -------------------- > 0/87D110 > (1 row) > > Until: > > Breakpoint 2, XLogSetAsyncXactLSN (asyncXactLSN=8900936) at xlog.c:2579 > 2579 XLogRecPtr WriteRqstPtr = asyncXactLSN; > (gdb) n > 2581 bool wakeup = false; > (gdb) > 2584 SpinLockAcquire(&XLogCtl->info_lck); > (gdb) > 2585 RefreshXLogWriteResult(LogwrtResult); > (gdb) > 2586 sleeping = XLogCtl->WalWriterSleeping; > (gdb) > 2587 prevAsyncXactLSN = XLogCtl->asyncXactLSN; > (gdb) > 2588 if (XLogCtl->asyncXactLSN < asyncXactLSN) > (gdb) > 2589 XLogCtl->asyncXactLSN = asyncXactLSN; > (gdb) > 2590 SpinLockRelease(&XLogCtl->info_lck); > (gdb) p p/x (uint32) XLogCtl->asyncXactLSN > $1 = 0x87d148 > > Then session 1 provides: > > pg_current_wal_lsn > -------------------- > 0/87D148 > (1 row) > > So, when we see in the log: > > 2024-04-05 04:37:05.074 UTC [3854278][background writer][:0] DEBUG: snapshot of 0+0 running transaction ids (lsn 0/3000098oldest xid 740 latest complete 739 next xid 740) > 2024-04-05 04:37:05.197 UTC [3866475][client backend][2/4:0] LOG: statement: SELECT '0/3000060' <= replay_lsn AND state= 'streaming' > > It's indeed possible that the new LSN was not visible yet (spinlock not released?) > before the query began (because we can not rely on the time the DEBUG message has > been emitted). > > > Note that the reason why > > walsender has picked the running_xact written by background writer is > > because it has checked after pg_current_wal_lsn() query, see LOGs [2]. > > I think we can probably try to reproduce manually via debugger. > > > > If this theory is correct > > It think it is. > > > then I think we will need to use injection > > points to control the behavior of bgwriter or use the slots created > > via SQL API for syncing in tests. > > > > Thoughts? > > I think that maybe as a first step we should move the "elog(DEBUG2," message as > proposed above to help debugging (that could help to confirm the above theory). > I think I am missing how exactly moving DEBUG2 can confirm the above theory. > If the theory is proven then I'm not sure we need the extra complexity of > injection point here, maybe just relying on the slots created via SQL API could > be enough. > Yeah, that could be the first step. We can probably add an injection point to control the bgwrite behavior and then add tests involving walsender performing the decoding. But I think it is important to have sufficient tests in this area as I see they are quite helpful in uncovering the issues. -- With Regards, Amit Kapila.
On Sat, Apr 6, 2024 at 10:13 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > There are still a few pending issues to be fixed in this feature but otherwise, we have committed all the main patches, so I marked the CF entry corresponding to this work as committed. -- With Regards, Amit Kapila.
Hi, On Sat, Apr 06, 2024 at 10:13:00AM +0530, Amit Kapila wrote: > On Fri, Apr 5, 2024 at 8:05 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > I think the new LSN can be visible only when the corresponding WAL is > written by XLogWrite(). I don't know what in XLogSetAsyncXactLSN() can > make it visible. In your experiment below, isn't it possible that in > the meantime WAL writer has written that WAL due to which you are > seeing an updated location? What I did is: session 1: select pg_current_wal_lsn();\watch 1 session 2: select pg_backend_pid(); terminal 1: tail -f logfile | grep -i snap terminal 2 : gdb -p <backendpid session 2) -ex 'b LogCurrentRunningXacts' + "continue" once in gdb session 2: SELECT pg_log_standby_snapshot(); That produces a break in the gdb session, then: Breakpoint 1, LogCurrentRunningXacts (CurrRunningXacts=0x5774f92f8da0 <CurrentRunningXactsData.13>) at standby.c:1346 1346 { (gdb) n 1350 Then next, next until the DEBUG message is emitted (confirmed in terminal 1). At this stage the DEBUG message shows the new LSN while session 1 still displays the previous LSN. Then once XLogSetAsyncXactLSN() is done in the gdb session (terminal 2) then session 1 displays the new LSN. This is reproducible as desired. With more debugging I can see that when the spinlock is released in XLogSetAsyncXactLSN() then XLogWrite() is doing its job and then session 1 does see the new value (that happens in this order, and as you said that's expected). My point is that while the DEBUG message is emitted session 1 still see the old LSN (until the new LSN is vsible). I think that we should emit the DEBUG message once session 1 can see the new value (If not, I think the timestamp of the DEBUG message can be missleading during debugging purpose). > I think I am missing how exactly moving DEBUG2 can confirm the above theory. I meant to say that instead of seeing: 2024-04-05 04:37:05.074 UTC [3854278][background writer][:0] DEBUG: snapshot of 0+0 running transaction ids (lsn 0/3000098oldest xid 740 latest complete 739 next xid 740) 2024-04-05 04:37:05.197 UTC [3866475][client backend][2/4:0] LOG: statement: SELECT '0/3000060' <= replay_lsn AND state= 'streaming' We would probably see something like: 2024-04-05 04:37:05.<something> UTC [3866475][client backend][2/4:0] LOG: statement: SELECT '0/3000060' <= replay_lsn ANDstate = 'streaming' 2024-04-05 04:37:05.<something>+xx UTC [3854278][background writer][:0] DEBUG: snapshot of 0+0 running transaction ids (lsn0/3000098 oldest xid 740 latest complete 739 next xid 740) And then it would be clear that the query has ran before the new LSN is visible. > > If the theory is proven then I'm not sure we need the extra complexity of > > injection point here, maybe just relying on the slots created via SQL API could > > be enough. > > > > Yeah, that could be the first step. We can probably add an injection > point to control the bgwrite behavior and then add tests involving > walsender performing the decoding. But I think it is important to have > sufficient tests in this area as I see they are quite helpful in > uncovering the issues. > Yeah agree. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On 2024-04-06 10:58:32 +0530, Amit Kapila wrote: > On Sat, Apr 6, 2024 at 10:13 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > There are still a few pending issues to be fixed in this feature but > otherwise, we have committed all the main patches, so I marked the CF > entry corresponding to this work as committed. There are a a fair number of failures of 040_standby_failover_slots_sync in the buildfarm. It'd be nice to get those fixed soon-ish. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2024-04-06%2020%3A58%3A50 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-04-06%2015%3A18%3A08 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=olingo&dt=2024-04-06%2010%3A13%3A58 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-04-05%2016%3A04%3A10 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=olingo&dt=2024-04-05%2014%3A59%3A40 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=adder&dt=2024-04-05%2014%3A59%3A07 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-04-05%2014%3A18%3A07 The symptoms are similar, but not entirely identical across all of them, I think. I've also seen a bunch of failures of this test locally. Greetings, Andres Freund
On Sun, Apr 7, 2024 at 3:06 AM Andres Freund <andres@anarazel.de> wrote: > > On 2024-04-06 10:58:32 +0530, Amit Kapila wrote: > > On Sat, Apr 6, 2024 at 10:13 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > There are still a few pending issues to be fixed in this feature but > > otherwise, we have committed all the main patches, so I marked the CF > > entry corresponding to this work as committed. > > There are a a fair number of failures of 040_standby_failover_slots_sync in > the buildfarm. It'd be nice to get those fixed soon-ish. > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2024-04-06%2020%3A58%3A50 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-04-06%2015%3A18%3A08 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=olingo&dt=2024-04-06%2010%3A13%3A58 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-04-05%2016%3A04%3A10 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=olingo&dt=2024-04-05%2014%3A59%3A40 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=adder&dt=2024-04-05%2014%3A59%3A07 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-04-05%2014%3A18%3A07 > > The symptoms are similar, but not entirely identical across all of them, I think. > I have analyzed these failures and there are two different tests that are failing but the underlying reason is the same as being discussed with Bertrand. We are working on the fix. -- With Regards, Amit Kapila.
On Saturday, April 6, 2024 12:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, Apr 5, 2024 at 8:05 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > > On Fri, Apr 05, 2024 at 06:23:10PM +0530, Amit Kapila wrote: > > > On Fri, Apr 5, 2024 at 5:17 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > Thinking more on this, it doesn't seem related to > > > c9920a9068eac2e6c8fb34988d18c0b42b9bf811 as that commit doesn't > > > change any locking or something like that which impacts write positions. > > > > Agree. > > > > > I think what has happened here is that running_xact record written > > > by the background writer [1] is not written to the kernel or disk > > > (see LogStandbySnapshot()), before pg_current_wal_lsn() checks the > > > current_lsn to be compared with replayed LSN. > > > > Agree, I think it's not visible through pg_current_wal_lsn() yet. > > > > Also I think that the DEBUG message in LogCurrentRunningXacts() > > > > " > > elog(DEBUG2, > > "snapshot of %d+%d running transaction ids (lsn %X/%X oldest > xid %u latest complete %u next xid %u)", > > CurrRunningXacts->xcnt, CurrRunningXacts->subxcnt, > > LSN_FORMAT_ARGS(recptr), > > CurrRunningXacts->oldestRunningXid, > > CurrRunningXacts->latestCompletedXid, > > CurrRunningXacts->nextXid); " > > > > should be located after the XLogSetAsyncXactLSN() call. Indeed, the > > new LSN is visible after the spinlock (XLogCtl->info_lck) in > > XLogSetAsyncXactLSN() is released, > > > > I think the new LSN can be visible only when the corresponding WAL is written > by XLogWrite(). I don't know what in XLogSetAsyncXactLSN() can make it > visible. In your experiment below, isn't it possible that in the meantime WAL > writer has written that WAL due to which you are seeing an updated location? > > >see: > > > > \watch on Session 1 provides: > > > > pg_current_wal_lsn > > -------------------- > > 0/87D110 > > (1 row) > > > > Until: > > > > Breakpoint 2, XLogSetAsyncXactLSN (asyncXactLSN=8900936) at > xlog.c:2579 > > 2579 XLogRecPtr WriteRqstPtr = asyncXactLSN; > > (gdb) n > > 2581 bool wakeup = false; > > (gdb) > > 2584 SpinLockAcquire(&XLogCtl->info_lck); > > (gdb) > > 2585 RefreshXLogWriteResult(LogwrtResult); > > (gdb) > > 2586 sleeping = XLogCtl->WalWriterSleeping; > > (gdb) > > 2587 prevAsyncXactLSN = XLogCtl->asyncXactLSN; > > (gdb) > > 2588 if (XLogCtl->asyncXactLSN < asyncXactLSN) > > (gdb) > > 2589 XLogCtl->asyncXactLSN = asyncXactLSN; > > (gdb) > > 2590 SpinLockRelease(&XLogCtl->info_lck); > > (gdb) p p/x (uint32) XLogCtl->asyncXactLSN > > $1 = 0x87d148 > > > > Then session 1 provides: > > > > pg_current_wal_lsn > > -------------------- > > 0/87D148 > > (1 row) > > > > So, when we see in the log: > > > > 2024-04-05 04:37:05.074 UTC [3854278][background writer][:0] DEBUG: > > snapshot of 0+0 running transaction ids (lsn 0/3000098 oldest xid 740 > > latest complete 739 next xid 740) > > 2024-04-05 04:37:05.197 UTC [3866475][client backend][2/4:0] LOG: > statement: SELECT '0/3000060' <= replay_lsn AND state = 'streaming' > > > > It's indeed possible that the new LSN was not visible yet (spinlock > > not released?) before the query began (because we can not rely on the > > time the DEBUG message has been emitted). > > > > > Note that the reason why > > > walsender has picked the running_xact written by background writer > > > is because it has checked after pg_current_wal_lsn() query, see LOGs [2]. > > > I think we can probably try to reproduce manually via debugger. > > > > > > If this theory is correct > > > > It think it is. > > > > > then I think we will need to use injection points to control the > > > behavior of bgwriter or use the slots created via SQL API for > > > syncing in tests. > > > > > > Thoughts? > > > > I think that maybe as a first step we should move the "elog(DEBUG2," > > message as proposed above to help debugging (that could help to confirm > the above theory). > > > > I think I am missing how exactly moving DEBUG2 can confirm the above theory. > > > If the theory is proven then I'm not sure we need the extra complexity > > of injection point here, maybe just relying on the slots created via > > SQL API could be enough. > > > > Yeah, that could be the first step. We can probably add an injection point to > control the bgwrite behavior and then add tests involving walsender > performing the decoding. But I think it is important to have sufficient tests in > this area as I see they are quite helpful in uncovering the issues. Here is the patch to drop the subscription in the beginning so that the restart_lsn of the lsub1_slot won't be advanced due to concurrent xl_running_xacts from bgwriter. The subscription will be re-created after all the slots are sync-ready. I think maybe we can use this to stabilize the test as a first step and then think about how to make use of injection point to add more tests if it's worth it. Best Regards, Hou zj
Attachment
On Mon, Apr 8, 2024 at 12:19 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Saturday, April 6, 2024 12:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Apr 5, 2024 at 8:05 PM Bertrand Drouvot > > <bertranddrouvot.pg@gmail.com> wrote: > > > > Yeah, that could be the first step. We can probably add an injection point to > > control the bgwrite behavior and then add tests involving walsender > > performing the decoding. But I think it is important to have sufficient tests in > > this area as I see they are quite helpful in uncovering the issues. > > Here is the patch to drop the subscription in the beginning so that the > restart_lsn of the lsub1_slot won't be advanced due to concurrent > xl_running_xacts from bgwriter. The subscription will be re-created after all > the slots are sync-ready. I think maybe we can use this to stabilize the test > as a first step and then think about how to make use of injection point to add > more tests if it's worth it. > Pushed. -- With Regards, Amit Kapila.
On Monday, April 8, 2024 6:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Apr 8, 2024 at 12:19 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> > wrote: > > > > On Saturday, April 6, 2024 12:43 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > On Fri, Apr 5, 2024 at 8:05 PM Bertrand Drouvot > > > <bertranddrouvot.pg@gmail.com> wrote: > > > > > > Yeah, that could be the first step. We can probably add an injection > > > point to control the bgwrite behavior and then add tests involving > > > walsender performing the decoding. But I think it is important to > > > have sufficient tests in this area as I see they are quite helpful in uncovering > the issues. > > > > Here is the patch to drop the subscription in the beginning so that > > the restart_lsn of the lsub1_slot won't be advanced due to concurrent > > xl_running_xacts from bgwriter. The subscription will be re-created > > after all the slots are sync-ready. I think maybe we can use this to > > stabilize the test as a first step and then think about how to make > > use of injection point to add more tests if it's worth it. > > > > Pushed. Thanks for pushing. I checked the BF status, and noticed one BF failure, which I think is related to a miss in the test code. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=adder&dt=2024-04-08%2012%3A04%3A27 From the following log, I can see the sync failed because the standby is lagging behind of the failover slot. ----- # No postmaster PID for node "cascading_standby" error running SQL: 'psql:<stdin>:1: ERROR: skipping slot synchronization as the received slot sync LSN 0/4000148 for slot"snap_test_slot" is ahead of the standby position 0/4000114' while running 'psql -XAtq -d port=50074 host=/tmp/t4HQFlrDmI dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'SELECTpg_sync_replication_slots();' at /home/bf/bf-build/adder/HEAD/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm line2042. # Postmaster PID for node "publisher" is 3715298 ----- I think it's because we missed to call wait_for_replay_catchup before syncing slots. ----- $primary->safe_psql('postgres', "SELECT pg_create_logical_replication_slot('snap_test_slot', 'test_decoding', false, false, true);" ); # ? missed to wait here $standby1->safe_psql('postgres', "SELECT pg_sync_replication_slots();"); ----- While testing, I noticed another place where we were calling wait_for_replay_catchup before doing pg_replication_slot_advance, which also has a small possibility to cause the failover slot to be ahead of the standby if some logs are written in between these two steps. So, I adjusted them together. Here is a small patch to improve the test. Best Regards, Hou zj
Attachment
Hi, On 2024-04-08 16:01:41 +0530, Amit Kapila wrote: > Pushed. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=adder&dt=2024-04-08%2012%3A04%3A27 This unfortunately is a commit after commit 6f3d8d5e7cc Author: Amit Kapila <akapila@postgresql.org> Date: 2024-04-08 13:21:55 +0530 Fix the intermittent buildfarm failures in 040_standby_failover_slots_sync. Greetings, Andres
On Mon, Apr 8, 2024 at 9:49 PM Andres Freund <andres@anarazel.de> wrote: > > On 2024-04-08 16:01:41 +0530, Amit Kapila wrote: > > Pushed. > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=adder&dt=2024-04-08%2012%3A04%3A27 > > This unfortunately is a commit after > Right, and thanks for the report. Hou-San has analyzed and shared the patch [1] for this yesterday. I'll review it today. [1] - https://www.postgresql.org/message-id/OS0PR01MB571665359F2F5DCD3ADABC9F94002%40OS0PR01MB5716.jpnprd01.prod.outlook.com -- With Regards, Amit Kapila.
On Mon, Apr 8, 2024 at 7:01 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > Thanks for pushing. > > I checked the BF status, and noticed one BF failure, which I think is related to > a miss in the test code. > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=adder&dt=2024-04-08%2012%3A04%3A27 > > From the following log, I can see the sync failed because the standby is > lagging behind of the failover slot. > > ----- > # No postmaster PID for node "cascading_standby" > error running SQL: 'psql:<stdin>:1: ERROR: skipping slot synchronization as the received slot sync LSN 0/4000148 for slot"snap_test_slot" is ahead of the standby position 0/4000114' > while running 'psql -XAtq -d port=50074 host=/tmp/t4HQFlrDmI dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'SELECTpg_sync_replication_slots();' at /home/bf/bf-build/adder/HEAD/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm line2042. > # Postmaster PID for node "publisher" is 3715298 > ----- > > I think it's because we missed to call wait_for_replay_catchup before syncing > slots. > > ----- > $primary->safe_psql('postgres', > "SELECT pg_create_logical_replication_slot('snap_test_slot', 'test_decoding', false, false, true);" > ); > # ? missed to wait here > $standby1->safe_psql('postgres', "SELECT pg_sync_replication_slots();"); > ----- > > While testing, I noticed another place where we were calling > wait_for_replay_catchup before doing pg_replication_slot_advance, which also has > a small possibility to cause the failover slot to be ahead of the standby if > some logs are written in between these two steps. So, I adjusted them together. > > Here is a small patch to improve the test. > LGTM. I'll push this tomorrow morning unless there are any more comments or suggestions. -- With Regards, Amit Kapila.
On Thursday, April 4, 2024 4:25 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: Hi, > On Wed, Apr 3, 2024 at 7:06 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > On Wed, Apr 3, 2024 at 11:13 AM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > > > On Wed, Apr 3, 2024 at 9:36 AM Bharath Rupireddy > > > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > > > > I quickly looked at v8, and have a nit, rest all looks good. > > > > > > > > + if (DecodingContextReady(ctx) && > found_consistent_snapshot) > > > > + *found_consistent_snapshot = true; > > > > > > > > Can the found_consistent_snapshot be checked first to help avoid > > > > the function call DecodingContextReady() for > > > > pg_replication_slot_advance callers? > > > > > > > > > > Okay, changed. Additionally, I have updated the comments and commit > > > message. I'll push this patch after some more testing. > > > > > > > Pushed! > > While testing this change, I realized that it could happen that the server logs > are flooded with the following logical decoding logs that are written every 200 > ms: Thanks for reporting! > > 2024-04-04 16:15:19.270 JST [3838739] LOG: starting logical decoding for slot > "test_sub" > 2024-04-04 16:15:19.270 JST [3838739] DETAIL: Streaming transactions > committing after 0/50006F48, reading WAL from 0/50006F10. > 2024-04-04 16:15:19.270 JST [3838739] LOG: logical decoding found > consistent point at 0/50006F10 > 2024-04-04 16:15:19.270 JST [3838739] DETAIL: There are no running > transactions. > 2024-04-04 16:15:19.477 JST [3838739] LOG: starting logical decoding for slot > "test_sub" > 2024-04-04 16:15:19.477 JST [3838739] DETAIL: Streaming transactions > committing after 0/50006F48, reading WAL from 0/50006F10. > 2024-04-04 16:15:19.477 JST [3838739] LOG: logical decoding found > consistent point at 0/50006F10 > 2024-04-04 16:15:19.477 JST [3838739] DETAIL: There are no running > transactions. > > For example, I could reproduce it with the following steps: > > 1. create the primary and start. > 2. run "pgbench -i -s 100" on the primary. > 3. run pg_basebackup to create the standby. > 4. configure slotsync setup on the standby and start. > 5. create a publication for all tables on the primary. > 6. create the subscriber and start. > 7. run "pgbench -i -Idtpf" on the subscriber. > 8. create a subscription on the subscriber (initial data copy will start). > > The logical decoding logs were written every 200 ms during the initial data > synchronization. > > Looking at the new changes for update_local_synced_slot(): ... > We call LogicalSlotAdvanceAndCheckSnapState() if one of confirmed_lsn, > restart_lsn, and catalog_xmin is different between the remote slot and the local > slot. In my test case, during the initial sync performing, only catalog_xmin was > different and there was no serialized snapshot at restart_lsn, and the slotsync > worker called LogicalSlotAdvanceAndCheckSnapState(). However no slot > properties were changed even after the function and it set slot_updated = true. > So it starts the next slot synchronization after 200ms. I was trying to reproduce this and check why the catalog_xmin is different among synced slot and remote slot, but I was not able to reproduce the case where there are lots of logical decoding logs. The script I used is attached. Would it be possible for you to share the script you used to reproduce this issue? Alternatively, could you please share the log files from both the primary and standby servers after reproducing the problem (it would be greatly helpful if you could set the log level to DEBUG2). Best Regards, Hou zj
Attachment
On Thursday, April 4, 2024 5:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > BTW, while thinking on this one, I > noticed that in the function LogicalConfirmReceivedLocation(), we first update > the disk copy, see comment [1] and then in-memory whereas the same is not > true in > update_local_synced_slot() for the case when snapshot exists. Now, do we have > the same risk here in case of standby? Because I think we will use these xmins > while sending the feedback message (in XLogWalRcvSendHSFeedback()). > > * We have to write the changed xmin to disk *before* we change > * the in-memory value, otherwise after a crash we wouldn't know > * that some catalog tuples might have been removed already. Yes, I think we have the risk on the standby, I can reproduce the case that if the server crashes after updating the in-memory value and before saving them to disk, the synced slot could be invalidated after restarting from crash, because the necessary rows have been removed on the primary. The steps can be found in [1]. I think we'd better fix the order in update_local_synced_slot() as well. I tried to make the fix in 0002, 0001 is Shveta's patch to fix another issue in this thread. Since they are touching the same function, so attach them together for review. [1] -- Primary: SELECT 'init' FROM pg_create_logical_replication_slot('logicalslot', 'test_decoding', false, false, true); -- Standby: SELECT 'init' FROM pg_create_logical_replication_slot('standbylogicalslot', 'test_decoding', false, false, false); SELECT pg_sync_replication_slots(); -- Primary: CREATE TABLE test (a int); INSERT INTO test VALUES(1); DROP TABLE test; SELECT txid_current(); SELECT txid_current(); SELECT txid_current(); SELECT pg_log_standby_snapshot(); SELECT pg_replication_slot_advance('logicalslot', pg_current_wal_lsn()); -- Standby: - wait for standby to replay all the changes on the primary. - this is to serialize snapshots. SELECT pg_replication_slot_advance('standbylogicalslot', pg_last_wal_replay_lsn()); - Use gdb to stop at the place after calling ReplicationSlotsComputexx() functions and before calling ReplicationSlotSave(). SELECT pg_sync_replication_slots(); -- Primary: - First, wait for the primary slot(the physical slot)'s catalog xmin to be updated to the same as the failover slot. VACUUM FULL; - Wait for VACUMM FULL to be replayed on standby. -- Standby: - For the process which is blocked by gdb, let the process crash (elog(PANIC, ...)). After restarting the standby from crash, we can see the synced slot is invalidated. LOG: invalidating obsolete replication slot "logicalslot" DETAIL: The slot conflicted with xid horizon 741. CONTEXT: WAL redo at 0/3059B90 for Heap2/PRUNE_ON_ACCESS: snapshotConflictHorizon: 741, isCatalogRel: T, nplans: 0, nredirected:0, ndead: 7, nunused: 0, dead: [22, 23, 24, 25, 26, 27, 28]; blkref #0: rel 1663/5/1249, blk 16 Best Regards, Hou zj
Attachment
On Wed, Apr 10, 2024 at 5:28 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Thursday, April 4, 2024 5:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > BTW, while thinking on this one, I > > noticed that in the function LogicalConfirmReceivedLocation(), we first update > > the disk copy, see comment [1] and then in-memory whereas the same is not > > true in > > update_local_synced_slot() for the case when snapshot exists. Now, do we have > > the same risk here in case of standby? Because I think we will use these xmins > > while sending the feedback message (in XLogWalRcvSendHSFeedback()). > > > > * We have to write the changed xmin to disk *before* we change > > * the in-memory value, otherwise after a crash we wouldn't know > > * that some catalog tuples might have been removed already. > > Yes, I think we have the risk on the standby, I can reproduce the case that if > the server crashes after updating the in-memory value and before saving them to > disk, the synced slot could be invalidated after restarting from crash, because > the necessary rows have been removed on the primary. The steps can be found in > [1]. > > I think we'd better fix the order in update_local_synced_slot() as well. I > tried to make the fix in 0002, 0001 is Shveta's patch to fix another issue in this thread. Since > they are touching the same function, so attach them together for review. > Few comments: =============== 1. + + /* Sanity check */ + if (slot->data.confirmed_flush != remote_slot->confirmed_lsn) + ereport(LOG, + errmsg("synchronized confirmed_flush for slot \"%s\" differs from remote slot", + remote_slot->name), Is there a reason to use elevel as LOG instead of ERROR? I think it should be elog(ERROR, .. as this is an unexpected case. 2. - if (remote_slot->restart_lsn < slot->data.restart_lsn) + if (remote_slot->confirmed_lsn < slot->data.confirmed_flush) elog(ERROR, "cannot synchronize local slot \"%s\" LSN(%X/%X)" Can we be more specific in this message? How about splitting it into error_message as "cannot synchronize local slot \"%s\"" and then errdetail as "Local slot's start streaming location LSN(%X/%X) is ahead of remote slot's LSN(%X/%X)"? -- With Regards, Amit Kapila.
On Thursday, April 11, 2024 12:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Apr 10, 2024 at 5:28 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> > wrote: > > > > On Thursday, April 4, 2024 5:37 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > > > BTW, while thinking on this one, I > > > noticed that in the function LogicalConfirmReceivedLocation(), we > > > first update the disk copy, see comment [1] and then in-memory > > > whereas the same is not true in > > > update_local_synced_slot() for the case when snapshot exists. Now, > > > do we have the same risk here in case of standby? Because I think we > > > will use these xmins while sending the feedback message (in > XLogWalRcvSendHSFeedback()). > > > > > > * We have to write the changed xmin to disk *before* we change > > > * the in-memory value, otherwise after a crash we wouldn't know > > > * that some catalog tuples might have been removed already. > > > > Yes, I think we have the risk on the standby, I can reproduce the case > > that if the server crashes after updating the in-memory value and > > before saving them to disk, the synced slot could be invalidated after > > restarting from crash, because the necessary rows have been removed on > > the primary. The steps can be found in [1]. > > > > I think we'd better fix the order in update_local_synced_slot() as > > well. I tried to make the fix in 0002, 0001 is Shveta's patch to fix > > another issue in this thread. Since they are touching the same function, so > attach them together for review. > > > > Few comments: > =============== > 1. > + > + /* Sanity check */ > + if (slot->data.confirmed_flush != remote_slot->confirmed_lsn) > + ereport(LOG, errmsg("synchronized confirmed_flush for slot \"%s\" > + differs from > remote slot", > + remote_slot->name), > > Is there a reason to use elevel as LOG instead of ERROR? I think it should be > elog(ERROR, .. as this is an unexpected case. Agreed. > > 2. > - if (remote_slot->restart_lsn < slot->data.restart_lsn) > + if (remote_slot->confirmed_lsn < slot->data.confirmed_flush) > elog(ERROR, > "cannot synchronize local slot \"%s\" LSN(%X/%X)" > > Can we be more specific in this message? How about splitting it into > error_message as "cannot synchronize local slot \"%s\"" and then errdetail as > "Local slot's start streaming location LSN(%X/%X) is ahead of remote slot's > LSN(%X/%X)"? Your version looks better. Since the above two messages all have errdetail, I used the style of ereport(ERROR, errmsg_internal(), errdetail_internal()... in the patch which is equal to the elog(ERROR but has an additional detail message. Here is V5 patch set. Best Regards, Hou zj
Attachment
On Thu, Apr 11, 2024 at 5:04 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Thursday, April 11, 2024 12:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > 2. > > - if (remote_slot->restart_lsn < slot->data.restart_lsn) > > + if (remote_slot->confirmed_lsn < slot->data.confirmed_flush) > > elog(ERROR, > > "cannot synchronize local slot \"%s\" LSN(%X/%X)" > > > > Can we be more specific in this message? How about splitting it into > > error_message as "cannot synchronize local slot \"%s\"" and then errdetail as > > "Local slot's start streaming location LSN(%X/%X) is ahead of remote slot's > > LSN(%X/%X)"? > > Your version looks better. Since the above two messages all have errdetail, I > used the style of ereport(ERROR, errmsg_internal(), errdetail_internal()... in > the patch which is equal to the elog(ERROR but has an additional detail message. > makes sense. > Here is V5 patch set. > I think we should move the check to not advance slot when one of remote_slot's restart_lsn or catalog_xmin is lesser than the local slot inside update_local_synced_slot() as we want to prevent updating slot in those cases even during slot synchronization. -- With Regards, Amit Kapila.
On Friday, April 12, 2024 11:31 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Apr 11, 2024 at 5:04 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> > wrote: > > > > On Thursday, April 11, 2024 12:11 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > > > > > 2. > > > - if (remote_slot->restart_lsn < slot->data.restart_lsn) > > > + if (remote_slot->confirmed_lsn < slot->data.confirmed_flush) > > > elog(ERROR, > > > "cannot synchronize local slot \"%s\" LSN(%X/%X)" > > > > > > Can we be more specific in this message? How about splitting it into > > > error_message as "cannot synchronize local slot \"%s\"" and then > > > errdetail as "Local slot's start streaming location LSN(%X/%X) is > > > ahead of remote slot's LSN(%X/%X)"? > > > > Your version looks better. Since the above two messages all have > > errdetail, I used the style of ereport(ERROR, errmsg_internal(), > > errdetail_internal()... in the patch which is equal to the elog(ERROR but has an > additional detail message. > > > > makes sense. > > > Here is V5 patch set. > > > > I think we should move the check to not advance slot when one of > remote_slot's restart_lsn or catalog_xmin is lesser than the local slot inside > update_local_synced_slot() as we want to prevent updating slot in those cases > even during slot synchronization. Agreed. Here is the V6 patch which addressed this. I have merged the two patches into one. Best Regards, Hou zj
Attachment
On Friday, March 15, 2024 10:45 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On Thu, Mar 14, 2024 at 02:22:44AM +0000, Zhijie Hou (Fujitsu) wrote: > > Hi, > > > > Since the standby_slot_names patch has been committed, I am attaching > > the last doc patch for review. > > > > Thanks! > > 1 === > > + continue subscribing to publications now on the new primary server > without > + any data loss. > > I think "without any data loss" should be re-worded in this context. Data loss in > the sense "data committed on the primary and not visible on the subscriber in > case of failover" can still occurs (in case synchronous replication is not used). > > 2 === > > + If the result (<literal>failover_ready</literal>) of both above steps is > + true, existing subscriptions will be able to continue without data loss. > + </para> > > I don't think that's true if synchronous replication is not used. Say, > > - synchronous replication is not used > - primary is not able to reach the standby anymore and standby_slot_names is > set > - new data is inserted into the primary > - then not replicated to subscriber (due to standby_slot_names) > > Then I think the both above steps will return true but data would be lost in case > of failover. Thanks for the comments, attach the new version patch which reworded the above places. Best Regards, Hou zj
Attachment
On Mon, Apr 29, 2024 at 10:57 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Friday, March 15, 2024 10:45 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > > > Hi, > > > > On Thu, Mar 14, 2024 at 02:22:44AM +0000, Zhijie Hou (Fujitsu) wrote: > > > Hi, > > > > > > Since the standby_slot_names patch has been committed, I am attaching > > > the last doc patch for review. > > > > > > > Thanks! > > > > 1 === > > > > + continue subscribing to publications now on the new primary server > > without > > + any data loss. > > > > I think "without any data loss" should be re-worded in this context. Data loss in > > the sense "data committed on the primary and not visible on the subscriber in > > case of failover" can still occurs (in case synchronous replication is not used). > > > > 2 === > > > > + If the result (<literal>failover_ready</literal>) of both above steps is > > + true, existing subscriptions will be able to continue without data loss. > > + </para> > > > > I don't think that's true if synchronous replication is not used. Say, > > > > - synchronous replication is not used > > - primary is not able to reach the standby anymore and standby_slot_names is > > set > > - new data is inserted into the primary > > - then not replicated to subscriber (due to standby_slot_names) > > > > Then I think the both above steps will return true but data would be lost in case > > of failover. > > Thanks for the comments, attach the new version patch which reworded the > above places. Thanks for the patch. Few comments: 1) Tested the steps, one of the queries still refers to 'conflict_reason'. I think it should refer 'conflicting'. 2) Will it be good to mention that in case of planned promotion, it is recommended to run pg_sync_replication_slots() as last sync attempt before we run failvoer-ready validation steps? This can be mentioned in high-availaibility.sgml of current patch thanks Shveta
On Mon, Apr 29, 2024 at 11:38 AM shveta malik <shveta.malik@gmail.com> wrote: > > On Mon, Apr 29, 2024 at 10:57 AM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > On Friday, March 15, 2024 10:45 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > > > > > Hi, > > > > > > On Thu, Mar 14, 2024 at 02:22:44AM +0000, Zhijie Hou (Fujitsu) wrote: > > > > Hi, > > > > > > > > Since the standby_slot_names patch has been committed, I am attaching > > > > the last doc patch for review. > > > > > > > > > > Thanks! > > > > > > 1 === > > > > > > + continue subscribing to publications now on the new primary server > > > without > > > + any data loss. > > > > > > I think "without any data loss" should be re-worded in this context. Data loss in > > > the sense "data committed on the primary and not visible on the subscriber in > > > case of failover" can still occurs (in case synchronous replication is not used). > > > > > > 2 === > > > > > > + If the result (<literal>failover_ready</literal>) of both above steps is > > > + true, existing subscriptions will be able to continue without data loss. > > > + </para> > > > > > > I don't think that's true if synchronous replication is not used. Say, > > > > > > - synchronous replication is not used > > > - primary is not able to reach the standby anymore and standby_slot_names is > > > set > > > - new data is inserted into the primary > > > - then not replicated to subscriber (due to standby_slot_names) > > > > > > Then I think the both above steps will return true but data would be lost in case > > > of failover. > > > > Thanks for the comments, attach the new version patch which reworded the > > above places. > > Thanks for the patch. > > Few comments: > > 1) Tested the steps, one of the queries still refers to > 'conflict_reason'. I think it should refer 'conflicting'. > > 2) Will it be good to mention that in case of planned promotion, it is > recommended to run pg_sync_replication_slots() as last sync attempt > before we run failvoer-ready validation steps? This can be mentioned > in high-availaibility.sgml of current patch I recall now that with the latest fix, we cannot run pg_sync_replication_slots() unless we disable the slot-sync worker. Considering that, I think it will be too many steps just to run the SQL function at the end without much value added. Thus we can skip this point, we can rely on slot sync worker completely. thanks Shveta
On Mon, Apr 29, 2024 at 11:38 AM shveta malik <shveta.malik@gmail.com> wrote: > > On Mon, Apr 29, 2024 at 10:57 AM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > On Friday, March 15, 2024 10:45 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > > > > > Hi, > > > > > > On Thu, Mar 14, 2024 at 02:22:44AM +0000, Zhijie Hou (Fujitsu) wrote: > > > > Hi, > > > > > > > > Since the standby_slot_names patch has been committed, I am attaching > > > > the last doc patch for review. > > > > > > > > > > Thanks! > > > > > > 1 === > > > > > > + continue subscribing to publications now on the new primary server > > > without > > > + any data loss. > > > > > > I think "without any data loss" should be re-worded in this context. Data loss in > > > the sense "data committed on the primary and not visible on the subscriber in > > > case of failover" can still occurs (in case synchronous replication is not used). > > > > > > 2 === > > > > > > + If the result (<literal>failover_ready</literal>) of both above steps is > > > + true, existing subscriptions will be able to continue without data loss. > > > + </para> > > > > > > I don't think that's true if synchronous replication is not used. Say, > > > > > > - synchronous replication is not used > > > - primary is not able to reach the standby anymore and standby_slot_names is > > > set > > > - new data is inserted into the primary > > > - then not replicated to subscriber (due to standby_slot_names) > > > > > > Then I think the both above steps will return true but data would be lost in case > > > of failover. > > > > Thanks for the comments, attach the new version patch which reworded the > > above places. > > Thanks for the patch. > > Few comments: > > 1) Tested the steps, one of the queries still refers to > 'conflict_reason'. I think it should refer 'conflicting'. > > 2) Will it be good to mention that in case of planned promotion, it is > recommended to run pg_sync_replication_slots() as last sync attempt > before we run failvoer-ready validation steps? This can be mentioned > in high-availaibility.sgml of current patch I recall now that with the latest fix, we cannot run pg_sync_replication_slots() unless we disable the slot-sync worker. Considering that, I think it will be too many steps just to run the SQL function at the end without much value added. Thus we can skip this point, we can rely on slot sync worker completely. thanks Shveta
On Monday, April 29, 2024 5:11 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Mon, Apr 29, 2024 at 11:38 AM shveta malik <shveta.malik@gmail.com> > wrote: > > > > On Mon, Apr 29, 2024 at 10:57 AM Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> wrote: > > > > > > On Friday, March 15, 2024 10:45 PM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > > > > > > > Hi, > > > > > > > > On Thu, Mar 14, 2024 at 02:22:44AM +0000, Zhijie Hou (Fujitsu) wrote: > > > > > Hi, > > > > > > > > > > Since the standby_slot_names patch has been committed, I am > > > > > attaching the last doc patch for review. > > > > > > > > > > > > > Thanks! > > > > > > > > 1 === > > > > > > > > + continue subscribing to publications now on the new primary > > > > + server > > > > without > > > > + any data loss. > > > > > > > > I think "without any data loss" should be re-worded in this > > > > context. Data loss in the sense "data committed on the primary and > > > > not visible on the subscriber in case of failover" can still occurs (in case > synchronous replication is not used). > > > > > > > > 2 === > > > > > > > > + If the result (<literal>failover_ready</literal>) of both above steps is > > > > + true, existing subscriptions will be able to continue without data > loss. > > > > + </para> > > > > > > > > I don't think that's true if synchronous replication is not used. > > > > Say, > > > > > > > > - synchronous replication is not used > > > > - primary is not able to reach the standby anymore and > > > > standby_slot_names is set > > > > - new data is inserted into the primary > > > > - then not replicated to subscriber (due to standby_slot_names) > > > > > > > > Then I think the both above steps will return true but data would > > > > be lost in case of failover. > > > > > > Thanks for the comments, attach the new version patch which reworded > > > the above places. > > > > Thanks for the patch. > > > > Few comments: > > > > 1) Tested the steps, one of the queries still refers to > > 'conflict_reason'. I think it should refer 'conflicting'. Thanks for catching this. Fixed. > > > > 2) Will it be good to mention that in case of planned promotion, it is > > recommended to run pg_sync_replication_slots() as last sync attempt > > before we run failvoer-ready validation steps? This can be mentioned > > in high-availaibility.sgml of current patch > > I recall now that with the latest fix, we cannot run > pg_sync_replication_slots() unless we disable the slot-sync worker. > Considering that, I think it will be too many steps just to run the SQL function at > the end without much value added. Thus we can skip this point, we can rely on > slot sync worker completely. Agreed. I didn't change this. Here is the V3 doc patch. Best Regards, Hou zj
Attachment
On Mon, Apr 29, 2024 at 5:28 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Monday, April 29, 2024 5:11 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Mon, Apr 29, 2024 at 11:38 AM shveta malik <shveta.malik@gmail.com> > > wrote: > > > > > > On Mon, Apr 29, 2024 at 10:57 AM Zhijie Hou (Fujitsu) > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > On Friday, March 15, 2024 10:45 PM Bertrand Drouvot > > <bertranddrouvot.pg@gmail.com> wrote: > > > > > > > > > > Hi, > > > > > > > > > > On Thu, Mar 14, 2024 at 02:22:44AM +0000, Zhijie Hou (Fujitsu) wrote: > > > > > > Hi, > > > > > > > > > > > > Since the standby_slot_names patch has been committed, I am > > > > > > attaching the last doc patch for review. > > > > > > > > > > > > > > > > Thanks! > > > > > > > > > > 1 === > > > > > > > > > > + continue subscribing to publications now on the new primary > > > > > + server > > > > > without > > > > > + any data loss. > > > > > > > > > > I think "without any data loss" should be re-worded in this > > > > > context. Data loss in the sense "data committed on the primary and > > > > > not visible on the subscriber in case of failover" can still occurs (in case > > synchronous replication is not used). > > > > > > > > > > 2 === > > > > > > > > > > + If the result (<literal>failover_ready</literal>) of both above steps is > > > > > + true, existing subscriptions will be able to continue without data > > loss. > > > > > + </para> > > > > > > > > > > I don't think that's true if synchronous replication is not used. > > > > > Say, > > > > > > > > > > - synchronous replication is not used > > > > > - primary is not able to reach the standby anymore and > > > > > standby_slot_names is set > > > > > - new data is inserted into the primary > > > > > - then not replicated to subscriber (due to standby_slot_names) > > > > > > > > > > Then I think the both above steps will return true but data would > > > > > be lost in case of failover. > > > > > > > > Thanks for the comments, attach the new version patch which reworded > > > > the above places. > > > > > > Thanks for the patch. > > > > > > Few comments: > > > > > > 1) Tested the steps, one of the queries still refers to > > > 'conflict_reason'. I think it should refer 'conflicting'. > > Thanks for catching this. Fixed. > > > > > > > 2) Will it be good to mention that in case of planned promotion, it is > > > recommended to run pg_sync_replication_slots() as last sync attempt > > > before we run failvoer-ready validation steps? This can be mentioned > > > in high-availaibility.sgml of current patch > > > > I recall now that with the latest fix, we cannot run > > pg_sync_replication_slots() unless we disable the slot-sync worker. > > Considering that, I think it will be too many steps just to run the SQL function at > > the end without much value added. Thus we can skip this point, we can rely on > > slot sync worker completely. > > Agreed. I didn't change this. > > Here is the V3 doc patch. Thanks for the patch. It will be good if 1a can produce quoted slot-names list as output, which can be used directly in step 1b's query; otherwise, it is little inconvenient to give input to 1b if the number of slots are huge. User needs to manually quote each slot-name. Other than this, the patch looks good to me. thanks Shveta
Hi, On Mon, Apr 29, 2024 at 11:58:09AM +0000, Zhijie Hou (Fujitsu) wrote: > On Monday, April 29, 2024 5:11 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Mon, Apr 29, 2024 at 11:38 AM shveta malik <shveta.malik@gmail.com> > > wrote: > > > > > > On Mon, Apr 29, 2024 at 10:57 AM Zhijie Hou (Fujitsu) > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > On Friday, March 15, 2024 10:45 PM Bertrand Drouvot > > <bertranddrouvot.pg@gmail.com> wrote: > > > > > > > > > > Hi, > > > > > > > > > > On Thu, Mar 14, 2024 at 02:22:44AM +0000, Zhijie Hou (Fujitsu) wrote: > > > > > > Hi, > > > > > > > > > > > > Since the standby_slot_names patch has been committed, I am > > > > > > attaching the last doc patch for review. > > > > > > > > > > > > > > > > Thanks! > > > > > > > > > > 1 === > > > > > > > > > > + continue subscribing to publications now on the new primary > > > > > + server > > > > > without > > > > > + any data loss. > > > > > > > > > > I think "without any data loss" should be re-worded in this > > > > > context. Data loss in the sense "data committed on the primary and > > > > > not visible on the subscriber in case of failover" can still occurs (in case > > synchronous replication is not used). > > > > > > > > > > 2 === > > > > > > > > > > + If the result (<literal>failover_ready</literal>) of both above steps is > > > > > + true, existing subscriptions will be able to continue without data > > loss. > > > > > + </para> > > > > > > > > > > I don't think that's true if synchronous replication is not used. > > > > > Say, > > > > > > > > > > - synchronous replication is not used > > > > > - primary is not able to reach the standby anymore and > > > > > standby_slot_names is set > > > > > - new data is inserted into the primary > > > > > - then not replicated to subscriber (due to standby_slot_names) > > > > > > > > > > Then I think the both above steps will return true but data would > > > > > be lost in case of failover. > > > > > > > > Thanks for the comments, attach the new version patch which reworded > > > > the above places. Thanks! > Here is the V3 doc patch. Thanks! A few comments: 1 === + losing any data that has been flushed to the new primary server. Worth to add a few words about possible data loss, something like? Please note that in case synchronous replication is not used and standby_slot_names is set correctly, it might be possible to lose data that would have been committed on the old primary server (in case the standby was not reachable during that time for example). 2 === +test_sub=# SELECT + array_agg(slotname) AS slots + FROM + (( + SELECT r.srsubid AS subid, CONCAT('pg_', srsubid, '_sync_', srrelid, '_', ctl.system_identifier) AS slotname + FROM pg_control_system() ctl, pg_subscription_rel r, pg_subscription s + WHERE r.srsubstate = 'f' AND s.oid = r.srsubid AND s.subfailover + ) UNION ( I guess this format comes from ReplicationSlotNameForTablesync(). What about creating a SQL callable function on top of it and make use of it in the query above? (that would ensure to keep the doc up to date even if the format changes in ReplicationSlotNameForTablesync()). 3 === +test_sub=# SELECT + MAX(remote_lsn) AS remote_lsn_on_subscriber + FROM + (( + SELECT (CASE WHEN r.srsubstate = 'f' THEN pg_replication_origin_progress(CONCAT('pg_', r.srsubid, '_', r.srrelid),false) + WHEN r.srsubstate IN ('s', 'r') THEN r.srsublsn END) AS remote_lsn + FROM pg_subscription_rel r, pg_subscription s + WHERE r.srsubstate IN ('f', 's', 'r') AND s.oid = r.srsubid AND s.subfailover + ) UNION ( + SELECT pg_replication_origin_progress(CONCAT('pg_', s.oid), false) AS remote_lsn + FROM pg_subscription s + WHERE s.subfailover + )); What about adding a join to pg_replication_origin to get rid of the "hardcoded" format "CONCAT('pg_', r.srsubid, '_', r.srrelid)" and "CONCAT('pg_', s.oid)"? Idea behind 2 === and 3 === is to have the queries as generic as possible and not rely on a hardcoded format (that would be more difficult to maintain should those formats change in the future). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Here are some review comments for the docs patch v3-0001. ====== Commit message 1. This patch adds detailed documentation for the slot sync feature including examples to guide users on how to verify that all slots have been successfully synchronized to the standby server and how to confirm whether the subscription can continue subscribing to publications on the promoted standby server. ~ This may be easier to read if you put it in bullet form like below: SUGGESTION It includes examples to guide the user: * How to verify that all slots have been successfully synchronized to the standby server * How to confirm whether the subscription can continue subscribing to publications on the promoted standby server ====== doc/src/sgml/high-availability.sgml 2. + <para> + If you have opted for synchronization of logical slots (see + <xref linkend="logicaldecoding-replication-slots-synchronization"/>), + then before switching to the standby server, it is recommended to check + if the logical slots synchronized on the standby server are ready + for failover. This can be done by following the steps described in + <xref linkend="logical-replication-failover"/>. + </para> + Maybe it is better to call this feature "logical replication slot synchronization" to be more consistent with the title of section 47.2.3. SUGGESTION If you have opted for logical replication slot synchronization (see ... ====== doc/src/sgml/logical-replication.sgml 3. + <para> + When the publisher server is the primary server of a streaming replication, + the logical slots on that primary server can be synchronized to the standby + server by specifying <literal>failover = true</literal> when creating + subscriptions for those publications. Enabling failover ensures a seamless + transition of those subscriptions after the standby is promoted. They can + continue subscribing to publications now on the new primary server without + losing any data that has been flushed to the new primary server. + </para> + 3a. BEFORE When the publisher server is the primary server of... SUGGESTION When publications are defined on the primary server of... ~ 3b. Enabling failover... Maybe say "Enabling the failover parameter..." and IMO there should also be a link to the CREATE SUBSCRIPTION failover parameter so the user can easily navigate there to read more about it. ~ 3c. BEFORE They can continue subscribing to publications now on the new primary server without losing any data that has been flushed to the new primary server. SUGGESTION (removes some extra info I did not think was needed) They can continue subscribing to publications now on the new primary server without any loss of data. ~~~ 4. + <para> + Because the slot synchronization logic copies asynchronously, it is + necessary to confirm that replication slots have been synced to the standby + server before the failover happens. Furthermore, to ensure a successful + failover, the standby server must not be lagging behind the subscriber. It + is highly recommended to use + <link linkend="guc-standby-slot-names"><varname>standby_slot_names</varname></link> + to prevent the subscriber from consuming changes faster than the hot standby. + To confirm that the standby server is indeed ready for failover, follow + these 2 steps: + </para> IMO the last sentence "To confirm..." should be a new paragraph. ~~~ 5. + <para> + Firstly, on the subscriber node, use the following SQL to identify + which slots should be synced to the standby that we plan to promote. AND + <para> + Next, check that the logical replication slots identified above exist on + the standby server and are ready for failover. ~~ 5a. I don't think you need to say "Firstly," and "Next," because the order to do these steps is already self-evident. ~ 5b. Patch says "on the subscriber node", but isn't that the simplest case? e.g. maybe there are multiple nodes having subscriptions for these publications. Maybe the sentence needs to account for case of subscribers on >1 nodes. Is there no way to discover this information by querying the publisher? ~~~ 6. +<programlisting> +test_sub=# SELECT + array_agg(slotname) AS slots + FROM ... +<programlisting> +test_standby=# SELECT slot_name, (synced AND NOT temporary AND NOT conflicting) AS failover_ready + FROM pg_replication_slots + WHERE slot_name IN ('sub1','sub2','sub3'); The example SQL for "1a" refers to 'slotname', but the example SQL for "1b" refers to "slot_name" (i.e. with underscore). It might be better if those are consistently called 'slot_name'. ~~~ 7. + <step performance="required"> + <para> + Confirm that the standby server is not lagging behind the subscribers. + This step can be skipped if + <link linkend="guc-standby-slot-names"><varname>standby_slot_names</varname></link> + has been correctly configured. If standby_slot_names is not configured + correctly, it is highly recommended to run this step after the primary + server is down, otherwise the results of the query may vary at different + points of time due to the ongoing replication on the logical subscribers + from the primary server. + </para> 7a. I felt that the step should just say "Confirm that the standby server is not lagging behind the subscribers.". So the text "This step can be skipped..." should be a separate paragraph. ~ 7b. The 2nd standby_slot_names should use a varname font. ~ 7c. /may vary at different points in time due to/can vary due to/ ~~~~ 8. + <para> + Firstly, on the subscriber node check the last replayed WAL. + This step needs to be run on the database(s) that includes the failover + enabled subscription(s), to find the last replayed WAL on each database. 8a. Don't need to say "Firstly," ~ 8b. The text "This step..." can be simplified as below: BEFORE This step needs to be run on the database(s) that includes the failover enabled subscription(s), to find the last replayed WAL on each database. SUGGESTION This step needs to be run on any database that includes failover-enabled subscriptions. ~~~ 9. + <para> + Next, on the standby server check that the last-received WAL location + is ahead of the replayed WAL location(s) on the subscriber identified + above. If the above SQL result was NULL, it means the subscriber has not + yet replayed any WAL, so the standby server must be ahead of the + subscriber, and this step can be skipped. Don't need to say "Next," ~~~ 10. + <para> + If the result (<literal>failover_ready</literal>) of both above steps is + true, existing subscriptions will be able to continue without losing any data + that has been flushed to the new primary server. + </para> Let's word this more like the same sentence top of the page. See review comment #3c SUGGESTION If the result (<literal>failover_ready</literal>) of both steps above is true, then existing subscriptions can continue subscribing to publications now on the new primary server without any loss of data. ====== Kind Regards, Peter Smith. Fujitsu Australia
On Wednesday, May 8, 2024 5:21 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > A few comments: Thanks for the comments! > 2 === > > +test_sub=# SELECT > + array_agg(slotname) AS slots > + FROM > + (( > + SELECT r.srsubid AS subid, CONCAT('pg_', srsubid, '_sync_', > srrelid, '_', ctl.system_identifier) AS slotname > + FROM pg_control_system() ctl, pg_subscription_rel r, > pg_subscription s > + WHERE r.srsubstate = 'f' AND s.oid = r.srsubid AND > s.subfailover > + ) UNION ( > > I guess this format comes from ReplicationSlotNameForTablesync(). What > about creating a SQL callable function on top of it and make use of it in the > query above? (that would ensure to keep the doc up to date even if the format > changes in ReplicationSlotNameForTablesync()). We could add a new function as suggested but I think it's not the right time(beta1) to add this function because new function will bring catversion bump which I think may not be worth at this stage. I think we can consider this after releasing and maybe gather more use cases for the new function you suggested. > > 3 === > > +test_sub=# SELECT > + MAX(remote_lsn) AS remote_lsn_on_subscriber > + FROM > + (( > + SELECT (CASE WHEN r.srsubstate = 'f' THEN > pg_replication_origin_progress(CONCAT('pg_', r.srsubid, '_', r.srrelid), false) > + WHEN r.srsubstate IN ('s', 'r') THEN r.srsublsn > END) AS remote_lsn > + FROM pg_subscription_rel r, pg_subscription s > + WHERE r.srsubstate IN ('f', 's', 'r') AND s.oid = r.srsubid AND > s.subfailover > + ) UNION ( > + SELECT pg_replication_origin_progress(CONCAT('pg_', > s.oid), false) AS remote_lsn > + FROM pg_subscription s > + WHERE s.subfailover > + )); > > What about adding a join to pg_replication_origin to get rid of the "hardcoded" > format "CONCAT('pg_', r.srsubid, '_', r.srrelid)" and "CONCAT('pg_', s.oid)"? I tried a bit, but it doesn't seem feasible to get the relationship between subscription and origin by querying pg_subscription and pg_replication_origin. Best Regards, Hou zj
On Thursday, May 23, 2024 1:34 PM Peter Smith <smithpb2250@gmail.com> wrote: Thanks for the comments. I addressed most of the comments except the following one which I am not sure: > 5b. > Patch says "on the subscriber node", but isn't that the simplest case? > e.g. maybe there are multiple nodes having subscriptions for these > publications. Maybe the sentence needs to account for case of subscribers on > >1 nodes. I think it's not necessary mention the multiple nodes case, as in that case, user can just perform the same steps on each node that have failover subscription. > Is there no way to discover this information by querying the publisher? I am not aware of the way for user to get the necessary info such as replication origin progress on the publisher, because such information is only available on subscriber. Attach the V4 doc patch which addressed Peter and Bertrand's comments. Best Regards, Hou zj
Attachment
On Wed, Jun 5, 2024 at 7:52 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > Attach the V4 doc patch which addressed Peter and Bertrand's comments. > Few comments: 1. + On the subscriber node, use the following SQL to identify + which slots should be synced to the standby that we plan to promote. +<programlisting> +test_sub=# SELECT + array_agg(slot_name) AS slots + FROM + (( + SELECT r.srsubid AS subid, CONCAT('pg_', srsubid, '_sync_', srrelid, '_', ctl.system_identifier) AS slot_name + FROM pg_control_system() ctl, pg_subscription_rel r, pg_subscription s + WHERE r.srsubstate = 'f' AND s.oid = r.srsubid AND s.subfailover + ) UNION ( + SELECT s.oid AS subid, s.subslotname as slot_name + FROM pg_subscription s + WHERE s.subfailover + )); + slots +------- + {sub1,sub2,sub3} This should additionally say what exactly this SQL is doing to fetch the required slots. 2. If <varname>standby_slot_names</varname> is + not configured correctly, it is highly recommended to run this step after + the primary server is down, otherwise the results of the query can vary + due to the ongoing replication on the logical subscribers from the primary + server. + </para> + <substeps> + <step performance="required"> + <para> + On the subscriber node, check the last replayed WAL. + This step needs to be run on any database that includes failover enabled + subscriptions. +<programlisting> +test_sub=# SELECT + MAX(remote_lsn) AS remote_lsn_on_subscriber + FROM If the 'standby_slot_names' is not configured then we can't ensure that standby is always ahead because what if immediately after running this query the additional WAL got synced to the subscriber before standby? Now, as you mentioned users can first shutdown primary to ensure that no additional WAL is sent to the subscriber. After that, it is possible that one can use these complex queries to ensure that the subscriber is behind the standby but it is better to encourage users to use standby_slot_names to ensure the same. If at all we get such use cases and or requirements then we can add such additional steps after understanding the user's requirements. For now, we should remove these additional steps. -- With Regards, Amit Kapila.
Hi. Here are some minor review comments for the docs patch v4-0001. ====== doc/src/sgml/logical-replication.sgml 1. General The SGML file wrapping can be fixed to fill up to 80 cols for some of the paragraphs. ~~~ 2. + standby is promoted. They can continue subscribing to publications now on the + new primary server without any loss of data. But please note that in case of + asynchronous replication, there remains a risk of data loss for transactions + that have been committed on the former primary server but have yet to be + replicated to the new primary server. + </para> /in case/in the case/ /But please note that.../Note that.../ ====== Kind Regards, Peter Smith. Fujitsu Australia
On Wednesday, June 5, 2024 2:32 PM Peter Smith <smithpb2250@gmail.com> wrote: > Hi. Here are some minor review comments for the docs patch v4-0001. Thanks for the comments! > The SGML file wrapping can be fixed to fill up to 80 cols for some of the > paragraphs. Unlike comments in C code, I think we don't force the 80 cols limit in doc file unless it's too long to read. I checked the doc once and think it's OK. Here is the V5 patch which addressed Peter's comments and Amit's comments[1]. [1] https://www.postgresql.org/message-id/CAA4eK1%2Bq1MYGgF3-LZCj6Xd0idujnjbTsfk-RqU%2BC51wYGaD5g%40mail.gmail.com Best Regards, Hou zj
Attachment
Hi, here are some review comments for the docs patch v5-0001. Apart from these it LGTM. ====== doc/src/sgml/logical-replication.sgml 1. + <para> + On the subscriber node, use the following SQL to identify which slots + should be synced to the standby that we plan to promote. This query will + return the relevant replication slots, including the main slots and table + synchronization slots associated with the failover enabled subscriptions. /failover enabled/failover-enabled/ ~~~ 2. + <para> + If all the slots are present on the standby server and result + (<literal>failover_ready</literal>) of is true, then existing subscriptions + can continue subscribing to publications now on the new primary server + without any loss of data. + </para> Hmm. It looks like there is some typo or missing words here: "of is true". Did you mean something like: "of the above SQL query is true"? ====== Kind Regards, Peter Smith. Fujitsu Australia
On Thursday, June 6, 2024 12:21 PM Peter Smith <smithpb2250@gmail.com> > > Hi, here are some review comments for the docs patch v5-0001. Thanks for the comments! Here is the V6 patch that addressed the these. Best Regards, Hou zj
Attachment
On Fri, Jun 7, 2024 at 7:57 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > Thanks for the comments! Here is the V6 patch that addressed the these. > I have pushed this after making minor changes in the wording. I have also changed one of the queries in docs to ignore the NULL slot_name values. -- With Regards, Amit Kapila.