Thread: How to simulate sync/async standbys being closer/farther (network distance) to primary in core postgres?
How to simulate sync/async standbys being closer/farther (network distance) to primary in core postgres?
From
Bharath Rupireddy
Date:
Hi, I'm thinking if there's a way in core postgres to achieve $subject. In reality, the sync/async standbys can either be closer/farther (which means sync/async standbys can receive WAL at different times) to primary, especially in cloud HA environments with primary in one Availability Zone(AZ)/Region and standbys in different AZs/Regions. $subject may not be possible on dev systems (say, for testing some HA features) unless we can inject a delay in WAL senders before sending WAL. How about having two developer-only GUCs {async, sync}_wal_sender_delay? When set, the async and sync WAL senders will delay sending WAL by {async, sync}_wal_sender_delay milliseconds/seconds? Although, I can't think of any immediate use, it will be useful someday IMO, say for features like [1], if it gets in. With this set of GUCs, one can even add core regression tests for HA features. Thoughts? [1] https://www.postgresql.org/message-id/CALj2ACWCj60g6TzYMbEO07ZhnBGbdCveCrD413udqbRM0O59RA%40mail.gmail.com Regards, Bharath Rupireddy.
Re: How to simulate sync/async standbys being closer/farther (network distance) to primary in core postgres?
From
Ashutosh Bapat
Date:
On Tue, Apr 5, 2022 at 9:23 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > Hi, > > I'm thinking if there's a way in core postgres to achieve $subject. In > reality, the sync/async standbys can either be closer/farther (which > means sync/async standbys can receive WAL at different times) to > primary, especially in cloud HA environments with primary in one > Availability Zone(AZ)/Region and standbys in different AZs/Regions. > $subject may not be possible on dev systems (say, for testing some HA > features) unless we can inject a delay in WAL senders before sending > WAL. > > How about having two developer-only GUCs {async, > sync}_wal_sender_delay? When set, the async and sync WAL senders will > delay sending WAL by {async, sync}_wal_sender_delay > milliseconds/seconds? Although, I can't think of any immediate use, it > will be useful someday IMO, say for features like [1], if it gets in. > With this set of GUCs, one can even add core regression tests for HA > features. > > Thoughts? I think this is a common problem, people run into. Once way to simulate network delay is what you suggest, yes. But I was wondering if there are tools/libraries that can help us to do that. Googling gives OS specific tools but nothing like a C or perl library which can be used for this purpose. -- Best Wishes, Ashutosh Bapat
Re: How to simulate sync/async standbys being closer/farther (network distance) to primary in core postgres?
From
Bharath Rupireddy
Date:
On Wed, Apr 6, 2022 at 4:30 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > On Tue, Apr 5, 2022 at 9:23 PM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > Hi, > > > > I'm thinking if there's a way in core postgres to achieve $subject. In > > reality, the sync/async standbys can either be closer/farther (which > > means sync/async standbys can receive WAL at different times) to > > primary, especially in cloud HA environments with primary in one > > Availability Zone(AZ)/Region and standbys in different AZs/Regions. > > $subject may not be possible on dev systems (say, for testing some HA > > features) unless we can inject a delay in WAL senders before sending > > WAL. > > > > How about having two developer-only GUCs {async, > > sync}_wal_sender_delay? When set, the async and sync WAL senders will > > delay sending WAL by {async, sync}_wal_sender_delay > > milliseconds/seconds? Although, I can't think of any immediate use, it > > will be useful someday IMO, say for features like [1], if it gets in. > > With this set of GUCs, one can even add core regression tests for HA > > features. > > > > Thoughts? > > I think this is a common problem, people run into. Once way to > simulate network delay is what you suggest, yes. But I was wondering > if there are tools/libraries that can help us to do that. Googling > gives OS specific tools but nothing like a C or perl library which can > be used for this purpose. Thanks. IMO, non-postgres tools (not sure if they exist, if at all they exist) to simulate network delays may not be reliable and usable easily, say, for adding some TAP tests for HA features. Especially in the cloud-world usage of those external tools may not even be possible. With the developer-only GUCs as being proposed here in this thread, it's pretty much easy to simulate what we want, but only the extra caution is to not let others (probably non-superusers) set and misuse these developer-only GUCs. I think that's even true for all the existing developer-only GUCs. Thoughts? Regards, Bharath Rupireddy.
Re: How to simulate sync/async standbys being closer/farther (network distance) to primary in core postgres?
From
SATYANARAYANA NARLAPURAM
Date:
On Fri, Apr 8, 2022 at 6:44 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
On Wed, Apr 6, 2022 at 4:30 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Tue, Apr 5, 2022 at 9:23 PM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> > Hi,
> >
> > I'm thinking if there's a way in core postgres to achieve $subject. In
> > reality, the sync/async standbys can either be closer/farther (which
> > means sync/async standbys can receive WAL at different times) to
> > primary, especially in cloud HA environments with primary in one
> > Availability Zone(AZ)/Region and standbys in different AZs/Regions.
> > $subject may not be possible on dev systems (say, for testing some HA
> > features) unless we can inject a delay in WAL senders before sending
> > WAL.
Simulation will be helpful even for end customers to simulate faults in the production environments during availability zone/disaster recovery drills.
> >
> > How about having two developer-only GUCs {async,
> > sync}_wal_sender_delay? When set, the async and sync WAL senders will
> > delay sending WAL by {async, sync}_wal_sender_delay
> > milliseconds/seconds? Although, I can't think of any immediate use, it
> > will be useful someday IMO, say for features like [1], if it gets in.
> > With this set of GUCs, one can even add core regression tests for HA
> > features.
I would suggest doing this at the slot level, instead of two GUCs that control the behavior of all the slots (physical/logical). Something like "pg_suspend_replication_slot and pg_Resume_replication_slot"?
Alternatively a GUC on the standby side instead of primary so that the wal receiver stops responding to the wal sender? This helps achieve the same as above but the granularity is now at individual replica level.
Thanks,
Satya
Re: How to simulate sync/async standbys being closer/farther (network distance) to primary in core postgres?
From
Bharath Rupireddy
Date:
On Fri, Apr 8, 2022 at 10:22 PM SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com> wrote: > >> > <bharath.rupireddyforpostgres@gmail.com> wrote: >> > > >> > > Hi, >> > > >> > > I'm thinking if there's a way in core postgres to achieve $subject. In >> > > reality, the sync/async standbys can either be closer/farther (which >> > > means sync/async standbys can receive WAL at different times) to >> > > primary, especially in cloud HA environments with primary in one >> > > Availability Zone(AZ)/Region and standbys in different AZs/Regions. >> > > $subject may not be possible on dev systems (say, for testing some HA >> > > features) unless we can inject a delay in WAL senders before sending >> > > WAL. > > Simulation will be helpful even for end customers to simulate faults in the production environments during availabilityzone/disaster recovery drills. Right. >> > > How about having two developer-only GUCs {async, >> > > sync}_wal_sender_delay? When set, the async and sync WAL senders will >> > > delay sending WAL by {async, sync}_wal_sender_delay >> > > milliseconds/seconds? Although, I can't think of any immediate use, it >> > > will be useful someday IMO, say for features like [1], if it gets in. >> > > With this set of GUCs, one can even add core regression tests for HA >> > > features. > > I would suggest doing this at the slot level, instead of two GUCs that control the behavior of all the slots (physical/logical).Something like "pg_suspend_replication_slot and pg_Resume_replication_slot"? Having the control at the replication slot level seems reasonable instead of at the WAL sender level. As there can be many slots on the primary, we must have a way to specify which slots need to be delayed and by how much time before sending WAL. If GUCs, they must be of list types and I'm not sure that would come out well. Instead, two (superuser-only/users with replication role) functions such as pg_replication_slot_set_delay(slot_name, delay_in_milliseconds)/pg_replication_slot_unset_delay(slot_name). pg_replication_slot_set_delay will set ReplicationSlot->delay and the WAL sender checks MyReplicationSlot->delay > 0 and waits before sending WAL. pg_replication_slot_unset_delay will set ReplicationSlot->delay to 0, or instead of pg_replication_slot_unset_delay, the pg_replication_slot_set_delay(slot_name, 0) can be used, this way only single function. If the users want a standby to receive WAL with a delay, they can use pg_replication_slot_set_delay after creating the replication slot. Thoughts? > Alternatively a GUC on the standby side instead of primary so that the wal receiver stops responding to the wal sender? I think we have wal_receiver_status_interval GUC on WAL receiver that achieves the above i.e. not responding to the primary at all, one can set wal_receiver_status_interval to, say, 1day. [1] { {"wal_receiver_status_interval", PGC_SIGHUP, REPLICATION_STANDBY, gettext_noop("Sets the maximum interval between WAL receiver status reports to the sending server."), NULL, GUC_UNIT_S }, &wal_receiver_status_interval, 10, 0, INT_MAX / 1000, NULL, NULL, NULL }, Regards, Bharath Rupireddy.
Re: How to simulate sync/async standbys being closer/farther (network distance) to primary in core postgres?
From
Julien Rouhaud
Date:
On Sat, Apr 09, 2022 at 02:38:50PM +0530, Bharath Rupireddy wrote: > On Fri, Apr 8, 2022 at 10:22 PM SATYANARAYANA NARLAPURAM > <satyanarlapuram@gmail.com> wrote: > > > >> > <bharath.rupireddyforpostgres@gmail.com> wrote: > >> > > > >> > > Hi, > >> > > > >> > > I'm thinking if there's a way in core postgres to achieve $subject. In > >> > > reality, the sync/async standbys can either be closer/farther (which > >> > > means sync/async standbys can receive WAL at different times) to > >> > > primary, especially in cloud HA environments with primary in one > >> > > Availability Zone(AZ)/Region and standbys in different AZs/Regions. > >> > > $subject may not be possible on dev systems (say, for testing some HA > >> > > features) unless we can inject a delay in WAL senders before sending > >> > > WAL. > > > > Simulation will be helpful even for end customers to simulate faults in the > > production environments during availability zone/disaster recovery drills. > > Right. I'm not sure that's actually helpful. If you want to do some realistic testing you need to fully simulate various network incidents and only delaying postgres replication is never going to be close to that. You should instead rely on tool like tc, which can do much more than what $subject could ever do, and do that for all your HA stack. At the very least you don't want to validate that your setup is working as excpected by just simulating a faulty postgres replication connection but still having all your clients and HA agent not having any network issue at all.
Re: How to simulate sync/async standbys being closer/farther (network distance) to primary in core postgres?
From
Bharath Rupireddy
Date:
On Sat, Apr 9, 2022 at 6:38 PM Julien Rouhaud <rjuju123@gmail.com> wrote: > > On Sat, Apr 09, 2022 at 02:38:50PM +0530, Bharath Rupireddy wrote: > > On Fri, Apr 8, 2022 at 10:22 PM SATYANARAYANA NARLAPURAM > > <satyanarlapuram@gmail.com> wrote: > > > > > >> > <bharath.rupireddyforpostgres@gmail.com> wrote: > > >> > > > > >> > > Hi, > > >> > > > > >> > > I'm thinking if there's a way in core postgres to achieve $subject. In > > >> > > reality, the sync/async standbys can either be closer/farther (which > > >> > > means sync/async standbys can receive WAL at different times) to > > >> > > primary, especially in cloud HA environments with primary in one > > >> > > Availability Zone(AZ)/Region and standbys in different AZs/Regions. > > >> > > $subject may not be possible on dev systems (say, for testing some HA > > >> > > features) unless we can inject a delay in WAL senders before sending > > >> > > WAL. > > > > > > Simulation will be helpful even for end customers to simulate faults in the > > > production environments during availability zone/disaster recovery drills. > > > > Right. > > I'm not sure that's actually helpful. If you want to do some realistic testing > you need to fully simulate various network incidents and only delaying postgres > replication is never going to be close to that. You should instead rely on > tool like tc, which can do much more than what $subject could ever do, and do > that for all your HA stack. At the very least you don't want to validate that > your setup is working as excpected by just simulating a faulty postgres > replication connection but still having all your clients and HA agent not > having any network issue at all. Agree that the external networking tools and commands can be used. IMHO, not everyone is familiar with those tools and the tools may not be portable and reliable all the time. And developers may not be able to use those tools to test some of the HA related features (which may require sync and async standbys being closer/farther to the primary) that I or some other postgres HA solution providers may develop. Having a reliable way within the core would actually help. Upon thinking further, how about we have hooks in WAL sender code (perhaps with replication slot info that it manages and some other info) and one can implement an extension of their choice (similar to auth_delay and ClientAuthentication_hook)? Regards, Bharath Rupireddy.