Thread: How to simulate sync/async standbys being closer/farther (network distance) to primary in core postgres?

Hi,

I'm thinking if there's a way in core postgres to achieve $subject. In
reality, the sync/async standbys can either be closer/farther (which
means sync/async standbys can receive WAL at different times) to
primary, especially in cloud HA environments with primary in one
Availability Zone(AZ)/Region and standbys in different AZs/Regions.
$subject may not be possible on dev systems (say, for testing some HA
features) unless we can inject a delay in WAL senders before sending
WAL.

How about having two developer-only GUCs {async,
sync}_wal_sender_delay? When set, the async and sync WAL senders will
delay sending WAL by {async, sync}_wal_sender_delay
milliseconds/seconds? Although, I can't think of any immediate use, it
will be useful someday IMO, say for features like [1], if it gets in.
With this set of GUCs, one can even add core regression tests for HA
features.

Thoughts?

[1] https://www.postgresql.org/message-id/CALj2ACWCj60g6TzYMbEO07ZhnBGbdCveCrD413udqbRM0O59RA%40mail.gmail.com

Regards,
Bharath Rupireddy.



On Tue, Apr 5, 2022 at 9:23 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> Hi,
>
> I'm thinking if there's a way in core postgres to achieve $subject. In
> reality, the sync/async standbys can either be closer/farther (which
> means sync/async standbys can receive WAL at different times) to
> primary, especially in cloud HA environments with primary in one
> Availability Zone(AZ)/Region and standbys in different AZs/Regions.
> $subject may not be possible on dev systems (say, for testing some HA
> features) unless we can inject a delay in WAL senders before sending
> WAL.
>
> How about having two developer-only GUCs {async,
> sync}_wal_sender_delay? When set, the async and sync WAL senders will
> delay sending WAL by {async, sync}_wal_sender_delay
> milliseconds/seconds? Although, I can't think of any immediate use, it
> will be useful someday IMO, say for features like [1], if it gets in.
> With this set of GUCs, one can even add core regression tests for HA
> features.
>
> Thoughts?

I think this is a common problem, people run into. Once way to
simulate network delay is what you suggest, yes. But I was wondering
if there are tools/libraries that can help us to do that. Googling
gives OS specific tools but nothing like a C or perl library which can
be used for this purpose.


-- 
Best Wishes,
Ashutosh Bapat



On Wed, Apr 6, 2022 at 4:30 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Tue, Apr 5, 2022 at 9:23 PM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> > Hi,
> >
> > I'm thinking if there's a way in core postgres to achieve $subject. In
> > reality, the sync/async standbys can either be closer/farther (which
> > means sync/async standbys can receive WAL at different times) to
> > primary, especially in cloud HA environments with primary in one
> > Availability Zone(AZ)/Region and standbys in different AZs/Regions.
> > $subject may not be possible on dev systems (say, for testing some HA
> > features) unless we can inject a delay in WAL senders before sending
> > WAL.
> >
> > How about having two developer-only GUCs {async,
> > sync}_wal_sender_delay? When set, the async and sync WAL senders will
> > delay sending WAL by {async, sync}_wal_sender_delay
> > milliseconds/seconds? Although, I can't think of any immediate use, it
> > will be useful someday IMO, say for features like [1], if it gets in.
> > With this set of GUCs, one can even add core regression tests for HA
> > features.
> >
> > Thoughts?
>
> I think this is a common problem, people run into. Once way to
> simulate network delay is what you suggest, yes. But I was wondering
> if there are tools/libraries that can help us to do that. Googling
> gives OS specific tools but nothing like a C or perl library which can
> be used for this purpose.

Thanks. IMO, non-postgres tools (not sure if they exist, if at all
they exist) to simulate network delays may not be reliable and usable
easily, say, for adding some TAP tests for HA features. Especially in
the cloud-world usage of those external tools may not even be
possible. With the developer-only GUCs as being proposed here in this
thread, it's pretty much easy to simulate what we want, but only the
extra caution is to not let others (probably non-superusers) set and
misuse these developer-only GUCs. I think that's even true for all the
existing developer-only GUCs.

Thoughts?

Regards,
Bharath Rupireddy.





On Fri, Apr 8, 2022 at 6:44 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:
On Wed, Apr 6, 2022 at 4:30 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Tue, Apr 5, 2022 at 9:23 PM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> >
> > Hi,
> >
> > I'm thinking if there's a way in core postgres to achieve $subject. In
> > reality, the sync/async standbys can either be closer/farther (which
> > means sync/async standbys can receive WAL at different times) to
> > primary, especially in cloud HA environments with primary in one
> > Availability Zone(AZ)/Region and standbys in different AZs/Regions.
> > $subject may not be possible on dev systems (say, for testing some HA
> > features) unless we can inject a delay in WAL senders before sending
> > WAL.

Simulation will be helpful even for end customers to simulate faults in the production environments during availability zone/disaster recovery drills.

 
> >
> > How about having two developer-only GUCs {async,
> > sync}_wal_sender_delay? When set, the async and sync WAL senders will
> > delay sending WAL by {async, sync}_wal_sender_delay
> > milliseconds/seconds? Although, I can't think of any immediate use, it
> > will be useful someday IMO, say for features like [1], if it gets in.
> > With this set of GUCs, one can even add core regression tests for HA
> > features.

I would suggest doing this at the slot level, instead of two GUCs that control the behavior of all the slots (physical/logical). Something like "pg_suspend_replication_slot and pg_Resume_replication_slot"?
Alternatively a GUC on the standby side instead of primary so that the wal receiver stops responding to the wal sender? This helps achieve the same as above but the granularity is now at individual replica level.
 
Thanks,
Satya
On Fri, Apr 8, 2022 at 10:22 PM SATYANARAYANA NARLAPURAM
<satyanarlapuram@gmail.com> wrote:
>
>> > <bharath.rupireddyforpostgres@gmail.com> wrote:
>> > >
>> > > Hi,
>> > >
>> > > I'm thinking if there's a way in core postgres to achieve $subject. In
>> > > reality, the sync/async standbys can either be closer/farther (which
>> > > means sync/async standbys can receive WAL at different times) to
>> > > primary, especially in cloud HA environments with primary in one
>> > > Availability Zone(AZ)/Region and standbys in different AZs/Regions.
>> > > $subject may not be possible on dev systems (say, for testing some HA
>> > > features) unless we can inject a delay in WAL senders before sending
>> > > WAL.
>
> Simulation will be helpful even for end customers to simulate faults in the production environments during
availabilityzone/disaster recovery drills.
 

Right.

>> > > How about having two developer-only GUCs {async,
>> > > sync}_wal_sender_delay? When set, the async and sync WAL senders will
>> > > delay sending WAL by {async, sync}_wal_sender_delay
>> > > milliseconds/seconds? Although, I can't think of any immediate use, it
>> > > will be useful someday IMO, say for features like [1], if it gets in.
>> > > With this set of GUCs, one can even add core regression tests for HA
>> > > features.
>
> I would suggest doing this at the slot level, instead of two GUCs that control the behavior of all the slots
(physical/logical).Something like "pg_suspend_replication_slot and pg_Resume_replication_slot"?
 

Having the control at the replication slot level seems reasonable
instead of at the WAL sender level. As there can be many slots on the
primary, we must have a way to specify which slots need to be delayed
and by how much time before sending WAL. If GUCs, they must be of list
types and I'm not sure that would come out well.

Instead, two (superuser-only/users with replication role) functions
such as pg_replication_slot_set_delay(slot_name,
delay_in_milliseconds)/pg_replication_slot_unset_delay(slot_name).
pg_replication_slot_set_delay will set ReplicationSlot->delay and the
WAL sender checks MyReplicationSlot->delay > 0 and waits before
sending WAL. pg_replication_slot_unset_delay will set
ReplicationSlot->delay to 0, or instead of
pg_replication_slot_unset_delay, the
pg_replication_slot_set_delay(slot_name, 0) can be used, this way only
single function.

If the users want a standby to receive WAL with a delay, they can use
pg_replication_slot_set_delay after creating the replication slot.

Thoughts?

> Alternatively a GUC on the standby side instead of primary so that the wal receiver stops responding to the wal
sender?

I think we have wal_receiver_status_interval GUC on WAL receiver that
achieves the above i.e. not responding to the primary at all, one can
set wal_receiver_status_interval to, say, 1day.

[1]
    {
        {"wal_receiver_status_interval", PGC_SIGHUP, REPLICATION_STANDBY,
            gettext_noop("Sets the maximum interval between WAL
receiver status reports to the sending server."),
            NULL,
            GUC_UNIT_S
        },
        &wal_receiver_status_interval,
        10, 0, INT_MAX / 1000,
        NULL, NULL, NULL
    },

Regards,
Bharath Rupireddy.



On Sat, Apr 09, 2022 at 02:38:50PM +0530, Bharath Rupireddy wrote:
> On Fri, Apr 8, 2022 at 10:22 PM SATYANARAYANA NARLAPURAM
> <satyanarlapuram@gmail.com> wrote:
> >
> >> > <bharath.rupireddyforpostgres@gmail.com> wrote:
> >> > >
> >> > > Hi,
> >> > >
> >> > > I'm thinking if there's a way in core postgres to achieve $subject. In
> >> > > reality, the sync/async standbys can either be closer/farther (which
> >> > > means sync/async standbys can receive WAL at different times) to
> >> > > primary, especially in cloud HA environments with primary in one
> >> > > Availability Zone(AZ)/Region and standbys in different AZs/Regions.
> >> > > $subject may not be possible on dev systems (say, for testing some HA
> >> > > features) unless we can inject a delay in WAL senders before sending
> >> > > WAL.
> >
> > Simulation will be helpful even for end customers to simulate faults in the
> > production environments during availability zone/disaster recovery drills.
>
> Right.

I'm not sure that's actually helpful.  If you want to do some realistic testing
you need to fully simulate various network incidents and only delaying postgres
replication is never going to be close to that.  You should instead rely on
tool like tc, which can do much more than what $subject could ever do, and do
that for all your HA stack.  At the very least you don't want to validate that
your setup is working as excpected by just simulating a faulty postgres
replication connection but still having all your clients and HA agent not
having any network issue at all.



On Sat, Apr 9, 2022 at 6:38 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> On Sat, Apr 09, 2022 at 02:38:50PM +0530, Bharath Rupireddy wrote:
> > On Fri, Apr 8, 2022 at 10:22 PM SATYANARAYANA NARLAPURAM
> > <satyanarlapuram@gmail.com> wrote:
> > >
> > >> > <bharath.rupireddyforpostgres@gmail.com> wrote:
> > >> > >
> > >> > > Hi,
> > >> > >
> > >> > > I'm thinking if there's a way in core postgres to achieve $subject. In
> > >> > > reality, the sync/async standbys can either be closer/farther (which
> > >> > > means sync/async standbys can receive WAL at different times) to
> > >> > > primary, especially in cloud HA environments with primary in one
> > >> > > Availability Zone(AZ)/Region and standbys in different AZs/Regions.
> > >> > > $subject may not be possible on dev systems (say, for testing some HA
> > >> > > features) unless we can inject a delay in WAL senders before sending
> > >> > > WAL.
> > >
> > > Simulation will be helpful even for end customers to simulate faults in the
> > > production environments during availability zone/disaster recovery drills.
> >
> > Right.
>
> I'm not sure that's actually helpful.  If you want to do some realistic testing
> you need to fully simulate various network incidents and only delaying postgres
> replication is never going to be close to that.  You should instead rely on
> tool like tc, which can do much more than what $subject could ever do, and do
> that for all your HA stack.  At the very least you don't want to validate that
> your setup is working as excpected by just simulating a faulty postgres
> replication connection but still having all your clients and HA agent not
> having any network issue at all.

Agree that the external networking tools and commands can be used.
IMHO, not everyone is familiar with those tools and the tools may not
be portable and reliable all the time. And developers may not be able
to use those tools to test some of the HA related features (which may
require sync and async standbys being closer/farther to the primary)
that I or some other postgres HA solution providers may develop.
Having a reliable way within the core would actually help.

Upon thinking further, how about we have hooks in WAL sender code
(perhaps with replication slot info that it manages and some other
info) and one can implement an extension of their choice (similar to
auth_delay and ClientAuthentication_hook)?

Regards,
Bharath Rupireddy.