Re: synchronized_standby_slots behavior inconsistent with quorum-based synchronous replication - Mailing list pgsql-hackers
| From | Ashutosh Sharma |
|---|---|
| Subject | Re: synchronized_standby_slots behavior inconsistent with quorum-based synchronous replication |
| Date | |
| Msg-id | CAE9k0Pn2CCw8jXqbaJqwrXwBUfdVW8rRg6aQY6XJjM7cn-Cp-Q@mail.gmail.com Whole thread |
| In response to | Re: synchronized_standby_slots behavior inconsistent with quorum-based synchronous replication (shveta malik <shveta.malik@gmail.com>) |
| Responses |
Re: synchronized_standby_slots behavior inconsistent with quorum-based synchronous replication
|
| List | pgsql-hackers |
On Fri, Apr 3, 2026 at 2:21 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Fri, Apr 3, 2026 at 9:46 AM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Thu, Apr 2, 2026 at 3:55 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > > > > Hi Shveta, > > > > > > On Wed, Apr 1, 2026 at 12:06 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > > > On Thu, Mar 26, 2026 at 5:23 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > > > > > > > > > > > > > PFA patch addressing all the comments above and let me know for any > > > > > further comments. > > > > > > > > > > > > > Thank You Ashutosh. Doc looks good to me. Few comments: > > > > > > > > 3) > > > > What is the execution time for this new test? > > > > I ran it on my VM (which is slightly on the slower side), and the > > > > runtime varies between ~60 seconds and ~140 seconds. I executed it > > > > around 10–15 times. Most runs completed in about 65 seconds (which is > > > > still more), but a few were significantly longer (100+ seconds). > > > > During the longer runs, I noticed the following entry in pub.log > > > > (possibly related to Test Scenario E taking more time?). Could you > > > > please try running this on your end as well? > > > > > > > > 2026-03-31 19:45:45.557 IST client backend[145705] > > > > 053_synchronized_standby_slots_quorum.pl LOG: statement: SELECT > > > > active_pid IS NOT NULL > > > > AND restart_lsn IS NOT NULL > > > > AND restart_lsn < '0/03000450'::pg_lsn > > > > FROM pg_replication_slots > > > > WHERE slot_name = 'sb1_slot'; > > > > > > > > Just for reference, the complete failover test > > > > (t/040_standby_failover_slots_sync.pl) takes somewhere between 7 to > > > > 10sec on my VM. > > > > > > > > > > My concern with this new test is that it's both slow to run and prone > > > to flakiness, which makes me question whether it's worth keeping. > > > > > > > will review and share my thoughts. > > > > I gave it more thought, another idea for a shorter and quicker > testcase could be to check wait_event for that particular > application_name in pg_stat_activity. A lagging standby will result in > wait_event=WaitForStandbyConfirmation with backend_type=walsender. > > I have attached sample-code to do the same in the attached txt file, > please have a look. I discussed with Hou-San offline, he is okay with > this idea. Please see if it works and change it as needed. > More than the execution time, I'm concerned if the test-case effectively validates what we want. With below setup, here is what I observe: Setup: Primary : psql -p 5555 (synchronous_standby_names = 'ANY 1 (standby1, standby2)'; synchronized_standby_slots = 'FIRST 1 (sb1_slot, sb2_slot)') Standby1 : psql -p 5556 (wal_receiver_status_interval=0) Standby2 : psql -p 5557 (wal_receiver_status_interval=10s) -- Observations: [local]:5555 ashu@postgres=# SELECT pg_logical_emit_message(true, 'qtest', 'first_1_lagging_blocks_1'); pg_logical_emit_message ------------------------- 0/04000220 (1 row) Time: 14.378 ms [local]:5555 ashu@postgres=# select slot_name, active_pid, restart_lsn from pg_replication_slots where slot_type = 'physical'; slot_name | active_pid | restart_lsn -----------+------------+------------- sb1_slot | 105328 | 0/04000250 sb2_slot | 105381 | 0/04000250 (2 rows) Time: 1.370 ms -- [local]:5555 ashu@postgres=# SELECT pg_logical_emit_message(true, 'qtest', 'first_1_lagging_blocks_2'); pg_logical_emit_message ------------------------- 0/040002A0 (1 row) Time: 13.533 ms [local]:5555 ashu@postgres=# select slot_name, active_pid, restart_lsn from pg_replication_slots where slot_type = 'physical'; slot_name | active_pid | restart_lsn -----------+------------+------------- sb1_slot | 105328 | 0/040002D0 sb2_slot | 105381 | 0/040002D0 (2 rows) -- Takeaways: 1) In both the cases, even though wal_receiver_status_interval = 0 on standby1, the restart_lsn of the standby1 quickly moved past the lsn of the logical message emitted which kind of gives sense that wal_receiver_status_interval = 0 disables periodic status packets, but receiver/walsender still exchange feedback on other events, so slot restart_lsn can move quickly. 2) On a fast local setup, both sb1_slot and sb2_slot can advance past the emitted LSN before we query pg_replication_slots making the test-case flaky/nondeterministic, it becomes time sensitive. -- With Regards, Ashutosh Sharma.
pgsql-hackers by date: