Re: Allow logical failover slots to wait on synchronous replication - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: Allow logical failover slots to wait on synchronous replication |
Date | |
Msg-id | CAA4eK1KXhK3QeNv9h5=Nrf8o0bx0y1ZwJYr6O+KBRmdkcRZr1A@mail.gmail.com Whole thread Raw |
In response to | Allow logical failover slots to wait on synchronous replication (John H <johnhyvr@gmail.com>) |
List | pgsql-hackers |
On Tue, Aug 27, 2024 at 12:58 AM John H <johnhyvr@gmail.com> wrote: > > For instance, in Shveta's suggestion of > > > > > We can perform this test with both of the below settings and say make > > > > D and E slow in sending responses: > > > > 1) synchronous_standby_names = 'ANY 3 (A,B,C,D,E)' > > > > 2) standby_slot_names = A_slot, B_slot, C_slot, D_slot, E_slot. > > if the server associated with E_slot is just down or undergoing > some sort of maintenance, then all logical consumers would start lagging until > the server is back up. I could also mimic a network lag of 20 seconds > and it's guaranteed > that this patch will perform better. > I wanted a simple test where in the first case you use synchronous_standby_names = 'ANY 3 (A,B,C,D,E)' and in the second case use standby_slot_names = A_slot, B_slot, C_slot, D_slot, E_slot. You can try some variations of it as well. The idea is that even if the performance is less for synchronous_standby_names configuration, we should be able to document it. This will help users to decide what is best for them. > I re-ran the benchmarks with a longer run time of 3 hours, and testing > a new shared cache > for walsenders to check the value before obtaining the SyncRepLock. > > I also saw I was being throttled on storage in my previous benchmarks > so I moved to a new setup. > I benchmarked a new test case with an additional shared cache between > all the walsenders to > reduce potential contention on SyncRepLock, and have attached said patch. > > Database: Writer on it's own disk, 5 RRs on the other disk together > Client: 10 logical clients, pgbench running from here as well > > 'pgbench -c 32 -j 4 -T 10800 -U "ec2-user" -d postgres -r -P 1' > > # Test failover_slots with synchronized_standby_slots = 'rr_1, rr_2, > rr_3, rr_4, rr_5' > latency average = 10.683 ms > latency stddev = 11.851 ms > initial connection time = 145.876 ms > tps = 2994.595673 (without initial connection time) > > # Test failover_slots waiting on sync_rep no new shared cache > latency average = 10.684 ms > latency stddev = 12.247 ms > initial connection time = 142.561 ms > tps = 2994.160136 (without initial connection time) > statement latencies in milliseconds and failures: > > # Test failover slots with additional shared cache > latency average = 10.674 ms > latency stddev = 11.917 ms > initial connection time = 142.486 ms > tps = 2997.315874 (without initial connection time) > > The tps improvement between no cache and shared_cache seems marginal, but we do > see the slight improvement in stddev which makes sense from a > contention perspective. > What is the difference between "Test failover_slots with synchronized_standby_slots = 'rr_1, rr_2, > rr_3, rr_4, rr_5'" and "Test failover_slots waiting on sync_rep no new shared cache"? I want to know what configurationdid you used for synchronous_standby_names in the latter case. > I think the cache would demonstrate a lot more improvement if we had > say 1000 logical slots > and all of them are trying to obtain SyncRepLock for updating its values. > > I've attached the patch but don't feel particularly strongly about the > new shared LSN values. > I am also not sure especially as the test results didn't shown much improvement and the code also becomes bit complicated. BTW, in the 0003 version in the below code: + /* Cache values to reduce contention */ + LWLockAcquire(SyncRepLock, LW_SHARED); + memcpy((XLogRecPtr *) walsndctl->cached_lsn, (XLogRecPtr *) walsndctl->lsn, sizeof(lsn)); + LWLockRelease(SyncRepLock); Which mode lsn is being copied? I am not sure if I understood this part of the code. In the 0002 version, in the following code [1], you are referring to LSN mode which is enabled for logical walsender irrespective of the mode used by the physical walsender. It is possible that they are always the same but that is not evident from the code or comments in the patch. [1] : + /* Cache values to reduce contention on lock */ + for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; i++) + { + lsn[i] = walsndctl->lsn[i]; + } - ss_oldest_flush_lsn = min_restart_lsn; + LWLockRelease(SyncRepLock); - return true; + if (lsn[mode] >= wait_for_lsn) + return true; -- With Regards, Amit Kapila.
pgsql-hackers by date: