RE: How can end users know the cause of LR slot sync delays? - Mailing list pgsql-hackers
From | Hayato Kuroda (Fujitsu) |
---|---|
Subject | RE: How can end users know the cause of LR slot sync delays? |
Date | |
Msg-id | OSCPR01MB14966F3E00A6E34625CE35AD2F5F5A@OSCPR01MB14966.jpnprd01.prod.outlook.com Whole thread Raw |
In response to | Re: How can end users know the cause of LR slot sync delays? (Shlok Kyal <shlok.kyal.oss@gmail.com>) |
List | pgsql-hackers |
Dear Shlok, > > 01. > > ``` > > + /* Update the slot sync reason */ > > + SpinLockAcquire(&slot->mutex); > > + if (slot->slot_sync_skip_reason != skip_reason) > > + slot->slot_sync_skip_reason = skip_reason; > > + SpinLockRelease(&slot->mutex); > > ``` > > > > Per my understanding, spinlock is acquired when the attribute on the shared > memory > > is updating. Can you check other parts and follow the rukle? > > > > 02. > > ``` > > + SpinLockAcquire(&slot->mutex); > > + synced = slot->data.synced; > > + SpinLockRelease(&slot->mutex); > > ``` > > > > Same as 1. > > > I checked and found the following comment: > * - Individual fields are protected by mutex where only the backend owning > * the slot is authorized to update the fields from its own slot. The > * backend owning the slot does not need to take this lock when reading its > * own fields, while concurrent backends not owning this slot should take the > * lock when reading this slot's data. > > So for the above two cases we are updating the > 'slot->slot_sync_skip_reason' and reading 'slot->data.synced' and this > can happen before the slot sync worker acquires the slot or owns the > slot. > Also in the same code at a later stage we are again checking the > synced flag and we do that while holding a spin lock. Based on these > observations I think we should take Spinlock in both cases. Hmm, regarding the update_slot_sync_skip_stats(), the replication slot has already been acquired except synchronize_one_slot() case. Can we avoid acquiring the spinlock as much as possible by adding an argument? Or it just introduces additional complexity? > > 09. > > ``` > > +my $connstr_1 = $primary->connstr; > > ``` > > > > Since this is an only connection string in the test, suffix _1 is not needed. > > > Fixed Same as the comment, can you replace "standby1" to "stanby"? > > > 10. > > ``` > > +# Simulate standby connection failure by modifying pg_hba.conf > > +unlink($primary->data_dir . '/pg_hba.conf'); > > +$primary->append_conf('pg_hba.conf', > > + qq{local all all > trust} > > +); > > ``` > > > > What if the system does not have Unix domain socket? I'm afraid all connections > > could be brocked in this case. > > > I have used an injection point to simulate this scenario instead of > changing the contents of pg_hba.conf files. Can you clarify the reason why you used the injection point? I'm not sure the injection point is beneficial here. I feel the point can be added when we handle the timing issue, race condition etc, but walreceiver may not have strong reasons to stop exact at that point. Regarding the content of pg_hba.conf, I felt below lines might be enough: ``` local all all trust host all all 127.0.0.1/32 trust ``` Also, here are comments for v5. ``` + <para> + Reason of the last slot synchronization skip. + </para></entry> ``` Possible values must be clarified. This was posted in [1] but seemed to be missed. ``` + /* Update the slot sync reason */ ``` It is better to clarify updating the *skip* reason ``` - ReplicationSlot *slot; + ReplicationSlot *slot = NULL; ``` No need to initialize as NULL. ``` +#include "utils/injection_point.h" ... + INJECTION_POINT("walreceiver", NULL); ``` As I told above, I have a concern to add the injection point. I want to hear other's opinion as well. ``` + else + { + /* Update the slot sync stats */ + Assert(!found_consistent_snapshot || + *found_consistent_snapshot); + update_slot_sync_skip_stats(slot, SS_SKIP_NONE); + } ``` Your patch may have another issue; if both confirmed_flush_lsn are the same but we do not have the consistent snapshot yet, we would get the assertion failure. (Again, not sure it can really happen) Can we use the condition as another if part? At that time we must clarify why it is OK to pass in case of found_consistent_snapshot == NULL. ``` +# Attach injection point to simulate wait +$standby_psql->query_safe( + q(select injection_points_attach('slot-sync-skip','wait'))); ``` I have been considering whether we can remove the injection point here or not. I think the point is used because the being synchronized slot is still temporary one; they would be cleaned up by ReplicationSlotCleanup(). Can we reproduce the skip event for the permanent slot? I cannot come up with, but if possible no need to introduce the injection point. [1]: https://www.postgresql.org/message-id/OSCPR01MB14966A618A8C61EC3DEE486A4F517A%40OSCPR01MB14966.jpnprd01.prod.outlook.com Best regards, Hayato Kuroda FUJITSU LIMITED
pgsql-hackers by date: