Re: How can end users know the cause of LR slot sync delays? - Mailing list pgsql-hackers
From | Shlok Kyal |
---|---|
Subject | Re: How can end users know the cause of LR slot sync delays? |
Date | |
Msg-id | CANhcyEU3hgjCL+N3Bu73RTe5K6XziyjAi7ucwec8ZcfVdd+o6w@mail.gmail.com Whole thread Raw |
In response to | RE: How can end users know the cause of LR slot sync delays? ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>) |
List | pgsql-hackers |
On Tue, 30 Sept 2025 at 18:22, Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Shlok, > > Thanks for updating the patch. Here are my comments. > > 01. > ``` > + /* Update the slot sync reason */ > + SpinLockAcquire(&slot->mutex); > + if (slot->slot_sync_skip_reason != skip_reason) > + slot->slot_sync_skip_reason = skip_reason; > + SpinLockRelease(&slot->mutex); > ``` > > Per my understanding, spinlock is acquired when the attribute on the shared memory > is updating. Can you check other parts and follow the rukle? > > 02. > ``` > + SpinLockAcquire(&slot->mutex); > + synced = slot->data.synced; > + SpinLockRelease(&slot->mutex); > ``` > > Same as 1. > I checked and found the following comment: * - Individual fields are protected by mutex where only the backend owning * the slot is authorized to update the fields from its own slot. The * backend owning the slot does not need to take this lock when reading its * own fields, while concurrent backends not owning this slot should take the * lock when reading this slot's data. So for the above two cases we are updating the 'slot->slot_sync_skip_reason' and reading 'slot->data.synced' and this can happen before the slot sync worker acquires the slot or owns the slot. Also in the same code at a later stage we are again checking the synced flag and we do that while holding a spin lock. Based on these observations I think we should take Spinlock in both cases. > 03. > ``` > ``` > > > 04. > ``` > +#ifdef USE_INJECTION_POINTS > + INJECTION_POINT("slot-sync-skip", NULL); > +#endif > ``` > > No need do do #ifdef here. INJECTION_POINT() itself checks internally. > Fixed > 05. > ``` > +# Initialize the primary cluster > +my $primary = PostgreSQL::Test::Cluster->new('publisher'); > +$primary->init( > + allows_streaming => 'logical', > + auth_extra => [ '--create-role' => 'repl_role' ]); > ``` > > Do we have to create repl_role? I'm not sure where it's used. > It is not needed, I have removed it. > 06. > ``` > +$primary->append_conf( > + 'postgresql.conf', qq{ > +autovacuum = off > +max_prepared_transactions = 1 > +}); > ``` > > Do we have to set max_prepared_transactions? PREPARE command is not used. > It is not needed. Removed it. > 07. > ``` > +# Load the injection_points extension > +$primary->safe_psql('postgres', q(CREATE EXTENSION injection_points)) > ``` > > We must check whether the injection_points is available or not. See 047_checkpoint_physical_slot.pl. > Added the check > 08. > ``` > +# Initialize standby from primary backup > +my $standby1 = PostgreSQL::Test::Cluster->new('standby1'); > +$standby1->init_from_backup( > + $primary, $backup_name, > + has_streaming => 1, > + has_restoring => 1); > ``` > > To clarify, is there a reason why we set has_restoring? Can we remove it? > It is not needed. Removed it. > 09. > ``` > +my $connstr_1 = $primary->connstr; > ``` > > Since this is an only connection string in the test, suffix _1 is not needed. > Fixed > 10. > ``` > +# Simulate standby connection failure by modifying pg_hba.conf > +unlink($primary->data_dir . '/pg_hba.conf'); > +$primary->append_conf('pg_hba.conf', > + qq{local all all trust} > +); > ``` > > What if the system does not have Unix domain socket? I'm afraid all connections > could be brocked in this case. > I have used an injection point to simulate this scenario instead of changing the contents of pg_hba.conf files. > 11. > ``` > +# Attempt to sync replication slots while standby is behind > +$standby1->psql('postgres', "SELECT pg_sync_replication_slots();"); > ``` > > When the command can be failed, can you set a error message from the command to > a variable? Otherwise an ERROR is output to the result file, which is suprising. > > ``` > psql:<stdin>:1: ERROR: skipping slot synchronization because the received slot sync LSN 0/03019E58 for slot "slot_sync"is ahead of the standby position 0/030000D8 > [17:23:21.495](0.384s) ok 2 - slot sync skip when standby is behind > ``` > In the existing tests I find similar cases where ERROR is output to the result file. For example in recovery/001_stream file: [10:33:55.443](0.099s) ok 5 - pg_sequence_last_value() on unlogged sequence on standby 1 psql:<stdin>:1: ERROR: cannot execute INSERT in a read-only transaction [10:33:55.468](0.025s) ok 6 - read-only queries on standby 1 psql:<stdin>:1: ERROR: cannot execute INSERT in a read-only transaction In 006_logical_decoding [10:34:19.233](0.050s) ok 8 - pg_recvlogical acknowledged changes psql:<stdin>:1: ERROR: replication slot "test_slot" was not created in this database [10:34:19.432](0.199s) ok 9 - replaying logical slot from another database fails psql:<stdin>:1: ERROR: database "otherdb" is used by an active logical replication slot DETAIL: There is 1 active slot. But I think it is a good idea to set the error message to variable to avoid ERROR messages in the regress log file. Added the change for the same. > 12. > ``` > +# Restore connectivity between primary and standby > +unlink($primary->data_dir . '/pg_hba.conf'); > +$primary->append_conf( > + 'pg_hba.conf', > + qq{ > +local all all trust > +local replication all trust > +}); > ``` > > Same as 10. Also, no need to do unlink() here. > I have used an injection point to simulate this scenario instead of changing the contents of pg_hba.conf files. > 13. > ``` > +# Create a new logical slot for testing injection point > +$primary->safe_psql('postgres', > + "SELECT pg_create_logical_replication_slot('slot_sync', 'test_decoding', false, false, true)" > +); > ``` > > Before here can you add a description what you would test? > Added > 14. > ``` > # Create a physical replication slot on primary for standby > $primary->psql('postgres', > q{SELECT pg_create_physical_replication_slot('sb1_slot');}); > ``` > Use safe_psql instead of psql(). > Fixed > 15. > ``` > - if ((slot = SearchNamedReplicationSlot(remote_slot->name, true))) > + if (slot || (slot = SearchNamedReplicationSlot(remote_slot->name, true))) > ``` > > Is there a possibility that slot is not NULL? It is used when confirmed_flush exceeds > the latestFlushPtr, but in this case the function cannot reach here. > This change is not required. I have reverted this change. Thanks, Shlok Kyal
Attachment
pgsql-hackers by date: