Re: How can end users know the cause of LR slot sync delays? - Mailing list pgsql-hackers

From Shlok Kyal
Subject Re: How can end users know the cause of LR slot sync delays?
Date
Msg-id CANhcyEU3hgjCL+N3Bu73RTe5K6XziyjAi7ucwec8ZcfVdd+o6w@mail.gmail.com
Whole thread Raw
In response to RE: How can end users know the cause of LR slot sync delays?  ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>)
List pgsql-hackers
On Tue, 30 Sept 2025 at 18:22, Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Shlok,
>
> Thanks for updating the patch. Here are my comments.
>
> 01.
> ```
> +       /* Update the slot sync reason */
> +       SpinLockAcquire(&slot->mutex);
> +       if (slot->slot_sync_skip_reason != skip_reason)
> +               slot->slot_sync_skip_reason = skip_reason;
> +       SpinLockRelease(&slot->mutex);
> ```
>
> Per my understanding, spinlock is acquired when the attribute on the shared memory
> is updating. Can you check other parts and follow the rukle?
>
> 02.
> ```
> +                       SpinLockAcquire(&slot->mutex);
> +                       synced = slot->data.synced;
> +                       SpinLockRelease(&slot->mutex);
> ```
>
> Same as 1.
>
I checked and found the following comment:
 * - Individual fields are protected by mutex where only the backend owning
 * the slot is authorized to update the fields from its own slot.  The
 * backend owning the slot does not need to take this lock when reading its
 * own fields, while concurrent backends not owning this slot should take the
 * lock when reading this slot's data.

So for the above two cases we are updating the
'slot->slot_sync_skip_reason' and reading 'slot->data.synced' and this
can happen before the slot sync worker acquires the slot or owns the
slot.
Also in the same code at a later stage we are again checking the
synced flag and we do that while holding a spin lock. Based on these
observations I think we should take Spinlock in both cases.

> 03.
> ```
> ```
>
>
> 04.
> ```
> +#ifdef USE_INJECTION_POINTS
> +       INJECTION_POINT("slot-sync-skip", NULL);
> +#endif
> ```
>
> No need do do #ifdef here. INJECTION_POINT() itself checks internally.
>
Fixed

> 05.
> ```
> +# Initialize the primary cluster
> +my $primary = PostgreSQL::Test::Cluster->new('publisher');
> +$primary->init(
> +       allows_streaming => 'logical',
> +       auth_extra => [ '--create-role' => 'repl_role' ]);
> ```
>
> Do we have to create repl_role? I'm not sure where it's used.
>
It is not needed, I have removed it.

> 06.
> ```
> +$primary->append_conf(
> +       'postgresql.conf', qq{
> +autovacuum = off
> +max_prepared_transactions = 1
> +});
> ```
>
> Do we have to set max_prepared_transactions? PREPARE command is not used.
>
It is not needed. Removed it.

> 07.
> ```
> +# Load the injection_points extension
> +$primary->safe_psql('postgres', q(CREATE EXTENSION injection_points))
> ```
>
> We must check whether the injection_points is available or not. See 047_checkpoint_physical_slot.pl.
>
Added the check

> 08.
> ```
> +# Initialize standby from primary backup
> +my $standby1 = PostgreSQL::Test::Cluster->new('standby1');
> +$standby1->init_from_backup(
> +       $primary, $backup_name,
> +       has_streaming => 1,
> +       has_restoring => 1);
> ```
>
> To clarify, is there a reason why we set has_restoring? Can we remove it?
>
It is not needed. Removed it.

> 09.
> ```
> +my $connstr_1 = $primary->connstr;
> ```
>
> Since this is an only connection string in the test, suffix _1 is not needed.
>
Fixed

> 10.
> ```
> +# Simulate standby connection failure by modifying pg_hba.conf
> +unlink($primary->data_dir . '/pg_hba.conf');
> +$primary->append_conf('pg_hba.conf',
> +       qq{local   all             all                                     trust}
> +);
> ```
>
> What if the system does not have Unix domain socket? I'm afraid all connections
> could be brocked in this case.
>
I have used an injection point to simulate this scenario instead of
changing the contents of pg_hba.conf files.

> 11.
> ```
> +# Attempt to sync replication slots while standby is behind
> +$standby1->psql('postgres', "SELECT pg_sync_replication_slots();");
> ```
>
> When the command can be failed, can you set a error message from the command to
> a variable? Otherwise an ERROR is output to the result file, which is suprising.
>
> ```
> psql:<stdin>:1: ERROR:  skipping slot synchronization because the received slot sync LSN 0/03019E58 for slot
"slot_sync"is ahead of the standby position 0/030000D8
 
> [17:23:21.495](0.384s) ok 2 - slot sync skip when standby is behind
> ```
>

In the existing tests I find similar cases where ERROR is output to
the result file.
For example in recovery/001_stream file:
[10:33:55.443](0.099s) ok 5 - pg_sequence_last_value() on unlogged
sequence on standby 1
psql:<stdin>:1: ERROR:  cannot execute INSERT in a read-only transaction
[10:33:55.468](0.025s) ok 6 - read-only queries on standby 1
psql:<stdin>:1: ERROR:  cannot execute INSERT in a read-only transaction

In 006_logical_decoding
[10:34:19.233](0.050s) ok 8 - pg_recvlogical acknowledged changes
psql:<stdin>:1: ERROR:  replication slot "test_slot" was not created
in this database
[10:34:19.432](0.199s) ok 9 - replaying logical slot from another database fails
psql:<stdin>:1: ERROR:  database "otherdb" is used by an active
logical replication slot
DETAIL:  There is 1 active slot.

But I think it is a good idea to set the error message to variable to
avoid ERROR messages in the regress log file.
Added the change for the same.

> 12.
> ```
> +# Restore connectivity between primary and standby
> +unlink($primary->data_dir . '/pg_hba.conf');
> +$primary->append_conf(
> +       'pg_hba.conf',
> +       qq{
> +local   all             all                                     trust
> +local   replication     all                                     trust
> +});
> ```
>
> Same as 10. Also, no need to do unlink() here.
>
I have used an injection point to simulate this scenario instead of
changing the contents of pg_hba.conf files.

> 13.
> ```
> +# Create a new logical slot for testing injection point
> +$primary->safe_psql('postgres',
> +       "SELECT pg_create_logical_replication_slot('slot_sync', 'test_decoding', false, false, true)"
> +);
> ```
>
> Before here can you add a description what you would test?
>
Added

> 14.
> ```
> # Create a physical replication slot on primary for standby
> $primary->psql('postgres',
>         q{SELECT pg_create_physical_replication_slot('sb1_slot');});
> ```
> Use safe_psql instead of psql().
>
Fixed

> 15.
> ```
> -       if ((slot = SearchNamedReplicationSlot(remote_slot->name, true)))
> +       if (slot || (slot = SearchNamedReplicationSlot(remote_slot->name, true)))
> ```
>
> Is there a possibility that slot is not NULL? It is used when confirmed_flush exceeds
> the latestFlushPtr, but in this case the function cannot reach here.
>
This change is not required. I have reverted this change.

Thanks,
Shlok Kyal

Attachment

pgsql-hackers by date:

Previous
From: Aleksander Alekseev
Date:
Subject: Re: [PATCH] Remove unused #include's in src/backend/commands/*
Next
From: Shlok Kyal
Date:
Subject: Re: How can end users know the cause of LR slot sync delays?