Re: speed up a logical replica setup - Mailing list pgsql-hackers
| From | Tomas Vondra |
|---|---|
| Subject | Re: speed up a logical replica setup |
| Date | |
| Msg-id | 47524383-6fd1-4b13-8c4e-fb918b1d35ba@enterprisedb.com Whole thread Raw |
| In response to | Re: speed up a logical replica setup ("Euler Taveira" <euler@eulerto.com>) |
| Responses |
Re: speed up a logical replica setup
|
| List | pgsql-hackers |
On 3/26/24 03:53, Euler Taveira wrote:
> On Mon, Mar 25, 2024, at 1:06 PM, Hayato Kuroda (Fujitsu) wrote:
>> ## Analysis for failure 1
>>
>> The failure caused by a time lag between walreceiver finishes and pg_is_in_recovery()
>> returns true.
>>
>> According to the output [1], it seems that the tool failed at wait_for_end_recovery()
>> with the message "standby server disconnected from the primary". Also, lines
>> "redo done at..." and "terminating walreceiver process due to administrator command"
>> meant that walreceiver was requested to shut down by XLogShutdownWalRcv().
>>
>> According to the source, we confirm that walreceiver is shut down in
>> StartupXLOG()->FinishWalRecovery()->XLogShutdownWalRcv(). Also, SharedRecoveryState
>> is changed to RECOVERY_STATE_DONE (this meant the pg_is_in_recovery() return true)
>> at the latter part of StartupXLOG().
>>
>> So, if there is a delay between FinishWalRecovery() and change the state, the check
>> in wait_for_end_recovery() would be failed during the time. Since we allow to miss
>> the walreceiver 10 times and it is checked once per second, the failure occurs if
>> the time lag is longer than 10 seconds.
>>
>> I do not have a good way to fix it. One approach is make NUM_CONN_ATTEMPTS larger,
>> but it's not a fundamental solution.
>
> I was expecting that slow hosts might have issues in wait_for_end_recovery().
> As you said it took a lot of steps between FinishWalRecovery() (where
> walreceiver is shutdown -- XLogShutdownWalRcv) and SharedRecoveryState is set to
> RECOVERY_STATE_DONE. If this window takes longer than NUM_CONN_ATTEMPTS *
> WAIT_INTERVAL (10 seconds), it aborts the execution. That's a bad decision
> because it already finished the promotion and it is just doing the final
> preparation for the host to become a primary.
>
> /*
> * If it is still in recovery, make sure the target server is
> * connected to the primary so it can receive the required WAL to
> * finish the recovery process. If it is disconnected try
> * NUM_CONN_ATTEMPTS in a row and bail out if not succeed.
> */
> res = PQexec(conn,
> "SELECT 1 FROM pg_catalog.pg_stat_wal_receiver");
> if (PQntuples(res) == 0)
> {
> if (++count > NUM_CONN_ATTEMPTS)
> {
> stop_standby_server(subscriber_dir);
> pg_log_error("standby server disconnected from the primary");
> break;
> }
> }
> else
> count = 0; /* reset counter if it connects again */
>
> This code was add to defend against the death/crash of the target server. There
> are at least 3 options:
>
> (1) increase NUM_CONN_ATTEMPTS * WAIT_INTERVAL seconds. We discussed this constant
> and I decided to use 10 seconds because even in some slow hosts, this time
> wasn't reached during my tests. It seems I forgot to test the combination of slow
> host, asserts enabled, and ubsan. I didn't notice that pg_promote() uses 60
> seconds as default wait. Maybe that's a reasonable value. I checked the
> 004_timeline_switch test and the last run took: 39.2s (serinus), 33.1s
> (culicidae), 18.31s (calliphoridae) and 27.52s (olingo).
> > (2) check if the primary is not running when walreceiver is not
available on the
> target server. Increase the connection attempts iif the primary is not running.
> Hence, the described case doesn't cause an increment on the count variable.
>
> (3) set recovery_timeout default to != 0 and remove pg_stat_wal_receiver check
> protection against the death/crash target server. I explained in a previous
> message that timeout may occur in cases that WAL replay to reach consistent
> state takes more than recovery-timeout seconds.
>
> Option (1) is the easiest fix, however, we can have the same issue again if a
> slow host decides to be even slower, hence, we have to adjust this value again.
> Option (2) interprets the walreceiver absence as a recovery end and if the
> primary server is running it can indicate that the target server is in the
> imminence of the recovery end. Option (3) is not as resilient as the other
> options.
>
> The first patch implements a combination of (1) and (2).
>
>> ## Analysis for failure 2
>>
>> According to [2], the physical replication slot which is specified as primary_slot_name
>> was not used by the walsender process. At that time walsender has not existed.
>>
>> ```
>> ...
>> pg_createsubscriber: publisher: current wal senders: 0
>> pg_createsubscriber: command is: SELECT 1 FROM pg_catalog.pg_replication_slots WHERE active AND slot_name =
'physical_slot'
>> pg_createsubscriber: error: could not obtain replication slot information: got 0 rows, expected 1 row
>> ...
>> ```
>>
>> Currently standby must be stopped before the command and current code does not
>> block the flow to ensure the replication is started. So there is a possibility
>> that the checking is run before walsender is launched.
>>
>> One possible approach is to wait until the replication starts. Alternative one is
>> to ease the condition.
>
> That's my suggestion too. I reused NUM_CONN_ATTEMPTS (that was renamed to
> NUM_ATTEMPTS in the first patch). See second patch.
>
Perhaps I'm missing something, but why is NUM_CONN_ATTEMPTS even needed?
Why isn't recovery_timeout enough to decide if wait_for_end_recovery()
waited long enough?
IMHO the test should simply pass PG_TEST_DEFAULT_TIMEOUT when calling
pg_createsubscriber, and that should do the trick.
Increasing PG_TEST_DEFAULT_TIMEOUT is what buildfarm animals doing
things like ubsan/valgrind already use to deal with exactly this kind of
timeout problem.
Or is there a deeper problem with deciding if the system is in recovery?
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
pgsql-hackers by date: