Home > mailing lists

restore_command on high-throughput cluster never switches to streaming replication - Mailing list pgsql-admin

From	Kasper Føns
Subject	restore_command on high-throughput cluster never switches to streaming replication
Date	November 24 16:46:26
Msg-id	CANOng2i1G_57nvZ4ip4uKKU87jtt+fzqWUFV_ou6L8N3bteSXQ@mail.gmail.com Whole thread Raw
List	pgsql-admin

Tree view

Hi PostgreSQL community.

I debugged an instance where a PostgreSQL standby would not switch to streaming replication when the `restore_command` fails.

Expectation

I expect PostgreSQL to try switching to streaming replication if the `restore_command` fails.

What happens

PostgreSQL attempts to restore the previously restored WAL segment and then retries the failed segment. However, because the primary produces WAL at a high rate, the WAL file now exists and PostgreSQL does not try to switch to streaming replication.

Context

Running PostgreSQL 15.7 in Kubernetes using CloudNative PostgreSQL Operator.

Logs

I configured PostgreSQL to emit DEBUG3 level logs. Newest logs first, oldest last.

got WAL segment from archive
executing restore command "/controller/manager wal-restore --log-destination /controller/log/postgres.json 000000410000A7BA00000058 pg_wal/RECOVERYXLOG"
got WAL segment from archive
executing restore command "/controller/manager wal-restore --log-destination /controller/log/postgres.json 000000410000A7BA00000057 pg_wal/RECOVERYXLOG"
could not open file "pg_wal/000000410000A7BA00000058": No such file or directory
could not restore file "000000410000A7BA00000058" from archive: child process exited with exit code 1
executing restore command "/controller/manager wal-restore --log-destination /controller/log/postgres.json 000000410000A7BA00000058 pg_wal/RECOVERYXLOG"
got WAL segment from archive
executing restore command "/controller/manager wal-restore --log-destination /controller/log/postgres.json 000000410000A7BA00000057 pg_wal/RECOVERYXLOG"

Notice that when 000000410000A7BA00000058 failed, PostgreSQL asked for 000000410000A7BA00000057 which it had already restored. Aftwards, it asks about 000000410000A7BA00000058 once again.

Problem

This is problematic because the standby will never switch to streaming replication.

Workaround

We can get the PostgreSQL replica to become in-sync if we change the command to `/bin/false` when we are withing `wal_keep_size`.

Question

Is this the expected behaviour?

I expect the function `WaitForWALToBecomeAvailable` to switch to streaming replication once a single `restore_command` fails. This also happens when `/bin/false` is used instead.

Any help would be greatly appreciated

/Kasper Føns

pgsql-admin by date:

From: Ron Johnson
Date: 23 November, 05:02:07
Subject: Re: Can't update RPM package to latest version

From: Ron Johnson
Date: 24 November, 18:26:33
Subject: Re: rebuild big tables with pgrepack

restore_command on high-throughput cluster never switches to streaming replication - Mailing list pgsql-admin

Previous

Next