Re: Primary and standby setting cross-checks - Mailing list pgsql-hackers
| From | James Pang |
|---|---|
| Subject | Re: Primary and standby setting cross-checks |
| Date | |
| Msg-id | CAHgTRfdF8d3n8CV6fbFRiU2ecDJSagWV+jjpHGAAJbXYPZuBaQ@mail.gmail.com Whole thread Raw |
| In response to | Re: Primary and standby setting cross-checks (Kirill Reshke <reshkekirill@gmail.com>) |
| List | pgsql-hackers |
May I know the progress? sometimes, standby only offload some query and with lower cpu and ram. warning is flexible instead of fatal ? We track_commit_timestamp =on and max_worker_process, wal_level=logical change and restart primary, it lead to standby failed and recovery from last restart point that lead to another issue, because old transaction before turn on track_commit_timestamp, and failed in get that transaction commit_ts.
FATAL: recovery aborted because of insufficient parameter settings
DETAIL: max_worker_processes = 10 is a lower setting than on the primary server, where its value was 30.
HINT: You can restart the server after making the necessary configuration changes.
CONTEXT: WAL redo at 1F4F/xxxxxxxx for XLOG/PARAMETER_CHANGE: max_connections=xxx max_worker_processes=30 max_wal_sen
LOG: startup process (PID xxx) exited with exit code 1
LOG: terminating any other active server processes
LOG: shutting down due to startup process failure
LOG: database system is shut down
LOG: starting PostgreSQL 14.18 on aarch64-unknown-linux-gnu, compiled by gcc (GCC) 7.3.1 20180712
DETAIL: max_worker_processes = 10 is a lower setting than on the primary server, where its value was 30.
HINT: You can restart the server after making the necessary configuration changes.
CONTEXT: WAL redo at 1F4F/xxxxxxxx for XLOG/PARAMETER_CHANGE: max_connections=xxx max_worker_processes=30 max_wal_sen
LOG: startup process (PID xxx) exited with exit code 1
LOG: terminating any other active server processes
LOG: shutting down due to startup process failure
LOG: database system is shut down
LOG: starting PostgreSQL 14.18 on aarch64-unknown-linux-gnu, compiled by gcc (GCC) 7.3.1 20180712
Thanks,
James
Kirill Reshke <reshkekirill@gmail.com> 於 2025年3月13日週四 上午2:09寫道:
On Thu, 29 Aug 2024 at 23:52, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> Currently, if you configure a hot standby server with a smaller
> max_connections setting than the primary, the server refuses to start up:
>
> LOG: entering standby mode
> FATAL: recovery aborted because of insufficient parameter settings
> DETAIL: max_connections = 10 is a lower setting than on the primary
> server, where its value was 100.
> HINT: You can restart the server after making the necessary
> configuration changes.
>
> Or if you change the setting in the primary while the standby is
> running, replay pauses:
>
> WARNING: hot standby is not possible because of insufficient parameter
> settings
> DETAIL: max_connections = 100 is a lower setting than on the primary
> server, where its value was 200.
> CONTEXT: WAL redo at 2/E10000D8 for XLOG/PARAMETER_CHANGE:
> max_connections=200 max_worker_processes=8 max_wal_senders=10
> max_prepared_xacts=0 max_locks_per_xact=64 wal_level=logical
> wal_log_hints=off track_commit_timestamp=off
> LOG: recovery has paused
> DETAIL: If recovery is unpaused, the server will shut down.
> HINT: You can then restart the server after making the necessary
> configuration changes.
> CONTEXT: WAL redo at 2/E10000D8 for XLOG/PARAMETER_CHANGE:
> max_connections=200 max_worker_processes=8 max_wal_senders=10
> max_prepared_xacts=0 max_locks_per_xact=64 wal_level=logical
> wal_log_hints=off track_commit_timestamp=off
>
> Both of these are rather unpleasant behavior.
>
> I thought I could get rid of that limitation with my CSN snapshot patch
> [1], because it gets rid of the fixed-size known-assigned XIDs array,
> but there's a second reason for these limitations. It's also used to
> ensure that the standby has enough space in the lock manager to hold
> possible AccessExclusiveLocks taken by transactions in the primary.
>
> So firstly, I think that's a bad tradeoff. In vast majority of cases,
> you would not run out of lock space anyway, if you just started up the
> system. Secondly, that cross-check of settings doesn't fully prevent the
> problem. It ensures that the lock tables are large enough to accommodate
> all the locks you could possibly hold in the primary, but that doesn't
> take into account any additional locks held by read-only queries in the
> hot standby. So if you have queries running in the standby that take a
> lot of locks, this can happen anyway:
>
> 2024-08-29 21:44:32.634 EEST [668327] FATAL: out of shared memory
> 2024-08-29 21:44:32.634 EEST [668327] HINT: You might need to increase
> "max_locks_per_transaction".
> 2024-08-29 21:44:32.634 EEST [668327] CONTEXT: WAL redo at 2/FD40FCC8
> for Standby/LOCK: xid 996 db 5 rel 154045
> 2024-08-29 21:44:32.634 EEST [668327] WARNING: you don't own a lock of
> type AccessExclusiveLock
> 2024-08-29 21:44:32.634 EEST [668327] LOG: RecoveryLockHash contains
> entry for lock no longer recorded by lock manager: xid 996 database 5
> relation 154045
> TRAP: failed Assert("false"), File:
> "../src/backend/storage/ipc/standby.c", Line: 1053, PID: 668327
> postgres: startup recovering
> 0000000100000002000000FD(ExceptionalCondition+0x6e)[0x556a4588396e]
> postgres: startup recovering
> 0000000100000002000000FD(+0x44156e)[0x556a4571356e]
> postgres: startup recovering
> 0000000100000002000000FD(StandbyReleaseAllLocks+0x78)[0x556a45712738]
> postgres: startup recovering
> 0000000100000002000000FD(ShutdownRecoveryTransactionEnvironment+0x15)[0x556a45712685]
> postgres: startup recovering
> 0000000100000002000000FD(shmem_exit+0x111)[0x556a457062e1]
> postgres: startup recovering
> 0000000100000002000000FD(+0x434132)[0x556a45706132]
> postgres: startup recovering
> 0000000100000002000000FD(proc_exit+0x59)[0x556a45706079]
> postgres: startup recovering
> 0000000100000002000000FD(errfinish+0x278)[0x556a45884708]
> postgres: startup recovering
> 0000000100000002000000FD(LockAcquireExtended+0xa46)[0x556a45719386]
> postgres: startup recovering
> 0000000100000002000000FD(StandbyAcquireAccessExclusiveLock+0x11d)[0x556a4571330d]
> postgres: startup recovering
> 0000000100000002000000FD(standby_redo+0x70)[0x556a45713690]
> postgres: startup recovering
> 0000000100000002000000FD(PerformWalRecovery+0x7b3)[0x556a4547d313]
> postgres: startup recovering
> 0000000100000002000000FD(StartupXLOG+0xac3)[0x556a4546dae3]
> postgres: startup recovering
> 0000000100000002000000FD(StartupProcessMain+0xe8)[0x556a45693558]
> postgres: startup recovering
> 0000000100000002000000FD(+0x3ba95d)[0x556a4568c95d]
> postgres: startup recovering
> 0000000100000002000000FD(+0x3bce41)[0x556a4568ee41]
> postgres: startup recovering
> 0000000100000002000000FD(PostmasterMain+0x116e)[0x556a4568eaae]
> postgres: startup recovering
> 0000000100000002000000FD(+0x2f960e)[0x556a455cb60e]
> /lib/x86_64-linux-gnu/libc.so.6(+0x27c8a)[0x7f10ef042c8a]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f10ef042d45]
> postgres: startup recovering
> 0000000100000002000000FD(_start+0x21)[0x556a453af011]
> 2024-08-29 21:44:32.641 EEST [668324] LOG: startup process (PID 668327)
> was terminated by signal 6: Aborted
> 2024-08-29 21:44:32.641 EEST [668324] LOG: terminating any other active
> server processes
> 2024-08-29 21:44:32.654 EEST [668324] LOG: shutting down due to startup
> process failure
> 2024-08-29 21:44:32.729 EEST [668324] LOG: database system is shut down
>
> Granted, if you restart the server, it will probably succeed because
> restarting the server will kill all the other queries that were holding
> locks. But yuck. With assertions disabled, it looks a little less scary,
> but not nice anyway.
>
> So how to improve this? I see a few options:
>
> a) Downgrade the error at startup to a warning, and allow starting the
> standby with smaller settings in standby. At least with a smaller
> max_locks_per_transactions. The other settings also affect the size of
> known-assigned XIDs array, but if the CSN snapshots get committed, that
> will get fixed. In most cases there is enough lock memory anyway, and it
> will be fine. Just fix the assertion failure so that the error message
> is a little nicer.
>
> b) If you run out of lock space, kill running queries, and prevent new
> ones from starting. Track the locks in startup process' private memory
> until there is enough space in the lock manager, and then re-open for
> queries. In essence, go from hot standby mode to warm standby, until
> it's possible to go back to hot standby mode again.
>
> Thoughts, better ideas?
>
> [1] https://commitfest.postgresql.org/49/4912/
>
> --
> Heikki Linnakangas
> Neon (https://neon.tech)
>
>
Hello! Do you intend to pursue this further?
--
Best regards,
Kirill Reshke
pgsql-hackers by date: