Re: Primary and standby setting cross-checks - Mailing list pgsql-hackers
From | Kirill Reshke |
---|---|
Subject | Re: Primary and standby setting cross-checks |
Date | |
Msg-id | CALdSSPgyAmTJTRWDXXVZV8OA1LS9PURq8mk74R7RnuqyxGZTow@mail.gmail.com Whole thread Raw |
In response to | Primary and standby setting cross-checks (Heikki Linnakangas <hlinnaka@iki.fi>) |
List | pgsql-hackers |
On Thu, 29 Aug 2024 at 23:52, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > Currently, if you configure a hot standby server with a smaller > max_connections setting than the primary, the server refuses to start up: > > LOG: entering standby mode > FATAL: recovery aborted because of insufficient parameter settings > DETAIL: max_connections = 10 is a lower setting than on the primary > server, where its value was 100. > HINT: You can restart the server after making the necessary > configuration changes. > > Or if you change the setting in the primary while the standby is > running, replay pauses: > > WARNING: hot standby is not possible because of insufficient parameter > settings > DETAIL: max_connections = 100 is a lower setting than on the primary > server, where its value was 200. > CONTEXT: WAL redo at 2/E10000D8 for XLOG/PARAMETER_CHANGE: > max_connections=200 max_worker_processes=8 max_wal_senders=10 > max_prepared_xacts=0 max_locks_per_xact=64 wal_level=logical > wal_log_hints=off track_commit_timestamp=off > LOG: recovery has paused > DETAIL: If recovery is unpaused, the server will shut down. > HINT: You can then restart the server after making the necessary > configuration changes. > CONTEXT: WAL redo at 2/E10000D8 for XLOG/PARAMETER_CHANGE: > max_connections=200 max_worker_processes=8 max_wal_senders=10 > max_prepared_xacts=0 max_locks_per_xact=64 wal_level=logical > wal_log_hints=off track_commit_timestamp=off > > Both of these are rather unpleasant behavior. > > I thought I could get rid of that limitation with my CSN snapshot patch > [1], because it gets rid of the fixed-size known-assigned XIDs array, > but there's a second reason for these limitations. It's also used to > ensure that the standby has enough space in the lock manager to hold > possible AccessExclusiveLocks taken by transactions in the primary. > > So firstly, I think that's a bad tradeoff. In vast majority of cases, > you would not run out of lock space anyway, if you just started up the > system. Secondly, that cross-check of settings doesn't fully prevent the > problem. It ensures that the lock tables are large enough to accommodate > all the locks you could possibly hold in the primary, but that doesn't > take into account any additional locks held by read-only queries in the > hot standby. So if you have queries running in the standby that take a > lot of locks, this can happen anyway: > > 2024-08-29 21:44:32.634 EEST [668327] FATAL: out of shared memory > 2024-08-29 21:44:32.634 EEST [668327] HINT: You might need to increase > "max_locks_per_transaction". > 2024-08-29 21:44:32.634 EEST [668327] CONTEXT: WAL redo at 2/FD40FCC8 > for Standby/LOCK: xid 996 db 5 rel 154045 > 2024-08-29 21:44:32.634 EEST [668327] WARNING: you don't own a lock of > type AccessExclusiveLock > 2024-08-29 21:44:32.634 EEST [668327] LOG: RecoveryLockHash contains > entry for lock no longer recorded by lock manager: xid 996 database 5 > relation 154045 > TRAP: failed Assert("false"), File: > "../src/backend/storage/ipc/standby.c", Line: 1053, PID: 668327 > postgres: startup recovering > 0000000100000002000000FD(ExceptionalCondition+0x6e)[0x556a4588396e] > postgres: startup recovering > 0000000100000002000000FD(+0x44156e)[0x556a4571356e] > postgres: startup recovering > 0000000100000002000000FD(StandbyReleaseAllLocks+0x78)[0x556a45712738] > postgres: startup recovering > 0000000100000002000000FD(ShutdownRecoveryTransactionEnvironment+0x15)[0x556a45712685] > postgres: startup recovering > 0000000100000002000000FD(shmem_exit+0x111)[0x556a457062e1] > postgres: startup recovering > 0000000100000002000000FD(+0x434132)[0x556a45706132] > postgres: startup recovering > 0000000100000002000000FD(proc_exit+0x59)[0x556a45706079] > postgres: startup recovering > 0000000100000002000000FD(errfinish+0x278)[0x556a45884708] > postgres: startup recovering > 0000000100000002000000FD(LockAcquireExtended+0xa46)[0x556a45719386] > postgres: startup recovering > 0000000100000002000000FD(StandbyAcquireAccessExclusiveLock+0x11d)[0x556a4571330d] > postgres: startup recovering > 0000000100000002000000FD(standby_redo+0x70)[0x556a45713690] > postgres: startup recovering > 0000000100000002000000FD(PerformWalRecovery+0x7b3)[0x556a4547d313] > postgres: startup recovering > 0000000100000002000000FD(StartupXLOG+0xac3)[0x556a4546dae3] > postgres: startup recovering > 0000000100000002000000FD(StartupProcessMain+0xe8)[0x556a45693558] > postgres: startup recovering > 0000000100000002000000FD(+0x3ba95d)[0x556a4568c95d] > postgres: startup recovering > 0000000100000002000000FD(+0x3bce41)[0x556a4568ee41] > postgres: startup recovering > 0000000100000002000000FD(PostmasterMain+0x116e)[0x556a4568eaae] > postgres: startup recovering > 0000000100000002000000FD(+0x2f960e)[0x556a455cb60e] > /lib/x86_64-linux-gnu/libc.so.6(+0x27c8a)[0x7f10ef042c8a] > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f10ef042d45] > postgres: startup recovering > 0000000100000002000000FD(_start+0x21)[0x556a453af011] > 2024-08-29 21:44:32.641 EEST [668324] LOG: startup process (PID 668327) > was terminated by signal 6: Aborted > 2024-08-29 21:44:32.641 EEST [668324] LOG: terminating any other active > server processes > 2024-08-29 21:44:32.654 EEST [668324] LOG: shutting down due to startup > process failure > 2024-08-29 21:44:32.729 EEST [668324] LOG: database system is shut down > > Granted, if you restart the server, it will probably succeed because > restarting the server will kill all the other queries that were holding > locks. But yuck. With assertions disabled, it looks a little less scary, > but not nice anyway. > > So how to improve this? I see a few options: > > a) Downgrade the error at startup to a warning, and allow starting the > standby with smaller settings in standby. At least with a smaller > max_locks_per_transactions. The other settings also affect the size of > known-assigned XIDs array, but if the CSN snapshots get committed, that > will get fixed. In most cases there is enough lock memory anyway, and it > will be fine. Just fix the assertion failure so that the error message > is a little nicer. > > b) If you run out of lock space, kill running queries, and prevent new > ones from starting. Track the locks in startup process' private memory > until there is enough space in the lock manager, and then re-open for > queries. In essence, go from hot standby mode to warm standby, until > it's possible to go back to hot standby mode again. > > Thoughts, better ideas? > > [1] https://commitfest.postgresql.org/49/4912/ > > -- > Heikki Linnakangas > Neon (https://neon.tech) > > Hello! Do you intend to pursue this further? -- Best regards, Kirill Reshke
pgsql-hackers by date: