Re: Primary and standby setting cross-checks - Mailing list pgsql-hackers

From Kirill Reshke
Subject Re: Primary and standby setting cross-checks
Date
Msg-id CALdSSPgyAmTJTRWDXXVZV8OA1LS9PURq8mk74R7RnuqyxGZTow@mail.gmail.com
Whole thread Raw
In response to Primary and standby setting cross-checks  (Heikki Linnakangas <hlinnaka@iki.fi>)
List pgsql-hackers
On Thu, 29 Aug 2024 at 23:52, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> Currently, if you configure a hot standby server with a smaller
> max_connections setting than the primary, the server refuses to start up:
>
> LOG:  entering standby mode
> FATAL:  recovery aborted because of insufficient parameter settings
> DETAIL:  max_connections = 10 is a lower setting than on the primary
> server, where its value was 100.
> HINT:  You can restart the server after making the necessary
> configuration changes.
>
> Or if you change the setting in the primary while the standby is
> running, replay pauses:
>
> WARNING:  hot standby is not possible because of insufficient parameter
> settings
> DETAIL:  max_connections = 100 is a lower setting than on the primary
> server, where its value was 200.
> CONTEXT:  WAL redo at 2/E10000D8 for XLOG/PARAMETER_CHANGE:
> max_connections=200 max_worker_processes=8 max_wal_senders=10
> max_prepared_xacts=0 max_locks_per_xact=64 wal_level=logical
> wal_log_hints=off track_commit_timestamp=off
> LOG:  recovery has paused
> DETAIL:  If recovery is unpaused, the server will shut down.
> HINT:  You can then restart the server after making the necessary
> configuration changes.
> CONTEXT:  WAL redo at 2/E10000D8 for XLOG/PARAMETER_CHANGE:
> max_connections=200 max_worker_processes=8 max_wal_senders=10
> max_prepared_xacts=0 max_locks_per_xact=64 wal_level=logical
> wal_log_hints=off track_commit_timestamp=off
>
> Both of these are rather unpleasant behavior.
>
> I thought I could get rid of that limitation with my CSN snapshot patch
> [1], because it gets rid of the fixed-size known-assigned XIDs array,
> but there's a second reason for these limitations. It's also used to
> ensure that the standby has enough space in the lock manager to hold
> possible AccessExclusiveLocks taken by transactions in the primary.
>
> So firstly, I think that's a bad tradeoff. In vast majority of cases,
> you would not run out of lock space anyway, if you just started up the
> system. Secondly, that cross-check of settings doesn't fully prevent the
> problem. It ensures that the lock tables are large enough to accommodate
> all the locks you could possibly hold in the primary, but that doesn't
> take into account any additional locks held by read-only queries in the
> hot standby. So if you have queries running in the standby that take a
> lot of locks, this can happen anyway:
>
> 2024-08-29 21:44:32.634 EEST [668327] FATAL:  out of shared memory
> 2024-08-29 21:44:32.634 EEST [668327] HINT:  You might need to increase
> "max_locks_per_transaction".
> 2024-08-29 21:44:32.634 EEST [668327] CONTEXT:  WAL redo at 2/FD40FCC8
> for Standby/LOCK: xid 996 db 5 rel 154045
> 2024-08-29 21:44:32.634 EEST [668327] WARNING:  you don't own a lock of
> type AccessExclusiveLock
> 2024-08-29 21:44:32.634 EEST [668327] LOG:  RecoveryLockHash contains
> entry for lock no longer recorded by lock manager: xid 996 database 5
> relation 154045
> TRAP: failed Assert("false"), File:
> "../src/backend/storage/ipc/standby.c", Line: 1053, PID: 668327
> postgres: startup recovering
> 0000000100000002000000FD(ExceptionalCondition+0x6e)[0x556a4588396e]
> postgres: startup recovering
> 0000000100000002000000FD(+0x44156e)[0x556a4571356e]
> postgres: startup recovering
> 0000000100000002000000FD(StandbyReleaseAllLocks+0x78)[0x556a45712738]
> postgres: startup recovering
> 0000000100000002000000FD(ShutdownRecoveryTransactionEnvironment+0x15)[0x556a45712685]
> postgres: startup recovering
> 0000000100000002000000FD(shmem_exit+0x111)[0x556a457062e1]
> postgres: startup recovering
> 0000000100000002000000FD(+0x434132)[0x556a45706132]
> postgres: startup recovering
> 0000000100000002000000FD(proc_exit+0x59)[0x556a45706079]
> postgres: startup recovering
> 0000000100000002000000FD(errfinish+0x278)[0x556a45884708]
> postgres: startup recovering
> 0000000100000002000000FD(LockAcquireExtended+0xa46)[0x556a45719386]
> postgres: startup recovering
> 0000000100000002000000FD(StandbyAcquireAccessExclusiveLock+0x11d)[0x556a4571330d]
> postgres: startup recovering
> 0000000100000002000000FD(standby_redo+0x70)[0x556a45713690]
> postgres: startup recovering
> 0000000100000002000000FD(PerformWalRecovery+0x7b3)[0x556a4547d313]
> postgres: startup recovering
> 0000000100000002000000FD(StartupXLOG+0xac3)[0x556a4546dae3]
> postgres: startup recovering
> 0000000100000002000000FD(StartupProcessMain+0xe8)[0x556a45693558]
> postgres: startup recovering
> 0000000100000002000000FD(+0x3ba95d)[0x556a4568c95d]
> postgres: startup recovering
> 0000000100000002000000FD(+0x3bce41)[0x556a4568ee41]
> postgres: startup recovering
> 0000000100000002000000FD(PostmasterMain+0x116e)[0x556a4568eaae]
> postgres: startup recovering
> 0000000100000002000000FD(+0x2f960e)[0x556a455cb60e]
> /lib/x86_64-linux-gnu/libc.so.6(+0x27c8a)[0x7f10ef042c8a]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f10ef042d45]
> postgres: startup recovering
> 0000000100000002000000FD(_start+0x21)[0x556a453af011]
> 2024-08-29 21:44:32.641 EEST [668324] LOG:  startup process (PID 668327)
> was terminated by signal 6: Aborted
> 2024-08-29 21:44:32.641 EEST [668324] LOG:  terminating any other active
> server processes
> 2024-08-29 21:44:32.654 EEST [668324] LOG:  shutting down due to startup
> process failure
> 2024-08-29 21:44:32.729 EEST [668324] LOG:  database system is shut down
>
> Granted, if you restart the server, it will probably succeed because
> restarting the server will kill all the other queries that were holding
> locks. But yuck. With assertions disabled, it looks a little less scary,
> but not nice anyway.
>
> So how to improve this? I see a few options:
>
> a) Downgrade the error at startup to a warning, and allow starting the
> standby with smaller settings in standby. At least with a smaller
> max_locks_per_transactions. The other settings also affect the size of
> known-assigned XIDs array, but if the CSN snapshots get committed, that
> will get fixed. In most cases there is enough lock memory anyway, and it
> will be fine. Just fix the assertion failure so that the error message
> is a little nicer.
>
> b) If you run out of lock space, kill running queries, and prevent new
> ones from starting. Track the locks in startup process' private memory
> until there is enough space in the lock manager, and then re-open for
> queries. In essence, go from hot standby mode to warm standby, until
> it's possible to go back to hot standby mode again.
>
> Thoughts, better ideas?
>
> [1] https://commitfest.postgresql.org/49/4912/
>
> --
> Heikki Linnakangas
> Neon (https://neon.tech)
>
>

Hello! Do you intend to pursue this further?

-- 
Best regards,
Kirill Reshke



pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Rename functions to alloc/free things in reorderbuffer.c
Next
From: Alena Rybakina
Date:
Subject: Re: Replace IN VALUES with ANY in WHERE clauses during optimization