On Thu, Aug 29, 2024 at 09:52:06PM +0300, Heikki Linnakangas wrote:
> Currently, if you configure a hot standby server with a smaller
> max_connections setting than the primary, the server refuses to start up:
>
> LOG: entering standby mode
> FATAL: recovery aborted because of insufficient parameter settings
> DETAIL: max_connections = 10 is a lower setting than on the primary server,
> where its value was 100.
> happen anyway:
>
> 2024-08-29 21:44:32.634 EEST [668327] FATAL: out of shared memory
> 2024-08-29 21:44:32.634 EEST [668327] HINT: You might need to increase
> "max_locks_per_transaction".
> 2024-08-29 21:44:32.634 EEST [668327] CONTEXT: WAL redo at 2/FD40FCC8 for
> Standby/LOCK: xid 996 db 5 rel 154045
> 2024-08-29 21:44:32.634 EEST [668327] WARNING: you don't own a lock of type
> AccessExclusiveLock
> 2024-08-29 21:44:32.634 EEST [668327] LOG: RecoveryLockHash contains entry
> for lock no longer recorded by lock manager: xid 996 database 5 relation
> 154045
> TRAP: failed Assert("false"), File: "../src/backend/storage/ipc/standby.c",
> Granted, if you restart the server, it will probably succeed because
> restarting the server will kill all the other queries that were holding
> locks. But yuck.
Agreed.
> So how to improve this? I see a few options:
>
> a) Downgrade the error at startup to a warning, and allow starting the
> standby with smaller settings in standby. At least with a smaller
> max_locks_per_transactions. The other settings also affect the size of
> known-assigned XIDs array, but if the CSN snapshots get committed, that will
> get fixed. In most cases there is enough lock memory anyway, and it will be
> fine. Just fix the assertion failure so that the error message is a little
> nicer.
>
> b) If you run out of lock space, kill running queries, and prevent new ones
> from starting. Track the locks in startup process' private memory until
> there is enough space in the lock manager, and then re-open for queries. In
> essence, go from hot standby mode to warm standby, until it's possible to go
> back to hot standby mode again.
Either seems fine. Having never encountered actual lock exhaustion from this,
I'd lean toward (a) for simplicity.
> Thoughts, better ideas?
I worry about future code assuming a MaxBackends-sized array suffices for
something. That could work almost all the time, breaking only when a standby
replays WAL from a server having a larger array. What could we do now to
catch that future mistake promptly? As a start, 027_stream_regress.pl could
use low settings on its standby.