Re: Primary and standby setting cross-checks - Mailing list pgsql-hackers

From Noah Misch
Subject Re: Primary and standby setting cross-checks
Date
Msg-id 20240925030345.67.nmisch@google.com
Whole thread Raw
In response to Primary and standby setting cross-checks  (Heikki Linnakangas <hlinnaka@iki.fi>)
List pgsql-hackers
On Thu, Aug 29, 2024 at 09:52:06PM +0300, Heikki Linnakangas wrote:
> Currently, if you configure a hot standby server with a smaller
> max_connections setting than the primary, the server refuses to start up:
> 
> LOG:  entering standby mode
> FATAL:  recovery aborted because of insufficient parameter settings
> DETAIL:  max_connections = 10 is a lower setting than on the primary server,
> where its value was 100.

> happen anyway:
> 
> 2024-08-29 21:44:32.634 EEST [668327] FATAL:  out of shared memory
> 2024-08-29 21:44:32.634 EEST [668327] HINT:  You might need to increase
> "max_locks_per_transaction".
> 2024-08-29 21:44:32.634 EEST [668327] CONTEXT:  WAL redo at 2/FD40FCC8 for
> Standby/LOCK: xid 996 db 5 rel 154045
> 2024-08-29 21:44:32.634 EEST [668327] WARNING:  you don't own a lock of type
> AccessExclusiveLock
> 2024-08-29 21:44:32.634 EEST [668327] LOG:  RecoveryLockHash contains entry
> for lock no longer recorded by lock manager: xid 996 database 5 relation
> 154045
> TRAP: failed Assert("false"), File: "../src/backend/storage/ipc/standby.c",

> Granted, if you restart the server, it will probably succeed because
> restarting the server will kill all the other queries that were holding
> locks. But yuck.

Agreed.

> So how to improve this? I see a few options:
> 
> a) Downgrade the error at startup to a warning, and allow starting the
> standby with smaller settings in standby. At least with a smaller
> max_locks_per_transactions. The other settings also affect the size of
> known-assigned XIDs array, but if the CSN snapshots get committed, that will
> get fixed. In most cases there is enough lock memory anyway, and it will be
> fine. Just fix the assertion failure so that the error message is a little
> nicer.
> 
> b) If you run out of lock space, kill running queries, and prevent new ones
> from starting. Track the locks in startup process' private memory until
> there is enough space in the lock manager, and then re-open for queries. In
> essence, go from hot standby mode to warm standby, until it's possible to go
> back to hot standby mode again.

Either seems fine.  Having never encountered actual lock exhaustion from this,
I'd lean toward (a) for simplicity.

> Thoughts, better ideas?

I worry about future code assuming a MaxBackends-sized array suffices for
something.  That could work almost all the time, breaking only when a standby
replays WAL from a server having a larger array.  What could we do now to
catch that future mistake promptly?  As a start, 027_stream_regress.pl could
use low settings on its standby.



pgsql-hackers by date:

Previous
From: Nisha Moond
Date:
Subject: Re: Clock-skew management in logical replication
Next
From: Richard Guo
Date:
Subject: Re: Eager aggregation, take 3