Thread: Fix assertion in autovacuum worker
Hi hackers, PostgreSQL hit the following assertion during error cleanup, after being OOM in dsa_allocate0(): void dshash_detach(dshash_table *hash_table) { ASSERT_NO_PARTITION_LOCKS_HELD_BY_ME(hash_table); called from pgstat_shutdown_hook(), called from shmem_exit(), called from proc_exit(), called from the exception handler. The partition locks got previously acquired by AutoVacWorkerMain() pgstat_report_autovac() pgstat_get_entry_ref_locked() pgstat_get_entry_ref() dshash_find_or_insert() resize() resize() locks all partitions so the hash table can safely be resized. Then it calls dsa_allocate0(). If dsa_allocate0() fails to allocate, it errors out. The exception handler calls proc_exit() which normally calls LWLockReleaseAll() via AbortTransaction() but only if there's an active transaction. However, pgstat_report_autovac() runs before a transaction got started and hence LWLockReleaseAll() doesn't run before pgstat_shutdown_hook() is called. See attached patch for an attempt to fix this issue. -- David Geier (ServiceNow)
Attachment
On Tue, Nov 28, 2023 at 07:00:16PM +0100, David Geier wrote: > PostgreSQL hit the following assertion during error cleanup, after being OOM > in dsa_allocate0(): > > void dshash_detach(dshash_table *hash_table) { > ASSERT_NO_PARTITION_LOCKS_HELD_BY_ME(hash_table); > > called from pgstat_shutdown_hook(), called from shmem_exit(), called from > proc_exit(), called from the exception handler. Nice find. > AutoVacWorkerMain() pgstat_report_autovac() pgstat_get_entry_ref_locked() > pgstat_get_entry_ref() dshash_find_or_insert() resize() resize() locks all > partitions so the hash table can safely be resized. Then it calls > dsa_allocate0(). If dsa_allocate0() fails to allocate, it errors out. The > exception handler calls proc_exit() which normally calls LWLockReleaseAll() > via AbortTransaction() but only if there's an active transaction. However, > pgstat_report_autovac() runs before a transaction got started and hence > LWLockReleaseAll() doesn't run before pgstat_shutdown_hook() is called. From a glance, it looks to me like the problem is that pgstat_shutdown_hook is registered as a before_shmem_exit callback, while ProcKill is registered as an on_shmem_exit callback. However, IIUC even moving them to the same list wouldn't be sufficient because the pg_stat_shutdown_hook is registered after ProcKill, and the code that calls the callbacks walks backwards through the list. I would expect your patch to fix this particular issue, but I'm wondering whether there's a bigger problem here. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Hi, On 2023-11-28 16:05:16 -0600, Nathan Bossart wrote: > On Tue, Nov 28, 2023 at 07:00:16PM +0100, David Geier wrote: > > PostgreSQL hit the following assertion during error cleanup, after being OOM > > in dsa_allocate0(): > > > > void dshash_detach(dshash_table *hash_table) { > > ASSERT_NO_PARTITION_LOCKS_HELD_BY_ME(hash_table); > > > > called from pgstat_shutdown_hook(), called from shmem_exit(), called from > > proc_exit(), called from the exception handler. > > Nice find. +1 > > AutoVacWorkerMain() pgstat_report_autovac() pgstat_get_entry_ref_locked() > > pgstat_get_entry_ref() dshash_find_or_insert() resize() resize() locks all > > partitions so the hash table can safely be resized. Then it calls > > dsa_allocate0(). If dsa_allocate0() fails to allocate, it errors out. The > > exception handler calls proc_exit() which normally calls LWLockReleaseAll() > > via AbortTransaction() but only if there's an active transaction. However, > > pgstat_report_autovac() runs before a transaction got started and hence > > LWLockReleaseAll() doesn't run before pgstat_shutdown_hook() is called. > > From a glance, it looks to me like the problem is that pgstat_shutdown_hook > is registered as a before_shmem_exit callback, while ProcKill is registered > as an on_shmem_exit callback. That's required, as pgstat_shutdown_hook() needs to acquire lwlocks, which you can't after ProcKill(). It's also not unique to pgstat, several other before_shmem_exit() callbacks acquire lwlocks (e.g. AtProcExit_Twophase(), ParallelWorkerShutdown(), do_pg_abort_backup(), the first three uses of before_shmem_exit when git grepping) - which makes sense, they are presumably before_shmem_exit() because they need to manage shared state, which often needs locks. In normal backends this is fine-ish, because ShutdownPostgres() is registered very late (and thus is called early in the shutdown sequence), and the AbortOutOfAnyTransaction() contained therein indirectly calls LWLockReleaseAll() and very little happens outside of the transaction context. > I would expect your patch to fix this particular issue, but I'm wondering > whether there's a bigger problem here. Yes, there is - our subsystem initialization, shutdown, error recovery infrastructure is a mess. We've interwoven transaction handling far too tightly with error handling, the order of subystem initialization is basically random and differs between operating systems (due to EXEC_BACKEND) and "mode" of execution (the order differs when using single user mode) and we've distributed error recovery into ~10 places (all the sigsetjmp()s in backend code, xact.c and and a few other places like WalSndErrorCleanup()). Greetings, Andres Freund
On Tue, Nov 28, 2023 at 04:03:49PM -0800, Andres Freund wrote: > On 2023-11-28 16:05:16 -0600, Nathan Bossart wrote: >> From a glance, it looks to me like the problem is that pgstat_shutdown_hook >> is registered as a before_shmem_exit callback, while ProcKill is registered >> as an on_shmem_exit callback. > > That's required, as pgstat_shutdown_hook() needs to acquire lwlocks, which you > can't after ProcKill(). It's also not unique to pgstat, several other > before_shmem_exit() callbacks acquire lwlocks (e.g. AtProcExit_Twophase(), > ParallelWorkerShutdown(), do_pg_abort_backup(), the first three uses of > before_shmem_exit when git grepping) - which makes sense, they are presumably > before_shmem_exit() because they need to manage shared state, which often > needs locks. > > In normal backends this is fine-ish, because ShutdownPostgres() is registered > very late (and thus is called early in the shutdown sequence), and the > AbortOutOfAnyTransaction() contained therein indirectly calls > LWLockReleaseAll() and very little happens outside of the transaction context. Right. Perhaps we could add a LWLockReleaseAll() to pgstat_shutdown_hook() instead of the autovacuum code, but I'm afraid that is still just a hack. >> I would expect your patch to fix this particular issue, but I'm wondering >> whether there's a bigger problem here. > > Yes, there is - our subsystem initialization, shutdown, error recovery > infrastructure is a mess. We've interwoven transaction handling far too > tightly with error handling, the order of subystem initialization is basically > random and differs between operating systems (due to EXEC_BACKEND) and "mode" > of execution (the order differs when using single user mode) and we've > distributed error recovery into ~10 places (all the sigsetjmp()s in backend > code, xact.c and and a few other places like WalSndErrorCleanup()). :( I do remember looking into uniting all the various sigsetjmp() calls before. That could be worth another try. The rest will probably require additional thought... -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Hi, On 2023-11-28 20:42:47 -0600, Nathan Bossart wrote: > On Tue, Nov 28, 2023 at 04:03:49PM -0800, Andres Freund wrote: > > On 2023-11-28 16:05:16 -0600, Nathan Bossart wrote: > >> From a glance, it looks to me like the problem is that pgstat_shutdown_hook > >> is registered as a before_shmem_exit callback, while ProcKill is registered > >> as an on_shmem_exit callback. > > > > That's required, as pgstat_shutdown_hook() needs to acquire lwlocks, which you > > can't after ProcKill(). It's also not unique to pgstat, several other > > before_shmem_exit() callbacks acquire lwlocks (e.g. AtProcExit_Twophase(), > > ParallelWorkerShutdown(), do_pg_abort_backup(), the first three uses of > > before_shmem_exit when git grepping) - which makes sense, they are presumably > > before_shmem_exit() because they need to manage shared state, which often > > needs locks. > > > > In normal backends this is fine-ish, because ShutdownPostgres() is registered > > very late (and thus is called early in the shutdown sequence), and the > > AbortOutOfAnyTransaction() contained therein indirectly calls > > LWLockReleaseAll() and very little happens outside of the transaction context. > > Right. Perhaps we could add a LWLockReleaseAll() to > pgstat_shutdown_hook() instead of the autovacuum code, but I'm afraid that > is still just a hack. Yea, we'd need that in just about all before_shmem_exit() callbacks. I could see an argument for doing it in proc_exit_prepare(). While that'd be a fairly gross layering violation, we already do reset a number a bunch of stuff in there: /* * Forget any pending cancel or die requests; we're doing our best to * close up shop already. Note that the signal handlers will not set * these flags again, now that proc_exit_inprogress is set. */ InterruptPending = false; ProcDiePending = false; QueryCancelPending = false; InterruptHoldoffCount = 1; CritSectionCount = 0; > >> I would expect your patch to fix this particular issue, but I'm wondering > >> whether there's a bigger problem here. > > > > Yes, there is - our subsystem initialization, shutdown, error recovery > > infrastructure is a mess. We've interwoven transaction handling far too > > tightly with error handling, the order of subystem initialization is basically > > random and differs between operating systems (due to EXEC_BACKEND) and "mode" > > of execution (the order differs when using single user mode) and we've > > distributed error recovery into ~10 places (all the sigsetjmp()s in backend > > code, xact.c and and a few other places like WalSndErrorCleanup()). > > :( > > I do remember looking into uniting all the various sigsetjmp() calls > before. That could be worth another try. The rest will probably require > additional thought... It'd definitely be worth some effort. I'm quite sure that we have a number of hard to find bugs around this. Greetings, Andres Freund
On Tue, Nov 28, 2023 at 06:48:59PM -0800, Andres Freund wrote: > On 2023-11-28 20:42:47 -0600, Nathan Bossart wrote: >> Right. Perhaps we could add a LWLockReleaseAll() to >> pgstat_shutdown_hook() instead of the autovacuum code, but I'm afraid that >> is still just a hack. > > Yea, we'd need that in just about all before_shmem_exit() callbacks. I could > see an argument for doing it in proc_exit_prepare(). While that'd be a fairly > gross layering violation, we already do reset a number a bunch of stuff in > there: Gross layering violations aside, that at least seems more future-proof against other sigsetjmp() blocks that proc_exit() without doing any preliminary cleanup. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Hi, On 2023-11-29 11:52:01 -0600, Nathan Bossart wrote: > On Tue, Nov 28, 2023 at 06:48:59PM -0800, Andres Freund wrote: > > On 2023-11-28 20:42:47 -0600, Nathan Bossart wrote: > >> Right. Perhaps we could add a LWLockReleaseAll() to > >> pgstat_shutdown_hook() instead of the autovacuum code, but I'm afraid that > >> is still just a hack. > > > > Yea, we'd need that in just about all before_shmem_exit() callbacks. I could > > see an argument for doing it in proc_exit_prepare(). While that'd be a fairly > > gross layering violation, we already do reset a number a bunch of stuff in > > there: > > Gross layering violations aside, that at least seems more future-proof > against other sigsetjmp() blocks that proc_exit() without doing any > preliminary cleanup. It's not just sigsetjmp() blocks not doing cleanup that are problematic - consider what happens if there's either no sigsetjmp() block established (and thus ERROR is promoted to FATAL), or if a FATAL error is raised. Then cleanup in the sigsetjmp() site doesn't matter. So we really need a better answer here. If we don't want to add LWLockReleaseAll() to proc_exit_prepare(), ISTM we should all at least add some assertion infrastructure verifying it's being called in relevant paths, triggering an assertion if there was no LWLockReleaseAll() before reaching important before_shmem_exit() routines, even if we don't actually hold an lwlock at the time of the error. Otherwise problematic paths are way too hard to find. Greetings, Andres Freund