Thread: Fix assertion in autovacuum worker

Fix assertion in autovacuum worker

From

David Geier

Date:

28 November 2023, 18:00:16

Hi hackers,

PostgreSQL hit the following assertion during error cleanup, after being 
OOM in dsa_allocate0():

void dshash_detach(dshash_table *hash_table) { 
ASSERT_NO_PARTITION_LOCKS_HELD_BY_ME(hash_table);

called from pgstat_shutdown_hook(), called from shmem_exit(), called 
from proc_exit(), called from the exception handler.

The partition locks got previously acquired by

AutoVacWorkerMain() pgstat_report_autovac() 
pgstat_get_entry_ref_locked() pgstat_get_entry_ref() 
dshash_find_or_insert() resize() resize() locks all partitions so the 
hash table can safely be resized. Then it calls dsa_allocate0(). If 
dsa_allocate0() fails to allocate, it errors out. The exception handler 
calls proc_exit() which normally calls LWLockReleaseAll() via 
AbortTransaction() but only if there's an active transaction. However, 
pgstat_report_autovac() runs before a transaction got started and hence 
LWLockReleaseAll() doesn't run before pgstat_shutdown_hook() is called.

See attached patch for an attempt to fix this issue.

-- 
David Geier
(ServiceNow)

Attachment

0001-Fix-autovacuum-cleanup-on-error.patch

Re: Fix assertion in autovacuum worker

From

Nathan Bossart

Date:

28 November 2023, 22:05:16

On Tue, Nov 28, 2023 at 07:00:16PM +0100, David Geier wrote:
> PostgreSQL hit the following assertion during error cleanup, after being OOM
> in dsa_allocate0():
> 
> void dshash_detach(dshash_table *hash_table) {
> ASSERT_NO_PARTITION_LOCKS_HELD_BY_ME(hash_table);
> 
> called from pgstat_shutdown_hook(), called from shmem_exit(), called from
> proc_exit(), called from the exception handler.

Nice find.

> AutoVacWorkerMain() pgstat_report_autovac() pgstat_get_entry_ref_locked()
> pgstat_get_entry_ref() dshash_find_or_insert() resize() resize() locks all
> partitions so the hash table can safely be resized. Then it calls
> dsa_allocate0(). If dsa_allocate0() fails to allocate, it errors out. The
> exception handler calls proc_exit() which normally calls LWLockReleaseAll()
> via AbortTransaction() but only if there's an active transaction. However,
> pgstat_report_autovac() runs before a transaction got started and hence
> LWLockReleaseAll() doesn't run before pgstat_shutdown_hook() is called.

From a glance, it looks to me like the problem is that pgstat_shutdown_hook
is registered as a before_shmem_exit callback, while ProcKill is registered
as an on_shmem_exit callback.  However, IIUC even moving them to the same
list wouldn't be sufficient because the pg_stat_shutdown_hook is registered
after ProcKill, and the code that calls the callbacks walks backwards
through the list.

I would expect your patch to fix this particular issue, but I'm wondering
whether there's a bigger problem here.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Fix assertion in autovacuum worker

From

Andres Freund

Date:

29 November 2023, 00:03:49

Hi,

On 2023-11-28 16:05:16 -0600, Nathan Bossart wrote:
> On Tue, Nov 28, 2023 at 07:00:16PM +0100, David Geier wrote:
> > PostgreSQL hit the following assertion during error cleanup, after being OOM
> > in dsa_allocate0():
> > 
> > void dshash_detach(dshash_table *hash_table) {
> > ASSERT_NO_PARTITION_LOCKS_HELD_BY_ME(hash_table);
> > 
> > called from pgstat_shutdown_hook(), called from shmem_exit(), called from
> > proc_exit(), called from the exception handler.
> 
> Nice find.

+1

> > AutoVacWorkerMain() pgstat_report_autovac() pgstat_get_entry_ref_locked()
> > pgstat_get_entry_ref() dshash_find_or_insert() resize() resize() locks all
> > partitions so the hash table can safely be resized. Then it calls
> > dsa_allocate0(). If dsa_allocate0() fails to allocate, it errors out. The
> > exception handler calls proc_exit() which normally calls LWLockReleaseAll()
> > via AbortTransaction() but only if there's an active transaction. However,
> > pgstat_report_autovac() runs before a transaction got started and hence
> > LWLockReleaseAll() doesn't run before pgstat_shutdown_hook() is called.
> 
> From a glance, it looks to me like the problem is that pgstat_shutdown_hook
> is registered as a before_shmem_exit callback, while ProcKill is registered
> as an on_shmem_exit callback.

That's required, as pgstat_shutdown_hook() needs to acquire lwlocks, which you
can't after ProcKill(). It's also not unique to pgstat, several other
before_shmem_exit() callbacks acquire lwlocks (e.g. AtProcExit_Twophase(),
ParallelWorkerShutdown(), do_pg_abort_backup(), the first three uses of
before_shmem_exit when git grepping) - which makes sense, they are presumably
before_shmem_exit() because they need to manage shared state, which often
needs locks.

In normal backends this is fine-ish, because ShutdownPostgres() is registered
very late (and thus is called early in the shutdown sequence), and the
AbortOutOfAnyTransaction() contained therein indirectly calls
LWLockReleaseAll() and very little happens outside of the transaction context.

> I would expect your patch to fix this particular issue, but I'm wondering
> whether there's a bigger problem here.

Yes, there is - our subsystem initialization, shutdown, error recovery
infrastructure is a mess.  We've interwoven transaction handling far too
tightly with error handling, the order of subystem initialization is basically
random and differs between operating systems (due to EXEC_BACKEND) and "mode"
of execution (the order differs when using single user mode) and we've
distributed error recovery into ~10 places (all the sigsetjmp()s in backend
code, xact.c and and a few other places like WalSndErrorCleanup()).

Greetings,

Andres Freund

Re: Fix assertion in autovacuum worker

From

Nathan Bossart

Date:

29 November 2023, 02:42:47

On Tue, Nov 28, 2023 at 04:03:49PM -0800, Andres Freund wrote:
> On 2023-11-28 16:05:16 -0600, Nathan Bossart wrote:
>> From a glance, it looks to me like the problem is that pgstat_shutdown_hook
>> is registered as a before_shmem_exit callback, while ProcKill is registered
>> as an on_shmem_exit callback.
> 
> That's required, as pgstat_shutdown_hook() needs to acquire lwlocks, which you
> can't after ProcKill(). It's also not unique to pgstat, several other
> before_shmem_exit() callbacks acquire lwlocks (e.g. AtProcExit_Twophase(),
> ParallelWorkerShutdown(), do_pg_abort_backup(), the first three uses of
> before_shmem_exit when git grepping) - which makes sense, they are presumably
> before_shmem_exit() because they need to manage shared state, which often
> needs locks.
> 
> In normal backends this is fine-ish, because ShutdownPostgres() is registered
> very late (and thus is called early in the shutdown sequence), and the
> AbortOutOfAnyTransaction() contained therein indirectly calls
> LWLockReleaseAll() and very little happens outside of the transaction context.

Right.  Perhaps we could add a LWLockReleaseAll() to
pgstat_shutdown_hook() instead of the autovacuum code, but I'm afraid that
is still just a hack.

>> I would expect your patch to fix this particular issue, but I'm wondering
>> whether there's a bigger problem here.
> 
> Yes, there is - our subsystem initialization, shutdown, error recovery
> infrastructure is a mess.  We've interwoven transaction handling far too
> tightly with error handling, the order of subystem initialization is basically
> random and differs between operating systems (due to EXEC_BACKEND) and "mode"
> of execution (the order differs when using single user mode) and we've
> distributed error recovery into ~10 places (all the sigsetjmp()s in backend
> code, xact.c and and a few other places like WalSndErrorCleanup()).

:(

I do remember looking into uniting all the various sigsetjmp() calls
before.  That could be worth another try.  The rest will probably require
additional thought...

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Fix assertion in autovacuum worker

From

Andres Freund

Date:

29 November 2023, 02:48:59

Hi,

On 2023-11-28 20:42:47 -0600, Nathan Bossart wrote:
> On Tue, Nov 28, 2023 at 04:03:49PM -0800, Andres Freund wrote:
> > On 2023-11-28 16:05:16 -0600, Nathan Bossart wrote:
> >> From a glance, it looks to me like the problem is that pgstat_shutdown_hook
> >> is registered as a before_shmem_exit callback, while ProcKill is registered
> >> as an on_shmem_exit callback.
> > 
> > That's required, as pgstat_shutdown_hook() needs to acquire lwlocks, which you
> > can't after ProcKill(). It's also not unique to pgstat, several other
> > before_shmem_exit() callbacks acquire lwlocks (e.g. AtProcExit_Twophase(),
> > ParallelWorkerShutdown(), do_pg_abort_backup(), the first three uses of
> > before_shmem_exit when git grepping) - which makes sense, they are presumably
> > before_shmem_exit() because they need to manage shared state, which often
> > needs locks.
> > 
> > In normal backends this is fine-ish, because ShutdownPostgres() is registered
> > very late (and thus is called early in the shutdown sequence), and the
> > AbortOutOfAnyTransaction() contained therein indirectly calls
> > LWLockReleaseAll() and very little happens outside of the transaction context.
> 
> Right.  Perhaps we could add a LWLockReleaseAll() to
> pgstat_shutdown_hook() instead of the autovacuum code, but I'm afraid that
> is still just a hack.

Yea, we'd need that in just about all before_shmem_exit() callbacks.  I could
see an argument for doing it in proc_exit_prepare(). While that'd be a fairly
gross layering violation, we already do reset a number a bunch of stuff in
there:
    /*
     * Forget any pending cancel or die requests; we're doing our best to
     * close up shop already.  Note that the signal handlers will not set
     * these flags again, now that proc_exit_inprogress is set.
     */
    InterruptPending = false;
    ProcDiePending = false;
    QueryCancelPending = false;
    InterruptHoldoffCount = 1;
    CritSectionCount = 0;


> >> I would expect your patch to fix this particular issue, but I'm wondering
> >> whether there's a bigger problem here.
> > 
> > Yes, there is - our subsystem initialization, shutdown, error recovery
> > infrastructure is a mess.  We've interwoven transaction handling far too
> > tightly with error handling, the order of subystem initialization is basically
> > random and differs between operating systems (due to EXEC_BACKEND) and "mode"
> > of execution (the order differs when using single user mode) and we've
> > distributed error recovery into ~10 places (all the sigsetjmp()s in backend
> > code, xact.c and and a few other places like WalSndErrorCleanup()).
> 
> :(
> 
> I do remember looking into uniting all the various sigsetjmp() calls
> before.  That could be worth another try.  The rest will probably require
> additional thought...

It'd definitely be worth some effort. I'm quite sure that we have a number of
hard to find bugs around this.

Greetings,

Andres Freund

Re: Fix assertion in autovacuum worker

From

Nathan Bossart

Date:

29 November 2023, 17:52:01

On Tue, Nov 28, 2023 at 06:48:59PM -0800, Andres Freund wrote:
> On 2023-11-28 20:42:47 -0600, Nathan Bossart wrote:
>> Right.  Perhaps we could add a LWLockReleaseAll() to
>> pgstat_shutdown_hook() instead of the autovacuum code, but I'm afraid that
>> is still just a hack.
> 
> Yea, we'd need that in just about all before_shmem_exit() callbacks.  I could
> see an argument for doing it in proc_exit_prepare(). While that'd be a fairly
> gross layering violation, we already do reset a number a bunch of stuff in
> there:

Gross layering violations aside, that at least seems more future-proof
against other sigsetjmp() blocks that proc_exit() without doing any
preliminary cleanup.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Fix assertion in autovacuum worker

From

Andres Freund

Date:

29 November 2023, 18:55:55

Hi,

On 2023-11-29 11:52:01 -0600, Nathan Bossart wrote:
> On Tue, Nov 28, 2023 at 06:48:59PM -0800, Andres Freund wrote:
> > On 2023-11-28 20:42:47 -0600, Nathan Bossart wrote:
> >> Right.  Perhaps we could add a LWLockReleaseAll() to
> >> pgstat_shutdown_hook() instead of the autovacuum code, but I'm afraid that
> >> is still just a hack.
> >
> > Yea, we'd need that in just about all before_shmem_exit() callbacks.  I could
> > see an argument for doing it in proc_exit_prepare(). While that'd be a fairly
> > gross layering violation, we already do reset a number a bunch of stuff in
> > there:
>
> Gross layering violations aside, that at least seems more future-proof
> against other sigsetjmp() blocks that proc_exit() without doing any
> preliminary cleanup.

It's not just sigsetjmp() blocks not doing cleanup that are problematic -
consider what happens if there's either no sigsetjmp() block established (and
thus ERROR is promoted to FATAL), or if a FATAL error is raised. Then cleanup
in the sigsetjmp() site doesn't matter.  So we really need a better answer
here.

If we don't want to add LWLockReleaseAll() to proc_exit_prepare(), ISTM we
should all at least add some assertion infrastructure verifying it's being
called in relevant paths, triggering an assertion if there was no
LWLockReleaseAll() before reaching important before_shmem_exit() routines,
even if we don't actually hold an lwlock at the time of the error. Otherwise
problematic paths are way too hard to find.

Greetings,

Andres Freund