Re: Autovacuum daemon terminated by signal 11 - Mailing list pgsql-general

From Tom Lane
Subject Re: Autovacuum daemon terminated by signal 11
Date
Msg-id 15221.1232149389@sss.pgh.pa.us
Whole thread Raw
In response to Re: Autovacuum daemon terminated by signal 11  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Autovacuum daemon terminated by signal 11  (Alvaro Herrera <alvherre@commandprompt.com>)
List pgsql-general
I wrote:
> ... and you've seemingly not managed to install the debug symbols where
> gdb can find them.

But never mind that --- it turns out to be trivial to reproduce the
crash.  Just create a database, set its datfrozenxid and datvacuumxid
far in the past (via a manual update of pg_database), enable autovacuum,
and wait a bit.

What is happening is that autovacuum_do_vac_analyze contains

    old_cxt = MemoryContextSwitchTo(AutovacMemCxt);
    ...
    vacuum(vacstmt, relids);
    ...
    MemoryContextSwitchTo(old_cxt);

and at the time it is called by process_whole_db, CurrentMemoryContext
points at TopTransactionContext.  Which gets destroyed because vacuum()
internally finishes that transaction and starts a new one.  When we
come out of vacuum(), CurrentMemoryContext again points at
TopTransactionContext, but *its not the same one*.  The closing
MemoryContextSwitchTo is installing a stale pointer, which then remains
active into CommitTransaction.  It's a wonder this code ever works.

The other path through do_autovacuum() escapes this fate because it
enters autovacuum_do_vac_analyze with CurrentMemoryContext pointing
at AutovacMemCxt, which isn't going to go away.

I argue that autovacuum_do_vac_analyze shouldn't attempt to restore the
caller's memory context at all.  One possible approach is to make it
re-select AutovacMemCxt at exit, but I wonder if we shouldn't define
its entry and exit conditions as current context being
(the current instance of) TopTransactionContext.

It looks like 8.3 and HEAD take the latter approach and are therefore
safe from this bug.  8.2 seems to escape it also because it doesn't have
process_whole_db anymore, but it's certainly not
autovacuum_do_vac_analyze fault that it's not broken, because it's still
trying to restore a context that it has no right to assume still exists.

Alvaro, you want to take charge of fixing this?

            regards, tom lane

pgsql-general by date:

Previous
From: Justin Pasher
Date:
Subject: Re: Autovacuum daemon terminated by signal 11
Next
From: Erik Jones
Date:
Subject: Re: Inheritance question