Re: Autovacuum worker doesn't immediately exit on postmaster death - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: Autovacuum worker doesn't immediately exit on postmaster death
Date
Msg-id 20201030150707.GP16415@tamriel.snowman.net
Whole thread Raw
In response to Re: Autovacuum worker doesn't immediately exit on postmaster death  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Greetings,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> > On 2020-Oct-29, Stephen Frost wrote:
> >> I do think it'd be good to find a way to check every once in a while
> >> even when we aren't going to delay though.  Not sure what the best
> >> answer there is.
>
> > Maybe instead of thinking specifically in terms of vacuum, we could
> > count buffer accesses (read from kernel) and check the latch once every
> > 1000th such, or something like that.  Then a very long query doesn't
> > have to wait until it's run to completion.  The cost is one integer
> > addition per syscall, which should be bearable.
>
> I'm kind of unwilling to add any syscalls at all to normal execution
> code paths for this purpose.  People shouldn't be sig-kill'ing the
> postmaster, or if they do, cleaning up the mess is their responsibility.
> I'd also suggest that adding nearly-untestable code paths for this
> purpose is a fine way to add bugs we'll never catch.

Not sure if either is at all viable, but I had a couple of thoughts
about other ways to possibly address this.

The first simplistic idea is this- we have lots of processes that pick
up pretty quickly on the postmaster going away due to checking if it's
still around while waiting for something else to happen anyway (like the
autovacuum launcher...), and we have CFI's in a lot of places where it's
reasonable to do a CFI but isn't alright to check for postmaster death.
While it'd be better if there were more platforms where parent death
would send a signal to the children, that doesn't seem to be coming any
time soon- so why don't we do it ourselves?  That is, when we discover
that the postmaster has died, scan through the proc array (carefully,
since it could be garbage, but all we're looking for are the PIDs of
anything that might still be around) and try sending a signal to any
processes that are left?  Those signals would hopefully get delivered
and the other backends would discover the signal through CFI and exit
reasonably quickly.

The other thought I had was around trying to check for postmaster death
when we're about to do some I/O, which would probably catch a large
number of these cases too though technically some process might stick
around for a while if it's only dealing with things that are already in
shared buffers, I suppose.  Also seems complicated and expensive to do.

> The if-we're-going-to-delay-anyway path in vacuum_delay_point seems
> OK to add a touch more overhead to, though.

Yeah, this certainly seems reasonable to do too and on a well run system
would likely be enough 90+% of the time.

Thanks,

Stephen

Attachment

pgsql-hackers by date:

Previous
From: Georgios Kokolatos
Date:
Subject: Re: shared-memory based stats collector
Next
From: John Naylor
Date:
Subject: Re: cutting down the TODO list thread