Home > mailing lists

Re: Autovacuum worker doesn't immediately exit on postmaster death - Mailing list pgsql-hackers

From	Stephen Frost
Subject	Re: Autovacuum worker doesn't immediately exit on postmaster death
Date	October 30, 2020 15:07:07
Msg-id	20201030150707.GP16415@tamriel.snowman.net Whole thread Raw
In response to	Re: Autovacuum worker doesn't immediately exit on postmaster death (Tom Lane <tgl@sss.pgh.pa.us>)
List	pgsql-hackers

Tree view

Greetings,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> > On 2020-Oct-29, Stephen Frost wrote:
> >> I do think it'd be good to find a way to check every once in a while
> >> even when we aren't going to delay though.  Not sure what the best
> >> answer there is.
>
> > Maybe instead of thinking specifically in terms of vacuum, we could
> > count buffer accesses (read from kernel) and check the latch once every
> > 1000th such, or something like that.  Then a very long query doesn't
> > have to wait until it's run to completion.  The cost is one integer
> > addition per syscall, which should be bearable.
>
> I'm kind of unwilling to add any syscalls at all to normal execution
> code paths for this purpose.  People shouldn't be sig-kill'ing the
> postmaster, or if they do, cleaning up the mess is their responsibility.
> I'd also suggest that adding nearly-untestable code paths for this
> purpose is a fine way to add bugs we'll never catch.

Not sure if either is at all viable, but I had a couple of thoughts
about other ways to possibly address this.

The first simplistic idea is this- we have lots of processes that pick
up pretty quickly on the postmaster going away due to checking if it's
still around while waiting for something else to happen anyway (like the
autovacuum launcher...), and we have CFI's in a lot of places where it's
reasonable to do a CFI but isn't alright to check for postmaster death.
While it'd be better if there were more platforms where parent death
would send a signal to the children, that doesn't seem to be coming any
time soon- so why don't we do it ourselves?  That is, when we discover
that the postmaster has died, scan through the proc array (carefully,
since it could be garbage, but all we're looking for are the PIDs of
anything that might still be around) and try sending a signal to any
processes that are left?  Those signals would hopefully get delivered
and the other backends would discover the signal through CFI and exit
reasonably quickly.

The other thought I had was around trying to check for postmaster death
when we're about to do some I/O, which would probably catch a large
number of these cases too though technically some process might stick
around for a while if it's only dealing with things that are already in
shared buffers, I suppose.  Also seems complicated and expensive to do.

> The if-we're-going-to-delay-anyway path in vacuum_delay_point seems
> OK to add a touch more overhead to, though.

Yeah, this certainly seems reasonable to do too and on a well run system
would likely be enough 90+% of the time.

Thanks,

Stephen

Attachment

signature.asc

pgsql-hackers by date:

From: Georgios Kokolatos
Date: 30 October 2020, 15:00:55
Subject: Re: shared-memory based stats collector

From: John Naylor
Date: 30 October 2020, 15:37:59
Subject: Re: cutting down the TODO list thread

Re: Autovacuum worker doesn't immediately exit on postmaster death - Mailing list pgsql-hackers

Attachment

Previous

Next