Home > mailing lists

Re: VM corruption on standby - Mailing list pgsql-hackers

From	Andres Freund
Subject	Re: VM corruption on standby
Date	August 19, 2025 18:24:27
Msg-id	o22gkxevs5c3ilid7czbo3idnmtv6aljczod37s2pi7gcnrbe4@bggjoejg6gy2 Whole thread
In response to	Re: VM corruption on standby (Thomas Munro <thomas.munro@gmail.com>)
Responses	Re: VM corruption on standby
List	pgsql-hackers

Tree view

Hi,

On 2025-08-20 03:19:38 +1200, Thomas Munro wrote:
> On Wed, Aug 20, 2025 at 2:57 AM Andres Freund <andres@anarazel.de> wrote:
> > On 2025-08-20 02:54:09 +1200, Thomas Munro wrote:
> > > > On linux - the primary OS with OOM killer troubles - I'm pretty sure'll lwlock
> > > > waiters would get killed due to the postmaster death signal we've configured
> > > > (c.f. PostmasterDeathSignalInit()).
> > >
> > > No, that has a handler that just sets a global variable.  That was
> > > done because recovery used to try to read() from the postmaster pipe
> > > after replaying every record.  Also we currently have some places that
> > > don't want to be summarily killed (off the top of my head, syncrep
> > > wants to send a special error message, and the logger wants to survive
> > > longer than everyone else to catch as much output as possible, things
> > > I've been thinking about in the context of threads).
> >
> > That makes no sense. We should just _exit(). If postmaster has been killed,
> > trying to stay up longer just makes everything more fragile. Waiting for the
> > logger is *exactly* what we should *not* do - what if the logger also crashed?
> > There's no postmaster around to start it.
> 
> Nobody is waiting for the logger.

Error messages that we might be printing will wait for logger if the pipe is
full, no?


> The logger waits for everyone else to exit first to collect forensics:
> 
>      * Unlike all other postmaster child processes, we'll ignore postmaster
>      * death because we want to collect final log output from all backends and
>      * then exit last.  We'll do that by running until we see EOF on the
>      * syslog pipe, which implies that all other backends have exited
>      * (including the postmaster).

> The syncrep case is a bit weirder: it wants to tell the user that
> syncrep is broken, so its own WaitEventSetWait() has
> WL_POSTMASTER_DEATH, but that's basically bogus because the backend
> can reach WaitEventSetWait(WL_EXIT_ON_PM_DEATH) in many other code
> paths.  I've proposed nuking that before.

Yea, that's just bogus.


I think this is one more instance of "let's try hard to continue limping
along" making things way more fragile than the simpler "let's just do
crash-restart in the most normal way possible".

Greetings,

Andres Freund

pgsql-hackers by date:

From: Thomas Munro
Date: 19 August 2025, 18:19:38
Subject: Re: VM corruption on standby

From: "章晨曦"
Date: 19 August 2025, 18:26:12
Subject: Performance issue on temporary relations

Re: VM corruption on standby - Mailing list pgsql-hackers

Previous

Next