Re: Weird failure with latches in curculio on v15 - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Weird failure with latches in curculio on v15
Date
Msg-id CA+hUKGJmM+3BWUrF=CXQ8p7gaS9JcUBB3VZ_R5CT6H6iY+t92g@mail.gmail.com
Whole thread Raw
In response to Re: Weird failure with latches in curculio on v15  (Nathan Bossart <nathandbossart@gmail.com>)
Responses Re: Weird failure with latches in curculio on v15
List pgsql-hackers
On Tue, Feb 21, 2023 at 5:50 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> On Tue, Feb 21, 2023 at 09:03:27AM +0900, Michael Paquier wrote:
> > Perhaps beginning a new thread with a patch and a summary would be
> > better at this stage?  Another thing I am wondering is if it could be
> > possible to test that rather reliably.  I have been playing with a few
> > scenarios like holding the system() call for a bit with hardcoded
> > sleep()s, without much success.  I'll try harder on that part..  It's
> > been mentioned as well that we could just move away from system() in
> > the long-term.
>
> I'm happy to create a new thread if needed, but I can't tell if there is
> any interest in this stopgap/back-branch fix.  Perhaps we should just jump
> straight to the long-term fix that Thomas is looking into.

Unfortunately the latch-friendly subprocess module proposal I was
talking about would be for 17.  I may post a thread fairly soon with
design ideas + list of problems and decision points as I see them, and
hopefully some sketch code, but it won't be a proposal for [/me checks
calendar] next week's commitfest and probably wouldn't be appropriate
in a final commitfest anyway, and I also have some other existing
stuff to clear first.  So please do continue with the stopgap ideas.

BTW Here's an idea (untested) about how to reproduce the problem.  You
could copy the source from a system() implementation, call it
doomed_system(), and insert kill(-getppid(), SIGQUIT) in between
sigprocmask(SIG_SETMASK, &omask, NULL) and exec*().  Parent and self
will handle the signal and both reach the proc_exit().

The systems that failed are running code like this:

https://github.com/openbsd/src/blob/master/lib/libc/stdlib/system.c
https://github.com/DragonFlyBSD/DragonFlyBSD/blob/master/lib/libc/stdlib/system.c

I'm pretty sure these other implementations could fail in just the
same way (they restore the handler before unblocking, so can run it
just before exec() replaces the image):

https://github.com/freebsd/freebsd-src/blob/main/lib/libc/stdlib/system.c
https://github.com/lattera/glibc/blob/master/sysdeps/posix/system.c

The glibc one is a bit busier and, huh, has a lock (I guess maybe
deadlockable if proc_exit() also calls system(), but hopefully it
doesn't), and uses fork() instead of vfork() but I don't think that's
a material difference here (with fork(), parent and child run
concurrently, while with vfork() the parent is suspended until the
child exists or execs, and then processes its pending signals).



pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: [PoC] Improve dead tuple storage for lazy vacuum
Next
From: Michael Banck
Date:
Subject: Re: Amcheck verification of GiST and GIN