Re: [PATCH] Fix orphaned backend processes on Windows using Job Objects - Mailing list pgsql-hackers

From Bryan Green
Subject Re: [PATCH] Fix orphaned backend processes on Windows using Job Objects
Date
Msg-id 5e80ef94-fa47-4110-9e3a-7007a4ecef14@gmail.com
Whole thread Raw
In response to Re: [PATCH] Fix orphaned backend processes on Windows using Job Objects  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
On 11/6/2025 8:39 PM, Thomas Munro wrote:
> On Fri, Nov 7, 2025 at 3:13 AM Bryan Green <dbryan.green@gmail.com> wrote:
>> The reason to still do this patch and clean up the handle inheritance
>> mess is that there are states (suspended state, infinite loop, spinlock
>> hold, whatever) that a process can be in that keeps it from processing
>> the event.  We don't need to wait on the children to voluntarily exit
>> when postmaster crashes.
> 
> Agreed on all points.  We'd recently come to the same conclusion on this thread:
> 
> https://www.postgresql.org/message-id/flat/B3C69B86-7F82-4111-B97F-0005497BB745%40yandex-team.ru
> 

Thanks for the link - I'll review that thread. It's reassuring to see
independent analysis reaching the same conclusions.

> I think there might arguably be a sort of weak forward progress
> guarantee in the existing design and it's been a while since we've had
> problem reports AFAIR*: locks were releases (which turns out to be
> fundamentally unsafe at least while in a critical section as analysed
> in that thread, but it does allow progress in blocked backends, so
> that they can learn of the postmaster's demise), and no one should
> enter WaitEventSet() while holding a spinlock, and infinite loops are
> against the law, and it's previously been considered acceptable-ish
> that a backend might continue to run a long query until completion
> before exiting (without supporting auxiliary or worker backends, which
> sounds potentially suspect, but at least you can't wait for another
> backend without learning of the PostgreSQL's demise assuming the only
> possible waits are LWLocks or latches).  But clearly it's not good
> enough.
> 
> The fact that Windows backends are born in suspended state until the
> postmaster resumes them is indeed a new and significant hole in that
> theory.  Preemptive termination is the only thing that makes sense.
> 

Exactly. The suspended state issue-- even with perfect handle
inheritance hygiene, a backend that hasn't been resumed yet cannot
receive or process any event notification. The postmaster could crash
after CreateProcess() but before ResumeThread(), leaving a zombie
process that will never wake up.

The handle inheritance problems I've been working on (socket handles,
O_CLOEXEC) are somewhat orthogonal - they cause *delayed* termination
rather than indefinite orphaning. But Job Objects solves both classes
of problems: it handles the suspended state case immediately, and
provides guaranteed cleanup regardless of handle state.

> *We used to have places that waited but forgot to handle PM exit, and
> I don't recall "manual orphan cleanup needed" reports since we
> enforced a central handler.  But see also my earlier note about
> systemd potentially hiding problems these days, if using "mixed" mode
> to SIGKILL the whole cgroup.

That's an interesting observation about systemd potentially masking
fragility. If Linux deployments are really just nuking the cgroup on
postmaster death, then the "voluntary exit" approach isn't actually
being tested in production at scale.

I'm looking forward to your broader subprocess management patchset.

-- 
Bryan Green
EDB: https://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Tender Wang
Date:
Subject: Re: Fix typos in ExecChooseHashTableSize()
Next
From: Peter Smith
Date:
Subject: Re: [PATCH] Add pg_get_subscription_ddl() function