Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows - Mailing list pgsql-hackers

From Bryan Green
Subject Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows
Date
Msg-id e4025275-0f97-4a3e-b107-a85e60ccf0f7@gmail.com
Whole thread Raw
In response to Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows  (vignesh C <vignesh21@gmail.com>)
Responses Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows
List pgsql-hackers
On 3/26/2025 3:18 AM, vignesh C wrote:
> On Sat, 20 Jul 2024 at 00:03, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Tue, Jul 16, 2024 at 8:04 AM Yasir <yasir.hussain.shah@gmail.com> wrote:
>>> Please ignore the above 4 lines in my review. See my comments in blue.
>>
>> OK, so I think it's unclear what the next steps are for this patch.
>>
>> 1. On June 3rd, Michael Paquier said that Tom Lane proposed that,
>> after doing what the patch currently does, we could simplify some
>> other stuff. The details are unclear, and Tom hasn't commented.
>>
>> 2. On June 29th, Noah Misch proposed a platform-independent way of
>> solving the problem.
>>
>> 3. On July 12th, Sutou Kouhei proposed using CreateProcess() to start
>> the postmaster instead of cmd.exe.
>>
>> 4. On July 16th, Yasir Shah said that he tested the patch and found
>> that the problem only exists in v17, not any prior release, which is
>> contrary to my understanding of the situation. He also proposed a
>> minor tweak to the patch itself.
>>
>> So, as I see it, we have three possible ways forward here. First, we
>> could stick with the current patch, possibly with further work as per
>> [1] or adjustments as per [4]. Second, we could abandon the current
>> approach and adopt Noah's proposal in [2]. Third, we could possibly
>> abandon the current approach and adopt Sutou's proposal in [3]. I say
>> "possibly" because I can't personally assess whether this approach is
>> feasible.
> 
> Thank you very much, Robert, for summarizing this. If anyone has
> suggestions on which approach might work best, please share them to
> help move this discussion forward.
> 
> Regards,
> Vignesh
> 
> 
Hello everyone,

I've spent the last few days implementing Sutou-san's CreateProcess() 
proposal as a proof-of-concept. With all three approaches now available 
for comparison, I wanted to share my assessment and recommendation.

The core issue, as we've established, is that cmd.exe intermediation 
prevents pg_ctl from getting the real postgres.exe PID, creating race 
conditions where we can't distinguish a newly-started postmaster from a 
pre-existing one. This causes pg_ctl start to incorrectly report success 
when attempting to start an already-running cluster.

Horiguchi-san's process tree walking patch works and is ready to ship. 
It has real merit as a minimal-change solution. However, it keeps 
cmd.exe and adds process enumeration complexity to work around that 
decision. We're fixing symptoms rather than the root cause.

Noah's token proposal is architecturally elegant and 
platform-independent. My concern is that it solves a Windows-specific 
problem by adding complexity to all platforms. Making other platforms 
adopt token-passing infrastructure for Windows's problems feels wrong. 
It also requires postmaster changes for what is fundamentally a pg_ctl 
issue, raising compatibility questions.

The CreateProcess() approach eliminates cmd.exe entirely and gives us 
the actual postgres.exe PID directly. I've implemented this in the 
attached patch (about 250 lines of Windows-specific code). The 
implementation uses CreateProcessAsUser() with a restricted security 
token to handle the case where pg_ctl runs with elevated privileges—this 
drops Administrator group membership and dangerous privileges before 
launching the postmaster, matching the behavior of the existing 
CreateRestrictedProcess() function.

For I/O redirection, we use Windows's native handle-based APIs. When a 
log file is specified, we open it with CreateFile() and pass it through 
STARTUPINFO. When there's no log file (interactive use and postgres -C 
queries), we inherit standard handles. One detail worth noting: we wait 
2 seconds for no-log-file launches to distinguish postgres -C (which 
exits immediately) from actual server startup. This is long enough to 
catch quick exits even on loaded CI systems without impacting test suite 
performance.

The implementation passes all regression tests on Windows, including CI 
environments with elevated privileges. It handles all the problem 
scenarios correctly: normal startup, detecting already-running clusters, 
postgres -C queries, pg_upgrade with multiple instances, and concurrent 
start attempts.

I prefer the CreateProcess() approach. It fixes the root cause, not the 
symptoms. We get the real PID immediately with no process tree walking 
or PID verification complexity. It's a Windows-specific solution to a 
Windows-specific problem.

If the CreateProcess() approach is deemed unacceptable, I would support 
Horiguchi-san's process tree walking patch as a pragmatic fallback. It 
fixes the bug and is tested. However, I believe we'd be choosing a 
workaround over a proper fix and adding maintenance burden we don't need.

I'm opposed to the token approach for the reasons stated above—it's 
architecturally appealing but touches all platforms for a Windows-only 
problem.

With this implementation complete, we now have all three options on the 
table for evaluation. I believe the CreateProcess() approach is 
technically sound and the right path forward, but I'm open to discussion 
and happy to address any concerns about the implementation.

Regardless of which way we choose to go, I'll happily help review and 
test the solution.

Best regards,
BG
Attachment

pgsql-hackers by date:

Previous
From: "Zhijie Hou (Fujitsu)"
Date:
Subject: RE: Logical Replication of sequences
Next
From: "Matt Smith (matts3)"
Date:
Subject: Re: Meson install warnings when running postgres build from a sandbox