Thread: BUG #16154: pg_ctl restart with a logfile fails sometimes (on Windows)

BUG #16154: pg_ctl restart with a logfile fails sometimes (on Windows)

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      16154
Logged by:          Alexander Lakhin
Email address:      exclusion@gmail.com
PostgreSQL version: 12.1
Operating system:   Windows
Description:

When performing regression tests on Windows intermittent failures are
observed, e.g. in src/bin/pg_basebackup test:
vcregress taptest src/bin/pg_basebackup
...
t/010_pg_basebackup.pl ... 10/106 Bailout called.  Further testing stopped:
system pg_ctl failed
FAILED--Further testing stopped: system pg_ctl failed

regress_log_010_pg_basebackup.log contains:
...
ok 13 - backup directory was created and left behind
### Restarting node "main"
# Running: pg_ctl -D
C:/src/postgresql/src/bin/pg_basebackup/tmp_check/t_010_pg_basebackup_main_data/pgdata
-l
C:/src/postgresql/src/bin/pg_basebackup/tmp_check/log/010_pg_basebackup_main.log
restart
waiting for server to shut down.... done
server stopped
waiting for server to start....The process cannot access the file because it
is being used by another process.
 stopped waiting
pg_ctl: could not start server
Examine the log output.
Bail out!  system pg_ctl failed

The issue is caused by sporadic "pg_ctl ... restart -l logfile" failures.


Re: BUG #16154: pg_ctl restart with a logfile fails sometimes (onWindows)

From
Alexander Lakhin
Date:
06.12.2019 10:00, PG Bug reporting form wrote:
When performing regression tests on Windows intermittent failures are
observed, e.g. in src/bin/pg_basebackup test:
vcregress taptest src/bin/pg_basebackup
...
t/010_pg_basebackup.pl ... 10/106 Bailout called.  Further testing stopped: 
system pg_ctl failed
FAILED--Further testing stopped: system pg_ctl failed


waiting for server to start....The process cannot access the file because it
is being used by another process.stopped waiting
pg_ctl: could not start server

The issue is caused by sporadic "pg_ctl ... restart -l logfile" failures.

To reproduce this issue reliably I propose the simple demo patch (delay_after_unlink_pid p, li { white-space: pre-wrap;).
With the delay added the "pg_ctl ... restart -l logfile" command (and "vcregress taptest src/bin/pg_basebackup") fails always.
Error message is not very informational, but debugging shows that the file in question is the log file, specified when running the command:
"C:\Windows\system32\cmd.exe" /C ""C:/src/postgresql/tmp_install/bin/postgres.exe" -D "C:/src/postgresql/src/bin/pg_basebackup/tmp_check/t_010_pg_basebackup_main_data/pgdata" --cluster-name=main < "nul" >> "C:/src/postgresql/src/bin/pg_basebackup/tmp_check/log/010_pg_basebackup_main.log" 2>&1"

If this file is still opened by the previous server shell (it can happen when the previous server instance has unlinked it's pid file, but it's CMD shell is still running), the next CMD start fails with the aforementioned error message.

To fix this issue I propose the attached patch (fix_logfile_sharing_violation p, li { white-space: pre-wrap; }).
With the patch, pg_ctl will wait for the log file to become available (for 30 seconds). And if the file still could not be opened (it can be reproduced with a larger delay in the demo patch), you'll get more meaningful message:
pg_ctl: could not access log file "C:/src/postgresql/src/bin/pg_basebackup/tmp_check/log/010_pg_basebackup_main.log": Permission denied

Best regards,
Alexander
Attachment

Re: BUG #16154: pg_ctl restart with a logfile fails sometimes (on Windows)

From
Tom Lane
Date:
Alexander Lakhin <exclusion@gmail.com> writes:
> If this file is still opened by the previous server shell (it can happen
> when the previous server instance has unlinked it's pid file, but it's
> CMD shell is still running), the next CMD start fails with the
> aforementioned error message.

Interesting.  I wonder whether this explains all of the remaining
buildfarm failures of this sort that we've been seeing even after
0ba06e0bf.

> To fix this issue I propose the attached patch
> (fix_logfile_sharing_violation ).

This seems like a pretty ugly hack ... please at least make it
#ifdef WIN32, so that the rest of us don't have to deal with it.
Also, if I read it correctly, it causes a pre-existing logfile
to get truncated, which has never happened before.  Mode "a"
would be a better choice.

            regards, tom lane



Re: BUG #16154: pg_ctl restart with a logfile fails sometimes (onWindows)

From
Alexander Lakhin
Date:
06.12.2019 18:14, Tom Lane wrote:
> Alexander Lakhin <exclusion@gmail.com> writes:
>> If this file is still opened by the previous server shell (it can happen
>> when the previous server instance has unlinked it's pid file, but it's
>> CMD shell is still running), the next CMD start fails with the
>> aforementioned error message.
> Interesting.  I wonder whether this explains all of the remaining
> buildfarm failures of this sort that we've been seeing even after
> 0ba06e0bf.
>
>> To fix this issue I propose the attached patch
>> (fix_logfile_sharing_violation ).
> This seems like a pretty ugly hack ... please at least make it
> #ifdef WIN32, so that the rest of us don't have to deal with it.
> Also, if I read it correctly, it causes a pre-existing logfile
> to get truncated, which has never happened before.  Mode "a"
> would be a better choice.
This change should go under the following code:
#else                            /* WIN32 */

    /*
     * As with the Unix case, it's easiest to use the shell (CMD.EXE) to
     * handle redirection etc.  Unfortunately CMD.EXE lacks any
equivalent of
     * "exec", so we don't get to find out the postmaster's PID immediately.
     */
    PROCESS_INFORMATION pi;
    const char *comspec;

    /* Find CMD.EXE location using COMSPEC, if it's set */
    comspec = getenv("COMSPEC");
    if (comspec == NULL)
        comspec = "CMD";

So it should affect only Windows.
I've fixed the mode. Thanks for your review!

Best regards,
Alexander


Attachment

Re: BUG #16154: pg_ctl restart with a logfile fails sometimes (on Windows)

From
Amit Kapila
Date:
On Fri, Dec 6, 2019 at 1:30 PM Alexander Lakhin <exclusion@gmail.com> wrote:
>
> 06.12.2019 10:00, PG Bug reporting form wrote:
>
> When performing regression tests on Windows intermittent failures are
> observed, e.g. in src/bin/pg_basebackup test:
> vcregress taptest src/bin/pg_basebackup
> ...
> t/010_pg_basebackup.pl ... 10/106 Bailout called.  Further testing stopped:
> system pg_ctl failed
> FAILED--Further testing stopped: system pg_ctl failed
>
>
> waiting for server to start....The process cannot access the file because it
> is being used by another process.
>  stopped waiting
> pg_ctl: could not start server
>
> The issue is caused by sporadic "pg_ctl ... restart -l logfile" failures.
>
>
> To reproduce this issue reliably I propose the simple demo patch (delay_after_unlink_pid p, li { white-space:
pre-wrap;).
> With the delay added the "pg_ctl ... restart -l logfile" command (and "vcregress taptest src/bin/pg_basebackup")
failsalways. 
> Error message is not very informational, but debugging shows that the file in question is the log file, specified
whenrunning the command: 
> "C:\Windows\system32\cmd.exe" /C ""C:/src/postgresql/tmp_install/bin/postgres.exe" -D
"C:/src/postgresql/src/bin/pg_basebackup/tmp_check/t_010_pg_basebackup_main_data/pgdata"--cluster-name=main < "nul" >>
"C:/src/postgresql/src/bin/pg_basebackup/tmp_check/log/010_pg_basebackup_main.log"2>&1" 
>
> If this file is still opened by the previous server shell (it can happen when the previous server instance has
unlinkedit's pid file, but it's CMD shell is still running), the next CMD start fails with the aforementioned error
message.
>

What is the reason for the previous shell still accessing logfile?
Is the reason that we don't close it before unlinking the pid file?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: BUG #16154: pg_ctl restart with a logfile fails sometimes (on Windows)

From
Alexander Law
Date:
07.12.2019 15:36 Amit Kapila <amit.kapila16@gmail.com>:
On Fri, Dec 6, 2019 at 1:30 PM Alexander Lakhin <exclusion@gmail.com> wrote:
>

> If this file is still opened by the previous server shell (it can happen when the previous server instance has unlinked it's pid file, but it's CMD shell is still running), the next CMD start fails with the aforementioned error message.
>

What is the reason for the previous shell still accessing logfile?
Is the reason that we don't close it before unlinking the pid file?
We can't close the logfile, as it is opened by CMD, not by the postgres process. (Postmaster is started by the command "CMD.EXE /C postgres ... >logfile".) So it is closed by the CMD shell after the postgres process exits and the postgres can't control this (as a child process).

Best regards,
Alexander

Re: BUG #16154: pg_ctl restart with a logfile fails sometimes (on Windows)

From
Tom Lane
Date:
Alexander Law <exclusion@gmail.com> writes:
> What is the reason for the previous shell still accessing logfile?
> Is the reason that we don't close it before unlinking the pid file?

> We can't close the logfile, as it is opened by CMD, not by the postgres
> process. (Postmaster is started by the command "CMD.EXE /C postgres ...
>> logfile".) So it is closed by the CMD shell after the postgres process
> exits and the postgres can't control this (as a child process).

Hmm ... so if we were talking about Unix, a reasonable solution would
be to avoid having a separate shell process, either by telling the shell
to "exec" the postmaster or by using fork/exec directly with no shell.
A 99% solution would be to use "&" so that the shell doesn't wait around
for the postmaster to die.

Are any of those ideas tenable on Windows?

            regards, tom lane



Re: BUG #16154: pg_ctl restart with a logfile fails sometimes (on Windows)

From
Alexander Law
Date:

07.12.2019 19:54 Tom Lane <tgl@sss.pgh.pa.us>:
Alexander Law <exclusion@gmail.com> writes:
> What is the reason for the previous shell still accessing logfile?
> Is the reason that we don't close it before unlinking the pid file?

> We can't close the logfile, as it is opened by CMD, not by the postgres
> process. (Postmaster is started by the command "CMD.EXE /C postgres ...
>> logfile".) So it is closed by the CMD shell after the postgres process
> exits and the postgres can't control this (as a child process).

Hmm ... so if we were talking about Unix, a reasonable solution would
be to avoid having a separate shell process, either by telling the shell
to "exec" the postmaster or by using fork/exec directly with no shell.
A 99% solution would be to use "&" so that the shell doesn't wait around
for the postmaster to die.

Are any of those ideas tenable on Windows?
I believe that we can't implement a similar solution on Windows, as the comment in pg_ctl says:

 508     /*
 509      * As with the Unix case, it's easiest to use the shell (CMD.EXE) to
 510      * handle redirection etc.  Unfortunately CMD.EXE lacks any equivalent of
 511      * "exec", so we don't get to find out the postmaster's PID immediately.
 512      */

The possible solution is to rid of CMD and to call CreateProcess with postgres.exe as an executable, but then we have to handle it's stdout, stderr pipes like this:
I'm afraid this would complicate things (and break an alignment with the Unix solution), but if the uglyness of the proposed fix overweights complexity concerns, I can try to develop an alternative patch.

Best regards,
Alexander 

Re: BUG #16154: pg_ctl restart with a logfile fails sometimes (on Windows)

From
Tom Lane
Date:
Alexander Law <exclusion@gmail.com> writes:
> 07.12.2019 19:54 Tom Lane <tgl@sss.pgh.pa.us>:
>> Hmm ... so if we were talking about Unix, a reasonable solution would
>> be to avoid having a separate shell process, either by telling the shell
>> to "exec" the postmaster or by using fork/exec directly with no shell.
>> A 99% solution would be to use "&" so that the shell doesn't wait around
>> for the postmaster to die.
>> Are any of those ideas tenable on Windows?

> The possible solution is to rid of CMD and to call CreateProcess with
> postgres.exe as an executable, but then we have to handle it's stdout,
> stderr pipes like this:
> https://docs.microsoft.com/en-us/windows/win32/procthread/creating-a-child-process-with-redirected-input-and-output
> I'm afraid this would complicate things (and break an alignment with the
> Unix solution), but if the uglyness of the proposed fix overweights
> complexity concerns, I can try to develop an alternative patch.

If it takes that much code to handle redirection on Windows, then
I agree we don't really want to do it :-(.

Your solution is a little bit messy but it seems expedient.  I feel
that it doesn't really fix the problem, if we define the problem as
"somebody opened the postmaster logfile with non-sharing-friendly
parameters".  This patch works around that for the specific case of
a series of pg_ctl start commands with identical -l arguments, but
the file's still not really sharable.  Still, maybe that's good
enough.  For our own purposes in buildfarm testing, the main residual
problem is that we can't safely inspect the postmaster logfile
immediately after pg_ctl stop; but there's a separate proposal to
fix that by hacking slurp_file.

It seems like it'd be cleaner to fix both of these things at once
by opening the logfile ourselves (rather than letting CMD.EXE do it)
and using sharing-friendly parameters at that point.  But maybe it's
just not worth the trouble compared to the individual hacks.

            regards, tom lane



Re: BUG #16154: pg_ctl restart with a logfile fails sometimes (on Windows)

From
Alexander Law
Date:

07.12.2019 22:26 Tom Lane <tgl@sss.pgh.pa.us>:
Alexander Law <exclusion@gmail.com> writes:


> The possible solution is to rid of CMD and to call CreateProcess with
> postgres.exe as an executable, but then we have to handle it's stdout,
> stderr pipes like this:
> https://docs.microsoft.com/en-us/windows/win32/procthread/creating-a-child-process-with-redirected-input-and-output
> I'm afraid this would complicate things (and break an alignment with the
> Unix solution), but if the uglyness of the proposed fix overweights
> complexity concerns, I can try to develop an alternative patch.

If it takes that much code to handle redirection on Windows, then
I agree we don't really want to do it :-(.

Your solution is a little bit messy but it seems expedient.  I feel
that it doesn't really fix the problem, if we define the problem as
"somebody opened the postmaster logfile with non-sharing-friendly
parameters". 
I see two problems here:
1) "pg_ctl restart ... -l logfile" can fail due to timing issues
2) when it (or "pg_ctl start ... -l logfile") fails, it prints the non-informative message:
The process cannot access the file because it
is being used by another process.
It doesn't tell the user what file is inaccessible. I think that this issue remains unresolved for so long partly due to this unclearance. The proposed patch fixes both, unless I miss something.

This patch works around that for the specific case of
a series of pg_ctl start commands with identical -l arguments, but
the file's still not really sharable. 
Yes, even if someone would manage to lock the logfile by a separate program or by starting another postgres cluster with the same logfile, he will get a more meaningful message (after a timeout).
Still, maybe that's good
enough.  For our own purposes in buildfarm testing, the main residual
problem is that we can't safely inspect the postmaster logfile
immediately after pg_ctl stop; but there's a separate proposal to
fix that by hacking slurp_file.
As to slurp_file, maybe it's read-only nature should help to avoid the issues with the logfile opening. I can test it next week.

It seems like it'd be cleaner to fix both of these things at once
by opening the logfile ourselves (rather than letting CMD.EXE do it)
and using sharing-friendly parameters at that point.  But maybe it's
just not worth the trouble compared to the individual hacks.
If we couldn't find more advantages of rewriting create process (with logfile), I would restrict changes to one hack (or two, if really needed).

Best regards,
Alexander

Re: BUG #16154: pg_ctl restart with a logfile fails sometimes (on Windows)

From
Tom Lane
Date:
Alexander Law <exclusion@gmail.com> writes:
> 07.12.2019 22:26 Tom Lane <tgl@sss.pgh.pa.us>:
>> It seems like it'd be cleaner to fix both of these things at once
>> by opening the logfile ourselves (rather than letting CMD.EXE do it)
>> and using sharing-friendly parameters at that point.  But maybe it's
>> just not worth the trouble compared to the individual hacks.

> If we couldn't find more advantages of rewriting create process (with
> logfile), I would restrict changes to one hack (or two, if really needed).

Yeah, I agree that it doesn't yet seem worthwhile to do that sort of
rewrite.

Also, I realized that blaming CMD.EXE for holding the logfile open is
wrong: it's actually the postmaster itself that still has the file
open after it's removed postmaster.pid.  (Well, maybe CMD.EXE still
has a handle too, but that would be weird; and it's not necessary
to explain the problem.)  Short of doing our own redirection, the
only way to fix that would be for the postmaster to close its own
stdout and stderr before removing the pidfile, which is obviously
a bad idea.

Hence, I pushed this patch, with a revised comment.  We'll see
how it affects the buildfarm.

            regards, tom lane



Re: BUG #16154: pg_ctl restart with a logfile fails sometimes (onWindows)

From
Alexander Lakhin
Date:
10.12.2019 21:24, Tom Lane wrote:
> Hence, I pushed this patch, with a revised comment.  We'll see
> how it affects the buildfarm.
Thank you! I will continue researching the remaining failures.

Best regards,
Alexander