Thread: Intermittent failure with t/003_logical_slots.pl test on windows

Intermittent failure with t/003_logical_slots.pl test on windows

From
Nisha Moond
Date:
Hi hackers,

There is a failure with 't/003_logical_slots.pl' test during the
upgrade. The failure is intermittent and observed in the Windows
environment.

Details-
Test - pg_upgrade/t/003_logical_slots.pl
Result -
t/003_logical_slots.pl .. 5/?
#   Failed test 'run of pg_upgrade of old cluster'
#   at t/003_logical_slots.pl line 165.
t/003_logical_slots.pl .. 10/?
#   Failed test 'check the slot exists on new cluster'
#   at t/003_logical_slots.pl line 171.
#          got: ''
#     expected: 'regress_sub|t'
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 25 just after 11.
t/003_logical_slots.pl .. Dubious, test returned 25 (wstat 6400, 0x1900)
Failed 2/11 subtests

Test Summary Report
-------------------
t/003_logical_slots.pl (Wstat: 6400 (exited 25) Tests: 11 Failed: 2)
  Failed tests:  10-11
  Non-zero exit status: 25
  Parse errors: No plan found in TAP output
Files=1, Tests=11, 32 wallclock secs ( 0.03 usr +  0.01 sys =  0.05 CPU)
Result: FAIL

log attached - 'regress_log_003_logical_slots'.

The failure cause is  -
no data was returned by command
""D:/Project/pg1/postgres/tmp_install/bin/pg_resetwal" -V"
check for "D:/Project/pg1/postgres/tmp_install/bin/pg_resetwal"
failed: cannot execute

Failure, exiting
[16:24:21.144](6.275s) not ok 10 - run of pg_upgrade of old cluster

If the same command is run manually, it succeeds -

>"D:/Project/pg1/postgres/tmp_install/bin/pg_resetwal" -V
pg_resetwal (PostgreSQL) 17devel

The same test failure (intermittent) is also seen with different
commands like pg_ctl  and pg_dump as failure cause while retrieving
version -
Ex -
no data was returned by command
""D:/Project/pg1/postgres/tmp_install/bin/pg_dump" -V"
check for "D:/Project/pg1/postgres/tmp_install/bin/pg_dump" failed:
cannot execute

Failure, exiting
[16:08:50.444](7.434s) not ok 10 - run of pg_upgrade of old cluster

Has anyone come across this issue? I am not sure what is the issue here.
Any thoughts?

Thanks,
Nisha Moond

Attachment

Re: Intermittent failure with t/003_logical_slots.pl test on windows

From
Amit Kapila
Date:
On Tue, Oct 31, 2023 at 4:53 PM Nisha Moond <nisha.moond412@gmail.com> wrote:
>
> There is a failure with 't/003_logical_slots.pl' test during the
> upgrade. The failure is intermittent and observed in the Windows
> environment.
>

How did you reach the conclusion that it is only for
't/003_logical_slots.pl'? I see that the failure is while pg_upgrade
internally running pg_resetwal -V command to check the version which
doesn't seem to be directly related to the newly added test or code.

> Details-
> Test - pg_upgrade/t/003_logical_slots.pl
> Result -
> t/003_logical_slots.pl .. 5/?
> #   Failed test 'run of pg_upgrade of old cluster'
> #   at t/003_logical_slots.pl line 165.
> t/003_logical_slots.pl .. 10/?
> #   Failed test 'check the slot exists on new cluster'
> #   at t/003_logical_slots.pl line 171.
> #          got: ''
> #     expected: 'regress_sub|t'
> # Tests were run but no plan was declared and done_testing() was not seen.
> # Looks like your test exited with 25 just after 11.
> t/003_logical_slots.pl .. Dubious, test returned 25 (wstat 6400, 0x1900)
> Failed 2/11 subtests
>
> Test Summary Report
> -------------------
> t/003_logical_slots.pl (Wstat: 6400 (exited 25) Tests: 11 Failed: 2)
>   Failed tests:  10-11
>   Non-zero exit status: 25
>   Parse errors: No plan found in TAP output
> Files=1, Tests=11, 32 wallclock secs ( 0.03 usr +  0.01 sys =  0.05 CPU)
> Result: FAIL
>
> log attached - 'regress_log_003_logical_slots'.
>
> The failure cause is  -
> no data was returned by command
> ""D:/Project/pg1/postgres/tmp_install/bin/pg_resetwal" -V"
> check for "D:/Project/pg1/postgres/tmp_install/bin/pg_resetwal"
> failed: cannot execute
>
> Failure, exiting
> [16:24:21.144](6.275s) not ok 10 - run of pg_upgrade of old cluster
>
> If the same command is run manually, it succeeds -
>

Can you add some LOGs in pg_resetwal to find out if the command has
performed appropriately?

--
With Regards,
Amit Kapila.



Re: Intermittent failure with t/003_logical_slots.pl test on windows

From
vignesh C
Date:
On Tue, 31 Oct 2023 at 17:51, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Oct 31, 2023 at 4:53 PM Nisha Moond <nisha.moond412@gmail.com> wrote:
> >
> > There is a failure with 't/003_logical_slots.pl' test during the
> > upgrade. The failure is intermittent and observed in the Windows
> > environment.
> >
>
> How did you reach the conclusion that it is only for
> 't/003_logical_slots.pl'? I see that the failure is while pg_upgrade
> internally running pg_resetwal -V command to check the version which
> doesn't seem to be directly related to the newly added test or code.

I also felt it is not related to the 003_logical_slots test, I felt
the problem might be because of the pipe_read_line function:
....
pipe_read_line(char *cmd, char *line, int maxsize)
{
FILE *pgver;

fflush(NULL);

errno = 0;
if ((pgver = popen(cmd, "r")) == NULL)
{
perror("popen failure");
return NULL;
}

errno = 0;
if (fgets(line, maxsize, pgver) == NULL)
...

Few others are also facing this problem with similar code like in:
https://stackoverflow.com/questions/15882799/fgets-returning-error-for-file-returned-by-popen

Regards,
Vignesh



RE: Intermittent failure with t/003_logical_slots.pl test on windows

From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Nisha,

> 
> The failure cause is  -
> no data was returned by command
> ""D:/Project/pg1/postgres/tmp_install/bin/pg_resetwal" -V"
> check for "D:/Project/pg1/postgres/tmp_install/bin/pg_resetwal"
> failed: cannot execute
> 
> Failure, exiting
> [16:24:21.144](6.275s) not ok 10 - run of pg_upgrade of old cluster

I thought it was not related with the feature. I doubt the pg_upgrade read the
command result before it was really executed.

First of all, The stack trace until the system call _popen() is as follows. 

```
check_exec()
pipe_read_line()
popen()
pgwin32_popen()
_popen() // process was forked and command would be executed
```

I read MSdocs and said that _popen executes specified commands asynchronously [1].
 
> The _popen function creates a pipe. It then asynchronously executes a spawned
> copy of the command processor, and uses command as the command line.
 

Your failure meant that the binary was found but its output was not found by fgets().
So I thought that the forked process has not executed the command yet at that time. Thought?

[1]: https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/popen-wpopen?view=msvc-170

Best Regards,
Hayato Kuroda
FUJITSU LIMITED


Re: Intermittent failure with t/003_logical_slots.pl test on windows

From
Kyotaro Horiguchi
Date:
At Tue, 31 Oct 2023 18:11:48 +0530, vignesh C <vignesh21@gmail.com> wrote in 
> Few others are also facing this problem with similar code like in:
> https://stackoverflow.com/questions/15882799/fgets-returning-error-for-file-returned-by-popen

I'm inclined to believe that the pipe won't enter the EOF state until
the target command terminates (then the top-level cmd.exe). The target
command likely terminated prematurely due to an error before priting
any output.

If we append "2>&1" to the command line, we can capture the error
message through the returned pipe if any. Such error messages will
cause the subsequent code to fail with an error such as "unexpected
string: 'the output'". I'm not sure, but if this is permissive, the
returned error messages could potentially provide insight into the
underlying issue, paving the way for a potential solution.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Intermittent failure with t/003_logical_slots.pl test on windows

From
Nisha Moond
Date:
On Thu, Nov 2, 2023 at 11:52 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Tue, 31 Oct 2023 18:11:48 +0530, vignesh C <vignesh21@gmail.com> wrote in
> > Few others are also facing this problem with similar code like in:
> > https://stackoverflow.com/questions/15882799/fgets-returning-error-for-file-returned-by-popen
>
> I'm inclined to believe that the pipe won't enter the EOF state until
> the target command terminates (then the top-level cmd.exe). The target
> command likely terminated prematurely due to an error before priting
> any output.
>
> If we append "2>&1" to the command line, we can capture the error
> message through the returned pipe if any. Such error messages will
> cause the subsequent code to fail with an error such as "unexpected
> string: 'the output'". I'm not sure, but if this is permissive, the
> returned error messages could potentially provide insight into the
> underlying issue, paving the way for a potential solution.
>

Appending '2>&1 test:
The command still results in NULL and ends up failing as no data is
returned. Which means even no error message is returned. The error log
with appended '2>$1' is -

no data was returned by command
""D:/Project/pg1/postgres/tmp_install/bin/pg_resetwal" -V 2>&1"

check for "D:/Project/pg1/postgres/tmp_install/bin/pg_resetwal"
failed: cannot execute
Failure, exiting

Further observations:
1. To make sure the forked process completes before fgets(), I tested
with Sleep(100) before fgets() call.
...
...
if ((pgver = popen(cmd, "r")) == NULL)
{
perror("popen failure");
return NULL;
}

errno = 0;
Sleep(100);
if (fgets(line, maxsize, pgver) == NULL)
{
if (feof(pgver))
fprintf(stderr, "no data was returned by command \"%s\"\n", cmd);
...
...

This also doesn't resolve the issue, the error is still seen intermittently.

2.  Even though fgets() fails, the output is still getting captured in
'line' string.
Tested with printing the 'line' in case of failure:
...
...
if ((pgver = popen(cmd, "r")) == NULL)
{
perror("popen failure");
return NULL;
}

errno = 0;
if (fgets(line, maxsize, pgver) == NULL)
{
    if (line)
      fprintf(stderr, "cmd output - %s\n", line);

    if (feof(pgver))
      fprintf(stderr, "no data was returned by command \"%s\"\n", cmd);
…
…
And the log looks like -
cmd output - postgres (PostgreSQL) 17devel
no data was returned by command
""D:/Project/pg1/postgres/tmp_install/bin/pg_controldata" -V"

check for "D:/Project/pg1/postgres/tmp_install/bin/pg_controldata"
failed: cannot execute
Failure, exiting

Attached test result log for the same - "regress_log_003_logical_slots".

Thanks,
Nisha Moond

Attachment

Re: Intermittent failure with t/003_logical_slots.pl test on windows

From
Nisha Moond
Date:
On Fri, Nov 3, 2023 at 5:02 PM Nisha Moond <nisha.moond412@gmail.com> wrote:
>
> On Thu, Nov 2, 2023 at 11:52 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> >
> > At Tue, 31 Oct 2023 18:11:48 +0530, vignesh C <vignesh21@gmail.com> wrote in
> > > Few others are also facing this problem with similar code like in:
> > > https://stackoverflow.com/questions/15882799/fgets-returning-error-for-file-returned-by-popen
> >
> > I'm inclined to believe that the pipe won't enter the EOF state until
> > the target command terminates (then the top-level cmd.exe). The target
> > command likely terminated prematurely due to an error before priting
> > any output.
> >
> > If we append "2>&1" to the command line, we can capture the error
> > message through the returned pipe if any. Such error messages will
> > cause the subsequent code to fail with an error such as "unexpected
> > string: 'the output'". I'm not sure, but if this is permissive, the
> > returned error messages could potentially provide insight into the
> > underlying issue, paving the way for a potential solution.
> >
>
> Appending '2>&1 test:
> The command still results in NULL and ends up failing as no data is
> returned. Which means even no error message is returned. The error log
> with appended '2>$1' is -
>
> no data was returned by command
> ""D:/Project/pg1/postgres/tmp_install/bin/pg_resetwal" -V 2>&1"
>
> check for "D:/Project/pg1/postgres/tmp_install/bin/pg_resetwal"
> failed: cannot execute
> Failure, exiting
>
> Further observations:
> 1. To make sure the forked process completes before fgets(), I tested
> with Sleep(100) before fgets() call.
> ...
> ...
> if ((pgver = popen(cmd, "r")) == NULL)
> {
> perror("popen failure");
> return NULL;
> }
>
> errno = 0;
> Sleep(100);
> if (fgets(line, maxsize, pgver) == NULL)
> {
> if (feof(pgver))
> fprintf(stderr, "no data was returned by command \"%s\"\n", cmd);
> ...
> ...
>
> This also doesn't resolve the issue, the error is still seen intermittently.
>
> 2.  Even though fgets() fails, the output is still getting captured in
> 'line' string.
> Tested with printing the 'line' in case of failure:
> ...
> ...
> if ((pgver = popen(cmd, "r")) == NULL)
> {
> perror("popen failure");
> return NULL;
> }
>
> errno = 0;
> if (fgets(line, maxsize, pgver) == NULL)
> {
>     if (line)
>       fprintf(stderr, "cmd output - %s\n", line);
>
>     if (feof(pgver))
>       fprintf(stderr, "no data was returned by command \"%s\"\n", cmd);
> …
> …
> And the log looks like -
> cmd output - postgres (PostgreSQL) 17devel
> no data was returned by command
> ""D:/Project/pg1/postgres/tmp_install/bin/pg_controldata" -V"
>
> check for "D:/Project/pg1/postgres/tmp_install/bin/pg_controldata"
> failed: cannot execute
> Failure, exiting
>
> Attached test result log for the same - "regress_log_003_logical_slots".

The same failure is observed with test 't\002_pg_upgrade.pl' too
(intermittently). So, it is not limited to "t/003_logical_slots.pl"
test alone. It is more closely associated with the pg_upgrade command
run.

--
Thanks,
Nisha Moond



Re: Intermittent failure with t/003_logical_slots.pl test on windows

From
Kyotaro Horiguchi
Date:
At Mon, 6 Nov 2023 19:42:21 +0530, Nisha Moond <nisha.moond412@gmail.com> wrote in 
> > Appending '2>&1 test:
> > The command still results in NULL and ends up failing as no data is
> > returned. Which means even no error message is returned. The error log

Thanks for confirmation. So, at least the child process was launced
successfully in the cmd.exe's view.

Upon a quick check on my end with Windows' _popen, I have obseved the
following:

- Once a child process is started, it seems to go undetected as an
  error by _popen or subsequent fgets calls if the process ends
  abnormally, with a non-zero exit status or even with a SEGV.

- After the child process has flushed data to stdout, it is possible
  to read from the pipe even if the child process crashes or ends
  thereafter.

- Even if fgets is called before the program starts, it will correctly
  block until the program outputs something. Specifically, when I used
  popen("sleep 5 & target.exe") and immediately performed fgets on the
  pipe, I was able to read the output of target.exe as the first line.

Therefore, based on the information available, it is conceivable that
the child process was killed by something right after it started, or
the program terminated on its own without any error messages.

By the way, in the case of aforementioned SEGV, Application Errors
corresponding to it were identifiable in the Event
Viewer. Additionally, regarding the exit statuses, they can be
captured by using a wrapper batch file (.bat) that records
%ERRORLEVEL% after running the target program. This may yield
insights, aothough its effectiveness is not guaranteed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Intermittent failure with t/003_logical_slots.pl test on windows

From
Shlok Kyal
Date:
Hi,
The same intermittent failure is reproducible on my system.
For the intermittent issues I found that many issues are due to errors
where commands like 'psql -V' are not returning any output.
To reproduce it in an easy way, I wrote a script (.bat file) with
'--version' option for different binaries. And found out that it was
not giving any output for some command (varies for each run).
Then I tried to run the same script after adding 'fflush(stdout)' in
the function called with  '--version' option and it started to give
output for each command.
I noticed the same for '--help' option and did the changes for the same.

I have attached the test script(changes the extension to .txt as gmail
is blocking it), output of test before the changes.
I have also attached the patch with changes which resolved the above issue.

This change has resolved most of the intermittent issues for me. I am
facing some more intermittent issues. Will analyse and share it as
well.

Thanks and regards
Shlok Kyal

On Tue, 7 Nov 2023 at 11:05, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
>
> At Mon, 6 Nov 2023 19:42:21 +0530, Nisha Moond <nisha.moond412@gmail.com> wrote in
> > > Appending '2>&1 test:
> > > The command still results in NULL and ends up failing as no data is
> > > returned. Which means even no error message is returned. The error log
>
> Thanks for confirmation. So, at least the child process was launced
> successfully in the cmd.exe's view.
>
> Upon a quick check on my end with Windows' _popen, I have obseved the
> following:
>
> - Once a child process is started, it seems to go undetected as an
>   error by _popen or subsequent fgets calls if the process ends
>   abnormally, with a non-zero exit status or even with a SEGV.
>
> - After the child process has flushed data to stdout, it is possible
>   to read from the pipe even if the child process crashes or ends
>   thereafter.
>
> - Even if fgets is called before the program starts, it will correctly
>   block until the program outputs something. Specifically, when I used
>   popen("sleep 5 & target.exe") and immediately performed fgets on the
>   pipe, I was able to read the output of target.exe as the first line.
>
> Therefore, based on the information available, it is conceivable that
> the child process was killed by something right after it started, or
> the program terminated on its own without any error messages.
>
> By the way, in the case of aforementioned SEGV, Application Errors
> corresponding to it were identifiable in the Event
> Viewer. Additionally, regarding the exit statuses, they can be
> captured by using a wrapper batch file (.bat) that records
> %ERRORLEVEL% after running the target program. This may yield
> insights, aothough its effectiveness is not guaranteed.
>
> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center
>
>

Attachment

Re: Intermittent failure with t/003_logical_slots.pl test on windows

From
Nisha Moond
Date:
Thanks for working on it. I tested the patch on my system and it
resolved the issue with commands running -V (version check).

As you mentioned, I am also still seeing intermittent errors even with
the patch as below -
 in 'pg_upgrade/002_pg_upgrade' -

# Running: pg_upgrade --no-sync -d
D:\Project\pg2\postgres\build/testrun/pg_upgrade/002_pg_upgrade\data/t_002_pg_upgrade_old_node_data/pgdata
-D D:\Project\pg2\postgres\build/testrun/pg_upgrade/002_pg_upgrade\data/t_002_pg_upgrade_new_node_data/pgdata
-b D:/Project/pg2/postgres/build/tmp_install/Project/pg2/postgresql/bin
-B D:/Project/pg2/postgres/build/tmp_install/Project/pg2/postgresql/bin
-s 127.0.0.1 -p 56095 -P 56096 --copy --check
Performing Consistency Checks
-----------------------------
Checking cluster versions                                     ok

The source cluster lacks cluster state information:
Failure, exiting
[12:37:38.666](3.317s) not ok 12 - run of pg_upgrade --check for new instance
[12:37:38.666](0.000s) #   Failed test 'run of pg_upgrade --check for
new instance'
#   at D:/Project/pg2/postgres/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 375.

and in 'pg_upgrade/003_logical_slots'  -

[12:35:33.773](0.001s) not ok 6 - run of pg_upgrade of old cluster
with slots having unconsumed WAL records stdout /(?^:Your installation
contains logical replication slots that can't be upgraded.)/
[12:35:33.773](0.000s) #   Failed test 'run of pg_upgrade of old
cluster with slots having unconsumed WAL records stdout /(?^:Your
installation contains logical replication slots that can't be
upgraded.)/'
#   at D:/Project/pg2/postgres/src/bin/pg_upgrade/t/003_logical_slots.pl
line 102.
[12:35:33.773](0.000s) #                   'Performing Consistency Checks
# -----------------------------
# Checking cluster versions                                     ok
#
# The target cluster lacks cluster state information:
# Failure, exiting

It seems 'Performing Consistency Checks' fail due to a lack of some
information and possible that it can also be fixed on the same lines.

--
Thanks,
Nisha



Re: Intermittent failure with t/003_logical_slots.pl test on windows

From
Shlok Kyal
Date:
Hi,
Apart of these I am getting following some intermittent failure as below:

131/272 postgresql:pg_basebackup / pg_basebackup/010_pg_basebackup
                    ERROR            30.51s   (exit status 255 or
signal 127 SIGinvalid)
114/272 postgresql:libpq / libpq/001_uri
                    ERROR             9.66s   exit status 8
 34/272 postgresql:pg_upgrade / pg_upgrade/003_logical_slots
                    ERROR            99.14s   exit status 1
186/272 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade
                    ERROR           306.22s   exit status 1
 29/272 postgresql:recovery / recovery/002_archiving
                    ERROR            89.62s   (exit status 255 or
signal 127 SIGinvalid)
138/272 postgresql:pg_resetwal / pg_resetwal/001_basic
                    ERROR             3.93s   (exit status 255 or
signal 127 SIGinvalid)

Have attached the regress logs for the same as well.

Thanks and Regards
Shlok Kyal

On Tue, 26 Dec 2023 at 17:39, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote:
>
> Hi,
> The same intermittent failure is reproducible on my system.
> For the intermittent issues I found that many issues are due to errors
> where commands like 'psql -V' are not returning any output.
> To reproduce it in an easy way, I wrote a script (.bat file) with
> '--version' option for different binaries. And found out that it was
> not giving any output for some command (varies for each run).
> Then I tried to run the same script after adding 'fflush(stdout)' in
> the function called with  '--version' option and it started to give
> output for each command.
> I noticed the same for '--help' option and did the changes for the same.
>
> I have attached the test script(changes the extension to .txt as gmail
> is blocking it), output of test before the changes.
> I have also attached the patch with changes which resolved the above issue.
>
> This change has resolved most of the intermittent issues for me. I am
> facing some more intermittent issues. Will analyse and share it as
> well.
>
> Thanks and regards
> Shlok Kyal
>
> On Tue, 7 Nov 2023 at 11:05, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
> >
> > At Mon, 6 Nov 2023 19:42:21 +0530, Nisha Moond <nisha.moond412@gmail.com> wrote in
> > > > Appending '2>&1 test:
> > > > The command still results in NULL and ends up failing as no data is
> > > > returned. Which means even no error message is returned. The error log
> >
> > Thanks for confirmation. So, at least the child process was launced
> > successfully in the cmd.exe's view.
> >
> > Upon a quick check on my end with Windows' _popen, I have obseved the
> > following:
> >
> > - Once a child process is started, it seems to go undetected as an
> >   error by _popen or subsequent fgets calls if the process ends
> >   abnormally, with a non-zero exit status or even with a SEGV.
> >
> > - After the child process has flushed data to stdout, it is possible
> >   to read from the pipe even if the child process crashes or ends
> >   thereafter.
> >
> > - Even if fgets is called before the program starts, it will correctly
> >   block until the program outputs something. Specifically, when I used
> >   popen("sleep 5 & target.exe") and immediately performed fgets on the
> >   pipe, I was able to read the output of target.exe as the first line.
> >
> > Therefore, based on the information available, it is conceivable that
> > the child process was killed by something right after it started, or
> > the program terminated on its own without any error messages.
> >
> > By the way, in the case of aforementioned SEGV, Application Errors
> > corresponding to it were identifiable in the Event
> > Viewer. Additionally, regarding the exit statuses, they can be
> > captured by using a wrapper batch file (.bat) that records
> > %ERRORLEVEL% after running the target program. This may yield
> > insights, aothough its effectiveness is not guaranteed.
> >
> > regards.
> >
> > --
> > Kyotaro Horiguchi
> > NTT Open Source Software Center
> >
> >

Attachment