Thread: Intermittent failure with t/003_logical_slots.pl test on windows
Hi hackers, There is a failure with 't/003_logical_slots.pl' test during the upgrade. The failure is intermittent and observed in the Windows environment. Details- Test - pg_upgrade/t/003_logical_slots.pl Result - t/003_logical_slots.pl .. 5/? # Failed test 'run of pg_upgrade of old cluster' # at t/003_logical_slots.pl line 165. t/003_logical_slots.pl .. 10/? # Failed test 'check the slot exists on new cluster' # at t/003_logical_slots.pl line 171. # got: '' # expected: 'regress_sub|t' # Tests were run but no plan was declared and done_testing() was not seen. # Looks like your test exited with 25 just after 11. t/003_logical_slots.pl .. Dubious, test returned 25 (wstat 6400, 0x1900) Failed 2/11 subtests Test Summary Report ------------------- t/003_logical_slots.pl (Wstat: 6400 (exited 25) Tests: 11 Failed: 2) Failed tests: 10-11 Non-zero exit status: 25 Parse errors: No plan found in TAP output Files=1, Tests=11, 32 wallclock secs ( 0.03 usr + 0.01 sys = 0.05 CPU) Result: FAIL log attached - 'regress_log_003_logical_slots'. The failure cause is - no data was returned by command ""D:/Project/pg1/postgres/tmp_install/bin/pg_resetwal" -V" check for "D:/Project/pg1/postgres/tmp_install/bin/pg_resetwal" failed: cannot execute Failure, exiting [16:24:21.144](6.275s) not ok 10 - run of pg_upgrade of old cluster If the same command is run manually, it succeeds - >"D:/Project/pg1/postgres/tmp_install/bin/pg_resetwal" -V pg_resetwal (PostgreSQL) 17devel The same test failure (intermittent) is also seen with different commands like pg_ctl and pg_dump as failure cause while retrieving version - Ex - no data was returned by command ""D:/Project/pg1/postgres/tmp_install/bin/pg_dump" -V" check for "D:/Project/pg1/postgres/tmp_install/bin/pg_dump" failed: cannot execute Failure, exiting [16:08:50.444](7.434s) not ok 10 - run of pg_upgrade of old cluster Has anyone come across this issue? I am not sure what is the issue here. Any thoughts? Thanks, Nisha Moond
Attachment
On Tue, Oct 31, 2023 at 4:53 PM Nisha Moond <nisha.moond412@gmail.com> wrote: > > There is a failure with 't/003_logical_slots.pl' test during the > upgrade. The failure is intermittent and observed in the Windows > environment. > How did you reach the conclusion that it is only for 't/003_logical_slots.pl'? I see that the failure is while pg_upgrade internally running pg_resetwal -V command to check the version which doesn't seem to be directly related to the newly added test or code. > Details- > Test - pg_upgrade/t/003_logical_slots.pl > Result - > t/003_logical_slots.pl .. 5/? > # Failed test 'run of pg_upgrade of old cluster' > # at t/003_logical_slots.pl line 165. > t/003_logical_slots.pl .. 10/? > # Failed test 'check the slot exists on new cluster' > # at t/003_logical_slots.pl line 171. > # got: '' > # expected: 'regress_sub|t' > # Tests were run but no plan was declared and done_testing() was not seen. > # Looks like your test exited with 25 just after 11. > t/003_logical_slots.pl .. Dubious, test returned 25 (wstat 6400, 0x1900) > Failed 2/11 subtests > > Test Summary Report > ------------------- > t/003_logical_slots.pl (Wstat: 6400 (exited 25) Tests: 11 Failed: 2) > Failed tests: 10-11 > Non-zero exit status: 25 > Parse errors: No plan found in TAP output > Files=1, Tests=11, 32 wallclock secs ( 0.03 usr + 0.01 sys = 0.05 CPU) > Result: FAIL > > log attached - 'regress_log_003_logical_slots'. > > The failure cause is - > no data was returned by command > ""D:/Project/pg1/postgres/tmp_install/bin/pg_resetwal" -V" > check for "D:/Project/pg1/postgres/tmp_install/bin/pg_resetwal" > failed: cannot execute > > Failure, exiting > [16:24:21.144](6.275s) not ok 10 - run of pg_upgrade of old cluster > > If the same command is run manually, it succeeds - > Can you add some LOGs in pg_resetwal to find out if the command has performed appropriately? -- With Regards, Amit Kapila.
On Tue, 31 Oct 2023 at 17:51, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Oct 31, 2023 at 4:53 PM Nisha Moond <nisha.moond412@gmail.com> wrote: > > > > There is a failure with 't/003_logical_slots.pl' test during the > > upgrade. The failure is intermittent and observed in the Windows > > environment. > > > > How did you reach the conclusion that it is only for > 't/003_logical_slots.pl'? I see that the failure is while pg_upgrade > internally running pg_resetwal -V command to check the version which > doesn't seem to be directly related to the newly added test or code. I also felt it is not related to the 003_logical_slots test, I felt the problem might be because of the pipe_read_line function: .... pipe_read_line(char *cmd, char *line, int maxsize) { FILE *pgver; fflush(NULL); errno = 0; if ((pgver = popen(cmd, "r")) == NULL) { perror("popen failure"); return NULL; } errno = 0; if (fgets(line, maxsize, pgver) == NULL) ... Few others are also facing this problem with similar code like in: https://stackoverflow.com/questions/15882799/fgets-returning-error-for-file-returned-by-popen Regards, Vignesh
RE: Intermittent failure with t/003_logical_slots.pl test on windows
From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Nisha, > > The failure cause is - > no data was returned by command > ""D:/Project/pg1/postgres/tmp_install/bin/pg_resetwal" -V" > check for "D:/Project/pg1/postgres/tmp_install/bin/pg_resetwal" > failed: cannot execute > > Failure, exiting > [16:24:21.144](6.275s) not ok 10 - run of pg_upgrade of old cluster I thought it was not related with the feature. I doubt the pg_upgrade read the command result before it was really executed. First of all, The stack trace until the system call _popen() is as follows. ``` check_exec() pipe_read_line() popen() pgwin32_popen() _popen() // process was forked and command would be executed ``` I read MSdocs and said that _popen executes specified commands asynchronously [1]. > The _popen function creates a pipe. It then asynchronously executes a spawned > copy of the command processor, and uses command as the command line. Your failure meant that the binary was found but its output was not found by fgets(). So I thought that the forked process has not executed the command yet at that time. Thought? [1]: https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/popen-wpopen?view=msvc-170 Best Regards, Hayato Kuroda FUJITSU LIMITED
At Tue, 31 Oct 2023 18:11:48 +0530, vignesh C <vignesh21@gmail.com> wrote in > Few others are also facing this problem with similar code like in: > https://stackoverflow.com/questions/15882799/fgets-returning-error-for-file-returned-by-popen I'm inclined to believe that the pipe won't enter the EOF state until the target command terminates (then the top-level cmd.exe). The target command likely terminated prematurely due to an error before priting any output. If we append "2>&1" to the command line, we can capture the error message through the returned pipe if any. Such error messages will cause the subsequent code to fail with an error such as "unexpected string: 'the output'". I'm not sure, but if this is permissive, the returned error messages could potentially provide insight into the underlying issue, paving the way for a potential solution. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, Nov 2, 2023 at 11:52 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Tue, 31 Oct 2023 18:11:48 +0530, vignesh C <vignesh21@gmail.com> wrote in > > Few others are also facing this problem with similar code like in: > > https://stackoverflow.com/questions/15882799/fgets-returning-error-for-file-returned-by-popen > > I'm inclined to believe that the pipe won't enter the EOF state until > the target command terminates (then the top-level cmd.exe). The target > command likely terminated prematurely due to an error before priting > any output. > > If we append "2>&1" to the command line, we can capture the error > message through the returned pipe if any. Such error messages will > cause the subsequent code to fail with an error such as "unexpected > string: 'the output'". I'm not sure, but if this is permissive, the > returned error messages could potentially provide insight into the > underlying issue, paving the way for a potential solution. > Appending '2>&1 test: The command still results in NULL and ends up failing as no data is returned. Which means even no error message is returned. The error log with appended '2>$1' is - no data was returned by command ""D:/Project/pg1/postgres/tmp_install/bin/pg_resetwal" -V 2>&1" check for "D:/Project/pg1/postgres/tmp_install/bin/pg_resetwal" failed: cannot execute Failure, exiting Further observations: 1. To make sure the forked process completes before fgets(), I tested with Sleep(100) before fgets() call. ... ... if ((pgver = popen(cmd, "r")) == NULL) { perror("popen failure"); return NULL; } errno = 0; Sleep(100); if (fgets(line, maxsize, pgver) == NULL) { if (feof(pgver)) fprintf(stderr, "no data was returned by command \"%s\"\n", cmd); ... ... This also doesn't resolve the issue, the error is still seen intermittently. 2. Even though fgets() fails, the output is still getting captured in 'line' string. Tested with printing the 'line' in case of failure: ... ... if ((pgver = popen(cmd, "r")) == NULL) { perror("popen failure"); return NULL; } errno = 0; if (fgets(line, maxsize, pgver) == NULL) { if (line) fprintf(stderr, "cmd output - %s\n", line); if (feof(pgver)) fprintf(stderr, "no data was returned by command \"%s\"\n", cmd); … … And the log looks like - cmd output - postgres (PostgreSQL) 17devel no data was returned by command ""D:/Project/pg1/postgres/tmp_install/bin/pg_controldata" -V" check for "D:/Project/pg1/postgres/tmp_install/bin/pg_controldata" failed: cannot execute Failure, exiting Attached test result log for the same - "regress_log_003_logical_slots". Thanks, Nisha Moond
Attachment
On Fri, Nov 3, 2023 at 5:02 PM Nisha Moond <nisha.moond412@gmail.com> wrote: > > On Thu, Nov 2, 2023 at 11:52 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > > At Tue, 31 Oct 2023 18:11:48 +0530, vignesh C <vignesh21@gmail.com> wrote in > > > Few others are also facing this problem with similar code like in: > > > https://stackoverflow.com/questions/15882799/fgets-returning-error-for-file-returned-by-popen > > > > I'm inclined to believe that the pipe won't enter the EOF state until > > the target command terminates (then the top-level cmd.exe). The target > > command likely terminated prematurely due to an error before priting > > any output. > > > > If we append "2>&1" to the command line, we can capture the error > > message through the returned pipe if any. Such error messages will > > cause the subsequent code to fail with an error such as "unexpected > > string: 'the output'". I'm not sure, but if this is permissive, the > > returned error messages could potentially provide insight into the > > underlying issue, paving the way for a potential solution. > > > > Appending '2>&1 test: > The command still results in NULL and ends up failing as no data is > returned. Which means even no error message is returned. The error log > with appended '2>$1' is - > > no data was returned by command > ""D:/Project/pg1/postgres/tmp_install/bin/pg_resetwal" -V 2>&1" > > check for "D:/Project/pg1/postgres/tmp_install/bin/pg_resetwal" > failed: cannot execute > Failure, exiting > > Further observations: > 1. To make sure the forked process completes before fgets(), I tested > with Sleep(100) before fgets() call. > ... > ... > if ((pgver = popen(cmd, "r")) == NULL) > { > perror("popen failure"); > return NULL; > } > > errno = 0; > Sleep(100); > if (fgets(line, maxsize, pgver) == NULL) > { > if (feof(pgver)) > fprintf(stderr, "no data was returned by command \"%s\"\n", cmd); > ... > ... > > This also doesn't resolve the issue, the error is still seen intermittently. > > 2. Even though fgets() fails, the output is still getting captured in > 'line' string. > Tested with printing the 'line' in case of failure: > ... > ... > if ((pgver = popen(cmd, "r")) == NULL) > { > perror("popen failure"); > return NULL; > } > > errno = 0; > if (fgets(line, maxsize, pgver) == NULL) > { > if (line) > fprintf(stderr, "cmd output - %s\n", line); > > if (feof(pgver)) > fprintf(stderr, "no data was returned by command \"%s\"\n", cmd); > … > … > And the log looks like - > cmd output - postgres (PostgreSQL) 17devel > no data was returned by command > ""D:/Project/pg1/postgres/tmp_install/bin/pg_controldata" -V" > > check for "D:/Project/pg1/postgres/tmp_install/bin/pg_controldata" > failed: cannot execute > Failure, exiting > > Attached test result log for the same - "regress_log_003_logical_slots". The same failure is observed with test 't\002_pg_upgrade.pl' too (intermittently). So, it is not limited to "t/003_logical_slots.pl" test alone. It is more closely associated with the pg_upgrade command run. -- Thanks, Nisha Moond
At Mon, 6 Nov 2023 19:42:21 +0530, Nisha Moond <nisha.moond412@gmail.com> wrote in > > Appending '2>&1 test: > > The command still results in NULL and ends up failing as no data is > > returned. Which means even no error message is returned. The error log Thanks for confirmation. So, at least the child process was launced successfully in the cmd.exe's view. Upon a quick check on my end with Windows' _popen, I have obseved the following: - Once a child process is started, it seems to go undetected as an error by _popen or subsequent fgets calls if the process ends abnormally, with a non-zero exit status or even with a SEGV. - After the child process has flushed data to stdout, it is possible to read from the pipe even if the child process crashes or ends thereafter. - Even if fgets is called before the program starts, it will correctly block until the program outputs something. Specifically, when I used popen("sleep 5 & target.exe") and immediately performed fgets on the pipe, I was able to read the output of target.exe as the first line. Therefore, based on the information available, it is conceivable that the child process was killed by something right after it started, or the program terminated on its own without any error messages. By the way, in the case of aforementioned SEGV, Application Errors corresponding to it were identifiable in the Event Viewer. Additionally, regarding the exit statuses, they can be captured by using a wrapper batch file (.bat) that records %ERRORLEVEL% after running the target program. This may yield insights, aothough its effectiveness is not guaranteed. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hi, The same intermittent failure is reproducible on my system. For the intermittent issues I found that many issues are due to errors where commands like 'psql -V' are not returning any output. To reproduce it in an easy way, I wrote a script (.bat file) with '--version' option for different binaries. And found out that it was not giving any output for some command (varies for each run). Then I tried to run the same script after adding 'fflush(stdout)' in the function called with '--version' option and it started to give output for each command. I noticed the same for '--help' option and did the changes for the same. I have attached the test script(changes the extension to .txt as gmail is blocking it), output of test before the changes. I have also attached the patch with changes which resolved the above issue. This change has resolved most of the intermittent issues for me. I am facing some more intermittent issues. Will analyse and share it as well. Thanks and regards Shlok Kyal On Tue, 7 Nov 2023 at 11:05, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Mon, 6 Nov 2023 19:42:21 +0530, Nisha Moond <nisha.moond412@gmail.com> wrote in > > > Appending '2>&1 test: > > > The command still results in NULL and ends up failing as no data is > > > returned. Which means even no error message is returned. The error log > > Thanks for confirmation. So, at least the child process was launced > successfully in the cmd.exe's view. > > Upon a quick check on my end with Windows' _popen, I have obseved the > following: > > - Once a child process is started, it seems to go undetected as an > error by _popen or subsequent fgets calls if the process ends > abnormally, with a non-zero exit status or even with a SEGV. > > - After the child process has flushed data to stdout, it is possible > to read from the pipe even if the child process crashes or ends > thereafter. > > - Even if fgets is called before the program starts, it will correctly > block until the program outputs something. Specifically, when I used > popen("sleep 5 & target.exe") and immediately performed fgets on the > pipe, I was able to read the output of target.exe as the first line. > > Therefore, based on the information available, it is conceivable that > the child process was killed by something right after it started, or > the program terminated on its own without any error messages. > > By the way, in the case of aforementioned SEGV, Application Errors > corresponding to it were identifiable in the Event > Viewer. Additionally, regarding the exit statuses, they can be > captured by using a wrapper batch file (.bat) that records > %ERRORLEVEL% after running the target program. This may yield > insights, aothough its effectiveness is not guaranteed. > > regards. > > -- > Kyotaro Horiguchi > NTT Open Source Software Center > >
Attachment
Thanks for working on it. I tested the patch on my system and it resolved the issue with commands running -V (version check). As you mentioned, I am also still seeing intermittent errors even with the patch as below - in 'pg_upgrade/002_pg_upgrade' - # Running: pg_upgrade --no-sync -d D:\Project\pg2\postgres\build/testrun/pg_upgrade/002_pg_upgrade\data/t_002_pg_upgrade_old_node_data/pgdata -D D:\Project\pg2\postgres\build/testrun/pg_upgrade/002_pg_upgrade\data/t_002_pg_upgrade_new_node_data/pgdata -b D:/Project/pg2/postgres/build/tmp_install/Project/pg2/postgresql/bin -B D:/Project/pg2/postgres/build/tmp_install/Project/pg2/postgresql/bin -s 127.0.0.1 -p 56095 -P 56096 --copy --check Performing Consistency Checks ----------------------------- Checking cluster versions ok The source cluster lacks cluster state information: Failure, exiting [12:37:38.666](3.317s) not ok 12 - run of pg_upgrade --check for new instance [12:37:38.666](0.000s) # Failed test 'run of pg_upgrade --check for new instance' # at D:/Project/pg2/postgres/src/bin/pg_upgrade/t/002_pg_upgrade.pl line 375. and in 'pg_upgrade/003_logical_slots' - [12:35:33.773](0.001s) not ok 6 - run of pg_upgrade of old cluster with slots having unconsumed WAL records stdout /(?^:Your installation contains logical replication slots that can't be upgraded.)/ [12:35:33.773](0.000s) # Failed test 'run of pg_upgrade of old cluster with slots having unconsumed WAL records stdout /(?^:Your installation contains logical replication slots that can't be upgraded.)/' # at D:/Project/pg2/postgres/src/bin/pg_upgrade/t/003_logical_slots.pl line 102. [12:35:33.773](0.000s) # 'Performing Consistency Checks # ----------------------------- # Checking cluster versions ok # # The target cluster lacks cluster state information: # Failure, exiting It seems 'Performing Consistency Checks' fail due to a lack of some information and possible that it can also be fixed on the same lines. -- Thanks, Nisha
Hi, Apart of these I am getting following some intermittent failure as below: 131/272 postgresql:pg_basebackup / pg_basebackup/010_pg_basebackup ERROR 30.51s (exit status 255 or signal 127 SIGinvalid) 114/272 postgresql:libpq / libpq/001_uri ERROR 9.66s exit status 8 34/272 postgresql:pg_upgrade / pg_upgrade/003_logical_slots ERROR 99.14s exit status 1 186/272 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade ERROR 306.22s exit status 1 29/272 postgresql:recovery / recovery/002_archiving ERROR 89.62s (exit status 255 or signal 127 SIGinvalid) 138/272 postgresql:pg_resetwal / pg_resetwal/001_basic ERROR 3.93s (exit status 255 or signal 127 SIGinvalid) Have attached the regress logs for the same as well. Thanks and Regards Shlok Kyal On Tue, 26 Dec 2023 at 17:39, Shlok Kyal <shlok.kyal.oss@gmail.com> wrote: > > Hi, > The same intermittent failure is reproducible on my system. > For the intermittent issues I found that many issues are due to errors > where commands like 'psql -V' are not returning any output. > To reproduce it in an easy way, I wrote a script (.bat file) with > '--version' option for different binaries. And found out that it was > not giving any output for some command (varies for each run). > Then I tried to run the same script after adding 'fflush(stdout)' in > the function called with '--version' option and it started to give > output for each command. > I noticed the same for '--help' option and did the changes for the same. > > I have attached the test script(changes the extension to .txt as gmail > is blocking it), output of test before the changes. > I have also attached the patch with changes which resolved the above issue. > > This change has resolved most of the intermittent issues for me. I am > facing some more intermittent issues. Will analyse and share it as > well. > > Thanks and regards > Shlok Kyal > > On Tue, 7 Nov 2023 at 11:05, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > > > At Mon, 6 Nov 2023 19:42:21 +0530, Nisha Moond <nisha.moond412@gmail.com> wrote in > > > > Appending '2>&1 test: > > > > The command still results in NULL and ends up failing as no data is > > > > returned. Which means even no error message is returned. The error log > > > > Thanks for confirmation. So, at least the child process was launced > > successfully in the cmd.exe's view. > > > > Upon a quick check on my end with Windows' _popen, I have obseved the > > following: > > > > - Once a child process is started, it seems to go undetected as an > > error by _popen or subsequent fgets calls if the process ends > > abnormally, with a non-zero exit status or even with a SEGV. > > > > - After the child process has flushed data to stdout, it is possible > > to read from the pipe even if the child process crashes or ends > > thereafter. > > > > - Even if fgets is called before the program starts, it will correctly > > block until the program outputs something. Specifically, when I used > > popen("sleep 5 & target.exe") and immediately performed fgets on the > > pipe, I was able to read the output of target.exe as the first line. > > > > Therefore, based on the information available, it is conceivable that > > the child process was killed by something right after it started, or > > the program terminated on its own without any error messages. > > > > By the way, in the case of aforementioned SEGV, Application Errors > > corresponding to it were identifiable in the Event > > Viewer. Additionally, regarding the exit statuses, they can be > > captured by using a wrapper batch file (.bat) that records > > %ERRORLEVEL% after running the target program. This may yield > > insights, aothough its effectiveness is not guaranteed. > > > > regards. > > > > -- > > Kyotaro Horiguchi > > NTT Open Source Software Center > > > >