Thread: [HACKERS] Testlib.pm vs msys

[HACKERS] Testlib.pm vs msys

From
Andrew Dunstan
Date:
It turns out I was wrong about the problem jacana has been having with
the pg_ctl tests hanging. The problem was not the use of select as a
timeout mechanism, although I think the change to using
Time::Hires::usleep() is correct and shouldn't be reverted.

The problem is command_like's use of redirection to strings. Why this
should be a problem for this particular use is a matter of speculation.
I suspect it's to do with the fact that in this instance pg_ctl is
leaving behind some child processes (i.e. postmaster and children) after
it exits, and so on this particular path IPC::Run isn't detecting the
exit properly. The workaround I have found to work is to redirect
command_like's output instead to a couple of files and then slurp in
those files and delete them. A bit hacky, I know, so I'm open to other
suggestions.


cheers


andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] Testlib.pm vs msys

From
Tom Lane
Date:
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
> It turns out I was wrong about the problem jacana has been having with
> the pg_ctl tests hanging. The problem was not the use of select as a
> timeout mechanism, although I think the change to using
> Time::Hires::usleep() is correct and shouldn't be reverted.

> The problem is command_like's use of redirection to strings. Why this
> should be a problem for this particular use is a matter of speculation.
> I suspect it's to do with the fact that in this instance pg_ctl is
> leaving behind some child processes (i.e. postmaster and children) after
> it exits, and so on this particular path IPC::Run isn't detecting the
> exit properly. The workaround I have found to work is to redirect
> command_like's output instead to a couple of files and then slurp in
> those files and delete them. A bit hacky, I know, so I'm open to other
> suggestions.

Yeah, I'd been eyeing that behavior of IPC::Run a month or so back,
though from the opposite direction.  If you are reading either stdout
or stderr of the executed command into Perl, then it detects command
completion by waiting till it gets EOF on those stream(s).  If you
are reading neither, then it goes into this wonky backoff behavior
where it sleeps a bit and then checks waitpid(WNOHANG), with the
value of "a bit" continually increasing until it reaches a fairly
large value, half a second or a second (I forget).  So you have
potentially some sizable fraction of a second that's just wasted after
command termination.  I'd been able to make a small but noticeable
improvement in the runtime of some of our TAP test suites by forcing
the first behavior, ie reading stdout even if we were going to throw
it away.

So I'm not really that excited about going in the other direction ;-).
It shouldn't matter much time-wise for short-lived commands, but it's
disturbing if the EOF technique fails entirely for some cases.

I looked at jacana's two recent pg_ctlCheck failures, and they both
seem to have failed on this:

command_like([ 'pg_ctl', 'start', '-D', "$tempdir/data",'-l', "$TestLib::log_path/001_start_stop_server.log"
],qr/done.*serverstarted/s, 'pg_ctl start');
 

That is redirecting the postmaster's stdout/stderr into a file,
for sure, so the child processes shouldn't impact EOF detection AFAICS.
It's also hard to explain this way why it only fails some of the time.

I think we need to look at what the recent changes were in this area
and try to form a better theory of why it's started to fail here.
        regards, tom lane



Re: [HACKERS] Testlib.pm vs msys

From
Tom Lane
Date:
I wrote:
> Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
>> The problem is command_like's use of redirection to strings. Why this
>> should be a problem for this particular use is a matter of speculation.

> I looked at jacana's two recent pg_ctlCheck failures, and they both
> seem to have failed on this:
> command_like([ 'pg_ctl', 'start', '-D', "$tempdir/data",
>     '-l', "$TestLib::log_path/001_start_stop_server.log" ],
>     qr/done.*server started/s, 'pg_ctl start');
> That is redirecting the postmaster's stdout/stderr into a file,
> for sure, so the child processes shouldn't impact EOF detection AFAICS.
> It's also hard to explain this way why it only fails some of the time.

I reflected a bit more on this issue, and realized that there's a pretty
obvious mechanism for this to happen.  While on Unix, we have pg_ctl
fork-and-exec the postmaster so that no extra processes are laying about,
that's not the case on Windows.  The Windows code in pg_ctl.c creates a
subprocess that runs CMD.EXE, which in turn runs the postmaster as a
subprocess.  The CMD.EXE process doesn't disappear until the postmaster
exits.  Now, we tell CMD to redirect the postmaster's stdout and stderr
into files, but there's no way to tell CMD to redirect its own handles.
So if the CMD process inherits pg_ctl's stdout and stderr, then the prove
process would not see EOF on those pipes after pg_ctl exits.

Now, this theory still has a Mack-truck-sized hole in it, which is
if that's the failure mechanism then why isn't it 100% effective?
Seems like the TAP test should fail every time, if this is a full
description.  But maybe it's part of the answer, so it seems worth
experimenting in this direction.

A bit of googling Microsoft's documentation suggests that maybe all
we have to do is pass CreateProcessAsUser's bInheritHandles parameter
as FALSE not TRUE in pg_ctl.c.  It's not apparent to me that there
are any handles we do need the CMD process to inherit.
        regards, tom lane



Re: [HACKERS] Testlib.pm vs msys

From
Andrew Dunstan
Date:

On 07/24/2017 09:33 PM, Tom Lane wrote:
>
> [good theory about why pg_ctl hangs in TAP test]
>
> Now, this theory still has a Mack-truck-sized hole in it, which is
> if that's the failure mechanism then why isn't it 100% effective?
> Seems like the TAP test should fail every time, if this is a full
> description.  But maybe it's part of the answer, so it seems worth
> experimenting in this direction.


The test that hangs is the only case where we call pg_ctl via
command_like. If we use other mechanisms such as command_ok that don't
try to read the output there is no problem.

Another data point is that this test doesn't hang on bowerbird, which is
an animal on the same machine but doing an MSVC build. Thus my thesis
that it's probably to do with the imteraction with the MSys perl and shell.

A simple workaround might be to provide two flavors of command_like, one
that uses files it then slurps in and one that uses direct scalars and
EOF detection.

cheers

andrew


>
> A bit of googling Microsoft's documentation suggests that maybe all
> we have to do is pass CreateProcessAsUser's bInheritHandles parameter
> as FALSE not TRUE in pg_ctl.c.  It's not apparent to me that there
> are any handles we do need the CMD process to inherit.
>
>             


Maybe.

cheers

andrew


-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] Testlib.pm vs msys

From
Tom Lane
Date:
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
> On 07/24/2017 09:33 PM, Tom Lane wrote:
>> Seems like the TAP test should fail every time, if this is a full
>> description.  But maybe it's part of the answer, so it seems worth
>> experimenting in this direction.

> The test that hangs is the only case where we call pg_ctl via
> command_like. If we use other mechanisms such as command_ok that don't
> try to read the output there is no problem.

What I'm on about is that jacana still succeeds entirely, more than half
the time.  If this is a handle-duplication problem, how could it ever
succeed?
        regards, tom lane



Re: [HACKERS] Testlib.pm vs msys

From
Andrew Dunstan
Date:

On 07/25/2017 11:11 AM, Tom Lane wrote:
> Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
>> On 07/24/2017 09:33 PM, Tom Lane wrote:
>>> Seems like the TAP test should fail every time, if this is a full
>>> description.  But maybe it's part of the answer, so it seems worth
>>> experimenting in this direction.
>> The test that hangs is the only case where we call pg_ctl via
>> command_like. If we use other mechanisms such as command_ok that don't
>> try to read the output there is no problem.
> What I'm on about is that jacana still succeeds entirely, more than half
> the time.  If this is a handle-duplication problem, how could it ever
> succeed?
>



No, it hangs 100% of the time. The only time you see a result at all is
if I manually intervene. The pg_ctl test is thus currently disabled  on
jacana.

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] Testlib.pm vs msys

From
Tom Lane
Date:
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
> On 07/25/2017 11:11 AM, Tom Lane wrote:
>> What I'm on about is that jacana still succeeds entirely, more than half
>> the time.  If this is a handle-duplication problem, how could it ever
>> succeed?

> No, it hangs 100% of the time. The only time you see a result at all is
> if I manually intervene. The pg_ctl test is thus currently disabled  on
> jacana.

Oh.  So when was the last time it worked?  And why do the buildfarm
archives contain apparently-successful log_stage entries for pg_ctl-check
on jacana, up through yesterday when I looked?
        regards, tom lane



Re: [HACKERS] Testlib.pm vs msys

From
Andrew Dunstan
Date:

On 07/25/2017 11:25 AM, Tom Lane wrote:
> Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
>> On 07/25/2017 11:11 AM, Tom Lane wrote:
>>> What I'm on about is that jacana still succeeds entirely, more than half
>>> the time.  If this is a handle-duplication problem, how could it ever
>>> succeed?
>> No, it hangs 100% of the time. The only time you see a result at all is
>> if I manually intervene. The pg_ctl test is thus currently disabled  on
>> jacana.
> Oh.  So when was the last time it worked?  And why do the buildfarm
> archives contain apparently-successful log_stage entries for pg_ctl-check
> on jacana, up through yesterday when I looked?
>
>             


There was a buildfarm bug, since corrected, which was not properly
picking up errors in a corner case. You will see the errors if you look
at all the logs for pg_ctl-check from 12 days back if they exist. The
last known actual good run of this test was 33 days ago, i.e. 2017-06-22

cheers

andrew



-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] Testlib.pm vs msys

From
Tom Lane
Date:
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
> On 07/25/2017 11:25 AM, Tom Lane wrote:
>> Oh.  So when was the last time it worked?  And why do the buildfarm
>> archives contain apparently-successful log_stage entries for pg_ctl-check
>> on jacana, up through yesterday when I looked?

> There was a buildfarm bug, since corrected, which was not properly
> picking up errors in a corner case. You will see the errors if you look
> at all the logs for pg_ctl-check from 12 days back if they exist. The
> last known actual good run of this test was 33 days ago, i.e. 2017-06-22

Hm.  This failure presumably started with commit f13ea95f9, which
introduced use of command_like() in the pg_ctl TAP test; but that
didn't go in until 2017-06-28.  Was jacana not running in the
interim?

Anyway, if we believe that it broke with f13ea95f9, hen it's plausible
that CMD.EXE has been sharing pg_ctl's stdout/stderr all along, and we
just had not noticed before.  Could you check what happens if we
change the bInheritHandles parameter as I suggested upthread?
        regards, tom lane



Re: [HACKERS] Testlib.pm vs msys

From
Andrew Dunstan
Date:

On 07/25/2017 01:41 PM, Tom Lane wrote:
> Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
>> On 07/25/2017 11:25 AM, Tom Lane wrote:
>>> Oh.  So when was the last time it worked?  And why do the buildfarm
>>> archives contain apparently-successful log_stage entries for pg_ctl-check
>>> on jacana, up through yesterday when I looked?
>> There was a buildfarm bug, since corrected, which was not properly
>> picking up errors in a corner case. You will see the errors if you look
>> at all the logs for pg_ctl-check from 12 days back if they exist. The
>> last known actual good run of this test was 33 days ago, i.e. 2017-06-22
> Hm.  This failure presumably started with commit f13ea95f9, which
> introduced use of command_like() in the pg_ctl TAP test; but that
> didn't go in until 2017-06-28.  Was jacana not running in the
> interim?


Looks like not, possibly because one of the other machines on this box
was hung (probably lorikeet).


>
> Anyway, if we believe that it broke with f13ea95f9, hen it's plausible
> that CMD.EXE has been sharing pg_ctl's stdout/stderr all along, and we
> just had not noticed before.  Could you check what happens if we
> change the bInheritHandles parameter as I suggested upthread?
>
>             


I'll try when I get all the animals caught up.

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] Testlib.pm vs msys

From
Andrew Dunstan
Date:

On 07/25/2017 02:45 PM, Andrew Dunstan wrote:
>> Anyway, if we believe that it broke with f13ea95f9, hen it's plausible
>> that CMD.EXE has been sharing pg_ctl's stdout/stderr all along, and we
>> just had not noticed before.  Could you check what happens if we
>> change the bInheritHandles parameter as I suggested upthread?
>>
>>             
>
> I'll try when I get all the animals caught up.
>


This made no difference. And I'm not really surprised, since the test is
not hanging when run from an MSVC build. The latter fact makes me
theorize that the problem arises from the fairly ancient perl that Msys
provides, and which we need to use to run prove so the TAP tests
understand the environment's virtualized file paths.


The attached patch should get around the problem without upsetting the
good work you've been doing in reducing TAP test times generally.

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Testlib.pm vs msys

From
Tom Lane
Date:
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
> This made no difference. And I'm not really surprised, since the test is
> not hanging when run from an MSVC build.

Oh well.

> The latter fact makes me
> theorize that the problem arises from the fairly ancient perl that Msys
> provides, and which we need to use to run prove so the TAP tests
> understand the environment's virtualized file paths.

It might be interesting to look into its copy of IPC::Run and see if
the wait technology is different from later versions.

> The attached patch should get around the problem without upsetting the
> good work you've been doing in reducing TAP test times generally.

I still find this pretty ugly :-(.
        regards, tom lane



Re: [HACKERS] Testlib.pm vs msys

From
Andrew Dunstan
Date:

On 07/26/2017 09:33 AM, Tom Lane wrote:
> Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
>
>> The latter fact makes me
>> theorize that the problem arises from the fairly ancient perl that Msys
>> provides, and which we need to use to run prove so the TAP tests
>> understand the environment's virtualized file paths.
> It might be interesting to look into its copy of IPC::Run and see if
> the wait technology is different from later versions.


It's using 0.94 just like my Linux box.

>
>> The attached patch should get around the problem without upsetting the
>> good work you've been doing in reducing TAP test times generally.
> I still find this pretty ugly :-(.
>
>             

I'm open to alternative suggestions. But I don't want to spend a heck of
a lot of time on this.

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] Testlib.pm vs msys

From
Tom Lane
Date:
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
> On 07/26/2017 09:33 AM, Tom Lane wrote:
>> It might be interesting to look into its copy of IPC::Run and see if
>> the wait technology is different from later versions.

> It's using 0.94 just like my Linux box.

Oh well, I hoped maybe we could update that.

>> I still find this pretty ugly :-(.

> I'm open to alternative suggestions. But I don't want to spend a heck of
> a lot of time on this.

Well, let's at least do the temp files a little more safely.
Maybe like this?

            regards, tom lane

diff --git a/src/bin/pg_ctl/t/001_start_stop.pl b/src/bin/pg_ctl/t/001_start_stop.pl
index 9c3551f..3acc80e 100644
--- a/src/bin/pg_ctl/t/001_start_stop.pl
+++ b/src/bin/pg_ctl/t/001_start_stop.pl
@@ -32,9 +32,17 @@ else
     print $conf "listen_addresses = '127.0.0.1'\n";
 }
 close $conf;
-command_like([ 'pg_ctl', 'start', '-D', "$tempdir/data",
-    '-l', "$TestLib::log_path/001_start_stop_server.log" ],
-    qr/done.*server started/s, 'pg_ctl start');
+my $ctlcmd = [ 'pg_ctl', 'start', '-D', "$tempdir/data",
+               '-l', "$TestLib::log_path/001_start_stop_server.log" ];
+if ($Config{osname} ne 'msys')
+{
+    command_like($ctlcmd, qr/done.*server started/s, 'pg_ctl start');
+}
+else
+{
+    # use the version of command_like that doesn't hang on Msys here
+    command_like_safe($ctlcmd, qr/done.*server started/s, 'pg_ctl start');
+}

 # sleep here is because Windows builds can't check postmaster.pid exactly,
 # so they may mistake a pre-existing postmaster.pid for one created by the
diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm
index fe09689..758c920 100644
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -37,6 +37,7 @@ our @EXPORT = qw(
   program_version_ok
   program_options_handling_ok
   command_like
+  command_like_safe
   command_fails_like

   $windows_os
@@ -300,6 +301,24 @@ sub command_like
     like($stdout, $expected_stdout, "$test_name: matches");
 }

+sub command_like_safe
+{
+    # Doesn't rely on detecting end of file on the file descriptors,
+    # which can fail, causing the process to hang, notably on Msys
+    # when used with 'pg_ctl start'
+    my ($cmd, $expected_stdout, $test_name) = @_;
+    my ($stdout, $stderr);
+    my $stdoutfile = File::Temp->new();
+    my $stderrfile = File::Temp->new();
+    print("# Running: " . join(" ", @{$cmd}) . "\n");
+    my $result = IPC::Run::run $cmd, '>', $stdoutfile, '2>', $stderrfile;
+    $stdout = slurp_file($stdoutfile);
+    $stderr = slurp_file($stderrfile);
+    ok($result, "$test_name: exit code 0");
+    is($stderr, '', "$test_name: no stderr");
+    like($stdout, $expected_stdout, "$test_name: matches");
+}
+
 sub command_fails_like
 {
     my ($cmd, $expected_stderr, $test_name) = @_;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Testlib.pm vs msys

From
Andrew Dunstan
Date:

On 07/26/2017 11:12 AM, Tom Lane wrote:
> Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
>
>>> I still find this pretty ugly :-(.
>> I'm open to alternative suggestions. But I don't want to spend a heck of
>> a lot of time on this.
> Well, let's at least do the temp files a little more safely.
> Maybe like this?
>
>             


OK  tested with that and it works.

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services