Thread: [HACKERS] Testlib.pm vs msys
It turns out I was wrong about the problem jacana has been having with the pg_ctl tests hanging. The problem was not the use of select as a timeout mechanism, although I think the change to using Time::Hires::usleep() is correct and shouldn't be reverted. The problem is command_like's use of redirection to strings. Why this should be a problem for this particular use is a matter of speculation. I suspect it's to do with the fact that in this instance pg_ctl is leaving behind some child processes (i.e. postmaster and children) after it exits, and so on this particular path IPC::Run isn't detecting the exit properly. The workaround I have found to work is to redirect command_like's output instead to a couple of files and then slurp in those files and delete them. A bit hacky, I know, so I'm open to other suggestions. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: > It turns out I was wrong about the problem jacana has been having with > the pg_ctl tests hanging. The problem was not the use of select as a > timeout mechanism, although I think the change to using > Time::Hires::usleep() is correct and shouldn't be reverted. > The problem is command_like's use of redirection to strings. Why this > should be a problem for this particular use is a matter of speculation. > I suspect it's to do with the fact that in this instance pg_ctl is > leaving behind some child processes (i.e. postmaster and children) after > it exits, and so on this particular path IPC::Run isn't detecting the > exit properly. The workaround I have found to work is to redirect > command_like's output instead to a couple of files and then slurp in > those files and delete them. A bit hacky, I know, so I'm open to other > suggestions. Yeah, I'd been eyeing that behavior of IPC::Run a month or so back, though from the opposite direction. If you are reading either stdout or stderr of the executed command into Perl, then it detects command completion by waiting till it gets EOF on those stream(s). If you are reading neither, then it goes into this wonky backoff behavior where it sleeps a bit and then checks waitpid(WNOHANG), with the value of "a bit" continually increasing until it reaches a fairly large value, half a second or a second (I forget). So you have potentially some sizable fraction of a second that's just wasted after command termination. I'd been able to make a small but noticeable improvement in the runtime of some of our TAP test suites by forcing the first behavior, ie reading stdout even if we were going to throw it away. So I'm not really that excited about going in the other direction ;-). It shouldn't matter much time-wise for short-lived commands, but it's disturbing if the EOF technique fails entirely for some cases. I looked at jacana's two recent pg_ctlCheck failures, and they both seem to have failed on this: command_like([ 'pg_ctl', 'start', '-D', "$tempdir/data",'-l', "$TestLib::log_path/001_start_stop_server.log" ],qr/done.*serverstarted/s, 'pg_ctl start'); That is redirecting the postmaster's stdout/stderr into a file, for sure, so the child processes shouldn't impact EOF detection AFAICS. It's also hard to explain this way why it only fails some of the time. I think we need to look at what the recent changes were in this area and try to form a better theory of why it's started to fail here. regards, tom lane
I wrote: > Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: >> The problem is command_like's use of redirection to strings. Why this >> should be a problem for this particular use is a matter of speculation. > I looked at jacana's two recent pg_ctlCheck failures, and they both > seem to have failed on this: > command_like([ 'pg_ctl', 'start', '-D', "$tempdir/data", > '-l', "$TestLib::log_path/001_start_stop_server.log" ], > qr/done.*server started/s, 'pg_ctl start'); > That is redirecting the postmaster's stdout/stderr into a file, > for sure, so the child processes shouldn't impact EOF detection AFAICS. > It's also hard to explain this way why it only fails some of the time. I reflected a bit more on this issue, and realized that there's a pretty obvious mechanism for this to happen. While on Unix, we have pg_ctl fork-and-exec the postmaster so that no extra processes are laying about, that's not the case on Windows. The Windows code in pg_ctl.c creates a subprocess that runs CMD.EXE, which in turn runs the postmaster as a subprocess. The CMD.EXE process doesn't disappear until the postmaster exits. Now, we tell CMD to redirect the postmaster's stdout and stderr into files, but there's no way to tell CMD to redirect its own handles. So if the CMD process inherits pg_ctl's stdout and stderr, then the prove process would not see EOF on those pipes after pg_ctl exits. Now, this theory still has a Mack-truck-sized hole in it, which is if that's the failure mechanism then why isn't it 100% effective? Seems like the TAP test should fail every time, if this is a full description. But maybe it's part of the answer, so it seems worth experimenting in this direction. A bit of googling Microsoft's documentation suggests that maybe all we have to do is pass CreateProcessAsUser's bInheritHandles parameter as FALSE not TRUE in pg_ctl.c. It's not apparent to me that there are any handles we do need the CMD process to inherit. regards, tom lane
On 07/24/2017 09:33 PM, Tom Lane wrote: > > [good theory about why pg_ctl hangs in TAP test] > > Now, this theory still has a Mack-truck-sized hole in it, which is > if that's the failure mechanism then why isn't it 100% effective? > Seems like the TAP test should fail every time, if this is a full > description. But maybe it's part of the answer, so it seems worth > experimenting in this direction. The test that hangs is the only case where we call pg_ctl via command_like. If we use other mechanisms such as command_ok that don't try to read the output there is no problem. Another data point is that this test doesn't hang on bowerbird, which is an animal on the same machine but doing an MSVC build. Thus my thesis that it's probably to do with the imteraction with the MSys perl and shell. A simple workaround might be to provide two flavors of command_like, one that uses files it then slurps in and one that uses direct scalars and EOF detection. cheers andrew > > A bit of googling Microsoft's documentation suggests that maybe all > we have to do is pass CreateProcessAsUser's bInheritHandles parameter > as FALSE not TRUE in pg_ctl.c. It's not apparent to me that there > are any handles we do need the CMD process to inherit. > > Maybe. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: > On 07/24/2017 09:33 PM, Tom Lane wrote: >> Seems like the TAP test should fail every time, if this is a full >> description. But maybe it's part of the answer, so it seems worth >> experimenting in this direction. > The test that hangs is the only case where we call pg_ctl via > command_like. If we use other mechanisms such as command_ok that don't > try to read the output there is no problem. What I'm on about is that jacana still succeeds entirely, more than half the time. If this is a handle-duplication problem, how could it ever succeed? regards, tom lane
On 07/25/2017 11:11 AM, Tom Lane wrote: > Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: >> On 07/24/2017 09:33 PM, Tom Lane wrote: >>> Seems like the TAP test should fail every time, if this is a full >>> description. But maybe it's part of the answer, so it seems worth >>> experimenting in this direction. >> The test that hangs is the only case where we call pg_ctl via >> command_like. If we use other mechanisms such as command_ok that don't >> try to read the output there is no problem. > What I'm on about is that jacana still succeeds entirely, more than half > the time. If this is a handle-duplication problem, how could it ever > succeed? > No, it hangs 100% of the time. The only time you see a result at all is if I manually intervene. The pg_ctl test is thus currently disabled on jacana. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: > On 07/25/2017 11:11 AM, Tom Lane wrote: >> What I'm on about is that jacana still succeeds entirely, more than half >> the time. If this is a handle-duplication problem, how could it ever >> succeed? > No, it hangs 100% of the time. The only time you see a result at all is > if I manually intervene. The pg_ctl test is thus currently disabled on > jacana. Oh. So when was the last time it worked? And why do the buildfarm archives contain apparently-successful log_stage entries for pg_ctl-check on jacana, up through yesterday when I looked? regards, tom lane
On 07/25/2017 11:25 AM, Tom Lane wrote: > Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: >> On 07/25/2017 11:11 AM, Tom Lane wrote: >>> What I'm on about is that jacana still succeeds entirely, more than half >>> the time. If this is a handle-duplication problem, how could it ever >>> succeed? >> No, it hangs 100% of the time. The only time you see a result at all is >> if I manually intervene. The pg_ctl test is thus currently disabled on >> jacana. > Oh. So when was the last time it worked? And why do the buildfarm > archives contain apparently-successful log_stage entries for pg_ctl-check > on jacana, up through yesterday when I looked? > > There was a buildfarm bug, since corrected, which was not properly picking up errors in a corner case. You will see the errors if you look at all the logs for pg_ctl-check from 12 days back if they exist. The last known actual good run of this test was 33 days ago, i.e. 2017-06-22 cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: > On 07/25/2017 11:25 AM, Tom Lane wrote: >> Oh. So when was the last time it worked? And why do the buildfarm >> archives contain apparently-successful log_stage entries for pg_ctl-check >> on jacana, up through yesterday when I looked? > There was a buildfarm bug, since corrected, which was not properly > picking up errors in a corner case. You will see the errors if you look > at all the logs for pg_ctl-check from 12 days back if they exist. The > last known actual good run of this test was 33 days ago, i.e. 2017-06-22 Hm. This failure presumably started with commit f13ea95f9, which introduced use of command_like() in the pg_ctl TAP test; but that didn't go in until 2017-06-28. Was jacana not running in the interim? Anyway, if we believe that it broke with f13ea95f9, hen it's plausible that CMD.EXE has been sharing pg_ctl's stdout/stderr all along, and we just had not noticed before. Could you check what happens if we change the bInheritHandles parameter as I suggested upthread? regards, tom lane
On 07/25/2017 01:41 PM, Tom Lane wrote: > Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: >> On 07/25/2017 11:25 AM, Tom Lane wrote: >>> Oh. So when was the last time it worked? And why do the buildfarm >>> archives contain apparently-successful log_stage entries for pg_ctl-check >>> on jacana, up through yesterday when I looked? >> There was a buildfarm bug, since corrected, which was not properly >> picking up errors in a corner case. You will see the errors if you look >> at all the logs for pg_ctl-check from 12 days back if they exist. The >> last known actual good run of this test was 33 days ago, i.e. 2017-06-22 > Hm. This failure presumably started with commit f13ea95f9, which > introduced use of command_like() in the pg_ctl TAP test; but that > didn't go in until 2017-06-28. Was jacana not running in the > interim? Looks like not, possibly because one of the other machines on this box was hung (probably lorikeet). > > Anyway, if we believe that it broke with f13ea95f9, hen it's plausible > that CMD.EXE has been sharing pg_ctl's stdout/stderr all along, and we > just had not noticed before. Could you check what happens if we > change the bInheritHandles parameter as I suggested upthread? > > I'll try when I get all the animals caught up. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 07/25/2017 02:45 PM, Andrew Dunstan wrote: >> Anyway, if we believe that it broke with f13ea95f9, hen it's plausible >> that CMD.EXE has been sharing pg_ctl's stdout/stderr all along, and we >> just had not noticed before. Could you check what happens if we >> change the bInheritHandles parameter as I suggested upthread? >> >> > > I'll try when I get all the animals caught up. > This made no difference. And I'm not really surprised, since the test is not hanging when run from an MSVC build. The latter fact makes me theorize that the problem arises from the fairly ancient perl that Msys provides, and which we need to use to run prove so the TAP tests understand the environment's virtualized file paths. The attached patch should get around the problem without upsetting the good work you've been doing in reducing TAP test times generally. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: > This made no difference. And I'm not really surprised, since the test is > not hanging when run from an MSVC build. Oh well. > The latter fact makes me > theorize that the problem arises from the fairly ancient perl that Msys > provides, and which we need to use to run prove so the TAP tests > understand the environment's virtualized file paths. It might be interesting to look into its copy of IPC::Run and see if the wait technology is different from later versions. > The attached patch should get around the problem without upsetting the > good work you've been doing in reducing TAP test times generally. I still find this pretty ugly :-(. regards, tom lane
On 07/26/2017 09:33 AM, Tom Lane wrote: > Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: > >> The latter fact makes me >> theorize that the problem arises from the fairly ancient perl that Msys >> provides, and which we need to use to run prove so the TAP tests >> understand the environment's virtualized file paths. > It might be interesting to look into its copy of IPC::Run and see if > the wait technology is different from later versions. It's using 0.94 just like my Linux box. > >> The attached patch should get around the problem without upsetting the >> good work you've been doing in reducing TAP test times generally. > I still find this pretty ugly :-(. > > I'm open to alternative suggestions. But I don't want to spend a heck of a lot of time on this. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: > On 07/26/2017 09:33 AM, Tom Lane wrote: >> It might be interesting to look into its copy of IPC::Run and see if >> the wait technology is different from later versions. > It's using 0.94 just like my Linux box. Oh well, I hoped maybe we could update that. >> I still find this pretty ugly :-(. > I'm open to alternative suggestions. But I don't want to spend a heck of > a lot of time on this. Well, let's at least do the temp files a little more safely. Maybe like this? regards, tom lane diff --git a/src/bin/pg_ctl/t/001_start_stop.pl b/src/bin/pg_ctl/t/001_start_stop.pl index 9c3551f..3acc80e 100644 --- a/src/bin/pg_ctl/t/001_start_stop.pl +++ b/src/bin/pg_ctl/t/001_start_stop.pl @@ -32,9 +32,17 @@ else print $conf "listen_addresses = '127.0.0.1'\n"; } close $conf; -command_like([ 'pg_ctl', 'start', '-D', "$tempdir/data", - '-l', "$TestLib::log_path/001_start_stop_server.log" ], - qr/done.*server started/s, 'pg_ctl start'); +my $ctlcmd = [ 'pg_ctl', 'start', '-D', "$tempdir/data", + '-l', "$TestLib::log_path/001_start_stop_server.log" ]; +if ($Config{osname} ne 'msys') +{ + command_like($ctlcmd, qr/done.*server started/s, 'pg_ctl start'); +} +else +{ + # use the version of command_like that doesn't hang on Msys here + command_like_safe($ctlcmd, qr/done.*server started/s, 'pg_ctl start'); +} # sleep here is because Windows builds can't check postmaster.pid exactly, # so they may mistake a pre-existing postmaster.pid for one created by the diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm index fe09689..758c920 100644 --- a/src/test/perl/TestLib.pm +++ b/src/test/perl/TestLib.pm @@ -37,6 +37,7 @@ our @EXPORT = qw( program_version_ok program_options_handling_ok command_like + command_like_safe command_fails_like $windows_os @@ -300,6 +301,24 @@ sub command_like like($stdout, $expected_stdout, "$test_name: matches"); } +sub command_like_safe +{ + # Doesn't rely on detecting end of file on the file descriptors, + # which can fail, causing the process to hang, notably on Msys + # when used with 'pg_ctl start' + my ($cmd, $expected_stdout, $test_name) = @_; + my ($stdout, $stderr); + my $stdoutfile = File::Temp->new(); + my $stderrfile = File::Temp->new(); + print("# Running: " . join(" ", @{$cmd}) . "\n"); + my $result = IPC::Run::run $cmd, '>', $stdoutfile, '2>', $stderrfile; + $stdout = slurp_file($stdoutfile); + $stderr = slurp_file($stderrfile); + ok($result, "$test_name: exit code 0"); + is($stderr, '', "$test_name: no stderr"); + like($stdout, $expected_stdout, "$test_name: matches"); +} + sub command_fails_like { my ($cmd, $expected_stderr, $test_name) = @_; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 07/26/2017 11:12 AM, Tom Lane wrote: > Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: > >>> I still find this pretty ugly :-(. >> I'm open to alternative suggestions. But I don't want to spend a heck of >> a lot of time on this. > Well, let's at least do the temp files a little more safely. > Maybe like this? > > OK tested with that and it works. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services