Thread: The real reason why TAP testing isn't ready for prime time
Buildfarm member hamster has failed a pretty significant fraction of its recent runs in the BinInstallCheck step: http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=hamster&br=HEAD Since other critters aren't equally distressed, it seems likely that this is just an out-of-disk-space type of problem. But maybe it's trying to tell us that there's a genuine platform-specific bug there. In any case, I challenge anyone to figure out what's happening from the information available from the buildfarm logs. I don't know whether this is just that the buildfarm script isn't collecting data that it should be. But my experiences with the TAP test scripts haven't been very positive. When they fail, it takes a lot of digging to find out why. Basically, that entire mechanism sucks as far as debuggability is concerned. I think there is a good argument for turning this off in the buildfarm until there is a better way of identifying and solving problems. It is not helping us that hamster is red half the time for undiscoverable reasons. That just conditions people to ignore it, and it may well be masking real problems that the machine could be finding if it weren't failing at this step. regards, tom lane
On Mon, Jun 15, 2015 at 3:37 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Buildfarm member hamster has failed a pretty significant fraction of > its recent runs in the BinInstallCheck step: > http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=hamster&br=HEAD > > Since other critters aren't equally distressed, it seems likely that > this is just an out-of-disk-space type of problem. But maybe it's > trying to tell us that there's a genuine platform-specific bug there. > In any case, I challenge anyone to figure out what's happening from > the information available from the buildfarm logs. > > I don't know whether this is just that the buildfarm script isn't > collecting data that it should be. But my experiences with the > TAP test scripts haven't been very positive. When they fail, it > takes a lot of digging to find out why. Basically, that entire > mechanism sucks as far as debuggability is concerned. Indeed. I think that one step in the good direction would be to replace all the calls to system and system_or_bail with a wrapper routine that calls IPC::Run able to catch the logs and store those logs in each test's base path. The same applies to pg_rewind tests. > I think there is a good argument for turning this off in the buildfarm > until there is a better way of identifying and solving problems. It is > not helping us that hamster is red half the time for undiscoverable > reasons. That just conditions people to ignore it, and it may well be > masking real problems that the machine could be finding if it weren't > failing at this step. hamster is legendary slow and has a slow disc, hence it improves chances of catching race conditions, and it is the only slow buildfarm machine enabling the TAP tests (by comparison dangomushi has never failed with the TAP tests) hence I would prefer thinking that the problem is not specific to ArchLinux ARM. In this case the failure seems to be related to the timing test servers stop and start even if -w switch is used with pg_ctl, particularly that PGPORT is set to the same value for all servers... Still, for the time being I don't mind disabling them and just did so now. I will try to investigate further on the machine itself. -- Michael
On Mon, Jun 15, 2015 at 8:26 AM, Michael Paquier wrote: > hamster is legendary slow and has a slow disc, hence it improves > chances of catching race conditions, and it is the only slow buildfarm > machine enabling the TAP tests (by comparison dangomushi has never > failed with the TAP tests) hence I would prefer thinking that the > problem is not specific to ArchLinux ARM. In this case the failure > seems to be related to the timing test servers stop and start even if > -w switch is used with pg_ctl, particularly that PGPORT is set to the > same value for all servers... Still, for the time being I don't mind > disabling them and just did so now. I will try to investigate further > on the machine itself. So, I have been doing some progress on this thing.. And wrote the patch attached that improves the way TAP tests log their output using IPC::Run::run. This simply creates a log file as regress_log/test_name for each test run, capturing all the command outputs. For example on current HEAD, some commands of pg_ctl are simply silenced or have their output redirected to /dev/null. With this patch, all their output is redirected to the log file. This is as well compatible with Windows, and logging of pg_rewind tests is unified with the new structure of TestLib.pm. I am going to apply that to hamster and see what is causing the error reported. I think that it would be useful as well to improve the buildfarm output. Thoughts? Regards -- Michael
Attachment
On Thu, Jun 18, 2015 at 3:52 PM, Michael Paquier wrote: > I think that it would be useful as well to improve the buildfarm > output. Thoughts? And after running the tests more or less 6~7 times in a row on a PI, I have been able to trigger the problem and I think that I have found its origin. First, the error has been triggered by the tests of pg_rewind: t/002_databases.pl ... 1..4 Bailout called. Further testing stopped: run pg_ctl failed: 256 Bail out! run pg_ctl failed: 256 FAILED--Further testing stopped: run pg_ctl failed: 256 Makefile:51: recipe for target 'check' failed make[1]: *** [check] Error 255 And by looking at the logs obtained thanks to the previous patch I could see the following (log attached for tests 1 and 2): $ tail -n5 regress_log/regress_log_002_databases waiting for server to start........ stopped waiting pg_ctl: could not start server Examine the log output. LOG: received immediate shutdown request LOG: received immediate shutdown request pg_ctl should be able to start the server and should not fail here. This is confirmed by the fact that first test has not stopped the servers. On a clean run, the immediate shutdown request is received and done: waiting for server to shut down....LOG: received immediate shutdown request LOG: unexpected EOF on standby connection done But in the case of the failure this does not happen: LOG: received immediate shutdown request LOG: unexpected EOF on standby connection LOG: received immediate shutdown request See the "done" is not here. Now if we look at RewindTest.pm, there is the following code: if ($test_master_datadir) { system "pg_ctl -D $test_master_datadir -s -m immediate stop 2> /dev/null"; } if ($test_standby_datadir) { system "pg_ctl -D $test_standby_datadir -s -m immediate stop 2> /dev/null"; } And I think that the problem is triggered because we are missing a -w switch here, meaning that we do not wait until the confirmation that the server has stopped, and visibly if stop is slow enough the next server to use cannot start because the port is already taken by the server currently stopping. Note as well that the last command of pg_ctl stop in pg_ctl/t/002_status.pl does not use -w, so we have the same problem there. Attached is a patch fixing those problems and improving the log facility as it really helped me out with those issues. The simplest fix would be to include the -w switch missing in the tests of pg_rewind and pg_ctl though. It would be good to get that fixed, then I would be able to re-enable the TAP tests of hamster. I have run the tests a dozen of times again with this patch, and I could not trigger the failure anymore. Regards, -- Michael
Attachment
Michael Paquier <michael.paquier@gmail.com> writes: > Attached is a patch fixing those problems and improving the log > facility as it really helped me out with those issues. The simplest > fix would be to include the -w switch missing in the tests of > pg_rewind and pg_ctl though. I agree with adding the -w switch and will go do that. As far as the rest of this patch goes, it seems like it could be made less invasive if the logs got dumped into a subdirectory of tmp_check rather than adding another top-level directory that has to be cleaned? regards, tom lane
Michael Paquier <michael.paquier@gmail.com> writes: > Now if we look at RewindTest.pm, there is the following code: > if ($test_master_datadir) > { > system > "pg_ctl -D $test_master_datadir -s -m immediate stop > 2> /dev/null"; > } > if ($test_standby_datadir) > { > system > "pg_ctl -D $test_standby_datadir -s -m immediate > stop 2> /dev/null"; > } > And I think that the problem is triggered because we are missing a -w > switch here, meaning that we do not wait until the confirmation that > the server has stopped, and visibly if stop is slow enough the next > server to use cannot start because the port is already taken by the > server currently stopping. After I woke up a bit more, I remembered that -w is already the default for "pg_ctl stop", so your diagnosis here is incorrect. I suspect that the real problem is the arbitrary decision to use -m immediate. The postmaster would ordinarily wait for its children to die, but on a slow machine we could perhaps reach the end of that 5-second timeout, whereupon the postmaster would SIGKILL its children *and exit immediately*. I'm not sure how instantaneous SIGKILL is, but it seems possible that we could end up trying to start the new postmaster before all the children of the old one are dead. If the shmem interlock is working properly that ought to fail. I wonder whether it's such a good idea for the postmaster to give up waiting before all children are gone (postmaster.c:1722 in HEAD). regards, tom lane
On Fri, Jun 19, 2015 at 11:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wonder whether it's such a good idea for the postmaster to give > up waiting before all children are gone (postmaster.c:1722 in HEAD). I doubt it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-06-19 11:16:18 -0400, Robert Haas wrote: > On Fri, Jun 19, 2015 at 11:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I wonder whether it's such a good idea for the postmaster to give > > up waiting before all children are gone (postmaster.c:1722 in HEAD). > > I doubt it. Seconded. It's pretty bad to possibly not be able to start again after stopping it. I don't see the benefit in not waiting - sure, the poastmaster exits faster, but postgres hasn't shut down at that point...
Andres Freund <andres@anarazel.de> writes: > On 2015-06-19 11:16:18 -0400, Robert Haas wrote: >> On Fri, Jun 19, 2015 at 11:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I wonder whether it's such a good idea for the postmaster to give >>> up waiting before all children are gone (postmaster.c:1722 in HEAD). >> I doubt it. > Seconded. It's pretty bad to possibly not be able to start again after > stopping it. I don't see the benefit in not waiting - sure, the > poastmaster exits faster, but postgres hasn't shut down at that point... Yeah. I'll see about fixing this. Hard to be sure if it will fix hamster's issue though. regards, tom lane
Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2015-06-19 11:16:18 -0400, Robert Haas wrote: > >> On Fri, Jun 19, 2015 at 11:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >>> I wonder whether it's such a good idea for the postmaster to give > >>> up waiting before all children are gone (postmaster.c:1722 in HEAD). > > >> I doubt it. > > > Seconded. It's pretty bad to possibly not be able to start again after > > stopping it. I don't see the benefit in not waiting - sure, the > > poastmaster exits faster, but postgres hasn't shut down at that point... > > Yeah. I'll see about fixing this. Hard to be sure if it will fix > hamster's issue though. We discussed this when that patch got in (82233ce7ea42d6b). The reason for not waiting, it was argued, is that the most likely reason for those processes not to have already gone away by the time we send SIGKILL was that they are stuck somewhere in the kernel, and so we might not be able to actually get them to go away with the SIGKILL. As I recall, that was the actual problem that MauMau was trying to get fixed. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2015-06-19 13:56:21 -0300, Alvaro Herrera wrote: > We discussed this when that patch got in (82233ce7ea42d6b). The reason > for not waiting, it was argued, is that the most likely reason for those > processes not to have already gone away by the time we send SIGKILL was > that they are stuck somewhere in the kernel, and so we might not be able > to actually get them to go away with the SIGKILL. As I recall, that was > the actual problem that MauMau was trying to get fixed. How does exiting before they're dead help? They're still going to be attached to shared memeory and thus prevent a restart. I don't think hiding the problem by exiting the postmaster helps at all. I don't think it's our job to protect against processes stuck in the kernel. If that happens something seriously has gone wrong, and we shouldn't just continue without making that visible.
Alvaro Herrera wrote: > We discussed this when that patch got in (82233ce7ea42d6b). The reason > for not waiting, it was argued, is that the most likely reason for those > processes not to have already gone away by the time we send SIGKILL was > that they are stuck somewhere in the kernel, and so we might not be able > to actually get them to go away with the SIGKILL. As I recall, that was > the actual problem that MauMau was trying to get fixed. Hm, I reviewed the discussion and actually I'm wrong: I came up with the unkillable process argument on my own. MauMau's problem was a deadlocked process because of trying to malloc() on its way out. I still think that unkillable processes are an issue, but perhaps it's sane to have postmaster wait for some time instead of just going away immediately. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Andres Freund <andres@anarazel.de> writes: > On 2015-06-19 13:56:21 -0300, Alvaro Herrera wrote: >> We discussed this when that patch got in (82233ce7ea42d6b). The reason >> for not waiting, it was argued, is that the most likely reason for those >> processes not to have already gone away by the time we send SIGKILL was >> that they are stuck somewhere in the kernel, and so we might not be able >> to actually get them to go away with the SIGKILL. As I recall, that was >> the actual problem that MauMau was trying to get fixed. > How does exiting before they're dead help? They're still going to be > attached to shared memeory and thus prevent a restart. I don't think > hiding the problem by exiting the postmaster helps at all. My thought exactly. regards, tom lane
Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2015-06-19 13:56:21 -0300, Alvaro Herrera wrote: > >> We discussed this when that patch got in (82233ce7ea42d6b). The reason > >> for not waiting, it was argued, is that the most likely reason for those > >> processes not to have already gone away by the time we send SIGKILL was > >> that they are stuck somewhere in the kernel, and so we might not be able > >> to actually get them to go away with the SIGKILL. As I recall, that was > >> the actual problem that MauMau was trying to get fixed. > > > How does exiting before they're dead help? They're still going to be > > attached to shared memeory and thus prevent a restart. I don't think > > hiding the problem by exiting the postmaster helps at all. > > My thought exactly. I guess you have a point there. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Jun 20, 2015 at 12:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Michael Paquier <michael.paquier@gmail.com> writes: >> Now if we look at RewindTest.pm, there is the following code: >> if ($test_master_datadir) >> { >> system >> "pg_ctl -D $test_master_datadir -s -m immediate stop >> 2> /dev/null"; >> } >> if ($test_standby_datadir) >> { >> system >> "pg_ctl -D $test_standby_datadir -s -m immediate >> stop 2> /dev/null"; >> } >> And I think that the problem is triggered because we are missing a -w >> switch here, meaning that we do not wait until the confirmation that >> the server has stopped, and visibly if stop is slow enough the next >> server to use cannot start because the port is already taken by the >> server currently stopping. > > After I woke up a bit more, I remembered that -w is already the default > for "pg_ctl stop", so your diagnosis here is incorrect. Ah right. I forgot that. Perhaps I got just lucky in my runs. > I suspect that the real problem is the arbitrary decision to use -m > immediate. The postmaster would ordinarily wait for its children to > die, but on a slow machine we could perhaps reach the end of that > 5-second timeout, whereupon the postmaster would SIGKILL its children > *and exit immediately*. I'm not sure how instantaneous SIGKILL is, > but it seems possible that we could end up trying to start the new > postmaster before all the children of the old one are dead. If the > shmem interlock is working properly that ought to fail. > > I wonder whether it's such a good idea for the postmaster to give > up waiting before all children are gone (postmaster.c:1722 in HEAD). I don't think so as well. -- Michael
On Sat, Jun 20, 2015 at 12:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andres Freund <andres@anarazel.de> writes: >> On 2015-06-19 11:16:18 -0400, Robert Haas wrote: >>> On Fri, Jun 19, 2015 at 11:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>> I wonder whether it's such a good idea for the postmaster to give >>>> up waiting before all children are gone (postmaster.c:1722 in HEAD). > >>> I doubt it. > >> Seconded. It's pretty bad to possibly not be able to start again after >> stopping it. I don't see the benefit in not waiting - sure, the >> poastmaster exits faster, but postgres hasn't shut down at that point... > > Yeah. I'll see about fixing this. Hard to be sure if it will fix > hamster's issue though. I have re-enabled the TAP tests in hamster. What happens in the next couple of days will be interesting to see. -- Michael
On Fri, Jun 19, 2015 at 11:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Michael Paquier <michael.paquier@gmail.com> writes: >> Attached is a patch fixing those problems and improving the log >> facility as it really helped me out with those issues. The simplest >> fix would be to include the -w switch missing in the tests of >> pg_rewind and pg_ctl though. > > I agree with adding the -w switch and will go do that. As far as the > rest of this patch goes, it seems like it could be made less invasive > if the logs got dumped into a subdirectory of tmp_check rather than > adding another top-level directory that has to be cleaned? tmp_check remains unused since dcae5fac (missing something perhaps?), but I guess we could use it again to save the logs in it, and have pg_rewind do the same. -- Michael
On Sat, Jun 20, 2015 at 7:06 AM, Michael Paquier <michael.paquier@gmail.com> wrote: >> As far as the >> rest of this patch goes, it seems like it could be made less invasive >> if the logs got dumped into a subdirectory of tmp_check rather than >> adding another top-level directory that has to be cleaned? > > tmp_check remains unused since dcae5fac (missing something perhaps?), > but I guess we could use it again to save the logs in it, and have > pg_rewind do the same. OK, so here is an updated patch. All the logs are saved in tmp_check/log, and each log file is named with the name of the file that launched the test to facilitate its identification. I noticed at the same time that some commands were called in silent mode. I removed that and now what they output is written as well in the log files. Regards, -- Michael
Attachment
On Sat, Jun 20, 2015 at 6:53 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Sat, Jun 20, 2015 at 12:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Andres Freund <andres@anarazel.de> writes: >>> On 2015-06-19 11:16:18 -0400, Robert Haas wrote: >>>> On Fri, Jun 19, 2015 at 11:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>>> I wonder whether it's such a good idea for the postmaster to give >>>>> up waiting before all children are gone (postmaster.c:1722 in HEAD). >> >>>> I doubt it. >> >>> Seconded. It's pretty bad to possibly not be able to start again after >>> stopping it. I don't see the benefit in not waiting - sure, the >>> poastmaster exits faster, but postgres hasn't shut down at that point... >> >> Yeah. I'll see about fixing this. Hard to be sure if it will fix >> hamster's issue though. > > I have re-enabled the TAP tests in hamster. What happens in the next > couple of days will be interesting to see.he perl And, we get a failure: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamster&dt=2015-06-20%2017%3A59%3A01 I am not sure why buildfarm runs makes it more easily reproducible, one of the reasons may be the perl scripts run underground. -- Michael
On Sun, Jun 21, 2015 at 7:06 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > And, we get a failure: > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamster&dt=2015-06-20%2017%3A59%3A01 > I am not sure why buildfarm runs makes it more easily reproducible, > one of the reasons may be the perl scripts run underground. hamster has not complained for a couple of weeks now, and the issue was reproducible every 4~6 days: http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=hamster&br=HEAD Hence let's consider the issue as resolved. -- Michael
Michael Paquier <michael.paquier@gmail.com> writes: > hamster has not complained for a couple of weeks now, and the issue > was reproducible every 4~6 days: > http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=hamster&br=HEAD > Hence let's consider the issue as resolved. Nah, I'm afraid not. We are definitely still seeing a lot of unexplained pg_ctl failures; as a recent example, http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2015-07-29%2018%3A00%3A19 Now maybe that's not the same thing as hamster's issue, but there's still something funny going on. That's why I've been bitching about the lack of debug information. I hope that we can make progress on understanding this once Andrew pushes out the new buildfarm release (especially since some of his machines are particularly prone to this type of failure, so I trust he'll update those critters promptly). regards, tom lane
On 07/29/2015 11:17 PM, Tom Lane wrote: > Michael Paquier <michael.paquier@gmail.com> writes: >> hamster has not complained for a couple of weeks now, and the issue >> was reproducible every 4~6 days: >> http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=hamster&br=HEAD >> Hence let's consider the issue as resolved. > Nah, I'm afraid not. We are definitely still seeing a lot of unexplained > pg_ctl failures; as a recent example, > > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2015-07-29%2018%3A00%3A19 > > Now maybe that's not the same thing as hamster's issue, but there's still > something funny going on. That's why I've been bitching about the lack of > debug information. I hope that we can make progress on understanding this > once Andrew pushes out the new buildfarm release (especially since some of > his machines are particularly prone to this type of failure, so I trust > he'll update those critters promptly). > > I want to deal with running the tests on MSVC first. Meanwhile I'll drop the latest run_build.pl on axolotl. Anyone who wants to be on the bleeding edge can grab it from <https://raw.githubusercontent.com/PGBuildFarm/client-code/master/run_build.pl> cheers andrew