Thread: [COMMITTERS] pgsql: Remove pgbench "progress" test pending solution of its timingis
[COMMITTERS] pgsql: Remove pgbench "progress" test pending solution of its timingis
From
Tom Lane
Date:
Remove pgbench "progress" test pending solution of its timing issues. Buildfarm member skink shows that this is even more flaky than I thought. There are probably some actual pgbench bugs here as well as a timing dependency. But we can't have stuff this unstable in the buildfarm, it obscures other issues. Branch ------ master Details ------- https://git.postgresql.org/pg/commitdiff/ad51c6fb5708342e603d12a730bbc4e663bd637e Modified Files -------------- src/bin/pgbench/t/001_pgbench_with_server.pl | 22 ---------------------- 1 file changed, 22 deletions(-) -- Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-committers
Re: [COMMITTERS] pgsql: Remove pgbench "progress" test pendingsolution of its timing is
From
Fabien COELHO
Date:
Hello Tom, > Remove pgbench "progress" test pending solution of its timing issues. It would have been nicer to simply comment them out... > Buildfarm member skink shows that this is even more flaky than I > thought. There are probably some actual pgbench bugs here as well as a > timing dependency. According to the trace, the host was so loaded that it could not do anything for over two seconds, so that the first output is "progress: 2.6", i.e. the 0 thread did not get any significant time for the first 2.6 seconds... of a 1 second test. Hmmm, not very kind. Somehow this cannot be helped: if the system does not give any execution time to some pgbench thread, the expected output cannot be there. The checked condition could be relaxed to "progress: \d", which still assumes that the system was kind enough to give enough cpu time to the process in the two second run so that it could output something. The issue is probably the same on the second test on the generated log. My guess is that some transactions (we are talking about "-S"...) took several seconds, thus were completed and aggregate in rows over the expected limit. Could be considered bug, that is the aggregation should discard over the limit results rather than counting them in some late entry. I'll investigate that. Globally these test failures reveals pathological testing conditions: the inability to execute a simple select over several seconds. It could have been OOM or whatever else. If the test must run even when postgres doesn't, it is a little bit hard as a starting assumption for a benchmarking tool:-( -- Fabien. -- Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-committers
Re: [COMMITTERS] pgsql: Remove pgbench "progress" test pendingsolution of its timing is
From
Andres Freund
Date:
Hi, On 2017-09-23 19:57:40 +0200, Fabien COELHO wrote: > According to the trace, the host was so loaded that it could not do anything > for over two seconds, so that the first output is "progress: 2.6", i.e. the > 0 thread did not get any significant time for the first 2.6 seconds... of a > 1 second test. Hmmm, not very kind. > > Somehow this cannot be helped: if the system does not give any execution > time to some pgbench thread, the expected output cannot be there. skink runs postgres under valgrind, so it's going to be slower. > If the test must run even when postgres doesn't, it is a little bit hard as > a starting assumption for a benchmarking tool:-( I'm unsure what the point of this is. It's not like we discussing removing pgbench, we're talking about an unreliable test. Greetings, Andres Freund -- Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-committers
Re: [COMMITTERS] pgsql: Remove pgbench "progress" test pending solution of its timing is
From
Tom Lane
Date:
Fabien COELHO <coelho@cri.ensmp.fr> writes: > Globally these test failures reveals pathological testing conditions: the > inability to execute a simple select over several seconds. It could have > been OOM or whatever else. Well, I think it's mostly about valgrind making everything really slow. Since we have seen some passes from skink recently, perhaps there was also a component of more-load-on-the-machine-than-usual. But in the end this is just evidence for my point that regression tests have to be very much not timing-sensitive. We run them under all kinds of out-of-the-ordinary stress. regards, tom lane -- Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-committers
Re: [COMMITTERS] pgsql: Remove pgbench "progress" test pendingsolution of its timing is
From
Fabien COELHO
Date:
>> If the test must run even when postgres doesn't, it is a little bit >> hard as a starting assumption for a benchmarking tool:-( > > I'm unsure what the point of this is. It's not like we discussing > removing pgbench, we're talking about an unreliable test. My sentence was probably not very clear. I just meant that devising a coverage test which does not fail when the benchmarking tool is not able to run because of load/valgrind/... is kind of hard. Anything even loosely related to time, here having a few simple SELECT transactions over 2 seconds and a few printf into a file or stdout every second, can always fail if conditions are bad enough. So this means somehow giving up on coverage, because of one host which can fail under very unusual testing conditions once in a while. I would prefer to keep the test and have a warning instead, something like "ok, although a test which is allowed to rarely fail failed", but at least the feature are tested most of the time. Sigh. -- Fabien. -- Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-committers
Re: [COMMITTERS] pgsql: Remove pgbench "progress" test pending solution of its timing is
From
Tom Lane
Date:
Fabien COELHO <coelho@cri.ensmp.fr> writes: > So this means somehow giving up on coverage, because of one host which can > fail under very unusual testing conditions once in a while. You are attacking a straw man. There is no reason whatever to believe that the problem is confined to one host. We have quite a number of slow buildfarm hosts: some are using CLOBBER_CACHE_ALWAYS, some are overloaded, some are just plain slow. Also, regression test failures outside the context of our buildfarm are just as bad if not worse. I still remember the pain I had with mysql's regression tests, which routinely fell over in Red Hat's package buildfarm even though they always succeeded on a personal workstation with nothing else going on. I do not want our own packagers to have that kind of experience with Postgres. > I would prefer to keep the test and have a warning instead, something > like "ok, although a test which is allowed to rarely fail failed", but at > least the feature are tested most of the time. If the failures are ignored, how much of a test have you really got? Nobody inquires into nominally-passed buildfarm runs to see what actually happened. In the TAP-test context, you do have quite a bit of freedom to decide what is a pass and what is a fail. So maybe the path forward is to be more flexible about the pass conditions for this test. But I can't help thinking that these results have exposed some outright bugs too. regards, tom lane -- Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-committers
Re: [COMMITTERS] pgsql: Remove pgbench "progress" test pendingsolution of its timing is
From
Fabien COELHO
Date:
> You are attacking a straw man. I thought it was tilting at windmills:-) > In the TAP-test context, you do have quite a bit of freedom to decide > what is a pass and what is a fail. So maybe the path forward is to be > more flexible about the pass conditions for this test. Hmmm. I'm not sure that there is a solution under "any" testing conditions if -T & -P are to be tested and some sensible output expected out of them. I'll think about it. > But I can't help thinking that these results have exposed some outright > bugs too. Yep, I agree that there is probably one bug/race/unexpected condition wrt what is output and time out handling. -- Fabien. -- Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-committers
Re: [COMMITTERS] pgsql: Remove pgbench "progress" test pendingsolution of its timing is
From
Fabien COELHO
Date:
Hello Tom, > Well, I think it's mostly about valgrind making everything really slow. > Since we have seen some passes from skink recently, perhaps there was > also a component of more-load-on-the-machine-than-usual. But in the end > this is just evidence for my point that regression tests have to be very > much not timing-sensitive. We run them under all kinds of > out-of-the-ordinary stress. Attached is an attempt at improving the situation: (1) there are added comments to explain the whys, which is to provide coverage for pgbench time-related features... while still not being time-sensitive, which is a challenge. (2) the test now only expects "progress: \d" from the output, so it is enough that one progress is shown, whenever it is shown. (3) if the test is detected to have gone AWOL, detailed log checks are coldly skipped. This would have passed on "skink" under the special conditions it encountered. I cannot guaranty that it would pass under any circumstance, though. If it still encounters a failure, ISTM that it should only be a missing "progress:" in the output, which has not been encountered so far. If it occurs, a few options would remain, none of them very convincing: - give the test some more time, eg 3 seconds (hmmm... could still fail after any time...) - drop the "progress:" expectation (hmmm... but then it does not check anything). - make the "progress:" output check conditional to the running time (hmmm... it would require changing the command_checks_all function, and there is still a chance that the bench was stuck for 2 seconds then given time to realize that it has to stop right now...) - use an even simpler transaction, eg "SELECT;" which is less likely to get stuck (but still could get stuck...). For the random-ness related test (--sample-rate), we could add a feature to pgbench which allows to control the random seed, so that the number of samples could be known in advance for testing purposes. -- Fabien. -- Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-committers
Attachment
Re: [COMMITTERS] pgsql: Remove pgbench "progress" test pendingsolution of its timing is
From
Alvaro Herrera
Date:
Here's a rebased version of this patch. This is not an endorsement of it -- I just used it to locally increase coverage for the doCustom refactoring patch I'm about to push. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services