Re: Running tests under valgrind is getting slower at an alarming pace - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Running tests under valgrind is getting slower at an alarming pace |
Date | |
Msg-id | 20211006164754.7fytp2r3va6sh66m@alap3.anarazel.de Whole thread Raw |
In response to | Re: Running tests under valgrind is getting slower at an alarming pace (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Running tests under valgrind is getting slower at an alarming pace
Re: Running tests under valgrind is getting slower at an alarming pace Re: Running tests under valgrind is getting slower at an alarming pace |
List | pgsql-hackers |
Hi, On 2021-10-06 12:09:36 -0400, Robert Haas wrote: > Is the problem here that we're adding a lot of new new test cases? Or > is the problem that valgrind runs are getting slower for the same > number of test cases? I don't know precisely. It's probably a combination of several factors. I do think we regressed somewhere around valgrind specifically - the leak origin tracking in older branches seems to work better than in newer branches. But I don't know if that affects performance. > If it's taking longer because we have more test cases, I'm honestly > not sure that's really something we should try to fix. I'm not arguing for having fewer tests. But clearly, executing them serially is problematic, when the times are going up like this. Skink is hosted on a machine with a CPU clocking around ~3.9GHZ for most of the test - getting a faster machine won't help that much. But most of the time only a few cores are active. This isn't just a problem with valgrind, the reporting times for other animals also aren't getting shorter... It takes my workstation 2min20s to execute check-world parallely, but > 16min sequentially. The BF executes tap tests sequentially... > I mean, I'm sure we have some bad test cases here and there, but overall I > think we still have too little test coverage, not too much. The recent > discovery that recovery_end_command had zero test coverage is one fine > example of that. > > But if we've done something that increases the relative cost of > valgrind, maybe we can fix that in a centralized way. There's probably some of that. The fact that the tap test infrastructure does all communication with the server via psql each only execute only a single query is a problem - connection startup is expensive. I think this is particularly a problem for things like PostgresNode::poll_query_until(), which is also used by ::wait_for_catchup(), because a) process creation is more expensive on valgrind b) things take longer on valgrind, so we pay that price many more times. At the same time increasing the timeout for the poll loop also makes the tests slower, because all the waits for things that already finished do add up. I'd guess that at the very least driving individual poll_query_until() via a psql that's running across queries would reduce this substantially, and perhaps allow us to reduce the polling time. But there's probably some nontrivial challenges around recognizing query boundaries :/ Briefly looking at a profile of valgrind, it looks like a lot of the cpu time is spent doing syscalls related to logging. So far skink had log_statement=all, log_connections=on, log_disconnections=on - I've turned them off for the next runs. We'll see if that helps. I'll also try to figure out print a bit more detail about timing for each tap test, looks like I need to figure out how to pass PROVE_TEST='--timer' through the buildfarm. Shouldn't be too hard. One thing I think would really help is having the total time for each run visible in an animals run history. That way we could pinpoint regressions reasonably efficiently, right now that's not easily possible without writing nontrivial queries to the buildfarm database... Greetings, Andres Freund
pgsql-hackers by date: