Thread: Recording test runtimes with the buildfarm
Hi, I've just had some thoughts about the possible usefulness of having the buildfarm record the run-time of each regression test to allow us to have some sort of ability to track the run-time history of each test. I thought the usefulness might be two-fold: 1. We could quickly identify when someone adds some overly complex test and slows down the regression tests too much. 2. We might get some faster insight into performance regressions. I can think of about 3 reasons that a test might slow down. a) Someone adds some new tests within the test file. b) Actual performance regression in Postgres c) Animal busy with other work. We likely could do a semi-decent job of telling a) and b) apart by just recording the latest commit that changed the .sql file for the test. We could also likely see when c) is at play by the results returning back to normal again a few runs after some spike. We'd only want to pay attention to consistent slowdowns. Perhaps there would be too much variability with the parallel tests, but maybe we could just record it for the serial tests in make check-world. I only thought of this after reading [1]. If we went ahead with that, as of now, it feels like someone could quite easily break that optimisation and nobody would notice for a long time. I admit to not having looked at the buildfarm code to determine how practical such a change would be. I've assumed there is a central database that stores all the results. David [1] https://www.postgresql.org/message-id/CAJ3gD9eEXJ2CHMSiOehvpTZu3Ap2GMi5jaXhoZuW=3XJLmZWpw@mail.gmail.com
Greetings, * David Rowley (dgrowleyml@gmail.com) wrote: > 1. We could quickly identify when someone adds some overly complex > test and slows down the regression tests too much. Sure, makes sense to me. We do track the individual 'stage_duration' but we don't track things down to a per-regression-test basis. To do that, I think we'd need the regression system to spit that kind of detailed information out somewhere (in a structured format) that the buildfarm client would then be able to pick it up and send to the server to write into an appropriate table. > 2. We might get some faster insight into performance regressions. There's some effort going into continuing to build out a "performance" farm, whose specific goal is to try and help exactly this issue. Trying to do that with the buildfarm has the challenge that many of those systems aren't dedicated and therefore the timing could vary wildly between runs due to entirely independent things than our code. Would certainly be great to have more people working on that. Currently it's primarily Ilaria (GSoC student), Ads and I. Current repo is https://github.com/PGPerfFarm/pgperffarm if folks want to look at it, but we're in the process of making some pretty serious changes, so now might not be the best time to look at the code. We're coordinating on the 'Postgresteam' slack in #perffarm for anyone interested tho. > I only thought of this after reading [1]. If we went ahead with that, > as of now, it feels like someone could quite easily break that > optimisation and nobody would notice for a long time. One of the goals with the perffarm is to be able to support different types of benchmarks, beyond just pgbench, so that we'd be able to add a benchmark for "numeric", perhaps, or maybe create a script with pgbench that ends up being heavy on numerics or such. > I admit to not having looked at the buildfarm code to determine how > practical such a change would be. I've assumed there is a central > database that stores all the results. Yes, there's a central database where results are pushed and that's what you see when you go to buildfarm.postgresql.org, there's also an archive server which has the logs going much farther back (and is quite a bit larger, of course..). Thanks, Stephen
Attachment
On 6/10/20 8:58 AM, David Rowley wrote: > Hi, > > I've just had some thoughts about the possible usefulness of having > the buildfarm record the run-time of each regression test to allow us > to have some sort of ability to track the run-time history of each > test. > > I thought the usefulness might be two-fold: > > 1. We could quickly identify when someone adds some overly complex > test and slows down the regression tests too much. > 2. We might get some faster insight into performance regressions. > > I can think of about 3 reasons that a test might slow down. > > a) Someone adds some new tests within the test file. > b) Actual performance regression in Postgres > c) Animal busy with other work. > > We likely could do a semi-decent job of telling a) and b) apart by > just recording the latest commit that changed the .sql file for the > test. We could also likely see when c) is at play by the results > returning back to normal again a few runs after some spike. We'd only > want to pay attention to consistent slowdowns. Perhaps there would be > too much variability with the parallel tests, but maybe we could just > record it for the serial tests in make check-world. > > I only thought of this after reading [1]. If we went ahead with that, > as of now, it feels like someone could quite easily break that > optimisation and nobody would notice for a long time. > > I admit to not having looked at the buildfarm code to determine how > practical such a change would be. I've assumed there is a central > database that stores all the results. > David, Yes we have a central database. But we don't have anything that digs into each step. In fact, on the server side the code isn't even really aware that it's Postgres that's being tested. The basic structures are: buildsystems - one row per animal build_status - one row per run build_status_log - one row per step within each run The last table has a blob containing the output of the step (e.g. "make check") plus any relevant files (like the postmaster.log). But we don't have any structure corresponding to an individual regression test. Architecturally, we could add a table for named times, and have the client extract and report them. Alternatively, people with access to the database could extract the logs and post-process them using perl or python. That would involve no work on my part :-) But it would not be automated. What we do record (in build_status_log) is the time each step took. So any regression test that suddenly blew out should likewise cause a blowout in the time the whole "make check" took. It might be possible to create a page that puts stats like that up. I think that's probably a better way to go. Tying results to an individual commit is harder. There could be dozens of commits that happened between the current run and the previous run on a given animal. But usually what has caused any event is fairly clear when you look at it. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: > Alternatively, people with access to the database could extract the logs > and post-process them using perl or python. That would involve no work > on my part :-) But it would not be automated. Yeah, we could easily extract per-test-script runtimes, since pg_regress started to print those. But ... > What we do record (in build_status_log) is the time each step took. So > any regression test that suddenly blew out should likewise cause a > blowout in the time the whole "make check" took. I have in the past scraped the latter results and tried to make sense of them. They are *mighty* noisy, even when considering just one animal that I know to be running on a machine with little else to do. Maybe averaging across the whole buildfarm could reduce the noise level, but I'm not very hopeful. Per-test-script times would likely be even noisier (ISTM anyway, maybe I'm wrong). The entire reason we've been discussing a separate performance farm is the expectation that buildfarm timings will be too noisy to be useful to detect any but the most obvious performance effects. regards, tom lane
On 6/10/20 10:13 AM, Tom Lane wrote: > Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: >> Alternatively, people with access to the database could extract the logs >> and post-process them using perl or python. That would involve no work >> on my part :-) But it would not be automated. > Yeah, we could easily extract per-test-script runtimes, since pg_regress > started to print those. But ... > >> What we do record (in build_status_log) is the time each step took. So >> any regression test that suddenly blew out should likewise cause a >> blowout in the time the whole "make check" took. > I have in the past scraped the latter results and tried to make sense of > them. They are *mighty* noisy, even when considering just one animal > that I know to be running on a machine with little else to do. Maybe > averaging across the whole buildfarm could reduce the noise level, but > I'm not very hopeful. Per-test-script times would likely be even > noisier (ISTM anyway, maybe I'm wrong). > > The entire reason we've been discussing a separate performance farm > is the expectation that buildfarm timings will be too noisy to be > useful to detect any but the most obvious performance effects. > > Yes, but will the performance farm be testing regression timings? Maybe we're going to need several test suites, one of which could be regression tests. But the regression tests themselves are not really intended for performance testing. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Greetings, * Andrew Dunstan (andrew.dunstan@2ndquadrant.com) wrote: > On 6/10/20 10:13 AM, Tom Lane wrote: > > Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: > >> Alternatively, people with access to the database could extract the logs > >> and post-process them using perl or python. That would involve no work > >> on my part :-) But it would not be automated. > > Yeah, we could easily extract per-test-script runtimes, since pg_regress > > started to print those. But ... > > > >> What we do record (in build_status_log) is the time each step took. So > >> any regression test that suddenly blew out should likewise cause a > >> blowout in the time the whole "make check" took. > > I have in the past scraped the latter results and tried to make sense of > > them. They are *mighty* noisy, even when considering just one animal > > that I know to be running on a machine with little else to do. Maybe > > averaging across the whole buildfarm could reduce the noise level, but > > I'm not very hopeful. Per-test-script times would likely be even > > noisier (ISTM anyway, maybe I'm wrong). > > > > The entire reason we've been discussing a separate performance farm > > is the expectation that buildfarm timings will be too noisy to be > > useful to detect any but the most obvious performance effects. > > Yes, but will the performance farm be testing regression timings? We are not currently envisioning that, no. > Maybe we're going to need several test suites, one of which could be > regression tests. But the regression tests themselves are not really > intended for performance testing. Agree with this- better would be tests which are specifically written to test performance instead. Thanks, Stephen
Attachment
On Thu, 11 Jun 2020 at 02:13, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I have in the past scraped the latter results and tried to make sense of > them. They are *mighty* noisy, even when considering just one animal > that I know to be running on a machine with little else to do. Do you recall if you looked at the parallel results or the serially executed ones? I imagine that the parallel ones will have much more noise since we run the tests up to 20 at a time. I imagine probably none, or at most not many of the animals have enough CPU cores not to be context switching a lot during those the parallel runs. I thought the serial ones would be better but didn't have an idea of they'd be good enough to be useful. David
On Thu, Jun 11, 2020 at 2:13 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I have in the past scraped the latter results and tried to make sense of > them. They are *mighty* noisy, even when considering just one animal > that I know to be running on a machine with little else to do. Maybe > averaging across the whole buildfarm could reduce the noise level, but > I'm not very hopeful. Per-test-script times would likely be even > noisier (ISTM anyway, maybe I'm wrong). I've been doing that in a little database that pulls down the results and analyses them with primitive regexes. First I wanted to know the pass/fail history for each individual regression, isolation and TAP script, then I wanted to build something that could identify tests that are 'flapping', and work out when the started and stopped flapping etc. I soon realised it was all too noisy, but then I figured that I could fix that by detecting crashes. So I classify every top level build farm run as SUCCESS, FAILURE or CRASH. If the top level run was CRASH, than I can disregard the individual per script results, because they're all BS.
David Rowley <dgrowleyml@gmail.com> writes: > On Thu, 11 Jun 2020 at 02:13, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I have in the past scraped the latter results and tried to make sense of >> them. They are *mighty* noisy, even when considering just one animal >> that I know to be running on a machine with little else to do. > Do you recall if you looked at the parallel results or the serially > executed ones? > I imagine that the parallel ones will have much more noise since we > run the tests up to 20 at a time. I imagine probably none, or at most > not many of the animals have enough CPU cores not to be context > switching a lot during those the parallel runs. I thought the serial > ones would be better but didn't have an idea of they'd be good enough > to be useful. I can't claim to recall specifically, but I agree with your theory about that, so I probably looked at the serial-schedule case. Note that this is moot for animals using use_installcheck_parallel ... but it looks like that's still just a minority of them. regards, tom lane
On Thu, Jun 11, 2020 at 9:43 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Thu, Jun 11, 2020 at 2:13 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I have in the past scraped the latter results and tried to make sense of > > them. They are *mighty* noisy, even when considering just one animal > > that I know to be running on a machine with little else to do. Maybe > > averaging across the whole buildfarm could reduce the noise level, but > > I'm not very hopeful. Per-test-script times would likely be even > > noisier (ISTM anyway, maybe I'm wrong). > > I've been doing that in a little database that pulls down the results > and analyses them with primitive regexes. First I wanted to know the > pass/fail history for each individual regression, isolation and TAP > script, then I wanted to build something that could identify tests > that are 'flapping', and work out when the started and stopped > flapping etc. I soon realised it was all too noisy, but then I > figured that I could fix that by detecting crashes. So I classify > every top level build farm run as SUCCESS, FAILURE or CRASH. If the > top level run was CRASH, than I can disregard the individual per > script results, because they're all BS. With more coffee I realise that you were talking about noise times, not noisy pass/fail results. But I still want to throw that idea out there, if we're considering analysing the logs.
Hi, On June 10, 2020 2:13:51 PM PDT, David Rowley <dgrowleyml@gmail.com> wrote: >On Thu, 11 Jun 2020 at 02:13, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I have in the past scraped the latter results and tried to make sense >of >> them. They are *mighty* noisy, even when considering just one animal >> that I know to be running on a machine with little else to do. > >Do you recall if you looked at the parallel results or the serially >executed ones? > >I imagine that the parallel ones will have much more noise since we >run the tests up to 20 at a time. I imagine probably none, or at most >not many of the animals have enough CPU cores not to be context >switching a lot during those the parallel runs. I thought the serial >ones would be better but didn't have an idea of they'd be good enough >to be useful. I'd assume that a rolling average (maybe 10 runs or so) would hide noise enough to see at least some trends even for parallelruns. We should be able to prototype this with a few queries over the bf database, right? Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Thomas Munro <thomas.munro@gmail.com> writes: > I've been doing that in a little database that pulls down the results > and analyses them with primitive regexes. First I wanted to know the > pass/fail history for each individual regression, isolation and TAP > script, then I wanted to build something that could identify tests > that are 'flapping', and work out when the started and stopped > flapping etc. I soon realised it was all too noisy, but then I > figured that I could fix that by detecting crashes. So I classify > every top level build farm run as SUCCESS, FAILURE or CRASH. If the > top level run was CRASH, than I can disregard the individual per > script results, because they're all BS. If you can pin the crash on a particular test script, it'd be useful to track that as a kind of failure. In general, though, both crashes and non-crash failures tend to cause collateral damage to later test scripts --- if you can't filter that out then the later scripts will have high false-positive rates. regards, tom lane
On Thu, 11 Jun 2020 at 10:02, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Thomas Munro <thomas.munro@gmail.com> writes: > > I've been doing that in a little database that pulls down the results > > and analyses them with primitive regexes. First I wanted to know the > > pass/fail history for each individual regression, isolation and TAP > > script, then I wanted to build something that could identify tests > > that are 'flapping', and work out when the started and stopped > > flapping etc. I soon realised it was all too noisy, but then I > > figured that I could fix that by detecting crashes. So I classify > > every top level build farm run as SUCCESS, FAILURE or CRASH. If the > > top level run was CRASH, than I can disregard the individual per > > script results, because they're all BS. > > If you can pin the crash on a particular test script, it'd be useful > to track that as a kind of failure. In general, though, both crashes > and non-crash failures tend to cause collateral damage to later test > scripts --- if you can't filter that out then the later scripts will > have high false-positive rates. I guess the fact that you've both needed to do analysis on individual tests shows that there might be a call for this beyond just recording the test's runtime. If we had a table that stored the individual test details, pass/fail and just stored the timing information along with that, then, even if the timing was unstable, it could still be useful for some analysis. I'd be happy enough even if that was only available as a csv file download. I imagine the buildfarm does not need to provide us with any tools for doing analysis on this. Ideally, there would be some run_id that we could link it back to the test run which would give us the commit SHA, and the animal that it ran on. Joining to details about the animal could be useful too, e.g perhaps a certain test always fails on 32-bit machines. I suppose that maybe we could modify pg_regress to add a command line option to have it write out a machine-readable file, e.g: testname,result,runtime\n, then just have the buildfarm client ship that off to the buildfarm server to record in the database. David
On 6/10/20 6:00 PM, Andres Freund wrote: > On June 10, 2020 2:13:51 PM PDT, David Rowley <dgrowleyml@gmail.com> wrote: >>On Thu, 11 Jun 2020 at 02:13, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I have in the past scraped the latter results and tried to make sense >>of >>> them. They are *mighty* noisy, even when considering just one animal >>> that I know to be running on a machine with little else to do. >> >>Do you recall if you looked at the parallel results or the serially >>executed ones? >> >>I imagine that the parallel ones will have much more noise since we >>run the tests up to 20 at a time. I imagine probably none, or at most >>not many of the animals have enough CPU cores not to be context >>switching a lot during those the parallel runs. I thought the serial >>ones would be better but didn't have an idea of they'd be good enough >>to be useful. > > I'd assume that a rolling average (maybe 10 runs or so) would hide noise enough to see at least some trends even for parallelruns. > > We should be able to prototype this with a few queries over the bf database, right? This seems to me like a perfect use case for control charts: https://en.wikipedia.org/wiki/Control_chart They are designed specifically to detect systematic changes in an environment with random noise. Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development
Greetings, * David Rowley (dgrowleyml@gmail.com) wrote: > On Thu, 11 Jun 2020 at 10:02, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Thomas Munro <thomas.munro@gmail.com> writes: > > > I've been doing that in a little database that pulls down the results > > > and analyses them with primitive regexes. First I wanted to know the > > > pass/fail history for each individual regression, isolation and TAP > > > script, then I wanted to build something that could identify tests > > > that are 'flapping', and work out when the started and stopped > > > flapping etc. I soon realised it was all too noisy, but then I > > > figured that I could fix that by detecting crashes. So I classify > > > every top level build farm run as SUCCESS, FAILURE or CRASH. If the > > > top level run was CRASH, than I can disregard the individual per > > > script results, because they're all BS. > > > > If you can pin the crash on a particular test script, it'd be useful > > to track that as a kind of failure. In general, though, both crashes > > and non-crash failures tend to cause collateral damage to later test > > scripts --- if you can't filter that out then the later scripts will > > have high false-positive rates. > > I guess the fact that you've both needed to do analysis on individual > tests shows that there might be a call for this beyond just recording > the test's runtime. > > If we had a table that stored the individual test details, pass/fail > and just stored the timing information along with that, then, even if > the timing was unstable, it could still be useful for some analysis. > I'd be happy enough even if that was only available as a csv file > download. I imagine the buildfarm does not need to provide us with > any tools for doing analysis on this. Ideally, there would be some > run_id that we could link it back to the test run which would give us > the commit SHA, and the animal that it ran on. Joining to details > about the animal could be useful too, e.g perhaps a certain test > always fails on 32-bit machines. > > I suppose that maybe we could modify pg_regress to add a command line > option to have it write out a machine-readable file, e.g: > testname,result,runtime\n, then just have the buildfarm client ship > that off to the buildfarm server to record in the database. That seems like it'd be the best approach to me, though I'd defer to Andrew on it. By the way, if you'd like access to the buildfarm archive server where all this stuff is stored, that can certainly be arranged, just let me know. Thanks, Stephen
Attachment
On 6/11/20 10:21 AM, Stephen Frost wrote: > Greetings, > > * David Rowley (dgrowleyml@gmail.com) wrote: >> On Thu, 11 Jun 2020 at 10:02, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Thomas Munro <thomas.munro@gmail.com> writes: >>>> I've been doing that in a little database that pulls down the results >>>> and analyses them with primitive regexes. First I wanted to know the >>>> pass/fail history for each individual regression, isolation and TAP >>>> script, then I wanted to build something that could identify tests >>>> that are 'flapping', and work out when the started and stopped >>>> flapping etc. I soon realised it was all too noisy, but then I >>>> figured that I could fix that by detecting crashes. So I classify >>>> every top level build farm run as SUCCESS, FAILURE or CRASH. If the >>>> top level run was CRASH, than I can disregard the individual per >>>> script results, because they're all BS. >>> If you can pin the crash on a particular test script, it'd be useful >>> to track that as a kind of failure. In general, though, both crashes >>> and non-crash failures tend to cause collateral damage to later test >>> scripts --- if you can't filter that out then the later scripts will >>> have high false-positive rates. >> I guess the fact that you've both needed to do analysis on individual >> tests shows that there might be a call for this beyond just recording >> the test's runtime. >> >> If we had a table that stored the individual test details, pass/fail >> and just stored the timing information along with that, then, even if >> the timing was unstable, it could still be useful for some analysis. >> I'd be happy enough even if that was only available as a csv file >> download. I imagine the buildfarm does not need to provide us with >> any tools for doing analysis on this. Ideally, there would be some >> run_id that we could link it back to the test run which would give us >> the commit SHA, and the animal that it ran on. Joining to details >> about the animal could be useful too, e.g perhaps a certain test >> always fails on 32-bit machines. >> >> I suppose that maybe we could modify pg_regress to add a command line >> option to have it write out a machine-readable file, e.g: >> testname,result,runtime\n, then just have the buildfarm client ship >> that off to the buildfarm server to record in the database. > That seems like it'd be the best approach to me, though I'd defer to > Andrew on it. > > By the way, if you'd like access to the buildfarm archive server where > all this stuff is stored, that can certainly be arranged, just let me > know. > Yeah, we'll need to work out where to stash the file. The client will pick up anything in src/regress/log for "make check", but would need adjusting for other steps that invoke pg_regress. I'm getting close to cutting a new client release, but I can delay it till we settle this. On the server side, we could add a table with a key of <animal, snapshot, branch, step, testname> but we'd need to make sure those test names were unique. Maybe we need a way of telling pg_regress to prepend a module name (e.g. btree_gist ot plperl) to the test name. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Jun 11, 2020 at 4:56 PM Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote:
On 6/11/20 10:21 AM, Stephen Frost wrote:
> Greetings,
>
> * David Rowley (dgrowleyml@gmail.com) wrote:
>> On Thu, 11 Jun 2020 at 10:02, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Thomas Munro <thomas.munro@gmail.com> writes:
>>>> I've been doing that in a little database that pulls down the results
>>>> and analyses them with primitive regexes. First I wanted to know the
>>>> pass/fail history for each individual regression, isolation and TAP
>>>> script, then I wanted to build something that could identify tests
>>>> that are 'flapping', and work out when the started and stopped
>>>> flapping etc. I soon realised it was all too noisy, but then I
>>>> figured that I could fix that by detecting crashes. So I classify
>>>> every top level build farm run as SUCCESS, FAILURE or CRASH. If the
>>>> top level run was CRASH, than I can disregard the individual per
>>>> script results, because they're all BS.
>>> If you can pin the crash on a particular test script, it'd be useful
>>> to track that as a kind of failure. In general, though, both crashes
>>> and non-crash failures tend to cause collateral damage to later test
>>> scripts --- if you can't filter that out then the later scripts will
>>> have high false-positive rates.
>> I guess the fact that you've both needed to do analysis on individual
>> tests shows that there might be a call for this beyond just recording
>> the test's runtime.
>>
>> If we had a table that stored the individual test details, pass/fail
>> and just stored the timing information along with that, then, even if
>> the timing was unstable, it could still be useful for some analysis.
>> I'd be happy enough even if that was only available as a csv file
>> download. I imagine the buildfarm does not need to provide us with
>> any tools for doing analysis on this. Ideally, there would be some
>> run_id that we could link it back to the test run which would give us
>> the commit SHA, and the animal that it ran on. Joining to details
>> about the animal could be useful too, e.g perhaps a certain test
>> always fails on 32-bit machines.
>>
>> I suppose that maybe we could modify pg_regress to add a command line
>> option to have it write out a machine-readable file, e.g:
>> testname,result,runtime\n, then just have the buildfarm client ship
>> that off to the buildfarm server to record in the database.
> That seems like it'd be the best approach to me, though I'd defer to
> Andrew on it.
>
> By the way, if you'd like access to the buildfarm archive server where
> all this stuff is stored, that can certainly be arranged, just let me
> know.
>
Yeah, we'll need to work out where to stash the file. The client will
pick up anything in src/regress/log for "make check", but would need
adjusting for other steps that invoke pg_regress. I'm getting close to
cutting a new client release, but I can delay it till we settle this.
On the server side, we could add a table with a key of <animal,
snapshot, branch, step, testname> but we'd need to make sure those test
names were unique. Maybe we need a way of telling pg_regress to prepend
a module name (e.g. btree_gist ot plperl) to the test name.
It seems pretty trivial to for example get all the steps out of check.log and their timing with a regexp. I just used '^(?:test)?\s+(\S+)\s+\.\.\. ok\s+(\d+) ms$' as the regexp. Running that against a few hundred build runs in the db generally looks fine, though I didn't look into it in detail.
Of course, that only looked at check.log, and more logic would be needed if we want to look into the other areas as well, but as long as it's pg_regress output I think it should be easy?
Magnus Hagander <magnus@hagander.net> writes: > On Thu, Jun 11, 2020 at 4:56 PM Andrew Dunstan < > andrew.dunstan@2ndquadrant.com> wrote: >> Yeah, we'll need to work out where to stash the file. The client will >> pick up anything in src/regress/log for "make check", but would need >> adjusting for other steps that invoke pg_regress. I'm getting close to >> cutting a new client release, but I can delay it till we settle this. > It seems pretty trivial to for example get all the steps out of check.log > and their timing with a regexp. Yeah, I don't see why we can't scrape this data from the existing buildfarm output, at least for the core regression tests. New infrastructure could make it easier/cheaper, but I don't think we should invest in that until we've got a provably useful tool. regards, tom lane
On Thu, Jun 11, 2020 at 6:32 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Magnus Hagander <magnus@hagander.net> writes:
> On Thu, Jun 11, 2020 at 4:56 PM Andrew Dunstan <
> andrew.dunstan@2ndquadrant.com> wrote:
>> Yeah, we'll need to work out where to stash the file. The client will
>> pick up anything in src/regress/log for "make check", but would need
>> adjusting for other steps that invoke pg_regress. I'm getting close to
>> cutting a new client release, but I can delay it till we settle this.
> It seems pretty trivial to for example get all the steps out of check.log
> and their timing with a regexp.
Yeah, I don't see why we can't scrape this data from the existing
buildfarm output, at least for the core regression tests. New
infrastructure could make it easier/cheaper, but I don't think we
should invest in that until we've got a provably useful tool.
So spending a few minutes to look at my data, it is not quite that easy if we want to look at contrib checks for example. Both btree_gin and btree_gist have checks called "int4" for example, and the aforementioned regexp won't pick that up. But that's surely a fixable problem.
And perhaps we should at least start off data for just "make check" to see if it's useful, before spending too much work?
On 6/11/20 12:32 PM, Tom Lane wrote: > Magnus Hagander <magnus@hagander.net> writes: >> On Thu, Jun 11, 2020 at 4:56 PM Andrew Dunstan < >> andrew.dunstan@2ndquadrant.com> wrote: >>> Yeah, we'll need to work out where to stash the file. The client will >>> pick up anything in src/regress/log for "make check", but would need >>> adjusting for other steps that invoke pg_regress. I'm getting close to >>> cutting a new client release, but I can delay it till we settle this. >> It seems pretty trivial to for example get all the steps out of check.log >> and their timing with a regexp. > Yeah, I don't see why we can't scrape this data from the existing > buildfarm output, at least for the core regression tests. New > infrastructure could make it easier/cheaper, but I don't think we > should invest in that until we've got a provably useful tool. OK, that makes my life fairly easy. Are you wanting this to be automated, or just left as an exercise for researchers for now? cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: > On 6/11/20 12:32 PM, Tom Lane wrote: >> Yeah, I don't see why we can't scrape this data from the existing >> buildfarm output, at least for the core regression tests. New >> infrastructure could make it easier/cheaper, but I don't think we >> should invest in that until we've got a provably useful tool. > OK, that makes my life fairly easy. Are you wanting this to be > automated, or just left as an exercise for researchers for now? I'm envisioning the latter --- I think somebody should prove that useful results can be obtained before we spend any effort on making it easier to gather the input numbers. Mind you, I'm not saying that I don't believe it's possible to get good results. But building infrastructure in advance of a solid use-case is a recipe for building the wrong thing. regards, tom lane
Hello Tom, > I have in the past scraped the latter results and tried to make sense of > them. They are *mighty* noisy, even when considering just one animal > that I know to be running on a machine with little else to do. Maybe > averaging across the whole buildfarm could reduce the noise level, but > I'm not very hopeful. I'd try with median instead of average, so that bad cases due to animal overloading are ignored. -- Fabien.
On Fri, 12 Jun 2020 at 04:32, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Magnus Hagander <magnus@hagander.net> writes: > > On Thu, Jun 11, 2020 at 4:56 PM Andrew Dunstan < > > It seems pretty trivial to for example get all the steps out of check.log > > and their timing with a regexp. > > Yeah, I don't see why we can't scrape this data from the existing > buildfarm output, at least for the core regression tests. New > infrastructure could make it easier/cheaper, but I don't think we > should invest in that until we've got a provably useful tool. Looking at the data mentioned in the logs for install-check-C, it seems some animals are quite stable with their timings but others are quite unstable. Taking the lowest half of the timings of each animal,test combination, where the animal ran the test at least 50 times, the top 3 animals with the most consistent timing are. postgres=# select animal, avg(stddev) from (select animal,testname, stddev(ms) from (select animal,testname,unnest(ar[:array_length(ar,1)/2]) ms from (select animal,testname,array_agg(ms order by ms) ar from run_details group by animal,testname) a)b group by 1,2 having count(*) > 50) c group by 1 order by avg(stddev) limit 3; animal | avg ------------+------------------- mule | 4.750935819647279 massasauga | 5.410419286413067 eider | 6.06834118301505 (3 rows) And the least consistent 3 are: postgres=# select animal, avg(stddev) from (select animal,testname, stddev(ms) from (select animal,testname,unnest(ar[:array_length(ar,1)/2]) ms from (select animal,testname,array_agg(ms order by ms) ar from run_details group by animal,testname) a)b group by 1,2 having count(*) > 50) c group by 1 order by avg(stddev) desc limit 3; animal | avg ----------+------------------- lorikeet | 830.9292062818336 gharial | 725.9874198764217 dory | 683.5182140482121 (3 rows) If I look at a random test from mule: postgres=# select commit,time,result,ms from run_details where animal = 'mule' and testname = 'join' order by time desc limit 10; commit | time | result | ms ---------+---------------------+--------+----- e532b1d | 2020-06-15 19:30:03 | ok | 563 7a3543c | 2020-06-15 15:30:03 | ok | 584 3baa7e3 | 2020-06-15 11:30:03 | ok | 596 47d4d0c | 2020-06-15 07:30:03 | ok | 533 decbe2b | 2020-06-14 15:30:04 | ok | 535 378badc | 2020-06-14 07:30:04 | ok | 563 23cbeda | 2020-06-13 19:30:04 | ok | 557 8f5b596 | 2020-06-13 07:30:04 | ok | 553 6472572 | 2020-06-13 03:30:04 | ok | 580 9a7fccd | 2020-06-12 23:30:04 | ok | 561 (10 rows) and from lorikeet: postgres=# select commit,time,result,ms from run_details where animal = 'lorikeet' and testname = 'join' order by time desc limit 10; commit | time | result | ms ---------+---------------------+--------+------ 47d4d0c | 2020-06-15 08:28:35 | ok | 8890 decbe2b | 2020-06-14 20:28:33 | ok | 8878 378badc | 2020-06-14 08:28:35 | ok | 8854 cc07264 | 2020-06-14 05:22:59 | ok | 8883 8f5b596 | 2020-06-13 10:36:14 | ok | 8942 2f48ede | 2020-06-12 20:28:41 | ok | 8904 ffd2582 | 2020-06-12 08:29:52 | ok | 2016 7aa4fb5 | 2020-06-11 23:21:26 | ok | 1939 ad9291f | 2020-06-11 09:56:48 | ok | 1924 c2bd1fe | 2020-06-10 23:01:42 | ok | 1873 (10 rows) I can supply the data I used for this, just send me an offlist email. It's about 19MB using bzip2. I didn't look at the make check data and I do see some animals use the parallel_schedule for make installcheck, which my regex neglected to account for, so that data is missing from the set. David