Thread: Recording test runtimes with the buildfarm

Recording test runtimes with the buildfarm

From
David Rowley
Date:
Hi,

I've just had some thoughts about the possible usefulness of having
the buildfarm record the run-time of each regression test to allow us
to have some sort of ability to track the run-time history of each
test.

I thought the usefulness might be two-fold:

1. We could quickly identify when someone adds some overly complex
test and slows down the regression tests too much.
2. We might get some faster insight into performance regressions.

I can think of about 3 reasons that a test might slow down.

a) Someone adds some new tests within the test file.
b) Actual performance regression in Postgres
c) Animal busy with other work.

We likely could do a semi-decent job of telling a) and b) apart by
just recording the latest commit that changed the .sql file for the
test. We could also likely see when c) is at play by the results
returning back to normal again a few runs after some spike.  We'd only
want to pay attention to consistent slowdowns. Perhaps there would be
too much variability with the parallel tests, but maybe we could just
record it for the serial tests in make check-world.

I only thought of this after reading [1]. If we went ahead with that,
as of now, it feels like someone could quite easily break that
optimisation and nobody would notice for a long time.

I admit to not having looked at the buildfarm code to determine how
practical such a change would be. I've assumed there is a central
database that stores all the results.

David

[1] https://www.postgresql.org/message-id/CAJ3gD9eEXJ2CHMSiOehvpTZu3Ap2GMi5jaXhoZuW=3XJLmZWpw@mail.gmail.com



Re: Recording test runtimes with the buildfarm

From
Stephen Frost
Date:
Greetings,

* David Rowley (dgrowleyml@gmail.com) wrote:
> 1. We could quickly identify when someone adds some overly complex
> test and slows down the regression tests too much.

Sure, makes sense to me.  We do track the individual 'stage_duration'
but we don't track things down to a per-regression-test basis.  To do
that, I think we'd need the regression system to spit that kind of
detailed information out somewhere (in a structured format) that the
buildfarm client would then be able to pick it up and send to the server
to write into an appropriate table.

> 2. We might get some faster insight into performance regressions.

There's some effort going into continuing to build out a "performance"
farm, whose specific goal is to try and help exactly this issue.  Trying
to do that with the buildfarm has the challenge that many of those
systems aren't dedicated and therefore the timing could vary wildly
between runs due to entirely independent things than our code.

Would certainly be great to have more people working on that.  Currently
it's primarily Ilaria (GSoC student), Ads and I.

Current repo is https://github.com/PGPerfFarm/pgperffarm if folks want
to look at it, but we're in the process of making some pretty serious
changes, so now might not be the best time to look at the code.  We're
coordinating on the 'Postgresteam' slack in #perffarm for anyone
interested tho.

> I only thought of this after reading [1]. If we went ahead with that,
> as of now, it feels like someone could quite easily break that
> optimisation and nobody would notice for a long time.

One of the goals with the perffarm is to be able to support different
types of benchmarks, beyond just pgbench, so that we'd be able to add a
benchmark for "numeric", perhaps, or maybe create a script with pgbench
that ends up being heavy on numerics or such.

> I admit to not having looked at the buildfarm code to determine how
> practical such a change would be. I've assumed there is a central
> database that stores all the results.

Yes, there's a central database where results are pushed and that's what
you see when you go to buildfarm.postgresql.org, there's also an archive
server which has the logs going much farther back (and is quite a bit
larger, of course..).

Thanks,

Stephen

Attachment

Re: Recording test runtimes with the buildfarm

From
Andrew Dunstan
Date:
On 6/10/20 8:58 AM, David Rowley wrote:
> Hi,
>
> I've just had some thoughts about the possible usefulness of having
> the buildfarm record the run-time of each regression test to allow us
> to have some sort of ability to track the run-time history of each
> test.
>
> I thought the usefulness might be two-fold:
>
> 1. We could quickly identify when someone adds some overly complex
> test and slows down the regression tests too much.
> 2. We might get some faster insight into performance regressions.
>
> I can think of about 3 reasons that a test might slow down.
>
> a) Someone adds some new tests within the test file.
> b) Actual performance regression in Postgres
> c) Animal busy with other work.
>
> We likely could do a semi-decent job of telling a) and b) apart by
> just recording the latest commit that changed the .sql file for the
> test. We could also likely see when c) is at play by the results
> returning back to normal again a few runs after some spike.  We'd only
> want to pay attention to consistent slowdowns. Perhaps there would be
> too much variability with the parallel tests, but maybe we could just
> record it for the serial tests in make check-world.
>
> I only thought of this after reading [1]. If we went ahead with that,
> as of now, it feels like someone could quite easily break that
> optimisation and nobody would notice for a long time.
>
> I admit to not having looked at the buildfarm code to determine how
> practical such a change would be. I've assumed there is a central
> database that stores all the results.
>

David,


Yes we have a central database. But we don't have anything that digs
into each step. In fact, on the server side the code isn't even really
aware that it's Postgres that's being tested.


The basic structures are:


    buildsystems - one row per animal

    build_status - one row per run

    build_status_log - one row per step within each run


The last table has a blob containing the output of the step (e.g. "make
check") plus any relevant files (like the postmaster.log).

But we don't have any structure corresponding to an individual
regression test.

Architecturally, we could add a table for named times, and have the
client extract and report them.

Alternatively, people with access to the database could extract the logs
and post-process them using perl or python. That would involve no work
on my part :-) But it would not be automated.

What we do record (in build_status_log) is the time each step took. So
any regression test that suddenly blew out should likewise cause a
blowout in the time the whole "make check" took. It might be possible to
create a page that puts stats like that up. I think that's probably a
better way to go.

Tying results to an individual commit is harder. There could be dozens
of commits that happened between the current run and the previous run on
a given animal. But usually what has caused any event is fairly clear
when you look at it.


cheers


andrew


-- 

Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Recording test runtimes with the buildfarm

From
Tom Lane
Date:
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
> Alternatively, people with access to the database could extract the logs
> and post-process them using perl or python. That would involve no work
> on my part :-) But it would not be automated.

Yeah, we could easily extract per-test-script runtimes, since pg_regress
started to print those.  But ...

> What we do record (in build_status_log) is the time each step took. So
> any regression test that suddenly blew out should likewise cause a
> blowout in the time the whole "make check" took.

I have in the past scraped the latter results and tried to make sense of
them.  They are *mighty* noisy, even when considering just one animal
that I know to be running on a machine with little else to do.  Maybe
averaging across the whole buildfarm could reduce the noise level, but
I'm not very hopeful.  Per-test-script times would likely be even
noisier (ISTM anyway, maybe I'm wrong).

The entire reason we've been discussing a separate performance farm
is the expectation that buildfarm timings will be too noisy to be
useful to detect any but the most obvious performance effects.

            regards, tom lane



Re: Recording test runtimes with the buildfarm

From
Andrew Dunstan
Date:
On 6/10/20 10:13 AM, Tom Lane wrote:
> Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
>> Alternatively, people with access to the database could extract the logs
>> and post-process them using perl or python. That would involve no work
>> on my part :-) But it would not be automated.
> Yeah, we could easily extract per-test-script runtimes, since pg_regress
> started to print those.  But ...
>
>> What we do record (in build_status_log) is the time each step took. So
>> any regression test that suddenly blew out should likewise cause a
>> blowout in the time the whole "make check" took.
> I have in the past scraped the latter results and tried to make sense of
> them.  They are *mighty* noisy, even when considering just one animal
> that I know to be running on a machine with little else to do.  Maybe
> averaging across the whole buildfarm could reduce the noise level, but
> I'm not very hopeful.  Per-test-script times would likely be even
> noisier (ISTM anyway, maybe I'm wrong).
>
> The entire reason we've been discussing a separate performance farm
> is the expectation that buildfarm timings will be too noisy to be
> useful to detect any but the most obvious performance effects.
>
>             



Yes, but will the performance farm be testing regression timings?


Maybe we're going to need several test suites, one of which could be
regression tests. But the regression tests themselves are not really
intended for performance testing.


cheers


andrew


-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Recording test runtimes with the buildfarm

From
Stephen Frost
Date:
Greetings,

* Andrew Dunstan (andrew.dunstan@2ndquadrant.com) wrote:
> On 6/10/20 10:13 AM, Tom Lane wrote:
> > Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
> >> Alternatively, people with access to the database could extract the logs
> >> and post-process them using perl or python. That would involve no work
> >> on my part :-) But it would not be automated.
> > Yeah, we could easily extract per-test-script runtimes, since pg_regress
> > started to print those.  But ...
> >
> >> What we do record (in build_status_log) is the time each step took. So
> >> any regression test that suddenly blew out should likewise cause a
> >> blowout in the time the whole "make check" took.
> > I have in the past scraped the latter results and tried to make sense of
> > them.  They are *mighty* noisy, even when considering just one animal
> > that I know to be running on a machine with little else to do.  Maybe
> > averaging across the whole buildfarm could reduce the noise level, but
> > I'm not very hopeful.  Per-test-script times would likely be even
> > noisier (ISTM anyway, maybe I'm wrong).
> >
> > The entire reason we've been discussing a separate performance farm
> > is the expectation that buildfarm timings will be too noisy to be
> > useful to detect any but the most obvious performance effects.
>
> Yes, but will the performance farm be testing regression timings?

We are not currently envisioning that, no.

> Maybe we're going to need several test suites, one of which could be
> regression tests. But the regression tests themselves are not really
> intended for performance testing.

Agree with this- better would be tests which are specifically written to
test performance instead.

Thanks,

Stephen

Attachment

Re: Recording test runtimes with the buildfarm

From
David Rowley
Date:
On Thu, 11 Jun 2020 at 02:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I have in the past scraped the latter results and tried to make sense of
> them.  They are *mighty* noisy, even when considering just one animal
> that I know to be running on a machine with little else to do.

Do you recall if you looked at the parallel results or the serially
executed ones?

I imagine that the parallel ones will have much more noise since we
run the tests up to 20 at a time. I imagine probably none, or at most
not many of the animals have enough CPU cores not to be context
switching a lot during those the parallel runs.  I thought the serial
ones would be better but didn't have an idea of they'd be good enough
to be useful.

David



Re: Recording test runtimes with the buildfarm

From
Thomas Munro
Date:
On Thu, Jun 11, 2020 at 2:13 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I have in the past scraped the latter results and tried to make sense of
> them.  They are *mighty* noisy, even when considering just one animal
> that I know to be running on a machine with little else to do.  Maybe
> averaging across the whole buildfarm could reduce the noise level, but
> I'm not very hopeful.  Per-test-script times would likely be even
> noisier (ISTM anyway, maybe I'm wrong).

I've been doing that in a little database that pulls down the results
and analyses them with primitive regexes.  First I wanted to know the
pass/fail history for each individual regression, isolation and TAP
script, then I wanted to build something that could identify tests
that are 'flapping', and work out when the started and stopped
flapping etc.  I soon realised it was all too noisy, but then I
figured that I could fix that by detecting crashes.  So I classify
every top level build farm run as SUCCESS, FAILURE or CRASH.  If the
top level run was CRASH, than I can disregard the individual per
script results, because they're all BS.



Re: Recording test runtimes with the buildfarm

From
Tom Lane
Date:
David Rowley <dgrowleyml@gmail.com> writes:
> On Thu, 11 Jun 2020 at 02:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I have in the past scraped the latter results and tried to make sense of
>> them.  They are *mighty* noisy, even when considering just one animal
>> that I know to be running on a machine with little else to do.

> Do you recall if you looked at the parallel results or the serially
> executed ones?

> I imagine that the parallel ones will have much more noise since we
> run the tests up to 20 at a time. I imagine probably none, or at most
> not many of the animals have enough CPU cores not to be context
> switching a lot during those the parallel runs.  I thought the serial
> ones would be better but didn't have an idea of they'd be good enough
> to be useful.

I can't claim to recall specifically, but I agree with your theory
about that, so I probably looked at the serial-schedule case.

Note that this is moot for animals using use_installcheck_parallel
... but it looks like that's still just a minority of them.

            regards, tom lane



Re: Recording test runtimes with the buildfarm

From
Thomas Munro
Date:
On Thu, Jun 11, 2020 at 9:43 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Thu, Jun 11, 2020 at 2:13 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I have in the past scraped the latter results and tried to make sense of
> > them.  They are *mighty* noisy, even when considering just one animal
> > that I know to be running on a machine with little else to do.  Maybe
> > averaging across the whole buildfarm could reduce the noise level, but
> > I'm not very hopeful.  Per-test-script times would likely be even
> > noisier (ISTM anyway, maybe I'm wrong).
>
> I've been doing that in a little database that pulls down the results
> and analyses them with primitive regexes.  First I wanted to know the
> pass/fail history for each individual regression, isolation and TAP
> script, then I wanted to build something that could identify tests
> that are 'flapping', and work out when the started and stopped
> flapping etc.  I soon realised it was all too noisy, but then I
> figured that I could fix that by detecting crashes.  So I classify
> every top level build farm run as SUCCESS, FAILURE or CRASH.  If the
> top level run was CRASH, than I can disregard the individual per
> script results, because they're all BS.

With more coffee I realise that you were talking about noise times,
not noisy pass/fail results.  But I still want to throw that idea out
there, if we're considering analysing the logs.



Re: Recording test runtimes with the buildfarm

From
Andres Freund
Date:
Hi,

On June 10, 2020 2:13:51 PM PDT, David Rowley <dgrowleyml@gmail.com> wrote:
>On Thu, 11 Jun 2020 at 02:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I have in the past scraped the latter results and tried to make sense
>of
>> them.  They are *mighty* noisy, even when considering just one animal
>> that I know to be running on a machine with little else to do.
>
>Do you recall if you looked at the parallel results or the serially
>executed ones?
>
>I imagine that the parallel ones will have much more noise since we
>run the tests up to 20 at a time. I imagine probably none, or at most
>not many of the animals have enough CPU cores not to be context
>switching a lot during those the parallel runs.  I thought the serial
>ones would be better but didn't have an idea of they'd be good enough
>to be useful.

I'd assume that a rolling average (maybe 10 runs or so) would hide noise enough to see at least some trends even for
parallelruns. 

We should be able to prototype this with a few queries over the bf database, right?

Andres

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: Recording test runtimes with the buildfarm

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> I've been doing that in a little database that pulls down the results
> and analyses them with primitive regexes.  First I wanted to know the
> pass/fail history for each individual regression, isolation and TAP
> script, then I wanted to build something that could identify tests
> that are 'flapping', and work out when the started and stopped
> flapping etc.  I soon realised it was all too noisy, but then I
> figured that I could fix that by detecting crashes.  So I classify
> every top level build farm run as SUCCESS, FAILURE or CRASH.  If the
> top level run was CRASH, than I can disregard the individual per
> script results, because they're all BS.

If you can pin the crash on a particular test script, it'd be useful
to track that as a kind of failure.  In general, though, both crashes
and non-crash failures tend to cause collateral damage to later test
scripts --- if you can't filter that out then the later scripts will
have high false-positive rates.

            regards, tom lane



Re: Recording test runtimes with the buildfarm

From
David Rowley
Date:
On Thu, 11 Jun 2020 at 10:02, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Thomas Munro <thomas.munro@gmail.com> writes:
> > I've been doing that in a little database that pulls down the results
> > and analyses them with primitive regexes.  First I wanted to know the
> > pass/fail history for each individual regression, isolation and TAP
> > script, then I wanted to build something that could identify tests
> > that are 'flapping', and work out when the started and stopped
> > flapping etc.  I soon realised it was all too noisy, but then I
> > figured that I could fix that by detecting crashes.  So I classify
> > every top level build farm run as SUCCESS, FAILURE or CRASH.  If the
> > top level run was CRASH, than I can disregard the individual per
> > script results, because they're all BS.
>
> If you can pin the crash on a particular test script, it'd be useful
> to track that as a kind of failure.  In general, though, both crashes
> and non-crash failures tend to cause collateral damage to later test
> scripts --- if you can't filter that out then the later scripts will
> have high false-positive rates.

I guess the fact that you've both needed to do analysis on individual
tests shows that there might be a call for this beyond just recording
the test's runtime.

If we had a table that stored the individual test details, pass/fail
and just stored the timing information along with that, then, even if
the timing was unstable, it could still be useful for some analysis.
I'd be happy enough even if that was only available as a csv file
download.  I imagine the buildfarm does not need to provide us with
any tools for doing analysis on this. Ideally, there would be some
run_id that we could link it back to the test run which would give us
the commit SHA, and the animal that it ran on. Joining to details
about the animal could be useful too, e.g perhaps a certain test
always fails on 32-bit machines.

I suppose that maybe we could modify pg_regress to add a command line
option to have it write out a machine-readable file, e.g:
testname,result,runtime\n, then just have the buildfarm client ship
that off to the buildfarm server to record in the database.

David



Re: Recording test runtimes with the buildfarm

From
Joe Conway
Date:
On 6/10/20 6:00 PM, Andres Freund wrote:
> On June 10, 2020 2:13:51 PM PDT, David Rowley <dgrowleyml@gmail.com> wrote:
>>On Thu, 11 Jun 2020 at 02:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I have in the past scraped the latter results and tried to make sense
>>of
>>> them.  They are *mighty* noisy, even when considering just one animal
>>> that I know to be running on a machine with little else to do.
>>
>>Do you recall if you looked at the parallel results or the serially
>>executed ones?
>>
>>I imagine that the parallel ones will have much more noise since we
>>run the tests up to 20 at a time. I imagine probably none, or at most
>>not many of the animals have enough CPU cores not to be context
>>switching a lot during those the parallel runs.  I thought the serial
>>ones would be better but didn't have an idea of they'd be good enough
>>to be useful.
> 
> I'd assume that a rolling average (maybe 10 runs or so) would hide noise enough to see at least some trends even for
parallelruns.
 
> 
> We should be able to prototype this with a few queries over the bf database, right?


This seems to me like a perfect use case for control charts:

  https://en.wikipedia.org/wiki/Control_chart

They are designed specifically to detect systematic changes in an environment
with random noise.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



Re: Recording test runtimes with the buildfarm

From
Stephen Frost
Date:
Greetings,

* David Rowley (dgrowleyml@gmail.com) wrote:
> On Thu, 11 Jun 2020 at 10:02, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Thomas Munro <thomas.munro@gmail.com> writes:
> > > I've been doing that in a little database that pulls down the results
> > > and analyses them with primitive regexes.  First I wanted to know the
> > > pass/fail history for each individual regression, isolation and TAP
> > > script, then I wanted to build something that could identify tests
> > > that are 'flapping', and work out when the started and stopped
> > > flapping etc.  I soon realised it was all too noisy, but then I
> > > figured that I could fix that by detecting crashes.  So I classify
> > > every top level build farm run as SUCCESS, FAILURE or CRASH.  If the
> > > top level run was CRASH, than I can disregard the individual per
> > > script results, because they're all BS.
> >
> > If you can pin the crash on a particular test script, it'd be useful
> > to track that as a kind of failure.  In general, though, both crashes
> > and non-crash failures tend to cause collateral damage to later test
> > scripts --- if you can't filter that out then the later scripts will
> > have high false-positive rates.
>
> I guess the fact that you've both needed to do analysis on individual
> tests shows that there might be a call for this beyond just recording
> the test's runtime.
>
> If we had a table that stored the individual test details, pass/fail
> and just stored the timing information along with that, then, even if
> the timing was unstable, it could still be useful for some analysis.
> I'd be happy enough even if that was only available as a csv file
> download.  I imagine the buildfarm does not need to provide us with
> any tools for doing analysis on this. Ideally, there would be some
> run_id that we could link it back to the test run which would give us
> the commit SHA, and the animal that it ran on. Joining to details
> about the animal could be useful too, e.g perhaps a certain test
> always fails on 32-bit machines.
>
> I suppose that maybe we could modify pg_regress to add a command line
> option to have it write out a machine-readable file, e.g:
> testname,result,runtime\n, then just have the buildfarm client ship
> that off to the buildfarm server to record in the database.

That seems like it'd be the best approach to me, though I'd defer to
Andrew on it.

By the way, if you'd like access to the buildfarm archive server where
all this stuff is stored, that can certainly be arranged, just let me
know.

Thanks,

Stephen

Attachment

Re: Recording test runtimes with the buildfarm

From
Andrew Dunstan
Date:
On 6/11/20 10:21 AM, Stephen Frost wrote:
> Greetings,
>
> * David Rowley (dgrowleyml@gmail.com) wrote:
>> On Thu, 11 Jun 2020 at 10:02, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Thomas Munro <thomas.munro@gmail.com> writes:
>>>> I've been doing that in a little database that pulls down the results
>>>> and analyses them with primitive regexes.  First I wanted to know the
>>>> pass/fail history for each individual regression, isolation and TAP
>>>> script, then I wanted to build something that could identify tests
>>>> that are 'flapping', and work out when the started and stopped
>>>> flapping etc.  I soon realised it was all too noisy, but then I
>>>> figured that I could fix that by detecting crashes.  So I classify
>>>> every top level build farm run as SUCCESS, FAILURE or CRASH.  If the
>>>> top level run was CRASH, than I can disregard the individual per
>>>> script results, because they're all BS.
>>> If you can pin the crash on a particular test script, it'd be useful
>>> to track that as a kind of failure.  In general, though, both crashes
>>> and non-crash failures tend to cause collateral damage to later test
>>> scripts --- if you can't filter that out then the later scripts will
>>> have high false-positive rates.
>> I guess the fact that you've both needed to do analysis on individual
>> tests shows that there might be a call for this beyond just recording
>> the test's runtime.
>>
>> If we had a table that stored the individual test details, pass/fail
>> and just stored the timing information along with that, then, even if
>> the timing was unstable, it could still be useful for some analysis.
>> I'd be happy enough even if that was only available as a csv file
>> download.  I imagine the buildfarm does not need to provide us with
>> any tools for doing analysis on this. Ideally, there would be some
>> run_id that we could link it back to the test run which would give us
>> the commit SHA, and the animal that it ran on. Joining to details
>> about the animal could be useful too, e.g perhaps a certain test
>> always fails on 32-bit machines.
>>
>> I suppose that maybe we could modify pg_regress to add a command line
>> option to have it write out a machine-readable file, e.g:
>> testname,result,runtime\n, then just have the buildfarm client ship
>> that off to the buildfarm server to record in the database.
> That seems like it'd be the best approach to me, though I'd defer to
> Andrew on it.
>
> By the way, if you'd like access to the buildfarm archive server where
> all this stuff is stored, that can certainly be arranged, just let me
> know.
>


Yeah, we'll need to work out where to stash the file. The client will
pick up anything in src/regress/log for "make check", but would need
adjusting for other steps that invoke pg_regress. I'm getting close to
cutting a new client release, but I can delay it till we settle this.


On the server side, we could add a table with a key of <animal,
snapshot, branch, step, testname> but we'd need to make sure those test
names were unique. Maybe we need a way of telling pg_regress to prepend
a module name (e.g. btree_gist ot plperl) to the test name.


cheers


andrew


-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Recording test runtimes with the buildfarm

From
Magnus Hagander
Date:
On Thu, Jun 11, 2020 at 4:56 PM Andrew Dunstan <andrew.dunstan@2ndquadrant.com> wrote:

On 6/11/20 10:21 AM, Stephen Frost wrote:
> Greetings,
>
> * David Rowley (dgrowleyml@gmail.com) wrote:
>> On Thu, 11 Jun 2020 at 10:02, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Thomas Munro <thomas.munro@gmail.com> writes:
>>>> I've been doing that in a little database that pulls down the results
>>>> and analyses them with primitive regexes.  First I wanted to know the
>>>> pass/fail history for each individual regression, isolation and TAP
>>>> script, then I wanted to build something that could identify tests
>>>> that are 'flapping', and work out when the started and stopped
>>>> flapping etc.  I soon realised it was all too noisy, but then I
>>>> figured that I could fix that by detecting crashes.  So I classify
>>>> every top level build farm run as SUCCESS, FAILURE or CRASH.  If the
>>>> top level run was CRASH, than I can disregard the individual per
>>>> script results, because they're all BS.
>>> If you can pin the crash on a particular test script, it'd be useful
>>> to track that as a kind of failure.  In general, though, both crashes
>>> and non-crash failures tend to cause collateral damage to later test
>>> scripts --- if you can't filter that out then the later scripts will
>>> have high false-positive rates.
>> I guess the fact that you've both needed to do analysis on individual
>> tests shows that there might be a call for this beyond just recording
>> the test's runtime.
>>
>> If we had a table that stored the individual test details, pass/fail
>> and just stored the timing information along with that, then, even if
>> the timing was unstable, it could still be useful for some analysis.
>> I'd be happy enough even if that was only available as a csv file
>> download.  I imagine the buildfarm does not need to provide us with
>> any tools for doing analysis on this. Ideally, there would be some
>> run_id that we could link it back to the test run which would give us
>> the commit SHA, and the animal that it ran on. Joining to details
>> about the animal could be useful too, e.g perhaps a certain test
>> always fails on 32-bit machines.
>>
>> I suppose that maybe we could modify pg_regress to add a command line
>> option to have it write out a machine-readable file, e.g:
>> testname,result,runtime\n, then just have the buildfarm client ship
>> that off to the buildfarm server to record in the database.
> That seems like it'd be the best approach to me, though I'd defer to
> Andrew on it.
>
> By the way, if you'd like access to the buildfarm archive server where
> all this stuff is stored, that can certainly be arranged, just let me
> know.
>


Yeah, we'll need to work out where to stash the file. The client will
pick up anything in src/regress/log for "make check", but would need
adjusting for other steps that invoke pg_regress. I'm getting close to
cutting a new client release, but I can delay it till we settle this.


On the server side, we could add a table with a key of <animal,
snapshot, branch, step, testname> but we'd need to make sure those test
names were unique. Maybe we need a way of telling pg_regress to prepend
a module name (e.g. btree_gist ot plperl) to the test name.

It seems pretty trivial to for example get all the steps out of check.log and their timing with a regexp. I just used '^(?:test)?\s+(\S+)\s+\.\.\. ok\s+(\d+) ms$' as the regexp. Running that against a few hundred build runs in the db generally looks fine, though I didn't look into it in detail. 

Of course, that only looked at check.log, and more logic would be needed if we want to look into the other areas as well, but as long as it's pg_regress output I think it should be easy?

--

Re: Recording test runtimes with the buildfarm

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> On Thu, Jun 11, 2020 at 4:56 PM Andrew Dunstan <
> andrew.dunstan@2ndquadrant.com> wrote:
>> Yeah, we'll need to work out where to stash the file. The client will
>> pick up anything in src/regress/log for "make check", but would need
>> adjusting for other steps that invoke pg_regress. I'm getting close to
>> cutting a new client release, but I can delay it till we settle this.

> It seems pretty trivial to for example get all the steps out of check.log
> and their timing with a regexp.

Yeah, I don't see why we can't scrape this data from the existing
buildfarm output, at least for the core regression tests.  New
infrastructure could make it easier/cheaper, but I don't think we
should invest in that until we've got a provably useful tool.

            regards, tom lane



Re: Recording test runtimes with the buildfarm

From
Magnus Hagander
Date:


On Thu, Jun 11, 2020 at 6:32 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Magnus Hagander <magnus@hagander.net> writes:
> On Thu, Jun 11, 2020 at 4:56 PM Andrew Dunstan <
> andrew.dunstan@2ndquadrant.com> wrote:
>> Yeah, we'll need to work out where to stash the file. The client will
>> pick up anything in src/regress/log for "make check", but would need
>> adjusting for other steps that invoke pg_regress. I'm getting close to
>> cutting a new client release, but I can delay it till we settle this.

> It seems pretty trivial to for example get all the steps out of check.log
> and their timing with a regexp.

Yeah, I don't see why we can't scrape this data from the existing
buildfarm output, at least for the core regression tests.  New
infrastructure could make it easier/cheaper, but I don't think we
should invest in that until we've got a provably useful tool.

So spending a few minutes to look at my data, it is not quite that easy if we want to look at contrib checks for example. Both btree_gin and btree_gist have checks called "int4" for example, and the aforementioned regexp won't pick that up. But that's surely a fixable problem.

And perhaps we should at least start off data for just "make check" to see if it's useful, before spending too much work?
 
--

Re: Recording test runtimes with the buildfarm

From
Andrew Dunstan
Date:
On 6/11/20 12:32 PM, Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>> On Thu, Jun 11, 2020 at 4:56 PM Andrew Dunstan <
>> andrew.dunstan@2ndquadrant.com> wrote:
>>> Yeah, we'll need to work out where to stash the file. The client will
>>> pick up anything in src/regress/log for "make check", but would need
>>> adjusting for other steps that invoke pg_regress. I'm getting close to
>>> cutting a new client release, but I can delay it till we settle this.
>> It seems pretty trivial to for example get all the steps out of check.log
>> and their timing with a regexp.
> Yeah, I don't see why we can't scrape this data from the existing
> buildfarm output, at least for the core regression tests.  New
> infrastructure could make it easier/cheaper, but I don't think we
> should invest in that until we've got a provably useful tool.


OK, that makes my life fairly easy. Are you wanting this to be
automated, or just left as an exercise for researchers for now?


cheers


andrew


-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Recording test runtimes with the buildfarm

From
Tom Lane
Date:
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
> On 6/11/20 12:32 PM, Tom Lane wrote:
>> Yeah, I don't see why we can't scrape this data from the existing
>> buildfarm output, at least for the core regression tests.  New
>> infrastructure could make it easier/cheaper, but I don't think we
>> should invest in that until we've got a provably useful tool.

> OK, that makes my life fairly easy. Are you wanting this to be
> automated, or just left as an exercise for researchers for now?

I'm envisioning the latter --- I think somebody should prove that
useful results can be obtained before we spend any effort on making it
easier to gather the input numbers.  Mind you, I'm not saying that
I don't believe it's possible to get good results.  But building
infrastructure in advance of a solid use-case is a recipe for
building the wrong thing.

            regards, tom lane



Re: Recording test runtimes with the buildfarm

From
Fabien COELHO
Date:
Hello Tom,

> I have in the past scraped the latter results and tried to make sense of 
> them.  They are *mighty* noisy, even when considering just one animal 
> that I know to be running on a machine with little else to do.  Maybe 
> averaging across the whole buildfarm could reduce the noise level, but 
> I'm not very hopeful.

I'd try with median instead of average, so that bad cases due to animal 
overloading are ignored.

-- 
Fabien.



Re: Recording test runtimes with the buildfarm

From
David Rowley
Date:
On Fri, 12 Jun 2020 at 04:32, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Magnus Hagander <magnus@hagander.net> writes:
> > On Thu, Jun 11, 2020 at 4:56 PM Andrew Dunstan <
> > It seems pretty trivial to for example get all the steps out of check.log
> > and their timing with a regexp.
>
> Yeah, I don't see why we can't scrape this data from the existing
> buildfarm output, at least for the core regression tests.  New
> infrastructure could make it easier/cheaper, but I don't think we
> should invest in that until we've got a provably useful tool.

Looking at the data mentioned in the logs for install-check-C, it
seems some animals are quite stable with their timings but others are
quite unstable.

Taking the lowest half of the timings of each animal,test combination,
where the animal ran the test at least 50 times, the top 3 animals
with the most consistent timing are.

postgres=# select animal, avg(stddev) from (select animal,testname,
stddev(ms) from (select
animal,testname,unnest(ar[:array_length(ar,1)/2]) ms from (select
animal,testname,array_agg(ms order by ms) ar from run_details group by
animal,testname) a)b group by 1,2  having count(*) > 50) c group by 1
order by avg(stddev) limit 3;
   animal   |        avg
------------+-------------------
 mule       | 4.750935819647279
 massasauga | 5.410419286413067
 eider      |  6.06834118301505
(3 rows)

And the least consistent 3 are:

postgres=# select animal, avg(stddev) from (select animal,testname,
stddev(ms) from (select
animal,testname,unnest(ar[:array_length(ar,1)/2]) ms from (select
animal,testname,array_agg(ms order by ms) ar from run_details group by
animal,testname) a)b group by 1,2  having count(*) > 50) c group by 1
order by avg(stddev) desc limit 3;
  animal  |        avg
----------+-------------------
 lorikeet | 830.9292062818336
 gharial  | 725.9874198764217
 dory     | 683.5182140482121
(3 rows)

If I look at a random test from mule:

postgres=# select commit,time,result,ms from run_details where animal
= 'mule' and testname = 'join' order by time desc limit 10;
 commit  |        time         | result | ms
---------+---------------------+--------+-----
 e532b1d | 2020-06-15 19:30:03 | ok     | 563
 7a3543c | 2020-06-15 15:30:03 | ok     | 584
 3baa7e3 | 2020-06-15 11:30:03 | ok     | 596
 47d4d0c | 2020-06-15 07:30:03 | ok     | 533
 decbe2b | 2020-06-14 15:30:04 | ok     | 535
 378badc | 2020-06-14 07:30:04 | ok     | 563
 23cbeda | 2020-06-13 19:30:04 | ok     | 557
 8f5b596 | 2020-06-13 07:30:04 | ok     | 553
 6472572 | 2020-06-13 03:30:04 | ok     | 580
 9a7fccd | 2020-06-12 23:30:04 | ok     | 561
(10 rows)

and from lorikeet:

postgres=# select commit,time,result,ms from run_details where animal
= 'lorikeet' and testname = 'join' order by time desc limit 10;
 commit  |        time         | result |  ms
---------+---------------------+--------+------
 47d4d0c | 2020-06-15 08:28:35 | ok     | 8890
 decbe2b | 2020-06-14 20:28:33 | ok     | 8878
 378badc | 2020-06-14 08:28:35 | ok     | 8854
 cc07264 | 2020-06-14 05:22:59 | ok     | 8883
 8f5b596 | 2020-06-13 10:36:14 | ok     | 8942
 2f48ede | 2020-06-12 20:28:41 | ok     | 8904
 ffd2582 | 2020-06-12 08:29:52 | ok     | 2016
 7aa4fb5 | 2020-06-11 23:21:26 | ok     | 1939
 ad9291f | 2020-06-11 09:56:48 | ok     | 1924
 c2bd1fe | 2020-06-10 23:01:42 | ok     | 1873
(10 rows)

I can supply the data I used for this, just send me an offlist email.
It's about 19MB using bzip2.

I didn't look at the make check data and I do see some animals use the
parallel_schedule for make installcheck, which my regex neglected to
account for, so that data is missing from the set.

David