Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED
Date
Msg-id CA+hUKGL6SrmFV7j1kTkfxyQ0ed_V1pcC36PwYpudCynHRRD32g@mail.gmail.com
Whole thread Raw
In response to Re: windows CI failing PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED  (Andrew Dunstan <andrew@dunslane.net>)
List pgsql-hackers
On Mon, Feb 20, 2023 at 2:46 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> On 2023-02-17 Fr 19:27, Thomas Munro wrote:
>> FWIW I have a thing I call bfbot for slurping up similar
>> data from the build farm.  It's not pretty enough for public
>> consumption, but I do know that this assertion hasn't failed there,
>> except the cases I mentioned earlier, and a load of failures on
>> lorikeet which was completely b0rked until recently.
>
> Are there things we need to do on the server side to make data extraction easier?

It's a good question.

One thought Andres mentioned to me is whether we might want to have an
in-tree tool to find interesting stuff.  That is, even locally during
development, but also in the CI + buildfarm, a common tool could find
and spit out human- and machine-readable highlights (backtraces,
PANICs, assertions, ... like cfbot is now doing).  Then the knowledge
of what's interesting would be maintained and extended by all of us.

On the other hand, as we think of new patterns over time to look out
for, it's also nice to be able to re-scan old data to see if the new
patterns occurred in the past (I've done this several times with
cfbot's new highlight analyser as I corrected mistakes and added
patterns).  So maybe that's also a good idea, but a separate thing.
Even if the analyser logic is not in-tree, we could try to make
something that works pretty much the same across CI and BF.  Perhaps
we could think about some of those ideas once the BF is using meson?
Aside from having just one system to think about, the meson build
system is a bit more structured: it has a more formal concept of test
suites and tests with machine readable results from the top level
(JSON files etc), with names strictly corresponding to directories
where the output is, etc.  I think I'd basically want a complete list
of available files (= like the artifacts on CI), and then I'd pull
down the meson test result file and then decide which other files I
also want to pull down (ie stuff relating to failed tests) to analyse.
(Not that any of that is intractable with the autoconf or handrolled
perl/MSVC stuff, it's just messier, and hard to get motivated when its
days are numbered.)

One little thing I remembered while looking into this general topic is
the noise you get when we crash during pg_regress, which it'd be nice
to fix:

https://www.postgresql.org/message-id/flat/CA%2BhUKGL7hxqbadkto7e1FCOLQhuHg%3DwVn_PDZd6fDMbQrrZisA%40mail.gmail.com

Another topic I'm interested in is how to find useful signals in the
timing data.  For example, when Nathan and I worked on walreceiver
wakeup improvements, we didn't notice that we'd caused some tests to
become dramatically slower, because of a pre-existing bug/thinko we
hadn't noticed.  I want a computer to tell me about this stuff.
That's somewhat tricky because of all the noise, but hopefully it's
not beyond the powers of statistics to notice that a test unexpectedly
took a nap for 10s.



pgsql-hackers by date:

Previous
From: "Joel Jacobson"
Date:
Subject: Re: Missing free_var() at end of accum_sum_final()?
Next
From: Justin Pryzby
Date:
Subject: cfbot failures