Re: a very significant fraction of the buildfarm is now pink - Mailing list pgsql-hackers

From James Hunter
Subject Re: a very significant fraction of the buildfarm is now pink
Date
Msg-id CAJVSvF4TDkq+uYGngvBQd01ujmbPKKZQp3OBM+hfMUWUrfEfGQ@mail.gmail.com
Whole thread Raw
In response to a very significant fraction of the buildfarm is now pink  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Fri, Feb 21, 2025 at 5:09 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Feb 21, 2025 at 7:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > A very significant fraction of the buildfarm is now pink.
> > If you don't have a fix pretty nearly ready, please revert.
>
> When we're going to do a release, you want no commits for at least 24
> hours before the release so that we can make sure the buildfarm is
> clean. But when I commit something and the buildfarm fails, you want
> it reverted within a handful of hours before I can even be sure of
> having seen all the failures the commit caused, let alone had time to
> think about what the best fix might be. It doesn't make sense to me
> that we need 24 hours to be sure that the buildfarm is passing, but in
> 3 hours I'm supposed to see all the failures -- including the ones
> that only happen on buildfarm animals that run once a day, I guess? --
> and analyze them -- and decide on a fix -- naturally without any sort
> of discussion because there's no time for that -- and code the fix --
> and push it. I have a really hard time seeing how that's a reasonable
> expectation.
>
> I understand that the buildfarm can't be red all the time or nobody
> can distinguish the problems they caused from preexisting ones.

I think you're right, and the solution has to be something like:
1. If it's > 24 hours until a release, then test failures on buildfarm
must be allowed.
2. Buildfarm needs to track failures at a fine enough granularity that
I can tell whether my change caused the failure.

I've worked at a few database companies -- some had workable test
frameworks, others didn't. The biggest mistake some companies made was
to collapse an entire test suite into a single, binary result: "ALL
PASSED" vs. "AT LEAST ONE FAILED." (It's fine to collapse the test
suite *before you merge your commit,* because then you are testing
*only* your commit. After you merge it, however, CI / buildfarm ends
up testing *multiple* new commits at the same time, so the tests need
to report at finer granularity.)

If we can't do CI / buildfarm at the granularity of a single commit
(and, at some point, no organization can -- the test universe becomes
too large!), then we have to design for the possibility that one or
more of multiple commits will cause a test failure, while the
remaining commits are innocent. And so the test framework has to
identify the (likely) culprit for a particular new test failure, and
then give the author *enough time* to fix that failure. And while the
author is fixing his failure, other authors need to be able to merge
their commits, etc.

(Note that actual *compilation* failures are more severe than test
failures, because if PostgreSQL can't compile, then no one can get any
work done if they rebase to TIP. But I think you are discussing tests,
not compilation.)

One company I worked for tracked test failures by individual unit
test. So, "parallel" schedule would have 220+ tests, each tracking
SUCCESS, [EXISTING] FAILURE, and NEW FAILURE. On every NEW FAILURE,
the system notified the unit test owner (this is feasible inside a
corporation, but perhaps not for an open source community) and
concurrently started running the equivalent of "git bisect" to
identify the offending commit. Once it identified (what it thought
was) the offending commit, it notified that committer as well.

> In the case of my commit today, the failures are the result of a
> 2-line regression diff with no functional impact that neither CI nor
> any of the 11 reviewers noticed.

This is exactly the sort of case where tracking NEW FAILURE at the
unit test level is useful. So you made a change that caused a
regression diff, and the new output is better than the old (or at
least neutral). Flag this individual unit test, send it to Robert, and
development continues.

Given enough unit tests, along with reasonable pre-commit CI testing,
the probability of two commits causing regressions / diffs in the same
unit test, over a 1-week period (let's say), is very small.

So: +1 to your proposal.

James



pgsql-hackers by date:

Previous
From: Sami Imseih
Date:
Subject: Re: explain plans for foreign servers
Next
From: Melanie Plageman
Date:
Subject: Re: Log connection establishment timings