Thread: Probable CF bot degradation

Probable CF bot degradation

From
Pavel Borisov
Date:
Hi, hackers!
I've noticed that CF bot hasn't been running active branches from yesterday:

Also, there is no new results on the current CF page on cputube.
I don't know if it is a problem or kind of scheduled maintenance though.

--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: Probable CF bot degradation

From
Julien Rouhaud
Date:
Hi,

On Fri, Mar 18, 2022 at 07:43:47PM +0400, Pavel Borisov wrote:
> Hi, hackers!
> I've noticed that CF bot hasn't been running active branches from yesterday:
> https://github.com/postgresql-cfbot/postgresql/branches/active
> 
> Also, there is no new results on the current CF page on cputube.
> I don't know if it is a problem or kind of scheduled maintenance though.

There was a github incident yesterday, that was resolved a few hours ago ([1]),
maybe the cfbot didn't like that.

[1] https://www.githubstatus.com/incidents/dcnvr6zym66r



Re: Probable CF bot degradation

From
Thomas Munro
Date:
On Sat, Mar 19, 2022 at 5:07 AM Julien Rouhaud <rjuju123@gmail.com> wrote:
> On Fri, Mar 18, 2022 at 07:43:47PM +0400, Pavel Borisov wrote:
> > Hi, hackers!
> > I've noticed that CF bot hasn't been running active branches from yesterday:
> > https://github.com/postgresql-cfbot/postgresql/branches/active
> >
> > Also, there is no new results on the current CF page on cputube.
> > I don't know if it is a problem or kind of scheduled maintenance though.
>
> There was a github incident yesterday, that was resolved a few hours ago ([1]),
> maybe the cfbot didn't like that.

Yeah, for a while it was seeing:

remote: Internal Server Error
To github.com:postgresql-cfbot/postgresql.git
 ! [remote rejected]       commitfest/37/3489 -> commitfest/37/3489
(Internal Server Error)
error: failed to push some refs to 'github.com:postgresql-cfbot/postgresql.git'

Unfortunately cfbot didn't handle that failure very well and it was
waiting for a long timeout before scheduling more jobs.  It's going
again now, and I'll try to make it more resilient against that type of
failure...



Re: Probable CF bot degradation

From
Pavel Borisov
Date:
remote: Internal Server Error
To github.com:postgresql-cfbot/postgresql.git
 ! [remote rejected]       commitfest/37/3489 -> commitfest/37/3489
(Internal Server Error)
error: failed to push some refs to 'github.com:postgresql-cfbot/postgresql.git'
I am seeing commitfest/37/3489 in "triggered" state for a long time. No progress is seen on this branch, though I started to see successful runs on the other branches now.
Could you see this particular branch and maybe restart it manually?

Unfortunately cfbot didn't handle that failure very well and it was
waiting for a long timeout before scheduling more jobs.  It's going
again now, and I'll try to make it more resilient against that type of
failure...
Thanks a lot!

--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: Probable CF bot degradation

From
Thomas Munro
Date:
On Sat, Mar 19, 2022 at 9:41 AM Pavel Borisov <pashkin.elfe@gmail.com> wrote:
>>
>> remote: Internal Server Error
>> To github.com:postgresql-cfbot/postgresql.git
>>  ! [remote rejected]       commitfest/37/3489 -> commitfest/37/3489
>> (Internal Server Error)
>> error: failed to push some refs to 'github.com:postgresql-cfbot/postgresql.git'
>
> I am seeing commitfest/37/3489 in "triggered" state for a long time. No progress is seen on this branch, though I
startedto see successful runs on the other branches now. 
> Could you see this particular branch and maybe restart it manually?

I don't seem to have a way to delete that...  it looks like when
github told us "Internal Server Error", it had partially succeeded and
the new branch (partially?) existed, but something was b0rked and it
confused Cirrus.  🤷  There is already another build for 3489 that is
almost finished now so I don't think that stale TRIGGERED one is
stopping anything from working and I guess it will eventually go away
by itself somehow...



Re: Probable CF bot degradation

From
Pavel Borisov
Date:
confused Cirrus.  🤷  There is already another build for 3489 that is
almost finished now so I don't think that stale TRIGGERED one is
stopping anything from working and I guess it will eventually go away
by itself somehow...
Indeed, I saw this now. No problem anymore. 
Thanks!

--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: Probable CF bot degradation

From
Matthias van de Meent
Date:
On Fri, 18 Mar 2022 at 19:52, Thomas Munro <thomas.munro@gmail.com> wrote:
> Unfortunately cfbot didn't handle that failure very well and it was
> waiting for a long timeout before scheduling more jobs.  It's going
> again now, and I'll try to make it more resilient against that type of
> failure...

I noticed that two of my patches (37/3543 and 37/3542) both failed due
to a bad commit on master (076f4d9). The issue was fixed an hour later
with b61e6214; but the pipeline for these patches hasn't run since.
Because doing a no-op update would only clutter people's inboxes, I
was waiting for CFBot to do its regular bitrot check; but that hasn't
happened yet after 4 days.
I understand that this is probably due to the high rate of new patch
revisions that get priority in the queue; but that doesn't quite
fulfill my want for information in this case.

Would you know how long the expected bitrot re-check period for CF
entries that haven't been updated is, or could the bitrot-checking
queue be displayed somewhere to indicate the position of a patch in
this queue?
Additionally, are there plans to validate commits of the main branch
before using them as a base for CF entries, so that "bad" commits on
master won't impact CFbot results as easy?

Kind regards,

Matthias van de Meent



Re: Probable CF bot degradation

From
Julien Rouhaud
Date:
On Sun, Mar 20, 2022 at 01:58:01PM +0100, Matthias van de Meent wrote:
>
> I noticed that two of my patches (37/3543 and 37/3542) both failed due
> to a bad commit on master (076f4d9). The issue was fixed an hour later
> with b61e6214; but the pipeline for these patches hasn't run since.
> Because doing a no-op update would only clutter people's inboxes, I
> was waiting for CFBot to do its regular bitrot check; but that hasn't
> happened yet after 4 days.
> I understand that this is probably due to the high rate of new patch
> revisions that get priority in the queue; but that doesn't quite
> fulfill my want for information in this case.

Just in case, if you only want to know whether the cfbot would be happy with
your patches you can run the exact same checks using a personal github repo, as
documented at src/tools/ci/README.

You could also send the URL of a successful run on the related threads, or as
an annotation on the cf entries to let possible reviewers know that the patch
is still in a good shape even if the cfbot is currently still broken.



Re: Probable CF bot degradation

From
Thomas Munro
Date:
On Mon, Mar 21, 2022 at 1:58 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
> Would you know how long the expected bitrot re-check period for CF
> entries that haven't been updated is, or could the bitrot-checking
> queue be displayed somewhere to indicate the position of a patch in
> this queue?

I see that your patches were eventually retested.

It was set to try to recheck every ~48 hours, though it couldn't quite
always achieve that when the total number of eligible submissions is
too large.  In this case it had stalled for too long after the github
outage, which I'm going to try to improve.  The reason for the 48+
hour cycle is the Windows tests now take ~25 minutes (since we started
actually running all the tests on that platform), and we could only
have two Windows tasts running at a time in practice, because the
limit for Windows was 8 CPUs, and we use 4 for each task, which means
we could only test ~115 branches per day, or actually a shade fewer
because it's pretty dumb and only wakes up once a minute to decide
what to do, and we currently have 242 submissions (though some don't
apply, those are free, so the number varies over time...).  There are
limits on the Unixes too but they are more generous, and the Unix
tests only take 4-10 minutes, so we can ignore that for now, it's all
down to Windows.

I had been meaning to stump up the USD$10/month it costs to double the
CPU limits from the basic free Cirrus account, and I've just now done
that and told cfbot it's allowed to test 4 branches at once and to try
to test every branch every 24 hours.  Let's see how that goes.

Here's hoping we can cut down the time it takes to run the tests on
Windows... there's some really dumb stuff happening there.  Top items
I'm aware of:  (1) general lack of test concurrency, (2) exec'ing new
backends is glacially slow on that OS but we do it for every SQL
statement in the TAP tests and every regression test script (I have
some patches for this to share after the code freeze).

> Additionally, are there plans to validate commits of the main branch
> before using them as a base for CF entries, so that "bad" commits on
> master won't impact CFbot results as easy?

How do you see this working?

I have wondered about some kind of way to click a button to say "do
this one again now", but I guess that sort of user interaction should
ideally happen after merging this thing into the Commitfest app,
because it already has auth, and interactive Python/Django web stuff.



Re: Probable CF bot degradation

From
Thomas Munro
Date:
On Mon, Mar 21, 2022 at 12:23 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Mon, Mar 21, 2022 at 1:58 AM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > Would you know how long the expected bitrot re-check period for CF
> > entries that haven't been updated is, or could the bitrot-checking
> > queue be displayed somewhere to indicate the position of a patch in
> > this queue?

Also, as for the show-me-the-queue page, yeah that's a good idea and
quite feasible.  I'll look into that in a bit.

> > Additionally, are there plans to validate commits of the main branch
> > before using them as a base for CF entries, so that "bad" commits on
> > master won't impact CFbot results as easy?
>
> How do you see this working?

[Now with more coffee on board]  Oh, right, I see, you're probably
thinking that we could look at
https://github.com/postgres/postgres/commits/master and take the most
recent passing commit as a base.  Hmm, interesting idea.



Re: Probable CF bot degradation

From
Andres Freund
Date:
Hi,

On 2022-03-21 12:23:02 +1300, Thomas Munro wrote:
> It was set to try to recheck every ~48 hours, though it couldn't quite
> always achieve that when the total number of eligible submissions is
> too large.  In this case it had stalled for too long after the github
> outage, which I'm going to try to improve.  The reason for the 48+
> hour cycle is the Windows tests now take ~25 minutes (since we started
> actually running all the tests on that platform)

I see 26-28 minutes regularly :(. And that doesn't even include the "boot
time" of the test of around 3-4min, which is quite a bit higher for windows
than for the other OSs.


> and we could only
> have two Windows tasts running at a time in practice, because the
> limit for Windows was 8 CPUs, and we use 4 for each task, which means
> we could only test ~115 branches per day, or actually a shade fewer
> because it's pretty dumb and only wakes up once a minute to decide
> what to do, and we currently have 242 submissions (though some don't
> apply, those are free, so the number varies over time...).  There are
> limits on the Unixes too but they are more generous, and the Unix
> tests only take 4-10 minutes, so we can ignore that for now, it's all
> down to Windows.

I wonder if it's worth using the number of concurrently running windows tasks
as the limit, rather than the number of commits being tested
concurrently. It's not rare for windows to fail more quickly than other
OSs. But probably the 4 concurrent tests are good enough for now...

I'd love to merge the patch adding mingw CI testing, which'd increase the
pressure substantially :/


> I had been meaning to stump up the USD$10/month it costs to double the
> CPU limits from the basic free Cirrus account, and I've just now done
> that and told cfbot it's allowed to test 4 branches at once and to try
> to test every branch every 24 hours.  Let's see how that goes.

Yay.


> Here's hoping we can cut down the time it takes to run the tests on
> Windows... there's some really dumb stuff happening there.  Top items
> I'm aware of:  (1) general lack of test concurrency, (2) exec'ing new
> backends is glacially slow on that OS but we do it for every SQL
> statement in the TAP tests and every regression test script (I have
> some patches for this to share after the code freeze).

3) build is quite slow and has no caching


With meson the difference of 1, 3 is quite visible. Look at
https://cirrus-ci.com/build/5265480968568832

current buildsystem: 28:07 min
meson w/ msbuild: 22:21 min
meson w/ ninja: 19:24

meson runs quite a few tests that the "current buildsystem" doesn't, so the
win is actually bigger than the time difference indicates...


Greetings,

Andres Freund



Re: Probable CF bot degradation

From
Andres Freund
Date:
Hi,

On 2022-03-21 12:23:02 +1300, Thomas Munro wrote:
> or actually a shade fewer because it's pretty dumb and only wakes up once a
> minute to decide what to do

Might be worth using https://cirrus-ci.org/api/#webhooks to trigger a run of
the scheduler. Probably still want to have the timeout based "scheduling
iterations", but perhaps at a lower frequency?

Greetings,

Andres Freund



Re: Probable CF bot degradation

From
Peter Geoghegan
Date:
On Sun, Mar 20, 2022 at 4:23 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Mon, Mar 21, 2022 at 1:58 AM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> > Would you know how long the expected bitrot re-check period for CF
> > entries that haven't been updated is, or could the bitrot-checking
> > queue be displayed somewhere to indicate the position of a patch in
> > this queue?
>
> I see that your patches were eventually retested.

What about just seeing if the patch still applies cleanly against HEAD
much more frequently? Obviously that would be way cheaper than running
all of the tests again.

Perhaps Cirrus provides a way of taking advantage of that? (Or maybe
that happens already, in which case please enlighten me.)

BTW, I think that the usability of the CFBot website would be improved
if there was a better visual indicator of what each "green tick inside
a circle" link actually indicates -- what are we testing for each
green tick/red X shown?

I already see tooltips which show a descriptive string (for example a
tooltip that says "FreeBSD - 13: COMPLETED" which comes from
<title></title> tags), which is something. But seeing these tooltips
requires several seconds of mouseover on my browser (Chrome). I'd be
quite happy if I could see similar tooltips immediately on mouseover
(which isn't actually possible with standard generic tooltips IIUC),
or something equivalent. Any kind of visual feedback on the nature of
the thing tested by a particular CI run that the user can drill down
to (you know, a Debian logo next to the tick, that kind of thing).

> I had been meaning to stump up the USD$10/month it costs to double the
> CPU limits from the basic free Cirrus account, and I've just now done
> that and told cfbot it's allowed to test 4 branches at once and to try
> to test every branch every 24 hours.  Let's see how that goes.

Extravagance!

-- 
Peter Geoghegan



Re: Probable CF bot degradation

From
Thomas Munro
Date:
On Mon, Mar 21, 2022 at 1:41 PM Peter Geoghegan <pg@bowt.ie> wrote:
> BTW, I think that the usability of the CFBot website would be improved
> if there was a better visual indicator of what each "green tick inside
> a circle" link actually indicates -- what are we testing for each
> green tick/red X shown?
>
> I already see tooltips which show a descriptive string (for example a
> tooltip that says "FreeBSD - 13: COMPLETED" which comes from
> <title></title> tags), which is something. But seeing these tooltips
> requires several seconds of mouseover on my browser (Chrome). I'd be
> quite happy if I could see similar tooltips immediately on mouseover
> (which isn't actually possible with standard generic tooltips IIUC),
> or something equivalent. Any kind of visual feedback on the nature of
> the thing tested by a particular CI run that the user can drill down
> to (you know, a Debian logo next to the tick, that kind of thing).

Nice idea, if someone with graphics skills is interested in looking into it...

Those tooltips come from the "name" elements of the .cirrus.yml file
where tasks are defined, with Cirrus's task status appended.  If we
had a set of monochrome green and red icons with a Linux penguin,
FreeBSD daemon, Windows logo and Apple logo of matching dimensions, a
config file could map task names to icons, and fall back to
ticks/crosses for anything unknown/new, including the
"CompilerWarnings" one that doesn't have an obvious icon.  Another
thing to think about is the 'solid' and 'hollow' variants, the former
indicating a recent change.  So we'd need 4 variants of each logo.
Also I believe there is a proposal to add NetBSD and OpenBSD in the
works.



Re: Probable CF bot degradation

From
Peter Geoghegan
Date:
On Sun, Mar 20, 2022 at 6:45 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Nice idea, if someone with graphics skills is interested in looking into it...

The logo thing wasn't really the point for me. I'd just like to have
the information be more visible, sooner.

I was hoping that there might be a very simple method of making the
same information more visible, that you could implement in only a few
minutes. Perhaps that was optimistic.

-- 
Peter Geoghegan



Re: Probable CF bot degradation

From
Andres Freund
Date:
Hi,

On 2022-03-21 14:44:55 +1300, Thomas Munro wrote:
> Those tooltips come from the "name" elements of the .cirrus.yml file
> where tasks are defined, with Cirrus's task status appended.  If we
> had a set of monochrome green and red icons with a Linux penguin,
> FreeBSD daemon, Windows logo and Apple logo of matching dimensions, a
> config file could map task names to icons, and fall back to
> ticks/crosses for anything unknown/new, including the
> "CompilerWarnings" one that doesn't have an obvious icon.  Another
> thing to think about is the 'solid' and 'hollow' variants, the former
> indicating a recent change.  So we'd need 4 variants of each logo.
> Also I believe there is a proposal to add NetBSD and OpenBSD in the
> works.

Might even be sufficient to add just the first letter of the task inside the
circle, instead of the "check" and x. Right now the letters are unique.

Greetings,

Andres Freund



Re: Probable CF bot degradation

From
Thomas Munro
Date:
On Mon, Mar 21, 2022 at 3:11 PM Andres Freund <andres@anarazel.de> wrote:
> On 2022-03-21 14:44:55 +1300, Thomas Munro wrote:
> > Those tooltips come from the "name" elements of the .cirrus.yml file
> > where tasks are defined, with Cirrus's task status appended.  If we
> > had a set of monochrome green and red icons with a Linux penguin,
> > FreeBSD daemon, Windows logo and Apple logo of matching dimensions, a
> > config file could map task names to icons, and fall back to
> > ticks/crosses for anything unknown/new, including the
> > "CompilerWarnings" one that doesn't have an obvious icon.  Another
> > thing to think about is the 'solid' and 'hollow' variants, the former
> > indicating a recent change.  So we'd need 4 variants of each logo.
> > Also I believe there is a proposal to add NetBSD and OpenBSD in the
> > works.
>
> Might even be sufficient to add just the first letter of the task inside the
> circle, instead of the "check" and x. Right now the letters are unique.

Nice idea, because it retains the information density.  If someone
with web skills would like to pull down the cfbot page and hack up one
of the rows to show an example of a pass, fail, recent-pass,
recent-fail as a circle with a letter in it, and also an "in progress"
symbol that occupies the same amoutn of space, I'd be keen to try
that.  (The current "in progress" blue circle was originally supposed
to be a pie filling up slowly according to a prediction of finished
time based on past performance, but I never got to that... it's stuck
at 1/4 :-))



Re: Probable CF bot degradation

From
Thomas Munro
Date:
On Mon, Mar 21, 2022 at 12:46 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Mon, Mar 21, 2022 at 12:23 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > On Mon, Mar 21, 2022 at 1:58 AM Matthias van de Meent
> > <boekewurm+postgres@gmail.com> wrote:
> > > Additionally, are there plans to validate commits of the main branch
> > > before using them as a base for CF entries, so that "bad" commits on
> > > master won't impact CFbot results as easy?
> >
> > How do you see this working?
>
> [Now with more coffee on board]  Oh, right, I see, you're probably
> thinking that we could look at
> https://github.com/postgres/postgres/commits/master and take the most
> recent passing commit as a base.  Hmm, interesting idea.

A nice case in point today: everything is breaking on Windows due to a
commit in master, which could easily be avoided by looking back a
certain distance for a passing commit from postgres/postgres to use as
a base.  Let's me see if this is easy to fix...

https://www.postgresql.org/message-id/20220322231311.GK28503%40telsasoft.com



Re: Probable CF bot degradation

From
Justin Pryzby
Date:
On Wed, Mar 23, 2022 at 12:44:09PM +1300, Thomas Munro wrote:
> On Mon, Mar 21, 2022 at 12:46 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > On Mon, Mar 21, 2022 at 12:23 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > > On Mon, Mar 21, 2022 at 1:58 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
> > > > Additionally, are there plans to validate commits of the main branch
> > > > before using them as a base for CF entries, so that "bad" commits on
> > > > master won't impact CFbot results as easy?
> > >
> > > How do you see this working?
> >
> > [Now with more coffee on board]  Oh, right, I see, you're probably
> > thinking that we could look at
> > https://github.com/postgres/postgres/commits/master and take the most
> > recent passing commit as a base.  Hmm, interesting idea.
> 
> A nice case in point today: everything is breaking on Windows due to a
> commit in master, which could easily be avoided by looking back a
> certain distance for a passing commit from postgres/postgres to use as
> a base.  Let's me see if this is easy to fix...
> 
> https://www.postgresql.org/message-id/20220322231311.GK28503%40telsasoft.com

I suggest not to make it too sophisticated.  If something is broken, the CI
should show that rather than presenting a misleading conclusion.

Maybe you could keep track of how many consecutive, *new* failures there've
been (which were passing on the previous run for that task, for that patch) and
delay if it's more than (say) 5.  For bonus points, queue a rerun of all the
failed tasks once something passes.

If you create a page to show the queue, maybe it should show the history of
results, too.  And maybe there should be a history of results for each patch.

If you implement interactive buttons, maybe it could allow re-queueing some
recent failures (add to end of queue).

-- 
Justin