Thread: Probable CF bot degradation
Hi, hackers!
--
I've noticed that CF bot hasn't been running active branches from yesterday:
Also, there is no new results on the current CF page on cputube.
I don't know if it is a problem or kind of scheduled maintenance though.
Hi, On Fri, Mar 18, 2022 at 07:43:47PM +0400, Pavel Borisov wrote: > Hi, hackers! > I've noticed that CF bot hasn't been running active branches from yesterday: > https://github.com/postgresql-cfbot/postgresql/branches/active > > Also, there is no new results on the current CF page on cputube. > I don't know if it is a problem or kind of scheduled maintenance though. There was a github incident yesterday, that was resolved a few hours ago ([1]), maybe the cfbot didn't like that. [1] https://www.githubstatus.com/incidents/dcnvr6zym66r
On Sat, Mar 19, 2022 at 5:07 AM Julien Rouhaud <rjuju123@gmail.com> wrote: > On Fri, Mar 18, 2022 at 07:43:47PM +0400, Pavel Borisov wrote: > > Hi, hackers! > > I've noticed that CF bot hasn't been running active branches from yesterday: > > https://github.com/postgresql-cfbot/postgresql/branches/active > > > > Also, there is no new results on the current CF page on cputube. > > I don't know if it is a problem or kind of scheduled maintenance though. > > There was a github incident yesterday, that was resolved a few hours ago ([1]), > maybe the cfbot didn't like that. Yeah, for a while it was seeing: remote: Internal Server Error To github.com:postgresql-cfbot/postgresql.git ! [remote rejected] commitfest/37/3489 -> commitfest/37/3489 (Internal Server Error) error: failed to push some refs to 'github.com:postgresql-cfbot/postgresql.git' Unfortunately cfbot didn't handle that failure very well and it was waiting for a long timeout before scheduling more jobs. It's going again now, and I'll try to make it more resilient against that type of failure...
remote: Internal Server Error
To github.com:postgresql-cfbot/postgresql.git
! [remote rejected] commitfest/37/3489 -> commitfest/37/3489
(Internal Server Error)
error: failed to push some refs to 'github.com:postgresql-cfbot/postgresql.git'
I am seeing commitfest/37/3489 in "triggered" state for a long time. No progress is seen on this branch, though I started to see successful runs on the other branches now.
Could you see this particular branch and maybe restart it manually?
Unfortunately cfbot didn't handle that failure very well and it was
waiting for a long timeout before scheduling more jobs. It's going
again now, and I'll try to make it more resilient against that type of
failure...
Thanks a lot!
On Sat, Mar 19, 2022 at 9:41 AM Pavel Borisov <pashkin.elfe@gmail.com> wrote: >> >> remote: Internal Server Error >> To github.com:postgresql-cfbot/postgresql.git >> ! [remote rejected] commitfest/37/3489 -> commitfest/37/3489 >> (Internal Server Error) >> error: failed to push some refs to 'github.com:postgresql-cfbot/postgresql.git' > > I am seeing commitfest/37/3489 in "triggered" state for a long time. No progress is seen on this branch, though I startedto see successful runs on the other branches now. > Could you see this particular branch and maybe restart it manually? I don't seem to have a way to delete that... it looks like when github told us "Internal Server Error", it had partially succeeded and the new branch (partially?) existed, but something was b0rked and it confused Cirrus. 🤷 There is already another build for 3489 that is almost finished now so I don't think that stale TRIGGERED one is stopping anything from working and I guess it will eventually go away by itself somehow...
confused Cirrus. 🤷 There is already another build for 3489 that is
almost finished now so I don't think that stale TRIGGERED one is
stopping anything from working and I guess it will eventually go away
by itself somehow...
Indeed, I saw this now. No problem anymore.
Thanks!
--
On Fri, 18 Mar 2022 at 19:52, Thomas Munro <thomas.munro@gmail.com> wrote: > Unfortunately cfbot didn't handle that failure very well and it was > waiting for a long timeout before scheduling more jobs. It's going > again now, and I'll try to make it more resilient against that type of > failure... I noticed that two of my patches (37/3543 and 37/3542) both failed due to a bad commit on master (076f4d9). The issue was fixed an hour later with b61e6214; but the pipeline for these patches hasn't run since. Because doing a no-op update would only clutter people's inboxes, I was waiting for CFBot to do its regular bitrot check; but that hasn't happened yet after 4 days. I understand that this is probably due to the high rate of new patch revisions that get priority in the queue; but that doesn't quite fulfill my want for information in this case. Would you know how long the expected bitrot re-check period for CF entries that haven't been updated is, or could the bitrot-checking queue be displayed somewhere to indicate the position of a patch in this queue? Additionally, are there plans to validate commits of the main branch before using them as a base for CF entries, so that "bad" commits on master won't impact CFbot results as easy? Kind regards, Matthias van de Meent
On Sun, Mar 20, 2022 at 01:58:01PM +0100, Matthias van de Meent wrote: > > I noticed that two of my patches (37/3543 and 37/3542) both failed due > to a bad commit on master (076f4d9). The issue was fixed an hour later > with b61e6214; but the pipeline for these patches hasn't run since. > Because doing a no-op update would only clutter people's inboxes, I > was waiting for CFBot to do its regular bitrot check; but that hasn't > happened yet after 4 days. > I understand that this is probably due to the high rate of new patch > revisions that get priority in the queue; but that doesn't quite > fulfill my want for information in this case. Just in case, if you only want to know whether the cfbot would be happy with your patches you can run the exact same checks using a personal github repo, as documented at src/tools/ci/README. You could also send the URL of a successful run on the related threads, or as an annotation on the cf entries to let possible reviewers know that the patch is still in a good shape even if the cfbot is currently still broken.
On Mon, Mar 21, 2022 at 1:58 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > Would you know how long the expected bitrot re-check period for CF > entries that haven't been updated is, or could the bitrot-checking > queue be displayed somewhere to indicate the position of a patch in > this queue? I see that your patches were eventually retested. It was set to try to recheck every ~48 hours, though it couldn't quite always achieve that when the total number of eligible submissions is too large. In this case it had stalled for too long after the github outage, which I'm going to try to improve. The reason for the 48+ hour cycle is the Windows tests now take ~25 minutes (since we started actually running all the tests on that platform), and we could only have two Windows tasts running at a time in practice, because the limit for Windows was 8 CPUs, and we use 4 for each task, which means we could only test ~115 branches per day, or actually a shade fewer because it's pretty dumb and only wakes up once a minute to decide what to do, and we currently have 242 submissions (though some don't apply, those are free, so the number varies over time...). There are limits on the Unixes too but they are more generous, and the Unix tests only take 4-10 minutes, so we can ignore that for now, it's all down to Windows. I had been meaning to stump up the USD$10/month it costs to double the CPU limits from the basic free Cirrus account, and I've just now done that and told cfbot it's allowed to test 4 branches at once and to try to test every branch every 24 hours. Let's see how that goes. Here's hoping we can cut down the time it takes to run the tests on Windows... there's some really dumb stuff happening there. Top items I'm aware of: (1) general lack of test concurrency, (2) exec'ing new backends is glacially slow on that OS but we do it for every SQL statement in the TAP tests and every regression test script (I have some patches for this to share after the code freeze). > Additionally, are there plans to validate commits of the main branch > before using them as a base for CF entries, so that "bad" commits on > master won't impact CFbot results as easy? How do you see this working? I have wondered about some kind of way to click a button to say "do this one again now", but I guess that sort of user interaction should ideally happen after merging this thing into the Commitfest app, because it already has auth, and interactive Python/Django web stuff.
On Mon, Mar 21, 2022 at 12:23 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Mon, Mar 21, 2022 at 1:58 AM Matthias van de Meent > <boekewurm+postgres@gmail.com> wrote: > > Would you know how long the expected bitrot re-check period for CF > > entries that haven't been updated is, or could the bitrot-checking > > queue be displayed somewhere to indicate the position of a patch in > > this queue? Also, as for the show-me-the-queue page, yeah that's a good idea and quite feasible. I'll look into that in a bit. > > Additionally, are there plans to validate commits of the main branch > > before using them as a base for CF entries, so that "bad" commits on > > master won't impact CFbot results as easy? > > How do you see this working? [Now with more coffee on board] Oh, right, I see, you're probably thinking that we could look at https://github.com/postgres/postgres/commits/master and take the most recent passing commit as a base. Hmm, interesting idea.
Hi, On 2022-03-21 12:23:02 +1300, Thomas Munro wrote: > It was set to try to recheck every ~48 hours, though it couldn't quite > always achieve that when the total number of eligible submissions is > too large. In this case it had stalled for too long after the github > outage, which I'm going to try to improve. The reason for the 48+ > hour cycle is the Windows tests now take ~25 minutes (since we started > actually running all the tests on that platform) I see 26-28 minutes regularly :(. And that doesn't even include the "boot time" of the test of around 3-4min, which is quite a bit higher for windows than for the other OSs. > and we could only > have two Windows tasts running at a time in practice, because the > limit for Windows was 8 CPUs, and we use 4 for each task, which means > we could only test ~115 branches per day, or actually a shade fewer > because it's pretty dumb and only wakes up once a minute to decide > what to do, and we currently have 242 submissions (though some don't > apply, those are free, so the number varies over time...). There are > limits on the Unixes too but they are more generous, and the Unix > tests only take 4-10 minutes, so we can ignore that for now, it's all > down to Windows. I wonder if it's worth using the number of concurrently running windows tasks as the limit, rather than the number of commits being tested concurrently. It's not rare for windows to fail more quickly than other OSs. But probably the 4 concurrent tests are good enough for now... I'd love to merge the patch adding mingw CI testing, which'd increase the pressure substantially :/ > I had been meaning to stump up the USD$10/month it costs to double the > CPU limits from the basic free Cirrus account, and I've just now done > that and told cfbot it's allowed to test 4 branches at once and to try > to test every branch every 24 hours. Let's see how that goes. Yay. > Here's hoping we can cut down the time it takes to run the tests on > Windows... there's some really dumb stuff happening there. Top items > I'm aware of: (1) general lack of test concurrency, (2) exec'ing new > backends is glacially slow on that OS but we do it for every SQL > statement in the TAP tests and every regression test script (I have > some patches for this to share after the code freeze). 3) build is quite slow and has no caching With meson the difference of 1, 3 is quite visible. Look at https://cirrus-ci.com/build/5265480968568832 current buildsystem: 28:07 min meson w/ msbuild: 22:21 min meson w/ ninja: 19:24 meson runs quite a few tests that the "current buildsystem" doesn't, so the win is actually bigger than the time difference indicates... Greetings, Andres Freund
Hi, On 2022-03-21 12:23:02 +1300, Thomas Munro wrote: > or actually a shade fewer because it's pretty dumb and only wakes up once a > minute to decide what to do Might be worth using https://cirrus-ci.org/api/#webhooks to trigger a run of the scheduler. Probably still want to have the timeout based "scheduling iterations", but perhaps at a lower frequency? Greetings, Andres Freund
On Sun, Mar 20, 2022 at 4:23 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Mon, Mar 21, 2022 at 1:58 AM Matthias van de Meent > <boekewurm+postgres@gmail.com> wrote: > > Would you know how long the expected bitrot re-check period for CF > > entries that haven't been updated is, or could the bitrot-checking > > queue be displayed somewhere to indicate the position of a patch in > > this queue? > > I see that your patches were eventually retested. What about just seeing if the patch still applies cleanly against HEAD much more frequently? Obviously that would be way cheaper than running all of the tests again. Perhaps Cirrus provides a way of taking advantage of that? (Or maybe that happens already, in which case please enlighten me.) BTW, I think that the usability of the CFBot website would be improved if there was a better visual indicator of what each "green tick inside a circle" link actually indicates -- what are we testing for each green tick/red X shown? I already see tooltips which show a descriptive string (for example a tooltip that says "FreeBSD - 13: COMPLETED" which comes from <title></title> tags), which is something. But seeing these tooltips requires several seconds of mouseover on my browser (Chrome). I'd be quite happy if I could see similar tooltips immediately on mouseover (which isn't actually possible with standard generic tooltips IIUC), or something equivalent. Any kind of visual feedback on the nature of the thing tested by a particular CI run that the user can drill down to (you know, a Debian logo next to the tick, that kind of thing). > I had been meaning to stump up the USD$10/month it costs to double the > CPU limits from the basic free Cirrus account, and I've just now done > that and told cfbot it's allowed to test 4 branches at once and to try > to test every branch every 24 hours. Let's see how that goes. Extravagance! -- Peter Geoghegan
On Mon, Mar 21, 2022 at 1:41 PM Peter Geoghegan <pg@bowt.ie> wrote: > BTW, I think that the usability of the CFBot website would be improved > if there was a better visual indicator of what each "green tick inside > a circle" link actually indicates -- what are we testing for each > green tick/red X shown? > > I already see tooltips which show a descriptive string (for example a > tooltip that says "FreeBSD - 13: COMPLETED" which comes from > <title></title> tags), which is something. But seeing these tooltips > requires several seconds of mouseover on my browser (Chrome). I'd be > quite happy if I could see similar tooltips immediately on mouseover > (which isn't actually possible with standard generic tooltips IIUC), > or something equivalent. Any kind of visual feedback on the nature of > the thing tested by a particular CI run that the user can drill down > to (you know, a Debian logo next to the tick, that kind of thing). Nice idea, if someone with graphics skills is interested in looking into it... Those tooltips come from the "name" elements of the .cirrus.yml file where tasks are defined, with Cirrus's task status appended. If we had a set of monochrome green and red icons with a Linux penguin, FreeBSD daemon, Windows logo and Apple logo of matching dimensions, a config file could map task names to icons, and fall back to ticks/crosses for anything unknown/new, including the "CompilerWarnings" one that doesn't have an obvious icon. Another thing to think about is the 'solid' and 'hollow' variants, the former indicating a recent change. So we'd need 4 variants of each logo. Also I believe there is a proposal to add NetBSD and OpenBSD in the works.
On Sun, Mar 20, 2022 at 6:45 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Nice idea, if someone with graphics skills is interested in looking into it... The logo thing wasn't really the point for me. I'd just like to have the information be more visible, sooner. I was hoping that there might be a very simple method of making the same information more visible, that you could implement in only a few minutes. Perhaps that was optimistic. -- Peter Geoghegan
Hi, On 2022-03-21 14:44:55 +1300, Thomas Munro wrote: > Those tooltips come from the "name" elements of the .cirrus.yml file > where tasks are defined, with Cirrus's task status appended. If we > had a set of monochrome green and red icons with a Linux penguin, > FreeBSD daemon, Windows logo and Apple logo of matching dimensions, a > config file could map task names to icons, and fall back to > ticks/crosses for anything unknown/new, including the > "CompilerWarnings" one that doesn't have an obvious icon. Another > thing to think about is the 'solid' and 'hollow' variants, the former > indicating a recent change. So we'd need 4 variants of each logo. > Also I believe there is a proposal to add NetBSD and OpenBSD in the > works. Might even be sufficient to add just the first letter of the task inside the circle, instead of the "check" and x. Right now the letters are unique. Greetings, Andres Freund
On Mon, Mar 21, 2022 at 3:11 PM Andres Freund <andres@anarazel.de> wrote: > On 2022-03-21 14:44:55 +1300, Thomas Munro wrote: > > Those tooltips come from the "name" elements of the .cirrus.yml file > > where tasks are defined, with Cirrus's task status appended. If we > > had a set of monochrome green and red icons with a Linux penguin, > > FreeBSD daemon, Windows logo and Apple logo of matching dimensions, a > > config file could map task names to icons, and fall back to > > ticks/crosses for anything unknown/new, including the > > "CompilerWarnings" one that doesn't have an obvious icon. Another > > thing to think about is the 'solid' and 'hollow' variants, the former > > indicating a recent change. So we'd need 4 variants of each logo. > > Also I believe there is a proposal to add NetBSD and OpenBSD in the > > works. > > Might even be sufficient to add just the first letter of the task inside the > circle, instead of the "check" and x. Right now the letters are unique. Nice idea, because it retains the information density. If someone with web skills would like to pull down the cfbot page and hack up one of the rows to show an example of a pass, fail, recent-pass, recent-fail as a circle with a letter in it, and also an "in progress" symbol that occupies the same amoutn of space, I'd be keen to try that. (The current "in progress" blue circle was originally supposed to be a pie filling up slowly according to a prediction of finished time based on past performance, but I never got to that... it's stuck at 1/4 :-))
On Mon, Mar 21, 2022 at 12:46 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Mon, Mar 21, 2022 at 12:23 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Mon, Mar 21, 2022 at 1:58 AM Matthias van de Meent > > <boekewurm+postgres@gmail.com> wrote: > > > Additionally, are there plans to validate commits of the main branch > > > before using them as a base for CF entries, so that "bad" commits on > > > master won't impact CFbot results as easy? > > > > How do you see this working? > > [Now with more coffee on board] Oh, right, I see, you're probably > thinking that we could look at > https://github.com/postgres/postgres/commits/master and take the most > recent passing commit as a base. Hmm, interesting idea. A nice case in point today: everything is breaking on Windows due to a commit in master, which could easily be avoided by looking back a certain distance for a passing commit from postgres/postgres to use as a base. Let's me see if this is easy to fix... https://www.postgresql.org/message-id/20220322231311.GK28503%40telsasoft.com
On Wed, Mar 23, 2022 at 12:44:09PM +1300, Thomas Munro wrote: > On Mon, Mar 21, 2022 at 12:46 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Mon, Mar 21, 2022 at 12:23 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > > On Mon, Mar 21, 2022 at 1:58 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > > > > Additionally, are there plans to validate commits of the main branch > > > > before using them as a base for CF entries, so that "bad" commits on > > > > master won't impact CFbot results as easy? > > > > > > How do you see this working? > > > > [Now with more coffee on board] Oh, right, I see, you're probably > > thinking that we could look at > > https://github.com/postgres/postgres/commits/master and take the most > > recent passing commit as a base. Hmm, interesting idea. > > A nice case in point today: everything is breaking on Windows due to a > commit in master, which could easily be avoided by looking back a > certain distance for a passing commit from postgres/postgres to use as > a base. Let's me see if this is easy to fix... > > https://www.postgresql.org/message-id/20220322231311.GK28503%40telsasoft.com I suggest not to make it too sophisticated. If something is broken, the CI should show that rather than presenting a misleading conclusion. Maybe you could keep track of how many consecutive, *new* failures there've been (which were passing on the previous run for that task, for that patch) and delay if it's more than (say) 5. For bonus points, queue a rerun of all the failed tasks once something passes. If you create a page to show the queue, maybe it should show the history of results, too. And maybe there should be a history of results for each patch. If you implement interactive buttons, maybe it could allow re-queueing some recent failures (add to end of queue). -- Justin