Thread: Several buildfarm animals fail tests because of shared memory error
Hello hackers, I'd like to bring your attention to multiple buildfarm failures, which occurred this month, on master only, caused by "could not open shared memory segment ...: No such file or directory" errors. First such errors were produced on 2024-12-16 by: leafhopper Amazon Linux 2023 | gcc 11.4.1 | aarch64/graviton4/r8g.2xl | tharar [ a t ] amazon.com https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=leafhopper&dt=2024-12-16%2012%3A27%3A01 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=leafhopper&dt=2024-12-16%2020%3A40%3A09 and batta: sid | gcc recent | aarch64 | michael [ a t ] paquier.xyz https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=batta&dt=2024-12-16%2008%3A05%3A04 Then there was alligator: Ubuntu 24.04 LTS | gcc experimental (nightly build) | x86_64 | tharakan [ a t ] gmail.com https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=alligator&dt=2024-12-19%2001%3A30%3A57 and parula: Amazon Linux 2 | gcc 13.2.0 | aarch64/Graviton3/c7g.2xl | tharar [ a t ] amazon.com https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-12-21%2009%3A56%3A28 Maybe it's a configuration issue (all animals except batta are owned by Robins), as described here: https://www.postgresql.org/docs/devel/kernel-resources.html#SYSTEMD-REMOVEIPC And maybe leafhopper is faulty by itself, because it also produced very weird test outputs (in older branches) like: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=leafhopper&dt=2024-12-16%2023%3A43%3A03 REL_15_STABLE - Rows Removed by Filter: 9990 + Rows Removed by Filter: 447009543 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=leafhopper&dt=2024-12-21%2022%3A18%3A04 REL_16_STABLE - Rows Removed by Filter: 9990 + Rows Removed by Filter: 9395 But still why master only? Unfortunately I'm unable to reproduce such failures locally, so I'm sorry for such raw information, but I see no way to investigate this further without assistance. Perhaps owners of these animals could shed some light on this... Best regards, Alexander
Hi Alexander,
Thanks for collating this list.
I'll try to add as much as I know, in hopes that it helps.
On Sun, 22 Dec 2024 at 16:30, Alexander Lakhin <exclusion@gmail.com> wrote:
I'd like to bring your attention to multiple buildfarm failures, which
occurred this month, on master only, caused by "could not open shared
memory segment ...: No such file or directory" errors.
- I am unsure how batta is set up, but till late last week, none of my instances had set REMOVEIPC correctly. I am sorry, I didn't know about this until Thomas pointed it out to me in another thread. So if that's a key reason here, then probably by this time next week things should settle down. I've begun setting it correctly (2 done with a few more to go) - although given that some machines are at work, I'll try to get to them this coming week.
But still why master only?
+1. It is interesting though as to why master is affected more often. This may be statistical - since master ends up with more commits and thus more tests? Unsure.
Also:
- I recently (~2 days back) switched parula to gcc-experimental nightly - after which I see 4 of the recent errors - although the recent most test is green.
- The only info about leafhopper may be relevant is that it's one of the newest machines (Graviton4) so it comes with a recent hardware / kernel / stock gcc 11.4.1.
Unfortunately I'm unable to reproduce such failures locally, so I'm sorry
for such raw information, but I see no way to investigate this further
without assistance. Perhaps owners of these animals could shed some light
on this...
Since the instances are created with work accounts, it isn't trivial to share access but I could revert with any outputs / capture if it can help here.
Lastly, alligator has been on gcc nightly for a few months, and is on x86_64 - so by this time next week if alligator is still stuttering, pretty sure there's more than just aarch64 or gcc or IPC config to blame here.
-
robins
robins