Thread: Several buildfarm animals fail tests because of shared memory error

Several buildfarm animals fail tests because of shared memory error

From
Alexander Lakhin
Date:
Hello hackers,

I'd like to bring your attention to multiple buildfarm failures, which
occurred this month, on master only, caused by "could not open shared
memory segment ...: No such file or directory" errors.

First such errors were produced on 2024-12-16 by:
leafhopper
Amazon Linux 2023 | gcc 11.4.1 | aarch64/graviton4/r8g.2xl | tharar [ a t ] amazon.com
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=leafhopper&dt=2024-12-16%2012%3A27%3A01
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=leafhopper&dt=2024-12-16%2020%3A40%3A09

and batta:
sid | gcc recent | aarch64 | michael [ a t ] paquier.xyz
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=batta&dt=2024-12-16%2008%3A05%3A04

Then there was alligator:
Ubuntu 24.04 LTS | gcc experimental (nightly build) | x86_64 | tharakan [ a t ] gmail.com
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=alligator&dt=2024-12-19%2001%3A30%3A57

and parula:
Amazon Linux 2 | gcc 13.2.0 | aarch64/Graviton3/c7g.2xl | tharar [ a t ] amazon.com
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-12-21%2009%3A56%3A28

Maybe it's a configuration issue (all animals except batta are owned by
Robins), as described here:
https://www.postgresql.org/docs/devel/kernel-resources.html#SYSTEMD-REMOVEIPC

And maybe leafhopper is faulty by itself, because it also produced very
weird test outputs (in older branches) like:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=leafhopper&dt=2024-12-16%2023%3A43%3A03
REL_15_STABLE
-               Rows Removed by Filter: 9990
+               Rows Removed by Filter: 447009543

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=leafhopper&dt=2024-12-21%2022%3A18%3A04
REL_16_STABLE
-               Rows Removed by Filter: 9990
+               Rows Removed by Filter: 9395

But still why master only?

Unfortunately I'm unable to reproduce such failures locally, so I'm sorry
for such raw information, but I see no way to investigate this further
without assistance. Perhaps owners of these animals could shed some light
on this...

Best regards,
Alexander



Re: Several buildfarm animals fail tests because of shared memory error

From
Robins Tharakan
Date:
Hi Alexander,

Thanks for collating this list.
I'll try to add as much as I know, in hopes that it helps.

On Sun, 22 Dec 2024 at 16:30, Alexander Lakhin <exclusion@gmail.com> wrote:
I'd like to bring your attention to multiple buildfarm failures, which
occurred this month, on master only, caused by "could not open shared
memory segment ...: No such file or directory" errors.


- I am unsure how batta is set up, but till late last week, none of my instances had set REMOVEIPC correctly. I am sorry, I didn't know about this until Thomas pointed it out to me in another thread. So if that's a key reason here, then probably by this time next week things should settle down. I've begun setting it correctly (2 done with a few more to go) - although given that some machines are at work, I'll try to get to them this coming week.



But still why master only?

+1. It is interesting though as to why master is affected more often. This may be statistical - since master ends up with more commits and thus more tests? Unsure.

Also:
- I recently (~2 days back) switched parula to gcc-experimental nightly - after which I see 4 of the recent errors - although the recent most test is green.
- The only info about leafhopper may be relevant is that it's one of the newest machines (Graviton4) so it comes with a recent hardware / kernel / stock gcc 11.4.1.

Unfortunately I'm unable to reproduce such failures locally, so I'm sorry
for such raw information, but I see no way to investigate this further
without assistance. Perhaps owners of these animals could shed some light
on this...

Since the instances are created with work accounts, it isn't trivial to share access but I could revert with any outputs / capture if it can help here.

Lastly, alligator has been on gcc nightly for a few months, and is on x86_64 - so by this time next week if alligator is still stuttering, pretty sure there's more than just aarch64 or gcc or IPC config to blame here.

-
robins