Thread: Strange failures on chipmunk
Hi, chipmunk (an armv6l-powered original Raspberry Pi model 1?) has failed in a couple of weird ways recently on 14 and master. On 14 I see what appears to be a corrupted log file name: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2022-06-16%2006%3A48%3A07 cp: cannot stat \342\200\230/home/pgbfarm/buildroot/REL_14_STABLE/pgsql.build/src/test/recovery/tmp_check/t_002_archiving_primary_data/archives/000000010000000000000003\342\200\231: No such file or directory On master, you can ignore this failure, because it was addressed by 93759c66: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2022-05-11%2015%3A26%3A01 Then there's this one-off, that smells like WAL corruption: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2022-06-13%2015%3A12%3A44 2022-06-13 23:02:06.988 EEST [30121:5] LOG: incorrect resource manager data checksum in record at 0/79B4FE0 Hmmm. I suppose it's remotely possible that Linux/armv6l ext4 suffers from concurrency bugs like Linux/sparc. In that particular kernel bug's case it's zeroes, so I guess it'd be easier to speculate about if the log message included the checksum when it fails like that...
On Thu, Jun 30, 2022 at 10:07:18AM +1200, Thomas Munro wrote: > Then there's this one-off, that smells like WAL corruption: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2022-06-13%2015%3A12%3A44 > > 2022-06-13 23:02:06.988 EEST [30121:5] LOG: incorrect resource > manager data checksum in record at 0/79B4FE0 > > Hmmm. I suppose it's remotely possible that Linux/armv6l ext4 suffers > from concurrency bugs like Linux/sparc. Running sparc64-ext4-zeros.c from https://marc.info/?l=linux-sparc&m=164539269632667&w=2 could confirm that possibility.
On 30/06/2022 09:31, Noah Misch wrote: > On Thu, Jun 30, 2022 at 10:07:18AM +1200, Thomas Munro wrote: >> Then there's this one-off, that smells like WAL corruption: >> >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2022-06-13%2015%3A12%3A44 >> >> 2022-06-13 23:02:06.988 EEST [30121:5] LOG: incorrect resource >> manager data checksum in record at 0/79B4FE0 >> >> Hmmm. I suppose it's remotely possible that Linux/armv6l ext4 suffers >> from concurrency bugs like Linux/sparc. > > Running sparc64-ext4-zeros.c from > https://marc.info/?l=linux-sparc&m=164539269632667&w=2 could confirm that > possibility. I ran sparc64-ext4-zeros on chipmunk for 10 minutes, and it didn't print anything. It's possible that the SD card on chipmunk is simply wearing out and flipping bits. I can try to replace it. Anyone have suggestions on a test program I could run on the SD card, after replacing it, to verify if it was indeed worn out? - Heikki
On Thu, Jun 30, 2022 at 8:21 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > I ran sparc64-ext4-zeros on chipmunk for 10 minutes, and it didn't print > anything. Thanks for checking. > It's possible that the SD card on chipmunk is simply wearing out and > flipping bits. I can try to replace it. Anyone have suggestions on a > test program I could run on the SD card, after replacing it, to verify > if it was indeed worn out? BTW its disk is full. FWIW I run RPi4 build bots on higher end USB3.x sticks (SanDisk Extreme Pro, I'm sure there are others), and the performance is orders of magnitude higher and more consistent than the micro SD and cheap/random USB sticks I tried. Admittedly they cost more than the RPi4 board themselves (back when you could get them). I noticed another (presumed) Raspberry Pi apparently behaving strangely at the storage level (guessing it's a Pi by the armv7l architecture): dangomushi appears to get files mixed up. Here it is trying to compile a log file last week: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2022-07-14%2017%3A58%3A38 And the week before it tried to compile some Perl: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2022-07-09%2015%3A30%3A07
On Fri, Jul 22, 2022 at 04:35:30PM +1200, Thomas Munro wrote: > I noticed another (presumed) Raspberry Pi apparently behaving > strangely at the storage level (guessing it's a Pi by the armv7l > architecture): dangomushi appears to get files mixed up. Here it is > trying to compile a log file last week: This is a PI2. > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2022-07-14%2017%3A58%3A38 > > And the week before it tried to compile some Perl: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2022-07-09%2015%3A30%3A07 The buildfarm runs are part of a SD card that's been running for a couple of years now, so I would not be surprised that the issue comes from the years using it. A couple of fsck's did not show up anything, though, but I am keeping an eye on it. -- Michael