Thread: Strange failures on chipmunk

Strange failures on chipmunk

From
Thomas Munro
Date:
Hi,

chipmunk (an armv6l-powered original Raspberry Pi model 1?) has failed
in a couple of weird ways recently on 14 and master.

On 14 I see what appears to be a corrupted log file name:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2022-06-16%2006%3A48%3A07

cp: cannot stat

\342\200\230/home/pgbfarm/buildroot/REL_14_STABLE/pgsql.build/src/test/recovery/tmp_check/t_002_archiving_primary_data/archives/000000010000000000000003\342\200\231:
No such file or directory

On master, you can ignore this failure, because it was addressed by 93759c66:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2022-05-11%2015%3A26%3A01

Then there's this one-off, that smells like WAL corruption:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2022-06-13%2015%3A12%3A44

2022-06-13 23:02:06.988 EEST [30121:5] LOG:  incorrect resource
manager data checksum in record at 0/79B4FE0

Hmmm.  I suppose it's remotely possible that Linux/armv6l ext4 suffers
from concurrency bugs like Linux/sparc.  In that particular kernel
bug's case it's zeroes, so I guess it'd be easier to speculate about
if the log message included the checksum when it fails like that...



Re: Strange failures on chipmunk

From
Noah Misch
Date:
On Thu, Jun 30, 2022 at 10:07:18AM +1200, Thomas Munro wrote:
> Then there's this one-off, that smells like WAL corruption:
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2022-06-13%2015%3A12%3A44
> 
> 2022-06-13 23:02:06.988 EEST [30121:5] LOG:  incorrect resource
> manager data checksum in record at 0/79B4FE0
> 
> Hmmm.  I suppose it's remotely possible that Linux/armv6l ext4 suffers
> from concurrency bugs like Linux/sparc.

Running sparc64-ext4-zeros.c from
https://marc.info/?l=linux-sparc&m=164539269632667&w=2 could confirm that
possibility.



Re: Strange failures on chipmunk

From
Heikki Linnakangas
Date:
On 30/06/2022 09:31, Noah Misch wrote:
> On Thu, Jun 30, 2022 at 10:07:18AM +1200, Thomas Munro wrote:
>> Then there's this one-off, that smells like WAL corruption:
>>
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=chipmunk&dt=2022-06-13%2015%3A12%3A44
>>
>> 2022-06-13 23:02:06.988 EEST [30121:5] LOG:  incorrect resource
>> manager data checksum in record at 0/79B4FE0
>>
>> Hmmm.  I suppose it's remotely possible that Linux/armv6l ext4 suffers
>> from concurrency bugs like Linux/sparc.
> 
> Running sparc64-ext4-zeros.c from
> https://marc.info/?l=linux-sparc&m=164539269632667&w=2 could confirm that
> possibility.

I ran sparc64-ext4-zeros on chipmunk for 10 minutes, and it didn't print 
anything.

It's possible that the SD card on chipmunk is simply wearing out and 
flipping bits. I can try to replace it. Anyone have suggestions on a 
test program I could run on the SD card, after replacing it, to verify 
if it was indeed worn out?

- Heikki



Re: Strange failures on chipmunk

From
Thomas Munro
Date:
On Thu, Jun 30, 2022 at 8:21 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I ran sparc64-ext4-zeros on chipmunk for 10 minutes, and it didn't print
> anything.

Thanks for checking.

> It's possible that the SD card on chipmunk is simply wearing out and
> flipping bits. I can try to replace it. Anyone have suggestions on a
> test program I could run on the SD card, after replacing it, to verify
> if it was indeed worn out?

BTW its disk is full.

FWIW I run RPi4 build bots on higher end USB3.x sticks (SanDisk
Extreme Pro, I'm sure there are others), and the performance is orders
of magnitude higher and more consistent than the micro SD and
cheap/random USB sticks I tried.  Admittedly they cost more than the
RPi4 board themselves (back when you could get them).

I noticed another (presumed) Raspberry Pi apparently behaving
strangely at the storage level (guessing it's a Pi by the armv7l
architecture): dangomushi appears to get files mixed up.  Here it is
trying to compile a log file last week:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2022-07-14%2017%3A58%3A38

And the week before it tried to compile some Perl:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2022-07-09%2015%3A30%3A07



Re: Strange failures on chipmunk

From
Michael Paquier
Date:
On Fri, Jul 22, 2022 at 04:35:30PM +1200, Thomas Munro wrote:
> I noticed another (presumed) Raspberry Pi apparently behaving
> strangely at the storage level (guessing it's a Pi by the armv7l
> architecture): dangomushi appears to get files mixed up.  Here it is
> trying to compile a log file last week:

This is a PI2.

> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2022-07-14%2017%3A58%3A38
>
> And the week before it tried to compile some Perl:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2022-07-09%2015%3A30%3A07

The buildfarm runs are part of a SD card that's been running for a
couple of years now, so I would not be surprised that the issue comes
from the years using it.  A couple of fsck's did not show up anything,
though, but I am keeping an eye on it.
--
Michael

Attachment