Thread: RISC-V animals sporadically produce weird memory-related failures

RISC-V animals sporadically produce weird memory-related failures

From
Alexander Lakhin
Date:
Hello hackers,

While investigating a recent copperhead failure [1] with the following
diagnostics:
2024-08-20 20:56:47.318 CEST [2179731:95] LOG:  server process (PID 2184722) was terminated by signal 11: Segmentation
fault
2024-08-20 20:56:47.318 CEST [2179731:96] DETAIL:  Failed process was running: COPY hash_f8_heap FROM 
'/home/pgbf/buildroot/HEAD/pgsql.build/src/test/regress/data/hash.data';

Core was generated by `postgres: pgbf regression [local] COPY                                        '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000002ac8e62674 in heap_multi_insert (relation=0x3f9525c890, slots=0x2ae68a5b30, ntuples=<optimized out>, 
cid=<optimized out>, options=<optimized out>, bistate=0x2ae6891c18) at heapam.c:2296
2296            tuple->t_tableOid = slots[i]->tts_tableOid;
#0  0x0000002ac8e62674 in heap_multi_insert (relation=0x3f9525c890, slots=0x2ae68a5b30, ntuples=<optimized out>, 
cid=<optimized out>, options=<optimized out>, bistate=0x2ae6891c18) at heapam.c:2296
#1  0x0000002ac8f41656 in table_multi_insert (bistate=<optimized out>, options=<optimized out>, cid=<optimized out>, 
nslots=1000, slots=0x2ae68a5b30, rel=<optimized out>) at ../../../src/include/access/tableam.h:1460
#2  CopyMultiInsertBufferFlush (miinfo=miinfo@entry=0x3ff87bceb0, buffer=0x2ae68a5b30, 
processed=processed@entry=0x3ff87bce90) at copyfrom.c:415
#3  0x0000002ac8f41f6c in CopyMultiInsertInfoFlush (processed=0x3ff87bce90, curr_rri=0x2ae67eacf8, miinfo=0x3ff87bceb0)

at copyfrom.c:532
#4  CopyFrom (cstate=cstate@entry=0x2ae6897fc0) at copyfrom.c:1242
...
$1 = {si_signo = 11,  ... _sigfault = {si_addr = 0x2ae600cbcc}, ...

I discovered a similarly looking failure, [2]:
2023-02-11 18:33:09.222 CET [2591215:73] LOG:  server process (PID 2596066) was terminated by signal 11: Segmentation
fault
2023-02-11 18:33:09.222 CET [2591215:74] DETAIL:  Failed process was running: COPY bt_i4_heap FROM 
'/home/pgbf/buildroot/HEAD/pgsql.build/src/test/regress/data/desc.data';

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000002adc9bc61a in heap_multi_insert (relation=0x3fa3bd53a8, slots=0x2b098a13c0, ntuples=<optimized out>, 
cid=<optimized out>, options=<optimized out>, bistate=0x2b097eda10) at heapam.c:2095
2095            tuple->t_tableOid = slots[i]->tts_tableOid;

But then I found also different failures on copperhead, all looking like
memory-related anomalies:
[3]
Program terminated with signal SIGSEGV, Segmentation fault.
#0  fixempties (f=0x0, nfa=0x2b02a59410) at regc_nfa.c:2246
2246                for (a = inarcsorig[s2->no]; a != NULL; a = a->inchain)

[4]
pgsql.build/src/bin/pg_rewind/tmp_check/log/regress_log_004_pg_xlog_symlink
malloc(): memory corruption (fast)

[5]
2022-11-22 20:22:48.907 CET [1364156:4] LOG:  server process (PID 1364221) was terminated by signal 11: Segmentation
fault
2022-11-22 20:22:48.907 CET [1364156:5] DETAIL:  Failed process was running: BASE_BACKUP LABEL 'pg_basebackup base 
backup' PROGRESS NOWAIT  TABLESPACE_MAP  MANIFEST 'yes'

[6]
psql exited with signal 11 (core dumped): '' while running 'psql -XAtq -d port=60743 host=/tmp/zHq9Kzn2b5 
dbname='postgres' -f - -v ON_ERROR_STOP=1' at 
/home/pgbf/buildroot/REL_14_STABLE/pgsql.build/contrib/bloom/../../src/test/perl/PostgresNode.pm line 1855.

[7]
- locktype | classid | objid | objsubid |     mode      | granted
+ locktype | classid | objid | objsubid |     mode      | gr_nted
(the most mysterious case)

[8]
Program terminated with signal SIGSEGV, Segmentation fault.
#0  GetMemoryChunkContext (pointer=0x2b21bca1f8) at ../../../../src/include/utils/memutils.h:128
128        context = *(MemoryContext *) (((char *) pointer) - sizeof(void *));
...
$1 = {si_signo = 11, ... _sigfault = {si_addr = 0x2b21bca1f0}, ...

[9]
Program terminated with signal SIGSEGV, Segmentation fault.
#0  fixempties (f=0x0, nfa=0x2ac0bf4c60) at regc_nfa.c:2246
2246                for (a = inarcsorig[s2->no]; a != NULL; a = a->inchain)


Moreover, the other RISC-V animal, boomslang produced weird failures too:
[10]
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000002ae6b50abe in ExecInterpExpr (state=0x2b20ca0040, econtext=0x2b20c9fba8, isnull=<optimized out>) at 
execExprInterp.c:678
678                resultslot->tts_values[resultnum] = state->resvalue;

[11]
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000002addf22728 in ExecInterpExpr (state=0x2ae0af8848, econtext=0x2ae0b16028, isnull=<optimized out>) at 
execExprInterp.c:666
666                resultslot->tts_values[resultnum] = scanslot->tts_values[attnum];

[12]
INSERT INTO ftable SELECT * FROM generate_series(1, 70000) i;

Core was generated by `postgres: buildfarm contrib_regression_postgres_fdw [local] INS'.
Program terminated with signal SIGABRT, Aborted.

As far as I can see, these animals run on Debian 10 with the kernel
version 5.15.5-2~bpo11+1 (2022-01-10), but RISC-V was declared an
official Debian architecture on 2023-07-23 [14]. So maybe the OS
version installed is not stable enough for testing...
(I've tried running the regression tests on a RISC-V machine emulated with
qemu, running Debian trixie, kernel version 6.8.12-1 (2024-05-31), and got
no failures.)

Dear copperhead, boomslang owner, could you consider upgrading OS on
these animals to rule out effects of OS anomalies that might be fixed
already? If it's not an option, couldn't you perform stress testing of
these machines, say, with stress-ng?

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2024-08-20%2017%3A59%3A12
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2023-02-11%2016%3A41%3A58
[3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2023-02-09%2001%3A25%3A06
[4] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2023-03-21%2022%3A58%3A43
[5] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2022-11-22%2019%3A00%3A19
[6] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2022-11-24%2018%3A45%3A45
[7] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2023-03-19%2017%3A21%3A17
[8] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2023-03-11%2016%3A54%3A52
[9] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2022-11-11%2021%3A39%3A04
[10] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=boomslang&dt=2023-03-12%2008%3A32%3A48
[11] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=boomslang&dt=2022-09-22%2007%3A38%3A42
[12] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=boomslang&dt=2022-10-18%2006%3A51%3A13
[13] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=boomslang&dt=2022-09-27%2006%3A57%3A38
[14] https://lists.debian.org/debian-riscv/2023/07/msg00053.html

Best regards,
Alexander



Re: RISC-V animals sporadically produce weird memory-related failures

From
"Tom Turelinckx"
Date:
Hello Alexander,

On Thu, Aug 22, 2024, at 11:00 AM, Alexander Lakhin wrote:
> Dear copperhead, boomslang owner, could you consider upgrading OS on
> these animals to rule out effects of OS anomalies that might be fixed
> already? If it's not an option, couldn't you perform stress testing of
> these machines, say, with stress-ng?

Thank you for investigating and sorry about the delay on my side.

Both boomslang and copperhead were running on the same HiFive Unmatched board (with soldered RAM) [1]. When I
configuredthis machine in late 2021 riscv64 was a Debian ports architecture. When it became an official architecture
allthe packages were rebuilt, so upgrading this machine essentially meant re-installing it completely.
 

I have now done just that, but on a new HiFive Premier P550 board [2]. It is running Ubuntu 24.04 LTS with a
board-specifickernel, currently 6.6.21-9-premier (2024-11-09). The buildfarm client is executing within a Debian Trixie
containercreated from the official Debian repo.
 

This stack is a lot more recent, should be more future-proof, and the board is significantly faster too. Boomslang has
alreadybuilt all branches and copperhead is currently going through them.
 

[1] https://www.sifive.com/boards/hifive-unmatched
[2] https://www.sifive.com/press/hifive-premier-p550-development-boards-now-shipping

Best regards,
Tom



Re: RISC-V animals sporadically produce weird memory-related failures

From
Alexander Lakhin
Date:
Hello Tom,

17.11.2024 20:28, Tom Turelinckx wrote:
> I have now done just that, but on a new HiFive Premier P550 board [2]. It is running Ubuntu 24.04 LTS with a
board-specifickernel, currently 6.6.21-9-premier (2024-11-09). The buildfarm client is executing within a Debian Trixie
containercreated from the official Debian repo.
 
>
> This stack is a lot more recent, should be more future-proof, and the board is significantly faster too. Boomslang
hasalready built all branches and copperhead is currently going through them.
 

Thank you for upgrading these machines!

Could you please take a look at new failures produced by copperhead
recently?:
[1]
2024-11-30 19:34:53.302 CET [13395:4] LOG:  server process (PID 13439) was terminated by signal 11: Segmentation fault
2024-11-30 19:34:53.302 CET [13395:5] DETAIL:  Failed process was running: SELECT '' AS tf_12, BOOLTBL1.*, BOOLTBL2.*
        FROM BOOLTBL1, BOOLTBL2
        WHERE BOOLTBL2.f1 <> BOOLTBL1.f1;

[2]
2024-11-30 19:54:11.478 CET [27560:15] LOG:  server process (PID 28459) was terminated by signal 11: Segmentation
fault
2024-11-30 19:54:11.478 CET [27560:16] DETAIL:  Failed process was running: SELECT count(*) FROM test_tsvector WHERE a

@@ any ('{wr,qh}');

These crashes are hardly related to code changes, so maybe there are
platform-specific issues still...
I've run 100 iterations of `make check` for REL_13_STABLE, using
trixie/sid 6.8.12-riscv64 (gcc 14.2.0), emulated with qemu-system-riscv64,
with no failures.

Unfortunately, the log files saved don't include coredump information,
maybe because of inappropriate core_pattern.
(Previously, a stack trace was extracted in case of a crash: [3].)

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2024-11-30%2018%3A16%3A37
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2024-11-30%2018%3A35%3A17
[3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2024-09-03%2016%3A38%3A46

Best regards,
Alexander



Re: RISC-V animals sporadically produce weird memory-related failures

From
"Tom Turelinckx"
Date:
Hi Alexander,

On Mon, Dec 2, 2024, at 2:00 PM, Alexander Lakhin wrote:
> These crashes are hardly related to code changes, so maybe there are
> platform-specific issues still...

I naively assumed that because llvm and clang are available in Trixie on riscv64 that I could simply install them and
enable--with-llvm on copperhead, but I then discovered that this caused lots of segmentation faults and I had to revert
the--with-llvm again. Sorry about not first testing without submitting results.
 

> Unfortunately, the log files saved don't include coredump information,
> maybe because of inappropriate core_pattern.

I had increased the core file size limit in /etc/security/limits.conf, but in Trixie this is overruled by a default
/etc/security/limits.d/10-coredump-debian.conf.Moreover, the core_pattern was set by apport on the Ubuntu lxc host, but
apportis not available in the Trixie lxc guest. I have now corrected both issues, and a simple test resulted in a core
filebeing written to the current directory, like it was before the upgrade.
 

Best regards,
Tom



Re: RISC-V animals sporadically produce weird memory-related failures

From
Alexander Lakhin
Date:
02.12.2024 18:25, Tom Turelinckx wrote:
>> These crashes are hardly related to code changes, so maybe there are
>> platform-specific issues still...
> I naively assumed that because llvm and clang are available in Trixie on riscv64 that I could simply install them and
enable--with-llvm on copperhead, but I then discovered that this caused lots of segmentation faults and I had to revert
the--with-llvm again. Sorry about not first testing without submitting results.
 

Thank you for the clarification!
I hadn't noticed the "--with-llvm" option added in the configuration...
Now I've re-run `make check` for a llvm-enabled build (made with clang
19.1.4) locally and got the same:
2024-12-02 16:49:47.620 UTC postmaster[21895] LOG:  server process (PID 21933) was terminated by signal 11:
Segmentation
 
fault
2024-12-02 16:49:47.620 UTC postmaster[21895] DETAIL:  Failed process was running: SELECT '' AS tf_12, BOOLTBL1.*, 
BOOLTBL2.*
            FROM BOOLTBL1, BOOLTBL2
            WHERE BOOLTBL2.f1 <> BOOLTBL1.f1;

A build made with clang-19 without llvm passed `make check` successfully.

> I had increased the core file size limit in /etc/security/limits.conf, but in Trixie this is overruled by a default
/etc/security/limits.d/10-coredump-debian.conf.Moreover, the core_pattern was set by apport on the Ubuntu lxc host, but
apportis not available in the Trixie lxc guest. I have now corrected both issues, and a simple test resulted in a core
filebeing written to the current directory, like it was before the upgrade.
 

Thank you for fixing this!

Best regards,
Alexander



Re: RISC-V animals sporadically produce weird memory-related failures

From
Thomas Munro
Date:
On Tue, Dec 3, 2024 at 7:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> A build made with clang-19 without llvm passed `make check` successfully.

We heard in another thread[1] that we'd need to use the JITLink API
for RISCV, instead of the RuntimeDyld API we're using.  I have a newer
patch to use JITLink on all architectures, starting at some LLVM
version, but it needs a bit more polish and research before sharing.
I'm surprised it's segfaulting instead of producing an error of some
sort, though.  I wonder why.  It would be nice if we could fail
gracefully instead.

Hmm, from a quick look in the LLVM main branch, it looks like a bunch
of RISCV stuff just landed in recent months under
llvm/lib/ExecutionEngine/RuntimeDyld, so maybe that's not true anymore
on bleeding-edge LLVM (20-devel), I have no idea what state that's in,
but IIUC there is no way RuntimeDyld could work on LLVM 16 or 19.

[1] https://www.postgresql.org/message-id/flat/20220829074622.2474104-1-alex.fan.q%40gmail.com