Thread: RISC-V animals sporadically produce weird memory-related failures
Hello hackers, While investigating a recent copperhead failure [1] with the following diagnostics: 2024-08-20 20:56:47.318 CEST [2179731:95] LOG: server process (PID 2184722) was terminated by signal 11: Segmentation fault 2024-08-20 20:56:47.318 CEST [2179731:96] DETAIL: Failed process was running: COPY hash_f8_heap FROM '/home/pgbf/buildroot/HEAD/pgsql.build/src/test/regress/data/hash.data'; Core was generated by `postgres: pgbf regression [local] COPY '. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x0000002ac8e62674 in heap_multi_insert (relation=0x3f9525c890, slots=0x2ae68a5b30, ntuples=<optimized out>, cid=<optimized out>, options=<optimized out>, bistate=0x2ae6891c18) at heapam.c:2296 2296 tuple->t_tableOid = slots[i]->tts_tableOid; #0 0x0000002ac8e62674 in heap_multi_insert (relation=0x3f9525c890, slots=0x2ae68a5b30, ntuples=<optimized out>, cid=<optimized out>, options=<optimized out>, bistate=0x2ae6891c18) at heapam.c:2296 #1 0x0000002ac8f41656 in table_multi_insert (bistate=<optimized out>, options=<optimized out>, cid=<optimized out>, nslots=1000, slots=0x2ae68a5b30, rel=<optimized out>) at ../../../src/include/access/tableam.h:1460 #2 CopyMultiInsertBufferFlush (miinfo=miinfo@entry=0x3ff87bceb0, buffer=0x2ae68a5b30, processed=processed@entry=0x3ff87bce90) at copyfrom.c:415 #3 0x0000002ac8f41f6c in CopyMultiInsertInfoFlush (processed=0x3ff87bce90, curr_rri=0x2ae67eacf8, miinfo=0x3ff87bceb0) at copyfrom.c:532 #4 CopyFrom (cstate=cstate@entry=0x2ae6897fc0) at copyfrom.c:1242 ... $1 = {si_signo = 11, ... _sigfault = {si_addr = 0x2ae600cbcc}, ... I discovered a similarly looking failure, [2]: 2023-02-11 18:33:09.222 CET [2591215:73] LOG: server process (PID 2596066) was terminated by signal 11: Segmentation fault 2023-02-11 18:33:09.222 CET [2591215:74] DETAIL: Failed process was running: COPY bt_i4_heap FROM '/home/pgbf/buildroot/HEAD/pgsql.build/src/test/regress/data/desc.data'; Program terminated with signal SIGSEGV, Segmentation fault. #0 0x0000002adc9bc61a in heap_multi_insert (relation=0x3fa3bd53a8, slots=0x2b098a13c0, ntuples=<optimized out>, cid=<optimized out>, options=<optimized out>, bistate=0x2b097eda10) at heapam.c:2095 2095 tuple->t_tableOid = slots[i]->tts_tableOid; But then I found also different failures on copperhead, all looking like memory-related anomalies: [3] Program terminated with signal SIGSEGV, Segmentation fault. #0 fixempties (f=0x0, nfa=0x2b02a59410) at regc_nfa.c:2246 2246 for (a = inarcsorig[s2->no]; a != NULL; a = a->inchain) [4] pgsql.build/src/bin/pg_rewind/tmp_check/log/regress_log_004_pg_xlog_symlink malloc(): memory corruption (fast) [5] 2022-11-22 20:22:48.907 CET [1364156:4] LOG: server process (PID 1364221) was terminated by signal 11: Segmentation fault 2022-11-22 20:22:48.907 CET [1364156:5] DETAIL: Failed process was running: BASE_BACKUP LABEL 'pg_basebackup base backup' PROGRESS NOWAIT TABLESPACE_MAP MANIFEST 'yes' [6] psql exited with signal 11 (core dumped): '' while running 'psql -XAtq -d port=60743 host=/tmp/zHq9Kzn2b5 dbname='postgres' -f - -v ON_ERROR_STOP=1' at /home/pgbf/buildroot/REL_14_STABLE/pgsql.build/contrib/bloom/../../src/test/perl/PostgresNode.pm line 1855. [7] - locktype | classid | objid | objsubid | mode | granted + locktype | classid | objid | objsubid | mode | gr_nted (the most mysterious case) [8] Program terminated with signal SIGSEGV, Segmentation fault. #0 GetMemoryChunkContext (pointer=0x2b21bca1f8) at ../../../../src/include/utils/memutils.h:128 128 context = *(MemoryContext *) (((char *) pointer) - sizeof(void *)); ... $1 = {si_signo = 11, ... _sigfault = {si_addr = 0x2b21bca1f0}, ... [9] Program terminated with signal SIGSEGV, Segmentation fault. #0 fixempties (f=0x0, nfa=0x2ac0bf4c60) at regc_nfa.c:2246 2246 for (a = inarcsorig[s2->no]; a != NULL; a = a->inchain) Moreover, the other RISC-V animal, boomslang produced weird failures too: [10] Program terminated with signal SIGSEGV, Segmentation fault. #0 0x0000002ae6b50abe in ExecInterpExpr (state=0x2b20ca0040, econtext=0x2b20c9fba8, isnull=<optimized out>) at execExprInterp.c:678 678 resultslot->tts_values[resultnum] = state->resvalue; [11] Program terminated with signal SIGSEGV, Segmentation fault. #0 0x0000002addf22728 in ExecInterpExpr (state=0x2ae0af8848, econtext=0x2ae0b16028, isnull=<optimized out>) at execExprInterp.c:666 666 resultslot->tts_values[resultnum] = scanslot->tts_values[attnum]; [12] INSERT INTO ftable SELECT * FROM generate_series(1, 70000) i; Core was generated by `postgres: buildfarm contrib_regression_postgres_fdw [local] INS'. Program terminated with signal SIGABRT, Aborted. As far as I can see, these animals run on Debian 10 with the kernel version 5.15.5-2~bpo11+1 (2022-01-10), but RISC-V was declared an official Debian architecture on 2023-07-23 [14]. So maybe the OS version installed is not stable enough for testing... (I've tried running the regression tests on a RISC-V machine emulated with qemu, running Debian trixie, kernel version 6.8.12-1 (2024-05-31), and got no failures.) Dear copperhead, boomslang owner, could you consider upgrading OS on these animals to rule out effects of OS anomalies that might be fixed already? If it's not an option, couldn't you perform stress testing of these machines, say, with stress-ng? [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2024-08-20%2017%3A59%3A12 [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2023-02-11%2016%3A41%3A58 [3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2023-02-09%2001%3A25%3A06 [4] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2023-03-21%2022%3A58%3A43 [5] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2022-11-22%2019%3A00%3A19 [6] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2022-11-24%2018%3A45%3A45 [7] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2023-03-19%2017%3A21%3A17 [8] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2023-03-11%2016%3A54%3A52 [9] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2022-11-11%2021%3A39%3A04 [10] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=boomslang&dt=2023-03-12%2008%3A32%3A48 [11] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=boomslang&dt=2022-09-22%2007%3A38%3A42 [12] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=boomslang&dt=2022-10-18%2006%3A51%3A13 [13] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=boomslang&dt=2022-09-27%2006%3A57%3A38 [14] https://lists.debian.org/debian-riscv/2023/07/msg00053.html Best regards, Alexander
Hello Alexander, On Thu, Aug 22, 2024, at 11:00 AM, Alexander Lakhin wrote: > Dear copperhead, boomslang owner, could you consider upgrading OS on > these animals to rule out effects of OS anomalies that might be fixed > already? If it's not an option, couldn't you perform stress testing of > these machines, say, with stress-ng? Thank you for investigating and sorry about the delay on my side. Both boomslang and copperhead were running on the same HiFive Unmatched board (with soldered RAM) [1]. When I configuredthis machine in late 2021 riscv64 was a Debian ports architecture. When it became an official architecture allthe packages were rebuilt, so upgrading this machine essentially meant re-installing it completely. I have now done just that, but on a new HiFive Premier P550 board [2]. It is running Ubuntu 24.04 LTS with a board-specifickernel, currently 6.6.21-9-premier (2024-11-09). The buildfarm client is executing within a Debian Trixie containercreated from the official Debian repo. This stack is a lot more recent, should be more future-proof, and the board is significantly faster too. Boomslang has alreadybuilt all branches and copperhead is currently going through them. [1] https://www.sifive.com/boards/hifive-unmatched [2] https://www.sifive.com/press/hifive-premier-p550-development-boards-now-shipping Best regards, Tom
Hello Tom, 17.11.2024 20:28, Tom Turelinckx wrote: > I have now done just that, but on a new HiFive Premier P550 board [2]. It is running Ubuntu 24.04 LTS with a board-specifickernel, currently 6.6.21-9-premier (2024-11-09). The buildfarm client is executing within a Debian Trixie containercreated from the official Debian repo. > > This stack is a lot more recent, should be more future-proof, and the board is significantly faster too. Boomslang hasalready built all branches and copperhead is currently going through them. Thank you for upgrading these machines! Could you please take a look at new failures produced by copperhead recently?: [1] 2024-11-30 19:34:53.302 CET [13395:4] LOG: server process (PID 13439) was terminated by signal 11: Segmentation fault 2024-11-30 19:34:53.302 CET [13395:5] DETAIL: Failed process was running: SELECT '' AS tf_12, BOOLTBL1.*, BOOLTBL2.* FROM BOOLTBL1, BOOLTBL2 WHERE BOOLTBL2.f1 <> BOOLTBL1.f1; [2] 2024-11-30 19:54:11.478 CET [27560:15] LOG: server process (PID 28459) was terminated by signal 11: Segmentation fault 2024-11-30 19:54:11.478 CET [27560:16] DETAIL: Failed process was running: SELECT count(*) FROM test_tsvector WHERE a @@ any ('{wr,qh}'); These crashes are hardly related to code changes, so maybe there are platform-specific issues still... I've run 100 iterations of `make check` for REL_13_STABLE, using trixie/sid 6.8.12-riscv64 (gcc 14.2.0), emulated with qemu-system-riscv64, with no failures. Unfortunately, the log files saved don't include coredump information, maybe because of inappropriate core_pattern. (Previously, a stack trace was extracted in case of a crash: [3].) [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2024-11-30%2018%3A16%3A37 [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2024-11-30%2018%3A35%3A17 [3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&dt=2024-09-03%2016%3A38%3A46 Best regards, Alexander
Hi Alexander, On Mon, Dec 2, 2024, at 2:00 PM, Alexander Lakhin wrote: > These crashes are hardly related to code changes, so maybe there are > platform-specific issues still... I naively assumed that because llvm and clang are available in Trixie on riscv64 that I could simply install them and enable--with-llvm on copperhead, but I then discovered that this caused lots of segmentation faults and I had to revert the--with-llvm again. Sorry about not first testing without submitting results. > Unfortunately, the log files saved don't include coredump information, > maybe because of inappropriate core_pattern. I had increased the core file size limit in /etc/security/limits.conf, but in Trixie this is overruled by a default /etc/security/limits.d/10-coredump-debian.conf.Moreover, the core_pattern was set by apport on the Ubuntu lxc host, but apportis not available in the Trixie lxc guest. I have now corrected both issues, and a simple test resulted in a core filebeing written to the current directory, like it was before the upgrade. Best regards, Tom
02.12.2024 18:25, Tom Turelinckx wrote: >> These crashes are hardly related to code changes, so maybe there are >> platform-specific issues still... > I naively assumed that because llvm and clang are available in Trixie on riscv64 that I could simply install them and enable--with-llvm on copperhead, but I then discovered that this caused lots of segmentation faults and I had to revert the--with-llvm again. Sorry about not first testing without submitting results. Thank you for the clarification! I hadn't noticed the "--with-llvm" option added in the configuration... Now I've re-run `make check` for a llvm-enabled build (made with clang 19.1.4) locally and got the same: 2024-12-02 16:49:47.620 UTC postmaster[21895] LOG: server process (PID 21933) was terminated by signal 11: Segmentation fault 2024-12-02 16:49:47.620 UTC postmaster[21895] DETAIL: Failed process was running: SELECT '' AS tf_12, BOOLTBL1.*, BOOLTBL2.* FROM BOOLTBL1, BOOLTBL2 WHERE BOOLTBL2.f1 <> BOOLTBL1.f1; A build made with clang-19 without llvm passed `make check` successfully. > I had increased the core file size limit in /etc/security/limits.conf, but in Trixie this is overruled by a default /etc/security/limits.d/10-coredump-debian.conf.Moreover, the core_pattern was set by apport on the Ubuntu lxc host, but apportis not available in the Trixie lxc guest. I have now corrected both issues, and a simple test resulted in a core filebeing written to the current directory, like it was before the upgrade. Thank you for fixing this! Best regards, Alexander
On Tue, Dec 3, 2024 at 7:00 AM Alexander Lakhin <exclusion@gmail.com> wrote: > A build made with clang-19 without llvm passed `make check` successfully. We heard in another thread[1] that we'd need to use the JITLink API for RISCV, instead of the RuntimeDyld API we're using. I have a newer patch to use JITLink on all architectures, starting at some LLVM version, but it needs a bit more polish and research before sharing. I'm surprised it's segfaulting instead of producing an error of some sort, though. I wonder why. It would be nice if we could fail gracefully instead. Hmm, from a quick look in the LLVM main branch, it looks like a bunch of RISCV stuff just landed in recent months under llvm/lib/ExecutionEngine/RuntimeDyld, so maybe that's not true anymore on bleeding-edge LLVM (20-devel), I have no idea what state that's in, but IIUC there is no way RuntimeDyld could work on LLVM 16 or 19. [1] https://www.postgresql.org/message-id/flat/20220829074622.2474104-1-alex.fan.q%40gmail.com