Re: IO in wrong state on riscv64 - Mailing list pgsql-hackers
From | Alexander Lakhin |
---|---|
Subject | Re: IO in wrong state on riscv64 |
Date | |
Msg-id | 60bbebf4-0a80-4e5f-83c0-311c85f6bd5c@gmail.com Whole thread Raw |
In response to | Re: IO in wrong state on riscv64 (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: IO in wrong state on riscv64
Re: IO in wrong state on riscv64 |
List | pgsql-hackers |
Hello Thomas, 12.10.2025 06:35, Thomas Munro wrote: > On Sun, Oct 12, 2025 at 2:00 AM Alexander Lakhin <exclusion@gmail.com> wrote: >> 2025-10-11 11:34:46.793 GMT [1169773:1] PANIC: !!!pgaio_io_wait| ioh->state changed from 0 to 1 at iteration 0 >> # no other iteration number observed > Can you please disassemble pgaio_io_update_state() and > pgaio_io_was_recycled()? I wonder if the memory barriers are not > being generated correctly, causing the state and generation to be > loaded out of order, or something like that... Please find those attached (gdb "disass/m pgaio_io_update_state" misses the start of the function (but it's still disassembled below), so I decided to share the whole output). This is from clean master, without any modifications, but with the issue confirmed for this build: 2025-10-11 20:29:11.724 UTC [1679534:1] [unknown] LOG: connection received: host=[local] 2025-10-11 20:29:11.725 UTC [1679536:1] [unknown] LOG: connection received: host=[local] 2025-10-11 20:29:11.726 UTC [1679538:1] [unknown] LOG: connection received: host=[local] 2025-10-11 20:29:11.724 UTC [1679533:1] [unknown] LOG: connection received: host=[local] 2025-10-11 20:29:11.724 UTC [1679537:1] [unknown] LOG: connection received: host=[local] 2025-10-11 20:29:11.724 UTC [1679535:1] [unknown] LOG: connection received: host=[local] 2025-10-11 20:29:11.729 UTC [1679539:1] [unknown] LOG: connection received: host=[local ... 2025-10-11 20:29:11.778 UTC [1679537:3] [unknown] LOG: connection authorized: user=debian database=regression application_name=pg_regress/create_schema 2025-10-11 20:29:11.778 UTC [1679533:3] [unknown] LOG: connection authorized: user=debian database=regression application_name=pg_regress/create_type 2025-10-11 20:29:11.778 UTC [1679539:3] [unknown] LOG: connection authorized: user=debian database=regression application_name=pg_regress/create_procedure 2025-10-11 20:29:11.778 UTC [1679536:3] [unknown] LOG: connection authorized: user=debian database=regression application_name=pg_regress/create_misc 2025-10-11 20:29:11.778 UTC [1679538:3] [unknown] LOG: connection authorized: user=debian database=regression application_name=pg_regress/create_table 2025-10-11 20:29:11.778 UTC [1679535:3] [unknown] LOG: connection authorized: user=debian database=regression application_name=pg_regress/create_function_c 2025-10-11 20:29:11.790 UTC [1679534:2] [unknown] FATAL: IO in wrong state: 0 2025-10-11 20:29:11.790 UTC [1679534:3] [unknown] BACKTRACE: postgres: debian regression [local] authentication(+0x4371ec) [0x555565cd61ec] postgres: debian regression [local] authentication(+0x442bb2) [0x555565ce1bb2] postgres: debian regression [local] authentication(StartBufferIO+0x66) [0x555565ce1a10] postgres: debian regression [local] authentication(+0x43df60) [0x555565cdcf60] postgres: debian regression [local] authentication(StartReadBuffer+0x308) [0x555565cdc8a4] postgres: debian regression [local] authentication(ReadBufferExtended+0x64) [0x555565cdaace] ... postgres: debian regression [local] authentication(postmaster_child_launch+0x132) [0x555565c7787c] postgres: debian regression [local] authentication(+0x3dcd52) [0x555565c7bd52] postgres: debian regression [local] authentication(InitProcessGlobals+0) [0x555565c79c96] postgres: debian regression [local] authentication(+0x3194b2) [0x555565bb84b2] /lib/riscv64-linux-gnu/libc.so.6(+0x2791c) [0x7fffa690891c] /lib/riscv64-linux-gnu/libc.so.6(__libc_start_main+0x74) [0x7fffa69089c4] postgres: debian regression [local] authentication(_start+0x20) [0x5555659842c8] > The previous failure on greenfly was a TIMEOUT in the same test, as if > a query was hanging. Yeah, I'll try to reproduce it too... > On Sun, Oct 12, 2025 at 2:00 AM Alexander Lakhin <exclusion@gmail.com> wrote: >> I've managed to reproduce it using qemu-system-riscv64 with Debian trixie > Huh, that's interesting. What is the host architecture? When I saw > that error myself and wondered about memory order, I dismissed the > idea of trying with qemu, figuring that my x86 host's TSO would affect > the coherency, but thinking again about that... I guess the compiler > might still reorder during riscv codegen if there is something wrong > with the barrier support, and even if it doesn't, the binary > translation to x86 might also feel free to reorder stuff if there are > no barrier instructions to prevent it? Or maybe that doesn't happen > but your host is ARM? I use AMD Ryzen 9 7900X, Ubuntu 24.04 and run the risc machine with: qemu-system-riscv64 -machine virt -m 1G -smp 8 -cpu rv64 -device virtio-blk-device,drive=hd \ -drive file=.../dqib_riscv64-virt/image.qcow2,if=none,id=hd -device virtio-net-device,netdev=net \ -netdev user,id=net,hostfwd=tcp::22222-:22 -bios /usr/lib/riscv64-linux-gnu/opensbi/generic/fw_jump.elf \ -kernel /usr/lib/u-boot/qemu-riscv64_smode/uboot.elf -object rng-random,filename=/dev/urandom,id=rng \ -device virtio-rng-device,rng=rng -nographic -append "root=LABEL=rootfs console=ttyS0" I don't know what hardware greenfly uses (CC'ing Greg in case he'd like to share some info on this), but timings are similar to what I'm seeing: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=greenfly&dt=2025-10-09%2000%3A06%3A03&stg=check ok 160 - select_parallel 5019 ms `make check` on mine, with select_parallel repeated: ok 160 - select_parallel 6042 ms ok 161 - select_parallel 6089 ms ok 162 - select_parallel 6116 ms copperhead and boomslang, which are using real RISC hardware [1], show better timings: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=copperhead&dt=2025-10-11%2019%3A36%3A33&stg=check ok 160 - select_parallel 2677 ms https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=boomslang&dt=2025-10-12%2002%3A17%3A53&stg=check ok 160 - select_parallel 3651 ms Thank you for looking into this! [1] https://www.postgresql.org/message-id/3db97903-884f-4b0c-b1cd-d7442e71ea75%40app.fastmail.com Best regards, Alexander
Attachment
pgsql-hackers by date: