Thread: Non-reproducible AIO failure
My buildfarm animal sifaka just failed like this [1]: TRAP: failed Assert("aio_ret->result.status != PGAIO_RS_UNKNOWN"), File: "bufmgr.c", Line: 1605, PID: 79322 0 postgres 0x0000000100e3df2c ExceptionalCondition + 108 1 postgres 0x0000000100cbb154 WaitReadBuffers + 616 2 postgres 0x0000000100f072bc read_stream_next_buffer.cold.1 + 184 3 postgres 0x0000000100cb6dd8 read_stream_next_buffer + 300 4 postgres 0x00000001009b75b4 heap_fetch_next_buffer + 136 5 postgres 0x00000001009ad8c4 heapgettup + 368 6 postgres 0x00000001009ad364 heap_getnext + 104 7 postgres 0x00000001009b9770 heapam_index_build_range_scan + 700 8 postgres 0x00000001009ef4b8 spgbuild + 480 9 postgres 0x0000000100a41acc index_build + 428 10 postgres 0x0000000100a43d14 reindex_index + 752 11 postgres 0x0000000100ad09e0 ExecReindex + 1908 12 postgres 0x0000000100d04628 ProcessUtilitySlow + 1300 13 postgres 0x0000000100d03754 standard_ProcessUtility + 1528 14 pg_stat_statements.dylib 0x000000010158d7a0 pgss_ProcessUtility + 632 15 postgres 0x0000000100d02cec PortalRunUtility + 184 16 postgres 0x0000000100d02354 PortalRunMulti + 236 17 postgres 0x0000000100d01d94 PortalRun + 456 18 postgres 0x0000000100d00e10 exec_simple_query + 1308 ... 2025-04-23 16:08:28.834 EDT [78788:4] LOG: client backend (PID 79322) was terminated by signal 6: Abort trap: 6 2025-04-23 16:08:28.834 EDT [78788:5] DETAIL: Failed process was running: reindex index spgist_point_idx; Scraping the buildfarm database confirmed that there have been no other similar failures since AIO went in, and unsurprisingly some attempts to reproduce it manually didn't bear fruit. So I'm suspecting a race condition with a very narrow window, but I don't have an idea where to look. Any thoughts on how to gather more data? regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=sifaka&dt=2025-04-23%2020%3A03%3A24&stg=recovery-check
Hi, On 2025-04-23 17:17:01 -0400, Tom Lane wrote: > My buildfarm animal sifaka just failed like this [1]: There's nothing really special about sifaka, is there? I see -std=gnu99 and a few debug -D cppflags, but they don't look they could really be relevant here. > TRAP: failed Assert("aio_ret->result.status != PGAIO_RS_UNKNOWN"), File: "bufmgr.c", Line: 1605, PID: 79322 Huh. I assume there's no core file left over? Would be helpful to query some more state (e.g. whether operation->io_wref is valid (think it has to be) and whether this was a partial read or such). > 0 postgres 0x0000000100e3df2c ExceptionalCondition + 108 > 1 postgres 0x0000000100cbb154 WaitReadBuffers + 616 > 2 postgres 0x0000000100f072bc read_stream_next_buffer.cold.1 + 184 > 3 postgres 0x0000000100cb6dd8 read_stream_next_buffer + 300 > 4 postgres 0x00000001009b75b4 heap_fetch_next_buffer + 136 > 5 postgres 0x00000001009ad8c4 heapgettup + 368 > 6 postgres 0x00000001009ad364 heap_getnext + 104 > 7 postgres 0x00000001009b9770 heapam_index_build_range_scan + 700 > 8 postgres 0x00000001009ef4b8 spgbuild + 480 > 9 postgres 0x0000000100a41acc index_build + 428 > 10 postgres 0x0000000100a43d14 reindex_index + 752 > 11 postgres 0x0000000100ad09e0 ExecReindex + 1908 > 12 postgres 0x0000000100d04628 ProcessUtilitySlow + 1300 > 13 postgres 0x0000000100d03754 standard_ProcessUtility + 1528 > 14 pg_stat_statements.dylib 0x000000010158d7a0 pgss_ProcessUtility + 632 > 15 postgres 0x0000000100d02cec PortalRunUtility + 184 > 16 postgres 0x0000000100d02354 PortalRunMulti + 236 > 17 postgres 0x0000000100d01d94 PortalRun + 456 > 18 postgres 0x0000000100d00e10 exec_simple_query + 1308 > ... > 2025-04-23 16:08:28.834 EDT [78788:4] LOG: client backend (PID 79322) was terminated by signal 6: Abort trap: 6 > 2025-04-23 16:08:28.834 EDT [78788:5] DETAIL: Failed process was running: reindex index spgist_point_idx; > Scraping the buildfarm database confirmed that there have been no > other similar failures since AIO went in, and unsurprisingly some > attempts to reproduce it manually didn't bear fruit. So I'm > suspecting a race condition with a very narrow window, but I don't > have an idea where to look. Any thoughts on how to gather more > data? The weird thing is that the concrete symptom here doesn't actually involve anything that could really race - the relevant state is only updated by the process itself. But of course there has to be something narrow here, otherwise we'd have seen this before :(. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2025-04-23 17:17:01 -0400, Tom Lane wrote: >> My buildfarm animal sifaka just failed like this [1]: > There's nothing really special about sifaka, is there? I see -std=gnu99 and a > few debug -D cppflags, but they don't look they could really be relevant here. No, it's intended to be a bog-standard macOS environment. >> TRAP: failed Assert("aio_ret->result.status != PGAIO_RS_UNKNOWN"), File: "bufmgr.c", Line: 1605, PID: 79322 > Huh. > I assume there's no core file left over? No such luck, I did not have that enabled. I will turn it on though, and also on its sibling indri. regards, tom lane
Alexander Lakhin <exclusion@gmail.com> writes: > FWIW, that Assert have just triggered on another mac: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=indri&dt=2025-05-23%2020%3A30%3A07 Yeah, I was just looking at that too. There is a corefile from that crash, but lldb seems unable to extract anything from it :-(. There is a low-resolution stack trace in the postmaster log though: 0 postgres 0x0000000105299c84 ExceptionalCondition + 108 1 postgres 0x00000001051159ac WaitReadBuffers + 616 2 postgres 0x00000001053611ec read_stream_next_buffer.cold.1 + 184 3 postgres 0x0000000105111630 read_stream_next_buffer + 300 4 postgres 0x0000000104e0b994 heap_fetch_next_buffer + 136 5 postgres 0x0000000104e018f4 heapgettup_pagemode + 204 6 postgres 0x0000000104e02010 heap_getnextslot + 84 7 postgres 0x0000000104faebb4 SeqNext + 160 Of note here is that indri and sifaka run on the same host --- indri uses some MacPorts packages while sifaka doesn't, but that shouldn't have anything to do with our AIO code. I trawled the buildfarm database and confirmed that these two crashes are our only similar reports (grepping for "PGAIO_RS_UNKNOWN"): sysname | branch | snapshot | stage | l ---------+--------+---------------------+---------------+------------------------------------------------------------------------------------------------------------- indri | HEAD | 2025-05-23 20:30:07 | recoveryCheck | TRAP: failed Assert("aio_ret->result.status != PGAIO_RS_UNKNOWN"),File: "bufmgr.c", Line: 1605, PID: 20931 sifaka | HEAD | 2025-04-23 20:03:24 | recoveryCheck | TRAP: failed Assert("aio_ret->result.status != PGAIO_RS_UNKNOWN"),File: "bufmgr.c", Line: 1605, PID: 79322 (2 rows) So it seems that "very low-probability issue in our Mac AIO code" is the most probable description. regards, tom lane
Hello Thomas, 24.05.2025 14:42, Thomas Munro wrote: > On Sat, May 24, 2025 at 3:17 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> So it seems that "very low-probability issue in our Mac AIO code" is >> the most probable description. > There isn't any macOS-specific AIO code so my first guess would be > that it might be due to aarch64 weak memory reordering (though Andres > speculated that itt should all be one backend, huh), if it's not just > a timing luck thing. Alexander, were the other OSes you tried all on > x86? As I wrote off-list before, I had tried x86_64 only, but since then I tried to reproduce the issue on an aarch64 server with Ubuntu 24.04, running 10, then 40 instances of t/027_stream_regress.pl in parallel. I've also multiplied "test: brin ..." line x10. But the issue is still not reproduced (in 8+ hours). However, I've managed to get an AIO-related assertion failure on macOS 14.5 (Darwin Kernel Version 23.5.0) quite easily, running 027_stream_regress.pl the same way: 2 # +++ tap check in src/test/recovery_2 +++ 3 # +++ tap check in src/test/recovery_3 +++ 4 # +++ tap check in src/test/recovery_4 +++ 5 # +++ tap check in src/test/recovery_5 +++ 1 # +++ tap check in src/test/recovery_1 +++ 7 # +++ tap check in src/test/recovery_7 +++ 6 # +++ tap check in src/test/recovery_6 +++ 8 # +++ tap check in src/test/recovery_8 +++ 1 [16:07:31] t/027_stream_regress.pl .. 1 Dubious, test returned 29 (wstat 7424, 0x1d00) 1 All 1 subtests passed 1 [16:07:31] 1 1 Test Summary Report 1 ------------------- 1 t/027_stream_regress.pl (Wstat: 7424 Tests: 1 Failed: 0) 1 Non-zero exit status: 29 1 Parse errors: No plan found in TAP output 1 Files=1, Tests=1, 197 wallclock secs ( 0.01 usr 0.01 sys + 4.47 cusr 5.79 csys = 10.28 CPU) 1 Result: FAIL TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 161, PID: 32355 0 postgres 0x0000000104f078f4 ExceptionalCondition + 236 1 postgres 0x0000000104c0ebd4 pgaio_io_before_start + 260 2 postgres 0x0000000104c0ea94 pgaio_io_start_readv + 36 3 postgres 0x0000000104c2d4e8 FileStartReadV + 252 4 postgres 0x0000000104c807c8 mdstartreadv + 668 5 postgres 0x0000000104c83db0 smgrstartreadv + 116 6 postgres 0x0000000104c1675c AsyncReadBuffers + 1444 7 postgres 0x0000000104c15868 StartReadBuffersImpl + 1196 8 postgres 0x0000000104c153ac StartReadBuffers + 64 9 postgres 0x0000000104c129b4 read_stream_start_pending_read + 1204 10 postgres 0x0000000104c12018 read_stream_look_ahead + 812 11 postgres 0x0000000104c11cd0 read_stream_next_buffer + 1396 12 postgres 0x0000000104606c38 heap_fetch_next_buffer + 284 13 postgres 0x00000001045f79d4 heapgettup_pagemode + 192 14 postgres 0x00000001045f7fa4 heap_getnextslot + 84 15 postgres 0x000000010497a404 table_scan_getnextslot + 340 16 postgres 0x000000010497a210 SeqNext + 148 17 postgres 0x000000010497a8e4 ExecScanFetch + 772 18 postgres 0x000000010497a4b8 ExecScanExtended + 164 19 postgres 0x0000000104979bf0 ExecSeqScanWithProject + 244 20 postgres 0x0000000104953ce0 ExecProcNode + 68 21 postgres 0x000000010494f148 MultiExecPrivateHash + 64 22 postgres 0x000000010494ed48 MultiExecHash + 100 23 postgres 0x0000000104925a18 MultiExecProcNode + 180 24 postgres 0x0000000104957220 ExecHashJoinImpl + 628 25 postgres 0x00000001049566dc ExecHashJoin + 32 26 postgres 0x0000000104925958 ExecProcNodeFirst + 100 27 postgres 0x000000010491c08c ExecProcNode + 68 28 postgres 0x000000010491692c ExecutePlan + 268 29 postgres 0x000000010491679c standard_ExecutorRun + 504 30 pg_stat_statements.dylib 0x00000001055e132c pgss_ExecutorRun + 212 31 postgres 0x0000000104916580 ExecutorRun + 72 32 postgres 0x000000010491f59c ParallelQueryMain + 508 33 postgres 0x00000001046968d4 ParallelWorkerMain + 1712 34 postgres 0x0000000104b476f4 BackgroundWorkerMain + 824 35 postgres 0x0000000104b4b87c postmaster_child_launch + 492 36 postgres 0x0000000104b56610 StartBackgroundWorker + 416 37 postgres 0x0000000104b51378 maybe_start_bgworkers + 552 38 postgres 0x0000000104b540cc LaunchMissingBackgroundProcesses + 1316 39 postgres 0x0000000104b51650 ServerLoop + 616 40 postgres 0x0000000104b4fef0 PostmasterMain + 6632 41 postgres 0x00000001049bfe20 main + 952 42 dyld 0x00000001849d60e0 start + 2360 2025-05-24 16:07:23.427 EDT [22120:4] LOG: background worker "parallel worker" (PID 32355) was terminated by signal 6: Abort trap: 6 2025-05-24 16:07:23.427 EDT [22120:5] DETAIL: Failed process was running: select count(*) from join_foo left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss I'm yet to see the Assert triggered on the buildfarm, but this one looks interesting too. (I can share the complete patch + script for such testing, if it can be helpful.) Best regards, Alexander Lakhin Neon (https://neon.tech)
On Sun, May 25, 2025 at 3:22 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > Can you get a core and print *ioh in the debugger? > > So far, I've failed to get anything useful out of core files > from this failure. The trace goes back no further than > > (lldb) bt > * thread #1 > * frame #0: 0x000000018de39388 libsystem_kernel.dylib`__pthread_kill + 8 > > That's quite odd in itself: while I don't find the debugging > environment on macOS to be the greatest, it's not normally > this unhelpful. (And Alexander reported the same off-list.). It's interesting that the elog.c backtrace stuff is able to analyse the stack and it looks normal AFAICS. Could that be interfering with the stack in the core?! I doubt it ... I kinda wonder if the debugger might be confused about libsystem sometimes since it has ceased to be a regular Mach-O file on disk, but IDK; maybe gdb (from MacPorts etc) would offer a clue? So far we have: TRAP: failed Assert("aio_ret->result.status != PGAIO_RS_UNKNOWN"), File: "bufmgr.c", Line: 1605, PID: 20931 0 postgres 0x0000000105299c84 ExceptionalCondition + 108 1 postgres 0x00000001051159ac WaitReadBuffers + 616 2 postgres 0x00000001053611ec read_stream_next_buffer.cold.1 + 184 3 postgres 0x0000000105111630 read_stream_next_buffer + 300 4 postgres 0x0000000104e0b994 heap_fetch_next_buffer + 136 5 postgres 0x0000000104e018f4 heapgettup_pagemode + 204 Hmm, looking around that code and wracking my brain for things that might happen on one OS but not others, I wonder about partial I/Os. Perhaps combined with some overlapping requests. TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 161, PID: 32355 0 postgres 0x0000000104f078f4 ExceptionalCondition + 236 1 postgres 0x0000000104c0ebd4 pgaio_io_before_start + 260 2 postgres 0x0000000104c0ea94 pgaio_io_start_readv + 36 3 postgres 0x0000000104c2d4e8 FileStartReadV + 252 4 postgres 0x0000000104c807c8 mdstartreadv + 668 5 postgres 0x0000000104c83db0 smgrstartreadv + 116 But this one seems like a more basic confusion... wild writes somewhere? Hmm, we need to see what's in that struct. If we can't get a debugger to break there or a core file to be analysable, maybe we should try logging as much info as possible at those points to learn a bit more? I would be digging like that myself but I haven't seen this failure on my little M4 MacBook Air yet (Sequoia 15.5, Apple clang-1700.0.13.3). It is infected with corporate security-ware that intercepts at least file system stuff and slows it down and I can't even convince it to dump core files right now. Could you guys please share your exact repro steps?
Thomas Munro <thomas.munro@gmail.com> writes: > Could you guys please share your exact repro steps? I've just been running 027_stream_regress.pl over and over. It's not a recommendable answer though because the failure probability is tiny, under 1%. It sounded like Alexander had a better way. regards, tom lane
Hi, On 2025-05-25 20:05:49 -0400, Tom Lane wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > Could you guys please share your exact repro steps? > > I've just been running 027_stream_regress.pl over and over. > It's not a recommendable answer though because the failure > probability is tiny, under 1%. It sounded like Alexander > had a better way. Just FYI, I've been trying to reproduce this as well, without a single failure so far. Despite running all tests for a few hundred times (~2 days) and 027_stream_regress.pl many hundreds of times (~1 day). This is on a m4 mac mini. I'm wondering if there's some hardware specific memory ordering issue or disk speed based timing issue that I'm just not hitting. Greetings, Andres Freund
Thomas Munro <thomas.munro@gmail.com> writes: > Could you please share your configure options? The failures on indri and sifaka were during ordinary buildfarm runs, you can check the animals' details on the website. (Note those are same host machine, the difference is that indri uses some MacPorts packages while sifaka is meant to be a bare-bones-macOS build.) On the other two, I believe I just used my normal Mac devel configuration. On the M4 laptop that's $ ./configure CC="ccache gcc" CFLAGS="-O2 -fno-common" --enable-debug --enable-cassert --with-bonjour --without-icu --enable-tap-tests--with-system-tzdata=/usr/share/zoneinfo but I see I haven't quite got the same thing on the M1 mini: $ ./configure CC="ccache clang" --enable-debug --enable-cassert --enable-tap-tests --with-system-tzdata=/usr/share/zoneinfo In both cases the compiler is really $ gcc -v Apple clang version 17.0.0 (clang-1700.0.13.3) Target: arm64-apple-darwin24.5.0 Thread model: posix InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin I did some experimentation with different -O levels to see if that would make the failure any more or less common. But I've yet to reproduce it at anything except -O2. regards, tom lane
Andres Freund <andres@anarazel.de> writes: > This is on a m4 mac mini. I'm wondering if there's some hardware specific > memory ordering issue or disk speed based timing issue that I'm just not > hitting. I dunno, I've seen it on three different physical machines now (one M1, two M4 Pros). But it is darn hard to repro, for sure. regards, tom lane
Hi, On 2025-05-27 10:12:28 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > This is on a m4 mac mini. I'm wondering if there's some hardware specific > > memory ordering issue or disk speed based timing issue that I'm just not > > hitting. > > I dunno, I've seen it on three different physical machines now > (one M1, two M4 Pros). But it is darn hard to repro, for sure. I just meant that it seems that I can't reproduce it for some as of yet unknown reason. I've now been through 3k+ runs of 027_stream_regress, without a single failure, so there has to be *something* different about my environment than yours. Darwin m4-dev 24.1.0 Darwin Kernel Version 24.1.0: Thu Oct 10 21:06:23 PDT 2024; root:xnu-11215.41.3~3/RELEASE_ARM64_T8132arm64 cc -v Apple clang version 16.0.0 (clang-1600.0.26.4) Target: arm64-apple-darwin24.1.0 Thread model: posix I guess I'll try to update to a later version and see if it repros there? Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > I just meant that it seems that I can't reproduce it for some as of yet > unknown reason. I've now been through 3k+ runs of 027_stream_regress, without > a single failure, so there has to be *something* different about my > environment than yours. > Darwin m4-dev 24.1.0 Darwin Kernel Version 24.1.0: Thu Oct 10 21:06:23 PDT 2024; root:xnu-11215.41.3~3/RELEASE_ARM64_T8132arm64 > cc -v > Apple clang version 16.0.0 (clang-1600.0.26.4) > Target: arm64-apple-darwin24.1.0 > Thread model: posix > I guess I'll try to update to a later version and see if it repros there? Maybe. All the machines I've seen it on are current-software: $ uname -a Darwin minim4.sss.pgh.pa.us 24.5.0 Darwin Kernel Version 24.5.0: Tue Apr 22 19:53:27 PDT 2025; root:xnu-11417.121.6~2/RELEASE_ARM64_T6041arm64 $ cc -v Apple clang version 17.0.0 (clang-1700.0.13.3) Target: arm64-apple-darwin24.5.0 Thread model: posix InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin If it's OS-version-specific, that raises the odds IMO that this is Apple's fault more than ours. regards, tom lane
Hello Thomas, 25.05.2025 05:45, Thomas Munro wrote: >> TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 161, PID: 32355 > Can you get a core and print *ioh in the debugger? It looks like I managed to restore all the conditions needed to reproduce that Assert more or less reliably (within a couple of hours), so I can continue experiments. I've added the following debugging: +++ b/src/backend/storage/aio/aio_io.c @@ -158,6 +158,7 @@ pgaio_io_before_start(PgAioHandle *ioh) Assert(ioh->state == PGAIO_HS_HANDED_OUT); Assert(pgaio_my_backend->handed_out_io == ioh); Assert(pgaio_io_has_target(ioh)); +if (ioh->op != PGAIO_OP_INVALID) elog(LOG, "!!!pgaio_io_before_start| ioh: %p, ioh->op: %d", ioh, ioh->op); Assert(ioh->op == PGAIO_OP_INVALID); --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -67,6 +67,7 @@ #include "utils/resowner.h" #include "utils/timestamp.h" +#include "storage/aio_internal.h" ... @@ -1855,7 +1860,11 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress) pgaio_submit_staged(); ioh = pgaio_io_acquire(CurrentResourceOwner, &operation->io_return); +fprintf(stderr, "!!!AsyncReadBuffers [%d]| ioh(2): %p, op: %d, blocknum: %d\n", getpid(), ioh, ((PgAioHandle *)ioh)->op, blocknum); } +else +fprintf(stderr, "!!!AsyncReadBuffers [%d]| ioh(1): %p, op: %d, blocknum: %d\n", getpid(), ioh, ((PgAioHandle *)ioh)->op, blocknum); + /* * Check if we can start IO on the first to-be-read buffer. ... (using fprintf here to rule out side effects of interrupt processing inside elog(), and it looks like fprintf there eases the reproducing) --- a/src/test/perl/PostgreSQL/Test/Cluster.pm +++ b/src/test/perl/PostgreSQL/Test/Cluster.pm @@ -712,7 +712,7 @@ sub init print $conf "wal_log_hints = on\n"; print $conf "hot_standby = on\n"; # conservative settings to ensure we can run multiple postmasters: - print $conf "shared_buffers = 1MB\n"; + print $conf "shared_buffers = 512kB\n"; print $conf "max_connections = 10\n"; (it looks like it increases the probability of failure too, will gather more statistics...) And got: ... !!!AsyncReadBuffers [97859]| ioh(1): 0x106efd540, op: 0, blocknum: 121 !!!AsyncReadBuffers [97859]| ioh(1): 0x106efd540, op: 0, blocknum: 122 !!!AsyncReadBuffers [97859]| ioh(1): 0x106efd540, op: 0, blocknum: 123 !!!AsyncReadBuffers [97859]| ioh(1): 0x106efd540, op: 0, blocknum: 124 !!!AsyncReadBuffers [97859]| ioh(1): 0x106efd540, op: 0, blocknum: 125 !!!AsyncReadBuffers [97859]| ioh(1): 0x106efd540, op: 0, blocknum: 126 !!!AsyncReadBuffers [97859]| ioh(1): 0x106efd540, op: 1, blocknum: 127 2025-05-30 14:38:40.812 EDT [97859:121] pg_regress/replica_identity LOG: !!!pgaio_io_before_start| ioh: 0x106efd540, ioh->op: 1 2025-05-30 14:38:40.812 EDT [97859:122] pg_regress/replica_identity STATEMENT: SELECT c.oid, n.nspname, c.relname FROM pg_catalog.pg_class c LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace WHERE c.relname OPERATOR(pg_catalog.~) '^(test_replica_identity3)$' COLLATE pg_catalog.default AND pg_catalog.pg_table_is_visible(c.oid) ORDER BY 2, 3; TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 162, PID: 97859 0 postgres 0x0000000105823878 ExceptionalCondition + 236 1 postgres 0x000000010552ab54 pgaio_io_before_start + 452 2 postgres 0x000000010552a954 pgaio_io_start_readv + 36 3 postgres 0x000000010554946c FileStartReadV + 252 4 postgres 0x000000010559c74c mdstartreadv + 668 5 postgres 0x000000010559fd34 smgrstartreadv + 116 6 postgres 0x00000001055326e0 AsyncReadBuffers + 1664 7 postgres 0x00000001055317e8 StartReadBuffersImpl + 1196 8 postgres 0x000000010553132c StartReadBuffers + 64 9 postgres 0x000000010552e934 read_stream_start_pending_read + 1204 10 postgres 0x000000010552df98 read_stream_look_ahead + 812 11 postgres 0x000000010552dc50 read_stream_next_buffer + 1396 12 postgres 0x0000000104f22af8 heap_fetch_next_buffer + 284 13 postgres 0x0000000104f13894 heapgettup_pagemode + 192 14 postgres 0x0000000104f13e64 heap_getnextslot + 84 15 postgres 0x00000001052962c4 table_scan_getnextslot + 340 16 postgres 0x00000001052960d0 SeqNext + 148 17 postgres 0x00000001052967a4 ExecScanFetch + 772 18 postgres 0x0000000105296378 ExecScanExtended + 164 ... And another failed run: ... !!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 30 !!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 31 !!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 34 !!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 35 !!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 36 !!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 37 !!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 38 !!!AsyncReadBuffers [5043]| ioh(1): 0x102d14f10, op: 0, blocknum: 253 !!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 39 !!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 40 !!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 41 !!!AsyncReadBuffers [5043]| ioh(1): 0x102d14f10, op: 0, blocknum: 254 !!!AsyncReadBuffers [5043]| ioh(1): 0x102d14f10, op: 0, blocknum: 255 2025-05-30 22:48:00.265 EDT [5043:16] pg_regress/memoize LOG: !!!pgaio_io_before_start| ioh: 0x102d14f10, ioh->op: 1 2025-05-30 22:48:00.265 EDT [5043:17] pg_regress/memoize CONTEXT: PL/pgSQL function explain_memoize(text,boolean) line 5 at FOR over EXECUTE statement 2025-05-30 22:48:00.265 EDT [5043:18] pg_regress/memoize STATEMENT: SELECT explain_memoize(' SELECT COUNT(*), AVG(t1.twenty) FROM tenk1 t1 LEFT JOIN LATERAL (SELECT t1.twenty AS c1, t2.unique1 AS c2, t2.two FROM tenk1 t2) s ON t1.two = s.two WHERE s.c1 = s.c2 AND t1.unique1 < 1000;', false); TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 162, PID: 5043 ... 0 postgres 0x000000010163b878 ExceptionalCondition + 236 1 postgres 0x0000000101342b54 pgaio_io_before_start + 452 2 postgres 0x0000000101342954 pgaio_io_start_readv + 36 3 postgres 0x000000010136146c FileStartReadV + 252 4 postgres 0x00000001013b474c mdstartreadv + 668 5 postgres 0x00000001013b7d34 smgrstartreadv + 116 6 postgres 0x000000010134a6e0 AsyncReadBuffers + 1664 7 postgres 0x00000001013497e8 StartReadBuffersImpl + 1196 8 postgres 0x000000010134932c StartReadBuffers + 64 9 postgres 0x0000000101346934 read_stream_start_pending_read + 1204 10 postgres 0x0000000101345f98 read_stream_look_ahead + 812 11 postgres 0x0000000101345c50 read_stream_next_buffer + 1396 12 postgres 0x0000000100d3aaf8 heap_fetch_next_buffer + 284 13 postgres 0x0000000100d2b894 heapgettup_pagemode + 192 14 postgres 0x0000000100d2be64 heap_getnextslot + 84 15 postgres 0x00000001010ae2c4 table_scan_getnextslot + 340 16 postgres 0x00000001010ae0d0 SeqNext + 148 ... I run the same loop as shown at [1], but with a random number of parallel processes: np=$((1 + $RANDOM % 5)), and as I can see the failure is more likely with np == 5. I also tried running stress-ng --hdd 1 --timeout 60m concurrently, but I'm still not sure if it affects the reproducing directly. I use MacBook M1 and don't know how to disable E/P-cores, moreover, from my observation, the test performs slower (up to 10%) when it runs for hours in a loop, probably because of the CPU heating up and lowering it's frequency, so if the issue is really timing-dependent, it's not that easy to get more or less the same timings once more (I tried to add sleep into the loop, to let CPU cool down...). [1] https://www.postgresql.org/message-id/87673f76-1ed7-44dc-8e8b-6aaf78e2817a%40gmail.com Best regards, Alexander Lakhin Neon (https://neon.tech)
31.05.2025 06:00, Alexander Lakhin wrote: > Hello Thomas, > > It looks like I managed to restore all the conditions needed to reproduce > that Assert more or less reliably (within a couple of hours), so I can > continue experiments. > > I've added the following debugging: > ... With additional logging (the patch is attached), I can see the following: !!!pgaio_io_reclaim [63817]| ioh: 0x1046b5660, ioh->op: 1, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2 !!!AsyncReadBuffers [63817] (1)| blocknum: 15, ioh: 0x1046b5660, ioh->op: 0, ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0 !!!pgaio_io_start_readv [63817]| ioh: 0x1046b5660, ioh->op: 1 !!!pgaio_io_reclaim [63817]| ioh: 0x1046b5660, ioh->op: 1, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2 !!!AsyncReadBuffers [63817] (1)| blocknum: 16, ioh: 0x1046b5660, ioh->op: 0, ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0 !!!pgaio_io_start_readv [63817]| ioh: 0x1046b5660, ioh->op: 1 !!!pgaio_io_reclaim [63817]| ioh: 0x1046b5660, ioh->op: 1, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2 !!!AsyncReadBuffers [63817] (1)| blocknum: 17, ioh: 0x1046b5660, ioh->op: 0, ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0 !!!pgaio_io_start_readv [63817]| ioh: 0x1046b5660, ioh->op: 1 !!!pgaio_io_reclaim [63817]| ioh: 0x1046b5660, ioh->op: 1, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2 !!!AsyncReadBuffers [63817] (1)| blocknum: 18, ioh: 0x1046b5660, ioh->op: 0, ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0 2025-06-02 00:45:11.391 EDT [63817:14] pg_regress/brin LOG: !!!pgaio_io_before_start| ioh: 0x1046b5660, ioh->op: 1, ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 2 TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 164, PID: 63817 0 postgres 0x0000000102fdb60c ExceptionalCondition + 236 1 postgres 0x0000000102ce233c pgaio_io_before_start + 504 2 postgres 0x0000000102ce20a8 pgaio_io_start_readv + 36 3 postgres 0x0000000102d01200 FileStartReadV + 252 4 postgres 0x0000000102d544e0 mdstartreadv + 668 5 postgres 0x0000000102d57ac8 smgrstartreadv + 116 6 postgres 0x0000000102cea474 AsyncReadBuffers + 1996 7 postgres 0x0000000102ce9430 StartReadBuffersImpl + 1196 8 postgres 0x0000000102ce8f74 StartReadBuffers + 64 9 postgres 0x0000000102ce657c read_stream_start_pending_read + 1204 10 postgres 0x0000000102ce5c60 read_stream_look_ahead + 812 11 postgres 0x0000000102ce57f8 read_stream_next_buffer + 2356 12 postgres 0x00000001026da1b8 heap_fetch_next_buffer + 284 13 postgres 0x00000001026caf54 heapgettup_pagemode + 192 14 postgres 0x00000001026cb524 heap_getnextslot + 84 15 postgres 0x0000000102a4d984 table_scan_getnextslot + 340 16 postgres 0x0000000102a4d790 SeqNext + 148 17 postgres 0x0000000102a4de64 ExecScanFetch + 772 18 postgres 0x0000000102a4da38 ExecScanExtended + 164 19 postgres 0x0000000102a4d374 ExecSeqScanWithQualProject + 248 20 postgres 0x0000000102a13248 ExecProcNode + 68 21 postgres 0x0000000102a12e70 fetch_input_tuple + 180 22 postgres 0x0000000102a12bb4 agg_retrieve_direct + 1664 23 postgres 0x0000000102a0f914 ExecAgg + 240 ... 2025-06-02 00:45:11.409 EDT [57192:4] LOG: client backend (PID 63817) was terminated by signal 6: Abort trap: 6 2025-06-02 00:45:11.409 EDT [57192:5] DETAIL: Failed process was running: DO $x$ DECLARE r record; r2 record; cond text; idx_ctids tid[]; ss_ctids tid[]; count int; plan_ok bool; plan_line text; BEGIN ... Best regards, Alexander Lakhin Neon (https://neon.tech)
Attachment
Hi, Thanks for working on investigating this. On 2025-06-03 08:00:01 +0300, Alexander Lakhin wrote: > 02.06.2025 09:00, Alexander Lakhin wrote: > > With additional logging (the patch is attached), I can see the following: > > ... > > !!!pgaio_io_reclaim [63817]| ioh: 0x1046b5660, ioh->op: 1, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2 > > !!!AsyncReadBuffers [63817] (1)| blocknum: 18, ioh: 0x1046b5660, > > ioh->op: 0, ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0 > > 2025-06-02 00:45:11.391 EDT [63817:14] pg_regress/brin LOG: > > !!!pgaio_io_before_start| ioh: 0x1046b5660, ioh->op: 1, ioh->state: 1, > > ioh->result: 0, ioh->num_callbacks: 2 > > > > TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 164, PID: 63817 > > I've added logging of the ioh->generation field at the beginning and at > the end of pgaio_io_reclaim() (the complete patch is attached) and got > even more interesting output: > ... > !!!pgaio_io_reclaim [25175]| ioh: 0x104c3e1a0, ioh->op: 1, ioh->state: 6, > ioh->result: 8192, ioh->num_callbacks: 2, ioh->generation: 21692 > !!!pgaio_io_reclaim [25175]| ioh: 0x104c3e1a0, ioh->op: 0, ioh->generation: 21693 > !!!AsyncReadBuffers [25175] (1)| blocknum: 40, ioh: 0x104c3e1a0, ioh->op: 0, > ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0, ioh->generation: 21693 > !!!pgaio_io_start_readv [25175]| ioh: 0x104c3e1a0, ioh->op: 1, ioh->generation: 21693 IO starts. > !!!pgaio_io_reclaim [25176]| ioh: 0x104c3e080, ioh->op: 1, ioh->state: 6, > ioh->result: 8192, ioh->num_callbacks: 2, ioh->generation: 21408 > !!!pgaio_io_reclaim [25176]| ioh: 0x104c3e080, ioh->op: 0, ioh->generation: 21409 > !!!AsyncReadBuffers [25176] (1)| blocknum: 49, ioh: 0x104c3e080, ioh->op: 0, > ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0, ioh->generation: 21409 > !!!pgaio_io_start_readv [25176]| ioh: 0x104c3e080, ioh->op: 1, ioh->generation: 21409 (other backend / other IO) > !!!pgaio_io_reclaim [25175]| ioh: 0x104c3e1a0, ioh->op: 1, ioh->state: 6, > ioh->result: 8192, ioh->num_callbacks: 2, ioh->generation: 21693 > !!!pgaio_io_reclaim [25175]| ioh: 0x104c3e1a0, ioh->op: 0, ioh->generation: 21694 I completes and is marked as idle, as we'd expect. > !!!AsyncReadBuffers [25175] (1)| blocknum: 41, ioh: 0x104c3e1a0, ioh->op: 0, > ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0, ioh->generation: 21694 And here, before starting a new IO, the op is 0 (i.e. PGAIO_OP_INVALID), as we would expect. > 2025-06-03 00:19:09.282 EDT [25175:1] LOG: !!!pgaio_io_before_start| ioh: > 0x104c3e1a0, ioh->op: 1, ioh->state: 1, ioh->result: 0, ioh->num_callbacks: > 2, ioh->generation: 21694 But here it somehow turned to 1 (PGAIO_OP_READV), despite there not being any "normal" reason for that. We know that the IO wasn't actually started as otherwise pgaio_io_start_readv() would have logged that fact. Which then immediately leads to an assert failure: > 2025-06-03 00:19:09.282 EDT [25175:2] STATEMENT: explain (analyze, format 'json') > select count(*) from join_foo > left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss > on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1; > > TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 164, PID: 25175 > > 0 postgres 0x0000000103563550 ExceptionalCondition + 236 > 1 postgres 0x000000010326a260 pgaio_io_before_start + 516 > 2 postgres 0x0000000103269fb4 pgaio_io_start_readv + 36 > 3 postgres 0x0000000103289144 FileStartReadV + 252 > 4 postgres 0x00000001032dc424 mdstartreadv + 668 > 5 postgres 0x00000001032dfa0c smgrstartreadv + 116 > 6 postgres 0x00000001032723b8 AsyncReadBuffers + 2028 > 7 postgres 0x0000000103271354 StartReadBuffersImpl + 1196 > 8 postgres 0x0000000103270e98 StartReadBuffers + 64 > 9 postgres 0x000000010326e4a0 read_stream_start_pending_read + 1204 > 10 postgres 0x000000010326db84 read_stream_look_ahead + 812 > 11 postgres 0x000000010326d71c read_stream_next_buffer + 2356 I don't see how this can happen. There is no mac specific code in AIO as of now, why aren't we seeing this on any other platform? IME the theory that there is something outside of a subsystem randomly overwriting memory is almost always wrong, but I'm kinda starting to suspect it here nonetheless. I wonder if it's worth trying to put a fair bit of "padding" memory before/after AIO's shared memory region(s). If that were to stop the crashes, it'd likely be something random writing there. Unfortunately continuing to crash wouldn't necessarily guarantee it's aio specific. Hm. We had issues with wrong size calculations of shmem regions before, what about writing a patch that puts mprotect(PROT_NONE) pages before/after each ShmemAlloc'd region? Greetings, Andres Freund
On Thu, Jun 5, 2025 at 8:02 AM Andres Freund <andres@anarazel.de> wrote: > On 2025-06-03 08:00:01 +0300, Alexander Lakhin wrote: > > 2025-06-03 00:19:09.282 EDT [25175:1] LOG: !!!pgaio_io_before_start| ioh: > > 0x104c3e1a0, ioh->op: 1, ioh->state: 1, ioh->result: 0, ioh->num_callbacks: > > 2, ioh->generation: 21694 > > But here it somehow turned to 1 (PGAIO_OP_READV), despite there not being any > "normal" reason for that. We know that the IO wasn't actually started as > otherwise pgaio_io_start_readv() would have logged that fact. A cheap/easy thing to try, maybe: give that enumeration more distinctive values like 0xaaaa, 0xbbbb, or whatever, to see if we get a wild 1 here?
Hello Thomas and Andres, 04.06.2025 23:32, Thomas Munro wrote: > On Thu, Jun 5, 2025 at 8:02 AM Andres Freund <andres@anarazel.de> wrote: >> On 2025-06-03 08:00:01 +0300, Alexander Lakhin wrote: >>> 2025-06-03 00:19:09.282 EDT [25175:1] LOG: !!!pgaio_io_before_start| ioh: >>> 0x104c3e1a0, ioh->op: 1, ioh->state: 1, ioh->result: 0, ioh->num_callbacks: >>> 2, ioh->generation: 21694 >> But here it somehow turned to 1 (PGAIO_OP_READV), despite there not being any >> "normal" reason for that. We know that the IO wasn't actually started as >> otherwise pgaio_io_start_readv() would have logged that fact. > A cheap/easy thing to try, maybe: give that enumeration more > distinctive values like 0xaaaa, 0xbbbb, or whatever, to see if we get > a wild 1 here? Thank you for your attention to this and for the tip! Today I tried the following: --- a/src/include/storage/aio.h +++ b/src/include/storage/aio.h @@ -89,8 +89,8 @@ typedef enum PgAioOp /* intentionally the zero value, to help catch zeroed memory etc */ PGAIO_OP_INVALID = 0, - PGAIO_OP_READV, - PGAIO_OP_WRITEV, + PGAIO_OP_READV = 0xaa, + PGAIO_OP_WRITEV = 0xbb, And got:: !!!AsyncReadBuffers [68292] (1)| blocknum: 9, ioh: 0x102b26230, ioh->op: 0, ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0, ioh->generation: 13038 !!!pgaio_io_start_readv [68292]| ioh: 0x102b26230, ioh->op: 170, ioh->generation: 13038 !!!pgaio_io_reclaim [67729]| ioh: 0x102b25540, ioh->op: 170, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2, ioh->generation: 47468 !!!pgaio_io_reclaim [67729]| ioh: 0x102b25540, ioh->op: 0, ioh->generation: 47469 !!!pgaio_io_reclaim [68292]| ioh: 0x102b26230, ioh->op: 170, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2, ioh->generation: 13038 !!!pgaio_io_reclaim [68292]| ioh: 0x102b26230, ioh->op: 0, ioh->generation: 13039 !!!AsyncReadBuffers [68292] (1)| blocknum: 10, ioh: 0x102b26230, ioh->op: 0, ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0, ioh->generation: 13039 !!!pgaio_io_start_readv [68292]| ioh: 0x102b26230, ioh->op: 170, ioh->generation: 13039 !!!pgaio_io_reclaim [67718]| ioh: 0x102b25390, ioh->op: 170, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2, ioh->generation: 35821 !!!pgaio_io_reclaim [67718]| ioh: 0x102b25390, ioh->op: 0, ioh->generation: 35822 !!!pgaio_io_reclaim [68292]| ioh: 0x102b26230, ioh->op: 170, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2, ioh->generation: 13039 !!!pgaio_io_reclaim [68292]| ioh: 0x102b26230, ioh->op: 0, ioh->generation: 13040 !!!AsyncReadBuffers [68292] (1)| blocknum: 11, ioh: 0x102b26230, ioh->op: 0, ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0, ioh->generation: 13040 2025-06-05 04:15:38.847 EDT [68292:1] LOG: !!!pgaio_io_before_start| ioh: 0x102b26230, ioh->op: 170, ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 2, ioh->generation: 13040 ### I reproduced the failure a few times and observed "ioh->op: 170" each time. ### 2025-06-05 04:15:38.847 EDT [68292:2] STATEMENT: select count(*) from simple r join extremely_skewed s using (id); TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 164, PID: 68292 0 postgres 0x000000010144b550 ExceptionalCondition + 236 1 postgres 0x0000000101152260 pgaio_io_before_start + 516 2 postgres 0x0000000101151fb4 pgaio_io_start_readv + 36 3 postgres 0x0000000101171144 FileStartReadV + 252 4 postgres 0x00000001011c4424 mdstartreadv + 668 5 postgres 0x00000001011c7a0c smgrstartreadv + 116 6 postgres 0x000000010115a3b8 AsyncReadBuffers + 2028 7 postgres 0x0000000101159354 StartReadBuffersImpl + 1196 8 postgres 0x0000000101158e98 StartReadBuffers + 64 9 postgres 0x00000001011564a0 read_stream_start_pending_read + 1204 10 postgres 0x0000000101155b84 read_stream_look_ahead + 812 11 postgres 0x000000010115571c read_stream_next_buffer + 2356 12 postgres 0x0000000100b4a04c heap_fetch_next_buffer + 284 13 postgres 0x0000000100b3ade8 heapgettup_pagemode + 192 14 postgres 0x0000000100b3b3b8 heap_getnextslot + 84 15 postgres 0x0000000100ebd818 table_scan_getnextslot + 340 16 postgres 0x0000000100ebd624 SeqNext + 148 17 postgres 0x0000000100ebdcf8 ExecScanFetch + 772 18 postgres 0x0000000100ebd8cc ExecScanExtended + 164 19 postgres 0x0000000100ebd004 ExecSeqScanWithProject + 244 20 postgres 0x0000000100e970f4 ExecProcNode + 68 I also tried reproducing this with -DFDDEBUG, no luck so far, probably due to significant change of the test duration. Best regards, Alexander Lakhin
On Thu, 5 Jun 2025 at 21:00, Alexander Lakhin <exclusion@gmail.com> wrote: > > Hello Thomas and Andres, > > 04.06.2025 23:32, Thomas Munro wrote: > > On Thu, Jun 5, 2025 at 8:02 AM Andres Freund <andres@anarazel.de> wrote: > >> On 2025-06-03 08:00:01 +0300, Alexander Lakhin wrote: > >>> 2025-06-03 00:19:09.282 EDT [25175:1] LOG: !!!pgaio_io_before_start| ioh: > >>> 0x104c3e1a0, ioh->op: 1, ioh->state: 1, ioh->result: 0, ioh->num_callbacks: > >>> 2, ioh->generation: 21694 > >> But here it somehow turned to 1 (PGAIO_OP_READV), despite there not being any > >> "normal" reason for that. We know that the IO wasn't actually started as > >> otherwise pgaio_io_start_readv() would have logged that fact. > > A cheap/easy thing to try, maybe: give that enumeration more > > distinctive values like 0xaaaa, 0xbbbb, or whatever, to see if we get > > a wild 1 here? I have a very wild guess that's probably wrong in a weird way, but here goes anyway: Did anyone test if interleaving the enum-typed bitfield fields of PgAioHandle with the uint8 fields might solve the issue? I.e. {state, flags, target, num_callbacks, op} instead of {state, target, op, flags, num_callbacks}? I've looked at some decompiled code generated for pgaio_io_before_start and pgaio_io_start_readv. pgaio_io_start_readv looks rather normal, but in pgaio_io_before_start it's clear that pgaio_io_before_start does some weird things as a response to the bitfields of PgAioHandle: (manually annotated) _pgaio_io_before_start: 00000001002eaba8 stp x20, x19, [sp, #-0x20]! /* save registers */ 00000001002eabac stp x29, x30, [sp, #0x10] /* save registers */ 00000001002eabb0 add x29, sp, #0x10 /* x29 := stack+10 */ 00000001002eabb4 ldrb w8, [x0] /* w8 := ioh->state */ 00000001002eabb8 cmp w8, #0x1 /* w8 ?? PGAIO_HS_HANDED_OUT */ 00000001002eabbc b.ne 0x1002eac38 w8 here will contain the value 0x1. 00000001002eabc0 mov x19, x0 /* x19 := ioh */ 00000001002eabc4 adrp x8, 1461 ; 0x10089f000 /* x8 := 0x10089f000 */ 00000001002eabc8 add x8, x8, #0xd40 /* x8 += 0xd40 (0x10089fd40) */ 00000001002eabcc ldr x8, [x8] /* x8 := pgaio_my_backend */ 00000001002eabd0 ldr x8, [x8, #0x20] /* x8 := pgaio_my_backend->handed_out_io */ 00000001002eabd4 cmp x8, x0 /* x8 (pgaio_my_backend->handed_out_io) ?? x19 (ioh) */ 00000001002eabd8 b.ne 0x1002eac50 00000001002eabdc mov x0, x19 /* x0 := ioh */ 00000001002eabe0 bl _pgaio_io_has_target 00000001002eabe4 tbz w0, #0x0, 0x1002eac68 /* _pgaio_io_has_target(ioh)? */ 00000001002eabe8 ldrb w8, [x19, #0x2] /* w8 := ioh->op */ w8 now contains ioh->op, which is presumed to be 0x0. 00000001002eabec ldrh w9, [x19] /* w9 := ioh->[target | state] */ 00000001002eabf0 orr w8, w9, w8, lsl #16 /* w8 := w9 | (w8 << 16) // [ op | [target | state] ]*/ But here more operations are applied, pushing in the preceding bitfields of PgAioHandle, mixing in various values into w8. 00000001002eabf4 cmp w8, #0x10, lsl #12 /* w8 ?? 0x1'00'00 // op'state'target ?? 0x1'00'00 */ And finally the full set of bitfields is compared against a target value, rather than just ioh->op. This means we do 2 memory accesses for this field. 00000001002eabf8 b.hs 0x1002eac80 /* >= ?? ERROR : continue */ And finally we branch into the setup for _ExceptionalCondition. This shows that b->op is not exactly a stand-alone field, and might be influenced by the other sibling bitfield fields. If we changed the order of the fields to not have continuous bitfields, that might be able to convince the compiler that there are no data dependencies between state/target/op, and remove the two-phase load operation. (checks more decompiled code) Oh, the bitfields are not just read with multiple accesses, but also *written* using multiple accesses. See pgaio_io_set_target: _pgaio_io_set_target: 00000001002eafdc stp x29, x30, [sp, #-0x10]! 00000001002eafe0 mov x29, sp 00000001002eafe4 ldrb w8, [x0, #0x2] 00000001002eafe8 ldrh w9, [x0] 00000001002eafec orr w8, w9, w8, lsl #16 00000001002eaff0 and w9, w8, #0xff 00000001002eaff4 cmp w9, #0x1 00000001002eaff8 b.ne 0x1002eb01c 00000001002eaffc tst w8, #0xff00 00000001002eb000 b.ne 0x1002eb034 00000001002eb004 lsr w9, w8, #16 00000001002eb008 bfi w8, w1, #8, #24 00000001002eb00c strb w9, [x0, #0x2] /* ioh->op = w9 */ 00000001002eb010 strh w8, [x0] /* ioh->[state, target] = w8 */ ... and pgaio_io_update_state: _pgaio_io_update_state: 00000001002e7f08 sub sp, sp, #0x70 00000001002e7f0c stp x24, x23, [sp, #0x30] 00000001002e7f10 stp x22, x21, [sp, #0x40] 00000001002e7f14 stp x20, x19, [sp, #0x50] 00000001002e7f18 stp x29, x30, [sp, #0x60] 00000001002e7f1c add x29, sp, #0x60 00000001002e7f20 mov x19, x1 00000001002e7f24 mov x20, x0 00000001002e7f28 mov w0, #0xa 00000001002e7f2c mov x1, #0x0 00000001002e7f30 bl _errstart 00000001002e7f34 cbz w0, 0x1002e8010 [...] /* skip pgaio_debug_io log handling */ 00000001002e8010 dmb ish 00000001002e8014 ldurh w8, [x20, #0x1] 00000001002e8018 bfi w19, w8, #8, #16 00000001002e801c lsr w8, w8, #8 00000001002e8020 strb w8, [x20, #0x2] /* store ioh->op */ 00000001002e8024 strh w19, [x20] /* store ioh->{state, target} */ I'm sending this now without looking much deeper into this due to $err_date_changed and other time-related constraints, but might this be a candidate for further debugging research? Kind regards, Matthias van de Meent Databricks Inc.
Matthias van de Meent <boekewurm+postgres@gmail.com> writes: > I have a very wild guess that's probably wrong in a weird way, but > here goes anyway: > Did anyone test if interleaving the enum-typed bitfield fields of > PgAioHandle with the uint8 fields might solve the issue? Ugh. I think you probably nailed it. IMO all those struct fields better be declared uint8. regards, tom lane
Hello, 05.06.2025 22:00, Alexander Lakhin wrote: > Thank you for your attention to this and for the tip! Today I tried the > following: > --- a/src/include/storage/aio.h > +++ b/src/include/storage/aio.h > @@ -89,8 +89,8 @@ typedef enum PgAioOp > /* intentionally the zero value, to help catch zeroed memory etc */ > PGAIO_OP_INVALID = 0, > > - PGAIO_OP_READV, > - PGAIO_OP_WRITEV, > + PGAIO_OP_READV = 0xaa, > + PGAIO_OP_WRITEV = 0xbb, And one more bit of information from me, if you're not tired of it yet: I've added: @@ -2238,6 +2240,7 @@ FileStartReadV(PgAioHandle *ioh, File file, vfdP = &VfdCache[file]; +if (ioh->op != PGAIO_OP_INVALID) elog(LOG, "!!!FileStartReadV| ioh: %p, ioh->op: %d, ioh->state: %d, ioh->result: %d, ioh->num_callbacks: %d, ioh->generation: %llu", ioh, ((PgAioHandle *)ioh)->op, ((PgAioHandle *)ioh)->state, ((PgAioHandle *)ioh)->result, ((PgAioHandle *)ioh)->num_callbacks, ((PgAioHandle *)ioh)->generation); pgaio_io_start_readv(ioh, vfdP->fd, iovcnt, offset); and got: ... !!!pgaio_io_reclaim [91056]| ioh: 0x1066d15d0, ioh->op: 0, ioh->generation: 29200 !!!AsyncReadBuffers [91056] (1)| blocknum: 70, ioh: 0x1066d15d0, ioh->op: 0, ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0, ioh->generation: 29200 !!!pgaio_io_start_readv [91056]| ioh: 0x1066d15d0, ioh->op: 170, ioh->generation: 29200 !!!pgaio_io_reclaim [91056]| ioh: 0x1066d15d0, ioh->op: 170, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2, ioh->generation: 29200 !!!pgaio_io_reclaim [91056]| ioh: 0x1066d15d0, ioh->op: 0, ioh->generation: 29201 !!!AsyncReadBuffers [91056] (1)| blocknum: 71, ioh: 0x1066d15d0, ioh->op: 0, ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0, ioh->generation: 29201 !!!pgaio_io_start_readv [91056]| ioh: 0x1066d15d0, ioh->op: 170, ioh->generation: 29201 !!!pgaio_io_reclaim [91056]| ioh: 0x1066d15d0, ioh->op: 170, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2, ioh->generation: 29201 !!!pgaio_io_reclaim [91056]| ioh: 0x1066d15d0, ioh->op: 0, ioh->generation: 29202 !!!AsyncReadBuffers [91056] (1)| blocknum: 72, ioh: 0x1066d15d0, ioh->op: 0, ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0, ioh->generation: 29202 !!!pgaio_io_start_readv [91056]| ioh: 0x1066d15d0, ioh->op: 170, ioh->generation: 29202 !!!pgaio_io_reclaim [91056]| ioh: 0x1066d15d0, ioh->op: 170, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2, ioh->generation: 29202 !!!pgaio_io_reclaim [91056]| ioh: 0x1066d15d0, ioh->op: 0, ioh->generation: 29203 !!!AsyncReadBuffers [91056] (1)| blocknum: 73, ioh: 0x1066d15d0, ioh->op: 170, ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0, ioh->generation: 29203 2025-06-05 19:52:44.174 EDT [91056:14] pg_regress/brin LOG: !!!FileStartReadV| ioh: 0x1066d15d0, ioh->op: 170, ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 2, ioh->generation: 29203 2025-06-05 19:52:44.174 EDT [91056:15] pg_regress/brin CONTEXT: SQL statement "SELECT array_agg(ctid) FROM brintest WHERE cidrcol = '10.2.14/24'::inet " PL/pgSQL function inline_code_block line 52 at EXECUTE 2025-06-05 19:52:44.174 EDT [91056:16] pg_regress/brin STATEMENT: DO $x$ ... 2025-06-05 19:52:44.174 EDT [91056:17] pg_regress/brin LOG: !!!pgaio_io_before_start| ioh: 0x1066d15d0, ioh->op: 170, ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 2, ioh->generation: 29203 2025-06-05 19:52:44.174 EDT [91056:18] pg_regress/brin CONTEXT: SQL statement "SELECT array_agg(ctid) FROM brintest WHERE cidrcol = '10.2.14/24'::inet " PL/pgSQL function inline_code_block line 52 at EXECUTE 2025-06-05 19:52:44.174 EDT [91056:19] pg_regress/brin STATEMENT: DO $x$ DECLARE r record; r2 record; cond text; idx_ctids tid[]; ss_ctids tid[]; count int; plan_ok bool; plan_line text; BEGIN FOR r IN SELECT colname, oper, typ, value[ordinality], matches[ordinality] FROM brinopers, unnest(op) WITH ORDINALITY AS oper LOOP ... TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 164, PID: 91056 0 postgres 0x0000000104ff771c ExceptionalCondition + 236 1 postgres 0x0000000104cfe32c pgaio_io_before_start + 516 2 postgres 0x0000000104cfe080 pgaio_io_start_readv + 36 3 postgres 0x0000000104d1d310 FileStartReadV + 508 4 postgres 0x0000000104d705f0 mdstartreadv + 668 5 postgres 0x0000000104d73bd8 smgrstartreadv + 116 6 postgres 0x0000000104d06484 AsyncReadBuffers + 2028 7 postgres 0x0000000104d05420 StartReadBuffersImpl + 1196 8 postgres 0x0000000104d04f64 StartReadBuffers + 64 9 postgres 0x0000000104d0256c read_stream_start_pending_read + 1204 10 postgres 0x0000000104d01c50 read_stream_look_ahead + 812 11 postgres 0x0000000104d017e8 read_stream_next_buffer + 2356 12 postgres 0x00000001046f661c heap_fetch_next_buffer + 284 13 postgres 0x00000001046e73b8 heapgettup_pagemode + 192 14 postgres 0x00000001046e7988 heap_getnextslot + 84 Thus, this time ioh->op switched to PGAIO_OP_READV earlier, even before the logging in AsyncReadBuffers(). By the way, this "DO .. FOR ... FROM brinopers, unnest(op) ... " triggers the assertion not for the first time for me (I've counted 3 such failures in my notes). Best regards, Alexander Lakhin
On 06/06/2025 2:31 am, Tom Lane wrote: > Matthias van de Meent <boekewurm+postgres@gmail.com> writes: >> I have a very wild guess that's probably wrong in a weird way, but >> here goes anyway: >> Did anyone test if interleaving the enum-typed bitfield fields of >> PgAioHandle with the uint8 fields might solve the issue? > Ugh. I think you probably nailed it. > > IMO all those struct fields better be declared uint8. > > regards, tom lane I also think that it can be in compiler. Bitfields with different enum type looks really exotic, so no wonder that optimizer can do something strange here. I failed to reproduce the problem with old version of clang (15.0). Also as far as I understand nobody was able to reproduce the problem with disabled optimizations (-O0). It definitely doesn't mean that there is bug in optimizer - just timing can be changed. Still it is not quite clear to me how `PGAIO_OP_READV` is managed to be written. There is just one place in the code when it is assigned: ``` pgaio_io_start_readv(PgAioHandle *ioh, int fd, int iovcnt, uint64 offset) { ... pgaio_io_stage(ioh, PGAIO_OP_READV); } ``` and `pgaio_io_stage` should update both `state` and `op`: ``` ioh->op = op; ioh->result = 0; pgaio_io_update_state(ioh, PGAIO_HS_DEFINED); ``` But as we see from the trace state is still PGAIO_HS_HANDED_OUT, so it was not updated. If there is some bug in optimizer which incorrectly construct mask for bitfield assignment, it is still not clean where it managed to get this PGAIO_OP_READV. And we can be sure that it is really PGAIO_OP_READV and just arbitrary garbage, because Alexander has replaced its value with 0xaa and we see in logs that it is rally stored. If there is race condition in `pgaio_io_update_state` (which enforces memory barrier before updating state) then for example inserting some sleep between assignment operation and status should increase probability of error. But it doesn't happen. Also as far as I understand, op is updated and read by the same backend. So it should not be some synchronization issue. So most likely it is bug in optimizer which generates incorrect code. Can Alexander or somebody else who was able to reproduce the problem share assembler code of `pgaio_io_reclaim` function? I am not sure that the bug is in this function - but it is prime suspect. Only `pgaio_io_start_readv` can set PGAIO_OP_READV, but we are almost sure that it was no called. So looks like that `op` was not cleared despite to what we see in logs. But if there was incorrect code in `pgaio_io_reclaim`, then it should always work incorrectly - doesn't clear "op" but in most cases it works...
There is really essential difference in code generated by clang 15 (working) and 16 (not working).
```
pgaio_io_stage(PgAioHandle *ioh, PgAioOp op)
{
...
HOLD_INTERRUPTS();
ioh->op = op;
ioh->result = 0;
pgaio_io_update_state(ioh, PGAIO_HS_DEFINED);
...
}
```
clang15:
```
postgres[0x10035d4ac] <+84>: add x22, x22, #0x5d4 ; InterruptHoldoffCount
postgres[0x10035d4b0] <+88>: ldr w8, [x22]
postgres[0x10035d4b4] <+92>: add w8, w8, #0x1
postgres[0x10035d4b8] <+96>: str w8, [x22]
postgres[0x10035d4bc] <+100>: strb w20, [x19, #0x2]
postgres[0x10035d4c0] <+104>: str wzr, [x19, #0x14]
postgres[0x10035d4c4] <+108>: mov x0, x19
postgres[0x10035d4c8] <+112>: mov w1, #0x2
postgres[0x10035d4cc] <+116>: bl 0x10035cbdc ; pgaio_io_update_state at aio.c:384
```
clang16:
```
postgres[0x100699e8c] <+200>: add x9, x9, #0xd74 ; InterruptHoldoffCount
postgres[0x100699e90] <+204>: ldr w8, [x9]
postgres[0x100699e94] <+208>: mov w10, #0x1 ; =1
postgres[0x100699e98] <+212>: stur w10, [x29, #-0x18]
postgres[0x100699e9c] <+216>: add w8, w8, #0x1
postgres[0x100699ea0] <+220>: str w8, [x9]
postgres[0x100699ea4] <+224>: ldur w10, [x29, #-0xc]
postgres[0x100699ea8] <+228>: ldur x9, [x29, #-0x8]
postgres[0x100699eac] <+232>: ldrh w8, [x9]
postgres[0x100699eb0] <+236>: and w10, w10, #0xff
postgres[0x100699eb4] <+240>: orr w10, w8, w10, lsl #16
postgres[0x100699eb8] <+244>: lsr w8, w10, #16
postgres[0x100699ebc] <+248>: strh w10, [x9]
postgres[0x100699ec0] <+252>: strb w8, [x9, #0x2]
postgres[0x100699ec4] <+256>: ldur x8, [x29, #-0x8]
postgres[0x100699ec8] <+260>: str wzr, [x8, #0x14]
postgres[0x100699ecc] <+264>: ldur x0, [x29, #-0x8]
postgres[0x100699ed0] <+268>: mov w1, #0x2 ; =2
postgres[0x100699ed4] <+272>: bl 0x100698efc ; pgaio_io_update_state at aio.c:384
```
As you can see pg16 just updates third byte of the structure (`op` field). It is not affecting 'state' field at all.
While clang16 is also updating `state` field! And looks like it is moved before memory barrier (in `pgaio_io_update_state` function),
Hi, On 2025-06-06 14:03:12 +0300, Konstantin Knizhnik wrote: > There is really essential difference in code generated by clang 15 (working) > and 16 (not working). There also are code gen differences between upstream clang 17 and apple's clang, which is based on llvm 17 as well (I've updated the toolchain, it repros with that as well). The code generated for the bitfield access is so attrocious that it's worth changing to plain uint8 for that alone. It's annoying to have to add casts to all the switches, but clearly there isn't a better way for now. What specific compilers / versions were you comparing? I don't quite see similar assembly... I'm annotating the code below mostly for my understanding, I'm not fluid in armv8. > ``` > > pgaio_io_stage(PgAioHandle *ioh, PgAioOp op) > > { > > ... > > HOLD_INTERRUPTS(); > > ioh->op = op; > ioh->result = 0; > > pgaio_io_update_state(ioh, PGAIO_HS_DEFINED); > ... > > } > > ``` > > clang15: > > ``` > > postgres[0x10035d4ac] <+84>: add x22, x22, #0x5d4 ; > InterruptHoldoffCount > postgres[0x10035d4b0] <+88>: ldr w8, [x22] > postgres[0x10035d4b4] <+92>: add w8, w8, #0x1 > postgres[0x10035d4b8] <+96>: str w8, [x22] That's InterruptHoldoffCount++ > *postgres[0x10035d4bc] <+100>: strb w20, [x19, #0x2] That's ioh->op = op > *postgres[0x10035d4c0] <+104>: str wzr, [x19, #0x14] ioh->result = 0; > postgres[0x10035d4c4] <+108>: mov x0, x19 Saving the caller-saved x0 register. > postgres[0x10035d4c8] <+112>: mov w1, #0x2 > postgres[0x10035d4cc] <+116>: bl 0x10035cbdc ; > pgaio_io_update_state at aio.c:384 Calling pgaio_io_update_state() with new_state = PGAIO_HS_DEFINED > ``` > > clang16: > > ``` > > postgres[0x100699e8c] <+200>: add x9, x9, #0xd74 ; InterruptHoldoffCount > postgres[0x100699e90] <+204>: ldr w8, [x9] That's the start of InterruptHoldoffCount++ > postgres[0x100699e94] <+208>: mov w10, #0x1 ; =1 > postgres[0x100699e98] <+212>: stur w10, [x29, #-0x18] Isn't this setting w10 to 1 just to immediately overwrite it with another value? I assume this is restoring things from the stack? > postgres[0x100699e9c] <+216>: add w8, w8, #0x1 > postgres[0x100699ea0] <+220>: str w8, [x9] That's the end of InterruptHoldoffCount++ > postgres[0x100699ea4] <+224>: ldur w10, [x29, #-0xc] One more overwriting of w10, for good measure? > postgres[0x100699ea8] <+228>: ldur x9, [x29, #-0x8] I assume this is restoring the pointer to ioh from the stack? > postgres[0x100699eac] <+232>: ldrh w8, [x9] Loading state + target as one 16bit value. > postgres[0x100699eb0] <+236>: and w10, w10, #0xff Clearing the upper bits of w10 - not sure what w10 actually is, probably the op argument? > postgres[0x100699eb4] <+240>: orr w10, w8, w10, lsl #16 Or-ing {ioh->state, ioh->target} and (I think) the 'op' argument shifted left by 16 bit, presumably leaving us with {ioh->state, ioh->target, op}. > postgres[0x100699eb8] <+244>: lsr w8, w10, #16 So this is right shifting what I think is {ioh->state, ioh->target, op} by 16, which presumably leaves you just with op. WTF. > *postgres[0x100699ebc] <+248>: strh w10, [x9] Now we're writing 16 bits of {ioh->state, ioh->target, ioh->op} to ioh, thereby setting ioh->state and ioh->target, but not ioh->op. > postgres[0x100699ec0] <+252>: strb w8, [x9, #0x2] Setting ioh->op. > *postgres[0x100699ec4] <+256>: ldur x8, [x29, #-0x8] This was loaded into x9 above, why is this redone, x9 is not overwritten? Oh, I guess you removed the Assert related instructions, and that's why it's reloading it... > postgres[0x100699ec8] <+260>: str wzr, [x8, #0x14] ioh->return = 0. > postgres[0x100699ed0] <+268>: mov w1, #0x2 ; =2 This puts 2, presumably PGAIO_HS_DEFINED into w1 - presumably as an argument for the call to pgaio_io_update_state(). > postgres[0x100699ed4] <+272>: bl 0x100698efc ; pgaio_io_update_state > at aio.c:384 > ``` > As you can see pg16 just updates third byte of the structure (`op` field). > It is not affecting 'state' field at all. It's really weird code. I think the state store would be in the call to pgaio_io_update_state(). > While clang16 is also updating `state` field! I assume this was intending to refer to clang15? FWIW, I don't think it does update the state field in the assembly you shared. This is very odd code, but I don't immediately see a bug in it :( Greetings, Andres Freund
Konstantin Knizhnik <knizhnik@garret.ru> writes: > There is really essential difference in code generated by clang 15 > (working) and 16 (not working). It's a mistake to think that this is a compiler bug. The C standard explicitly allows compilers to use word-wide operations to access bit-field struct members. Such accesses may fetch or overwrite other bit-fields in the same word, using a non-atomic read/modify/write sequence. I don't have a recent C standard at hand, but I found this quote on the web [1]: The C Standard, 3.17, paragraph 3 [ISO/IEC 9899:2024], states Note 2 to entry: A bit-field and an adjacent non-bit-field member are in separate memory locations. The same applies to two bit-fields, if one is declared inside a nested structure declaration and the other is not, or if the two are separated by a zero-length bit-field declaration, or if they are separated by a non-bit-field member declaration. It is not safe to concurrently update two non-atomic bit-fields in the same structure if all members declared between them are also (nonzero-length) bit-fields, no matter what the sizes of those intervening bit-fields happen to be. So it's our code that is busted. No doubt, what is happening is that process A is fetching two fields, modifying one of them, and storing the word back (with the observed value of the other field) concurrently with some other process trying to update the other field. So the other process's update is lost. If we want to preserve the compactness of this struct, we have to use primitive int types not bit-fields. regards, tom lane [1] https://wiki.sei.cmu.edu/confluence/display/c/CON32-C.+Prevent+data+races+when+accessing+bit-fields+from+multiple+threads
Hi, On 2025-06-06 15:21:13 -0400, Tom Lane wrote: > Konstantin Knizhnik <knizhnik@garret.ru> writes: > > There is really essential difference in code generated by clang 15 > > (working) and 16 (not working). > > It's a mistake to think that this is a compiler bug. The C standard > explicitly allows compilers to use word-wide operations to access > bit-field struct members. Such accesses may fetch or overwrite other > bit-fields in the same word, using a non-atomic read/modify/write > sequence. I don't have a recent C standard at hand, but I found this > quote on the web [1]: > > The C Standard, 3.17, paragraph 3 [ISO/IEC 9899:2024], states > > Note 2 to entry: A bit-field and an adjacent non-bit-field member > are in separate memory locations. The same applies to two > bit-fields, if one is declared inside a nested structure > declaration and the other is not, or if the two are separated by a > zero-length bit-field declaration, or if they are separated by a > non-bit-field member declaration. It is not safe to concurrently > update two non-atomic bit-fields in the same structure if all > members declared between them are also (nonzero-length) > bit-fields, no matter what the sizes of those intervening > bit-fields happen to be. > > So it's our code that is busted. No doubt, what is happening is > that process A is fetching two fields, modifying one of them, > and storing the word back (with the observed value of the other > field) concurrently with some other process trying to update > the other field. So the other process's update is lost. There shouldn't be any concurrent accesses here, so I don't really see how the above would explain the problem (the IO can only ever be modified by one backend, initially the "owning backend", then, when submitted, by the IO worker, and then again by the backend). Alexander's latest post has a good example (due to the added logging) At the start of reclaim (looks sane): !!!pgaio_io_reclaim [91056]| ioh: 0x1066d15d0, ioh->op: 170, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2, ioh->generation:29202 At the end of reclaim (looks sane), note ioh->op = 0 !!!pgaio_io_reclaim [91056]| ioh: 0x1066d15d0, ioh->op: 0, ioh->generation: 29203 Just a bit later (broken): !!!AsyncReadBuffers [91056] (1)| blocknum: 73, ioh: 0x1066d15d0, ioh->op: 170, ioh->state: 1, ioh->result: 0, ioh->num_callbacks:0, ioh->generation: 29203 Note that ioh->op went back to 170, despite the code that sets ioh->op not yet having been called. However the increment of the increment of ioh->generation survived. Between the second "!!!pgaio_io_reclaim" and "!!!AsyncReadBuffers" there definitely aren't any other processes accessing things. > If we want to preserve the compactness of this struct, we have > to use primitive int types not bit-fields. I think we have to do that, if for no other reason than how bad the generated code is. But I would still like to understand what actually causes the problem here - I'm far from sure it's actually the bitfield. Greetings, Andres Freund
06.06.2025 22:37, Andres Freund wrote:
On 2025-06-06 15:21:13 -0400, Tom Lane wrote:It's a mistake to think that this is a compiler bug. The C standard explicitly allows compilers to use word-wide operations to access bit-field struct members. Such accesses may fetch or overwrite other bit-fields in the same word, using a non-atomic read/modify/write sequence. I don't have a recent C standard at hand, but I found this quote on the web [1]: The C Standard, 3.17, paragraph 3 [ISO/IEC 9899:2024], states Note 2 to entry: A bit-field and an adjacent non-bit-field member are in separate memory locations. The same applies to two bit-fields, if one is declared inside a nested structure declaration and the other is not, or if the two are separated by a zero-length bit-field declaration, or if they are separated by a non-bit-field member declaration ... So it's our code that is busted. No doubt, what is happening is that process A is fetching two fields, modifying one of them, and storing the word back (with the observed value of the other field) concurrently with some other process trying to update the other field. So the other process's update is lost.There shouldn't be any concurrent accesses here, so I don't really see how the above would explain the problem (the IO can only ever be modified by one backend, initially the "owning backend", then, when submitted, by the IO worker, and then again by the backend).
FWIW, I tried the idea Matthias proposed upthread:
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -98,16 +98,17 @@ struct PgAioHandle
/* all state updates should go through pgaio_io_update_state() */
PgAioHandleState state:8;
+ /* bitfield of PgAioHandleFlags */
+ uint8 flags;
+
/* what are we operating on */
PgAioTargetID target:8;
+ uint8 num_callbacks;
+
/* which IO operation */
PgAioOp op:8;
- /* bitfield of PgAioHandleFlags */
- uint8 flags;
-
- uint8 num_callbacks;
/* using the proper type here would use more space */
uint8 callbacks[PGAIO_HANDLE_MAX_CALLBACKS];
and got 120 iterations passed cleanly. Without this change, I got failures
on iterations 41, 5, 20, Not a proof, just an observation...
May be the op field gets broken because of a (concurrent?) write to
not that field, by to state or target in some other place?
I also tried to reproduce the same on an ARM server again with clang 18,
using -O2 and -O3, and we with Konstantin didn't see such a sophisticated
assembly for pgaio_io_stage() — the code is much simpler and the failure
is not reproduced. The triplet there is:
PostgreSQL 18beta1 on aarch64-unknown-linux-gnu, compiled by Ubuntu clang version 18.1.3 (1ubuntu1), 64-bit
On that my MacBook, the triplet is:
PostgreSQL 18beta1 on aarch64-apple-darwin23.5.0, compiled by Apple clang version 16.0.0 (clang-1600.0.26.6), 64-bit
indri:
PostgreSQL 18beta1 on aarch64-apple-darwin24.5.0, compiled by Apple clang version 17.0.0 (clang-1700.0.13.3), 64-bit
sifaka:
PostgreSQL 18devel on aarch64-apple-darwin24.4.0, compiled by Apple clang version 17.0.0 (clang-1700.0.13.3), 64-bit
So it looks like "aarch64-apple" enables some extra optimizations with
bitfields...
By the way, there are also some other places with adjacent bitfields, e, g.:
typedef union PgAioTargetData
{
struct
{
RelFileLocator rlocator; /* physical relation identifier */
BlockNumber blockNum; /* blknum relative to begin of reln */
BlockNumber nblocks;
ForkNumber forkNum:8; /* don't waste 4 byte for four values */
bool is_temp:1; /* proc can be inferred by owning AIO */
bool skip_fsync:1;
} smgr;
and
typedef struct PgStat_KindInfo
{
/*
* Do a fixed number of stats objects exist for this kind of stats (e.g.
* bgwriter stats) or not (e.g. tables).
*/
bool fixed_amount:1;
/*
* Can stats of this kind be accessed from another database? Determines
* whether a stats object gets included in stats snapshots.
*/
bool accessed_across_databases:1;
/* Should stats be written to the on-disk stats file? */
bool write_to_file:1;
...
Best regards,
Alexander Lakhin
On Fri, Jun 06, 2025 at 03:37:45PM -0400, Andres Freund wrote: > On 2025-06-06 15:21:13 -0400, Tom Lane wrote: > > So it's our code that is busted. No doubt, what is happening is > > that process A is fetching two fields, modifying one of them, > > and storing the word back (with the observed value of the other > > field) concurrently with some other process trying to update > > the other field. So the other process's update is lost. > > There shouldn't be any concurrent accesses here, so I don't really see > how the above would explain the problem (the IO can only ever be > modified by one backend, initially the "owning backend", then, when > submitted, by the IO worker, and then again by the backend). No concurrent accesses, not from threads anyways, but not even from kernel subsystems nor from signal handlers?
Hi, On 2025-06-06 14:03:12 +0300, Konstantin Knizhnik wrote:There is really essential difference in code generated by clang 15 (working) and 16 (not working).There also are code gen differences between upstream clang 17 and apple's clang, which is based on llvm 17 as well (I've updated the toolchain, it repros with that as well). The code generated for the bitfield access is so attrocious that it's worth changing to plain uint8 for that alone. It's annoying to have to add casts to all the switches, but clearly there isn't a better way for now.
+1
What specific compilers / versions were you comparing? I don't quite see similar assembly...
My MacBook (bug is not reproduced):
Apple clang version 15.0.0 (clang-1500.0.40.1)
Target: arm64-apple-darwin22.6.0
Alexander's MacBook (bug is reproduced)
Apple clang version 16.0.0 (clang-1600.0.26.6)
Target: arm64-apple-darwin23.5.0
As you can see pg16 just updates third byte of the structure (`op` field). It is not affecting 'state' field at all.It's really weird code. I think the state store would be in the call to pgaio_io_update_state().
Yes, and that seems to be the problem,
While clang16 is also updating `state` field!I assume this was intending to refer to clang15? FWIW, I don't think it does update the state field in the assembly you shared.
clang 15 (old) doesn't update `state` field. And this code works normally.
clang 16 - does.
This is very odd code, but I don't immediately see a bug in it :(
I also not an expert in ARM assembler. My analyze of the code is mostly the same as your.
Looks like the code generated by clang 16 does update all three bitfields in `pgaio_io_stage`.
As far as I understand assignment is correct: so it uses old (extracted) value of state and new (passed to the function as parameter) value of op).
But I am not 100% sure here.
It is not quite clear to me why it happen, because `pgaio_io_update_state` is not actually inlined.
And there is at least one problem that assignment of state moved through memory barrier (in `pgaio_io_update_state`).
On 06/06/2025 10:21 pm, Tom Lane wrote: > Konstantin Knizhnik <knizhnik@garret.ru> writes: >> There is really essential difference in code generated by clang 15 >> (working) and 16 (not working). > It's a mistake to think that this is a compiler bug. The C standard > explicitly allows compilers to use word-wide operations to access > bit-field struct members. Sorry, I have not said that I have found compiler bug. The generated code (clang 16) overwrites all three bitfields but looks like is doing it in correct way (old values of state and target are preserved). My first thought was that compiler moves forward assignment of state from `pgaio_io_update_state`. But it is not true: it just reads and writes old value of `state` field. It is also possible to expect that `pgaio_io_update_state` somehow "resurrects" old value of `op` field. But it is also not true: ``` postgres`pgaio_io_update_state: ... postgres[0x100699070] <+372>: dmb ish ; memory write barrier postgres[0x100699074] <+376>: ldur w10, [x29, #-0xc] ; w10 = state postgres[0x100699078] <+380>: ldur x9, [x29, #-0x8] ; x9 = ioh postgres[0x10069907c] <+384>: ldrh w8, [x9] ; w8 = ioh->state, ioh->target postgres[0x100699080] <+388>: ldrb w11, [x9, #0x2] ; w11 = ioh->op postgres[0x100699084] <+392>: orr w8, w8, w11, lsl #16 ; w8 = ioh->state, ioh->target, ioh->op postgres[0x100699088] <+396>: and w10, w10, #0xff ; w10 &= 0xff postgres[0x10069908c] <+400>: and w8, w8, #0xffffff00 ; w8 &= ~0xff postgres[0x100699090] <+404>: orr w10, w8, w10 ; w10 = state, ioh->target, ioh->op postgres[0x100699094] <+408>: lsr w8, w10, #16 ; w8 = ioh->op postgres[0x100699098] <+412>: strh w10, [x9] postgres[0x10069909c] <+416>: strb w8, [x9, #0x2] postgres[0x1006990a0] <+420>: ldp x29, x30, [sp, #0x60] postgres[0x1006990a4] <+424>: add sp, sp, #0x70 postgres[0x1006990a8] <+428>: ret ```
Hi, On 2025-06-06 15:37:45 -0400, Andres Freund wrote: > There shouldn't be any concurrent accesses here, so I don't really see how the > above would explain the problem (the IO can only ever be modified by one > backend, initially the "owning backend", then, when submitted, by the IO > worker, and then again by the backend). The symptoms I can reproduce are slightly different than Alexander's - it's the assertion failure reported upthread by Tom. FWIW, I can continue to repro the assertion after removing the use of the bitfield in PgAioHandle. So the problem indeed seems to be be independent of the bitfields. I'm continuing to investigate. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > The symptoms I can reproduce are slightly different than Alexander's - it's > the assertion failure reported upthread by Tom. > > FWIW, I can continue to repro the assertion after removing the use of the > bitfield in PgAioHandle. So the problem indeed seems to be be independent of > the bitfields. I noticed that the assertion you and I are seeing is also about a bitfield: TRAP: failed Assert("aio_ret->result.status != PGAIO_RS_UNKNOWN"), File: "bufmgr.c", Line: 1605, PID: 79322 typedef struct PgAioResult { ... /* of type PgAioResultStatus, see above */ uint32 status:PGAIO_RESULT_STATUS_BITS; ... } PgAioResult; That doesn't make a lot of sense to me, because as far as I've seen the bitfields in PgAioResult are not concurrently accessed. But it feels like a relevant observation. regards, tom lane
On Sat, Jun 7, 2025 at 6:47 AM Andres Freund <andres@anarazel.de> wrote: > On 2025-06-06 14:03:12 +0300, Konstantin Knizhnik wrote: > > There is really essential difference in code generated by clang 15 (working) > > and 16 (not working). > > There also are code gen differences between upstream clang 17 and apple's > clang, which is based on llvm 17 as well (I've updated the toolchain, it > repros with that as well). Just for the record, Apple clang 17 (self-reported clobbered version) is said to be based on LLVM 19[1]. For a long time it was off by one but tada now it's apparently two. Might be relevant if people are comparing generated code up that close.... . o O (I wonder if one could corroborate that by running "strings" on upstream clang binaries (as compiled by MacPorts/whatever) for each major version and finding new strings, ie strings that don't appear in earlier major versions, and then seeing which ones are present in Apple's clang binaries... What a silly problem.) [1] https://en.wikipedia.org/wiki/Xcode#Xcode_15.0_-_16.x_(since_visionOS_support)
On 09/06/2025 2:05 am, Thomas Munro wrote: > On Sat, Jun 7, 2025 at 6:47 AM Andres Freund <andres@anarazel.de> wrote: >> On 2025-06-06 14:03:12 +0300, Konstantin Knizhnik wrote: >>> There is really essential difference in code generated by clang 15 (working) >>> and 16 (not working). >> There also are code gen differences between upstream clang 17 and apple's >> clang, which is based on llvm 17 as well (I've updated the toolchain, it >> repros with that as well). > Just for the record, Apple clang 17 (self-reported clobbered version) > is said to be based on LLVM 19[1]. For a long time it was off by one > but tada now it's apparently two. Might be relevant if people are > comparing generated code up that close.... > > . o O (I wonder if one could corroborate that by running "strings" on > upstream clang binaries (as compiled by MacPorts/whatever) for each > major version and finding new strings, ie strings that don't appear in > earlier major versions, and then seeing which ones are present in > Apple's clang binaries... What a silly problem.) > > [1] https://en.wikipedia.org/wiki/Xcode#Xcode_15.0_-_16.x_(since_visionOS_support) Some updates: I was able to reproduce the problem at my Mac with old clang (15.0) but only with disabled optimization (CFLAGS=-O0). So very unlikely it is bug in compiler. Why it is better reproduced in debug build? May be because of timing. Or may be because without optimization compiler is doing stupid things: loads all three bitfields from memory to register (one half word+one byte), then does some manipulations with this register and writes it back to memory. Can register somehow be clobbered between read and write (for example by signal handler)? Very unlikely... So still do not have any good hypothesis. But with bitfields replaced with uint8 the bug is not reproduced any more. May be just do this change (which seems to be good thing in any case)?
Hi, On 2025-06-10 17:28:11 +0300, Konstantin Knizhnik wrote: > On 09/06/2025 2:05 am, Thomas Munro wrote: > > On Sat, Jun 7, 2025 at 6:47 AM Andres Freund <andres@anarazel.de> wrote: > > > On 2025-06-06 14:03:12 +0300, Konstantin Knizhnik wrote: > > > > There is really essential difference in code generated by clang 15 (working) > > > > and 16 (not working). > > > There also are code gen differences between upstream clang 17 and apple's > > > clang, which is based on llvm 17 as well (I've updated the toolchain, it > > > repros with that as well). > > Just for the record, Apple clang 17 (self-reported clobbered version) > > is said to be based on LLVM 19[1]. For a long time it was off by one > > but tada now it's apparently two. Might be relevant if people are > > comparing generated code up that close.... You've got to be kidding me. Because the world otherwise would be too easy, I guess. > > . o O (I wonder if one could corroborate that by running "strings" on > > upstream clang binaries (as compiled by MacPorts/whatever) for each > > major version and finding new strings, ie strings that don't appear in > > earlier major versions, and then seeing which ones are present in > > Apple's clang binaries... What a silly problem.) > > > > [1] https://en.wikipedia.org/wiki/Xcode#Xcode_15.0_-_16.x_(since_visionOS_support) > > > Some updates: I was able to reproduce the problem at my Mac with old clang > (15.0) but only with disabled optimization (CFLAGS=-O0). > So very unlikely it is bug in compiler. I was able to reproduce it with gcc, too. > Why it is better reproduced in debug build? May be because of timing. Code-gen wise the biggest change I see is that there is more stack spilling due to assertion related code... > Or may be because without optimization compiler is doing stupid things: > loads all three bitfields from memory to register (one half word+one byte), > then does some manipulations with this register and writes it back to > memory. Can register somehow be clobbered between read and write (for > example by signal handler)? Very unlikely... > So still do not have any good hypothesis. > > But with bitfields replaced with uint8 the bug is not reproduced any more. > May be just do this change (which seems to be good thing in any case)? I've reproduced it without that bitfield, unfortunately :(. Unfortunately my current set of debugging output seems to have prevented the issue from re-occurring. Need to pare it down to re-trigger. But for me it only reproduces relatively rarely, so paring down the debug output is a rather slow process :( This is really a peculiar issue. I've now ran 10s of thousands of non-macos iterations, without triggering this or a related issue even once. The one good news is that currently the regression tests are remarkably stable, I think in the past I hardly could have run that many iterations without (independent) failures. Greetings, Andres Freund
But once again - no access violation, just assert failure.
Previously "op" field was overwritten somewhere between `pgaio_io_reclaim` and `AsyncReadBuffers`:
!!!pgaio_io_reclaim [20376]| ioh: 0x1019bc000, ioh->op: 0, ioh->generation: 19346
!!!AsyncReadBuffers [20376] (1)| blocknum: 21, ioh: 0x1019bc000, ioh->op: 1, ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0, ioh->generation: 19346
Now it is overwritten after print in AsyncReadBuffers:
!!!pgaio_io_reclaim [88932]| ioh: 0x105a5c000, ioh->op: 0, ioh->generation: 42848
!!!pgaio_io_acquire_nb[88932]| ioh: 0x105a5c000, ioh->op: 0, ioh->generation: 42848
!!!AsyncReadBuffers [88932] (1)| blocknum: 10, ioh: 0x105a5c000, ioh->op: 0, ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0, ioh->generation: 42848
!!!pgaio_io_before_start| ioh: 0x105a5c000, ioh->op: 1, ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 2, ioh->generation: 42848
In this run I prohibit writes to io handle in `pgaio_io_acquire_nb` and reenable them in `AsyncReadBuffer`.
On 12/06/2025 4:13 pm, Andres Freund wrote: > Hi, > > On 2025-06-12 15:12:00 +0300, Konstantin Knizhnik wrote: >> Reproduced it once again with with write-protected io handle. >> But once again - no access violation, just assert failure. >> >> Previously "op" field was overwritten somewhere between `pgaio_io_reclaim` >> and `AsyncReadBuffers`: >> >> !!!pgaio_io_reclaim [20376]| ioh: 0x1019bc000, ioh->op: 0, ioh->generation: >> 19346 >> !!!AsyncReadBuffers [20376] (1)| blocknum: 21, ioh: 0x1019bc000, ioh->op: 1, >> ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0, ioh->generation: 19346 >> >> Now it is overwritten after print in AsyncReadBuffers: >> >> !!!pgaio_io_reclaim [88932]| ioh: 0x105a5c000, ioh->op: 0, ioh->generation: >> 42848 >> !!!pgaio_io_acquire_nb[88932]| ioh: 0x105a5c000, ioh->op: 0, >> ioh->generation: 42848 >> !!!AsyncReadBuffers [88932] (1)| blocknum: 10, ioh: 0x105a5c000, ioh->op: 0, >> ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0, ioh->generation: 42848 >> !!!pgaio_io_before_start| ioh: 0x105a5c000, ioh->op: 1, ioh->state: 1, >> ioh->result: 0, ioh->num_callbacks: 2, ioh->generation: 42848 >> >> In this run I prohibit writes to io handle in `pgaio_io_acquire_nb` and >> reenable them in `AsyncReadBuffer`. > I'm reasonably certain I found the issue, I think it's a missing memory > barrier on the read side. The CPU is reordering the read (or just using a > cached value) of ->distilled_result to be before the load of ->state. > > But it'll take a bit to verify that that's the issue... It is great! But I wonder how it correlates with your previous statement: > There shouldn't be any concurrent accesses here, so I don't really see how the > above would explain the problem (the IO can only ever be modified by one > backend, initially the "owning backend", then, when submitted, by the IO > worker, and then again by the backend). This is what I am observing myself: "op" field is modified and fetched by the same process. Certainly process can be rescheduled to some other CPU. But if such reschedule can cause loose of stored value, then nothing will work, will it? So assume that there is some variable "x" which is updated by process "x=1" executed at CPU1, then process is rescheduled to CPU2 which does "x=2", then process is once again rescheduled to CPU1 and we found out that "x==1". And to prevent it we need to explicitly enforce some memory barrier. Unless I missing something , nothing can work with such memory model.
Hi, On 2025-06-12 16:30:54 +0300, Konstantin Knizhnik wrote: > On 12/06/2025 4:13 pm, Andres Freund wrote: > > On 2025-06-12 15:12:00 +0300, Konstantin Knizhnik wrote: > > I'm reasonably certain I found the issue, I think it's a missing memory > > barrier on the read side. The CPU is reordering the read (or just using a > > cached value) of ->distilled_result to be before the load of ->state. > > > > But it'll take a bit to verify that that's the issue... > > It is great! > But I wonder how it correlates with your previous statement: > > > There shouldn't be any concurrent accesses here, so I don't really see how the > > above would explain the problem (the IO can only ever be modified by one > > backend, initially the "owning backend", then, when submitted, by the IO > > worker, and then again by the backend). The problem appears to be in that switch between "when submitted, by the IO worker" and "then again by the backend". It's not concurrent access in the sense of two processes writing to the same value, it's that when switching from the worker updating ->distilled_result to the issuer looking at that, the issuer didn't ensure that no outdated version of ->distilled_result could be used. Basically, the problem is that the worker would 1) set ->distilled_result 2) perform a write memory barrier 3) set ->state to COMPLETED_SHARED and then the issuer of the IO would: 4) check ->state is COMPLETED_SHARED 5) use ->distilled_result The problem is that there currently is no barrier between 4 & 5, which means an outdated ->distilled_result could be used. This also explains why the issue looked so weird - eventually, after fprintfs, after a core dump, etc, the updated ->distilled_result result would "arrive" in the issuing process, and suddenly look correct. > This is what I am observing myself: "op" field is modified and fetched by > the same process. Right - but I don't think the ->op field being wrong was actually part of the issue though. > Certainly process can be rescheduled to some other CPU. But if such > reschedule can cause loose of stored value, then nothing will work, will it? Yes, that'd be completely broken - but isn't the issue here. Greetings, Andres Freund
On 12/06/2025 4:57 pm, Andres Freund wrote: > The problem appears to be in that switch between "when submitted, by the IO > worker" and "then again by the backend". It's not concurrent access in the > sense of two processes writing to the same value, it's that when switching > from the worker updating ->distilled_result to the issuer looking at that, the > issuer didn't ensure that no outdated version of ->distilled_result could be > used. > > Basically, the problem is that the worker would > > 1) set ->distilled_result > 2) perform a write memory barrier > 3) set ->state to COMPLETED_SHARED > > and then the issuer of the IO would: > > 4) check ->state is COMPLETED_SHARED > 5) use ->distilled_result > > The problem is that there currently is no barrier between 4 & 5, which means > an outdated ->distilled_result could be used. > > > This also explains why the issue looked so weird - eventually, after fprintfs, > after a core dump, etc, the updated ->distilled_result result would "arrive" > in the issuing process, and suddenly look correct. Thank you very much for explanation. Everything seems to be so simple after explanations, that you can not even believe that before you think that such behavior can be only caused by "black magic" or "OS bug":) Certainly using outdated result can explain such behavior. But in which particular place we loose read barrier between 4 and 5? I see `pgaio_io_wait` which as I expect should be called by backend to wait completion of IO. And it calls `pgaio_io_was_recycled` to get state and it in turn enforce read barrier: ``` bool pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, PgAioHandleState *state) { *state = ioh->state; pg_read_barrier(); return ioh->generation != ref_generation; } ```
Hi, On 2025-06-12 11:52:31 -0400, Andres Freund wrote: > On 2025-06-12 17:22:22 +0300, Konstantin Knizhnik wrote: > > On 12/06/2025 4:57 pm, Andres Freund wrote: > > > The problem appears to be in that switch between "when submitted, by the IO > > > worker" and "then again by the backend". It's not concurrent access in the > > > sense of two processes writing to the same value, it's that when switching > > > from the worker updating ->distilled_result to the issuer looking at that, the > > > issuer didn't ensure that no outdated version of ->distilled_result could be > > > used. > > > > > > Basically, the problem is that the worker would > > > > > > 1) set ->distilled_result > > > 2) perform a write memory barrier > > > 3) set ->state to COMPLETED_SHARED > > > > > > and then the issuer of the IO would: > > > > > > 4) check ->state is COMPLETED_SHARED > > > 5) use ->distilled_result > > > > > > The problem is that there currently is no barrier between 4 & 5, which means > > > an outdated ->distilled_result could be used. > > > > > > > > > This also explains why the issue looked so weird - eventually, after fprintfs, > > > after a core dump, etc, the updated ->distilled_result result would "arrive" > > > in the issuing process, and suddenly look correct. > > > > Thank you very much for explanation. > > Everything seems to be so simple after explanations, that you can not even > > believe that before you think that such behavior can be only caused by > > "black magic" or "OS bug":) > > Indeed. For a while I was really starting to doubt my sanity - not helped by > the fact that I'd loose most of the mental context between reproductions (the > problem did not re-occur from Saturday to Wednesday morning...). What finally > made it clearer was moving the failing assertion to earlier > (pgaio_io_call_complete_local() and shared_buffer_readv_complete_local()), as > that took out a lot of the time in which the problem could occur. > > > Certainly using outdated result can explain such behavior. > > But in which particular place we loose read barrier between 4 and 5? > > I see `pgaio_io_wait` which as I expect should be called by backend to wait > > completion of IO. > > And it calls `pgaio_io_was_recycled` to get state and it in turn enforce > > read barrier: > > ``` > > > > bool > > pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, > > PgAioHandleState *state) > > { > > *state = ioh->state; > > pg_read_barrier(); > > > > return ioh->generation != ref_generation; > > } > > ``` > > Yes, I don't think that path has the issue. > > As far as I can tell the problem only ever happens if we end up reclaiming an > IO while waiting for a free IO handle. > > The known problematic path is > pgaio_io_acquire() > -> pgaio_io_wait_for_free() > -> pgaio_io_reclaim() > > In that path we don't use pgaio_io_was_recycled(). I couldn't find any other > path with the same issue [1]. > The issue only happening while waiting for free IO handles is presumably the > reason why it happend in 027_stream_regress.pl, due to the small > shared_buffers used for the test io_max_concurrency is 1, which means we > constantly need to wait for a free IO handles. Armed with that knowledge it turns out to not be too hard to create a repro that fails a lot quicker than multiple days. Starting postgres with -c io_max_concurrency=1 -c io_combine_limit=1 -c synchronize_seqscans=false -c restart_after_crash=false -c max_parallel_workers_per_gather=0 and creating a test table with: create table large as select generate_series(1, 100000) a, b from generate_series(1, 100) b; makes the following pgbench fail within seconds for me: c=16; pgbench -c $c -j $c -M prepared -n -f <(echo "select count(*) FROM large;") -T 1000 -P 10 (interestingly at c=8 it takes closer to a minute on average) Adding pg_read_barrier() before the two pgaio_io_reclaim() calls in pgaio_io_wait_for_free() has so far prevented a crash. I'll leave it running for a while longer, to make sure. We really lack any meaningful stress testing of IO paths, or I suspect we would have found this much earlier. I wonder if we could add a smaller version of the above, with a small fixed runtime (maybe 3-5s), as a tap test. Greetings, Andres Freund
On 13/06/2025 11:20 pm, Andres Freund wrote: > Hi, > > On 2025-06-12 12:23:13 -0400, Andres Freund wrote: >> On 2025-06-12 11:52:31 -0400, Andres Freund wrote: >>> On 2025-06-12 17:22:22 +0300, Konstantin Knizhnik wrote: >>>> On 12/06/2025 4:57 pm, Andres Freund wrote: >>>>> The problem appears to be in that switch between "when submitted, by the IO >>>>> worker" and "then again by the backend". It's not concurrent access in the >>>>> sense of two processes writing to the same value, it's that when switching >>>>> from the worker updating ->distilled_result to the issuer looking at that, the >>>>> issuer didn't ensure that no outdated version of ->distilled_result could be >>>>> used. >>>>> >>>>> Basically, the problem is that the worker would >>>>> >>>>> 1) set ->distilled_result >>>>> 2) perform a write memory barrier >>>>> 3) set ->state to COMPLETED_SHARED >>>>> >>>>> and then the issuer of the IO would: >>>>> >>>>> 4) check ->state is COMPLETED_SHARED >>>>> 5) use ->distilled_result >>>>> >>>>> The problem is that there currently is no barrier between 4 & 5, which means >>>>> an outdated ->distilled_result could be used. >>>>> >>>>> >>>>> This also explains why the issue looked so weird - eventually, after fprintfs, >>>>> after a core dump, etc, the updated ->distilled_result result would "arrive" >>>>> in the issuing process, and suddenly look correct. >>>> Thank you very much for explanation. >>>> Everything seems to be so simple after explanations, that you can not even >>>> believe that before you think that such behavior can be only caused by >>>> "black magic" or "OS bug":) >>> Indeed. For a while I was really starting to doubt my sanity - not helped by >>> the fact that I'd loose most of the mental context between reproductions (the >>> problem did not re-occur from Saturday to Wednesday morning...). What finally >>> made it clearer was moving the failing assertion to earlier >>> (pgaio_io_call_complete_local() and shared_buffer_readv_complete_local()), as >>> that took out a lot of the time in which the problem could occur. >>>> Certainly using outdated result can explain such behavior. >>>> But in which particular place we loose read barrier between 4 and 5? >>>> I see `pgaio_io_wait` which as I expect should be called by backend to wait >>>> completion of IO. >>>> And it calls `pgaio_io_was_recycled` to get state and it in turn enforce >>>> read barrier: >>>> ``` >>>> >>>> bool >>>> pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, >>>> PgAioHandleState *state) >>>> { >>>> *state = ioh->state; >>>> pg_read_barrier(); >>>> >>>> return ioh->generation != ref_generation; >>>> } >>>> ``` >>> Yes, I don't think that path has the issue. >>> >>> As far as I can tell the problem only ever happens if we end up reclaiming an >>> IO while waiting for a free IO handle. >>> >>> The known problematic path is >>> pgaio_io_acquire() >>> -> pgaio_io_wait_for_free() >>> -> pgaio_io_reclaim() >>> >>> In that path we don't use pgaio_io_was_recycled(). I couldn't find any other >>> path with the same issue [1]. >>> The issue only happening while waiting for free IO handles is presumably the >>> reason why it happend in 027_stream_regress.pl, due to the small >>> shared_buffers used for the test io_max_concurrency is 1, which means we >>> constantly need to wait for a free IO handles. >> Armed with that knowledge it turns out to not be too hard to create a repro >> that fails a lot quicker than multiple days. >> >> Starting postgres with >> >> -c io_max_concurrency=1 -c io_combine_limit=1 -c synchronize_seqscans=false -c restart_after_crash=false -c max_parallel_workers_per_gather=0 >> >> and creating a test table with: >> create table large as select generate_series(1, 100000) a, b from generate_series(1, 100) b; >> >> makes the following pgbench fail within seconds for me: >> c=16; pgbench -c $c -j $c -M prepared -n -f <(echo "select count(*) FROM large;") -T 1000 -P 10 >> >> (interestingly at c=8 it takes closer to a minute on average) >> >> Adding pg_read_barrier() before the two pgaio_io_reclaim() calls in >> pgaio_io_wait_for_free() has so far prevented a crash. I'll leave it running >> for a while longer, to make sure. >> >> We really lack any meaningful stress testing of IO paths, or I suspect we >> would have found this much earlier. I wonder if we could add a smaller >> version of the above, with a small fixed runtime (maybe 3-5s), as a >> tap test. > Attached is a patch that fixes the problem for me. Alexander, Konstantin, > could you verify that it also fixes the problem for you? I have inspected your patch and from my point of view it adds all necessary read barriers and so eliminates the race. I also ran proposed test during more than 4 hours and the problem is not reproduced at my system (without patch it is reproduced after 100 seconds). It's a little bit strange that pgbench is "stuck" - I have specified timeout 10000 seconds (-T 10000) but it didn't stop and continue working until 15000 seconds when I have terminated it. There were no stuck queries: all 16 backends are running queries which execution takes about 2 seconds: progress: 14960.0 s, 7.4 tps, lat 2189.006 ms stddev 133.277, 0 failed progress: 14970.0 s, 7.3 tps, lat 2113.871 ms stddev 164.008, 0 failed progress: 14980.0 s, 7.3 tps, lat 2038.736 ms stddev 152.869, 0 failed Looks like there is yet another race condition now in pgbench:) > > Given that it does address the problem for me, I'm inclined to push this > fairly soon, the barrier is pretty obviously required. > > Greetings, > > Andres Freund
One more update: with the proposed patch (memory barrier before `ConditionVariableBroadcast` in `pgaio_io_process_completion` and replacing bit fields with `uint8`) the problem is not reproduced at my system during 50000 seconds.
Hi, On 2025-06-16 14:11:39 +0300, Konstantin Knizhnik wrote: > One more update: with the proposed patch (memory barrier before > `ConditionVariableBroadcast` in `pgaio_io_process_completion` I don't see how that barrier could be required for correctness - ConditionVariableBroadcast() is a barrier itself (as the comment above the call notes, too). On 2025-06-15 14:48:43 +0300, Konstantin Knizhnik wrote: > Also I think that replacing bitfields with `uint8` and may be even with > `int`, is good idea at least to avoids false sharing. I don't think there's false sharing here. And even if there were, the granularity at which false sharing occurs is a cache line size, so either 64 or 128 byte. I unfortunately can't repro this issue so far. I don't think it's the same problem as my patch fixes, so I'll push my patch. How exactly did you reproduce the probelm? Greetings, Andres Freund
Konstantin Knizhnik <knizhnik@garret.ru> writes:On 16/06/2025 6:11 pm, Andres Freund wrote:I unfortunately can't repro this issue so far.But unfortunately it means that the problem is not fixed.FWIW, I get similar results to Andres' on a Mac Mini M4 Pro using MacPorts' current compiler release (clang version 19.1.7). The currently-proposed test case fails within a few minutes on e9a3615a5^ but doesn't fail in a couple of hours on e9a3615a5. However, I cannot repro that on a slightly older Mini M1 using Apple's current release (clang-1700.0.13.5, which per wikipedia is really LLVM 19.1.4). It seems to work fine even without e9a3615a5. So the whole thing is still depressingly phase-of-the-moon-dependent. I don't doubt that Konstantin has found a different issue, but it's hard to be sure about the fix unless we can get it to be more reproducible. Neither of my machines has ever shown the symptom he's getting. regards, tom lane
Unfortunately I still able to reproduce assertion failure (with Andreas patch but without my "fixes" - uint8 instead of bitfields).
Apple M2 Pro, Ventura 13.7.6, clang 15.0
Postgres build option: --without-icu --enable-debug --enable-cassert CFLAGS=-O0 Postgres config:
io_max_concurrency=1 io_combine_limit=1 synchronize_seqscans=false restart_after_crash=false max_parallel_workers_per_gather=0 fsync=off Scenario proposed by Andreas: c=16; pgbench -c $c -j $c -M prepared -n -f <(echo "select count(*) FROM large;") -T 30000 -P 10 Now it takes me much more time to reproduce the error. Night 30000 seconds iterations passed normally. I thought that it can be somehow caused or enforces by hibernation. When I have reproduced it last time, I do not change default setting for display "Prevent automatic sleeping on power adapter when the display is off" (disabled). It was cause strange timing effect I have reported: when system is hibernated, alarm is also "hibernated" and so total test time is much more than specified with -T option. Now I tried explicitly force hibernation by periodically closing laptop cover - it doesn't help. But I disable one again this option and let it run. Several times it normally continue execution after wakeup. But finally I got assertion failure: progress: 18290.0 s, 7.3 tps, lat 2229.243 ms stddev 147.211, 0 failed progress: 18300.0 s, 7.3 tps, lat 2164.216 ms stddev 176.251, 0 failed progress: 18310.0 s, 7.5 tps, lat 2105.803 ms stddev 202.003, 0 failed progress: 18320.0 s, 7.7 tps, lat 2162.209 ms stddev 209.344, 0 failed progress: 18330.0 s, 7.0 tps, lat 2157.891 ms stddev 181.369, 0 failed progress: 18340.0 s, 7.6 tps, lat 2120.269 ms stddev 169.287, 0 failed progress: 18350.0 s, 7.3 tps, lat 2178.657 ms stddev 159.984, 0 failed pgbench: error: client 4 aborted in command 0 (SQL) of script 0; perhaps the backend died while processing WARNING: terminating connection because of crash of another server process Not sure if it is related with hibernation or not (looks like it happens at the moment when computer is going to hibernate - not sure how to check it). Backtrace is usual: * frame #0: 0x0000000187248704 libsystem_kernel.dylib`__pthread_kill + 8 frame #1: 0x000000018727fc28 libsystem_pthread.dylib`pthread_kill + 288 frame #2: 0x000000018718dae8 libsystem_c.dylib`abort + 180 frame #3: 0x00000001034b2080 postgres`ExceptionalCondition(conditionName="ioh->op == PGAIO_OP_INVALID", fileName="aio_io.c", lineNumber=161) at assert.c:66:2 frame #4: 0x00000001031bac04 postgres`pgaio_io_before_start(ioh=0x000000010c9f6e40) at aio_io.c:161:2 frame #5: 0x00000001031baac4 postgres`pgaio_io_start_readv(ioh=0x000000010c9f6e40, fd=10, iovcnt=1, offset=180944896) at aio_io.c:81:2 frame #6: 0x00000001031d985c postgres`FileStartReadV(ioh=0x000000010c9f6e40, file=3, iovcnt=1, offset=180944896, wait_event_info=167772181) at fd.c:2241:2 frame #7: 0x000000010322c894 postgres`mdstartreadv(ioh=0x000000010c9f6e40, reln=0x000000011f0289c8, forknum=MAIN_FORKNUM, blocknum=22088, buffers=0x000000016d2dd998, nblocks=1) at md.c:1019:8 frame #8: 0x000000010322fe4c postgres`smgrstartreadv(ioh=0x000000010c9f6e40, reln=0x000000011f0289c8, forknum=MAIN_FORKNUM, blocknum=22088, buffers=0x000000016d2dd998, nblocks=1) at smgr.c:758:2 frame #9: 0x00000001031c2b0c postgres`AsyncReadBuffers(operation=0x000000010f81b910, nblocks_progress=0x000000016d2ddeb4) at bufmgr.c:1959:3 frame #10: 0x00000001031c1ce8 postgres`StartReadBuffersImpl(operation=0x000000010f81b910, buffers=0x000000010f81b3e0, blockNum=22088, nblocks=0x000000016d2ddeb4, flags=0, allow_forwarding=true) at bufmgr.c:1428:18 frame #11: 0x00000001031c182c postgres`StartReadBuffers(operation=0x000000010f81b910, buffers=0x000000010f81b3e0, blockNum=22088, nblocks=0x000000016d2ddeb4, flags=0) at bufmgr.c:1500:9 frame #12: 0x00000001031bee44 postgres`read_stream_start_pending_read(stream=0x000000010f81b378) at read_stream.c:335:14 frame #13: 0x00000001031be528 postgres`read_stream_look_ahead(stream=0x000000010f81b378) at read_stream.c:493:3 frame #14: 0x00000001031be0c0 postgres`read_stream_next_buffer(stream=0x000000010f81b378, per_buffer_data=0x0000000000000000) at read_stream.c:971:2 frame #15: 0x0000000102bb3d34 postgres`heap_fetch_next_buffer(scan=0x000000010f81ae58, dir=ForwardScanDirection) at heapam.c:675:18 frame #16: 0x0000000102ba4a88 postgres`heapgettup_pagemode(scan=0x000000010f81ae58, dir=ForwardScanDirection, nkeys=0, key=0x0000000000000000) at heapam.c:1037:3 frame #17: 0x0000000102ba5058 postgres`heap_getnextslot(sscan=0x000000010f81ae58, direction=ForwardScanDirection, slot=0x000000010f811330) at heapam.c:1391:3 frame #18: 0x0000000102f26448 postgres`table_scan_getnextslot(sscan=0x000000010f81ae58, direction=ForwardScanDirection, slot=0x000000010f811330) at tableam.h:1031:9 frame #19: 0x0000000102f26254 postgres`SeqNext(node=0x000000010f811110) at nodeSeqscan.c:81:6 frame #20: 0x0000000102f2691c postgres`ExecScanFetch(node=0x000000010f811110, epqstate=0x0000000000000000, accessMtd=(postgres`SeqNext at nodeSeqscan.c:52), recheckMtd=(postgres`SeqRecheck at nodeSeqscan.c:91)) at execScan.h:126:9 If you are interested - I have core file. May be it is separate problem - not related with race condition caused by lack of memory barrier. But frankly speaking - unlikely. I think hibernation just affects processes schedule which cause this effect.
Hi, On 2025-06-16 20:22:00 -0400, Tom Lane wrote: > Konstantin Knizhnik <knizhnik@garret.ru> writes: > > On 16/06/2025 6:11 pm, Andres Freund wrote: > >> I unfortunately can't repro this issue so far. > > > But unfortunately it means that the problem is not fixed. > > FWIW, I get similar results to Andres' on a Mac Mini M4 Pro > using MacPorts' current compiler release (clang version 19.1.7). > The currently-proposed test case fails within a few minutes on > e9a3615a5^ but doesn't fail in a couple of hours on e9a3615a5. I'm surprised it takes that long, given it takes seconds to reproduce here with the config parameters I outlined. Did you try crank up the concurrency a bit? Yours has more cores than mine, and I found that that makes a huge difference. > However, I cannot repro that on a slightly older Mini M1 using Apple's > current release (clang-1700.0.13.5, which per wikipedia is really LLVM > 19.1.4). It seems to work fine even without e9a3615a5. So the whole > thing is still depressingly phase-of-the-moon-dependent. It's not entirely surprising that an M1 would have a harder time reproducing the issue, more cores, larger caches and a larger out-of-order execution window will make it more likely that the missing memory barriers have a visible effect. I'm reasonably sure that e9a3615a5 quashed that specific issue - I could repro it within seconds with e9a3615a5^ and with e9a3615a5 I ran it for several days without a single failure... > I don't doubt that Konstantin has found a different issue, but > it's hard to be sure about the fix unless we can get it to be > more reproducible. Neither of my machines has ever shown the > symptom he's getting. I've not been able to reproduce that symptom a single time either so far. The assertion continues to be inexplicable to me. It shows, within a single process, memory in shared memory going "backwards". But not always, just very occasionally. Because this is before the IO is defined, there's no concurrent access whatsoever. I stole^Wgot my partner's m1 macbook for a bit, trying to reproduce the issue there. It has "Apple clang version 16.0.0 (clang-1600.0.26.6)" on "Darwin Kernel Version 24.3.0" That's the same Apple-clang version that Alexander reported being able to reproduce the issue on [1], but unfortunately it's a newer kernel version. No dice in the first 55 test iterations. Konstantin, Alexander - are you using the same device to reproduce this or different ones? I wonder if this somehow depends on some MDM / corporate enforcement tooling running or such. What does: - profiles status -type enrollment - kextstat -l show? Greetings, Andres Freund [1] https://postgr.es/m/92b33ab2-0596-40fe-9db6-a6d821d08e8a%40gmail.com
On 17/06/2025 4:35 pm, Andres Freund wrote: > Konstantin, Alexander - are you using the same device to reproduce this or > different ones? I wonder if this somehow depends on some MDM / corporate > enforcement tooling running or such. > > What does: > - profiles status -type enrollment > - kextstat -l > show? I and Alexandr are using completely different devices with different hardware, OS and clang version. profiles status -type enrollment Enrolled via DEP: No MDM enrollment: Yes (User Approved) MDM server: https://apple.mdm.jumpcloud.com/command `kextstat -l` output very long list, should I post it all or search for some particular pattern?
Hi, On 2025-06-17 16:43:05 +0300, Konstantin Knizhnik wrote: > On 17/06/2025 4:35 pm, Andres Freund wrote: > > Konstantin, Alexander - are you using the same device to reproduce this or > > different ones? I wonder if this somehow depends on some MDM / corporate > > enforcement tooling running or such. > > > > What does: > > - profiles status -type enrollment > > - kextstat -l > > show? > > > I and Alexandr are using completely different devices with different > hardware, OS and clang version. Both of you are running Ventura, right? > profiles status -type enrollment > Enrolled via DEP: No > MDM enrollment: Yes (User Approved) > MDM server: https://apple.mdm.jumpcloud.com/command So it is enrolled in MDM, which, I guess, means there's some kernel component running... > `kextstat -l` output very long list, should I post it all or search for some > particular pattern? I guess kextstat -l|grep -v com.apple might suffice. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > Both of you are running Ventura, right? FTR, the machines I'm trying this on are all running current Sequoia: [tgl@minim4 ~]$ uname -a Darwin minim4.sss.pgh.pa.us 24.5.0 Darwin Kernel Version 24.5.0: Tue Apr 22 19:53:27 PDT 2025; root:xnu-11417.121.6~2/RELEASE_ARM64_T6041arm64 Interestingly, the M1 seems to be using a different kernel build than my two M4 machines: [tgl@minim1 ~]$ uname -a Darwin minim1.sss.pgh.pa.us 24.5.0 Darwin Kernel Version 24.5.0: Tue Apr 22 19:48:46 PDT 2025; root:xnu-11417.121.6~2/RELEASE_ARM64_T8103arm64 although Apple has a history of using kernels customized to specific core counts, so maybe that's all there is there. These are all personally-owned bare metal: [tgl@minim4 ~]$ profiles status -type enrollment Enrolled via DEP: No MDM enrollment: No [tgl@minim4 ~]$ kextstat -l | grep -v com.apple Executing: /usr/bin/kmutil showloaded --list-only No variant specified, falling back to release As for your previous question, I tried both c=16 and c=32 on the M4. regards, tom lane
On 12/06/2025 4:57 pm, Andres Freund wrote: > The problem appears to be in that switch between "when submitted, by the IO > worker" and "then again by the backend". It's not concurrent access in the > sense of two processes writing to the same value, it's that when switching > from the worker updating ->distilled_result to the issuer looking at that, the > issuer didn't ensure that no outdated version of ->distilled_result could be > used. > > Basically, the problem is that the worker would > > 1) set ->distilled_result > 2) perform a write memory barrier > 3) set ->state to COMPLETED_SHARED > > and then the issuer of the IO would: > > 4) check ->state is COMPLETED_SHARED > 5) use ->distilled_result > > The problem is that there currently is no barrier between 4 & 5, which means > an outdated ->distilled_result could be used. > > > This also explains why the issue looked so weird - eventually, after fprintfs, > after a core dump, etc, the updated ->distilled_result result would "arrive" > in the issuing process, and suddenly look correct. > Sorry, I realized that O do not completely understand how it can explained assertion failure in `pgaio_io_before_start`: Assert(ioh->op == PGAIO_OP_INVALID); I am not so familiar with AIO implementation in Postgress, so please excuse me if I missing something... Ok, `distilled_result` contains some deteriorated data. As far as I understand it is used for calling callbacks. Right now in debugger (loaded core file), I see thsat two callbacks are registered: `aio_md_readv_cb` and `aio_shared_buffer_readv_cb`. None of them is changing "op" field. This is full content of the handle: (PgAioHandle) { state = PGAIO_HS_HANDED_OUT target = PGAIO_TID_SMGR op = PGAIO_OP_READV flags = '\x04' num_callbacks = '\x02' callbacks = { [0] = '\x02' [1] = '\x01' [2] = '\0' [3] = '\0' } callbacks_data = { [0] = '\0' [1] = '\0' [2] = '\0' [3] = '\0' } handle_data_len = '\x01' owner_procno = 28 result = 0 node = { prev = 0x000000010c9ea368 next = 0x000000010c9ea368 } resowner = 0x000000014880ef28 resowner_node = { prev = 0x000000014880f1d8 next = 0x000000014880f1d8 } generation = 133159566 cv = { mutex = 0 wakeup = (head = -1, tail = -1) } distilled_result = (id = 0, status = 0, error_data = 0, result = 1) iovec_off = 448 report_return = 0x000000010f81b950 op_data = { read = (fd = 7, iov_length = 1, offset = 180936704) write = (fd = 7, iov_length = 1, offset = 180936704) } target_data = { smgr = { rlocator = (spcOid = 1663, dbOid = 5, relNumber = 16384) blockNum = 22088 nblocks = 1 forkNum = MAIN_FORKNUM is_temp = false skip_fsync = false } } } Actually `PGAIO_OP_READV` is set in the only place: `pgaio_io_start_readv`. And this function is obviously called in user's backed with procno=28 which letter gets this assert failure. So I try to understand dataflow. 1. Backend calls `pgaio_io_start_readv` to initiate IO. It assigns `op=PGAIO_OP_READV` 2. It is broadcasted to all CPUs because of write barrier in `pgaio_io_update_state(ioh, PGAIO_HS_DEFINED)` called fro `pgaio_io_stage`. 3. AIO thread gets notification about this request and somehow handle it changing status to `COMPLETED_SHARED` in `pgaio_io_process_completion` which notifies backend using condvar (which is expected to enforce write barrier). 4. Backend see that status is `COMPLETED_SHARED`. Backend can do it in `pgaio_io_wait` or in `pgaio_io_wait_for_free`. In first case there seems to be no problem, because `pgaio_io_was_recycled` is used which does read barrier. So you suspect `pgaio_io_wait_for_free` where read barrier was missed (before your patch). But it is not quite clear to me what can happen with "op" field without read barrier? In theory backend can see old value. In practice it seems to be impossible because as far as I know the unit of syncing data between CPU is cache line. So my hypothesis was the following: backend observed old value of "op" (PGAIO_OP_READV, because this is what it has written itself), then overwrites it with PGAIO_OP_INVALID in `pgaio_io_reclaim` and then delayed update from AIO worker is received which restores old value - PGAIO_OP_READV. But it seems to be not so realistic scenario. Can you explain please how according to you theory not up-to-date version of distilled_result can explain such behavior? It will be possible if result contains index of the handle and stale version of the index can cause addressing from IO handle. But it is not the case - as we can see from the trace, address of `ioh` in the backens is always the same. > >> This is what I am observing myself: "op" field is modified and fetched by >> the same process. > Right - but I don't think the ->op field being wrong was actually part of the > issue though. > > >> Certainly process can be rescheduled to some other CPU. But if such >> reschedule can cause loose of stored value, then nothing will work, will it? > Yes, that'd be completely broken - but isn't the issue here. > > Greetings, > > Andres Freund
I and Alexandr are using completely different devices with different hardware, OS and clang version.Both of you are running Ventura, right?
No, Alexandr is using darwin23.5
Alexandr also noticed that he can reproduce the problem only with --with-liburing.
And I am not using it.
I guess kextstat -l|grep -v com.apple might suffice.
kextstat -l| grep -v com.apple
Executing: /usr/bin/kmutil showloaded --list-only
No variant specified, falling back to release
On 2025-06-17 17:54:12 +0300, Konstantin Knizhnik wrote: > > On 12/06/2025 4:57 pm, Andres Freund wrote: > > The problem appears to be in that switch between "when submitted, by the IO > > worker" and "then again by the backend". It's not concurrent access in the > > sense of two processes writing to the same value, it's that when switching > > from the worker updating ->distilled_result to the issuer looking at that, the > > issuer didn't ensure that no outdated version of ->distilled_result could be > > used. > > > > Basically, the problem is that the worker would > > > > 1) set ->distilled_result > > 2) perform a write memory barrier > > 3) set ->state to COMPLETED_SHARED > > > > and then the issuer of the IO would: > > > > 4) check ->state is COMPLETED_SHARED > > 5) use ->distilled_result > > > > The problem is that there currently is no barrier between 4 & 5, which means > > an outdated ->distilled_result could be used. > > > > > > This also explains why the issue looked so weird - eventually, after fprintfs, > > after a core dump, etc, the updated ->distilled_result result would "arrive" > > in the issuing process, and suddenly look correct. > > > Sorry, I realized that O do not completely understand how it can explained > assertion failure in `pgaio_io_before_start`: > > Assert(ioh->op == PGAIO_OP_INVALID); I don't think it can - this must be an independent bug from the one that Tom and I were encountering. Greetings, Andres Freund
On 2025-06-17 18:08:30 +0300, Konstantin Knizhnik wrote: > > On 17/06/2025 4:47 pm, Andres Freund wrote: > > > I and Alexandr are using completely different devices with different > > > hardware, OS and clang version. > > Both of you are running Ventura, right? > > > No, Alexandr is using darwin23.5 I.e. Sonoma. > Alexandr also noticed that he can reproduce the problem only with > --with-liburing. > And I am not using it. That must be a misunderstanding - liburing is linux only. > > I guess > > kextstat -l|grep -v com.apple > > might suffice. > > kextstat -l| grep -v com.apple > Executing: /usr/bin/kmutil showloaded --list-only > No variant specified, falling back to release Thanks. Greetings, Andres Freund
On Thu, Jun 19, 2025 at 4:08 AM Andres Freund <andres@anarazel.de> wrote: > Konstantin, Alexander, can you share what commit you're testing and what > precise changes have been applied to the source? I've now tested this on a > significant number of apple machines for many many days without being able to > reproduce this a single time, despite using various compiler > [versions]. Something has to be different on the two systems you're testing > on. What do you see in "System Firmware Version" from "About This Mac" -> "More Info" -> "General" -> "System Report"? My M4 Air running Sequoia 15.5 shows "11881.121.1". My M1 Air is still running Sequoia 15.4.1 and shows "11881.101.1" (will upgrade it shortly). The version before, Sequoia 15.4 (no ".1") actually broke my M1 Air for a couple of weeks, there was something funny going on with the firmware involving a kernel panic that mentioned t6000dart (I believe that is the IOMMU, ie virtual memory hardware for IO), fixed by 15.4.1. It's just a paranoid thought, but I wonder if all that stuff rolls out in sync across all territories. I have also not managed to reproduce this issue despite running for many hours yesterday...