Thread: Non-reproducible AIO failure
My buildfarm animal sifaka just failed like this [1]: TRAP: failed Assert("aio_ret->result.status != PGAIO_RS_UNKNOWN"), File: "bufmgr.c", Line: 1605, PID: 79322 0 postgres 0x0000000100e3df2c ExceptionalCondition + 108 1 postgres 0x0000000100cbb154 WaitReadBuffers + 616 2 postgres 0x0000000100f072bc read_stream_next_buffer.cold.1 + 184 3 postgres 0x0000000100cb6dd8 read_stream_next_buffer + 300 4 postgres 0x00000001009b75b4 heap_fetch_next_buffer + 136 5 postgres 0x00000001009ad8c4 heapgettup + 368 6 postgres 0x00000001009ad364 heap_getnext + 104 7 postgres 0x00000001009b9770 heapam_index_build_range_scan + 700 8 postgres 0x00000001009ef4b8 spgbuild + 480 9 postgres 0x0000000100a41acc index_build + 428 10 postgres 0x0000000100a43d14 reindex_index + 752 11 postgres 0x0000000100ad09e0 ExecReindex + 1908 12 postgres 0x0000000100d04628 ProcessUtilitySlow + 1300 13 postgres 0x0000000100d03754 standard_ProcessUtility + 1528 14 pg_stat_statements.dylib 0x000000010158d7a0 pgss_ProcessUtility + 632 15 postgres 0x0000000100d02cec PortalRunUtility + 184 16 postgres 0x0000000100d02354 PortalRunMulti + 236 17 postgres 0x0000000100d01d94 PortalRun + 456 18 postgres 0x0000000100d00e10 exec_simple_query + 1308 ... 2025-04-23 16:08:28.834 EDT [78788:4] LOG: client backend (PID 79322) was terminated by signal 6: Abort trap: 6 2025-04-23 16:08:28.834 EDT [78788:5] DETAIL: Failed process was running: reindex index spgist_point_idx; Scraping the buildfarm database confirmed that there have been no other similar failures since AIO went in, and unsurprisingly some attempts to reproduce it manually didn't bear fruit. So I'm suspecting a race condition with a very narrow window, but I don't have an idea where to look. Any thoughts on how to gather more data? regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=sifaka&dt=2025-04-23%2020%3A03%3A24&stg=recovery-check
Hi, On 2025-04-23 17:17:01 -0400, Tom Lane wrote: > My buildfarm animal sifaka just failed like this [1]: There's nothing really special about sifaka, is there? I see -std=gnu99 and a few debug -D cppflags, but they don't look they could really be relevant here. > TRAP: failed Assert("aio_ret->result.status != PGAIO_RS_UNKNOWN"), File: "bufmgr.c", Line: 1605, PID: 79322 Huh. I assume there's no core file left over? Would be helpful to query some more state (e.g. whether operation->io_wref is valid (think it has to be) and whether this was a partial read or such). > 0 postgres 0x0000000100e3df2c ExceptionalCondition + 108 > 1 postgres 0x0000000100cbb154 WaitReadBuffers + 616 > 2 postgres 0x0000000100f072bc read_stream_next_buffer.cold.1 + 184 > 3 postgres 0x0000000100cb6dd8 read_stream_next_buffer + 300 > 4 postgres 0x00000001009b75b4 heap_fetch_next_buffer + 136 > 5 postgres 0x00000001009ad8c4 heapgettup + 368 > 6 postgres 0x00000001009ad364 heap_getnext + 104 > 7 postgres 0x00000001009b9770 heapam_index_build_range_scan + 700 > 8 postgres 0x00000001009ef4b8 spgbuild + 480 > 9 postgres 0x0000000100a41acc index_build + 428 > 10 postgres 0x0000000100a43d14 reindex_index + 752 > 11 postgres 0x0000000100ad09e0 ExecReindex + 1908 > 12 postgres 0x0000000100d04628 ProcessUtilitySlow + 1300 > 13 postgres 0x0000000100d03754 standard_ProcessUtility + 1528 > 14 pg_stat_statements.dylib 0x000000010158d7a0 pgss_ProcessUtility + 632 > 15 postgres 0x0000000100d02cec PortalRunUtility + 184 > 16 postgres 0x0000000100d02354 PortalRunMulti + 236 > 17 postgres 0x0000000100d01d94 PortalRun + 456 > 18 postgres 0x0000000100d00e10 exec_simple_query + 1308 > ... > 2025-04-23 16:08:28.834 EDT [78788:4] LOG: client backend (PID 79322) was terminated by signal 6: Abort trap: 6 > 2025-04-23 16:08:28.834 EDT [78788:5] DETAIL: Failed process was running: reindex index spgist_point_idx; > Scraping the buildfarm database confirmed that there have been no > other similar failures since AIO went in, and unsurprisingly some > attempts to reproduce it manually didn't bear fruit. So I'm > suspecting a race condition with a very narrow window, but I don't > have an idea where to look. Any thoughts on how to gather more > data? The weird thing is that the concrete symptom here doesn't actually involve anything that could really race - the relevant state is only updated by the process itself. But of course there has to be something narrow here, otherwise we'd have seen this before :(. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2025-04-23 17:17:01 -0400, Tom Lane wrote: >> My buildfarm animal sifaka just failed like this [1]: > There's nothing really special about sifaka, is there? I see -std=gnu99 and a > few debug -D cppflags, but they don't look they could really be relevant here. No, it's intended to be a bog-standard macOS environment. >> TRAP: failed Assert("aio_ret->result.status != PGAIO_RS_UNKNOWN"), File: "bufmgr.c", Line: 1605, PID: 79322 > Huh. > I assume there's no core file left over? No such luck, I did not have that enabled. I will turn it on though, and also on its sibling indri. regards, tom lane