Re: Non-reproducible AIO failure - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Non-reproducible AIO failure |
Date | |
Msg-id | oj3ydw3irpcrjirw5ihcfcqlav2c26x35fhfq2pk4fo5sl6zh5@c25nlzjx3b6e Whole thread Raw |
In response to | Non-reproducible AIO failure (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Non-reproducible AIO failure
|
List | pgsql-hackers |
Hi, On 2025-04-23 17:17:01 -0400, Tom Lane wrote: > My buildfarm animal sifaka just failed like this [1]: There's nothing really special about sifaka, is there? I see -std=gnu99 and a few debug -D cppflags, but they don't look they could really be relevant here. > TRAP: failed Assert("aio_ret->result.status != PGAIO_RS_UNKNOWN"), File: "bufmgr.c", Line: 1605, PID: 79322 Huh. I assume there's no core file left over? Would be helpful to query some more state (e.g. whether operation->io_wref is valid (think it has to be) and whether this was a partial read or such). > 0 postgres 0x0000000100e3df2c ExceptionalCondition + 108 > 1 postgres 0x0000000100cbb154 WaitReadBuffers + 616 > 2 postgres 0x0000000100f072bc read_stream_next_buffer.cold.1 + 184 > 3 postgres 0x0000000100cb6dd8 read_stream_next_buffer + 300 > 4 postgres 0x00000001009b75b4 heap_fetch_next_buffer + 136 > 5 postgres 0x00000001009ad8c4 heapgettup + 368 > 6 postgres 0x00000001009ad364 heap_getnext + 104 > 7 postgres 0x00000001009b9770 heapam_index_build_range_scan + 700 > 8 postgres 0x00000001009ef4b8 spgbuild + 480 > 9 postgres 0x0000000100a41acc index_build + 428 > 10 postgres 0x0000000100a43d14 reindex_index + 752 > 11 postgres 0x0000000100ad09e0 ExecReindex + 1908 > 12 postgres 0x0000000100d04628 ProcessUtilitySlow + 1300 > 13 postgres 0x0000000100d03754 standard_ProcessUtility + 1528 > 14 pg_stat_statements.dylib 0x000000010158d7a0 pgss_ProcessUtility + 632 > 15 postgres 0x0000000100d02cec PortalRunUtility + 184 > 16 postgres 0x0000000100d02354 PortalRunMulti + 236 > 17 postgres 0x0000000100d01d94 PortalRun + 456 > 18 postgres 0x0000000100d00e10 exec_simple_query + 1308 > ... > 2025-04-23 16:08:28.834 EDT [78788:4] LOG: client backend (PID 79322) was terminated by signal 6: Abort trap: 6 > 2025-04-23 16:08:28.834 EDT [78788:5] DETAIL: Failed process was running: reindex index spgist_point_idx; > Scraping the buildfarm database confirmed that there have been no > other similar failures since AIO went in, and unsurprisingly some > attempts to reproduce it manually didn't bear fruit. So I'm > suspecting a race condition with a very narrow window, but I don't > have an idea where to look. Any thoughts on how to gather more > data? The weird thing is that the concrete symptom here doesn't actually involve anything that could really race - the relevant state is only updated by the process itself. But of course there has to be something narrow here, otherwise we'd have seen this before :(. Greetings, Andres Freund
pgsql-hackers by date: