Thread: Non-reproducible AIO failure

Non-reproducible AIO failure

From
Tom Lane
Date:
My buildfarm animal sifaka just failed like this [1]:

TRAP: failed Assert("aio_ret->result.status != PGAIO_RS_UNKNOWN"), File: "bufmgr.c", Line: 1605, PID: 79322
0   postgres                            0x0000000100e3df2c ExceptionalCondition + 108
1   postgres                            0x0000000100cbb154 WaitReadBuffers + 616
2   postgres                            0x0000000100f072bc read_stream_next_buffer.cold.1 + 184
3   postgres                            0x0000000100cb6dd8 read_stream_next_buffer + 300
4   postgres                            0x00000001009b75b4 heap_fetch_next_buffer + 136
5   postgres                            0x00000001009ad8c4 heapgettup + 368
6   postgres                            0x00000001009ad364 heap_getnext + 104
7   postgres                            0x00000001009b9770 heapam_index_build_range_scan + 700
8   postgres                            0x00000001009ef4b8 spgbuild + 480
9   postgres                            0x0000000100a41acc index_build + 428
10  postgres                            0x0000000100a43d14 reindex_index + 752
11  postgres                            0x0000000100ad09e0 ExecReindex + 1908
12  postgres                            0x0000000100d04628 ProcessUtilitySlow + 1300
13  postgres                            0x0000000100d03754 standard_ProcessUtility + 1528
14  pg_stat_statements.dylib            0x000000010158d7a0 pgss_ProcessUtility + 632
15  postgres                            0x0000000100d02cec PortalRunUtility + 184
16  postgres                            0x0000000100d02354 PortalRunMulti + 236
17  postgres                            0x0000000100d01d94 PortalRun + 456
18  postgres                            0x0000000100d00e10 exec_simple_query + 1308
...
2025-04-23 16:08:28.834 EDT [78788:4] LOG:  client backend (PID 79322) was terminated by signal 6: Abort trap: 6
2025-04-23 16:08:28.834 EDT [78788:5] DETAIL:  Failed process was running: reindex index spgist_point_idx;


Scraping the buildfarm database confirmed that there have been no
other similar failures since AIO went in, and unsurprisingly some
attempts to reproduce it manually didn't bear fruit.  So I'm
suspecting a race condition with a very narrow window, but I don't
have an idea where to look.  Any thoughts on how to gather more
data?

            regards, tom lane

[1]
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=sifaka&dt=2025-04-23%2020%3A03%3A24&stg=recovery-check



Re: Non-reproducible AIO failure

From
Andres Freund
Date:
Hi,

On 2025-04-23 17:17:01 -0400, Tom Lane wrote:
> My buildfarm animal sifaka just failed like this [1]:

There's nothing really special about sifaka, is there? I see -std=gnu99 and a
few debug -D cppflags, but they don't look they could really be relevant here.


> TRAP: failed Assert("aio_ret->result.status != PGAIO_RS_UNKNOWN"), File: "bufmgr.c", Line: 1605, PID: 79322

Huh.

I assume there's no core file left over? Would be helpful to query some more
state (e.g. whether operation->io_wref is valid (think it has to be) and
whether this was a partial read or such).


> 0   postgres                            0x0000000100e3df2c ExceptionalCondition + 108
> 1   postgres                            0x0000000100cbb154 WaitReadBuffers + 616
> 2   postgres                            0x0000000100f072bc read_stream_next_buffer.cold.1 + 184
> 3   postgres                            0x0000000100cb6dd8 read_stream_next_buffer + 300
> 4   postgres                            0x00000001009b75b4 heap_fetch_next_buffer + 136
> 5   postgres                            0x00000001009ad8c4 heapgettup + 368
> 6   postgres                            0x00000001009ad364 heap_getnext + 104
> 7   postgres                            0x00000001009b9770 heapam_index_build_range_scan + 700
> 8   postgres                            0x00000001009ef4b8 spgbuild + 480
> 9   postgres                            0x0000000100a41acc index_build + 428
> 10  postgres                            0x0000000100a43d14 reindex_index + 752
> 11  postgres                            0x0000000100ad09e0 ExecReindex + 1908
> 12  postgres                            0x0000000100d04628 ProcessUtilitySlow + 1300
> 13  postgres                            0x0000000100d03754 standard_ProcessUtility + 1528
> 14  pg_stat_statements.dylib            0x000000010158d7a0 pgss_ProcessUtility + 632
> 15  postgres                            0x0000000100d02cec PortalRunUtility + 184
> 16  postgres                            0x0000000100d02354 PortalRunMulti + 236
> 17  postgres                            0x0000000100d01d94 PortalRun + 456
> 18  postgres                            0x0000000100d00e10 exec_simple_query + 1308
> ...
> 2025-04-23 16:08:28.834 EDT [78788:4] LOG:  client backend (PID 79322) was terminated by signal 6: Abort trap: 6
> 2025-04-23 16:08:28.834 EDT [78788:5] DETAIL:  Failed process was running: reindex index spgist_point_idx;


> Scraping the buildfarm database confirmed that there have been no
> other similar failures since AIO went in, and unsurprisingly some
> attempts to reproduce it manually didn't bear fruit.  So I'm
> suspecting a race condition with a very narrow window, but I don't
> have an idea where to look.  Any thoughts on how to gather more
> data?

The weird thing is that the concrete symptom here doesn't actually involve
anything that could really race - the relevant state is only updated by the
process itself.  But of course there has to be something narrow here,
otherwise we'd have seen this before :(.

Greetings,

Andres Freund



Re: Non-reproducible AIO failure

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2025-04-23 17:17:01 -0400, Tom Lane wrote:
>> My buildfarm animal sifaka just failed like this [1]:

> There's nothing really special about sifaka, is there? I see -std=gnu99 and a
> few debug -D cppflags, but they don't look they could really be relevant here.

No, it's intended to be a bog-standard macOS environment.

>> TRAP: failed Assert("aio_ret->result.status != PGAIO_RS_UNKNOWN"), File: "bufmgr.c", Line: 1605, PID: 79322

> Huh.

> I assume there's no core file left over?

No such luck, I did not have that enabled.  I will turn it on though,
and also on its sibling indri.

            regards, tom lane