Thread: Non-reproducible AIO failure

Non-reproducible AIO failure

From
Tom Lane
Date:
My buildfarm animal sifaka just failed like this [1]:

TRAP: failed Assert("aio_ret->result.status != PGAIO_RS_UNKNOWN"), File: "bufmgr.c", Line: 1605, PID: 79322
0   postgres                            0x0000000100e3df2c ExceptionalCondition + 108
1   postgres                            0x0000000100cbb154 WaitReadBuffers + 616
2   postgres                            0x0000000100f072bc read_stream_next_buffer.cold.1 + 184
3   postgres                            0x0000000100cb6dd8 read_stream_next_buffer + 300
4   postgres                            0x00000001009b75b4 heap_fetch_next_buffer + 136
5   postgres                            0x00000001009ad8c4 heapgettup + 368
6   postgres                            0x00000001009ad364 heap_getnext + 104
7   postgres                            0x00000001009b9770 heapam_index_build_range_scan + 700
8   postgres                            0x00000001009ef4b8 spgbuild + 480
9   postgres                            0x0000000100a41acc index_build + 428
10  postgres                            0x0000000100a43d14 reindex_index + 752
11  postgres                            0x0000000100ad09e0 ExecReindex + 1908
12  postgres                            0x0000000100d04628 ProcessUtilitySlow + 1300
13  postgres                            0x0000000100d03754 standard_ProcessUtility + 1528
14  pg_stat_statements.dylib            0x000000010158d7a0 pgss_ProcessUtility + 632
15  postgres                            0x0000000100d02cec PortalRunUtility + 184
16  postgres                            0x0000000100d02354 PortalRunMulti + 236
17  postgres                            0x0000000100d01d94 PortalRun + 456
18  postgres                            0x0000000100d00e10 exec_simple_query + 1308
...
2025-04-23 16:08:28.834 EDT [78788:4] LOG:  client backend (PID 79322) was terminated by signal 6: Abort trap: 6
2025-04-23 16:08:28.834 EDT [78788:5] DETAIL:  Failed process was running: reindex index spgist_point_idx;


Scraping the buildfarm database confirmed that there have been no
other similar failures since AIO went in, and unsurprisingly some
attempts to reproduce it manually didn't bear fruit.  So I'm
suspecting a race condition with a very narrow window, but I don't
have an idea where to look.  Any thoughts on how to gather more
data?

            regards, tom lane

[1]
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=sifaka&dt=2025-04-23%2020%3A03%3A24&stg=recovery-check



Re: Non-reproducible AIO failure

From
Andres Freund
Date:
Hi,

On 2025-04-23 17:17:01 -0400, Tom Lane wrote:
> My buildfarm animal sifaka just failed like this [1]:

There's nothing really special about sifaka, is there? I see -std=gnu99 and a
few debug -D cppflags, but they don't look they could really be relevant here.


> TRAP: failed Assert("aio_ret->result.status != PGAIO_RS_UNKNOWN"), File: "bufmgr.c", Line: 1605, PID: 79322

Huh.

I assume there's no core file left over? Would be helpful to query some more
state (e.g. whether operation->io_wref is valid (think it has to be) and
whether this was a partial read or such).


> 0   postgres                            0x0000000100e3df2c ExceptionalCondition + 108
> 1   postgres                            0x0000000100cbb154 WaitReadBuffers + 616
> 2   postgres                            0x0000000100f072bc read_stream_next_buffer.cold.1 + 184
> 3   postgres                            0x0000000100cb6dd8 read_stream_next_buffer + 300
> 4   postgres                            0x00000001009b75b4 heap_fetch_next_buffer + 136
> 5   postgres                            0x00000001009ad8c4 heapgettup + 368
> 6   postgres                            0x00000001009ad364 heap_getnext + 104
> 7   postgres                            0x00000001009b9770 heapam_index_build_range_scan + 700
> 8   postgres                            0x00000001009ef4b8 spgbuild + 480
> 9   postgres                            0x0000000100a41acc index_build + 428
> 10  postgres                            0x0000000100a43d14 reindex_index + 752
> 11  postgres                            0x0000000100ad09e0 ExecReindex + 1908
> 12  postgres                            0x0000000100d04628 ProcessUtilitySlow + 1300
> 13  postgres                            0x0000000100d03754 standard_ProcessUtility + 1528
> 14  pg_stat_statements.dylib            0x000000010158d7a0 pgss_ProcessUtility + 632
> 15  postgres                            0x0000000100d02cec PortalRunUtility + 184
> 16  postgres                            0x0000000100d02354 PortalRunMulti + 236
> 17  postgres                            0x0000000100d01d94 PortalRun + 456
> 18  postgres                            0x0000000100d00e10 exec_simple_query + 1308
> ...
> 2025-04-23 16:08:28.834 EDT [78788:4] LOG:  client backend (PID 79322) was terminated by signal 6: Abort trap: 6
> 2025-04-23 16:08:28.834 EDT [78788:5] DETAIL:  Failed process was running: reindex index spgist_point_idx;


> Scraping the buildfarm database confirmed that there have been no
> other similar failures since AIO went in, and unsurprisingly some
> attempts to reproduce it manually didn't bear fruit.  So I'm
> suspecting a race condition with a very narrow window, but I don't
> have an idea where to look.  Any thoughts on how to gather more
> data?

The weird thing is that the concrete symptom here doesn't actually involve
anything that could really race - the relevant state is only updated by the
process itself.  But of course there has to be something narrow here,
otherwise we'd have seen this before :(.

Greetings,

Andres Freund



Re: Non-reproducible AIO failure

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2025-04-23 17:17:01 -0400, Tom Lane wrote:
>> My buildfarm animal sifaka just failed like this [1]:

> There's nothing really special about sifaka, is there? I see -std=gnu99 and a
> few debug -D cppflags, but they don't look they could really be relevant here.

No, it's intended to be a bog-standard macOS environment.

>> TRAP: failed Assert("aio_ret->result.status != PGAIO_RS_UNKNOWN"), File: "bufmgr.c", Line: 1605, PID: 79322

> Huh.

> I assume there's no core file left over?

No such luck, I did not have that enabled.  I will turn it on though,
and also on its sibling indri.

            regards, tom lane



Re: Non-reproducible AIO failure

From
Tom Lane
Date:
Alexander Lakhin <exclusion@gmail.com> writes:
> FWIW, that Assert have just triggered on another mac:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=indri&dt=2025-05-23%2020%3A30%3A07

Yeah, I was just looking at that too.  There is a corefile
from that crash, but lldb seems unable to extract anything
from it :-(.  There is a low-resolution stack trace in the
postmaster log though:

0   postgres                            0x0000000105299c84 ExceptionalCondition + 108
1   postgres                            0x00000001051159ac WaitReadBuffers + 616
2   postgres                            0x00000001053611ec read_stream_next_buffer.cold.1 + 184
3   postgres                            0x0000000105111630 read_stream_next_buffer + 300
4   postgres                            0x0000000104e0b994 heap_fetch_next_buffer + 136
5   postgres                            0x0000000104e018f4 heapgettup_pagemode + 204
6   postgres                            0x0000000104e02010 heap_getnextslot + 84
7   postgres                            0x0000000104faebb4 SeqNext + 160


Of note here is that indri and sifaka run on the same host ---
indri uses some MacPorts packages while sifaka doesn't, but
that shouldn't have anything to do with our AIO code.

I trawled the buildfarm database and confirmed that these two crashes
are our only similar reports (grepping for "PGAIO_RS_UNKNOWN"):

 sysname | branch |      snapshot       |     stage     |                                                      l
                                              

---------+--------+---------------------+---------------+-------------------------------------------------------------------------------------------------------------
 indri   | HEAD   | 2025-05-23 20:30:07 | recoveryCheck | TRAP: failed Assert("aio_ret->result.status !=
PGAIO_RS_UNKNOWN"),File: "bufmgr.c", Line: 1605, PID: 20931 
 sifaka  | HEAD   | 2025-04-23 20:03:24 | recoveryCheck | TRAP: failed Assert("aio_ret->result.status !=
PGAIO_RS_UNKNOWN"),File: "bufmgr.c", Line: 1605, PID: 79322 
(2 rows)

So it seems that "very low-probability issue in our Mac AIO code" is
the most probable description.

            regards, tom lane



Re: Non-reproducible AIO failure

From
Alexander Lakhin
Date:
Hello Thomas,

24.05.2025 14:42, Thomas Munro wrote:
> On Sat, May 24, 2025 at 3:17 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> So it seems that "very low-probability issue in our Mac AIO code" is
>> the most probable description.
> There isn't any macOS-specific AIO code so my first guess would be
> that it might be due to aarch64 weak memory reordering (though Andres
> speculated that itt should all be one backend, huh), if it's not just
> a timing luck thing.  Alexander, were the other OSes you tried all on
> x86?

As I wrote off-list before, I had tried x86_64 only, but since then I
tried to reproduce the issue on an aarch64 server with Ubuntu 24.04,
running 10, then 40 instances of t/027_stream_regress.pl in parallel. I've
also multiplied "test: brin ..." line x10. But the issue is still not
reproduced (in 8+ hours).

However, I've managed to get an AIO-related assertion failure on macOS 14.5
(Darwin Kernel Version 23.5.0) quite easily, running 027_stream_regress.pl
the same way:
2       # +++ tap check in src/test/recovery_2 +++
3       # +++ tap check in src/test/recovery_3 +++
4       # +++ tap check in src/test/recovery_4 +++
5       # +++ tap check in src/test/recovery_5 +++
1       # +++ tap check in src/test/recovery_1 +++
7       # +++ tap check in src/test/recovery_7 +++
6       # +++ tap check in src/test/recovery_6 +++
8       # +++ tap check in src/test/recovery_8 +++
1       [16:07:31] t/027_stream_regress.pl ..
1       Dubious, test returned 29 (wstat 7424, 0x1d00)
1       All 1 subtests passed
1       [16:07:31]
1
1       Test Summary Report
1       -------------------
1       t/027_stream_regress.pl (Wstat: 7424 Tests: 1 Failed: 0)
1         Non-zero exit status: 29
1         Parse errors: No plan found in TAP output
1       Files=1, Tests=1, 197 wallclock secs ( 0.01 usr  0.01 sys + 4.47 cusr  5.79 csys = 10.28 CPU)
1       Result: FAIL

TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 161, PID: 32355
0   postgres                            0x0000000104f078f4 ExceptionalCondition + 236
1   postgres                            0x0000000104c0ebd4 pgaio_io_before_start + 260
2   postgres                            0x0000000104c0ea94 pgaio_io_start_readv + 36
3   postgres                            0x0000000104c2d4e8 FileStartReadV + 252
4   postgres                            0x0000000104c807c8 mdstartreadv + 668
5   postgres                            0x0000000104c83db0 smgrstartreadv + 116
6   postgres                            0x0000000104c1675c AsyncReadBuffers + 1444
7   postgres                            0x0000000104c15868 StartReadBuffersImpl + 1196
8   postgres                            0x0000000104c153ac StartReadBuffers + 64
9   postgres                            0x0000000104c129b4 read_stream_start_pending_read + 1204
10  postgres                            0x0000000104c12018 read_stream_look_ahead + 812
11  postgres                            0x0000000104c11cd0 read_stream_next_buffer + 1396
12  postgres                            0x0000000104606c38 heap_fetch_next_buffer + 284
13  postgres                            0x00000001045f79d4 heapgettup_pagemode + 192
14  postgres                            0x00000001045f7fa4 heap_getnextslot + 84
15  postgres                            0x000000010497a404 table_scan_getnextslot + 340
16  postgres                            0x000000010497a210 SeqNext + 148
17  postgres                            0x000000010497a8e4 ExecScanFetch + 772
18  postgres                            0x000000010497a4b8 ExecScanExtended + 164
19  postgres                            0x0000000104979bf0 ExecSeqScanWithProject + 244
20  postgres                            0x0000000104953ce0 ExecProcNode + 68
21  postgres                            0x000000010494f148 MultiExecPrivateHash + 64
22  postgres                            0x000000010494ed48 MultiExecHash + 100
23  postgres                            0x0000000104925a18 MultiExecProcNode + 180
24  postgres                            0x0000000104957220 ExecHashJoinImpl + 628
25  postgres                            0x00000001049566dc ExecHashJoin + 32
26  postgres                            0x0000000104925958 ExecProcNodeFirst + 100
27  postgres                            0x000000010491c08c ExecProcNode + 68
28  postgres                            0x000000010491692c ExecutePlan + 268
29  postgres                            0x000000010491679c standard_ExecutorRun + 504
30  pg_stat_statements.dylib            0x00000001055e132c pgss_ExecutorRun + 212
31  postgres                            0x0000000104916580 ExecutorRun + 72
32  postgres                            0x000000010491f59c ParallelQueryMain + 508
33  postgres                            0x00000001046968d4 ParallelWorkerMain + 1712
34  postgres                            0x0000000104b476f4 BackgroundWorkerMain + 824
35  postgres                            0x0000000104b4b87c postmaster_child_launch + 492
36  postgres                            0x0000000104b56610 StartBackgroundWorker + 416
37  postgres                            0x0000000104b51378 maybe_start_bgworkers + 552
38  postgres                            0x0000000104b540cc LaunchMissingBackgroundProcesses + 1316
39  postgres                            0x0000000104b51650 ServerLoop + 616
40  postgres                            0x0000000104b4fef0 PostmasterMain + 6632
41  postgres                            0x00000001049bfe20 main + 952
42  dyld                                0x00000001849d60e0 start + 2360
2025-05-24 16:07:23.427 EDT [22120:4] LOG:  background worker "parallel worker" (PID 32355) was terminated by signal 6:

Abort trap: 6
2025-05-24 16:07:23.427 EDT [22120:5] DETAIL:  Failed process was running: select count(*) from join_foo
           left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss

I'm yet to see the Assert triggered on the buildfarm, but this one looks
interesting too.

(I can share the complete patch + script for such testing, if it can be helpful.)

Best regards,
Alexander Lakhin
Neon (https://neon.tech)



Re: Non-reproducible AIO failure

From
Thomas Munro
Date:
On Sun, May 25, 2025 at 3:22 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > Can you get a core and print *ioh in the debugger?
>
> So far, I've failed to get anything useful out of core files
> from this failure.  The trace goes back no further than
>
> (lldb) bt
> * thread #1
>   * frame #0: 0x000000018de39388 libsystem_kernel.dylib`__pthread_kill + 8
>
> That's quite odd in itself: while I don't find the debugging
> environment on macOS to be the greatest, it's not normally
> this unhelpful.

(And Alexander reported the same off-list.). It's interesting that the
elog.c backtrace stuff is able to analyse the stack and it looks
normal AFAICS.  Could that be interfering with the stack in the core?!
 I doubt it ... I kinda wonder if the debugger might be confused about
libsystem sometimes since it has ceased to be a regular Mach-O file on
disk, but IDK; maybe gdb (from MacPorts etc) would offer a clue?

So far we have:

TRAP: failed Assert("aio_ret->result.status != PGAIO_RS_UNKNOWN"),
File: "bufmgr.c", Line: 1605, PID: 20931
0   postgres                            0x0000000105299c84
ExceptionalCondition + 108
1   postgres                            0x00000001051159ac WaitReadBuffers + 616
2   postgres                            0x00000001053611ec
read_stream_next_buffer.cold.1 + 184
3   postgres                            0x0000000105111630
read_stream_next_buffer + 300
4   postgres                            0x0000000104e0b994
heap_fetch_next_buffer + 136
5   postgres                            0x0000000104e018f4
heapgettup_pagemode + 204

Hmm, looking around that code and wracking my brain for things that
might happen on one OS but not others, I wonder about partial I/Os.
Perhaps combined with some overlapping requests.

TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c",
Line: 161, PID: 32355
0   postgres                            0x0000000104f078f4
ExceptionalCondition + 236
1   postgres                            0x0000000104c0ebd4
pgaio_io_before_start + 260
2   postgres                            0x0000000104c0ea94
pgaio_io_start_readv + 36
3   postgres                            0x0000000104c2d4e8 FileStartReadV + 252
4   postgres                            0x0000000104c807c8 mdstartreadv + 668
5   postgres                            0x0000000104c83db0 smgrstartreadv + 116

But this one seems like a more basic confusion...  wild writes
somewhere?  Hmm, we need to see what's in that struct.

If we can't get a debugger to break there or a core file to be
analysable, maybe we should try logging as much info as possible at
those points to learn a bit more?  I would be digging like that myself
but I haven't seen this failure on my little M4 MacBook Air yet
(Sequoia 15.5, Apple clang-1700.0.13.3).  It is infected with
corporate security-ware that intercepts at least file system stuff and
slows it down and I can't even convince it to dump core files right
now.  Could you guys please share your exact repro steps?



Re: Non-reproducible AIO failure

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> Could you guys please share your exact repro steps?

I've just been running 027_stream_regress.pl over and over.
It's not a recommendable answer though because the failure
probability is tiny, under 1%.  It sounded like Alexander
had a better way.

            regards, tom lane



Re: Non-reproducible AIO failure

From
Andres Freund
Date:
Hi,

On 2025-05-25 20:05:49 -0400, Tom Lane wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > Could you guys please share your exact repro steps?
> 
> I've just been running 027_stream_regress.pl over and over.
> It's not a recommendable answer though because the failure
> probability is tiny, under 1%.  It sounded like Alexander
> had a better way.

Just FYI, I've been trying to reproduce this as well, without a single failure
so far. Despite running all tests for a few hundred times (~2 days) and
027_stream_regress.pl many hundreds of times (~1 day).

This is on a m4 mac mini.  I'm wondering if there's some hardware specific
memory ordering issue or disk speed based timing issue that I'm just not
hitting.

Greetings,

Andres Freund



Re: Non-reproducible AIO failure

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> Could you please share your configure options?

The failures on indri and sifaka were during ordinary buildfarm
runs, you can check the animals' details on the website.
(Note those are same host machine, the difference is that
indri uses some MacPorts packages while sifaka is meant to be
a bare-bones-macOS build.)

On the other two, I believe I just used my normal Mac devel
configuration.  On the M4 laptop that's

$ ./configure CC="ccache gcc" CFLAGS="-O2 -fno-common" --enable-debug --enable-cassert --with-bonjour --without-icu
--enable-tap-tests--with-system-tzdata=/usr/share/zoneinfo 

but I see I haven't quite got the same thing on the M1 mini:

$ ./configure CC="ccache clang" --enable-debug --enable-cassert --enable-tap-tests
--with-system-tzdata=/usr/share/zoneinfo

In both cases the compiler is really

$ gcc -v
Apple clang version 17.0.0 (clang-1700.0.13.3)
Target: arm64-apple-darwin24.5.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

I did some experimentation with different -O levels to see if that
would make the failure any more or less common.  But I've yet to
reproduce it at anything except -O2.

            regards, tom lane



Re: Non-reproducible AIO failure

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> This is on a m4 mac mini.  I'm wondering if there's some hardware specific
> memory ordering issue or disk speed based timing issue that I'm just not
> hitting.

I dunno, I've seen it on three different physical machines now
(one M1, two M4 Pros).  But it is darn hard to repro, for sure.

            regards, tom lane



Re: Non-reproducible AIO failure

From
Andres Freund
Date:
Hi,

On 2025-05-27 10:12:28 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > This is on a m4 mac mini.  I'm wondering if there's some hardware specific
> > memory ordering issue or disk speed based timing issue that I'm just not
> > hitting.
> 
> I dunno, I've seen it on three different physical machines now
> (one M1, two M4 Pros).  But it is darn hard to repro, for sure.

I just meant that it seems that I can't reproduce it for some as of yet
unknown reason. I've now been through 3k+ runs of 027_stream_regress, without
a single failure, so there has to be *something* different about my
environment than yours.

Darwin m4-dev 24.1.0 Darwin Kernel Version 24.1.0: Thu Oct 10 21:06:23 PDT 2024;
root:xnu-11215.41.3~3/RELEASE_ARM64_T8132arm64
 

cc -v
Apple clang version 16.0.0 (clang-1600.0.26.4)
Target: arm64-apple-darwin24.1.0
Thread model: posix

I guess I'll try to update to a later version and see if it repros there?

Greetings,

Andres Freund



Re: Non-reproducible AIO failure

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> I just meant that it seems that I can't reproduce it for some as of yet
> unknown reason. I've now been through 3k+ runs of 027_stream_regress, without
> a single failure, so there has to be *something* different about my
> environment than yours.

> Darwin m4-dev 24.1.0 Darwin Kernel Version 24.1.0: Thu Oct 10 21:06:23 PDT 2024;
root:xnu-11215.41.3~3/RELEASE_ARM64_T8132arm64 

> cc -v
> Apple clang version 16.0.0 (clang-1600.0.26.4)
> Target: arm64-apple-darwin24.1.0
> Thread model: posix

> I guess I'll try to update to a later version and see if it repros there?

Maybe.  All the machines I've seen it on are current-software:

$ uname -a
Darwin minim4.sss.pgh.pa.us 24.5.0 Darwin Kernel Version 24.5.0: Tue Apr 22 19:53:27 PDT 2025;
root:xnu-11417.121.6~2/RELEASE_ARM64_T6041arm64 
$ cc -v
Apple clang version 17.0.0 (clang-1700.0.13.3)
Target: arm64-apple-darwin24.5.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

If it's OS-version-specific, that raises the odds IMO that this
is Apple's fault more than ours.

            regards, tom lane



Re: Non-reproducible AIO failure

From
Alexander Lakhin
Date:
Hello Thomas,

25.05.2025 05:45, Thomas Munro wrote:
>> TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 161, PID: 32355
> Can you get a core and print *ioh in the debugger?
It looks like I managed to restore all the conditions needed to reproduce
that Assert more or less reliably (within a couple of hours), so I can
continue experiments.

I've added the following debugging:
+++ b/src/backend/storage/aio/aio_io.c
@@ -158,6 +158,7 @@ pgaio_io_before_start(PgAioHandle *ioh)
         Assert(ioh->state == PGAIO_HS_HANDED_OUT);
         Assert(pgaio_my_backend->handed_out_io == ioh);
         Assert(pgaio_io_has_target(ioh));
+if (ioh->op != PGAIO_OP_INVALID) elog(LOG, "!!!pgaio_io_before_start| ioh: %p, ioh->op: %d", ioh, ioh->op);
         Assert(ioh->op == PGAIO_OP_INVALID);

--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -67,6 +67,7 @@
  #include "utils/resowner.h"
  #include "utils/timestamp.h"

+#include "storage/aio_internal.h"
...
@@ -1855,7 +1860,11 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
                 pgaio_submit_staged();

                 ioh = pgaio_io_acquire(CurrentResourceOwner, &operation->io_return);
+fprintf(stderr, "!!!AsyncReadBuffers [%d]| ioh(2): %p, op: %d, blocknum: %d\n", getpid(), ioh, ((PgAioHandle 
*)ioh)->op, blocknum);
         }
+else
+fprintf(stderr, "!!!AsyncReadBuffers [%d]| ioh(1): %p, op: %d, blocknum: %d\n", getpid(), ioh, ((PgAioHandle 
*)ioh)->op, blocknum);
+

         /*
          * Check if we can start IO on the first to-be-read buffer.
...

(using fprintf here to rule out side effects of interrupt processing
inside elog(), and it looks like fprintf there eases the reproducing)

--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -712,7 +712,7 @@ sub init
                 print $conf "wal_log_hints = on\n";
                 print $conf "hot_standby = on\n";
                 # conservative settings to ensure we can run multiple postmasters:
-               print $conf "shared_buffers = 1MB\n";
+               print $conf "shared_buffers = 512kB\n";
                 print $conf "max_connections = 10\n";

(it looks like it increases the probability of failure too, will gather
more statistics...)

And got:
...
!!!AsyncReadBuffers [97859]| ioh(1): 0x106efd540, op: 0, blocknum: 121
!!!AsyncReadBuffers [97859]| ioh(1): 0x106efd540, op: 0, blocknum: 122
!!!AsyncReadBuffers [97859]| ioh(1): 0x106efd540, op: 0, blocknum: 123
!!!AsyncReadBuffers [97859]| ioh(1): 0x106efd540, op: 0, blocknum: 124
!!!AsyncReadBuffers [97859]| ioh(1): 0x106efd540, op: 0, blocknum: 125
!!!AsyncReadBuffers [97859]| ioh(1): 0x106efd540, op: 0, blocknum: 126
!!!AsyncReadBuffers [97859]| ioh(1): 0x106efd540, op: 1, blocknum: 127
2025-05-30 14:38:40.812 EDT [97859:121] pg_regress/replica_identity LOG:  !!!pgaio_io_before_start| ioh: 0x106efd540, 
ioh->op: 1
2025-05-30 14:38:40.812 EDT [97859:122] pg_regress/replica_identity STATEMENT:  SELECT c.oid,
           n.nspname,
           c.relname
         FROM pg_catalog.pg_class c
              LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
         WHERE c.relname OPERATOR(pg_catalog.~) '^(test_replica_identity3)$' COLLATE pg_catalog.default
           AND pg_catalog.pg_table_is_visible(c.oid)
         ORDER BY 2, 3;
TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 162, PID: 97859

0   postgres                            0x0000000105823878 ExceptionalCondition + 236
1   postgres                            0x000000010552ab54 pgaio_io_before_start + 452
2   postgres                            0x000000010552a954 pgaio_io_start_readv + 36
3   postgres                            0x000000010554946c FileStartReadV + 252
4   postgres                            0x000000010559c74c mdstartreadv + 668
5   postgres                            0x000000010559fd34 smgrstartreadv + 116
6   postgres                            0x00000001055326e0 AsyncReadBuffers + 1664
7   postgres                            0x00000001055317e8 StartReadBuffersImpl + 1196
8   postgres                            0x000000010553132c StartReadBuffers + 64
9   postgres                            0x000000010552e934 read_stream_start_pending_read + 1204
10  postgres                            0x000000010552df98 read_stream_look_ahead + 812
11  postgres                            0x000000010552dc50 read_stream_next_buffer + 1396
12  postgres                            0x0000000104f22af8 heap_fetch_next_buffer + 284
13  postgres                            0x0000000104f13894 heapgettup_pagemode + 192
14  postgres                            0x0000000104f13e64 heap_getnextslot + 84
15  postgres                            0x00000001052962c4 table_scan_getnextslot + 340
16  postgres                            0x00000001052960d0 SeqNext + 148
17  postgres                            0x00000001052967a4 ExecScanFetch + 772
18  postgres                            0x0000000105296378 ExecScanExtended + 164
...

And another failed run:
...
!!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 30
!!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 31
!!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 34
!!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 35
!!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 36
!!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 37
!!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 38
!!!AsyncReadBuffers [5043]| ioh(1): 0x102d14f10, op: 0, blocknum: 253
!!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 39
!!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 40
!!!AsyncReadBuffers [5036]| ioh(1): 0x102d15a50, op: 0, blocknum: 41
!!!AsyncReadBuffers [5043]| ioh(1): 0x102d14f10, op: 0, blocknum: 254
!!!AsyncReadBuffers [5043]| ioh(1): 0x102d14f10, op: 0, blocknum: 255
2025-05-30 22:48:00.265 EDT [5043:16] pg_regress/memoize LOG: !!!pgaio_io_before_start| ioh: 0x102d14f10, ioh->op: 1
2025-05-30 22:48:00.265 EDT [5043:17] pg_regress/memoize CONTEXT: PL/pgSQL function explain_memoize(text,boolean) line
5
 
at FOR over EXECUTE statement
2025-05-30 22:48:00.265 EDT [5043:18] pg_regress/memoize STATEMENT: SELECT explain_memoize('
         SELECT COUNT(*), AVG(t1.twenty) FROM tenk1 t1 LEFT JOIN
         LATERAL (SELECT t1.twenty AS c1, t2.unique1 AS c2, t2.two FROM tenk1 t2) s
         ON t1.two = s.two
         WHERE s.c1 = s.c2 AND t1.unique1 < 1000;', false);
TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 162, PID: 5043
...
0   postgres                            0x000000010163b878 ExceptionalCondition + 236
1   postgres                            0x0000000101342b54 pgaio_io_before_start + 452
2   postgres                            0x0000000101342954 pgaio_io_start_readv + 36
3   postgres                            0x000000010136146c FileStartReadV + 252
4   postgres                            0x00000001013b474c mdstartreadv + 668
5   postgres                            0x00000001013b7d34 smgrstartreadv + 116
6   postgres                            0x000000010134a6e0 AsyncReadBuffers + 1664
7   postgres                            0x00000001013497e8 StartReadBuffersImpl + 1196
8   postgres                            0x000000010134932c StartReadBuffers + 64
9   postgres                            0x0000000101346934 read_stream_start_pending_read + 1204
10  postgres                            0x0000000101345f98 read_stream_look_ahead + 812
11  postgres                            0x0000000101345c50 read_stream_next_buffer + 1396
12  postgres                            0x0000000100d3aaf8 heap_fetch_next_buffer + 284
13  postgres                            0x0000000100d2b894 heapgettup_pagemode + 192
14  postgres                            0x0000000100d2be64 heap_getnextslot + 84
15  postgres                            0x00000001010ae2c4 table_scan_getnextslot + 340
16  postgres                            0x00000001010ae0d0 SeqNext + 148
...

I run the same loop as shown at [1], but with a random number of parallel
processes: np=$((1 + $RANDOM % 5)), and as I can see the failure is more
likely with np == 5. I also tried running stress-ng --hdd 1 --timeout 60m
concurrently, but I'm still not sure if it affects the reproducing
directly.
I use MacBook M1 and don't know how to disable E/P-cores, moreover, from
my observation, the test performs slower (up to 10%) when it runs for
hours in a loop, probably because of the CPU heating up and lowering
it's frequency, so if the issue is really timing-dependent, it's not that
easy to get more or less the same timings once more (I tried to add sleep
into the loop, to let CPU cool down...).

[1] https://www.postgresql.org/message-id/87673f76-1ed7-44dc-8e8b-6aaf78e2817a%40gmail.com

Best regards,
Alexander Lakhin
Neon (https://neon.tech)



Re: Non-reproducible AIO failure

From
Alexander Lakhin
Date:
31.05.2025 06:00, Alexander Lakhin wrote:
> Hello Thomas,
>
> It looks like I managed to restore all the conditions needed to reproduce
> that Assert more or less reliably (within a couple of hours), so I can
> continue experiments.
>
> I've added the following debugging:
> ...

With additional logging (the patch is attached), I can see the following:
!!!pgaio_io_reclaim [63817]| ioh: 0x1046b5660, ioh->op: 1, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2
!!!AsyncReadBuffers [63817] (1)| blocknum: 15, ioh: 0x1046b5660, ioh->op: 0, ioh->state: 1, ioh->result: 0, 
ioh->num_callbacks: 0
!!!pgaio_io_start_readv [63817]| ioh: 0x1046b5660, ioh->op: 1
!!!pgaio_io_reclaim [63817]| ioh: 0x1046b5660, ioh->op: 1, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2
!!!AsyncReadBuffers [63817] (1)| blocknum: 16, ioh: 0x1046b5660, ioh->op: 0, ioh->state: 1, ioh->result: 0, 
ioh->num_callbacks: 0
!!!pgaio_io_start_readv [63817]| ioh: 0x1046b5660, ioh->op: 1
!!!pgaio_io_reclaim [63817]| ioh: 0x1046b5660, ioh->op: 1, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2
!!!AsyncReadBuffers [63817] (1)| blocknum: 17, ioh: 0x1046b5660, ioh->op: 0, ioh->state: 1, ioh->result: 0, 
ioh->num_callbacks: 0
!!!pgaio_io_start_readv [63817]| ioh: 0x1046b5660, ioh->op: 1
!!!pgaio_io_reclaim [63817]| ioh: 0x1046b5660, ioh->op: 1, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2
!!!AsyncReadBuffers [63817] (1)| blocknum: 18, ioh: 0x1046b5660, ioh->op: 0, ioh->state: 1, ioh->result: 0, 
ioh->num_callbacks: 0
2025-06-02 00:45:11.391 EDT [63817:14] pg_regress/brin LOG: !!!pgaio_io_before_start| ioh: 0x1046b5660, ioh->op: 1, 
ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 2

TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 164, PID: 63817
0   postgres                            0x0000000102fdb60c ExceptionalCondition + 236
1   postgres                            0x0000000102ce233c pgaio_io_before_start + 504
2   postgres                            0x0000000102ce20a8 pgaio_io_start_readv + 36
3   postgres                            0x0000000102d01200 FileStartReadV + 252
4   postgres                            0x0000000102d544e0 mdstartreadv + 668
5   postgres                            0x0000000102d57ac8 smgrstartreadv + 116
6   postgres                            0x0000000102cea474 AsyncReadBuffers + 1996
7   postgres                            0x0000000102ce9430 StartReadBuffersImpl + 1196
8   postgres                            0x0000000102ce8f74 StartReadBuffers + 64
9   postgres                            0x0000000102ce657c read_stream_start_pending_read + 1204
10  postgres                            0x0000000102ce5c60 read_stream_look_ahead + 812
11  postgres                            0x0000000102ce57f8 read_stream_next_buffer + 2356
12  postgres                            0x00000001026da1b8 heap_fetch_next_buffer + 284
13  postgres                            0x00000001026caf54 heapgettup_pagemode + 192
14  postgres                            0x00000001026cb524 heap_getnextslot + 84
15  postgres                            0x0000000102a4d984 table_scan_getnextslot + 340
16  postgres                            0x0000000102a4d790 SeqNext + 148
17  postgres                            0x0000000102a4de64 ExecScanFetch + 772
18  postgres                            0x0000000102a4da38 ExecScanExtended + 164
19  postgres                            0x0000000102a4d374 ExecSeqScanWithQualProject + 248
20  postgres                            0x0000000102a13248 ExecProcNode + 68
21  postgres                            0x0000000102a12e70 fetch_input_tuple + 180
22  postgres                            0x0000000102a12bb4 agg_retrieve_direct + 1664
23  postgres                            0x0000000102a0f914 ExecAgg + 240
...
2025-06-02 00:45:11.409 EDT [57192:4] LOG:  client backend (PID 63817) was terminated by signal 6: Abort trap: 6
2025-06-02 00:45:11.409 EDT [57192:5] DETAIL:  Failed process was running: DO $x$
         DECLARE
                 r record;
                 r2 record;
                 cond text;
                 idx_ctids tid[];
                 ss_ctids tid[];
                 count int;
                 plan_ok bool;
                 plan_line text;
         BEGIN
...

Best regards,
Alexander Lakhin
Neon (https://neon.tech)
Attachment

Re: Non-reproducible AIO failure

From
Andres Freund
Date:
Hi,

Thanks for working on investigating this.


On 2025-06-03 08:00:01 +0300, Alexander Lakhin wrote:
> 02.06.2025 09:00, Alexander Lakhin wrote:
> > With additional logging (the patch is attached), I can see the following:
> > ...
> > !!!pgaio_io_reclaim [63817]| ioh: 0x1046b5660, ioh->op: 1, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2
> > !!!AsyncReadBuffers [63817] (1)| blocknum: 18, ioh: 0x1046b5660,
> > ioh->op: 0, ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0
> > 2025-06-02 00:45:11.391 EDT [63817:14] pg_regress/brin LOG:
> > !!!pgaio_io_before_start| ioh: 0x1046b5660, ioh->op: 1, ioh->state: 1,
> > ioh->result: 0, ioh->num_callbacks: 2
> > 
> > TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 164, PID: 63817
> 
> I've added logging of the ioh->generation field at the beginning and at
> the end of pgaio_io_reclaim() (the complete patch is attached) and got
> even more interesting output:
> ...
> !!!pgaio_io_reclaim [25175]| ioh: 0x104c3e1a0, ioh->op: 1, ioh->state: 6,
> ioh->result: 8192, ioh->num_callbacks: 2, ioh->generation: 21692
> !!!pgaio_io_reclaim [25175]| ioh: 0x104c3e1a0, ioh->op: 0, ioh->generation: 21693
> !!!AsyncReadBuffers [25175] (1)| blocknum: 40, ioh: 0x104c3e1a0, ioh->op: 0,
> ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0, ioh->generation: 21693
> !!!pgaio_io_start_readv [25175]| ioh: 0x104c3e1a0, ioh->op: 1, ioh->generation: 21693

IO starts.


> !!!pgaio_io_reclaim [25176]| ioh: 0x104c3e080, ioh->op: 1, ioh->state: 6,
> ioh->result: 8192, ioh->num_callbacks: 2, ioh->generation: 21408
> !!!pgaio_io_reclaim [25176]| ioh: 0x104c3e080, ioh->op: 0, ioh->generation: 21409
> !!!AsyncReadBuffers [25176] (1)| blocknum: 49, ioh: 0x104c3e080, ioh->op: 0,
> ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0, ioh->generation: 21409
> !!!pgaio_io_start_readv [25176]| ioh: 0x104c3e080, ioh->op: 1, ioh->generation: 21409

(other backend / other IO)


> !!!pgaio_io_reclaim [25175]| ioh: 0x104c3e1a0, ioh->op: 1, ioh->state: 6,
> ioh->result: 8192, ioh->num_callbacks: 2, ioh->generation: 21693
> !!!pgaio_io_reclaim [25175]| ioh: 0x104c3e1a0, ioh->op: 0, ioh->generation: 21694

I completes and is marked as idle, as we'd expect.


> !!!AsyncReadBuffers [25175] (1)| blocknum: 41, ioh: 0x104c3e1a0, ioh->op: 0,
> ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0, ioh->generation: 21694

And here, before starting a new IO, the op is 0 (i.e. PGAIO_OP_INVALID), as we
would expect.


> 2025-06-03 00:19:09.282 EDT [25175:1] LOG: !!!pgaio_io_before_start| ioh:
> 0x104c3e1a0, ioh->op: 1, ioh->state: 1, ioh->result: 0, ioh->num_callbacks:
> 2, ioh->generation: 21694

But here it somehow turned to 1 (PGAIO_OP_READV), despite there not being any
"normal" reason for that. We know that the IO wasn't actually started as
otherwise pgaio_io_start_readv() would have logged that fact.

Which then immediately leads to an assert failure:

> 2025-06-03 00:19:09.282 EDT [25175:2] STATEMENT:  explain (analyze, format 'json')
>           select count(*) from join_foo
>             left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss
>             on join_foo.id < ss.id + 1 and join_foo.id > ss.id - 1;
> 
> TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 164, PID: 25175
> 
> 0   postgres                            0x0000000103563550 ExceptionalCondition + 236
> 1   postgres                            0x000000010326a260 pgaio_io_before_start + 516
> 2   postgres                            0x0000000103269fb4 pgaio_io_start_readv + 36
> 3   postgres                            0x0000000103289144 FileStartReadV + 252
> 4   postgres                            0x00000001032dc424 mdstartreadv + 668
> 5   postgres                            0x00000001032dfa0c smgrstartreadv + 116
> 6   postgres                            0x00000001032723b8 AsyncReadBuffers + 2028
> 7   postgres                            0x0000000103271354 StartReadBuffersImpl + 1196
> 8   postgres                            0x0000000103270e98 StartReadBuffers + 64
> 9   postgres                            0x000000010326e4a0 read_stream_start_pending_read + 1204
> 10  postgres                            0x000000010326db84 read_stream_look_ahead + 812
> 11  postgres                            0x000000010326d71c read_stream_next_buffer + 2356

I don't see how this can happen.

There is no mac specific code in AIO as of now, why aren't we seeing this on
any other platform?

IME the theory that there is something outside of a subsystem randomly
overwriting memory is almost always wrong, but I'm kinda starting to suspect
it here nonetheless.


I wonder if it's worth trying to put a fair bit of "padding" memory
before/after AIO's shared memory region(s). If that were to stop the crashes,
it'd likely be something random writing there. Unfortunately continuing to
crash wouldn't necessarily guarantee it's aio specific.

Hm. We had issues with wrong size calculations of shmem regions before, what
about writing a patch that puts mprotect(PROT_NONE) pages before/after each
ShmemAlloc'd region?

Greetings,

Andres Freund



Re: Non-reproducible AIO failure

From
Thomas Munro
Date:
On Thu, Jun 5, 2025 at 8:02 AM Andres Freund <andres@anarazel.de> wrote:
> On 2025-06-03 08:00:01 +0300, Alexander Lakhin wrote:
> > 2025-06-03 00:19:09.282 EDT [25175:1] LOG: !!!pgaio_io_before_start| ioh:
> > 0x104c3e1a0, ioh->op: 1, ioh->state: 1, ioh->result: 0, ioh->num_callbacks:
> > 2, ioh->generation: 21694
>
> But here it somehow turned to 1 (PGAIO_OP_READV), despite there not being any
> "normal" reason for that. We know that the IO wasn't actually started as
> otherwise pgaio_io_start_readv() would have logged that fact.

A cheap/easy thing to try, maybe: give that enumeration more
distinctive values like 0xaaaa, 0xbbbb, or whatever, to see if we get
a wild 1 here?



Re: Non-reproducible AIO failure

From
Alexander Lakhin
Date:
Hello Thomas and Andres,

04.06.2025 23:32, Thomas Munro wrote:
> On Thu, Jun 5, 2025 at 8:02 AM Andres Freund <andres@anarazel.de> wrote:
>> On 2025-06-03 08:00:01 +0300, Alexander Lakhin wrote:
>>> 2025-06-03 00:19:09.282 EDT [25175:1] LOG: !!!pgaio_io_before_start| ioh:
>>> 0x104c3e1a0, ioh->op: 1, ioh->state: 1, ioh->result: 0, ioh->num_callbacks:
>>> 2, ioh->generation: 21694
>> But here it somehow turned to 1 (PGAIO_OP_READV), despite there not being any
>> "normal" reason for that. We know that the IO wasn't actually started as
>> otherwise pgaio_io_start_readv() would have logged that fact.
> A cheap/easy thing to try, maybe: give that enumeration more
> distinctive values like 0xaaaa, 0xbbbb, or whatever, to see if we get
> a wild 1 here?

Thank you for your attention to this and for the tip! Today I tried the
following:
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -89,8 +89,8 @@ typedef enum PgAioOp
         /* intentionally the zero value, to help catch zeroed memory etc */
         PGAIO_OP_INVALID = 0,

-       PGAIO_OP_READV,
-       PGAIO_OP_WRITEV,
+       PGAIO_OP_READV = 0xaa,
+       PGAIO_OP_WRITEV = 0xbb,

And got::
!!!AsyncReadBuffers [68292] (1)| blocknum: 9, ioh: 0x102b26230, ioh->op: 0, ioh->state: 1, ioh->result: 0, 
ioh->num_callbacks: 0, ioh->generation: 13038
!!!pgaio_io_start_readv [68292]| ioh: 0x102b26230, ioh->op: 170, ioh->generation: 13038
!!!pgaio_io_reclaim [67729]| ioh: 0x102b25540, ioh->op: 170, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2, 
ioh->generation: 47468
!!!pgaio_io_reclaim [67729]| ioh: 0x102b25540, ioh->op: 0, ioh->generation: 47469
!!!pgaio_io_reclaim [68292]| ioh: 0x102b26230, ioh->op: 170, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2, 
ioh->generation: 13038
!!!pgaio_io_reclaim [68292]| ioh: 0x102b26230, ioh->op: 0, ioh->generation: 13039
!!!AsyncReadBuffers [68292] (1)| blocknum: 10, ioh: 0x102b26230, ioh->op: 0, ioh->state: 1, ioh->result: 0, 
ioh->num_callbacks: 0, ioh->generation: 13039
!!!pgaio_io_start_readv [68292]| ioh: 0x102b26230, ioh->op: 170, ioh->generation: 13039
!!!pgaio_io_reclaim [67718]| ioh: 0x102b25390, ioh->op: 170, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2, 
ioh->generation: 35821
!!!pgaio_io_reclaim [67718]| ioh: 0x102b25390, ioh->op: 0, ioh->generation: 35822
!!!pgaio_io_reclaim [68292]| ioh: 0x102b26230, ioh->op: 170, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2, 
ioh->generation: 13039
!!!pgaio_io_reclaim [68292]| ioh: 0x102b26230, ioh->op: 0, ioh->generation: 13040
!!!AsyncReadBuffers [68292] (1)| blocknum: 11, ioh: 0x102b26230, ioh->op: 0, ioh->state: 1, ioh->result: 0, 
ioh->num_callbacks: 0, ioh->generation: 13040
2025-06-05 04:15:38.847 EDT [68292:1] LOG: !!!pgaio_io_before_start| ioh: 0x102b26230, ioh->op: 170, ioh->state: 1, 
ioh->result: 0, ioh->num_callbacks: 2, ioh->generation: 13040
### I reproduced the failure a few times and observed "ioh->op: 170" each time. ###
2025-06-05 04:15:38.847 EDT [68292:2] STATEMENT:  select count(*) from simple r join extremely_skewed s using (id);
TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 164, PID: 68292

0   postgres                            0x000000010144b550 ExceptionalCondition + 236
1   postgres                            0x0000000101152260 pgaio_io_before_start + 516
2   postgres                            0x0000000101151fb4 pgaio_io_start_readv + 36
3   postgres                            0x0000000101171144 FileStartReadV + 252
4   postgres                            0x00000001011c4424 mdstartreadv + 668
5   postgres                            0x00000001011c7a0c smgrstartreadv + 116
6   postgres                            0x000000010115a3b8 AsyncReadBuffers + 2028
7   postgres                            0x0000000101159354 StartReadBuffersImpl + 1196
8   postgres                            0x0000000101158e98 StartReadBuffers + 64
9   postgres                            0x00000001011564a0 read_stream_start_pending_read + 1204
10  postgres                            0x0000000101155b84 read_stream_look_ahead + 812
11  postgres                            0x000000010115571c read_stream_next_buffer + 2356
12  postgres                            0x0000000100b4a04c heap_fetch_next_buffer + 284
13  postgres                            0x0000000100b3ade8 heapgettup_pagemode + 192
14  postgres                            0x0000000100b3b3b8 heap_getnextslot + 84
15  postgres                            0x0000000100ebd818 table_scan_getnextslot + 340
16  postgres                            0x0000000100ebd624 SeqNext + 148
17  postgres                            0x0000000100ebdcf8 ExecScanFetch + 772
18  postgres                            0x0000000100ebd8cc ExecScanExtended + 164
19  postgres                            0x0000000100ebd004 ExecSeqScanWithProject + 244
20  postgres                            0x0000000100e970f4 ExecProcNode + 68

I also tried reproducing this with -DFDDEBUG, no luck so far, probably due
to significant change of the test duration.

Best regards,
Alexander Lakhin



Re: Non-reproducible AIO failure

From
Matthias van de Meent
Date:
On Thu, 5 Jun 2025 at 21:00, Alexander Lakhin <exclusion@gmail.com> wrote:
>
> Hello Thomas and Andres,
>
> 04.06.2025 23:32, Thomas Munro wrote:
> > On Thu, Jun 5, 2025 at 8:02 AM Andres Freund <andres@anarazel.de> wrote:
> >> On 2025-06-03 08:00:01 +0300, Alexander Lakhin wrote:
> >>> 2025-06-03 00:19:09.282 EDT [25175:1] LOG: !!!pgaio_io_before_start| ioh:
> >>> 0x104c3e1a0, ioh->op: 1, ioh->state: 1, ioh->result: 0, ioh->num_callbacks:
> >>> 2, ioh->generation: 21694
> >> But here it somehow turned to 1 (PGAIO_OP_READV), despite there not being any
> >> "normal" reason for that. We know that the IO wasn't actually started as
> >> otherwise pgaio_io_start_readv() would have logged that fact.
> > A cheap/easy thing to try, maybe: give that enumeration more
> > distinctive values like 0xaaaa, 0xbbbb, or whatever, to see if we get
> > a wild 1 here?

I have a very wild guess that's probably wrong in a weird way, but
here goes anyway:

Did anyone test if interleaving the enum-typed bitfield fields of
PgAioHandle with the uint8 fields might solve the issue? I.e. {state,
flags, target, num_callbacks, op} instead of {state, target, op,
flags, num_callbacks}?

I've looked at some decompiled code generated for
pgaio_io_before_start and pgaio_io_start_readv. pgaio_io_start_readv
looks rather normal, but in pgaio_io_before_start it's clear that
pgaio_io_before_start does some weird things as a response to the
bitfields of PgAioHandle:

(manually annotated)

_pgaio_io_before_start:
00000001002eaba8    stp    x20, x19, [sp, #-0x20]!     /* save registers */
00000001002eabac    stp    x29, x30, [sp, #0x10]       /* save registers */
00000001002eabb0    add    x29, sp, #0x10              /* x29 := stack+10 */
00000001002eabb4    ldrb    w8, [x0]                /* w8 := ioh->state */
00000001002eabb8    cmp    w8, #0x1                    /* w8 ??
PGAIO_HS_HANDED_OUT */
00000001002eabbc    b.ne    0x1002eac38

w8 here will contain the value 0x1.

00000001002eabc0    mov    x19, x0                     /* x19 := ioh */
00000001002eabc4    adrp    x8, 1461 ; 0x10089f000  /* x8 := 0x10089f000 */
00000001002eabc8    add    x8, x8, #0xd40              /* x8 += 0xd40
(0x10089fd40) */
00000001002eabcc    ldr    x8, [x8]                    /* x8 :=
pgaio_my_backend */
00000001002eabd0    ldr    x8, [x8, #0x20]             /* x8 :=
pgaio_my_backend->handed_out_io */
00000001002eabd4    cmp    x8, x0                      /* x8
(pgaio_my_backend->handed_out_io) ?? x19 (ioh) */
00000001002eabd8    b.ne    0x1002eac50
00000001002eabdc    mov    x0, x19                     /* x0 := ioh */
00000001002eabe0    bl    _pgaio_io_has_target
00000001002eabe4    tbz    w0, #0x0, 0x1002eac68       /*
_pgaio_io_has_target(ioh)? */
00000001002eabe8    ldrb    w8, [x19, #0x2]         /* w8 := ioh->op */

w8 now contains ioh->op, which is presumed to be 0x0.

00000001002eabec    ldrh    w9, [x19]               /* w9 :=
ioh->[target | state] */
00000001002eabf0    orr    w8, w9, w8, lsl #16         /* w8 := w9 |
(w8 << 16) // [ op | [target | state] ]*/

But here more operations are applied, pushing in the preceding
bitfields of PgAioHandle, mixing in various values into w8.

00000001002eabf4    cmp    w8, #0x10, lsl #12          /* w8 ??
0x1'00'00 // op'state'target ?? 0x1'00'00 */

And finally the full set of bitfields is compared against a target
value, rather than just ioh->op. This means we do 2 memory accesses
for this field.

00000001002eabf8    b.hs    0x1002eac80             /* >= ?? ERROR : continue */

And finally we branch into the setup for _ExceptionalCondition.

This shows that b->op is not exactly a stand-alone field, and might be
influenced by the other sibling bitfield fields. If we changed the
order of the fields to not have continuous bitfields, that might be
able to convince the compiler that there are no data dependencies
between state/target/op, and remove the two-phase load operation.

(checks more decompiled code)

Oh, the bitfields are not just read with multiple accesses, but also
*written* using multiple accesses. See pgaio_io_set_target:

_pgaio_io_set_target:
00000001002eafdc    stp    x29, x30, [sp, #-0x10]!
00000001002eafe0    mov    x29, sp
00000001002eafe4    ldrb    w8, [x0, #0x2]
00000001002eafe8    ldrh    w9, [x0]
00000001002eafec    orr    w8, w9, w8, lsl #16
00000001002eaff0    and    w9, w8, #0xff
00000001002eaff4    cmp    w9, #0x1
00000001002eaff8    b.ne    0x1002eb01c
00000001002eaffc    tst    w8, #0xff00
00000001002eb000    b.ne    0x1002eb034
00000001002eb004    lsr    w9, w8, #16
00000001002eb008    bfi    w8, w1, #8, #24
00000001002eb00c    strb    w9, [x0, #0x2] /* ioh->op = w9 */
00000001002eb010    strh    w8, [x0]           /* ioh->[state, target] = w8 */

... and pgaio_io_update_state:

_pgaio_io_update_state:
00000001002e7f08    sub    sp, sp, #0x70
00000001002e7f0c    stp    x24, x23, [sp, #0x30]
00000001002e7f10    stp    x22, x21, [sp, #0x40]
00000001002e7f14    stp    x20, x19, [sp, #0x50]
00000001002e7f18    stp    x29, x30, [sp, #0x60]
00000001002e7f1c    add    x29, sp, #0x60
00000001002e7f20    mov    x19, x1
00000001002e7f24    mov    x20, x0
00000001002e7f28    mov    w0, #0xa
00000001002e7f2c    mov    x1, #0x0
00000001002e7f30    bl    _errstart
00000001002e7f34    cbz    w0, 0x1002e8010
[...] /* skip pgaio_debug_io log handling */
00000001002e8010    dmb    ish
00000001002e8014    ldurh    w8, [x20, #0x1]
00000001002e8018    bfi    w19, w8, #8, #16
00000001002e801c    lsr    w8, w8, #8
00000001002e8020    strb    w8, [x20, #0x2] /* store ioh->op */
00000001002e8024    strh    w19, [x20]        /* store ioh->{state, target} */

I'm sending this now without looking much deeper into this due to
$err_date_changed and other time-related constraints, but might this
be a candidate for further debugging research?


Kind regards,

Matthias van de Meent
Databricks Inc.



Re: Non-reproducible AIO failure

From
Tom Lane
Date:
Matthias van de Meent <boekewurm+postgres@gmail.com> writes:
> I have a very wild guess that's probably wrong in a weird way, but
> here goes anyway:
> Did anyone test if interleaving the enum-typed bitfield fields of
> PgAioHandle with the uint8 fields might solve the issue?

Ugh.  I think you probably nailed it.

IMO all those struct fields better be declared uint8.

            regards, tom lane



Re: Non-reproducible AIO failure

From
Alexander Lakhin
Date:
Hello,

05.06.2025 22:00, Alexander Lakhin wrote:
> Thank you for your attention to this and for the tip! Today I tried the
> following:
> --- a/src/include/storage/aio.h
> +++ b/src/include/storage/aio.h
> @@ -89,8 +89,8 @@ typedef enum PgAioOp
>         /* intentionally the zero value, to help catch zeroed memory etc */
>         PGAIO_OP_INVALID = 0,
>
> -       PGAIO_OP_READV,
> -       PGAIO_OP_WRITEV,
> +       PGAIO_OP_READV = 0xaa,
> +       PGAIO_OP_WRITEV = 0xbb,

And one more bit of information from me, if you're not tired of it yet:
I've added:
@@ -2238,6 +2240,7 @@ FileStartReadV(PgAioHandle *ioh, File file,

         vfdP = &VfdCache[file];

+if (ioh->op != PGAIO_OP_INVALID) elog(LOG, "!!!FileStartReadV| ioh: %p, ioh->op: %d, ioh->state: %d, ioh->result: %d,

ioh->num_callbacks: %d, ioh->generation: %llu", ioh, ((PgAioHandle *)ioh)->op, ((PgAioHandle *)ioh)->state, 
((PgAioHandle *)ioh)->result, ((PgAioHandle *)ioh)->num_callbacks, ((PgAioHandle *)ioh)->generation);
         pgaio_io_start_readv(ioh, vfdP->fd, iovcnt, offset);

and got:
...
!!!pgaio_io_reclaim [91056]| ioh: 0x1066d15d0, ioh->op: 0, ioh->generation: 29200
!!!AsyncReadBuffers [91056] (1)| blocknum: 70, ioh: 0x1066d15d0, ioh->op: 0, ioh->state: 1, ioh->result: 0, 
ioh->num_callbacks: 0, ioh->generation: 29200
!!!pgaio_io_start_readv [91056]| ioh: 0x1066d15d0, ioh->op: 170, ioh->generation: 29200
!!!pgaio_io_reclaim [91056]| ioh: 0x1066d15d0, ioh->op: 170, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2, 
ioh->generation: 29200
!!!pgaio_io_reclaim [91056]| ioh: 0x1066d15d0, ioh->op: 0, ioh->generation: 29201
!!!AsyncReadBuffers [91056] (1)| blocknum: 71, ioh: 0x1066d15d0, ioh->op: 0, ioh->state: 1, ioh->result: 0, 
ioh->num_callbacks: 0, ioh->generation: 29201
!!!pgaio_io_start_readv [91056]| ioh: 0x1066d15d0, ioh->op: 170, ioh->generation: 29201
!!!pgaio_io_reclaim [91056]| ioh: 0x1066d15d0, ioh->op: 170, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2, 
ioh->generation: 29201
!!!pgaio_io_reclaim [91056]| ioh: 0x1066d15d0, ioh->op: 0, ioh->generation: 29202
!!!AsyncReadBuffers [91056] (1)| blocknum: 72, ioh: 0x1066d15d0, ioh->op: 0, ioh->state: 1, ioh->result: 0, 
ioh->num_callbacks: 0, ioh->generation: 29202
!!!pgaio_io_start_readv [91056]| ioh: 0x1066d15d0, ioh->op: 170, ioh->generation: 29202
!!!pgaio_io_reclaim [91056]| ioh: 0x1066d15d0, ioh->op: 170, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2, 
ioh->generation: 29202
!!!pgaio_io_reclaim [91056]| ioh: 0x1066d15d0, ioh->op: 0, ioh->generation: 29203
!!!AsyncReadBuffers [91056] (1)| blocknum: 73, ioh: 0x1066d15d0, ioh->op: 170, ioh->state: 1, ioh->result: 0, 
ioh->num_callbacks: 0, ioh->generation: 29203
2025-06-05 19:52:44.174 EDT [91056:14] pg_regress/brin LOG: !!!FileStartReadV| ioh: 0x1066d15d0, ioh->op: 170, 
ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 2, ioh->generation: 29203
2025-06-05 19:52:44.174 EDT [91056:15] pg_regress/brin CONTEXT:  SQL statement "SELECT array_agg(ctid) FROM brintest 
WHERE cidrcol = '10.2.14/24'::inet "
         PL/pgSQL function inline_code_block line 52 at EXECUTE
2025-06-05 19:52:44.174 EDT [91056:16] pg_regress/brin STATEMENT: DO $x$
...
2025-06-05 19:52:44.174 EDT [91056:17] pg_regress/brin LOG: !!!pgaio_io_before_start| ioh: 0x1066d15d0, ioh->op: 170, 
ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 2, ioh->generation: 29203
2025-06-05 19:52:44.174 EDT [91056:18] pg_regress/brin CONTEXT:  SQL statement "SELECT array_agg(ctid) FROM brintest 
WHERE cidrcol = '10.2.14/24'::inet "
         PL/pgSQL function inline_code_block line 52 at EXECUTE
2025-06-05 19:52:44.174 EDT [91056:19] pg_regress/brin STATEMENT: DO $x$
         DECLARE
                 r record;
                 r2 record;
                 cond text;
                 idx_ctids tid[];
                 ss_ctids tid[];
                 count int;
                 plan_ok bool;
                 plan_line text;
         BEGIN
                 FOR r IN SELECT colname, oper, typ, value[ordinality], matches[ordinality] FROM brinopers, unnest(op)

WITH ORDINALITY AS oper LOOP
...
TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 164, PID: 91056
0   postgres                            0x0000000104ff771c ExceptionalCondition + 236
1   postgres                            0x0000000104cfe32c pgaio_io_before_start + 516
2   postgres                            0x0000000104cfe080 pgaio_io_start_readv + 36
3   postgres                            0x0000000104d1d310 FileStartReadV + 508
4   postgres                            0x0000000104d705f0 mdstartreadv + 668
5   postgres                            0x0000000104d73bd8 smgrstartreadv + 116
6   postgres                            0x0000000104d06484 AsyncReadBuffers + 2028
7   postgres                            0x0000000104d05420 StartReadBuffersImpl + 1196
8   postgres                            0x0000000104d04f64 StartReadBuffers + 64
9   postgres                            0x0000000104d0256c read_stream_start_pending_read + 1204
10  postgres                            0x0000000104d01c50 read_stream_look_ahead + 812
11  postgres                            0x0000000104d017e8 read_stream_next_buffer + 2356
12  postgres                            0x00000001046f661c heap_fetch_next_buffer + 284
13  postgres                            0x00000001046e73b8 heapgettup_pagemode + 192
14  postgres                            0x00000001046e7988 heap_getnextslot + 84

Thus, this time ioh->op switched to PGAIO_OP_READV earlier, even before
the logging in AsyncReadBuffers(). By the way, this
"DO .. FOR ... FROM brinopers, unnest(op) ... " triggers the assertion not
for the first time for me (I've counted 3 such failures in my notes).

Best regards,
Alexander Lakhin



Re: Non-reproducible AIO failure

From
Konstantin Knizhnik
Date:
On 06/06/2025 2:31 am, Tom Lane wrote:
> Matthias van de Meent <boekewurm+postgres@gmail.com> writes:
>> I have a very wild guess that's probably wrong in a weird way, but
>> here goes anyway:
>> Did anyone test if interleaving the enum-typed bitfield fields of
>> PgAioHandle with the uint8 fields might solve the issue?
> Ugh.  I think you probably nailed it.
>
> IMO all those struct fields better be declared uint8.
>
>             regards, tom lane

I also think that it can be in compiler. Bitfields with different enum 
type looks really exotic, so no wonder that optimizer can do something 
strange here.
I failed to reproduce the problem with old version of clang (15.0). Also 
as far as I understand nobody was able to reproduce the problem with 
disabled optimizations (-O0).
It definitely doesn't mean that there is bug in optimizer - just timing 
can be changed.

Still it is not quite clear to me how `PGAIO_OP_READV` is managed to be 
written.
There is just one place in the code when it is assigned:

```
pgaio_io_start_readv(PgAioHandle *ioh,
                      int fd, int iovcnt, uint64 offset)
{
     ...

     pgaio_io_stage(ioh, PGAIO_OP_READV);
}

```

and `pgaio_io_stage` should update both `state` and `op`:

```
     ioh->op = op;
     ioh->result = 0;

     pgaio_io_update_state(ioh, PGAIO_HS_DEFINED);
```

But as we see from the trace state is still PGAIO_HS_HANDED_OUT, so it 
was not updated.

If there is some bug in optimizer which incorrectly construct mask for 
bitfield assignment, it is still not clean where it managed to get this 
PGAIO_OP_READV.
And we can be sure that it is really PGAIO_OP_READV and just arbitrary 
garbage, because Alexander has replaced its value with 0xaa and we see 
in logs that it is rally stored.

If there is race condition in `pgaio_io_update_state` (which enforces 
memory barrier before updating state) then for example inserting some 
sleep between assignment operation and status should increase 
probability of error. But it doesn't happen. Also as far as I 
understand, op is updated and read by the same backend. So it should not 
be some synchronization issue.

So most likely it is bug in optimizer which generates incorrect code. 
Can Alexander or somebody else who was able to reproduce the problem 
share assembler code of `pgaio_io_reclaim`  function?
I am not sure that the bug is in this function - but it is prime 
suspect. Only `pgaio_io_start_readv` can set PGAIO_OP_READV, but we are 
almost sure that it was no called.
So looks like that `op` was not cleared despite to what we see in logs. 
But if there was incorrect code in `pgaio_io_reclaim`, then it should 
always work incorrectly - doesn't clear "op" but in most cases it works...








Re: Non-reproducible AIO failure

From
Konstantin Knizhnik
Date:

There is really essential difference in code generated by clang 15 (working) and 16 (not working).

```

pgaio_io_stage(PgAioHandle *ioh, PgAioOp op)

{

     ...

     HOLD_INTERRUPTS();

    ioh->op = op;
    ioh->result = 0;

    pgaio_io_update_state(ioh, PGAIO_HS_DEFINED);
    ...

}

```

clang15:

```

postgres[0x10035d4ac] <+84>:  add    x22, x22, #0x5d4          ; InterruptHoldoffCount
postgres[0x10035d4b0] <+88>:  ldr    w8, [x22]
postgres[0x10035d4b4] <+92>:  add    w8, w8, #0x1
postgres[0x10035d4b8] <+96>:  str    w8, [x22]
postgres[0x10035d4bc] <+100>: strb   w20, [x19, #0x2]
postgres[0x10035d4c0] <+104>: str    wzr, [x19, #0x14]
postgres[0x10035d4c4] <+108>: mov    x0, x19
postgres[0x10035d4c8] <+112>: mov    w1, #0x2
postgres[0x10035d4cc] <+116>: bl     0x10035cbdc               ; pgaio_io_update_state at aio.c:384
```

clang16:

```

postgres[0x100699e8c] <+200>: add    x9, x9, #0xd74 ; InterruptHoldoffCount
postgres[0x100699e90] <+204>: ldr    w8, [x9]
postgres[0x100699e94] <+208>: mov    w10, #0x1 ; =1
postgres[0x100699e98] <+212>: stur   w10, [x29, #-0x18]
postgres[0x100699e9c] <+216>: add    w8, w8, #0x1
postgres[0x100699ea0] <+220>: str    w8, [x9]
postgres[0x100699ea4] <+224>: ldur   w10, [x29, #-0xc]
postgres[0x100699ea8] <+228>: ldur   x9, [x29, #-0x8]
postgres[0x100699eac] <+232>: ldrh   w8, [x9]
postgres[0x100699eb0] <+236>: and    w10, w10, #0xff
postgres[0x100699eb4] <+240>: orr    w10, w8, w10, lsl #16
postgres[0x100699eb8] <+244>: lsr    w8, w10, #16
postgres[0x100699ebc] <+248>: strh   w10, [x9]
postgres[0x100699ec0] <+252>: strb   w8, [x9, #0x2]
postgres[0x100699ec4] <+256>: ldur   x8, [x29, #-0x8]
postgres[0x100699ec8] <+260>: str    wzr, [x8, #0x14]
postgres[0x100699ecc] <+264>: ldur   x0, [x29, #-0x8]
postgres[0x100699ed0] <+268>: mov    w1, #0x2 ; =2
postgres[0x100699ed4] <+272>: bl     0x100698efc    ; pgaio_io_update_state at aio.c:384
```

As you can see pg16 just updates third byte of the structure (`op` field). It is not affecting 'state' field at all.

While clang16 is also updating `state` field! And looks like it is moved before memory barrier (in `pgaio_io_update_state` function),