Re: Non-reproducible AIO failure - Mailing list pgsql-hackers

From Alexander Lakhin
Subject Re: Non-reproducible AIO failure
Date
Msg-id f975aa25-eddd-45ca-90b5-855338ed75d2@gmail.com
Whole thread Raw
In response to Non-reproducible AIO failure  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Hello Thomas,

24.05.2025 14:42, Thomas Munro wrote:
> On Sat, May 24, 2025 at 3:17 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> So it seems that "very low-probability issue in our Mac AIO code" is
>> the most probable description.
> There isn't any macOS-specific AIO code so my first guess would be
> that it might be due to aarch64 weak memory reordering (though Andres
> speculated that itt should all be one backend, huh), if it's not just
> a timing luck thing.  Alexander, were the other OSes you tried all on
> x86?

As I wrote off-list before, I had tried x86_64 only, but since then I
tried to reproduce the issue on an aarch64 server with Ubuntu 24.04,
running 10, then 40 instances of t/027_stream_regress.pl in parallel. I've
also multiplied "test: brin ..." line x10. But the issue is still not
reproduced (in 8+ hours).

However, I've managed to get an AIO-related assertion failure on macOS 14.5
(Darwin Kernel Version 23.5.0) quite easily, running 027_stream_regress.pl
the same way:
2       # +++ tap check in src/test/recovery_2 +++
3       # +++ tap check in src/test/recovery_3 +++
4       # +++ tap check in src/test/recovery_4 +++
5       # +++ tap check in src/test/recovery_5 +++
1       # +++ tap check in src/test/recovery_1 +++
7       # +++ tap check in src/test/recovery_7 +++
6       # +++ tap check in src/test/recovery_6 +++
8       # +++ tap check in src/test/recovery_8 +++
1       [16:07:31] t/027_stream_regress.pl ..
1       Dubious, test returned 29 (wstat 7424, 0x1d00)
1       All 1 subtests passed
1       [16:07:31]
1
1       Test Summary Report
1       -------------------
1       t/027_stream_regress.pl (Wstat: 7424 Tests: 1 Failed: 0)
1         Non-zero exit status: 29
1         Parse errors: No plan found in TAP output
1       Files=1, Tests=1, 197 wallclock secs ( 0.01 usr  0.01 sys + 4.47 cusr  5.79 csys = 10.28 CPU)
1       Result: FAIL

TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 161, PID: 32355
0   postgres                            0x0000000104f078f4 ExceptionalCondition + 236
1   postgres                            0x0000000104c0ebd4 pgaio_io_before_start + 260
2   postgres                            0x0000000104c0ea94 pgaio_io_start_readv + 36
3   postgres                            0x0000000104c2d4e8 FileStartReadV + 252
4   postgres                            0x0000000104c807c8 mdstartreadv + 668
5   postgres                            0x0000000104c83db0 smgrstartreadv + 116
6   postgres                            0x0000000104c1675c AsyncReadBuffers + 1444
7   postgres                            0x0000000104c15868 StartReadBuffersImpl + 1196
8   postgres                            0x0000000104c153ac StartReadBuffers + 64
9   postgres                            0x0000000104c129b4 read_stream_start_pending_read + 1204
10  postgres                            0x0000000104c12018 read_stream_look_ahead + 812
11  postgres                            0x0000000104c11cd0 read_stream_next_buffer + 1396
12  postgres                            0x0000000104606c38 heap_fetch_next_buffer + 284
13  postgres                            0x00000001045f79d4 heapgettup_pagemode + 192
14  postgres                            0x00000001045f7fa4 heap_getnextslot + 84
15  postgres                            0x000000010497a404 table_scan_getnextslot + 340
16  postgres                            0x000000010497a210 SeqNext + 148
17  postgres                            0x000000010497a8e4 ExecScanFetch + 772
18  postgres                            0x000000010497a4b8 ExecScanExtended + 164
19  postgres                            0x0000000104979bf0 ExecSeqScanWithProject + 244
20  postgres                            0x0000000104953ce0 ExecProcNode + 68
21  postgres                            0x000000010494f148 MultiExecPrivateHash + 64
22  postgres                            0x000000010494ed48 MultiExecHash + 100
23  postgres                            0x0000000104925a18 MultiExecProcNode + 180
24  postgres                            0x0000000104957220 ExecHashJoinImpl + 628
25  postgres                            0x00000001049566dc ExecHashJoin + 32
26  postgres                            0x0000000104925958 ExecProcNodeFirst + 100
27  postgres                            0x000000010491c08c ExecProcNode + 68
28  postgres                            0x000000010491692c ExecutePlan + 268
29  postgres                            0x000000010491679c standard_ExecutorRun + 504
30  pg_stat_statements.dylib            0x00000001055e132c pgss_ExecutorRun + 212
31  postgres                            0x0000000104916580 ExecutorRun + 72
32  postgres                            0x000000010491f59c ParallelQueryMain + 508
33  postgres                            0x00000001046968d4 ParallelWorkerMain + 1712
34  postgres                            0x0000000104b476f4 BackgroundWorkerMain + 824
35  postgres                            0x0000000104b4b87c postmaster_child_launch + 492
36  postgres                            0x0000000104b56610 StartBackgroundWorker + 416
37  postgres                            0x0000000104b51378 maybe_start_bgworkers + 552
38  postgres                            0x0000000104b540cc LaunchMissingBackgroundProcesses + 1316
39  postgres                            0x0000000104b51650 ServerLoop + 616
40  postgres                            0x0000000104b4fef0 PostmasterMain + 6632
41  postgres                            0x00000001049bfe20 main + 952
42  dyld                                0x00000001849d60e0 start + 2360
2025-05-24 16:07:23.427 EDT [22120:4] LOG:  background worker "parallel worker" (PID 32355) was terminated by signal 6:

Abort trap: 6
2025-05-24 16:07:23.427 EDT [22120:5] DETAIL:  Failed process was running: select count(*) from join_foo
           left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using (id)) ss

I'm yet to see the Assert triggered on the buildfarm, but this one looks
interesting too.

(I can share the complete patch + script for such testing, if it can be helpful.)

Best regards,
Alexander Lakhin
Neon (https://neon.tech)



pgsql-hackers by date:

Previous
From: "David E. Wheeler"
Date:
Subject: Re: PATCH: jsonpath string methods: lower, upper, initcap, l/r/btrim, replace, split_part
Next
From: "David E. Wheeler"
Date:
Subject: Re: PATCH: jsonpath string methods: lower, upper, initcap, l/r/btrim, replace, split_part