Re: Non-reproducible AIO failure - Mailing list pgsql-hackers

From Konstantin Knizhnik
Subject Re: Non-reproducible AIO failure
Date
Msg-id 48b239e4-6428-415f-9c92-ec63a17c993a@garret.ru
Whole thread Raw
In response to Re: Non-reproducible AIO failure  (Konstantin Knizhnik <knizhnik@garret.ru>)
List pgsql-hackers


On 17/06/2025 3:22 am, Tom Lane wrote:
Konstantin Knizhnik <knizhnik@garret.ru> writes:
On 16/06/2025 6:11 pm, Andres Freund wrote:
I unfortunately can't repro this issue so far.
But unfortunately it means that the problem is not fixed.
FWIW, I get similar results to Andres' on a Mac Mini M4 Pro
using MacPorts' current compiler release (clang version 19.1.7).
The currently-proposed test case fails within a few minutes on
e9a3615a5^ but doesn't fail in a couple of hours on e9a3615a5.

However, I cannot repro that on a slightly older Mini M1 using Apple's
current release (clang-1700.0.13.5, which per wikipedia is really LLVM
19.1.4).  It seems to work fine even without e9a3615a5.  So the whole
thing is still depressingly phase-of-the-moon-dependent.

I don't doubt that Konstantin has found a different issue, but
it's hard to be sure about the fix unless we can get it to be
more reproducible.  Neither of my machines has ever shown the
symptom he's getting.
			regards, tom lane


Unfortunately I still able to reproduce assertion failure (with Andreas patch but without my "fixes" - uint8 instead of bitfields).

Apple M2 Pro, Ventura 13.7.6, clang 15.0

Postgres build option:  --without-icu --enable-debug --enable-cassert CFLAGS=-O0 Postgres config:

io_max_concurrency=1
io_combine_limit=1
synchronize_seqscans=false
restart_after_crash=false
max_parallel_workers_per_gather=0
fsync=off

Scenario proposed by Andreas:
c=16; pgbench -c $c -j $c -M prepared -n -f <(echo "select count(*) FROM large;") -T 30000 -P 10


Now it takes me much more time to reproduce the error. Night 30000 seconds iterations passed normally.
I thought that it can be somehow caused or enforces by hibernation.
When I have reproduced it last time, I do not change default setting for display "Prevent automatic sleeping on power adapter when the display is off" (disabled).
It was cause strange timing effect I have reported: when system is hibernated, alarm is also "hibernated" and so total test time is much more than specified with -T option.
Now I tried explicitly force hibernation by periodically closing laptop cover - it doesn't help. But I disable one again this option and let it run. Several times it normally continue execution after wakeup. But finally I got assertion failure:

progress: 18290.0 s, 7.3 tps, lat 2229.243 ms stddev 147.211, 0 failed
progress: 18300.0 s, 7.3 tps, lat 2164.216 ms stddev 176.251, 0 failed
progress: 18310.0 s, 7.5 tps, lat 2105.803 ms stddev 202.003, 0 failed
progress: 18320.0 s, 7.7 tps, lat 2162.209 ms stddev 209.344, 0 failed
progress: 18330.0 s, 7.0 tps, lat 2157.891 ms stddev 181.369, 0 failed
progress: 18340.0 s, 7.6 tps, lat 2120.269 ms stddev 169.287, 0 failed
progress: 18350.0 s, 7.3 tps, lat 2178.657 ms stddev 159.984, 0 failed
pgbench: error: client 4 aborted in command 0 (SQL) of script 0; perhaps the backend died while processing
WARNING:  terminating connection because of crash of another server process

Not sure if it is related with hibernation or not (looks like it happens at the moment when computer is going to hibernate - not sure how to check it).
Backtrace is usual:
  * frame #0: 0x0000000187248704 libsystem_kernel.dylib`__pthread_kill + 8    frame #1: 0x000000018727fc28 libsystem_pthread.dylib`pthread_kill + 288    frame #2: 0x000000018718dae8 libsystem_c.dylib`abort + 180    frame #3: 0x00000001034b2080 postgres`ExceptionalCondition(conditionName="ioh->op == PGAIO_OP_INVALID", fileName="aio_io.c", lineNumber=161) at assert.c:66:2    frame #4: 0x00000001031bac04 postgres`pgaio_io_before_start(ioh=0x000000010c9f6e40) at aio_io.c:161:2    frame #5: 0x00000001031baac4 postgres`pgaio_io_start_readv(ioh=0x000000010c9f6e40, fd=10, iovcnt=1, offset=180944896) at aio_io.c:81:2    frame #6: 0x00000001031d985c postgres`FileStartReadV(ioh=0x000000010c9f6e40, file=3, iovcnt=1, offset=180944896, wait_event_info=167772181) at fd.c:2241:2    frame #7: 0x000000010322c894 postgres`mdstartreadv(ioh=0x000000010c9f6e40, reln=0x000000011f0289c8, forknum=MAIN_FORKNUM, blocknum=22088, buffers=0x000000016d2dd998, nblocks=1) at md.c:1019:8    frame #8: 0x000000010322fe4c postgres`smgrstartreadv(ioh=0x000000010c9f6e40, reln=0x000000011f0289c8, forknum=MAIN_FORKNUM, blocknum=22088, buffers=0x000000016d2dd998, nblocks=1) at smgr.c:758:2    frame #9: 0x00000001031c2b0c postgres`AsyncReadBuffers(operation=0x000000010f81b910, nblocks_progress=0x000000016d2ddeb4) at bufmgr.c:1959:3    frame #10: 0x00000001031c1ce8 postgres`StartReadBuffersImpl(operation=0x000000010f81b910, buffers=0x000000010f81b3e0, blockNum=22088, nblocks=0x000000016d2ddeb4, flags=0, allow_forwarding=true) at bufmgr.c:1428:18    frame #11: 0x00000001031c182c postgres`StartReadBuffers(operation=0x000000010f81b910, buffers=0x000000010f81b3e0, blockNum=22088, nblocks=0x000000016d2ddeb4, flags=0) at bufmgr.c:1500:9    frame #12: 0x00000001031bee44 postgres`read_stream_start_pending_read(stream=0x000000010f81b378) at read_stream.c:335:14    frame #13: 0x00000001031be528 postgres`read_stream_look_ahead(stream=0x000000010f81b378) at read_stream.c:493:3    frame #14: 0x00000001031be0c0 postgres`read_stream_next_buffer(stream=0x000000010f81b378, per_buffer_data=0x0000000000000000) at read_stream.c:971:2    frame #15: 0x0000000102bb3d34 postgres`heap_fetch_next_buffer(scan=0x000000010f81ae58, dir=ForwardScanDirection) at heapam.c:675:18    frame #16: 0x0000000102ba4a88 postgres`heapgettup_pagemode(scan=0x000000010f81ae58, dir=ForwardScanDirection, nkeys=0, key=0x0000000000000000) at heapam.c:1037:3    frame #17: 0x0000000102ba5058 postgres`heap_getnextslot(sscan=0x000000010f81ae58, direction=ForwardScanDirection, slot=0x000000010f811330) at heapam.c:1391:3    frame #18: 0x0000000102f26448 postgres`table_scan_getnextslot(sscan=0x000000010f81ae58, direction=ForwardScanDirection, slot=0x000000010f811330) at tableam.h:1031:9    frame #19: 0x0000000102f26254 postgres`SeqNext(node=0x000000010f811110) at nodeSeqscan.c:81:6    frame #20: 0x0000000102f2691c postgres`ExecScanFetch(node=0x000000010f811110, epqstate=0x0000000000000000, accessMtd=(postgres`SeqNext at nodeSeqscan.c:52), recheckMtd=(postgres`SeqRecheck at nodeSeqscan.c:91)) at execScan.h:126:9

If you are interested - I have core file.
May be it is separate problem - not related with race condition caused by lack of memory barrier. But frankly speaking - unlikely.
I think hibernation just affects processes schedule which cause this effect.

pgsql-hackers by date:

Previous
From: Dean Rasheed
Date:
Subject: Re: wrong comments in rewriteTargetListIU
Next
From: D Laaren
Date:
Subject: Re: Timeline switching with partial WAL records can break replica recovery