Re: Non-reproducible AIO failure - Mailing list pgsql-hackers
From | Konstantin Knizhnik |
---|---|
Subject | Re: Non-reproducible AIO failure |
Date | |
Msg-id | 48b239e4-6428-415f-9c92-ec63a17c993a@garret.ru Whole thread Raw |
In response to | Re: Non-reproducible AIO failure (Konstantin Knizhnik <knizhnik@garret.ru>) |
List | pgsql-hackers |
On 17/06/2025 3:22 am, Tom Lane wrote:
Konstantin Knizhnik <knizhnik@garret.ru> writes:On 16/06/2025 6:11 pm, Andres Freund wrote:I unfortunately can't repro this issue so far.But unfortunately it means that the problem is not fixed.FWIW, I get similar results to Andres' on a Mac Mini M4 Pro using MacPorts' current compiler release (clang version 19.1.7). The currently-proposed test case fails within a few minutes on e9a3615a5^ but doesn't fail in a couple of hours on e9a3615a5. However, I cannot repro that on a slightly older Mini M1 using Apple's current release (clang-1700.0.13.5, which per wikipedia is really LLVM 19.1.4). It seems to work fine even without e9a3615a5. So the whole thing is still depressingly phase-of-the-moon-dependent. I don't doubt that Konstantin has found a different issue, but it's hard to be sure about the fix unless we can get it to be more reproducible. Neither of my machines has ever shown the symptom he's getting. regards, tom lane
Unfortunately I still able to reproduce assertion failure (with Andreas patch but without my "fixes" - uint8 instead of bitfields).
Apple M2 Pro, Ventura 13.7.6, clang 15.0
Postgres build option: --without-icu --enable-debug --enable-cassert CFLAGS=-O0 Postgres config:
io_max_concurrency=1 io_combine_limit=1 synchronize_seqscans=false restart_after_crash=false max_parallel_workers_per_gather=0 fsync=off Scenario proposed by Andreas: c=16; pgbench -c $c -j $c -M prepared -n -f <(echo "select count(*) FROM large;") -T 30000 -P 10 Now it takes me much more time to reproduce the error. Night 30000 seconds iterations passed normally. I thought that it can be somehow caused or enforces by hibernation. When I have reproduced it last time, I do not change default setting for display "Prevent automatic sleeping on power adapter when the display is off" (disabled). It was cause strange timing effect I have reported: when system is hibernated, alarm is also "hibernated" and so total test time is much more than specified with -T option. Now I tried explicitly force hibernation by periodically closing laptop cover - it doesn't help. But I disable one again this option and let it run. Several times it normally continue execution after wakeup. But finally I got assertion failure: progress: 18290.0 s, 7.3 tps, lat 2229.243 ms stddev 147.211, 0 failed progress: 18300.0 s, 7.3 tps, lat 2164.216 ms stddev 176.251, 0 failed progress: 18310.0 s, 7.5 tps, lat 2105.803 ms stddev 202.003, 0 failed progress: 18320.0 s, 7.7 tps, lat 2162.209 ms stddev 209.344, 0 failed progress: 18330.0 s, 7.0 tps, lat 2157.891 ms stddev 181.369, 0 failed progress: 18340.0 s, 7.6 tps, lat 2120.269 ms stddev 169.287, 0 failed progress: 18350.0 s, 7.3 tps, lat 2178.657 ms stddev 159.984, 0 failed pgbench: error: client 4 aborted in command 0 (SQL) of script 0; perhaps the backend died while processing WARNING: terminating connection because of crash of another server process Not sure if it is related with hibernation or not (looks like it happens at the moment when computer is going to hibernate - not sure how to check it). Backtrace is usual: * frame #0: 0x0000000187248704 libsystem_kernel.dylib`__pthread_kill + 8 frame #1: 0x000000018727fc28 libsystem_pthread.dylib`pthread_kill + 288 frame #2: 0x000000018718dae8 libsystem_c.dylib`abort + 180 frame #3: 0x00000001034b2080 postgres`ExceptionalCondition(conditionName="ioh->op == PGAIO_OP_INVALID", fileName="aio_io.c", lineNumber=161) at assert.c:66:2 frame #4: 0x00000001031bac04 postgres`pgaio_io_before_start(ioh=0x000000010c9f6e40) at aio_io.c:161:2 frame #5: 0x00000001031baac4 postgres`pgaio_io_start_readv(ioh=0x000000010c9f6e40, fd=10, iovcnt=1, offset=180944896) at aio_io.c:81:2 frame #6: 0x00000001031d985c postgres`FileStartReadV(ioh=0x000000010c9f6e40, file=3, iovcnt=1, offset=180944896, wait_event_info=167772181) at fd.c:2241:2 frame #7: 0x000000010322c894 postgres`mdstartreadv(ioh=0x000000010c9f6e40, reln=0x000000011f0289c8, forknum=MAIN_FORKNUM, blocknum=22088, buffers=0x000000016d2dd998, nblocks=1) at md.c:1019:8 frame #8: 0x000000010322fe4c postgres`smgrstartreadv(ioh=0x000000010c9f6e40, reln=0x000000011f0289c8, forknum=MAIN_FORKNUM, blocknum=22088, buffers=0x000000016d2dd998, nblocks=1) at smgr.c:758:2 frame #9: 0x00000001031c2b0c postgres`AsyncReadBuffers(operation=0x000000010f81b910, nblocks_progress=0x000000016d2ddeb4) at bufmgr.c:1959:3 frame #10: 0x00000001031c1ce8 postgres`StartReadBuffersImpl(operation=0x000000010f81b910, buffers=0x000000010f81b3e0, blockNum=22088, nblocks=0x000000016d2ddeb4, flags=0, allow_forwarding=true) at bufmgr.c:1428:18 frame #11: 0x00000001031c182c postgres`StartReadBuffers(operation=0x000000010f81b910, buffers=0x000000010f81b3e0, blockNum=22088, nblocks=0x000000016d2ddeb4, flags=0) at bufmgr.c:1500:9 frame #12: 0x00000001031bee44 postgres`read_stream_start_pending_read(stream=0x000000010f81b378) at read_stream.c:335:14 frame #13: 0x00000001031be528 postgres`read_stream_look_ahead(stream=0x000000010f81b378) at read_stream.c:493:3 frame #14: 0x00000001031be0c0 postgres`read_stream_next_buffer(stream=0x000000010f81b378, per_buffer_data=0x0000000000000000) at read_stream.c:971:2 frame #15: 0x0000000102bb3d34 postgres`heap_fetch_next_buffer(scan=0x000000010f81ae58, dir=ForwardScanDirection) at heapam.c:675:18 frame #16: 0x0000000102ba4a88 postgres`heapgettup_pagemode(scan=0x000000010f81ae58, dir=ForwardScanDirection, nkeys=0, key=0x0000000000000000) at heapam.c:1037:3 frame #17: 0x0000000102ba5058 postgres`heap_getnextslot(sscan=0x000000010f81ae58, direction=ForwardScanDirection, slot=0x000000010f811330) at heapam.c:1391:3 frame #18: 0x0000000102f26448 postgres`table_scan_getnextslot(sscan=0x000000010f81ae58, direction=ForwardScanDirection, slot=0x000000010f811330) at tableam.h:1031:9 frame #19: 0x0000000102f26254 postgres`SeqNext(node=0x000000010f811110) at nodeSeqscan.c:81:6 frame #20: 0x0000000102f2691c postgres`ExecScanFetch(node=0x000000010f811110, epqstate=0x0000000000000000, accessMtd=(postgres`SeqNext at nodeSeqscan.c:52), recheckMtd=(postgres`SeqRecheck at nodeSeqscan.c:91)) at execScan.h:126:9 If you are interested - I have core file. May be it is separate problem - not related with race condition caused by lack of memory barrier. But frankly speaking - unlikely. I think hibernation just affects processes schedule which cause this effect.
pgsql-hackers by date: