Re: Non-reproducible AIO failure - Mailing list pgsql-hackers
From | Konstantin Knizhnik |
---|---|
Subject | Re: Non-reproducible AIO failure |
Date | |
Msg-id | 7b70a28d-512b-4564-9930-c8df196ae54b@garret.ru Whole thread Raw |
In response to | Re: Non-reproducible AIO failure (Andres Freund <andres@anarazel.de>) |
List | pgsql-hackers |
On 13/06/2025 11:20 pm, Andres Freund wrote: > Hi, > > On 2025-06-12 12:23:13 -0400, Andres Freund wrote: >> On 2025-06-12 11:52:31 -0400, Andres Freund wrote: >>> On 2025-06-12 17:22:22 +0300, Konstantin Knizhnik wrote: >>>> On 12/06/2025 4:57 pm, Andres Freund wrote: >>>>> The problem appears to be in that switch between "when submitted, by the IO >>>>> worker" and "then again by the backend". It's not concurrent access in the >>>>> sense of two processes writing to the same value, it's that when switching >>>>> from the worker updating ->distilled_result to the issuer looking at that, the >>>>> issuer didn't ensure that no outdated version of ->distilled_result could be >>>>> used. >>>>> >>>>> Basically, the problem is that the worker would >>>>> >>>>> 1) set ->distilled_result >>>>> 2) perform a write memory barrier >>>>> 3) set ->state to COMPLETED_SHARED >>>>> >>>>> and then the issuer of the IO would: >>>>> >>>>> 4) check ->state is COMPLETED_SHARED >>>>> 5) use ->distilled_result >>>>> >>>>> The problem is that there currently is no barrier between 4 & 5, which means >>>>> an outdated ->distilled_result could be used. >>>>> >>>>> >>>>> This also explains why the issue looked so weird - eventually, after fprintfs, >>>>> after a core dump, etc, the updated ->distilled_result result would "arrive" >>>>> in the issuing process, and suddenly look correct. >>>> Thank you very much for explanation. >>>> Everything seems to be so simple after explanations, that you can not even >>>> believe that before you think that such behavior can be only caused by >>>> "black magic" or "OS bug":) >>> Indeed. For a while I was really starting to doubt my sanity - not helped by >>> the fact that I'd loose most of the mental context between reproductions (the >>> problem did not re-occur from Saturday to Wednesday morning...). What finally >>> made it clearer was moving the failing assertion to earlier >>> (pgaio_io_call_complete_local() and shared_buffer_readv_complete_local()), as >>> that took out a lot of the time in which the problem could occur. >>>> Certainly using outdated result can explain such behavior. >>>> But in which particular place we loose read barrier between 4 and 5? >>>> I see `pgaio_io_wait` which as I expect should be called by backend to wait >>>> completion of IO. >>>> And it calls `pgaio_io_was_recycled` to get state and it in turn enforce >>>> read barrier: >>>> ``` >>>> >>>> bool >>>> pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, >>>> PgAioHandleState *state) >>>> { >>>> *state = ioh->state; >>>> pg_read_barrier(); >>>> >>>> return ioh->generation != ref_generation; >>>> } >>>> ``` >>> Yes, I don't think that path has the issue. >>> >>> As far as I can tell the problem only ever happens if we end up reclaiming an >>> IO while waiting for a free IO handle. >>> >>> The known problematic path is >>> pgaio_io_acquire() >>> -> pgaio_io_wait_for_free() >>> -> pgaio_io_reclaim() >>> >>> In that path we don't use pgaio_io_was_recycled(). I couldn't find any other >>> path with the same issue [1]. >>> The issue only happening while waiting for free IO handles is presumably the >>> reason why it happend in 027_stream_regress.pl, due to the small >>> shared_buffers used for the test io_max_concurrency is 1, which means we >>> constantly need to wait for a free IO handles. >> Armed with that knowledge it turns out to not be too hard to create a repro >> that fails a lot quicker than multiple days. >> >> Starting postgres with >> >> -c io_max_concurrency=1 -c io_combine_limit=1 -c synchronize_seqscans=false -c restart_after_crash=false -c max_parallel_workers_per_gather=0 >> >> and creating a test table with: >> create table large as select generate_series(1, 100000) a, b from generate_series(1, 100) b; >> >> makes the following pgbench fail within seconds for me: >> c=16; pgbench -c $c -j $c -M prepared -n -f <(echo "select count(*) FROM large;") -T 1000 -P 10 >> >> (interestingly at c=8 it takes closer to a minute on average) >> >> Adding pg_read_barrier() before the two pgaio_io_reclaim() calls in >> pgaio_io_wait_for_free() has so far prevented a crash. I'll leave it running >> for a while longer, to make sure. >> >> We really lack any meaningful stress testing of IO paths, or I suspect we >> would have found this much earlier. I wonder if we could add a smaller >> version of the above, with a small fixed runtime (maybe 3-5s), as a >> tap test. > Attached is a patch that fixes the problem for me. Alexander, Konstantin, > could you verify that it also fixes the problem for you? I have inspected your patch and from my point of view it adds all necessary read barriers and so eliminates the race. I also ran proposed test during more than 4 hours and the problem is not reproduced at my system (without patch it is reproduced after 100 seconds). It's a little bit strange that pgbench is "stuck" - I have specified timeout 10000 seconds (-T 10000) but it didn't stop and continue working until 15000 seconds when I have terminated it. There were no stuck queries: all 16 backends are running queries which execution takes about 2 seconds: progress: 14960.0 s, 7.4 tps, lat 2189.006 ms stddev 133.277, 0 failed progress: 14970.0 s, 7.3 tps, lat 2113.871 ms stddev 164.008, 0 failed progress: 14980.0 s, 7.3 tps, lat 2038.736 ms stddev 152.869, 0 failed Looks like there is yet another race condition now in pgbench:) > > Given that it does address the problem for me, I'm inclined to push this > fairly soon, the barrier is pretty obviously required. > > Greetings, > > Andres Freund
pgsql-hackers by date: