Re: Non-reproducible AIO failure - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Non-reproducible AIO failure |
Date | |
Msg-id | mncqmsiizh6mbgkce5zjp3nbaytuan6djwdpze6lmb4hujbrne@dcdkfinpg2vg Whole thread Raw |
In response to | Re: Non-reproducible AIO failure (Konstantin Knizhnik <knizhnik@garret.ru>) |
List | pgsql-hackers |
Hi, On 2025-06-12 11:52:31 -0400, Andres Freund wrote: > On 2025-06-12 17:22:22 +0300, Konstantin Knizhnik wrote: > > On 12/06/2025 4:57 pm, Andres Freund wrote: > > > The problem appears to be in that switch between "when submitted, by the IO > > > worker" and "then again by the backend". It's not concurrent access in the > > > sense of two processes writing to the same value, it's that when switching > > > from the worker updating ->distilled_result to the issuer looking at that, the > > > issuer didn't ensure that no outdated version of ->distilled_result could be > > > used. > > > > > > Basically, the problem is that the worker would > > > > > > 1) set ->distilled_result > > > 2) perform a write memory barrier > > > 3) set ->state to COMPLETED_SHARED > > > > > > and then the issuer of the IO would: > > > > > > 4) check ->state is COMPLETED_SHARED > > > 5) use ->distilled_result > > > > > > The problem is that there currently is no barrier between 4 & 5, which means > > > an outdated ->distilled_result could be used. > > > > > > > > > This also explains why the issue looked so weird - eventually, after fprintfs, > > > after a core dump, etc, the updated ->distilled_result result would "arrive" > > > in the issuing process, and suddenly look correct. > > > > Thank you very much for explanation. > > Everything seems to be so simple after explanations, that you can not even > > believe that before you think that such behavior can be only caused by > > "black magic" or "OS bug":) > > Indeed. For a while I was really starting to doubt my sanity - not helped by > the fact that I'd loose most of the mental context between reproductions (the > problem did not re-occur from Saturday to Wednesday morning...). What finally > made it clearer was moving the failing assertion to earlier > (pgaio_io_call_complete_local() and shared_buffer_readv_complete_local()), as > that took out a lot of the time in which the problem could occur. > > > Certainly using outdated result can explain such behavior. > > But in which particular place we loose read barrier between 4 and 5? > > I see `pgaio_io_wait` which as I expect should be called by backend to wait > > completion of IO. > > And it calls `pgaio_io_was_recycled` to get state and it in turn enforce > > read barrier: > > ``` > > > > bool > > pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, > > PgAioHandleState *state) > > { > > *state = ioh->state; > > pg_read_barrier(); > > > > return ioh->generation != ref_generation; > > } > > ``` > > Yes, I don't think that path has the issue. > > As far as I can tell the problem only ever happens if we end up reclaiming an > IO while waiting for a free IO handle. > > The known problematic path is > pgaio_io_acquire() > -> pgaio_io_wait_for_free() > -> pgaio_io_reclaim() > > In that path we don't use pgaio_io_was_recycled(). I couldn't find any other > path with the same issue [1]. > The issue only happening while waiting for free IO handles is presumably the > reason why it happend in 027_stream_regress.pl, due to the small > shared_buffers used for the test io_max_concurrency is 1, which means we > constantly need to wait for a free IO handles. Armed with that knowledge it turns out to not be too hard to create a repro that fails a lot quicker than multiple days. Starting postgres with -c io_max_concurrency=1 -c io_combine_limit=1 -c synchronize_seqscans=false -c restart_after_crash=false -c max_parallel_workers_per_gather=0 and creating a test table with: create table large as select generate_series(1, 100000) a, b from generate_series(1, 100) b; makes the following pgbench fail within seconds for me: c=16; pgbench -c $c -j $c -M prepared -n -f <(echo "select count(*) FROM large;") -T 1000 -P 10 (interestingly at c=8 it takes closer to a minute on average) Adding pg_read_barrier() before the two pgaio_io_reclaim() calls in pgaio_io_wait_for_free() has so far prevented a crash. I'll leave it running for a while longer, to make sure. We really lack any meaningful stress testing of IO paths, or I suspect we would have found this much earlier. I wonder if we could add a smaller version of the above, with a small fixed runtime (maybe 3-5s), as a tap test. Greetings, Andres Freund
pgsql-hackers by date: