Re: Non-reproducible AIO failure - Mailing list pgsql-hackers

From Konstantin Knizhnik
Subject Re: Non-reproducible AIO failure
Date
Msg-id 7b70a28d-512b-4564-9930-c8df196ae54b@garret.ru
Whole thread Raw
In response to Re: Non-reproducible AIO failure  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On 13/06/2025 11:20 pm, Andres Freund wrote:
> Hi,
>
> On 2025-06-12 12:23:13 -0400, Andres Freund wrote:
>> On 2025-06-12 11:52:31 -0400, Andres Freund wrote:
>>> On 2025-06-12 17:22:22 +0300, Konstantin Knizhnik wrote:
>>>> On 12/06/2025 4:57 pm, Andres Freund wrote:
>>>>> The problem appears to be in that switch between "when submitted, by the IO
>>>>> worker" and "then again by the backend".  It's not concurrent access in the
>>>>> sense of two processes writing to the same value, it's that when switching
>>>>> from the worker updating ->distilled_result to the issuer looking at that, the
>>>>> issuer didn't ensure that no outdated version of ->distilled_result could be
>>>>> used.
>>>>>
>>>>> Basically, the problem is that the worker would
>>>>>
>>>>> 1) set ->distilled_result
>>>>> 2) perform a write memory barrier
>>>>> 3) set ->state to COMPLETED_SHARED
>>>>>
>>>>> and then the issuer of the IO would:
>>>>>
>>>>> 4) check ->state is COMPLETED_SHARED
>>>>> 5) use ->distilled_result
>>>>>
>>>>> The problem is that there currently is no barrier between 4 & 5, which means
>>>>> an outdated ->distilled_result could be used.
>>>>>
>>>>>
>>>>> This also explains why the issue looked so weird - eventually, after fprintfs,
>>>>> after a core dump, etc, the updated ->distilled_result result would "arrive"
>>>>> in the issuing process, and suddenly look correct.
>>>> Thank you very much for explanation.
>>>> Everything seems to be so simple after explanations, that you can not even
>>>> believe that before you think that such behavior can be only caused by
>>>> "black magic" or "OS bug":)
>>> Indeed. For a while I was really starting to doubt my sanity - not helped by
>>> the fact that I'd loose most of the mental context between reproductions (the
>>> problem did not re-occur from Saturday to Wednesday morning...).  What finally
>>> made it clearer was moving the failing assertion to earlier
>>> (pgaio_io_call_complete_local() and shared_buffer_readv_complete_local()), as
>>> that took out a lot of the time in which the problem could occur.
>>>> Certainly using outdated result can explain such behavior.
>>>> But in which particular place we loose read barrier between 4 and 5?
>>>> I see `pgaio_io_wait` which as I expect should be called by backend to wait
>>>> completion of IO.
>>>> And it calls `pgaio_io_was_recycled` to get state and it in turn enforce
>>>> read barrier:
>>>> ```
>>>>
>>>> bool
>>>> pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation,
>>>> PgAioHandleState *state)
>>>> {
>>>>      *state = ioh->state;
>>>>      pg_read_barrier();
>>>>
>>>>      return ioh->generation != ref_generation;
>>>> }
>>>> ```
>>> Yes, I don't think that path has the issue.
>>>
>>> As far as I can tell the problem only ever happens if we end up reclaiming an
>>> IO while waiting for a free IO handle.
>>>
>>> The known problematic path is
>>> pgaio_io_acquire()
>>> -> pgaio_io_wait_for_free()
>>> -> pgaio_io_reclaim()
>>>
>>> In that path we don't use pgaio_io_was_recycled().  I couldn't find any other
>>> path with the same issue [1].
>>> The issue only happening while waiting for free IO handles is presumably the
>>> reason why it happend in 027_stream_regress.pl, due to the small
>>> shared_buffers used for the test io_max_concurrency is 1, which means we
>>> constantly need to wait for a free IO handles.
>> Armed with that knowledge it turns out to not be too hard to create a repro
>> that fails a lot quicker than multiple days.
>>
>> Starting postgres with
>>
>>    -c io_max_concurrency=1 -c io_combine_limit=1 -c synchronize_seqscans=false -c restart_after_crash=false -c
max_parallel_workers_per_gather=0
>>
>> and creating a test table with:
>>    create table large as select generate_series(1, 100000) a, b from generate_series(1, 100) b;
>>
>> makes the following pgbench fail within seconds for me:
>>    c=16; pgbench -c $c -j $c -M prepared -n -f <(echo "select count(*) FROM large;") -T 1000 -P 10
>>
>> (interestingly at c=8 it takes closer to a minute on average)
>>
>> Adding pg_read_barrier() before the two pgaio_io_reclaim() calls in
>> pgaio_io_wait_for_free() has so far prevented a crash.  I'll leave it running
>> for a while longer, to make sure.
>>
>> We really lack any meaningful stress testing of IO paths, or I suspect we
>> would have found this much earlier.  I wonder if we could add a smaller
>> version of the above, with a small fixed runtime (maybe 3-5s), as a
>> tap test.
> Attached is a patch that fixes the problem for me. Alexander, Konstantin,
> could you verify that it also fixes the problem for you?


I have inspected your patch and from my point of view it adds all 
necessary read barriers and so eliminates the race.
I also ran proposed test during more than 4 hours and the problem is not 
reproduced at my system (without patch it is reproduced after 100 seconds).

It's a little bit strange that pgbench is "stuck" - I have specified 
timeout 10000 seconds (-T 10000) but it didn't stop and continue working 
until 15000 seconds when I have terminated it.
There were no stuck queries: all 16 backends are running queries which 
execution takes about 2 seconds:

progress: 14960.0 s, 7.4 tps, lat 2189.006 ms stddev 133.277, 0 failed
progress: 14970.0 s, 7.3 tps, lat 2113.871 ms stddev 164.008, 0 failed
progress: 14980.0 s, 7.3 tps, lat 2038.736 ms stddev 152.869, 0 failed

Looks like there is yet another race condition now in pgbench:)




>
> Given that it does address the problem for me, I'm inclined to push this
> fairly soon, the barrier is pretty obviously required.
>
> Greetings,
>
> Andres Freund



pgsql-hackers by date:

Previous
From: Dimitrios Apostolou
Date:
Subject: Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward
Next
From: Michael Paquier
Date:
Subject: Re: Handling OID Changes in Regression Tests for C Extensions