Re: Non-reproducible AIO failure - Mailing list pgsql-hackers

From Konstantin Knizhnik
Subject Re: Non-reproducible AIO failure
Date
Msg-id 2938968e-8010-4d14-a211-baa2d5778e1b@garret.ru
Whole thread Raw
In response to Re: Non-reproducible AIO failure  (Andres Freund <andres@anarazel.de>)
Responses Re: Non-reproducible AIO failure
List pgsql-hackers
On 12/06/2025 4:57 pm, Andres Freund wrote:
> The problem appears to be in that switch between "when submitted, by the IO
> worker" and "then again by the backend".  It's not concurrent access in the
> sense of two processes writing to the same value, it's that when switching
> from the worker updating ->distilled_result to the issuer looking at that, the
> issuer didn't ensure that no outdated version of ->distilled_result could be
> used.
>
> Basically, the problem is that the worker would
>
> 1) set ->distilled_result
> 2) perform a write memory barrier
> 3) set ->state to COMPLETED_SHARED
>
> and then the issuer of the IO would:
>
> 4) check ->state is COMPLETED_SHARED
> 5) use ->distilled_result
>
> The problem is that there currently is no barrier between 4 & 5, which means
> an outdated ->distilled_result could be used.
>
>
> This also explains why the issue looked so weird - eventually, after fprintfs,
> after a core dump, etc, the updated ->distilled_result result would "arrive"
> in the issuing process, and suddenly look correct.
>
Sorry, I realized that O do not completely understand how it can 
explained assertion failure in `pgaio_io_before_start`:

     Assert(ioh->op == PGAIO_OP_INVALID);

I am not so familiar with AIO implementation in Postgress, so please 
excuse me if I missing something...

Ok, `distilled_result` contains some deteriorated data.
As far as I understand it is used for calling callbacks.

Right now in debugger (loaded core file), I see thsat two callbacks are 
registered: `aio_md_readv_cb` and `aio_shared_buffer_readv_cb`.
None of them is changing "op" field.

This is full content of the handle:

(PgAioHandle) {
   state = PGAIO_HS_HANDED_OUT
   target = PGAIO_TID_SMGR
   op = PGAIO_OP_READV
   flags = '\x04'
   num_callbacks = '\x02'
   callbacks = {
     [0] = '\x02'
     [1] = '\x01'
     [2] = '\0'
     [3] = '\0'
   }
   callbacks_data = {
     [0] = '\0'
     [1] = '\0'
     [2] = '\0'
     [3] = '\0'
   }
   handle_data_len = '\x01'
   owner_procno = 28
   result = 0
   node = {
     prev = 0x000000010c9ea368
     next = 0x000000010c9ea368
   }
   resowner = 0x000000014880ef28
   resowner_node = {
     prev = 0x000000014880f1d8
     next = 0x000000014880f1d8
   }
   generation = 133159566
   cv = {
     mutex = 0
     wakeup = (head = -1, tail = -1)
   }
   distilled_result = (id = 0, status = 0, error_data = 0, result = 1)
   iovec_off = 448
   report_return = 0x000000010f81b950
   op_data = {
     read = (fd = 7, iov_length = 1, offset = 180936704)
     write = (fd = 7, iov_length = 1, offset = 180936704)
   }
   target_data = {
     smgr = {
       rlocator = (spcOid = 1663, dbOid = 5, relNumber = 16384)
       blockNum = 22088
       nblocks = 1
       forkNum = MAIN_FORKNUM
       is_temp = false
       skip_fsync = false
     }
   }
}


Actually `PGAIO_OP_READV` is set in the only place: `pgaio_io_start_readv`.
And this function is obviously called in user's backed with procno=28 
which letter gets this assert failure.

So I try to understand dataflow.

1. Backend calls `pgaio_io_start_readv` to initiate IO. It assigns 
`op=PGAIO_OP_READV`
2. It is broadcasted to all CPUs because of write barrier in 
`pgaio_io_update_state(ioh, PGAIO_HS_DEFINED)` called fro `pgaio_io_stage`.
3. AIO thread gets notification about this request and somehow handle it 
changing status to `COMPLETED_SHARED` in `pgaio_io_process_completion` 
which notifies backend using condvar (which is expected to enforce write 
barrier).
4. Backend see that status is `COMPLETED_SHARED`. Backend can do it in 
`pgaio_io_wait` or in `pgaio_io_wait_for_free`. In first case there 
seems to be no problem, because `pgaio_io_was_recycled` is used which 
does read barrier. So you suspect `pgaio_io_wait_for_free` where read 
barrier was missed (before your patch). But it is not quite clear to me 
what can happen with "op" field without read barrier? In theory backend 
can see old value. In practice it seems to be impossible because as far 
as I know the unit of syncing data between CPU is cache line.

So my hypothesis was the following: backend observed old value of "op" 
(PGAIO_OP_READV, because this is what it has written itself), then 
overwrites it with PGAIO_OP_INVALID in `pgaio_io_reclaim` and then 
delayed update from AIO worker is received which restores old value - 
PGAIO_OP_READV.
But it seems to be not so realistic scenario.

Can you explain please how according to you theory not up-to-date 
version of distilled_result can explain such behavior?
It will be possible if result contains index of the handle and stale 
version of the index can cause addressing from IO handle.
But it is not the case - as we can see from the trace, address of `ioh` 
in the backens is always the same.




>
>> This is what I am observing myself: "op"  field is modified and fetched by
>> the same process.
> Right - but I don't think the ->op field being wrong was actually part of the
> issue though.
>
>
>> Certainly process can be rescheduled to some other CPU. But if such
>> reschedule can cause loose of stored value, then nothing will work, will it?
> Yes, that'd be completely broken - but isn't the issue here.
>
> Greetings,
>
> Andres Freund



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Proposal: Adding compression of temporary files
Next
From: Daniel Gustafsson
Date:
Subject: Re: Avoid possible dereference null pointer (src/backend/utils/cache/relcache.c)