Re: Non-reproducible AIO failure - Mailing list pgsql-hackers
From | Konstantin Knizhnik |
---|---|
Subject | Re: Non-reproducible AIO failure |
Date | |
Msg-id | 2938968e-8010-4d14-a211-baa2d5778e1b@garret.ru Whole thread Raw |
In response to | Re: Non-reproducible AIO failure (Andres Freund <andres@anarazel.de>) |
Responses |
Re: Non-reproducible AIO failure
|
List | pgsql-hackers |
On 12/06/2025 4:57 pm, Andres Freund wrote: > The problem appears to be in that switch between "when submitted, by the IO > worker" and "then again by the backend". It's not concurrent access in the > sense of two processes writing to the same value, it's that when switching > from the worker updating ->distilled_result to the issuer looking at that, the > issuer didn't ensure that no outdated version of ->distilled_result could be > used. > > Basically, the problem is that the worker would > > 1) set ->distilled_result > 2) perform a write memory barrier > 3) set ->state to COMPLETED_SHARED > > and then the issuer of the IO would: > > 4) check ->state is COMPLETED_SHARED > 5) use ->distilled_result > > The problem is that there currently is no barrier between 4 & 5, which means > an outdated ->distilled_result could be used. > > > This also explains why the issue looked so weird - eventually, after fprintfs, > after a core dump, etc, the updated ->distilled_result result would "arrive" > in the issuing process, and suddenly look correct. > Sorry, I realized that O do not completely understand how it can explained assertion failure in `pgaio_io_before_start`: Assert(ioh->op == PGAIO_OP_INVALID); I am not so familiar with AIO implementation in Postgress, so please excuse me if I missing something... Ok, `distilled_result` contains some deteriorated data. As far as I understand it is used for calling callbacks. Right now in debugger (loaded core file), I see thsat two callbacks are registered: `aio_md_readv_cb` and `aio_shared_buffer_readv_cb`. None of them is changing "op" field. This is full content of the handle: (PgAioHandle) { state = PGAIO_HS_HANDED_OUT target = PGAIO_TID_SMGR op = PGAIO_OP_READV flags = '\x04' num_callbacks = '\x02' callbacks = { [0] = '\x02' [1] = '\x01' [2] = '\0' [3] = '\0' } callbacks_data = { [0] = '\0' [1] = '\0' [2] = '\0' [3] = '\0' } handle_data_len = '\x01' owner_procno = 28 result = 0 node = { prev = 0x000000010c9ea368 next = 0x000000010c9ea368 } resowner = 0x000000014880ef28 resowner_node = { prev = 0x000000014880f1d8 next = 0x000000014880f1d8 } generation = 133159566 cv = { mutex = 0 wakeup = (head = -1, tail = -1) } distilled_result = (id = 0, status = 0, error_data = 0, result = 1) iovec_off = 448 report_return = 0x000000010f81b950 op_data = { read = (fd = 7, iov_length = 1, offset = 180936704) write = (fd = 7, iov_length = 1, offset = 180936704) } target_data = { smgr = { rlocator = (spcOid = 1663, dbOid = 5, relNumber = 16384) blockNum = 22088 nblocks = 1 forkNum = MAIN_FORKNUM is_temp = false skip_fsync = false } } } Actually `PGAIO_OP_READV` is set in the only place: `pgaio_io_start_readv`. And this function is obviously called in user's backed with procno=28 which letter gets this assert failure. So I try to understand dataflow. 1. Backend calls `pgaio_io_start_readv` to initiate IO. It assigns `op=PGAIO_OP_READV` 2. It is broadcasted to all CPUs because of write barrier in `pgaio_io_update_state(ioh, PGAIO_HS_DEFINED)` called fro `pgaio_io_stage`. 3. AIO thread gets notification about this request and somehow handle it changing status to `COMPLETED_SHARED` in `pgaio_io_process_completion` which notifies backend using condvar (which is expected to enforce write barrier). 4. Backend see that status is `COMPLETED_SHARED`. Backend can do it in `pgaio_io_wait` or in `pgaio_io_wait_for_free`. In first case there seems to be no problem, because `pgaio_io_was_recycled` is used which does read barrier. So you suspect `pgaio_io_wait_for_free` where read barrier was missed (before your patch). But it is not quite clear to me what can happen with "op" field without read barrier? In theory backend can see old value. In practice it seems to be impossible because as far as I know the unit of syncing data between CPU is cache line. So my hypothesis was the following: backend observed old value of "op" (PGAIO_OP_READV, because this is what it has written itself), then overwrites it with PGAIO_OP_INVALID in `pgaio_io_reclaim` and then delayed update from AIO worker is received which restores old value - PGAIO_OP_READV. But it seems to be not so realistic scenario. Can you explain please how according to you theory not up-to-date version of distilled_result can explain such behavior? It will be possible if result contains index of the handle and stale version of the index can cause addressing from IO handle. But it is not the case - as we can see from the trace, address of `ioh` in the backens is always the same. > >> This is what I am observing myself: "op" field is modified and fetched by >> the same process. > Right - but I don't think the ->op field being wrong was actually part of the > issue though. > > >> Certainly process can be rescheduled to some other CPU. But if such >> reschedule can cause loose of stored value, then nothing will work, will it? > Yes, that'd be completely broken - but isn't the issue here. > > Greetings, > > Andres Freund
pgsql-hackers by date: