On 2025-06-17 17:54:12 +0300, Konstantin Knizhnik wrote:
>
> On 12/06/2025 4:57 pm, Andres Freund wrote:
> > The problem appears to be in that switch between "when submitted, by the IO
> > worker" and "then again by the backend". It's not concurrent access in the
> > sense of two processes writing to the same value, it's that when switching
> > from the worker updating ->distilled_result to the issuer looking at that, the
> > issuer didn't ensure that no outdated version of ->distilled_result could be
> > used.
> >
> > Basically, the problem is that the worker would
> >
> > 1) set ->distilled_result
> > 2) perform a write memory barrier
> > 3) set ->state to COMPLETED_SHARED
> >
> > and then the issuer of the IO would:
> >
> > 4) check ->state is COMPLETED_SHARED
> > 5) use ->distilled_result
> >
> > The problem is that there currently is no barrier between 4 & 5, which means
> > an outdated ->distilled_result could be used.
> >
> >
> > This also explains why the issue looked so weird - eventually, after fprintfs,
> > after a core dump, etc, the updated ->distilled_result result would "arrive"
> > in the issuing process, and suddenly look correct.
> >
> Sorry, I realized that O do not completely understand how it can explained
> assertion failure in `pgaio_io_before_start`:
>
> Assert(ioh->op == PGAIO_OP_INVALID);
I don't think it can - this must be an independent bug from the one that Tom
and I were encountering.
Greetings,
Andres Freund