Re: Non-reproducible AIO failure - Mailing list pgsql-hackers

From Alexander Lakhin
Subject Re: Non-reproducible AIO failure
Date
Msg-id 92b33ab2-0596-40fe-9db6-a6d821d08e8a@gmail.com
Whole thread Raw
In response to Re: Non-reproducible AIO failure  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Hello Andres and Tom,

06.06.2025 22:37, Andres Freund wrote:
On 2025-06-06 15:21:13 -0400, Tom Lane wrote:
It's a mistake to think that this is a compiler bug.  The C standard
explicitly allows compilers to use word-wide operations to access
bit-field struct members.  Such accesses may fetch or overwrite other
bit-fields in the same word, using a non-atomic read/modify/write
sequence.  I don't have a recent C standard at hand, but I found this
quote on the web [1]:

The C Standard, 3.17, paragraph 3 [ISO/IEC 9899:2024], states
    Note 2 to entry: A bit-field and an adjacent non-bit-field member    are in separate memory locations. The same applies to two    bit-fields, if one is declared inside a nested structure    declaration and the other is not, or if the two are separated by a    zero-length bit-field declaration, or if they are separated by a    non-bit-field member declaration ...

So it's our code that is busted.  No doubt, what is happening is
that process A is fetching two fields, modifying one of them,
and storing the word back (with the observed value of the other
field) concurrently with some other process trying to update
the other field.  So the other process's update is lost.
There shouldn't be any concurrent accesses here, so I don't really see how the
above would explain the problem (the IO can only ever be modified by one
backend, initially the "owning backend", then, when submitted, by the IO
worker, and then again by the backend).

FWIW,  I tried the idea Matthias proposed upthread:
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -98,16 +98,17 @@ struct PgAioHandle
        /* all state updates should go through pgaio_io_update_state() */
        PgAioHandleState state:8;
 
+       /* bitfield of PgAioHandleFlags */
+       uint8           flags;
+
        /* what are we operating on */
        PgAioTargetID target:8;
 
+       uint8           num_callbacks;
+
        /* which IO operation */
        PgAioOp         op:8;
 
-       /* bitfield of PgAioHandleFlags */
-       uint8           flags;
-
-       uint8           num_callbacks;
 
        /* using the proper type here would use more space */
        uint8           callbacks[PGAIO_HANDLE_MAX_CALLBACKS];

and got 120 iterations passed cleanly. Without this change, I got failures
on iterations 41, 5, 20, Not a proof, just an observation...

May be the op field gets broken because of a (concurrent?) write to
not that field, by to state or target in some other place?

I also tried to reproduce the same on an ARM server again with clang 18,
using -O2 and -O3, and we with Konstantin didn't see such a sophisticated
assembly for pgaio_io_stage() — the code is much simpler and the failure
is not reproduced. The triplet there is:
PostgreSQL 18beta1 on aarch64-unknown-linux-gnu, compiled by Ubuntu clang version 18.1.3 (1ubuntu1), 64-bit

On that my MacBook, the triplet is:
PostgreSQL 18beta1 on aarch64-apple-darwin23.5.0, compiled by Apple clang version 16.0.0 (clang-1600.0.26.6), 64-bit
indri:
PostgreSQL 18beta1 on aarch64-apple-darwin24.5.0, compiled by Apple clang version 17.0.0 (clang-1700.0.13.3), 64-bit
sifaka:
PostgreSQL 18devel on aarch64-apple-darwin24.4.0, compiled by Apple clang version 17.0.0 (clang-1700.0.13.3), 64-bit

So it looks like "aarch64-apple" enables some extra optimizations with
bitfields...

By the way, there are also some other places with adjacent bitfields, e, g.:
typedef union PgAioTargetData
{
    struct
    {
        RelFileLocator rlocator;    /* physical relation identifier */
        BlockNumber blockNum;   /* blknum relative to begin of reln */
        BlockNumber nblocks;
        ForkNumber  forkNum:8;  /* don't waste 4 byte for four values */
        bool        is_temp:1;  /* proc can be inferred by owning AIO */
        bool        skip_fsync:1;
    }           smgr;

and
typedef struct PgStat_KindInfo
{
    /*
     * Do a fixed number of stats objects exist for this kind of stats (e.g.
     * bgwriter stats) or not (e.g. tables).
     */
    bool        fixed_amount:1;

    /*
     * Can stats of this kind be accessed from another database? Determines
     * whether a stats object gets included in stats snapshots.
     */
    bool        accessed_across_databases:1;

    /* Should stats be written to the on-disk stats file? */
    bool        write_to_file:1;
...

Best regards,
Alexander Lakhin

pgsql-hackers by date:

Previous
From: David Rowley
Date:
Subject: Re: [PATCH] Re: Proposal to Enable/Disable Index using ALTER INDEX
Next
From: Nico Williams
Date:
Subject: Re: Non-reproducible AIO failure