Re: Non-reproducible AIO failure - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Non-reproducible AIO failure
Date
Msg-id of6nnksyqlbqikhpiwspalskgtx5dax6te2dwn3ojmj5k7obh4@hrteef7hiwvp
Whole thread Raw
In response to Re: Non-reproducible AIO failure  (Konstantin Knizhnik <knizhnik@garret.ru>)
Responses Re: Non-reproducible AIO failure
List pgsql-hackers
Hi,

On 2025-06-16 20:22:00 -0400, Tom Lane wrote:
> Konstantin Knizhnik <knizhnik@garret.ru> writes:
> > On 16/06/2025 6:11 pm, Andres Freund wrote:
> >> I unfortunately can't repro this issue so far.
>
> > But unfortunately it means that the problem is not fixed.
>
> FWIW, I get similar results to Andres' on a Mac Mini M4 Pro
> using MacPorts' current compiler release (clang version 19.1.7).
> The currently-proposed test case fails within a few minutes on
> e9a3615a5^ but doesn't fail in a couple of hours on e9a3615a5.

I'm surprised it takes that long, given it takes seconds to reproduce here
with the config parameters I outlined. Did you try crank up the concurrency a
bit? Yours has more cores than mine, and I found that that makes a huge
difference.


> However, I cannot repro that on a slightly older Mini M1 using Apple's
> current release (clang-1700.0.13.5, which per wikipedia is really LLVM
> 19.1.4).  It seems to work fine even without e9a3615a5.  So the whole
> thing is still depressingly phase-of-the-moon-dependent.

It's not entirely surprising that an M1 would have a harder time reproducing
the issue, more cores, larger caches and a larger out-of-order execution
window will make it more likely that the missing memory barriers have a
visible effect.

I'm reasonably sure that e9a3615a5 quashed that specific issue - I could repro
it within seconds with e9a3615a5^ and with e9a3615a5 I ran it for several days
without a single failure...


> I don't doubt that Konstantin has found a different issue, but
> it's hard to be sure about the fix unless we can get it to be
> more reproducible.  Neither of my machines has ever shown the
> symptom he's getting.

I've not been able to reproduce that symptom a single time either so far.

The assertion continues to be inexplicable to me. It shows, within a single
process, memory in shared memory going "backwards". But not always, just very
occasionally. Because this is before the IO is defined, there's no concurrent
access whatsoever.


I stole^Wgot my partner's m1 macbook for a bit, trying to reproduce the issue
there. It has
"Apple clang version 16.0.0 (clang-1600.0.26.6)"
on
"Darwin Kernel Version 24.3.0"


That's the same Apple-clang version that Alexander reported being able to
reproduce the issue on [1], but unfortunately it's a newer kernel version. No
dice in the first 55 test iterations.


Konstantin, Alexander - are you using the same device to reproduce this or
different ones? I wonder if this somehow depends on some MDM / corporate
enforcement tooling running or such.

What does:
- profiles status -type enrollment
- kextstat -l
show?

Greetings,

Andres Freund

[1] https://postgr.es/m/92b33ab2-0596-40fe-9db6-a6d821d08e8a%40gmail.com



pgsql-hackers by date:

Previous
From: Aleksander Alekseev
Date:
Subject: Re: Avoid possible dereference null pointer (src/backend/utils/cache/relcache.c)
Next
From: Dimitrios Apostolou
Date:
Subject: Re: --enable-{debug,cassert} should also activate --enable-depend