Hi,
On 2025-06-16 20:22:00 -0400, Tom Lane wrote:
> Konstantin Knizhnik <knizhnik@garret.ru> writes:
> > On 16/06/2025 6:11 pm, Andres Freund wrote:
> >> I unfortunately can't repro this issue so far.
>
> > But unfortunately it means that the problem is not fixed.
>
> FWIW, I get similar results to Andres' on a Mac Mini M4 Pro
> using MacPorts' current compiler release (clang version 19.1.7).
> The currently-proposed test case fails within a few minutes on
> e9a3615a5^ but doesn't fail in a couple of hours on e9a3615a5.
I'm surprised it takes that long, given it takes seconds to reproduce here
with the config parameters I outlined. Did you try crank up the concurrency a
bit? Yours has more cores than mine, and I found that that makes a huge
difference.
> However, I cannot repro that on a slightly older Mini M1 using Apple's
> current release (clang-1700.0.13.5, which per wikipedia is really LLVM
> 19.1.4). It seems to work fine even without e9a3615a5. So the whole
> thing is still depressingly phase-of-the-moon-dependent.
It's not entirely surprising that an M1 would have a harder time reproducing
the issue, more cores, larger caches and a larger out-of-order execution
window will make it more likely that the missing memory barriers have a
visible effect.
I'm reasonably sure that e9a3615a5 quashed that specific issue - I could repro
it within seconds with e9a3615a5^ and with e9a3615a5 I ran it for several days
without a single failure...
> I don't doubt that Konstantin has found a different issue, but
> it's hard to be sure about the fix unless we can get it to be
> more reproducible. Neither of my machines has ever shown the
> symptom he's getting.
I've not been able to reproduce that symptom a single time either so far.
The assertion continues to be inexplicable to me. It shows, within a single
process, memory in shared memory going "backwards". But not always, just very
occasionally. Because this is before the IO is defined, there's no concurrent
access whatsoever.
I stole^Wgot my partner's m1 macbook for a bit, trying to reproduce the issue
there. It has
"Apple clang version 16.0.0 (clang-1600.0.26.6)"
on
"Darwin Kernel Version 24.3.0"
That's the same Apple-clang version that Alexander reported being able to
reproduce the issue on [1], but unfortunately it's a newer kernel version. No
dice in the first 55 test iterations.
Konstantin, Alexander - are you using the same device to reproduce this or
different ones? I wonder if this somehow depends on some MDM / corporate
enforcement tooling running or such.
What does:
- profiles status -type enrollment
- kextstat -l
show?
Greetings,
Andres Freund
[1] https://postgr.es/m/92b33ab2-0596-40fe-9db6-a6d821d08e8a%40gmail.com