Re: Segfault in jit tuple deforming on arm64 due to LLVM issue - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Segfault in jit tuple deforming on arm64 due to LLVM issue
Date
Msg-id CA+hUKGL+rRGWneZ7+v0dkF-4yWPKoJU16XQ7Qg0j5sOYGwhRfg@mail.gmail.com
Whole thread Raw
In response to Re: Segfault in jit tuple deforming on arm64 due to LLVM issue  (Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com>)
Responses Re: Segfault in jit tuple deforming on arm64 due to LLVM issue
List pgsql-hackers
On Sat, Aug 24, 2024 at 12:22 AM Anthonin Bonnefoy
<anthonin.bonnefoy@datadoghq.com> wrote:
> On Thu, Aug 22, 2024 at 12:33 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > I fear that back-porting, for the LLVM project, would mean "we fix it
> > in main/20.x, and also back-port it to 19.x".  Do distros back-port
> > further?
>
> That's also my fear, I'm not familiar with distros back-port policy
> but eyeballing ubuntu package changelog[1], it seems to be mostly
> build fixes.
>
> Given that there's no visible way to fix the relocation issue, I
> wonder if jit shouldn't be disabled for arm64 until either the
> RuntimeDyld fix is merged or the switch to JITLink is done. Disabling
> jit tuple deforming may be enough but I'm not confident the issue
> won't happen in a different part.

We've experienced something a little similar before: In the early days
of PostgreSQL LLVM, it didn't work at all on ARM or POWER.  We sent a
trivial fix[1] upstream that landed in LLVM 7; since it was a small
and obvious problem and it took a long time for some distros to ship
LLVM 7, we even contemplated hot-patching that LLVM function with our
own copy (but, ugh, only for about 7 nanoseconds).  That was before we
turned JIT on by default, and was also easier to deal with because it
was an obvious consistent failure in basic tests, so packagers
probably just disabled the build option on those architectures.  IIUC
this one is a random and rare crash depending on malloc() and perhaps
also the working size of your virtual memory dart board.  (Annoyingly,
I had tried to reproduce this quite a few times on small ARM systems
when earlier reports came in, d'oh!).

This degree of support window mismatch is probably what triggered RHEL
to develop their new rolling LLVM version policy.  Unfortunately, it's
the other distros that tell *us* which versions to support, and not
the reverse (for example CF #4920 is about to drop support for LLVM <
14, but that will only be for PostgreSQL 18+).

Ultimately, if it doesn't work, and doesn't get fixed, it's hard for
us to do much about it.  But hmm, this is probably madness... I wonder
if it would be feasible to detect address span overflow ourselves at a
useful time, as a kind of band-aid defence...

[1] https://www.postgresql.org/message-id/CAEepm%3D39F_B3Ou8S3OrUw%2BhJEUP3p%3DwCu0ug-TTW67qKN53g3w%40mail.gmail.com



pgsql-hackers by date:

Previous
From: Peter Smith
Date:
Subject: Re: Conflict Detection and Resolution
Next
From: jian he
Date:
Subject: Re: POC, WIP: OR-clause support for indexes