Re: Need help debugging SIGBUS crashes - Mailing list pgsql-hackers

From Peter 'PMc' Much
Subject Re: Need help debugging SIGBUS crashes
Date
Msg-id ablzSvIqaleFirLx@disp.intra.daemon.contact
Whole thread Raw
In response to Re: Need help debugging SIGBUS crashes  (Jakub Wartak <jakub.wartak@enterprisedb.com>)
List pgsql-hackers
On Tue, Mar 17, 2026 at 02:50:25PM +0100, Jakub Wartak wrote:
! 
! Not an answer from a regular FreeBSD guy, but more questions:
! 
! So have you removed those ZFS patches or not? (You said You reverted only
! NUMA ones)?

They are completely removed now. 

! Maybe those ZFS patches they corrupt some memory and jemalloc just
! hits those regions? I would revert the kernel to stock thing

Yes, I would, too, but I can't. There are patches for kerberos
(FreeBSD 14 still uses that very old Heimdal implementation, that
is why I am kind of stuck with PG 15, and upgrading that one will
be a bit of work), there are patches to make IPv6 fragmentation work
with the firewalls - in short, removing all of the patches will make
the SSO and networking fall apart entirely, and make the site
nonfunctional.

OTOH this crash seems to prefer happening in production. Last night
when it happened, the machine was busy rebuilding the OS etc. for
other nodes to upgrade to 14.4, and then I got bored and additionally
did run an LLM for entertainment. So the server had some 25 GB paged
out, when the nightly housekeeping started to push daily log data
into the databases - which then led to the crash.

That means,
 A) I have no good idea how to properly reproduce such conditions
    in a test scenario, and
 B) it is not impossible that there is a bug (somewhere), that just
    doesn't usually happen to orderly people who run their databases
    in rather overprovisioned conditions.

! Are You using hugepages? The jemalloc stack also contains "_large_" so can we
! assume jemalloc is using hugepages ?

I think I remember I once tried to, but hugepages with postgres do not
work on FreeBSD. The docs also say: 
   "this setting is supported only on Linux and Windows."

! I don't know if that might help, but last time I hunted down SIGBUS [0] it was
! due to our incorrect patches (causing NUMA hugepages imbalances across nodes;
! our patch has some pause there, but what I did to track it down was to
! stack trace
! to Linux's kernel do_sigbus() routine via eBPF). Possibly You could hijack/
! detect some traps and/or hijack some routines using DTrace that's in FreeBSD and
! that would get some hints?

Thank You, currently everything helps. :)
DTrace is super cool, but then it also needs to understand the code
first before getting useful insight from it.
So any approach will imply a bunch of work, and I am currently looking
for the shortest path to an unknown target. ;)

PMc



pgsql-hackers by date:

Previous
From: Andrei Zubkov
Date:
Subject: Re: Vacuum statistics
Next
From: "Peter 'PMc' Much"
Date:
Subject: Re: Need help debugging SIGBUS crashes