Re: Need help debugging SIGBUS crashes - Mailing list pgsql-hackers
| From | Jakub Wartak |
|---|---|
| Subject | Re: Need help debugging SIGBUS crashes |
| Date | |
| Msg-id | CAKZiRmyQz+jZWLC4GbyuCa6cjurS0nECgFbYVyjgxB3Hgo+VnQ@mail.gmail.com Whole thread |
| In response to | Need help debugging SIGBUS crashes ("Peter 'PMc' Much" <pmc@citylink.dinoex.sub.org>) |
| Responses |
Re: Need help debugging SIGBUS crashes
|
| List | pgsql-hackers |
Hi, On Tue, Mar 17, 2026 at 1:27 PM Peter 'PMc' Much <pmc@citylink.dinoex.sub.org> wrote: > > Hello, > please excuse I am writing here, I wrote earlier to the users list > but got no answer. > > I am observing repeated SIGBUS crashes of the postgres backend binary > on FreeBSD, starting at Feb 2, every couple of weeks. > The postgres is 15.15, the FreeBSD Release was 14.3, the crashes > happen in malloc(). > > The crashes happened on different PG clusters (running off the same > binaries), so they cannot be pinpointed to a specific application. > > After following a few red herrings, I figured that I had patched > into the NUMA allocation policy in the kernel at Dec 18, so I > obviousley thought this being the actual cause for the crashes. But > apparently it isn't. I removed the patches that would relate to > malloc() (and left only those relating to ZFS) - and after some > days got another crash. > > So, yesterday I upgraded to FreeBSD 14.4, removed all my patches > for NUMA, and in addition disabled NUMA entirely with > vm.numa.disabled=1 > and added debugging info for libc. I intended to also add debugging > to postgres - but tonight I already got another crash: the problem > is apparently not related to NUMA. [..] > frame #6: 0x0000000829687afd libc.so.7`__je_arena_extent_alloc_large(tsdn=<unavailable>, arena=0x00003e616aa00980,usize=32768, alignment=<unavailable>, zero=0x0000000820c5bedf) at jemalloc_arena.c:448:12 > frame #7: 0x00000008296afca0 libc.so.7`__je_large_palloc(tsdn=0x00003e616a889090, arena=<unavailable>, usize=<unavailable>,alignment=64, zero=<unavailable>) at jemalloc_large.c:47:43 > frame #8: 0x00000008296afb02 libc.so.7`__je_large_malloc(tsdn=<unavailable>, arena=<unavailable>, usize=<unavailable>,zero=<unavailable>) at jemalloc_large.c:17:9 [artificial] [..] Not an answer from a regular FreeBSD guy, but more questions: So have you removed those ZFS patches or not? (You said You reverted only NUMA ones)? Maybe those ZFS patches they corrupt some memory and jemalloc just hits those regions? I would revert the kernel to stock thing as nobody would be able to tell otherwise what's happening there :) Are You using hugepages? The jemalloc stack also contains "_large_" so can we assume jemalloc is using hugepages ? I don't know if that might help, but last time I hunted down SIGBUS [0] it was due to our incorrect patches (causing NUMA hugepages imbalances across nodes; our patch has some pause there, but what I did to track it down was to stack trace to Linux's kernel do_sigbus() routine via eBPF). Possibly You could hijack/ detect some traps and/or hijack some routines using DTrace that's in FreeBSD and that would get some hints? -J. [0] - https://www.postgresql.org/message-id/CAKZiRmww2P6QAzu6W%2BvxB89i5Ha-YRSHMeyr6ax2Lymcu3LUcw%40mail.gmail.com
pgsql-hackers by date: