Re: Need help debugging SIGBUS crashes - Mailing list pgsql-hackers
| From | Peter 'PMc' Much |
|---|---|
| Subject | Re: Need help debugging SIGBUS crashes |
| Date | |
| Msg-id | acxgzmNqBCuRGCf6@disp.intra.daemon.contact Whole thread Raw |
| In response to | Re: Need help debugging SIGBUS crashes (Tom Lane <tgl@sss.pgh.pa.us>) |
| List | pgsql-hackers |
On Tue, Mar 17, 2026 at 04:56:48PM -0400, Tom Lane wrote: ! "Peter 'PMc' Much" <pmc@citylink.dinoex.sub.org> writes: ! > On Tue, Mar 17, 2026 at 10:12:07AM -0400, Tom Lane wrote: ! > ! Why it was okay in older FreeBSD and not so much in v14, who knows? ! ! > Maybe it wasn't. Here it appeared out of thin air in February, while ! > the system was upgraded from 13.5 to 14.3 in July'25, and did run ! > without problems for these eight months. ! > So this is not directly or solely related to FBSD R.14, and while it ! > happens more likely during massive memory use, but this also is not ! > stingent. Neither did I find any other solid determining condition. ! ! Yeah, it seems likely that there is some additional triggering ! condition that we don't understand; otherwise there would be more ! people complaining than just you. Dear hackers ;) I have now analyzed three of the memory dumps from servers crashing; that means, I walked through the actual code of malloc() and did all the computations manually, in order to understand why and where a SIGBUS would be triggered. What I found is an area of memory where jemalloc stores a lookup tree, about 4 or 8 MB long. That area is zeroed, and sparsely populated by pointers to other memory locations, which jemalloc uses. But within this area are one or two 4kB-pages which contain data that does not belong there. That data is slightly structured, but there is no unique signature by which I could identify an owner - it is not fully random, but quite random, and also very different between the three crashes. When a memory pointer is fetched out of that area, it can point to anywhere, and that explains why utilizing such a pointer gives either SIGSEGV or SIGBUS. There is also one other person who has perceived the exact same backtraces (and attributed them to autovacuum, and filed a bug report against FreeBSD) - this rules out a possible hardware issue. https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=294039 The serious abdominal pain that I currently have is this: when something can replace pages in a table used internally by jemalloc, can it also replace pages in memory which are vital to the database itself? In other words, can this lead to silent data corruption? In my samples I found about 0.1% of the memory corrupted, and also I still assume that there is an additional factor of memory exhaustion involved. This together might explain why the observations happen rarely. For now it is confirmed that the crashes may happen in Freebsd 14.3, 14.4 and 15.0, and with PG r14, r15 and r16. Furthermore (as You can read in the mentioned bug report) our PG maintainer Palle Girgenson had the idea that an Errata advisory FreeBSD-EN-26:03.vm might possibly be causing the issues. The installation of that patch aligns well with the appearance of the crashes. For now I have removed that patch from my kernel, and am hammering onto the database, without another crash, for nearly two days now - but that is still too short to say anything with certainty. I am unsure about what to do next. In the worst case scenario quite a bunch of professional installations might be in subtle danger, so maybe something should be done? Certainly, I could as well decide that this patch removal (hopefully) solves my issue, and so I am now (hopefully) done with this, and go to sleep again, as everybody else may just care for themselves... I'll be thankful for inspirations. cheerio, PMc
pgsql-hackers by date: