Need help debugging SIGBUS crashes - Mailing list pgsql-hackers

From Peter 'PMc' Much
Subject Need help debugging SIGBUS crashes
Date
Msg-id ablGaVLQ3N9WsHcc@disp.intra.daemon.contact
Whole thread Raw
Responses Re: Need help debugging SIGBUS crashes
Re: Need help debugging SIGBUS crashes
List pgsql-hackers
Hello,
  please excuse I am writing here, I wrote earlier to the users list
but got no answer.

I am observing repeated SIGBUS crashes of the postgres backend binary
on FreeBSD, starting at Feb 2, every couple of weeks.
The postgres is 15.15, the FreeBSD Release was 14.3, the crashes
happen in malloc().

The crashes happened on different PG clusters (running off the same
binaries), so they cannot be pinpointed to a specific application.

After following a few red herrings, I figured that I had patched
into the NUMA allocation policy in the kernel at Dec 18, so I
obviousley thought this being the actual cause for the crashes. But
apparently it isn't. I removed the patches that would relate to
malloc() (and left only those relating to ZFS) - and after some
days got another crash.

So, yesterday I upgraded to FreeBSD 14.4, removed all my patches
for NUMA, and in addition disabled NUMA entirely with
   vm.numa.disabled=1
and added debugging info for libc. I intended to also add debugging
to postgres - but tonight I already got another crash: the problem
is apparently not related to NUMA. 

The backtrace is disgusting - all interesting things are optimized
away. :( 
So I am now quite clueless on how to proceed further, and could
really use some educated inspiration. I can not even say if this is
a postgres issue or a FreeBSD issue (but it doesn't happen to any
other program). I could probably rebuild the OS with -O0 - but is
this the best way to proceed?
(Filing a bug report against FreeBSD without specific reproduction
info would most likely just mean that it stays on my desk anyway.)

I'm now following your paper on what info you like to have on
bug reports - I'm still thinking there might be something I have
changed on the system (I'm always thinking that, because I tend
to change things to my liking, albeit in most cases it then is
actually a bug ;) ) and I shouldn't yet report a bug - but I
can't think of anything now.


  ## My configuration file:

hba_file = '/usr/local/etc/pg-col/pg_hba.conf'
ident_file = '/usr/local/etc/pg-col/pg_ident.conf'
listen_addresses = 'admn-e.intra.daemon.contact'
port = 5434                             # (change requires restart)
max_connections = 60                    # (change requires restart)
unix_socket_permissions = 0777          # begin with 0 to use octal notation
krb_server_keyfile = '/usr/local/etc/pg-col/krb5.keytab'
shared_buffers = 40MB                   # min 128kB
temp_buffers = 20MB                     # min 800kB
work_mem = 50MB                         # min 64kB
maintenance_work_mem = 50MB             # min 1MB
max_stack_depth = 40MB                  # min 100kB
dynamic_shared_memory_type = posix      # the default is the first option
max_files_per_process = 200             # min 25
effective_io_concurrency = 12           # 1-1000; 0 disables prefetching
max_parallel_workers_per_gather = 3     # taken from max_parallel_workers
synchronous_commit = off                # synchronization level;
wal_sync_method = fsync         # the default is the first option
full_page_writes = off                  # recover from partial page writes
wal_compression = on                    # enable compression of full-page writes
wal_init_zero = off                     # zero-fill new WAL files
wal_writer_delay = 2000ms               # 1-10000 milliseconds
checkpoint_timeout = 240min             # range 30s-1d
checkpoint_completion_target = 0.0      # checkpoint target duration, 0.0 - 1.0
max_wal_size = 4GB
archive_mode = on               # enables archiving; off, on, or always
archive_command = '/ext/libexec/RedoLog.copy "%f" "%p"'
archive_timeout = 86400         # force a logfile segment switch after this

seq_page_cost = 0.5                     # measured on an arbitrary scale
random_page_cost = 1.2                  # same scale as above / PMc: SSD
effective_cache_size = 1GB
default_statistics_target = 1000        # range 1-10000
log_destination = 'syslog'
syslog_ident = 'pg-col'
log_min_messages = info                 # values in order of decreasing detail:
log_min_duration_statement = 10000      # -1 is disabled, 0 logs all statements
log_checkpoints = on
log_connections = on
log_disconnections = on
log_error_verbosity = terse             # terse, default, or verbose messages
log_line_prefix = '%u:%d[%r] '          # special values:
log_lock_waits = on                     # log lock waits >= deadlock_timeout
log_temp_files = 10000                  # log temporary files equal or larger
log_timezone = 'Europe/Berlin'
update_process_title = on
track_io_timing = on
track_wal_io_timing = on
autovacuum = on                         # Enable autovacuum subprocess?  'on'
autovacuum_naptime = 5min               # time between autovacuum runs
autovacuum_vacuum_scale_factor = 0.01   # fraction of table size before vacuum
autovacuum_analyze_scale_factor = 0.05  # fraction of table size before analyze
temp_tablespaces = 'l1only'             # a list of tablespace names, '' uses
datestyle = 'german, dmy'
timezone = 'Europe/Berlin'
lc_messages = 'en_US.UTF-8'                     # locale for system error message
lc_monetary = 'en_US.UTF-8'                     # locale for monetary formatting
lc_numeric = 'en_US.UTF-8'                      # locale for number formatting
lc_time = 'de_DE.UTF-8'                         # locale for time formatting
default_text_search_config = 'pg_catalog.german'

  ## settings for startup:

postgresql_data="/var/db/pg-col/data15"
postgresql_flags="-w -m fast -o --config_file=/usr/local/etc/pg-col/postgresql.conf"
postgresql_initdb_flags="--encoding=utf-8 --lc-collate=de_DE.UTF-8 --lc-ctype=de_DE.UTF-8 --lc-messages=en_US.UTF-8
--lc-monetary=en_US.UTF-8--lc-numeric=en_US.UTF-8 --lc-time=en_US.UTF-8"
 

 # postgres --version
postgres (PostgreSQL) 15.15

 # uname -a
FreeBSD 14.4-RELEASE FreeBSD 14.4-RELEASE (commit a456f852d145)

And this is the backtrace - a couple of other partial bactraces
are here:
https://forums.freebsd.org/threads/trouble-bus-errors-and-segfaults-in-libc-so-from-postgres.101876/
(this apparently comes from differing locations, but seems to crash at
the same place)

* thread #1, name = 'postgres', stop reason = signal SIGBUS
  * frame #0: 0x00000008296ad173 libc.so.7`extent_try_coalesce_impl(tsdn=0x00003e616a889090, arena=0x00003e616aa00980,
r_extent_hooks=0x0000000820c5be88,rtree_ctx=0x00003e616a8890c0, extents=0x00003e616aa058d8, extent=0x00003e616ab0ec00,
coalesced=0x0000000000000000,growing_retained=<unavailable>, inactive_only=<unavailable>) at jemalloc_extent.c:0
 
    frame #1: 0x00000008296aa0d8 libc.so.7`extent_record [inlined] extent_try_coalesce(tsdn=0x00003e616a889090,
arena=0x00003e616aa00980,r_extent_hooks=0x0000000820c5be88, rtree_ctx=0x00003e616a8890c0, extents=0x00003e616aa058d8,
extent=0x00003e616ab0ec00,coalesced=0x0000000000000000, growing_retained=<unavailable>) at jemalloc_extent.c:1678:9
 
    frame #2: 0x00000008296aa0c3 libc.so.7`extent_record(tsdn=0x00003e616a889090, arena=0x00003e616aa00980,
r_extent_hooks=0x0000000820c5be88,extents=0x00003e616aa058d8, extent=0x00003e616ab0ec00,
growing_retained=<unavailable>)at jemalloc_extent.c:1717:12
 
    frame #3: 0x00000008296aaf39 libc.so.7`__je_extent_alloc_wrapper [inlined]
extent_grow_retained(tsdn=0x00003e616a889090,arena=0x00003e616aa00980, r_extent_hooks=0x0000000820c5be88, size=32768,
pad=4096,alignment=<unavailable>, slab=<unavailable>, szind=<unavailable>, zero=<unavailable>, commit=<unavailable>) at
jemalloc_extent.c:1383:4
    frame #4: 0x00000008296aaef7 libc.so.7`__je_extent_alloc_wrapper [inlined]
extent_alloc_retained(tsdn=0x00003e616a889090,arena=0x00003e616aa00980, r_extent_hooks=0x0000000820c5be88,
new_addr=0x0000000000000000,size=32768, pad=4096, alignment=<unavailable>, slab=<unavailable>, szind=<unavailable>,
zero=<unavailable>,commit=<unavailable>) at jemalloc_extent.c:1480:12
 
    frame #5: 0x00000008296aaef7 libc.so.7`__je_extent_alloc_wrapper(tsdn=0x00003e616a889090, arena=0x00003e616aa00980,
r_extent_hooks=0x0000000820c5be88,new_addr=0x0000000000000000, size=32768, pad=4096, alignment=64, slab=<unavailable>,
szind=40,zero=0x0000000820c5bedf, commit=0x0000000820c5be87) at jemalloc_extent.c:1539:21 
    frame #6: 0x0000000829687afd libc.so.7`__je_arena_extent_alloc_large(tsdn=<unavailable>, arena=0x00003e616aa00980,
usize=32768,alignment=<unavailable>, zero=0x0000000820c5bedf) at jemalloc_arena.c:448:12
 
    frame #7: 0x00000008296afca0 libc.so.7`__je_large_palloc(tsdn=0x00003e616a889090, arena=<unavailable>,
usize=<unavailable>,alignment=64, zero=<unavailable>) at jemalloc_large.c:47:43
 
    frame #8: 0x00000008296afb02 libc.so.7`__je_large_malloc(tsdn=<unavailable>, arena=<unavailable>,
usize=<unavailable>,zero=<unavailable>) at jemalloc_large.c:17:9 [artificial]
 
    frame #9: 0x000000082967c8a7 libc.so.7`__je_malloc_default [inlined] tcache_alloc_large(tsd=0x00003e616a889090,
arena=<unavailable>,tcache=0x00003e616a889280, size=<unavailable>, binind=<unavailable>, zero=false, slow_path=false)
attcache_inlines.h:124:9
 
    frame #10: 0x000000082967c82c libc.so.7`__je_malloc_default [inlined] arena_malloc(tsdn=0x00003e616a889090,
arena=0x0000000000000000,size=<unavailable>, ind=<unavailable>, zero=false, tcache=0x00003e616a889280, slow_path=false)
atarena_inlines_b.h:169:11
 
    frame #11: 0x000000082967c818 libc.so.7`__je_malloc_default [inlined] iallocztm(tsdn=0x00003e616a889090,
size=<unavailable>,ind=<unavailable>, zero=false, tcache=0x00003e616a889280, is_internal=false,
arena=0x0000000000000000,slow_path=false) at jemalloc_internal_inlines_c.h:53:8
 
    frame #12: 0x000000082967c818 libc.so.7`__je_malloc_default [inlined] imalloc_no_sample(sopts=<unavailable>,
dopts=<unavailable>,tsd=0x00003e616a889090, size=<unavailable>, usize=32768, ind=<unavailable>) at
jemalloc_jemalloc.c:1953:9
    frame #13: 0x000000082967c818 libc.so.7`__je_malloc_default [inlined] imalloc_body(sopts=<unavailable>,
dopts=<unavailable>,tsd=0x00003e616a889090) at jemalloc_jemalloc.c:2153:16
 
    frame #14: 0x000000082967c818 libc.so.7`__je_malloc_default [inlined] imalloc(sopts=<unavailable>,
dopts=<unavailable>)at jemalloc_jemalloc.c:2262:10
 
    frame #15: 0x000000082967c74a libc.so.7`__je_malloc_default(size=<unavailable>) at jemalloc_jemalloc.c:2293:2
    frame #16: 0x00000000009c0577 postgres`AllocSetContextCreateInternal + 199
    frame #17: 0x000000000058e56c postgres`StartTransaction + 332
    frame #18: 0x000000000058e3be postgres`StartTransactionCommand + 30
    frame #19: 0x00000000009a82fd postgres`InitPostgres + 461
    frame #20: 0x00000000007b8e0d postgres`AutoVacWorkerMain + 765
    frame #21: 0x00000000007b8ac7 postgres`StartAutoVacWorker + 39
    frame #22: 0x00000000007c0c21 postgres`sigusr1_handler + 785
    frame #23: 0x0000000822ae79b6 libthr.so.3`handle_signal(actp=0x0000000820c5c600, sig=30, info=0x0000000820c5c9f0,
ucp=0x0000000820c5c680)at thr_sig.c:318:3
 
    frame #24: 0x0000000822ae6eba libthr.so.3`thr_sighandler(sig=30, info=0x0000000820c5c9f0, _ucp=0x0000000820c5c680)
atthr_sig.c:261:2
 
    frame #25: 0x00000008210662d3
    frame #26: 0x00000000007c2545 postgres`ServerLoop + 1605
    frame #27: 0x00000000007bffa3 postgres`PostmasterMain + 3251
    frame #28: 0x0000000000720601 postgres`main + 801
    frame #29: 0x000000082958015c libc.so.7`__libc_start1(argc=4, argv=0x0000000820c5d6b0, env=0x0000000820c5d6d8,
cleanup=<unavailable>,mainX=(postgres`main)) at libc_start1.c:180:7
 
    frame #30: 0x00000000004ff4e4 postgres`_start + 36




pgsql-hackers by date:

Previous
From: Jakub Wartak
Date:
Subject: Re: pg_stat_io_histogram
Next
From: "Hayato Kuroda (Fujitsu)"
Date:
Subject: RE: [Proposal] Adding Log File Capability to pg_createsubscriber