Need help debugging SIGBUS crashes - Mailing list pgsql-hackers
| From | Peter 'PMc' Much |
|---|---|
| Subject | Need help debugging SIGBUS crashes |
| Date | |
| Msg-id | ablGaVLQ3N9WsHcc@disp.intra.daemon.contact Whole thread Raw |
| Responses |
Re: Need help debugging SIGBUS crashes
Re: Need help debugging SIGBUS crashes |
| List | pgsql-hackers |
Hello, please excuse I am writing here, I wrote earlier to the users list but got no answer. I am observing repeated SIGBUS crashes of the postgres backend binary on FreeBSD, starting at Feb 2, every couple of weeks. The postgres is 15.15, the FreeBSD Release was 14.3, the crashes happen in malloc(). The crashes happened on different PG clusters (running off the same binaries), so they cannot be pinpointed to a specific application. After following a few red herrings, I figured that I had patched into the NUMA allocation policy in the kernel at Dec 18, so I obviousley thought this being the actual cause for the crashes. But apparently it isn't. I removed the patches that would relate to malloc() (and left only those relating to ZFS) - and after some days got another crash. So, yesterday I upgraded to FreeBSD 14.4, removed all my patches for NUMA, and in addition disabled NUMA entirely with vm.numa.disabled=1 and added debugging info for libc. I intended to also add debugging to postgres - but tonight I already got another crash: the problem is apparently not related to NUMA. The backtrace is disgusting - all interesting things are optimized away. :( So I am now quite clueless on how to proceed further, and could really use some educated inspiration. I can not even say if this is a postgres issue or a FreeBSD issue (but it doesn't happen to any other program). I could probably rebuild the OS with -O0 - but is this the best way to proceed? (Filing a bug report against FreeBSD without specific reproduction info would most likely just mean that it stays on my desk anyway.) I'm now following your paper on what info you like to have on bug reports - I'm still thinking there might be something I have changed on the system (I'm always thinking that, because I tend to change things to my liking, albeit in most cases it then is actually a bug ;) ) and I shouldn't yet report a bug - but I can't think of anything now. ## My configuration file: hba_file = '/usr/local/etc/pg-col/pg_hba.conf' ident_file = '/usr/local/etc/pg-col/pg_ident.conf' listen_addresses = 'admn-e.intra.daemon.contact' port = 5434 # (change requires restart) max_connections = 60 # (change requires restart) unix_socket_permissions = 0777 # begin with 0 to use octal notation krb_server_keyfile = '/usr/local/etc/pg-col/krb5.keytab' shared_buffers = 40MB # min 128kB temp_buffers = 20MB # min 800kB work_mem = 50MB # min 64kB maintenance_work_mem = 50MB # min 1MB max_stack_depth = 40MB # min 100kB dynamic_shared_memory_type = posix # the default is the first option max_files_per_process = 200 # min 25 effective_io_concurrency = 12 # 1-1000; 0 disables prefetching max_parallel_workers_per_gather = 3 # taken from max_parallel_workers synchronous_commit = off # synchronization level; wal_sync_method = fsync # the default is the first option full_page_writes = off # recover from partial page writes wal_compression = on # enable compression of full-page writes wal_init_zero = off # zero-fill new WAL files wal_writer_delay = 2000ms # 1-10000 milliseconds checkpoint_timeout = 240min # range 30s-1d checkpoint_completion_target = 0.0 # checkpoint target duration, 0.0 - 1.0 max_wal_size = 4GB archive_mode = on # enables archiving; off, on, or always archive_command = '/ext/libexec/RedoLog.copy "%f" "%p"' archive_timeout = 86400 # force a logfile segment switch after this seq_page_cost = 0.5 # measured on an arbitrary scale random_page_cost = 1.2 # same scale as above / PMc: SSD effective_cache_size = 1GB default_statistics_target = 1000 # range 1-10000 log_destination = 'syslog' syslog_ident = 'pg-col' log_min_messages = info # values in order of decreasing detail: log_min_duration_statement = 10000 # -1 is disabled, 0 logs all statements log_checkpoints = on log_connections = on log_disconnections = on log_error_verbosity = terse # terse, default, or verbose messages log_line_prefix = '%u:%d[%r] ' # special values: log_lock_waits = on # log lock waits >= deadlock_timeout log_temp_files = 10000 # log temporary files equal or larger log_timezone = 'Europe/Berlin' update_process_title = on track_io_timing = on track_wal_io_timing = on autovacuum = on # Enable autovacuum subprocess? 'on' autovacuum_naptime = 5min # time between autovacuum runs autovacuum_vacuum_scale_factor = 0.01 # fraction of table size before vacuum autovacuum_analyze_scale_factor = 0.05 # fraction of table size before analyze temp_tablespaces = 'l1only' # a list of tablespace names, '' uses datestyle = 'german, dmy' timezone = 'Europe/Berlin' lc_messages = 'en_US.UTF-8' # locale for system error message lc_monetary = 'en_US.UTF-8' # locale for monetary formatting lc_numeric = 'en_US.UTF-8' # locale for number formatting lc_time = 'de_DE.UTF-8' # locale for time formatting default_text_search_config = 'pg_catalog.german' ## settings for startup: postgresql_data="/var/db/pg-col/data15" postgresql_flags="-w -m fast -o --config_file=/usr/local/etc/pg-col/postgresql.conf" postgresql_initdb_flags="--encoding=utf-8 --lc-collate=de_DE.UTF-8 --lc-ctype=de_DE.UTF-8 --lc-messages=en_US.UTF-8 --lc-monetary=en_US.UTF-8--lc-numeric=en_US.UTF-8 --lc-time=en_US.UTF-8" # postgres --version postgres (PostgreSQL) 15.15 # uname -a FreeBSD 14.4-RELEASE FreeBSD 14.4-RELEASE (commit a456f852d145) And this is the backtrace - a couple of other partial bactraces are here: https://forums.freebsd.org/threads/trouble-bus-errors-and-segfaults-in-libc-so-from-postgres.101876/ (this apparently comes from differing locations, but seems to crash at the same place) * thread #1, name = 'postgres', stop reason = signal SIGBUS * frame #0: 0x00000008296ad173 libc.so.7`extent_try_coalesce_impl(tsdn=0x00003e616a889090, arena=0x00003e616aa00980, r_extent_hooks=0x0000000820c5be88,rtree_ctx=0x00003e616a8890c0, extents=0x00003e616aa058d8, extent=0x00003e616ab0ec00, coalesced=0x0000000000000000,growing_retained=<unavailable>, inactive_only=<unavailable>) at jemalloc_extent.c:0 frame #1: 0x00000008296aa0d8 libc.so.7`extent_record [inlined] extent_try_coalesce(tsdn=0x00003e616a889090, arena=0x00003e616aa00980,r_extent_hooks=0x0000000820c5be88, rtree_ctx=0x00003e616a8890c0, extents=0x00003e616aa058d8, extent=0x00003e616ab0ec00,coalesced=0x0000000000000000, growing_retained=<unavailable>) at jemalloc_extent.c:1678:9 frame #2: 0x00000008296aa0c3 libc.so.7`extent_record(tsdn=0x00003e616a889090, arena=0x00003e616aa00980, r_extent_hooks=0x0000000820c5be88,extents=0x00003e616aa058d8, extent=0x00003e616ab0ec00, growing_retained=<unavailable>)at jemalloc_extent.c:1717:12 frame #3: 0x00000008296aaf39 libc.so.7`__je_extent_alloc_wrapper [inlined] extent_grow_retained(tsdn=0x00003e616a889090,arena=0x00003e616aa00980, r_extent_hooks=0x0000000820c5be88, size=32768, pad=4096,alignment=<unavailable>, slab=<unavailable>, szind=<unavailable>, zero=<unavailable>, commit=<unavailable>) at jemalloc_extent.c:1383:4 frame #4: 0x00000008296aaef7 libc.so.7`__je_extent_alloc_wrapper [inlined] extent_alloc_retained(tsdn=0x00003e616a889090,arena=0x00003e616aa00980, r_extent_hooks=0x0000000820c5be88, new_addr=0x0000000000000000,size=32768, pad=4096, alignment=<unavailable>, slab=<unavailable>, szind=<unavailable>, zero=<unavailable>,commit=<unavailable>) at jemalloc_extent.c:1480:12 frame #5: 0x00000008296aaef7 libc.so.7`__je_extent_alloc_wrapper(tsdn=0x00003e616a889090, arena=0x00003e616aa00980, r_extent_hooks=0x0000000820c5be88,new_addr=0x0000000000000000, size=32768, pad=4096, alignment=64, slab=<unavailable>, szind=40,zero=0x0000000820c5bedf, commit=0x0000000820c5be87) at jemalloc_extent.c:1539:21 frame #6: 0x0000000829687afd libc.so.7`__je_arena_extent_alloc_large(tsdn=<unavailable>, arena=0x00003e616aa00980, usize=32768,alignment=<unavailable>, zero=0x0000000820c5bedf) at jemalloc_arena.c:448:12 frame #7: 0x00000008296afca0 libc.so.7`__je_large_palloc(tsdn=0x00003e616a889090, arena=<unavailable>, usize=<unavailable>,alignment=64, zero=<unavailable>) at jemalloc_large.c:47:43 frame #8: 0x00000008296afb02 libc.so.7`__je_large_malloc(tsdn=<unavailable>, arena=<unavailable>, usize=<unavailable>,zero=<unavailable>) at jemalloc_large.c:17:9 [artificial] frame #9: 0x000000082967c8a7 libc.so.7`__je_malloc_default [inlined] tcache_alloc_large(tsd=0x00003e616a889090, arena=<unavailable>,tcache=0x00003e616a889280, size=<unavailable>, binind=<unavailable>, zero=false, slow_path=false) attcache_inlines.h:124:9 frame #10: 0x000000082967c82c libc.so.7`__je_malloc_default [inlined] arena_malloc(tsdn=0x00003e616a889090, arena=0x0000000000000000,size=<unavailable>, ind=<unavailable>, zero=false, tcache=0x00003e616a889280, slow_path=false) atarena_inlines_b.h:169:11 frame #11: 0x000000082967c818 libc.so.7`__je_malloc_default [inlined] iallocztm(tsdn=0x00003e616a889090, size=<unavailable>,ind=<unavailable>, zero=false, tcache=0x00003e616a889280, is_internal=false, arena=0x0000000000000000,slow_path=false) at jemalloc_internal_inlines_c.h:53:8 frame #12: 0x000000082967c818 libc.so.7`__je_malloc_default [inlined] imalloc_no_sample(sopts=<unavailable>, dopts=<unavailable>,tsd=0x00003e616a889090, size=<unavailable>, usize=32768, ind=<unavailable>) at jemalloc_jemalloc.c:1953:9 frame #13: 0x000000082967c818 libc.so.7`__je_malloc_default [inlined] imalloc_body(sopts=<unavailable>, dopts=<unavailable>,tsd=0x00003e616a889090) at jemalloc_jemalloc.c:2153:16 frame #14: 0x000000082967c818 libc.so.7`__je_malloc_default [inlined] imalloc(sopts=<unavailable>, dopts=<unavailable>)at jemalloc_jemalloc.c:2262:10 frame #15: 0x000000082967c74a libc.so.7`__je_malloc_default(size=<unavailable>) at jemalloc_jemalloc.c:2293:2 frame #16: 0x00000000009c0577 postgres`AllocSetContextCreateInternal + 199 frame #17: 0x000000000058e56c postgres`StartTransaction + 332 frame #18: 0x000000000058e3be postgres`StartTransactionCommand + 30 frame #19: 0x00000000009a82fd postgres`InitPostgres + 461 frame #20: 0x00000000007b8e0d postgres`AutoVacWorkerMain + 765 frame #21: 0x00000000007b8ac7 postgres`StartAutoVacWorker + 39 frame #22: 0x00000000007c0c21 postgres`sigusr1_handler + 785 frame #23: 0x0000000822ae79b6 libthr.so.3`handle_signal(actp=0x0000000820c5c600, sig=30, info=0x0000000820c5c9f0, ucp=0x0000000820c5c680)at thr_sig.c:318:3 frame #24: 0x0000000822ae6eba libthr.so.3`thr_sighandler(sig=30, info=0x0000000820c5c9f0, _ucp=0x0000000820c5c680) atthr_sig.c:261:2 frame #25: 0x00000008210662d3 frame #26: 0x00000000007c2545 postgres`ServerLoop + 1605 frame #27: 0x00000000007bffa3 postgres`PostmasterMain + 3251 frame #28: 0x0000000000720601 postgres`main + 801 frame #29: 0x000000082958015c libc.so.7`__libc_start1(argc=4, argv=0x0000000820c5d6b0, env=0x0000000820c5d6d8, cleanup=<unavailable>,mainX=(postgres`main)) at libc_start1.c:180:7 frame #30: 0x00000000004ff4e4 postgres`_start + 36
pgsql-hackers by date: