Thread: v15b1: FailedAssertion("segment_map->header->magic == (DSA_SEGMENT_HEADER_MAGIC ^ area->control->handle ^ index)", File: "dsa.c", ..)
v15b1: FailedAssertion("segment_map->header->magic == (DSA_SEGMENT_HEADER_MAGIC ^ area->control->handle ^ index)", File: "dsa.c", ..)
From
Justin Pryzby
Date:
Yes, $SUBJECT is correct. On an old centos6 VM which I'd forgotten about and never removed from monitoring, I noticed that a process had recently crashed... Maybe this is an issue which was already fixed, but I looked and find no bug report nor patch about it. Feel free to dismiss the problem report if it's not interesting or useful. postgres was compiled locally at 4e54d231a. It'd been running continuously since September without crashing until a couple weeks ago (and running nearly-continuously for months before that). The VM is essentially idle, so maybe that's related to the crash. TRAP: FailedAssertion("segment_map->header->magic == (DSA_SEGMENT_HEADER_MAGIC ^ area->control->handle ^ index)", File: "dsa.c",Line: 1770, PID: 24257) postgres: telsasoft old_ts ::1(50284) authentication(ExceptionalCondition+0x91)[0x991451] postgres: telsasoft old_ts ::1(50284) authentication[0x9b9f97] postgres: telsasoft old_ts ::1(50284) authentication(dsa_get_address+0x92)[0x9ba192] postgres: telsasoft old_ts ::1(50284) authentication(pgstat_get_entry_ref+0x442)[0x868892] postgres: telsasoft old_ts ::1(50284) authentication(pgstat_prep_pending_entry+0x54)[0x862b14] postgres: telsasoft old_ts ::1(50284) authentication(pgstat_assoc_relation+0x54)[0x866764] postgres: telsasoft old_ts ::1(50284) authentication(_bt_first+0xb1b)[0x51399b] postgres: telsasoft old_ts ::1(50284) authentication(btgettuple+0xc2)[0x50e792] postgres: telsasoft old_ts ::1(50284) authentication(index_getnext_tid+0x51)[0x4ff271] postgres: telsasoft old_ts ::1(50284) authentication(index_getnext_slot+0x72)[0x4ff442] postgres: telsasoft old_ts ::1(50284) authentication(systable_getnext+0x132)[0x4fe282] postgres: telsasoft old_ts ::1(50284) authentication[0x9775cb] postgres: telsasoft old_ts ::1(50284) authentication(SearchCatCache+0x20d)[0x978e6d] postgres: telsasoft old_ts ::1(50284) authentication(GetSysCacheOid+0x30)[0x98c7c0] postgres: telsasoft old_ts ::1(50284) authentication(get_role_oid+0x2d)[0x86b1ad] postgres: telsasoft old_ts ::1(50284) authentication(hba_getauthmethod+0x22)[0x6f5592] postgres: telsasoft old_ts ::1(50284) authentication(ClientAuthentication+0x39)[0x6f1f59] postgres: telsasoft old_ts ::1(50284) authentication(InitPostgres+0x8c6)[0x9a33d6] postgres: telsasoft old_ts ::1(50284) authentication(PostgresMain+0x109)[0x84bb79] postgres: telsasoft old_ts ::1(50284) authentication(PostmasterMain+0x1a6a)[0x7ac1aa] postgres: telsasoft old_ts ::1(50284) authentication(main+0x461)[0x6fe5a1] /lib64/libc.so.6(__libc_start_main+0x100)[0x36e041ed20] (gdb) bt #0 0x00000036e04324f5 in raise () from /lib64/libc.so.6 #1 0x00000036e0433cd5 in abort () from /lib64/libc.so.6 #2 0x0000000000991470 in ExceptionalCondition (conditionName=<value optimized out>, errorType=<value optimized out>, fileName=<valueoptimized out>, lineNumber=1770) at assert.c:69 #3 0x00000000009b9f97 in get_segment_by_index (area=0x22818c0, index=<value optimized out>) at dsa.c:1769 #4 0x00000000009ba192 in dsa_get_address (area=0x22818c0, dp=1099511703168) at dsa.c:953 #5 0x0000000000868892 in pgstat_get_entry_ref (kind=PGSTAT_KIND_RELATION, dboid=<value optimized out>, objoid=<value optimizedout>, create=true, created_entry=0x0) at pgstat_shmem.c:508 #6 0x0000000000862b14 in pgstat_prep_pending_entry (kind=PGSTAT_KIND_RELATION, dboid=0, objoid=2676, created_entry=0x0)at pgstat.c:1067 #7 0x0000000000866764 in pgstat_prep_relation_pending (rel=0x22beba8) at pgstat_relation.c:855 #8 pgstat_assoc_relation (rel=0x22beba8) at pgstat_relation.c:138 #9 0x000000000051399b in _bt_first (scan=0x22eb5c8, dir=ForwardScanDirection) at nbtsearch.c:882 #10 0x000000000050e792 in btgettuple (scan=0x22eb5c8, dir=ForwardScanDirection) at nbtree.c:243 #11 0x00000000004ff271 in index_getnext_tid (scan=0x22eb5c8, direction=<value optimized out>) at indexam.c:533 #12 0x00000000004ff442 in index_getnext_slot (scan=0x22eb5c8, direction=ForwardScanDirection, slot=0x22eb418) at indexam.c:625 #13 0x00000000004fe282 in systable_getnext (sysscan=0x22eb3c0) at genam.c:511 #14 0x00000000009775cb in SearchCatCacheMiss (cache=0x229e280, nkeys=<value optimized out>, hashValue=3877703461, hashIndex=5,v1=<value optimized out>, v2=<value optimized out>, v3=0, v4=0) at catcache.c:1364 #15 0x0000000000978e6d in SearchCatCacheInternal (cache=0x229e280, v1=36056248, v2=0, v3=0, v4=0) at catcache.c:1295 #16 SearchCatCache (cache=0x229e280, v1=36056248, v2=0, v3=0, v4=0) at catcache.c:1149 #17 0x000000000098c7c0 in GetSysCacheOid (cacheId=10, oidcol=<value optimized out>, key1=<value optimized out>, key2=<valueoptimized out>, key3=<value optimized out>, key4=<value optimized out>) at syscache.c:1293 #18 0x000000000086b1ad in get_role_oid (rolname=0x2262cb8 "telsasoft", missing_ok=true) at acl.c:5181 #19 0x00000000006f5592 in check_hba (port=<value optimized out>) at hba.c:2100 #20 hba_getauthmethod (port=<value optimized out>) at hba.c:2699 #21 0x00000000006f1f59 in ClientAuthentication (port=0x224b5d0) at auth.c:396 #22 0x00000000009a33d6 in PerformAuthentication (in_dbname=0x2268a48 "old_ts", dboid=0, username=0x2262cb8 "telsasoft", useroid=0,out_dbname=0x0, override_allow_connections=false) at postinit.c:245 #23 InitPostgres (in_dbname=0x2268a48 "old_ts", dboid=0, username=0x2262cb8 "telsasoft", useroid=0, out_dbname=0x0, override_allow_connections=false)at postinit.c:836 #24 0x000000000084bb79 in PostgresMain (dbname=0x2268a48 "old_ts", username=0x2262cb8 "telsasoft") at postgres.c:4130 #25 0x00000000007ac1aa in BackendRun (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4504 #26 BackendStartup (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:4232 #27 ServerLoop (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1806 #28 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1478 #29 0x00000000006fe5a1 in main (argc=3, argv=0x22239a0) at main.c:202 Unfortunately: (gdb) p area->control->handle $3 = 0 (gdb) p segment_map->header->magic value has been optimized out (gdb) p index $4 = <value optimized out> -- Justin
Re: v15b1: FailedAssertion("segment_map->header->magic == (DSA_SEGMENT_HEADER_MAGIC ^ area->control->handle ^ index)", File: "dsa.c", ..)
From
Andres Freund
Date:
Hi, On 2023-04-11 14:46:23 -0500, Justin Pryzby wrote: > Yes, $SUBJECT is correct. > > On an old centos6 VM which I'd forgotten about and never removed from > monitoring, I noticed that a process had recently crashed... > > Maybe this is an issue which was already fixed, but I looked and find no > bug report nor patch about it. Feel free to dismiss the problem report > if it's not interesting or useful. > postgres was compiled locally at 4e54d231a. It'd been running > continuously since September without crashing until a couple weeks ago > (and running nearly-continuously for months before that). It possibly could be: Author: Andres Freund <andres@anarazel.de> Branch: master [cb2e7ddfe] 2022-12-02 18:10:30 -0800 Branch: REL_15_STABLE Release: REL_15_2 [c6a60471a] 2022-12-02 18:07:47 -0800 Branch: REL_14_STABLE Release: REL_14_7 [6344bc097] 2022-12-02 18:10:30 -0800 Branch: REL_13_STABLE Release: REL_13_10 [7944d2d8c] 2022-12-02 18:13:40 -0800 Branch: REL_12_STABLE Release: REL_12_14 [35b99a18f] 2022-12-02 18:16:14 -0800 Branch: REL_11_STABLE Release: REL_11_19 [af3517c15] 2022-12-02 18:17:54 -0800 Prevent pgstats from getting confused when relkind of a relation changes But the fact that it's on a catalog table's stats makes it less likely, although not impossible. Any chance there were conversions from tables to views in that connection? Greetings, Andres Freund
Re: v15b1: FailedAssertion("segment_map->header->magic == (DSA_SEGMENT_HEADER_MAGIC ^ area->control->handle ^ index)", File: "dsa.c", ..)
From
Andres Freund
Date:
Hi, On 2023-04-11 13:35:38 -0700, Andres Freund wrote: > On 2023-04-11 14:46:23 -0500, Justin Pryzby wrote: > > Yes, $SUBJECT is correct. > > > > On an old centos6 VM which I'd forgotten about and never removed from > > monitoring, I noticed that a process had recently crashed... > > > > Maybe this is an issue which was already fixed, but I looked and find no > > bug report nor patch about it. Feel free to dismiss the problem report > > if it's not interesting or useful. > > > postgres was compiled locally at 4e54d231a. It'd been running > > continuously since September without crashing until a couple weeks ago > > (and running nearly-continuously for months before that). > > It possibly could be: > > Author: Andres Freund <andres@anarazel.de> > Branch: master [cb2e7ddfe] 2022-12-02 18:10:30 -0800 > Branch: REL_15_STABLE Release: REL_15_2 [c6a60471a] 2022-12-02 18:07:47 -0800 > Branch: REL_14_STABLE Release: REL_14_7 [6344bc097] 2022-12-02 18:10:30 -0800 > Branch: REL_13_STABLE Release: REL_13_10 [7944d2d8c] 2022-12-02 18:13:40 -0800 > Branch: REL_12_STABLE Release: REL_12_14 [35b99a18f] 2022-12-02 18:16:14 -0800 > Branch: REL_11_STABLE Release: REL_11_19 [af3517c15] 2022-12-02 18:17:54 -0800 > > Prevent pgstats from getting confused when relkind of a relation changes > > But the fact that it's on a catalog table's stats makes it less likely, > although not impossible. > > > Any chance there were conversions from tables to views in that connection? Nope, not possible - the stack trace actually shows this is during connection establishment. Thomas, see stack trace upthread? Greetings, Andres Freund
Re: v15b1: FailedAssertion("segment_map->header->magic == (DSA_SEGMENT_HEADER_MAGIC ^ area->control->handle ^ index)", File: "dsa.c", ..)
From
Thomas Munro
Date:
On Wed, Apr 12, 2023 at 7:46 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > Unfortunately: > (gdb) p area->control->handle > $3 = 0 > (gdb) p segment_map->header->magic > value has been optimized out > (gdb) p index > $4 = <value optimized out> Hmm, well index I can find from parameters: > #2 0x0000000000991470 in ExceptionalCondition (conditionName=<value optimized out>, errorType=<value optimized out>, fileName=<valueoptimized out>, lineNumber=1770) at assert.c:69 > #3 0x00000000009b9f97 in get_segment_by_index (area=0x22818c0, index=<value optimized out>) at dsa.c:1769 > #4 0x00000000009ba192 in dsa_get_address (area=0x22818c0, dp=1099511703168) at dsa.c:953 We have dp=1099511703168 == 0x10000012680, so index == 1 and the rest is the offset into that segment. It's not the initial segment in the main shared memory area created by the postmaster with dsa_create_in_place() (that'd be index 0), it's in an extra segment that was created with shm_open(). We managed to open and mmap() that segment, but it contains unexpected garbage. Can you print *area->control? And then can you see that the DSM handle is in index 1 in "segment_handles" in there? Then can you see if your system has a file with that number in its name under /dev/shm/, and can you tell me what "od -c /dev/shm/..." shows as the first few lines of stuff at the top, so we can see what that unexpected garbage looks like? Side rant: I don't think there's any particular indication that it's the issue here, but while it's on my mind: I really wish we didn't use random numbers for DSM handles. I understand where it came from: the need to manage SysV shmem keyspace (a DSM mode that almost nobody uses, but whose limitations apply to all modes). We've debugged issues relating to handle collisions before, causing unrelated DSM segments to be confused, back when the random seed was not different in each backend making collisions likely. For every other mode, we could instead use something like (slot, generation) to keep collisions as far apart as possible (generation wraparound), and avoid collisions between unrelated clusters by using the pgdata path as a shm_open() prefix. Another idea is to add a new DSM mode that would use memfd and similar things and pass fds between backends, so that the segments are entirely anonymous and don't need to be cleaned up after a crash (I thought about that while studying the reasons why PostgreSQL can't run on Capsicum (a capabilities research project) or Android (a telephone), both of which banned SysV *and* POSIX shm because system-global namespaces are bad).
Re: v15b1: FailedAssertion("segment_map->header->magic == (DSA_SEGMENT_HEADER_MAGIC ^ area->control->handle ^ index)", File: "dsa.c", ..)
From
Justin Pryzby
Date:
On Wed, Apr 12, 2023 at 11:18:36AM +1200, Thomas Munro wrote: > Can you print *area->control? (gdb) p *area->control $1 = {segment_header = {magic = 216163848, usable_pages = 62, size = 1048576, prev = 1, next = 18446744073709551615, bin= 4, freed = false}, handle = 0, segment_handles = {0, 3696856876, 433426374, 1403332952, 2754923922, 0 <repeats 1019 times>}, segment_bins = {18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615,3, 4, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615, 18446744073709551615,18446744073709551615}, pools = {{lock = {tranche = 72, state = {value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}, spans = {0, 2199025295360, 8192, 0}}, {lock = {tranche = 72, state= {value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}, spans = {0, 2199025296088, 0, 0}}, {lock = {tranche = 72, state = {value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}},spans = {0, 0, 0, 0}}, {lock = {tranche = 72, state = {value = 536870912}, waiters = { head = 2147483647, tail = 2147483647}}, spans = {0, 0, 0, 0}}, {lock = {tranche = 72, state = {value = 536870912},waiters = {head = 2147483647, tail = 2147483647}}, spans = {0, 0, 0, 0}}, {lock = {tranche = 72, state = {value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}, spans = {0, 0, 0, 0}}, {lock = {tranche= 72, state = {value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}, spans = {0, 0, 0, 0}}, {lock = {tranche = 72, state = {value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}},spans = {0, 2199025296648, 2199025298608, 0}}, {lock = {tranche = 72, state = {value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}, spans = {0, 0, 0, 0}}, {lock = {tranche = 72, state = {value =536870912}, waiters = {head = 2147483647, tail = 2147483647}}, spans = {0, 0, 0, 0}}, {lock = { tranche = 72, state = {value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}, spans = {0, 0, 0, 0}},{lock = {tranche = 72, state = {value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}, spans = {0, 0, 0, 0}}, {lock = {tranche = 72, state = {value = 536870912}, waiters = {head= 2147483647, tail = 2147483647}}, spans = {0, 0, 0, 0}}, {lock = {tranche = 72, state = { value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}, spans = {0, 0, 0, 0}}, {lock = {tranche= 72, state = {value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}, spans = {0, 0, 0, 0}}, {lock = {tranche = 72, state = {value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}, spans= {0, 2199025298496, 2199025298664, 2199025297936}}, {lock = {tranche = 72, state = { value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}, spans = {0, 0, 0, 0}}, {lock = {tranche= 72, state = {value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}, spans = {0, 0, 0, 0}}, {lock = {tranche = 72, state = {value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}, spans= {0, 8416, 0, 0}}, {lock = {tranche = 72, state = {value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}, spans = {0, 0, 0, 0}}, {lock = {tranche = 72, state = {value = 536870912}, waiters = {head= 2147483647, tail = 2147483647}}, spans = {0, 0, 0, 0}}, {lock = {tranche = 72, state = { value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}, spans = {0, 0, 0, 0}}, {lock = {tranche= 72, state = {value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}, spans = {0, 0, 0, 0}}, {lock = {tranche = 72, state = {value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}, spans= {0, 0, 0, 0}}, {lock = {tranche = 72, state = {value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}, spans = {0, 0, 0, 0}}, {lock = {tranche = 72, state = {value = 536870912}, waiters = {head= 2147483647, tail = 2147483647}}, spans = {0, 8304, 0, 0}}, {lock = {tranche = 72, state = { value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}, spans = {0, 0, 0, 0}}, {lock = {tranche= 72, state = {value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}, spans = {0, 0, 0, 0}}, {lock = {tranche = 72, state = {value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}, spans= {0, 0, 0, 0}}, {lock = {tranche = 72, state = {value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}, spans = {0, 8528, 0, 0}}, {lock = {tranche = 72, state = {value = 536870912}, waiters = {head= 2147483647, tail = 2147483647}}, spans = {0, 0, 0, 0}}, {lock = {tranche = 72, state = { value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}, spans = {0, 8248, 0, 0}}, {lock = {tranche= 72, state = {value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}, spans = {0, 0, 0, 0}}, {lock = {tranche = 72, state = {value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}, spans= {0, 8584, 0, 0}}, {lock = {tranche = 72, state = {value = 536870912}, waiters = { head = 2147483647, tail = 2147483647}}, spans = {0, 0, 0, 0}}, {lock = {tranche = 72, state = {value = 536870912},waiters = {head = 2147483647, tail = 2147483647}}, spans = {0, 0, 0, 0}}, {lock = {tranche = 72, state = {value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}, spans = {0, 0, 0, 0}}, {lock = {tranche= 72, state = {value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}, spans = {0, 8640, 0, 0}}}, total_segment_size = 9699328, max_total_segment_size = 18446744073709551615, high_segment_index =4, refcnt = 8455469, pinned = true, freed_segment_counter = 0, lwlock_tranche_id = 72, lock = { tranche = 72, state = {value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}} > And then can you see that the DSM handle is in index 1 in "segment_handles" > in there? (gdb) p area->control->segment_handles $2 = {0, 3696856876, 433426374, 1403332952, 2754923922, 0 <repeats 1019 times>} > Then can you see if your system has a file with that number in its name under > /dev/shm/, $ ls /dev/shm/ |grep 3696856876 || echo not found not found (In case it matters: the vm has been up for 1558 days). If it's helpful, I could provide the corefile, unstripped binaries, and libc.so, which would be enough to use gdb on your side with "set solib-search-path". -- Justin
Re: v15b1: FailedAssertion("segment_map->header->magic == (DSA_SEGMENT_HEADER_MAGIC ^ area->control->handle ^ index)", File: "dsa.c", ..)
From
Thomas Munro
Date:
On Wed, Apr 12, 2023 at 11:37 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > $ ls /dev/shm/ |grep 3696856876 || echo not found > not found Oh, of course it would have restarted after it crashed and unlinked that... So the remaining traces of that memory *might* be in the core file, depending (IIRC) on the core filter settings (you definitely get shared anonymous memory like our main shm region by default, but IIRC there's something extra needed if you want the shm_open'd DSM segments to be dumped too...) > (In case it matters: the vm has been up for 1558 days). I will refrain from invoking cosmic radiation at this point :-) > If it's helpful, I could provide the corefile, unstripped binaries, and > libc.so, which would be enough to use gdb on your side with "set > solib-search-path". Sounds good, thanks, please send them over off-list and I'll see if I can figure anything out ...
Re: v15b1: FailedAssertion("segment_map->header->magic == (DSA_SEGMENT_HEADER_MAGIC ^ area->control->handle ^ index)", File: "dsa.c", ..)
From
Justin Pryzby
Date:
On Wed, Apr 12, 2023 at 11:49:51AM +1200, Thomas Munro wrote: > On Wed, Apr 12, 2023 at 11:37 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > > $ ls /dev/shm/ |grep 3696856876 || echo not found > > not found > > Oh, of course it would have restarted after it crashed and unlinked > that... So the remaining traces of that memory *might* be in the core > file, depending (IIRC) on the core filter settings (you definitely get > shared anonymous memory like our main shm region by default, but IIRC > there's something extra needed if you want the shm_open'd DSM segments > to be dumped too...) I scrounged around and found: /var/spool/abrt/ccpp-2023-03-11-13:07:02-24257/maps Which has (including the size): $ sudo cat /var/spool/abrt/ccpp-2023-03-11-13:07:02-24257/maps |awk --non-decimal-data -F'[- ]' '/shm|zero|SYSV/{a="0x"$1;b="0x"$2; print $0,0+b-a}' 7fd39c981000-7fd39ca81000 rw-s 00000000 00:0f 1698691690 /dev/shm/PostgreSQL.3696856876 1048576 7fd39ca81000-7fd39cc81000 rw-s 00000000 00:0f 1699005881 /dev/shm/PostgreSQL.433426374 2097152 7fd39cc81000-7fd39d081000 rw-s 00000000 00:0f 2443340900 /dev/shm/PostgreSQL.2754923922 4194304 7fd39d081000-7fd3b6a09000 rw-s 00000000 00:04 1698308066 /dev/zero (deleted) 429424640 7fd3bcf58000-7fd3bcf63000 rw-s 00000000 00:0f 1698308074 /dev/shm/PostgreSQL.2386569568 45056 7fd3bcf63000-7fd3bcf64000 rw-s 00000000 00:04 9732096 /SYSV0001581b (deleted) 4096 But except for the last one, none of these are available in the corefile. -- Justin