Re: pg11.1: dsa_area could not attach to segment - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: pg11.1: dsa_area could not attach to segment |
Date | |
Msg-id | CAEepm=34-Kq+fLi+0FTmS-nBdhC2d-z1i7Pe2ZDD4MbEWQeE9g@mail.gmail.com Whole thread Raw |
In response to | Re: pg11.1: dsa_area could not attach to segment (Justin Pryzby <pryzby@telsasoft.com>) |
List | pgsql-hackers |
On Tue, Feb 12, 2019 at 1:14 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > On Tue, Feb 12, 2019 at 10:57:51AM +1100, Thomas Munro wrote: > > > On current REL_11_STABLE branch with PANIC level i see this backtrace for failed parallel process: > > > > > > #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 > > > #1 0x00007f3b36983535 in __GI_abort () at abort.c:79 > > > #2 0x000055f03ab87a4e in errfinish (dummy=dummy@entry=0) at elog.c:555 > > > #3 0x000055f03ab899e0 in elog_finish (elevel=elevel@entry=22, fmt=fmt@entry=0x55f03ad86900 "dsa_area could not attachto segment") at elog.c:1376 > > > #4 0x000055f03abaa1e2 in get_segment_by_index (area=area@entry=0x55f03cdd6bf0, index=index@entry=7) at dsa.c:1743 > > > #5 0x000055f03abaa8ab in get_best_segment (area=area@entry=0x55f03cdd6bf0, npages=npages@entry=8) at dsa.c:1993 > > > #6 0x000055f03ababdb8 in dsa_allocate_extended (area=0x55f03cdd6bf0, size=size@entry=32768, flags=flags@entry=0) atdsa.c:701 > > > > Ok, this contains some clues I didn't have before. Here we see that a > > request for a 32KB chunk of memory led to a traversal the linked list > > of segments in a given bin, and at some point we followed a link to > > segment index number 7, which turned out to be bogus. We tried to > > attach to the segment whose handle is stored in > > area->control->segment_handles[7] and it was not known to dsm.c. It > > wasn't DSM_HANDLE_INVALID, or you'd have got a different error > > message. That means that it wasn't a segment that had been freed by > > destroy_superblock(), or it'd hold DSM_HANDLE_INVALID. > > > > Hmm. So perhaps the bin list was corrupted (the segment index was bad > > I think there is corruption *somewhere* due to never being able to do > this (and looks very broken?) > > (gdb) p segment_map > $1 = (dsa_segment_map *) 0x1 > > (gdb) print segment_map->header > Cannot access memory at address 0x11 If you're in get_segment_by_index() in a core dumped at the "could not attach" error, the variable segment_map hasn't been assigned a value yet so that's uninitialised junk on the stack. > > Can we please see the stderr output of dsa_dump(area), added just > > before the PANIC? Can we see the value of "handle" when the error is > > raised, and the directory listing for /dev/shm (assuming Linux) after > > the crash (maybe you need restart_after_crash = off to prevent > > automatic cleanup)? > > PANIC: dsa_area could not attach to segment index:8 handle:1076305344 > > I think it needs to be: > > | if (segment == NULL) { > | LWLockRelease(DSA_AREA_LOCK(area)); > | dsa_dump(area); > | elog(PANIC, "dsa_area could not attach to segment index:%zd handle:%d", index, handle); > | } Right. > ..but that triggers recursion: > > #0 0x00000037b9c32495 in raise () from /lib64/libc.so.6 > #1 0x00000037b9c33c75 in abort () from /lib64/libc.so.6 > #2 0x0000000000a395c0 in errfinish (dummy=0) at elog.c:567 > #3 0x0000000000a3bbf6 in elog_finish (elevel=22, fmt=0xc9faa0 "dsa_area could not attach to segment index:%zd handle:%d")at elog.c:1389 > #4 0x0000000000a6b97a in get_segment_by_index (area=0x1659200, index=8) at dsa.c:1747 > #5 0x0000000000a6a3dc in dsa_dump (area=0x1659200) at dsa.c:1093 > #6 0x0000000000a6b946 in get_segment_by_index (area=0x1659200, index=8) at dsa.c:1744 Ok, that makes sense. > [pryzbyj@telsasoft-db postgresql]$ ls -lt /dev/shm |head > total 353056 > -rw-------. 1 pryzbyj pryzbyj 1048576 Feb 11 13:51 PostgreSQL.821164732 > -rw-------. 1 pryzbyj pryzbyj 2097152 Feb 11 13:51 PostgreSQL.1990121974 > -rw-------. 1 pryzbyj pryzbyj 2097152 Feb 11 12:54 PostgreSQL.847060172 > -rw-------. 1 pryzbyj pryzbyj 2097152 Feb 11 12:48 PostgreSQL.1369859581 > -rw-------. 1 postgres postgres 21328 Feb 10 21:00 PostgreSQL.1155375187 > -rw-------. 1 pryzbyj pryzbyj 196864 Feb 10 18:52 PostgreSQL.2136009186 > -rw-------. 1 pryzbyj pryzbyj 2097152 Feb 10 18:49 PostgreSQL.1648026537 > -rw-------. 1 pryzbyj pryzbyj 2097152 Feb 10 18:49 PostgreSQL.827867206 > -rw-------. 1 pryzbyj pryzbyj 2097152 Feb 10 18:49 PostgreSQL.1684837530 Hmm. While contemplating this evidence that you have multiple postgres clusters running, I started wondering if there could be some way for two different DSA areas to get their DSM segments mixed up, perhaps related to the way that 11 generates identical sequences of random() numbers in all parallel workers. But I'm not seeing it; the O_CREAT | O_EXCL protocol seems to avoid that (not to mention permissions). You truncated that list at 10 lines with "head"... do you know if PostgreSQL.1076305344 was present? Or if you do it again, the one who name matches the value of "handle". -- Thomas Munro http://www.enterprisedb.com
pgsql-hackers by date: