Re: pg11.1: dsa_area could not attach to segment - Mailing list pgsql-hackers

From Justin Pryzby
Subject Re: pg11.1: dsa_area could not attach to segment
Date
Msg-id 20190212021428.GA31721@telsasoft.com
Whole thread Raw
In response to Re: pg11.1: dsa_area could not attach to segment  (Thomas Munro <thomas.munro@enterprisedb.com>)
Responses Re: pg11.1: dsa_area could not attach to segment
Re: pg11.1: dsa_area could not attach to segment
List pgsql-hackers
On Tue, Feb 12, 2019 at 10:57:51AM +1100, Thomas Munro wrote:
> > On current REL_11_STABLE branch with PANIC level i see this backtrace for failed parallel process:
> >
> > #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
> > #1  0x00007f3b36983535 in __GI_abort () at abort.c:79
> > #2  0x000055f03ab87a4e in errfinish (dummy=dummy@entry=0) at elog.c:555
> > #3  0x000055f03ab899e0 in elog_finish (elevel=elevel@entry=22, fmt=fmt@entry=0x55f03ad86900 "dsa_area could not
attachto segment") at elog.c:1376
 
> > #4  0x000055f03abaa1e2 in get_segment_by_index (area=area@entry=0x55f03cdd6bf0, index=index@entry=7) at dsa.c:1743
> > #5  0x000055f03abaa8ab in get_best_segment (area=area@entry=0x55f03cdd6bf0, npages=npages@entry=8) at dsa.c:1993
> > #6  0x000055f03ababdb8 in dsa_allocate_extended (area=0x55f03cdd6bf0, size=size@entry=32768, flags=flags@entry=0)
atdsa.c:701
 
> 
> Ok, this contains some clues I didn't have before.  Here we see that a
> request for a 32KB chunk of memory led to a traversal the linked list
> of segments in a given bin, and at some point we followed a link to
> segment index number 7, which turned out to be bogus.  We tried to
> attach to the segment whose handle is stored in
> area->control->segment_handles[7] and it was not known to dsm.c.  It
> wasn't DSM_HANDLE_INVALID, or you'd have got a different error
> message.  That means that it wasn't a segment that had been freed by
> destroy_superblock(), or it'd hold DSM_HANDLE_INVALID.
> 
> Hmm.  So perhaps the bin list was corrupted (the segment index was bad

I think there is corruption *somewhere* due to never being able to do
this (and looks very broken?)

(gdb) p segment_map
$1 = (dsa_segment_map *) 0x1

(gdb) print segment_map->header 
Cannot access memory at address 0x11

> Can we please see the stderr output of dsa_dump(area), added just
> before the PANIC?  Can we see the value of "handle" when the error is
> raised, and the directory listing for /dev/shm (assuming Linux) after
> the crash (maybe you need restart_after_crash = off to prevent
> automatic cleanup)?

PANIC:  dsa_area could not attach to segment index:8 handle:1076305344

I think it needs to be:

|               if (segment == NULL) {
|                       LWLockRelease(DSA_AREA_LOCK(area));
|                       dsa_dump(area);
|                       elog(PANIC, "dsa_area could not attach to segment index:%zd handle:%d", index, handle);
|               }

..but that triggers recursion:

#0  0x00000037b9c32495 in raise () from /lib64/libc.so.6
#1  0x00000037b9c33c75 in abort () from /lib64/libc.so.6
#2  0x0000000000a395c0 in errfinish (dummy=0) at elog.c:567
#3  0x0000000000a3bbf6 in elog_finish (elevel=22, fmt=0xc9faa0 "dsa_area could not attach to segment index:%zd
handle:%d")at elog.c:1389
 
#4  0x0000000000a6b97a in get_segment_by_index (area=0x1659200, index=8) at dsa.c:1747
#5  0x0000000000a6a3dc in dsa_dump (area=0x1659200) at dsa.c:1093
#6  0x0000000000a6b946 in get_segment_by_index (area=0x1659200, index=8) at dsa.c:1744
[...]
#717 0x0000000000a6a3dc in dsa_dump (area=0x1659200) at dsa.c:1093
#718 0x0000000000a6b946 in get_segment_by_index (area=0x1659200, index=8) at dsa.c:1744
#719 0x0000000000a6a3dc in dsa_dump (area=0x1659200) at dsa.c:1093
#720 0x0000000000a6b946 in get_segment_by_index (area=0x1659200, index=8) at dsa.c:1744
#721 0x0000000000a6c150 in get_best_segment (area=0x1659200, npages=8) at dsa.c:1997
#722 0x0000000000a69680 in dsa_allocate_extended (area=0x1659200, size=32768, flags=0) at dsa.c:701
#723 0x00000000007052eb in ExecParallelHashTupleAlloc (hashtable=0x7f56ff9b40e8, size=112, shared=0x7fffda8c36a0) at
nodeHash.c:2837
#724 0x00000000007034f3 in ExecParallelHashTableInsert (hashtable=0x7f56ff9b40e8, slot=0x1608948, hashvalue=2677813320)
atnodeHash.c:1693
 
#725 0x0000000000700ba3 in MultiExecParallelHash (node=0x1607f40) at nodeHash.c:288
#726 0x00000000007007ce in MultiExecHash (node=0x1607f40) at nodeHash.c:112
#727 0x00000000006e94d7 in MultiExecProcNode (node=0x1607f40) at execProcnode.c:501
[...]

[pryzbyj@telsasoft-db postgresql]$ ls -lt /dev/shm |head
total 353056
-rw-------. 1 pryzbyj  pryzbyj   1048576 Feb 11 13:51 PostgreSQL.821164732
-rw-------. 1 pryzbyj  pryzbyj   2097152 Feb 11 13:51 PostgreSQL.1990121974
-rw-------. 1 pryzbyj  pryzbyj   2097152 Feb 11 12:54 PostgreSQL.847060172
-rw-------. 1 pryzbyj  pryzbyj   2097152 Feb 11 12:48 PostgreSQL.1369859581
-rw-------. 1 postgres postgres    21328 Feb 10 21:00 PostgreSQL.1155375187
-rw-------. 1 pryzbyj  pryzbyj    196864 Feb 10 18:52 PostgreSQL.2136009186
-rw-------. 1 pryzbyj  pryzbyj   2097152 Feb 10 18:49 PostgreSQL.1648026537
-rw-------. 1 pryzbyj  pryzbyj   2097152 Feb 10 18:49 PostgreSQL.827867206
-rw-------. 1 pryzbyj  pryzbyj   2097152 Feb 10 18:49 PostgreSQL.1684837530

Justin


pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Reporting script runtimes in pg_regress
Next
From: Justin Pryzby
Date:
Subject: Re: pg11.1: dsa_area could not attach to segment