Re: SegFault on 9.6.14 - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: SegFault on 9.6.14
Date
Msg-id 20190716082204.iecyfh6mhu6khdhz@development
Whole thread Raw
In response to Re: SegFault on 9.6.14  (Jerry Sievers <gsievers19@comcast.net>)
Responses Re: SegFault on 9.6.14  (Thomas Munro <thomas.munro@gmail.com>)
Re: SegFault on 9.6.14  (Jerry Sievers <gsievers19@comcast.net>)
List pgsql-hackers
On Mon, Jul 15, 2019 at 08:20:00PM -0500, Jerry Sievers wrote:
>Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>
>> On Mon, Jul 15, 2019 at 07:22:55PM -0500, Jerry Sievers wrote:
>>
>>>Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>>>
>>>> On Mon, Jul 15, 2019 at 06:48:05PM -0500, Jerry Sievers wrote:
>>>>
>>>>>Greetings Hackers.
>>>>>
>>>>>We have a reproduceable case of $subject that issues a backtrace such as
>>>>>seen below.
>>>>>
>>>>>The query that I'd prefer to sanitize before sending is <30 lines of at
>>>>>a glance, not terribly complex logic.
>>>>>
>>>>>It nonetheless dies hard after a few seconds of running and as expected,
>>>>>results in an automatic all-backend restart.
>>>>>
>>>>>Please advise on how to proceed.  Thanks!
>>>>>
>>>>>bt
>>>>>#0  initscan (scan=scan@entry=0x55d7a7daa0b0, key=0x0, keep_startblock=keep_startblock@entry=1 '\001')
>>>>>    at /build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/access/heap/heapam.c:233
>>>>>#1  0x000055d7a72fa8d0 in heap_rescan (scan=0x55d7a7daa0b0, key=key@entry=0x0) at
/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/access/heap/heapam.c:1529
>>>>>#2  0x000055d7a7451fef in ExecReScanSeqScan (node=node@entry=0x55d7a7d85100) at
/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/nodeSeqscan.c:280
>>>>>#3  0x000055d7a742d36e in ExecReScan (node=0x55d7a7d85100) at
/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/execAmi.c:158
>>>>>#4  0x000055d7a7445d38 in ExecReScanGather (node=node@entry=0x55d7a7d84d30) at
/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/nodeGather.c:475
>>>>>#5  0x000055d7a742d255 in ExecReScan (node=0x55d7a7d84d30) at
/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/execAmi.c:166
>>>>>#6  0x000055d7a7448673 in ExecReScanHashJoin (node=node@entry=0x55d7a7d84110) at
/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/nodeHashjoin.c:1019
>>>>>#7  0x000055d7a742d29e in ExecReScan (node=node@entry=0x55d7a7d84110) at
/build/postgresql-9.6-5O8OLM/postgresql-9.6-9.6.14/build/../src/backend/executor/execAmi.c:226
>>>>><about 30 lines omitted>
>>>>>
>>>>
>>>> Hmmm, that means it's crashing here:
>>>>
>>>>    if (scan->rs_parallel != NULL)
>>>>        scan->rs_nblocks = scan->rs_parallel->phs_nblocks;     <--- here
>>>>    else
>>>>        scan->rs_nblocks = RelationGetNumberOfBlocks(scan->rs_rd);
>>>>
>>>> But clearly, scan is valid (otherwise it'd crash on the if condition),
>>>> and scan->rs_parallel must me non-NULL. Which probably means the pointer
>>>> is (no longer) valid.
>>>>
>>>> Could it be that the rs_parallel DSM disappears on rescan, or something
>>>> like that?
>>>
>>>No clue but something I just tried was to disable parallelism by setting
>>>max_parallel_workers_per_gather to 0 and however the query has not
>>>finished after a few minutes, there is no crash.
>>>
>>
>> That might be a hint my rough analysis was somewhat correct. The
>> question is whether the non-parallel plan does the same thing. Maybe it
>> picks a plan that does not require rescans, or something like that.
>>
>>>Please advise.
>>>
>>
>> It would be useful to see (a) exacution plan of the query, (b) full
>> backtrace and (c) a bit of context for the place where it crashed.
>>
>> Something like (in gdb):
>>
>>    bt full
>>    list
>>    p *scan
>
>The p *scan did nothing unless I ran it first however my gdb $foo isn't
>strong presently.

Hmm, the rs_parallel pointer looks sane (it's not obvious garbage). Can
you try this?

   p *scan->rs_parallel

Another question - are you sure this is not an OOM issue? That might
sometimes look like SIGSEGV due to overcommit. What's the memory
consumption / is there anything in dmesg?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: GiST VACUUM
Next
From: Dilip Kumar
Date:
Subject: Re: POC: Cleaning up orphaned files using undo logs