Re: [sqlsmith] Unpinning error in parallel worker - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: [sqlsmith] Unpinning error in parallel worker
Date
Msg-id CAEepm=0gtExezsnVabv79hKSzn61dbdEbzWRxqyJf1nf8hzppQ@mail.gmail.com
Whole thread Raw
In response to Re: [sqlsmith] Unpinning error in parallel worker  (Jonathan Rudenberg <jonathan@titanous.com>)
Responses Re: [sqlsmith] Unpinning error in parallel worker
List pgsql-hackers
On Wed, Apr 18, 2018 at 8:52 AM, Jonathan Rudenberg
<jonathan@titanous.com> wrote:
> Hundreds of queries stuck with a wait_event of DynamicSharedMemoryControlLock and pg_terminate_backend did not
terminatethe queries.
 
>
> In the log:
>
>> FATAL:  cannot unpin a segment that is not pinned

Thanks for the report.  That error is reachable via two paths:

1.  Cleanup of a DSA area at the end of a query, giving back all
segments.  This is how the bug originally reported in this thread
reached it, and that's because of a case where we tried to
double-destroy the DSA area when refcount went down to zero, then back
up again, and then back to zero (late starting parallel worker that
attached in a narrow time window).  That was fixed in fddf45b3: once
it reaches zero we recognise it as already destroyed and don't even
let anyone attach.

2.  In destroy_superblock(), called by dsa_free(), when we're where
we've determined that a 64kb superblock can be given back to the DSM
segment, and that the DSM segment is now entirely free so can be given
back to the operating system.  To do that, after we put the pages back
into the free page manager we test fpm_largest(segment_map->fpm) ==
segment_map->header->usable_pages to see if the largest span of free
pages is now the same size as the whole segment.

I don't have any theories about how that could be going wrong right
now, but I'm looking into it.  There could be a logic bug in dsa.c, or
a logic bug in client code running an invalid sequence of
dsa_allocate(), dsa_free() calls that corrupts state (I wonder if a
well timed double dsa_free() could produce this effect), or a
common-or-garden overrun bug somewhere that trashes control state.

> I don't have a backtrace yet, but I will provide them if/when the issue happens again.

Thanks, that would be much appreciated, as would any clues about what
workload you're running.  Do you know what the query plan looks like
for the queries that crashed?

-- 
Thomas Munro
http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: pruning disabled for array, enum, record, range type partitionkeys
Next
From: Jonathan Rudenberg
Date:
Subject: Re: [sqlsmith] Unpinning error in parallel worker