Thread: Re: [sqlsmith] Unpinning error in parallel worker

Re: [sqlsmith] Unpinning error in parallel worker

From

Thomas Munro

Date:

27 March 2017, 08:53:32

On Mon, Mar 27, 2017 at 8:38 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Mon, Mar 27, 2017 at 4:18 AM, Andreas Seltenreich <seltenreich@gmx.de> wrote:
>> Hi,
>>
>> today's testing with master as of d253b0f6e3 yielded two clusters that
>> stopped processing queries.  Symptoms:
>>
>> [...]
>
> Thanks Andreas.  Investigating.

First, the hanging stems from reentering dsm_backend_shutdown and
trying to acquire DynamicSharedMemoryControlLock which we already
acquired further down the stack in dsm_unpin_segment when it raised an
error.  That's obviously not great, but the real question is how we
reached this this-cannot-happen error condition.

I reproduced this by inserting a sleep before dsa_attach_in_place,
inserting a call to dsa_allocate into ExecInitParallelPlan so that the
executor's DSA area owns at least one segment, and then cancelling a
parallel query before the sleepy worker has managed to attach.  The
DSA area is destroyed before the worker attaches, but the worker
doesn't know this, and goes on to destroy it again after it learns
that the query has been cancelled.

In an earlier version of DSA, attaching should have failed in this
scenario because the handle would be invalid.  Based on complaints
about creating an extra DSM segment all the time even if we don't turn
out to need it, I implemented "in place" DSA areas where the control
object is in user-supplied shared memory, in this case in the parallel
query main DSM segment.  But that created a new hazard: if you try to
attach to a piece of memory that contains the remains of a
already-destroyed DSA area, then we don't do anything to detect that.
Oops.

The attached patch fixes that one way: it detects refcnt == 0 as a
defunct DSA area and raises an error when you try to attach.

Another approach which I could explore would be to "reset" the DSA
area instead of destroying it when the last backend detaches.  I'm not
sure if that would ever be useful as a feature in its own right, but
it would at least allow a very late worker to attach and then detach
in an orderly fashion in this query-cancelled case, so you wouldn't
get a different error message in the worker in this rare case.

Thoughts?

-- 
Thomas Munro
http://www.enterprisedb.com

Attachment

detect-late-dsa-attach-in-place.patch

Re: [sqlsmith] Unpinning error in parallel worker

From

Thomas Munro

Date:

29 March 2017, 08:31:41

On Mon, Mar 27, 2017 at 6:53 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Mon, Mar 27, 2017 at 8:38 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> On Mon, Mar 27, 2017 at 4:18 AM, Andreas Seltenreich <seltenreich@gmx.de> wrote:
>>> Hi,
>>>
>>> today's testing with master as of d253b0f6e3 yielded two clusters that
>>> stopped processing queries.  Symptoms:
>>>
>>> [...]
>>
>> Thanks Andreas.  Investigating.
>
> First, the hanging stems from reentering dsm_backend_shutdown and
> trying to acquire DynamicSharedMemoryControlLock which we already
> acquired further down the stack in dsm_unpin_segment when it raised an
> error.  That's obviously not great, but the real question is how we
> reached this this-cannot-happen error condition.
>
> I reproduced this by inserting a sleep before dsa_attach_in_place,
> inserting a call to dsa_allocate into ExecInitParallelPlan so that the
> executor's DSA area owns at least one segment, and then cancelling a
> parallel query before the sleepy worker has managed to attach.  The
> DSA area is destroyed before the worker attaches, but the worker
> doesn't know this, and goes on to destroy it again after it learns
> that the query has been cancelled.
>
> In an earlier version of DSA, attaching should have failed in this
> scenario because the handle would be invalid.  Based on complaints
> about creating an extra DSM segment all the time even if we don't turn
> out to need it, I implemented "in place" DSA areas where the control
> object is in user-supplied shared memory, in this case in the parallel
> query main DSM segment.  But that created a new hazard: if you try to
> attach to a piece of memory that contains the remains of a
> already-destroyed DSA area, then we don't do anything to detect that.
> Oops.
>
> The attached patch fixes that one way: it detects refcnt == 0 as a
> defunct DSA area and raises an error when you try to attach.
>
> Another approach which I could explore would be to "reset" the DSA
> area instead of destroying it when the last backend detaches.  I'm not
> sure if that would ever be useful as a feature in its own right, but
> it would at least allow a very late worker to attach and then detach
> in an orderly fashion in this query-cancelled case, so you wouldn't
> get a different error message in the worker in this rare case.
>
> Thoughts?

Added to open items.

I considered whether the error message could be improved but it
matches the message for an existing similar case (where you try to
attach to an unknown handle).

The alternative approach I mentioned above doesn't seem warranted, as
you can already get various different failure messages depending on
timing.

Based on feedback on another thread about how to make reviewers' and
committers' jobs easier, here is a format-patch version with a short
description as raw material for a commit message, in case that is
helpful.

-- 
Thomas Munro
http://www.enterprisedb.com

Attachment

detect-late-dsa-attach-v2.patch

Re: [sqlsmith] Unpinning error in parallel worker

From

Robert Haas

Date:

29 March 2017, 17:50:07

On Wed, Mar 29, 2017 at 1:31 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> I considered whether the error message could be improved but it
> matches the message for an existing similar case (where you try to
> attach to an unknown handle).

Ugh, OK.  I committed this, but I think this whole file needs a visit
from the message style police.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [sqlsmith] Unpinning error in parallel worker

From

Andreas Seltenreich

Date:

29 March 2017, 20:43:03

Thomas Munro writes:

> Based on feedback on another thread about how to make reviewers' and
> committers' jobs easier, here is a format-patch version with a short
> description as raw material for a commit message, in case that is
> helpful.

+1

It's quite convenient.  Otherwise I have to be creative myself writing
commit messages in order to keep track of which patches are applied on
the branch sqlsmith is crunching on.

regards,
Andreas

Re: [sqlsmith] Unpinning error in parallel worker

From

Jonathan Rudenberg

Date:

17 April 2018, 23:52:31

On Wed, Mar 29, 2017, at 10:50, Robert Haas wrote:
> On Wed, Mar 29, 2017 at 1:31 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
> > I considered whether the error message could be improved but it
> > matches the message for an existing similar case (where you try to
> > attach to an unknown handle).
> 
> Ugh, OK.  I committed this, but I think this whole file needs a visit
> from the message style police.

I ran into an issue that matches the symptoms of this bug on a production cluster running 10.3 (Ubuntu
10.3-1.pgdg16.04+1)twice in the past week.

Hundreds of queries stuck with a wait_event of DynamicSharedMemoryControlLock and pg_terminate_backend did not
terminatethe queries.

In the log:

> FATAL:  cannot unpin a segment that is not pinned

I don't have a backtrace yet, but I will provide them if/when the issue happens again.

Jonathan

Re: [sqlsmith] Unpinning error in parallel worker

From

Thomas Munro

Date:

18 April 2018, 01:38:14

On Wed, Apr 18, 2018 at 8:52 AM, Jonathan Rudenberg
<jonathan@titanous.com> wrote:
> Hundreds of queries stuck with a wait_event of DynamicSharedMemoryControlLock and pg_terminate_backend did not
terminatethe queries.

>
> In the log:
>
>> FATAL:  cannot unpin a segment that is not pinned

Thanks for the report.  That error is reachable via two paths:

1.  Cleanup of a DSA area at the end of a query, giving back all
segments.  This is how the bug originally reported in this thread
reached it, and that's because of a case where we tried to
double-destroy the DSA area when refcount went down to zero, then back
up again, and then back to zero (late starting parallel worker that
attached in a narrow time window).  That was fixed in fddf45b3: once
it reaches zero we recognise it as already destroyed and don't even
let anyone attach.

2.  In destroy_superblock(), called by dsa_free(), when we're where
we've determined that a 64kb superblock can be given back to the DSM
segment, and that the DSM segment is now entirely free so can be given
back to the operating system.  To do that, after we put the pages back
into the free page manager we test fpm_largest(segment_map->fpm) ==
segment_map->header->usable_pages to see if the largest span of free
pages is now the same size as the whole segment.

I don't have any theories about how that could be going wrong right
now, but I'm looking into it.  There could be a logic bug in dsa.c, or
a logic bug in client code running an invalid sequence of
dsa_allocate(), dsa_free() calls that corrupts state (I wonder if a
well timed double dsa_free() could produce this effect), or a
common-or-garden overrun bug somewhere that trashes control state.

> I don't have a backtrace yet, but I will provide them if/when the issue happens again.

Thanks, that would be much appreciated, as would any clues about what
workload you're running.  Do you know what the query plan looks like
for the queries that crashed?

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [sqlsmith] Unpinning error in parallel worker

From

Jonathan Rudenberg

Date:

18 April 2018, 02:01:05

On Tue, Apr 17, 2018, at 18:38, Thomas Munro wrote:

> I don't have any theories about how that could be going wrong right
> now, but I'm looking into it. 

Thank you!

> > I don't have a backtrace yet, but I will provide them if/when the issue happens again.
> 
> Thanks, that would be much appreciated, as would any clues about what
> workload you're running.  Do you know what the query plan looks like
> for the queries that crashed?

Yep, I think I know approximately what it looked like, I've attached a lightly redacted plan. All of the hung queries
wererunning some variant of this plan as far as I can tell.

Thanks again!

Jonathan

Attachment

dsa_hang_plan.txt

Re: [sqlsmith] Unpinning error in parallel worker

From

Thomas Munro

Date:

18 April 2018, 02:31:33

On Wed, Apr 18, 2018 at 11:01 AM, Jonathan Rudenberg
<jonathan@titanous.com> wrote:
> On Tue, Apr 17, 2018, at 18:38, Thomas Munro wrote:
>> Thanks, that would be much appreciated, as would any clues about what
>> workload you're running.  Do you know what the query plan looks like
>> for the queries that crashed?
>
> Yep, I think I know approximately what it looked like, I've attached a lightly redacted plan. All of the hung queries
wererunning some variant of this plan as far as I can tell.
 

Hmm, that isn't a parallel query.  I was expecting to see "Gather" and
"Parallel" in there.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [sqlsmith] Unpinning error in parallel worker

From

Jonathan Rudenberg

Date:

18 April 2018, 02:43:50

On Tue, Apr 17, 2018, at 19:31, Thomas Munro wrote:
> On Wed, Apr 18, 2018 at 11:01 AM, Jonathan Rudenberg
> <jonathan@titanous.com> wrote:
> > On Tue, Apr 17, 2018, at 18:38, Thomas Munro wrote:
> >> Thanks, that would be much appreciated, as would any clues about what
> >> workload you're running.  Do you know what the query plan looks like
> >> for the queries that crashed?
> >
> > Yep, I think I know approximately what it looked like, I've attached a lightly redacted plan. All of the hung
querieswere running some variant of this plan as far as I can tell.
 
> 
> Hmm, that isn't a parallel query.  I was expecting to see "Gather" and
> "Parallel" in there.

Oops, I'm really sorry about that. I only have the first part of the hung queries, and there are a few variants. Here's
onethat's parallel.

Attachment

dsa_hang_plan_v2.txt

Re: [sqlsmith] Unpinning error in parallel worker

From

Thomas Munro

Date:

20 April 2018, 07:42:41

On Wed, Apr 18, 2018 at 11:43 AM, Jonathan Rudenberg
<jonathan@titanous.com> wrote:
> On Tue, Apr 17, 2018, at 19:31, Thomas Munro wrote:
>> On Wed, Apr 18, 2018 at 11:01 AM, Jonathan Rudenberg
>> <jonathan@titanous.com> wrote:
>> > Yep, I think I know approximately what it looked like, I've attached a lightly redacted plan. All of the hung
querieswere running some variant of this plan as far as I can tell.

>>
>> Hmm, that isn't a parallel query.  I was expecting to see "Gather" and
>> "Parallel" in there.
>
> Oops, I'm really sorry about that. I only have the first part of the hung queries, and there are a few variants.
Here'sone that's parallel.

I spent some time trying to reproduce this failure without any luck,
using query plans similar to your Gather plan fragment, and using some
test harness code for the allocator stuff in isolation.

I had an idea that (1) freeing a large object that releases and unpins
a segment in one backend and then (2) freeing it again in another
backend (illegally) might produce this effect with sufficiently bad
luck.  I'm still trying to reproduce that without any success, but I
get other kinds of failures which I think you'd be seeing too if that
hunch were right.  Still looking...

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [sqlsmith] Unpinning error in parallel worker

From

Jonathan Rudenberg

Date:

21 April 2018, 01:24:54


On Fri, Apr 20, 2018, at 00:42, Thomas Munro wrote:
> On Wed, Apr 18, 2018 at 11:43 AM, Jonathan Rudenberg
> <jonathan@titanous.com> wrote:
> > On Tue, Apr 17, 2018, at 19:31, Thomas Munro wrote:
> >> On Wed, Apr 18, 2018 at 11:01 AM, Jonathan Rudenberg
> >> <jonathan@titanous.com> wrote:
> >> > Yep, I think I know approximately what it looked like, I've attached a lightly redacted plan. All of the hung
querieswere running some variant of this plan as far as I can tell.
 
> >>
> >> Hmm, that isn't a parallel query.  I was expecting to see "Gather" and
> >> "Parallel" in there.
> >
> > Oops, I'm really sorry about that. I only have the first part of the hung queries, and there are a few variants.
Here'sone that's parallel.
 
> 
> I spent some time trying to reproduce this failure without any luck,
> using query plans similar to your Gather plan fragment, and using some
> test harness code for the allocator stuff in isolation.
> 
> I had an idea that (1) freeing a large object that releases and unpins
> a segment in one backend and then (2) freeing it again in another
> backend (illegally) might produce this effect with sufficiently bad
> luck.  I'm still trying to reproduce that without any success, but I
> get other kinds of failures which I think you'd be seeing too if that
> hunch were right.  Still looking...

Thanks for investigating! We have monitoring in place and a plan to grab stack traces if this happens again. Is there
anythingelse that would be useful for us to try to get in addition to a stack trace from the process processing the
stuckquery?

Re: [sqlsmith] Unpinning error in parallel worker

From

Jonathan Rudenberg

Date:

24 April 2018, 17:21:40


On Fri, Apr 20, 2018, at 00:42, Thomas Munro wrote:
> On Wed, Apr 18, 2018 at 11:43 AM, Jonathan Rudenberg
> <jonathan@titanous.com> wrote:
> > On Tue, Apr 17, 2018, at 19:31, Thomas Munro wrote:
> >> On Wed, Apr 18, 2018 at 11:01 AM, Jonathan Rudenberg
> >> <jonathan@titanous.com> wrote:
> >> > Yep, I think I know approximately what it looked like, I've attached a lightly redacted plan. All of the hung
querieswere running some variant of this plan as far as I can tell.
 
> >>
> >> Hmm, that isn't a parallel query.  I was expecting to see "Gather" and
> >> "Parallel" in there.
> >
> > Oops, I'm really sorry about that. I only have the first part of the hung queries, and there are a few variants.
Here'sone that's parallel.
 
> 
> I spent some time trying to reproduce this failure without any luck,
> using query plans similar to your Gather plan fragment, and using some
> test harness code for the allocator stuff in isolation.
> 
> I had an idea that (1) freeing a large object that releases and unpins
> a segment in one backend and then (2) freeing it again in another
> backend (illegally) might produce this effect with sufficiently bad
> luck.  I'm still trying to reproduce that without any success, but I
> get other kinds of failures which I think you'd be seeing too if that
> hunch were right.  Still looking...


This issue happened again in production, here are the stack traces from three we grabbed before nuking the >400 hanging
backends.

process: /proc/24194/mem
thread: 0x7f46263648c0, lwp: 24194, type: 1
#0  0x00007f4625fdc827 in do_futex_wait+55() at /lib/x86_64-linux-gnu/libpthread.so.0 at futex-internal.h:205
#1  0x00007f4625fdc8d4 in __new_sem_wait_slow+84() at /lib/x86_64-linux-gnu/libpthread.so.0 at sem_waitcommon.c:181
#2  0x00007f4625fdc97a in __new_sem_wait+58() at /lib/x86_64-linux-gnu/libpthread.so.0 at sem_wait.c:29
#3  0x000055fccb8c8832 in PGSemaphoreLock+34() at /usr/lib/postgresql/10/bin/postgres at pg_sema.c:310
#4  0x000055fccb93b21c in LWLockAcquire+188() at /usr/lib/postgresql/10/bin/postgres at lwlock.c:1233
#5  0x000055fccb925fa7 in dsm_create+151() at /usr/lib/postgresql/10/bin/postgres at dsm.c:493
#6  0x000055fccb6f2a6f in InitializeParallelDSM+511() at /usr/lib/postgresql/10/bin/postgres at parallel.c:266
#7  0x000055fccb815f27 in ExecInitParallelPlan+807() at /usr/lib/postgresql/10/bin/postgres at execParallel.c:470
#8  0x000055fccb825a50 in ExecGather+1168() at /usr/lib/postgresql/10/bin/postgres at nodeGather.c:158
#9  0x000055fccb81dbd3 in ExecAppend+83() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#10 0x000055fccb81df2c in fetch_input_tuple+172() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#11 0x000055fccb820293 in ExecAgg+1235() at /usr/lib/postgresql/10/bin/postgres at nodeAgg.c:2539
#12 0x000055fccb835235 in CteScanNext+229() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#13 0x000055fccb817f0d in ExecScan+557() at /usr/lib/postgresql/10/bin/postgres at execScan.c:97
#14 0x000055fccb832155 in ExecNestLoop+165() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#15 0x000055fccb832155 in ExecNestLoop+165() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#16 0x000055fccb832155 in ExecNestLoop+165() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#17 0x000055fccb81df2c in fetch_input_tuple+172() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#18 0x000055fccb8201eb in ExecAgg+1067() at /usr/lib/postgresql/10/bin/postgres at nodeAgg.c:2347
#19 0x000055fccb811dfb in standard_ExecutorRun+379() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#20 0x000055fccb94d19b in PortalRunSelect+507() at /usr/lib/postgresql/10/bin/postgres at pquery.c:932
#21 0x000055fccb94e798 in PortalRun+824() at /usr/lib/postgresql/10/bin/postgres at pquery.c:773
#22 0x000055fccb94b329 in PostgresMain+3721() at /usr/lib/postgresql/10/bin/postgres at postgres.c:1984
#23 0x000055fccb6840d9 in ServerLoop+3412() at /usr/lib/postgresql/10/bin/postgres at postmaster.c:4405
#24 0x000055fccb8da78b in PostmasterMain+4235() at /usr/lib/postgresql/10/bin/postgres at postmaster.c:1363
#25 0x000055fccb6854d5 in main+2165() at /usr/lib/postgresql/10/bin/postgres at main.c:228
#26 0x00007f4623c07830 in __libc_start_main+240() at /lib/x86_64-linux-gnu/libc.so.6 at libc-start.c:291
#27 0x000055fccb685559 in _start!+41() at /usr/lib/postgresql/10/bin/postgres

process: /proc/24231/mem
thread: 0x7f46263648c0, lwp: 24231, type: 1
#0  0x00007f4625fdc827 in do_futex_wait+55() at /lib/x86_64-linux-gnu/libpthread.so.0 at futex-internal.h:205
#1  0x00007f4625fdc8d4 in __new_sem_wait_slow+84() at /lib/x86_64-linux-gnu/libpthread.so.0 at sem_waitcommon.c:181
#2  0x00007f4625fdc97a in __new_sem_wait+58() at /lib/x86_64-linux-gnu/libpthread.so.0 at sem_wait.c:29
#3  0x000055fccb8c8832 in PGSemaphoreLock+34() at /usr/lib/postgresql/10/bin/postgres at pg_sema.c:310
#4  0x000055fccb93b21c in LWLockAcquire+188() at /usr/lib/postgresql/10/bin/postgres at lwlock.c:1233
#5  0x000055fccb925fa7 in dsm_create+151() at /usr/lib/postgresql/10/bin/postgres at dsm.c:493
#6  0x000055fccb6f2a6f in InitializeParallelDSM+511() at /usr/lib/postgresql/10/bin/postgres at parallel.c:266
#7  0x000055fccb815f27 in ExecInitParallelPlan+807() at /usr/lib/postgresql/10/bin/postgres at execParallel.c:470
#8  0x000055fccb825a50 in ExecGather+1168() at /usr/lib/postgresql/10/bin/postgres at nodeGather.c:158
#9  0x000055fccb81dbd3 in ExecAppend+83() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#10 0x000055fccb8348d6 in ExecSort+278() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#11 0x000055fccb834bd7 in ExecUnique+135() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#12 0x000055fccb835235 in CteScanNext+229() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#13 0x000055fccb817f0d in ExecScan+557() at /usr/lib/postgresql/10/bin/postgres at execScan.c:97
#14 0x000055fccb811dfb in standard_ExecutorRun+379() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#15 0x000055fccb94d19b in PortalRunSelect+507() at /usr/lib/postgresql/10/bin/postgres at pquery.c:932
#16 0x000055fccb94e798 in PortalRun+824() at /usr/lib/postgresql/10/bin/postgres at pquery.c:773
#17 0x000055fccb94b329 in PostgresMain+3721() at /usr/lib/postgresql/10/bin/postgres at postgres.c:1984
#18 0x000055fccb6840d9 in ServerLoop+3412() at /usr/lib/postgresql/10/bin/postgres at postmaster.c:4405
#19 0x000055fccb8da78b in PostmasterMain+4235() at /usr/lib/postgresql/10/bin/postgres at postmaster.c:1363
#20 0x000055fccb6854d5 in main+2165() at /usr/lib/postgresql/10/bin/postgres at main.c:228
#21 0x00007f4623c07830 in __libc_start_main+240() at /lib/x86_64-linux-gnu/libc.so.6 at libc-start.c:291
#22 0x000055fccb685559 in _start!+41() at /usr/lib/postgresql/10/bin/postgres

process: /proc/24136/mem
thread: 0x7f46263648c0, lwp: 24136, type: 1
#0  0x00007f4625fdc827 in do_futex_wait+55() at /lib/x86_64-linux-gnu/libpthread.so.0 at futex-internal.h:205
#1  0x00007f4625fdc8d4 in __new_sem_wait_slow+84() at /lib/x86_64-linux-gnu/libpthread.so.0 at sem_waitcommon.c:181
#2  0x00007f4625fdc97a in __new_sem_wait+58() at /lib/x86_64-linux-gnu/libpthread.so.0 at sem_wait.c:29
#3  0x000055fccb8c8832 in PGSemaphoreLock+34() at /usr/lib/postgresql/10/bin/postgres at pg_sema.c:310
#4  0x000055fccb93b21c in LWLockAcquire+188() at /usr/lib/postgresql/10/bin/postgres at lwlock.c:1233
#5  0x000055fccb925fa7 in dsm_create+151() at /usr/lib/postgresql/10/bin/postgres at dsm.c:493
#6  0x000055fccb6f2a6f in InitializeParallelDSM+511() at /usr/lib/postgresql/10/bin/postgres at parallel.c:266
#7  0x000055fccb815f27 in ExecInitParallelPlan+807() at /usr/lib/postgresql/10/bin/postgres at execParallel.c:470
#8  0x000055fccb825a50 in ExecGather+1168() at /usr/lib/postgresql/10/bin/postgres at nodeGather.c:158
#9  0x000055fccb81dbd3 in ExecAppend+83() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#10 0x000055fccb81df2c in fetch_input_tuple+172() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#11 0x000055fccb820293 in ExecAgg+1235() at /usr/lib/postgresql/10/bin/postgres at nodeAgg.c:2539
#12 0x000055fccb835235 in CteScanNext+229() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#13 0x000055fccb817f0d in ExecScan+557() at /usr/lib/postgresql/10/bin/postgres at execScan.c:97
#14 0x000055fccb832155 in ExecNestLoop+165() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#15 0x000055fccb832155 in ExecNestLoop+165() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#16 0x000055fccb832155 in ExecNestLoop+165() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#17 0x000055fccb81df2c in fetch_input_tuple+172() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#18 0x000055fccb8201eb in ExecAgg+1067() at /usr/lib/postgresql/10/bin/postgres at nodeAgg.c:2347
#19 0x000055fccb811dfb in standard_ExecutorRun+379() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#20 0x000055fccb94d19b in PortalRunSelect+507() at /usr/lib/postgresql/10/bin/postgres at pquery.c:932
#21 0x000055fccb94e798 in PortalRun+824() at /usr/lib/postgresql/10/bin/postgres at pquery.c:773
#22 0x000055fccb94b329 in PostgresMain+3721() at /usr/lib/postgresql/10/bin/postgres at postgres.c:1984
#23 0x000055fccb6840d9 in ServerLoop+3412() at /usr/lib/postgresql/10/bin/postgres at postmaster.c:4405
#24 0x000055fccb8da78b in PostmasterMain+4235() at /usr/lib/postgresql/10/bin/postgres at postmaster.c:1363
#25 0x000055fccb6854d5 in main+2165() at /usr/lib/postgresql/10/bin/postgres at main.c:228
#26 0x00007f4623c07830 in __libc_start_main+240() at /lib/x86_64-linux-gnu/libc.so.6 at libc-start.c:291
#27 0x000055fccb685559 in _start!+41() at /usr/lib/postgresql/10/bin/postgres

Re: [sqlsmith] Unpinning error in parallel worker

From

Thomas Munro

Date:

24 April 2018, 23:06:43

On Wed, Apr 25, 2018 at 2:21 AM, Jonathan Rudenberg
<jonathan@titanous.com> wrote:
> This issue happened again in production, here are the stack traces from three we grabbed before nuking the >400
hangingbackends.

>
> [...]
> #4  0x000055fccb93b21c in LWLockAcquire+188() at /usr/lib/postgresql/10/bin/postgres at lwlock.c:1233
> #5  0x000055fccb925fa7 in dsm_create+151() at /usr/lib/postgresql/10/bin/postgres at dsm.c:493
> #6  0x000055fccb6f2a6f in InitializeParallelDSM+511() at /usr/lib/postgresql/10/bin/postgres at parallel.c:266
> [...]

Thank you.  These stacks are all blocked trying to acquire
DynamicSharedMemoryControlLock.  My theory is that they can't because
one backend -- the one that emitted the error "FATAL:  cannot unpin a
segment that is not pinned" -- is deadlocked against itself.  After
emitting that error you can see from Andreas's "seabisquit" stack that
that shmem_exit() runs dsm_backend_shutdown() which runs dsm_detach()
which tries to acquire DynamicSharedMemoryControlLock again, even
though we already hold it at that point.

I'll write a patch to fix that unpleasant symptom.  While holding
DynamicSharedMemoryControlLock we shouldn't raise any errors without
releasing it first, because the error handling path will try to
acquire it again.  That's a horrible failure mode as you have
discovered.

But that isn't the root problem: we shouldn't be raising that error,
and I'd love to see the stack of the one process that did that and
then self-deadlocked.  I will have another go at trying to reproduce
it here today.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [sqlsmith] Unpinning error in parallel worker

From

Jonathan Rudenberg

Date:

24 April 2018, 23:15:31


On Tue, Apr 24, 2018, at 16:06, Thomas Munro wrote:
> On Wed, Apr 25, 2018 at 2:21 AM, Jonathan Rudenberg
> <jonathan@titanous.com> wrote:
> > This issue happened again in production, here are the stack traces from three we grabbed before nuking the >400
hangingbackends.
 
> >
> > [...]
> > #4  0x000055fccb93b21c in LWLockAcquire+188() at /usr/lib/postgresql/10/bin/postgres at lwlock.c:1233
> > #5  0x000055fccb925fa7 in dsm_create+151() at /usr/lib/postgresql/10/bin/postgres at dsm.c:493
> > #6  0x000055fccb6f2a6f in InitializeParallelDSM+511() at /usr/lib/postgresql/10/bin/postgres at parallel.c:266
> > [...]
> 
> Thank you.  These stacks are all blocked trying to acquire
> DynamicSharedMemoryControlLock.  My theory is that they can't because
> one backend -- the one that emitted the error "FATAL:  cannot unpin a
> segment that is not pinned" -- is deadlocked against itself.  After
> emitting that error you can see from Andreas's "seabisquit" stack that
> that shmem_exit() runs dsm_backend_shutdown() which runs dsm_detach()
> which tries to acquire DynamicSharedMemoryControlLock again, even
> though we already hold it at that point.
> 
> I'll write a patch to fix that unpleasant symptom.  While holding
> DynamicSharedMemoryControlLock we shouldn't raise any errors without
> releasing it first, because the error handling path will try to
> acquire it again.  That's a horrible failure mode as you have
> discovered.
> 
> But that isn't the root problem: we shouldn't be raising that error,
> and I'd love to see the stack of the one process that did that and
> then self-deadlocked.  I will have another go at trying to reproduce
> it here today.

Thanks for the update!

We have turned off parallel queries (using max_parallel_workers_per_gather = 0) for now, as the production impact of
thisbug is unfortunately quite problematic.
 

What will this failure look like with the patch you've proposed?

Thanks again,

Jonathan

Re: [sqlsmith] Unpinning error in parallel worker

From

Thomas Munro

Date:

23 May 2018, 07:56:16

On Wed, Apr 25, 2018 at 8:15 AM, Jonathan Rudenberg
<jonathan@titanous.com> wrote:
> On Tue, Apr 24, 2018, at 16:06, Thomas Munro wrote:
>> I'll write a patch to fix that unpleasant symptom.  While holding
>> DynamicSharedMemoryControlLock we shouldn't raise any errors without
>> releasing it first, because the error handling path will try to
>> acquire it again.  That's a horrible failure mode as you have
>> discovered.
>>
>> But that isn't the root problem: we shouldn't be raising that error,
>> and I'd love to see the stack of the one process that did that and
>> then self-deadlocked.  I will have another go at trying to reproduce
>> it here today.
>
> Thanks for the update!
>
> We have turned off parallel queries (using max_parallel_workers_per_gather = 0) for now, as the production impact of
thisbug is unfortunately quite problematic.
 

Apologies for the delay... I've tried a few times  reproduce the
problem in vain.  There may be a timing element here.

> What will this failure look like with the patch you've proposed?

On second thoughts, I think it depends what is going wrong and I'm not
entirely sure if the result would necessarily be better.  It might be
that errors are raised and normal service resumes or it might be that
we have tangled up our resources in a bad way.  I need to get to the
bottom of this.

-- 
Thomas Munro
http://www.enterprisedb.com