Thread: Re: [sqlsmith] Unpinning error in parallel worker
On Mon, Mar 27, 2017 at 8:38 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Mon, Mar 27, 2017 at 4:18 AM, Andreas Seltenreich <seltenreich@gmx.de> wrote: >> Hi, >> >> today's testing with master as of d253b0f6e3 yielded two clusters that >> stopped processing queries. Symptoms: >> >> [...] > > Thanks Andreas. Investigating. First, the hanging stems from reentering dsm_backend_shutdown and trying to acquire DynamicSharedMemoryControlLock which we already acquired further down the stack in dsm_unpin_segment when it raised an error. That's obviously not great, but the real question is how we reached this this-cannot-happen error condition. I reproduced this by inserting a sleep before dsa_attach_in_place, inserting a call to dsa_allocate into ExecInitParallelPlan so that the executor's DSA area owns at least one segment, and then cancelling a parallel query before the sleepy worker has managed to attach. The DSA area is destroyed before the worker attaches, but the worker doesn't know this, and goes on to destroy it again after it learns that the query has been cancelled. In an earlier version of DSA, attaching should have failed in this scenario because the handle would be invalid. Based on complaints about creating an extra DSM segment all the time even if we don't turn out to need it, I implemented "in place" DSA areas where the control object is in user-supplied shared memory, in this case in the parallel query main DSM segment. But that created a new hazard: if you try to attach to a piece of memory that contains the remains of a already-destroyed DSA area, then we don't do anything to detect that. Oops. The attached patch fixes that one way: it detects refcnt == 0 as a defunct DSA area and raises an error when you try to attach. Another approach which I could explore would be to "reset" the DSA area instead of destroying it when the last backend detaches. I'm not sure if that would ever be useful as a feature in its own right, but it would at least allow a very late worker to attach and then detach in an orderly fashion in this query-cancelled case, so you wouldn't get a different error message in the worker in this rare case. Thoughts? -- Thomas Munro http://www.enterprisedb.com
Attachment
On Mon, Mar 27, 2017 at 6:53 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Mon, Mar 27, 2017 at 8:38 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> On Mon, Mar 27, 2017 at 4:18 AM, Andreas Seltenreich <seltenreich@gmx.de> wrote: >>> Hi, >>> >>> today's testing with master as of d253b0f6e3 yielded two clusters that >>> stopped processing queries. Symptoms: >>> >>> [...] >> >> Thanks Andreas. Investigating. > > First, the hanging stems from reentering dsm_backend_shutdown and > trying to acquire DynamicSharedMemoryControlLock which we already > acquired further down the stack in dsm_unpin_segment when it raised an > error. That's obviously not great, but the real question is how we > reached this this-cannot-happen error condition. > > I reproduced this by inserting a sleep before dsa_attach_in_place, > inserting a call to dsa_allocate into ExecInitParallelPlan so that the > executor's DSA area owns at least one segment, and then cancelling a > parallel query before the sleepy worker has managed to attach. The > DSA area is destroyed before the worker attaches, but the worker > doesn't know this, and goes on to destroy it again after it learns > that the query has been cancelled. > > In an earlier version of DSA, attaching should have failed in this > scenario because the handle would be invalid. Based on complaints > about creating an extra DSM segment all the time even if we don't turn > out to need it, I implemented "in place" DSA areas where the control > object is in user-supplied shared memory, in this case in the parallel > query main DSM segment. But that created a new hazard: if you try to > attach to a piece of memory that contains the remains of a > already-destroyed DSA area, then we don't do anything to detect that. > Oops. > > The attached patch fixes that one way: it detects refcnt == 0 as a > defunct DSA area and raises an error when you try to attach. > > Another approach which I could explore would be to "reset" the DSA > area instead of destroying it when the last backend detaches. I'm not > sure if that would ever be useful as a feature in its own right, but > it would at least allow a very late worker to attach and then detach > in an orderly fashion in this query-cancelled case, so you wouldn't > get a different error message in the worker in this rare case. > > Thoughts? Added to open items. I considered whether the error message could be improved but it matches the message for an existing similar case (where you try to attach to an unknown handle). The alternative approach I mentioned above doesn't seem warranted, as you can already get various different failure messages depending on timing. Based on feedback on another thread about how to make reviewers' and committers' jobs easier, here is a format-patch version with a short description as raw material for a commit message, in case that is helpful. -- Thomas Munro http://www.enterprisedb.com
Attachment
On Wed, Mar 29, 2017 at 1:31 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > I considered whether the error message could be improved but it > matches the message for an existing similar case (where you try to > attach to an unknown handle). Ugh, OK. I committed this, but I think this whole file needs a visit from the message style police. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Thomas Munro writes: > Based on feedback on another thread about how to make reviewers' and > committers' jobs easier, here is a format-patch version with a short > description as raw material for a commit message, in case that is > helpful. +1 It's quite convenient. Otherwise I have to be creative myself writing commit messages in order to keep track of which patches are applied on the branch sqlsmith is crunching on. regards, Andreas
On Wed, Mar 29, 2017, at 10:50, Robert Haas wrote: > On Wed, Mar 29, 2017 at 1:31 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: > > I considered whether the error message could be improved but it > > matches the message for an existing similar case (where you try to > > attach to an unknown handle). > > Ugh, OK. I committed this, but I think this whole file needs a visit > from the message style police. I ran into an issue that matches the symptoms of this bug on a production cluster running 10.3 (Ubuntu 10.3-1.pgdg16.04+1)twice in the past week. Hundreds of queries stuck with a wait_event of DynamicSharedMemoryControlLock and pg_terminate_backend did not terminatethe queries. In the log: > FATAL: cannot unpin a segment that is not pinned I don't have a backtrace yet, but I will provide them if/when the issue happens again. Jonathan
On Wed, Apr 18, 2018 at 8:52 AM, Jonathan Rudenberg <jonathan@titanous.com> wrote: > Hundreds of queries stuck with a wait_event of DynamicSharedMemoryControlLock and pg_terminate_backend did not terminatethe queries. > > In the log: > >> FATAL: cannot unpin a segment that is not pinned Thanks for the report. That error is reachable via two paths: 1. Cleanup of a DSA area at the end of a query, giving back all segments. This is how the bug originally reported in this thread reached it, and that's because of a case where we tried to double-destroy the DSA area when refcount went down to zero, then back up again, and then back to zero (late starting parallel worker that attached in a narrow time window). That was fixed in fddf45b3: once it reaches zero we recognise it as already destroyed and don't even let anyone attach. 2. In destroy_superblock(), called by dsa_free(), when we're where we've determined that a 64kb superblock can be given back to the DSM segment, and that the DSM segment is now entirely free so can be given back to the operating system. To do that, after we put the pages back into the free page manager we test fpm_largest(segment_map->fpm) == segment_map->header->usable_pages to see if the largest span of free pages is now the same size as the whole segment. I don't have any theories about how that could be going wrong right now, but I'm looking into it. There could be a logic bug in dsa.c, or a logic bug in client code running an invalid sequence of dsa_allocate(), dsa_free() calls that corrupts state (I wonder if a well timed double dsa_free() could produce this effect), or a common-or-garden overrun bug somewhere that trashes control state. > I don't have a backtrace yet, but I will provide them if/when the issue happens again. Thanks, that would be much appreciated, as would any clues about what workload you're running. Do you know what the query plan looks like for the queries that crashed? -- Thomas Munro http://www.enterprisedb.com
On Tue, Apr 17, 2018, at 18:38, Thomas Munro wrote: > I don't have any theories about how that could be going wrong right > now, but I'm looking into it. Thank you! > > I don't have a backtrace yet, but I will provide them if/when the issue happens again. > > Thanks, that would be much appreciated, as would any clues about what > workload you're running. Do you know what the query plan looks like > for the queries that crashed? Yep, I think I know approximately what it looked like, I've attached a lightly redacted plan. All of the hung queries wererunning some variant of this plan as far as I can tell. Thanks again! Jonathan
Attachment
On Wed, Apr 18, 2018 at 11:01 AM, Jonathan Rudenberg <jonathan@titanous.com> wrote: > On Tue, Apr 17, 2018, at 18:38, Thomas Munro wrote: >> Thanks, that would be much appreciated, as would any clues about what >> workload you're running. Do you know what the query plan looks like >> for the queries that crashed? > > Yep, I think I know approximately what it looked like, I've attached a lightly redacted plan. All of the hung queries wererunning some variant of this plan as far as I can tell. Hmm, that isn't a parallel query. I was expecting to see "Gather" and "Parallel" in there. -- Thomas Munro http://www.enterprisedb.com
On Tue, Apr 17, 2018, at 19:31, Thomas Munro wrote: > On Wed, Apr 18, 2018 at 11:01 AM, Jonathan Rudenberg > <jonathan@titanous.com> wrote: > > On Tue, Apr 17, 2018, at 18:38, Thomas Munro wrote: > >> Thanks, that would be much appreciated, as would any clues about what > >> workload you're running. Do you know what the query plan looks like > >> for the queries that crashed? > > > > Yep, I think I know approximately what it looked like, I've attached a lightly redacted plan. All of the hung querieswere running some variant of this plan as far as I can tell. > > Hmm, that isn't a parallel query. I was expecting to see "Gather" and > "Parallel" in there. Oops, I'm really sorry about that. I only have the first part of the hung queries, and there are a few variants. Here's onethat's parallel.
Attachment
On Wed, Apr 18, 2018 at 11:43 AM, Jonathan Rudenberg <jonathan@titanous.com> wrote: > On Tue, Apr 17, 2018, at 19:31, Thomas Munro wrote: >> On Wed, Apr 18, 2018 at 11:01 AM, Jonathan Rudenberg >> <jonathan@titanous.com> wrote: >> > Yep, I think I know approximately what it looked like, I've attached a lightly redacted plan. All of the hung querieswere running some variant of this plan as far as I can tell. >> >> Hmm, that isn't a parallel query. I was expecting to see "Gather" and >> "Parallel" in there. > > Oops, I'm really sorry about that. I only have the first part of the hung queries, and there are a few variants. Here'sone that's parallel. I spent some time trying to reproduce this failure without any luck, using query plans similar to your Gather plan fragment, and using some test harness code for the allocator stuff in isolation. I had an idea that (1) freeing a large object that releases and unpins a segment in one backend and then (2) freeing it again in another backend (illegally) might produce this effect with sufficiently bad luck. I'm still trying to reproduce that without any success, but I get other kinds of failures which I think you'd be seeing too if that hunch were right. Still looking... -- Thomas Munro http://www.enterprisedb.com
On Fri, Apr 20, 2018, at 00:42, Thomas Munro wrote: > On Wed, Apr 18, 2018 at 11:43 AM, Jonathan Rudenberg > <jonathan@titanous.com> wrote: > > On Tue, Apr 17, 2018, at 19:31, Thomas Munro wrote: > >> On Wed, Apr 18, 2018 at 11:01 AM, Jonathan Rudenberg > >> <jonathan@titanous.com> wrote: > >> > Yep, I think I know approximately what it looked like, I've attached a lightly redacted plan. All of the hung querieswere running some variant of this plan as far as I can tell. > >> > >> Hmm, that isn't a parallel query. I was expecting to see "Gather" and > >> "Parallel" in there. > > > > Oops, I'm really sorry about that. I only have the first part of the hung queries, and there are a few variants. Here'sone that's parallel. > > I spent some time trying to reproduce this failure without any luck, > using query plans similar to your Gather plan fragment, and using some > test harness code for the allocator stuff in isolation. > > I had an idea that (1) freeing a large object that releases and unpins > a segment in one backend and then (2) freeing it again in another > backend (illegally) might produce this effect with sufficiently bad > luck. I'm still trying to reproduce that without any success, but I > get other kinds of failures which I think you'd be seeing too if that > hunch were right. Still looking... Thanks for investigating! We have monitoring in place and a plan to grab stack traces if this happens again. Is there anythingelse that would be useful for us to try to get in addition to a stack trace from the process processing the stuckquery?
On Fri, Apr 20, 2018, at 00:42, Thomas Munro wrote: > On Wed, Apr 18, 2018 at 11:43 AM, Jonathan Rudenberg > <jonathan@titanous.com> wrote: > > On Tue, Apr 17, 2018, at 19:31, Thomas Munro wrote: > >> On Wed, Apr 18, 2018 at 11:01 AM, Jonathan Rudenberg > >> <jonathan@titanous.com> wrote: > >> > Yep, I think I know approximately what it looked like, I've attached a lightly redacted plan. All of the hung querieswere running some variant of this plan as far as I can tell. > >> > >> Hmm, that isn't a parallel query. I was expecting to see "Gather" and > >> "Parallel" in there. > > > > Oops, I'm really sorry about that. I only have the first part of the hung queries, and there are a few variants. Here'sone that's parallel. > > I spent some time trying to reproduce this failure without any luck, > using query plans similar to your Gather plan fragment, and using some > test harness code for the allocator stuff in isolation. > > I had an idea that (1) freeing a large object that releases and unpins > a segment in one backend and then (2) freeing it again in another > backend (illegally) might produce this effect with sufficiently bad > luck. I'm still trying to reproduce that without any success, but I > get other kinds of failures which I think you'd be seeing too if that > hunch were right. Still looking... This issue happened again in production, here are the stack traces from three we grabbed before nuking the >400 hanging backends. process: /proc/24194/mem thread: 0x7f46263648c0, lwp: 24194, type: 1 #0 0x00007f4625fdc827 in do_futex_wait+55() at /lib/x86_64-linux-gnu/libpthread.so.0 at futex-internal.h:205 #1 0x00007f4625fdc8d4 in __new_sem_wait_slow+84() at /lib/x86_64-linux-gnu/libpthread.so.0 at sem_waitcommon.c:181 #2 0x00007f4625fdc97a in __new_sem_wait+58() at /lib/x86_64-linux-gnu/libpthread.so.0 at sem_wait.c:29 #3 0x000055fccb8c8832 in PGSemaphoreLock+34() at /usr/lib/postgresql/10/bin/postgres at pg_sema.c:310 #4 0x000055fccb93b21c in LWLockAcquire+188() at /usr/lib/postgresql/10/bin/postgres at lwlock.c:1233 #5 0x000055fccb925fa7 in dsm_create+151() at /usr/lib/postgresql/10/bin/postgres at dsm.c:493 #6 0x000055fccb6f2a6f in InitializeParallelDSM+511() at /usr/lib/postgresql/10/bin/postgres at parallel.c:266 #7 0x000055fccb815f27 in ExecInitParallelPlan+807() at /usr/lib/postgresql/10/bin/postgres at execParallel.c:470 #8 0x000055fccb825a50 in ExecGather+1168() at /usr/lib/postgresql/10/bin/postgres at nodeGather.c:158 #9 0x000055fccb81dbd3 in ExecAppend+83() at /usr/lib/postgresql/10/bin/postgres at executor.h:250 #10 0x000055fccb81df2c in fetch_input_tuple+172() at /usr/lib/postgresql/10/bin/postgres at executor.h:250 #11 0x000055fccb820293 in ExecAgg+1235() at /usr/lib/postgresql/10/bin/postgres at nodeAgg.c:2539 #12 0x000055fccb835235 in CteScanNext+229() at /usr/lib/postgresql/10/bin/postgres at executor.h:250 #13 0x000055fccb817f0d in ExecScan+557() at /usr/lib/postgresql/10/bin/postgres at execScan.c:97 #14 0x000055fccb832155 in ExecNestLoop+165() at /usr/lib/postgresql/10/bin/postgres at executor.h:250 #15 0x000055fccb832155 in ExecNestLoop+165() at /usr/lib/postgresql/10/bin/postgres at executor.h:250 #16 0x000055fccb832155 in ExecNestLoop+165() at /usr/lib/postgresql/10/bin/postgres at executor.h:250 #17 0x000055fccb81df2c in fetch_input_tuple+172() at /usr/lib/postgresql/10/bin/postgres at executor.h:250 #18 0x000055fccb8201eb in ExecAgg+1067() at /usr/lib/postgresql/10/bin/postgres at nodeAgg.c:2347 #19 0x000055fccb811dfb in standard_ExecutorRun+379() at /usr/lib/postgresql/10/bin/postgres at executor.h:250 #20 0x000055fccb94d19b in PortalRunSelect+507() at /usr/lib/postgresql/10/bin/postgres at pquery.c:932 #21 0x000055fccb94e798 in PortalRun+824() at /usr/lib/postgresql/10/bin/postgres at pquery.c:773 #22 0x000055fccb94b329 in PostgresMain+3721() at /usr/lib/postgresql/10/bin/postgres at postgres.c:1984 #23 0x000055fccb6840d9 in ServerLoop+3412() at /usr/lib/postgresql/10/bin/postgres at postmaster.c:4405 #24 0x000055fccb8da78b in PostmasterMain+4235() at /usr/lib/postgresql/10/bin/postgres at postmaster.c:1363 #25 0x000055fccb6854d5 in main+2165() at /usr/lib/postgresql/10/bin/postgres at main.c:228 #26 0x00007f4623c07830 in __libc_start_main+240() at /lib/x86_64-linux-gnu/libc.so.6 at libc-start.c:291 #27 0x000055fccb685559 in _start!+41() at /usr/lib/postgresql/10/bin/postgres process: /proc/24231/mem thread: 0x7f46263648c0, lwp: 24231, type: 1 #0 0x00007f4625fdc827 in do_futex_wait+55() at /lib/x86_64-linux-gnu/libpthread.so.0 at futex-internal.h:205 #1 0x00007f4625fdc8d4 in __new_sem_wait_slow+84() at /lib/x86_64-linux-gnu/libpthread.so.0 at sem_waitcommon.c:181 #2 0x00007f4625fdc97a in __new_sem_wait+58() at /lib/x86_64-linux-gnu/libpthread.so.0 at sem_wait.c:29 #3 0x000055fccb8c8832 in PGSemaphoreLock+34() at /usr/lib/postgresql/10/bin/postgres at pg_sema.c:310 #4 0x000055fccb93b21c in LWLockAcquire+188() at /usr/lib/postgresql/10/bin/postgres at lwlock.c:1233 #5 0x000055fccb925fa7 in dsm_create+151() at /usr/lib/postgresql/10/bin/postgres at dsm.c:493 #6 0x000055fccb6f2a6f in InitializeParallelDSM+511() at /usr/lib/postgresql/10/bin/postgres at parallel.c:266 #7 0x000055fccb815f27 in ExecInitParallelPlan+807() at /usr/lib/postgresql/10/bin/postgres at execParallel.c:470 #8 0x000055fccb825a50 in ExecGather+1168() at /usr/lib/postgresql/10/bin/postgres at nodeGather.c:158 #9 0x000055fccb81dbd3 in ExecAppend+83() at /usr/lib/postgresql/10/bin/postgres at executor.h:250 #10 0x000055fccb8348d6 in ExecSort+278() at /usr/lib/postgresql/10/bin/postgres at executor.h:250 #11 0x000055fccb834bd7 in ExecUnique+135() at /usr/lib/postgresql/10/bin/postgres at executor.h:250 #12 0x000055fccb835235 in CteScanNext+229() at /usr/lib/postgresql/10/bin/postgres at executor.h:250 #13 0x000055fccb817f0d in ExecScan+557() at /usr/lib/postgresql/10/bin/postgres at execScan.c:97 #14 0x000055fccb811dfb in standard_ExecutorRun+379() at /usr/lib/postgresql/10/bin/postgres at executor.h:250 #15 0x000055fccb94d19b in PortalRunSelect+507() at /usr/lib/postgresql/10/bin/postgres at pquery.c:932 #16 0x000055fccb94e798 in PortalRun+824() at /usr/lib/postgresql/10/bin/postgres at pquery.c:773 #17 0x000055fccb94b329 in PostgresMain+3721() at /usr/lib/postgresql/10/bin/postgres at postgres.c:1984 #18 0x000055fccb6840d9 in ServerLoop+3412() at /usr/lib/postgresql/10/bin/postgres at postmaster.c:4405 #19 0x000055fccb8da78b in PostmasterMain+4235() at /usr/lib/postgresql/10/bin/postgres at postmaster.c:1363 #20 0x000055fccb6854d5 in main+2165() at /usr/lib/postgresql/10/bin/postgres at main.c:228 #21 0x00007f4623c07830 in __libc_start_main+240() at /lib/x86_64-linux-gnu/libc.so.6 at libc-start.c:291 #22 0x000055fccb685559 in _start!+41() at /usr/lib/postgresql/10/bin/postgres process: /proc/24136/mem thread: 0x7f46263648c0, lwp: 24136, type: 1 #0 0x00007f4625fdc827 in do_futex_wait+55() at /lib/x86_64-linux-gnu/libpthread.so.0 at futex-internal.h:205 #1 0x00007f4625fdc8d4 in __new_sem_wait_slow+84() at /lib/x86_64-linux-gnu/libpthread.so.0 at sem_waitcommon.c:181 #2 0x00007f4625fdc97a in __new_sem_wait+58() at /lib/x86_64-linux-gnu/libpthread.so.0 at sem_wait.c:29 #3 0x000055fccb8c8832 in PGSemaphoreLock+34() at /usr/lib/postgresql/10/bin/postgres at pg_sema.c:310 #4 0x000055fccb93b21c in LWLockAcquire+188() at /usr/lib/postgresql/10/bin/postgres at lwlock.c:1233 #5 0x000055fccb925fa7 in dsm_create+151() at /usr/lib/postgresql/10/bin/postgres at dsm.c:493 #6 0x000055fccb6f2a6f in InitializeParallelDSM+511() at /usr/lib/postgresql/10/bin/postgres at parallel.c:266 #7 0x000055fccb815f27 in ExecInitParallelPlan+807() at /usr/lib/postgresql/10/bin/postgres at execParallel.c:470 #8 0x000055fccb825a50 in ExecGather+1168() at /usr/lib/postgresql/10/bin/postgres at nodeGather.c:158 #9 0x000055fccb81dbd3 in ExecAppend+83() at /usr/lib/postgresql/10/bin/postgres at executor.h:250 #10 0x000055fccb81df2c in fetch_input_tuple+172() at /usr/lib/postgresql/10/bin/postgres at executor.h:250 #11 0x000055fccb820293 in ExecAgg+1235() at /usr/lib/postgresql/10/bin/postgres at nodeAgg.c:2539 #12 0x000055fccb835235 in CteScanNext+229() at /usr/lib/postgresql/10/bin/postgres at executor.h:250 #13 0x000055fccb817f0d in ExecScan+557() at /usr/lib/postgresql/10/bin/postgres at execScan.c:97 #14 0x000055fccb832155 in ExecNestLoop+165() at /usr/lib/postgresql/10/bin/postgres at executor.h:250 #15 0x000055fccb832155 in ExecNestLoop+165() at /usr/lib/postgresql/10/bin/postgres at executor.h:250 #16 0x000055fccb832155 in ExecNestLoop+165() at /usr/lib/postgresql/10/bin/postgres at executor.h:250 #17 0x000055fccb81df2c in fetch_input_tuple+172() at /usr/lib/postgresql/10/bin/postgres at executor.h:250 #18 0x000055fccb8201eb in ExecAgg+1067() at /usr/lib/postgresql/10/bin/postgres at nodeAgg.c:2347 #19 0x000055fccb811dfb in standard_ExecutorRun+379() at /usr/lib/postgresql/10/bin/postgres at executor.h:250 #20 0x000055fccb94d19b in PortalRunSelect+507() at /usr/lib/postgresql/10/bin/postgres at pquery.c:932 #21 0x000055fccb94e798 in PortalRun+824() at /usr/lib/postgresql/10/bin/postgres at pquery.c:773 #22 0x000055fccb94b329 in PostgresMain+3721() at /usr/lib/postgresql/10/bin/postgres at postgres.c:1984 #23 0x000055fccb6840d9 in ServerLoop+3412() at /usr/lib/postgresql/10/bin/postgres at postmaster.c:4405 #24 0x000055fccb8da78b in PostmasterMain+4235() at /usr/lib/postgresql/10/bin/postgres at postmaster.c:1363 #25 0x000055fccb6854d5 in main+2165() at /usr/lib/postgresql/10/bin/postgres at main.c:228 #26 0x00007f4623c07830 in __libc_start_main+240() at /lib/x86_64-linux-gnu/libc.so.6 at libc-start.c:291 #27 0x000055fccb685559 in _start!+41() at /usr/lib/postgresql/10/bin/postgres
On Wed, Apr 25, 2018 at 2:21 AM, Jonathan Rudenberg <jonathan@titanous.com> wrote: > This issue happened again in production, here are the stack traces from three we grabbed before nuking the >400 hangingbackends. > > [...] > #4 0x000055fccb93b21c in LWLockAcquire+188() at /usr/lib/postgresql/10/bin/postgres at lwlock.c:1233 > #5 0x000055fccb925fa7 in dsm_create+151() at /usr/lib/postgresql/10/bin/postgres at dsm.c:493 > #6 0x000055fccb6f2a6f in InitializeParallelDSM+511() at /usr/lib/postgresql/10/bin/postgres at parallel.c:266 > [...] Thank you. These stacks are all blocked trying to acquire DynamicSharedMemoryControlLock. My theory is that they can't because one backend -- the one that emitted the error "FATAL: cannot unpin a segment that is not pinned" -- is deadlocked against itself. After emitting that error you can see from Andreas's "seabisquit" stack that that shmem_exit() runs dsm_backend_shutdown() which runs dsm_detach() which tries to acquire DynamicSharedMemoryControlLock again, even though we already hold it at that point. I'll write a patch to fix that unpleasant symptom. While holding DynamicSharedMemoryControlLock we shouldn't raise any errors without releasing it first, because the error handling path will try to acquire it again. That's a horrible failure mode as you have discovered. But that isn't the root problem: we shouldn't be raising that error, and I'd love to see the stack of the one process that did that and then self-deadlocked. I will have another go at trying to reproduce it here today. -- Thomas Munro http://www.enterprisedb.com
On Tue, Apr 24, 2018, at 16:06, Thomas Munro wrote: > On Wed, Apr 25, 2018 at 2:21 AM, Jonathan Rudenberg > <jonathan@titanous.com> wrote: > > This issue happened again in production, here are the stack traces from three we grabbed before nuking the >400 hangingbackends. > > > > [...] > > #4 0x000055fccb93b21c in LWLockAcquire+188() at /usr/lib/postgresql/10/bin/postgres at lwlock.c:1233 > > #5 0x000055fccb925fa7 in dsm_create+151() at /usr/lib/postgresql/10/bin/postgres at dsm.c:493 > > #6 0x000055fccb6f2a6f in InitializeParallelDSM+511() at /usr/lib/postgresql/10/bin/postgres at parallel.c:266 > > [...] > > Thank you. These stacks are all blocked trying to acquire > DynamicSharedMemoryControlLock. My theory is that they can't because > one backend -- the one that emitted the error "FATAL: cannot unpin a > segment that is not pinned" -- is deadlocked against itself. After > emitting that error you can see from Andreas's "seabisquit" stack that > that shmem_exit() runs dsm_backend_shutdown() which runs dsm_detach() > which tries to acquire DynamicSharedMemoryControlLock again, even > though we already hold it at that point. > > I'll write a patch to fix that unpleasant symptom. While holding > DynamicSharedMemoryControlLock we shouldn't raise any errors without > releasing it first, because the error handling path will try to > acquire it again. That's a horrible failure mode as you have > discovered. > > But that isn't the root problem: we shouldn't be raising that error, > and I'd love to see the stack of the one process that did that and > then self-deadlocked. I will have another go at trying to reproduce > it here today. Thanks for the update! We have turned off parallel queries (using max_parallel_workers_per_gather = 0) for now, as the production impact of thisbug is unfortunately quite problematic. What will this failure look like with the patch you've proposed? Thanks again, Jonathan
On Wed, Apr 25, 2018 at 8:15 AM, Jonathan Rudenberg <jonathan@titanous.com> wrote: > On Tue, Apr 24, 2018, at 16:06, Thomas Munro wrote: >> I'll write a patch to fix that unpleasant symptom. While holding >> DynamicSharedMemoryControlLock we shouldn't raise any errors without >> releasing it first, because the error handling path will try to >> acquire it again. That's a horrible failure mode as you have >> discovered. >> >> But that isn't the root problem: we shouldn't be raising that error, >> and I'd love to see the stack of the one process that did that and >> then self-deadlocked. I will have another go at trying to reproduce >> it here today. > > Thanks for the update! > > We have turned off parallel queries (using max_parallel_workers_per_gather = 0) for now, as the production impact of thisbug is unfortunately quite problematic. Apologies for the delay... I've tried a few times reproduce the problem in vain. There may be a timing element here. > What will this failure look like with the patch you've proposed? On second thoughts, I think it depends what is going wrong and I'm not entirely sure if the result would necessarily be better. It might be that errors are raised and normal service resumes or it might be that we have tangled up our resources in a bad way. I need to get to the bottom of this. -- Thomas Munro http://www.enterprisedb.com