Re: [sqlsmith] Unpinning error in parallel worker - Mailing list pgsql-hackers

From Jonathan Rudenberg
Subject Re: [sqlsmith] Unpinning error in parallel worker
Date
Msg-id 1524579700.3786618.1349037056.03F1C731@webmail.messagingengine.com
Whole thread Raw
In response to Re: [sqlsmith] Unpinning error in parallel worker  (Thomas Munro <thomas.munro@enterprisedb.com>)
Responses Re: [sqlsmith] Unpinning error in parallel worker
List pgsql-hackers

On Fri, Apr 20, 2018, at 00:42, Thomas Munro wrote:
> On Wed, Apr 18, 2018 at 11:43 AM, Jonathan Rudenberg
> <jonathan@titanous.com> wrote:
> > On Tue, Apr 17, 2018, at 19:31, Thomas Munro wrote:
> >> On Wed, Apr 18, 2018 at 11:01 AM, Jonathan Rudenberg
> >> <jonathan@titanous.com> wrote:
> >> > Yep, I think I know approximately what it looked like, I've attached a lightly redacted plan. All of the hung
querieswere running some variant of this plan as far as I can tell.
 
> >>
> >> Hmm, that isn't a parallel query.  I was expecting to see "Gather" and
> >> "Parallel" in there.
> >
> > Oops, I'm really sorry about that. I only have the first part of the hung queries, and there are a few variants.
Here'sone that's parallel.
 
> 
> I spent some time trying to reproduce this failure without any luck,
> using query plans similar to your Gather plan fragment, and using some
> test harness code for the allocator stuff in isolation.
> 
> I had an idea that (1) freeing a large object that releases and unpins
> a segment in one backend and then (2) freeing it again in another
> backend (illegally) might produce this effect with sufficiently bad
> luck.  I'm still trying to reproduce that without any success, but I
> get other kinds of failures which I think you'd be seeing too if that
> hunch were right.  Still looking...


This issue happened again in production, here are the stack traces from three we grabbed before nuking the >400 hanging
backends.

process: /proc/24194/mem
thread: 0x7f46263648c0, lwp: 24194, type: 1
#0  0x00007f4625fdc827 in do_futex_wait+55() at /lib/x86_64-linux-gnu/libpthread.so.0 at futex-internal.h:205
#1  0x00007f4625fdc8d4 in __new_sem_wait_slow+84() at /lib/x86_64-linux-gnu/libpthread.so.0 at sem_waitcommon.c:181
#2  0x00007f4625fdc97a in __new_sem_wait+58() at /lib/x86_64-linux-gnu/libpthread.so.0 at sem_wait.c:29
#3  0x000055fccb8c8832 in PGSemaphoreLock+34() at /usr/lib/postgresql/10/bin/postgres at pg_sema.c:310
#4  0x000055fccb93b21c in LWLockAcquire+188() at /usr/lib/postgresql/10/bin/postgres at lwlock.c:1233
#5  0x000055fccb925fa7 in dsm_create+151() at /usr/lib/postgresql/10/bin/postgres at dsm.c:493
#6  0x000055fccb6f2a6f in InitializeParallelDSM+511() at /usr/lib/postgresql/10/bin/postgres at parallel.c:266
#7  0x000055fccb815f27 in ExecInitParallelPlan+807() at /usr/lib/postgresql/10/bin/postgres at execParallel.c:470
#8  0x000055fccb825a50 in ExecGather+1168() at /usr/lib/postgresql/10/bin/postgres at nodeGather.c:158
#9  0x000055fccb81dbd3 in ExecAppend+83() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#10 0x000055fccb81df2c in fetch_input_tuple+172() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#11 0x000055fccb820293 in ExecAgg+1235() at /usr/lib/postgresql/10/bin/postgres at nodeAgg.c:2539
#12 0x000055fccb835235 in CteScanNext+229() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#13 0x000055fccb817f0d in ExecScan+557() at /usr/lib/postgresql/10/bin/postgres at execScan.c:97
#14 0x000055fccb832155 in ExecNestLoop+165() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#15 0x000055fccb832155 in ExecNestLoop+165() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#16 0x000055fccb832155 in ExecNestLoop+165() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#17 0x000055fccb81df2c in fetch_input_tuple+172() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#18 0x000055fccb8201eb in ExecAgg+1067() at /usr/lib/postgresql/10/bin/postgres at nodeAgg.c:2347
#19 0x000055fccb811dfb in standard_ExecutorRun+379() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#20 0x000055fccb94d19b in PortalRunSelect+507() at /usr/lib/postgresql/10/bin/postgres at pquery.c:932
#21 0x000055fccb94e798 in PortalRun+824() at /usr/lib/postgresql/10/bin/postgres at pquery.c:773
#22 0x000055fccb94b329 in PostgresMain+3721() at /usr/lib/postgresql/10/bin/postgres at postgres.c:1984
#23 0x000055fccb6840d9 in ServerLoop+3412() at /usr/lib/postgresql/10/bin/postgres at postmaster.c:4405
#24 0x000055fccb8da78b in PostmasterMain+4235() at /usr/lib/postgresql/10/bin/postgres at postmaster.c:1363
#25 0x000055fccb6854d5 in main+2165() at /usr/lib/postgresql/10/bin/postgres at main.c:228
#26 0x00007f4623c07830 in __libc_start_main+240() at /lib/x86_64-linux-gnu/libc.so.6 at libc-start.c:291
#27 0x000055fccb685559 in _start!+41() at /usr/lib/postgresql/10/bin/postgres

process: /proc/24231/mem
thread: 0x7f46263648c0, lwp: 24231, type: 1
#0  0x00007f4625fdc827 in do_futex_wait+55() at /lib/x86_64-linux-gnu/libpthread.so.0 at futex-internal.h:205
#1  0x00007f4625fdc8d4 in __new_sem_wait_slow+84() at /lib/x86_64-linux-gnu/libpthread.so.0 at sem_waitcommon.c:181
#2  0x00007f4625fdc97a in __new_sem_wait+58() at /lib/x86_64-linux-gnu/libpthread.so.0 at sem_wait.c:29
#3  0x000055fccb8c8832 in PGSemaphoreLock+34() at /usr/lib/postgresql/10/bin/postgres at pg_sema.c:310
#4  0x000055fccb93b21c in LWLockAcquire+188() at /usr/lib/postgresql/10/bin/postgres at lwlock.c:1233
#5  0x000055fccb925fa7 in dsm_create+151() at /usr/lib/postgresql/10/bin/postgres at dsm.c:493
#6  0x000055fccb6f2a6f in InitializeParallelDSM+511() at /usr/lib/postgresql/10/bin/postgres at parallel.c:266
#7  0x000055fccb815f27 in ExecInitParallelPlan+807() at /usr/lib/postgresql/10/bin/postgres at execParallel.c:470
#8  0x000055fccb825a50 in ExecGather+1168() at /usr/lib/postgresql/10/bin/postgres at nodeGather.c:158
#9  0x000055fccb81dbd3 in ExecAppend+83() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#10 0x000055fccb8348d6 in ExecSort+278() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#11 0x000055fccb834bd7 in ExecUnique+135() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#12 0x000055fccb835235 in CteScanNext+229() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#13 0x000055fccb817f0d in ExecScan+557() at /usr/lib/postgresql/10/bin/postgres at execScan.c:97
#14 0x000055fccb811dfb in standard_ExecutorRun+379() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#15 0x000055fccb94d19b in PortalRunSelect+507() at /usr/lib/postgresql/10/bin/postgres at pquery.c:932
#16 0x000055fccb94e798 in PortalRun+824() at /usr/lib/postgresql/10/bin/postgres at pquery.c:773
#17 0x000055fccb94b329 in PostgresMain+3721() at /usr/lib/postgresql/10/bin/postgres at postgres.c:1984
#18 0x000055fccb6840d9 in ServerLoop+3412() at /usr/lib/postgresql/10/bin/postgres at postmaster.c:4405
#19 0x000055fccb8da78b in PostmasterMain+4235() at /usr/lib/postgresql/10/bin/postgres at postmaster.c:1363
#20 0x000055fccb6854d5 in main+2165() at /usr/lib/postgresql/10/bin/postgres at main.c:228
#21 0x00007f4623c07830 in __libc_start_main+240() at /lib/x86_64-linux-gnu/libc.so.6 at libc-start.c:291
#22 0x000055fccb685559 in _start!+41() at /usr/lib/postgresql/10/bin/postgres

process: /proc/24136/mem
thread: 0x7f46263648c0, lwp: 24136, type: 1
#0  0x00007f4625fdc827 in do_futex_wait+55() at /lib/x86_64-linux-gnu/libpthread.so.0 at futex-internal.h:205
#1  0x00007f4625fdc8d4 in __new_sem_wait_slow+84() at /lib/x86_64-linux-gnu/libpthread.so.0 at sem_waitcommon.c:181
#2  0x00007f4625fdc97a in __new_sem_wait+58() at /lib/x86_64-linux-gnu/libpthread.so.0 at sem_wait.c:29
#3  0x000055fccb8c8832 in PGSemaphoreLock+34() at /usr/lib/postgresql/10/bin/postgres at pg_sema.c:310
#4  0x000055fccb93b21c in LWLockAcquire+188() at /usr/lib/postgresql/10/bin/postgres at lwlock.c:1233
#5  0x000055fccb925fa7 in dsm_create+151() at /usr/lib/postgresql/10/bin/postgres at dsm.c:493
#6  0x000055fccb6f2a6f in InitializeParallelDSM+511() at /usr/lib/postgresql/10/bin/postgres at parallel.c:266
#7  0x000055fccb815f27 in ExecInitParallelPlan+807() at /usr/lib/postgresql/10/bin/postgres at execParallel.c:470
#8  0x000055fccb825a50 in ExecGather+1168() at /usr/lib/postgresql/10/bin/postgres at nodeGather.c:158
#9  0x000055fccb81dbd3 in ExecAppend+83() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#10 0x000055fccb81df2c in fetch_input_tuple+172() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#11 0x000055fccb820293 in ExecAgg+1235() at /usr/lib/postgresql/10/bin/postgres at nodeAgg.c:2539
#12 0x000055fccb835235 in CteScanNext+229() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#13 0x000055fccb817f0d in ExecScan+557() at /usr/lib/postgresql/10/bin/postgres at execScan.c:97
#14 0x000055fccb832155 in ExecNestLoop+165() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#15 0x000055fccb832155 in ExecNestLoop+165() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#16 0x000055fccb832155 in ExecNestLoop+165() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#17 0x000055fccb81df2c in fetch_input_tuple+172() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#18 0x000055fccb8201eb in ExecAgg+1067() at /usr/lib/postgresql/10/bin/postgres at nodeAgg.c:2347
#19 0x000055fccb811dfb in standard_ExecutorRun+379() at /usr/lib/postgresql/10/bin/postgres at executor.h:250
#20 0x000055fccb94d19b in PortalRunSelect+507() at /usr/lib/postgresql/10/bin/postgres at pquery.c:932
#21 0x000055fccb94e798 in PortalRun+824() at /usr/lib/postgresql/10/bin/postgres at pquery.c:773
#22 0x000055fccb94b329 in PostgresMain+3721() at /usr/lib/postgresql/10/bin/postgres at postgres.c:1984
#23 0x000055fccb6840d9 in ServerLoop+3412() at /usr/lib/postgresql/10/bin/postgres at postmaster.c:4405
#24 0x000055fccb8da78b in PostmasterMain+4235() at /usr/lib/postgresql/10/bin/postgres at postmaster.c:1363
#25 0x000055fccb6854d5 in main+2165() at /usr/lib/postgresql/10/bin/postgres at main.c:228
#26 0x00007f4623c07830 in __libc_start_main+240() at /lib/x86_64-linux-gnu/libc.so.6 at libc-start.c:291
#27 0x000055fccb685559 in _start!+41() at /usr/lib/postgresql/10/bin/postgres


pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: [HACKERS] Runtime Partition Pruning
Next
From: Adam Brusselback
Date:
Subject: Re: Built-in connection pooling