Thread: BUG #14973: hung queries

BUG #14973: hung queries

From
skaurus@gmail.com
Date:
The following bug has been logged on the website:

Bug reference:      14973
Logged by:          Dmitry Shalashov
Email address:      skaurus@gmail.com
PostgreSQL version: 10.1
Operating system:   Debian 9
Description:

We stumbled upon queries running for a day or more. They are simple ones, so
that should not be happening. And most of the time it don't - very small
share of these queries ends up like this.

Moreover, these queries couldn't be stopped.

pg_stat_activity says that they all have wait_event_type = IPC, wait_event =
BtreePage, state = active

strace tells that they all inside epoll_wait syscall

grep over ps says that they all are "postgres: bgworker: parallel worker for
PID ..."

Looks like some bug in parallel seq scan maybe?

We are going to disable parallel seq scan and restart our server in like 4
hours from now. I can get more debug if asked before that.


Re: BUG #14973: hung queries

From
Thomas Munro
Date:
On Fri, Dec 15, 2017 at 1:31 AM,  <skaurus@gmail.com> wrote:
> The following bug has been logged on the website:
>
> Bug reference:      14973
> Logged by:          Dmitry Shalashov
> Email address:      skaurus@gmail.com
> PostgreSQL version: 10.1
> Operating system:   Debian 9
> Description:
>
> We stumbled upon queries running for a day or more. They are simple ones, so
> that should not be happening. And most of the time it don't - very small
> share of these queries ends up like this.
>
> Moreover, these queries couldn't be stopped.
>
> pg_stat_activity says that they all have wait_event_type = IPC, wait_event =
> BtreePage, state = active
>
> strace tells that they all inside epoll_wait syscall
>
> grep over ps says that they all are "postgres: bgworker: parallel worker for
> PID ..."
>
> Looks like some bug in parallel seq scan maybe?
>
> We are going to disable parallel seq scan and restart our server in like 4
> hours from now. I can get more debug if asked before that.

Hello Dmitry,

Thank you for the report.  It sounds like a known bug in 10.0 and 10.1
that was recently fixed:

https://www.postgresql.org/message-id/E1ePESn-0005PV-S9%40gemulon.postgresql.org

The problem is in Parallel Index Scan for btree.  The fix will be in
10.2.  One workaround in the meantime would be to disable parallelism
for that query (SET max_parallel_workers_per_gather = 0).

-- 
Thomas Munro
http://www.enterprisedb.com


Re: BUG #14973: hung queries

From
Thomas Munro
Date:
On Tue, Dec 19, 2017 at 6:38 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Fri, Dec 15, 2017 at 1:31 AM,  <skaurus@gmail.com> wrote:
>> pg_stat_activity says that they all have wait_event_type = IPC, wait_event =
>> BtreePage, state = active
>
> https://www.postgresql.org/message-id/E1ePESn-0005PV-S9%40gemulon.postgresql.org
>
> The problem is in Parallel Index Scan for btree.  The fix will be in
> 10.2.  One workaround in the meantime would be to disable parallelism
> for that query (SET max_parallel_workers_per_gather = 0).

On second thoughts, a more targeted workaround to avoid just these
buggy parallel index scans without disabling parallelism in general
might be:

SET min_parallel_index_scan_size = '5TB';

(Assuming you don't have any indexes that large.)

-- 
Thomas Munro
http://www.enterprisedb.com


Re: BUG #14973: hung queries

From
Dmitry Shalashov
Date:
Hi Thomas,

I'm glad to help. Thanks for the advice!

By the way, there was a mistake in my bug report - wait_event actually was BgWorkerShutdown.


Dmitry Shalashov, relap.io & surfingbird.ru

2017-12-18 22:55 GMT+03:00 Thomas Munro <thomas.munro@enterprisedb.com>:
On Tue, Dec 19, 2017 at 6:38 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Fri, Dec 15, 2017 at 1:31 AM,  <skaurus@gmail.com> wrote:
>> pg_stat_activity says that they all have wait_event_type = IPC, wait_event =
>> BtreePage, state = active
>
> https://www.postgresql.org/message-id/E1ePESn-0005PV-S9%40gemulon.postgresql.org
>
> The problem is in Parallel Index Scan for btree.  The fix will be in
> 10.2.  One workaround in the meantime would be to disable parallelism
> for that query (SET max_parallel_workers_per_gather = 0).

On second thoughts, a more targeted workaround to avoid just these
buggy parallel index scans without disabling parallelism in general
might be:

SET min_parallel_index_scan_size = '5TB';

(Assuming you don't have any indexes that large.)

Re: BUG #14973: hung queries

From
Amit Kapila
Date:
On Tue, Dec 19, 2017 at 2:48 AM, Dmitry Shalashov <skaurus@gmail.com> wrote:
> Hi Thomas,
>
> I'm glad to help. Thanks for the advice!
>
> By the way, there was a mistake in my bug report - wait_event actually was
> BgWorkerShutdown.
>

I think BgWorkerShutdown type of wait event can be only for the master
backend not for all the workers.  Are there any other wait events?
Can we get a stack trace of one or more workers?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: BUG #14973: hung queries

From
Michael Paquier
Date:
On Tue, Dec 19, 2017 at 4:02 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Dec 19, 2017 at 2:48 AM, Dmitry Shalashov <skaurus@gmail.com> wrote:
>> Hi Thomas,
>>
>> I'm glad to help. Thanks for the advice!
>>
>> By the way, there was a mistake in my bug report - wait_event actually was
>> BgWorkerShutdown.
>
> I think BgWorkerShutdown type of wait event can be only for the master
> backend not for all the workers.

Yeah, that's what happens when calling
WaitForBackgroundWorkerShutdown() as the primary backend waits for all
the workers to stop. You can see it as well this wait event in a
logirep launcher by the way.
-- 
Michael