Thread: BUG #15290: Stuck Parallel Index Scan query
The following bug has been logged on the website: Bug reference: 15290 Logged by: Victor Yegorov Email address: vyegorov@gmail.com PostgreSQL version: 10.4 Operating system: Debian GNU/Linux 8.7 (jessie) Description: We've just encountered an issue on the streaming replica of the client. In short — query is active for 8 hours and we're not able to terminate it: all processes (leader and workers) are ignoring all signals. Symptoms are similar to the ones described in #15036. As a result, we had to perform immediate shutdown f the instance. After restart we've got another such stuck query within minutes, so it's not a single occurrence.. Query in subject ---------------- -[ RECORD 1 ]----------------------------------------------------------------------------------------------------- ts_age | 08:32:17.286343 state | active query_age | 08:32:17.286345 change_age | 08:32:17.286344 datname | coub pid | 2877 usename | app waiting | f client_addr | client_port | -1 query | select count(*) as value from coubs where type='Coub::Simple' and is_done=false and in_process=false QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------- Finalize Aggregate (cost=666001.67..666001.72 rows=1 width=8) -> Gather (cost=666001.20..666001.65 rows=8 width=8) Workers Planned: 8 -> Partial Aggregate (cost=665901.20..665901.25 rows=1 width=8) -> Parallel Index Scan using coubs_type_is_done_partial_simple on coubs (cost=0.56..663016.01 rows=1154077 width=0) Index Cond: (is_done = false) Filter: ((NOT is_done) AND (NOT in_process)) (7 rows) In reality this query worked with just 4 workers: 2877 ? ts 0:08 \_ postgres: 10/main: app coub [local] SELECT 3416 ? Ss 0:00 \_ postgres: 10/main: bgworker: parallel worker for PID 2877 3417 ? Ss 0:00 \_ postgres: 10/main: bgworker: parallel worker for PID 2877 3418 ? Ss 0:00 \_ postgres: 10/main: bgworker: parallel worker for PID 2877 3419 ? Ss 0:00 \_ postgres: 10/main: bgworker: parallel worker for PID 2877 We have backtraces of the main process and all workers, and also `pg_stat_activity` snapshot — I will attach them to the new e-mail.
вс, 22 июл. 2018 г. в 21:31, PG Bug reporting form <noreply@postgresql.org>:
The following bug has been logged on the website:
Bug reference: 15290
Logged by: Victor Yegorov
Email address: vyegorov@gmail.com
PostgreSQL version: 10.4
Operating system: Debian GNU/Linux 8.7 (jessie)
Victor Yegorov
Attachment
вс, 22 июл. 2018 г. в 21:31, PG Bug reporting form <noreply@postgresql.org>:
The following bug has been logged on the website:
Bug reference: 15290
Logged by: Victor Yegorov
Email address: vyegorov@gmail.com
PostgreSQL version: 10.4
Operating system: Debian GNU/Linux 8.7 (jessie)
Some more details.
Index that was chosen for Parallel Index Scan was a partial index. And it is quite bloated:
"coubs_type_is_done_partial_simple" btree (type, is_done) WHERE type::text = 'Coub::Simple'::text
6766MB
6766MB
What we did — created another, ordinary index:
"index_coubs_on_is_done_and_in_process_and_type" btree (is_done, in_process, type)
2141MB
So now planner prefers to use new index and issue is not re-appearing. Query executes within 10-20ms.
Old bloated index is still around for analysis. `amcheck` showed no issues with it, though.
Victor Yegorov
On Mon, Jul 23, 2018 at 8:52 AM, Victor Yegorov <vyegorov@gmail.com> wrote: > вс, 22 июл. 2018 г. в 21:31, PG Bug reporting form <noreply@postgresql.org>: >> >> The following bug has been logged on the website: >> >> Bug reference: 15290 >> Logged by: Victor Yegorov >> Email address: vyegorov@gmail.com >> PostgreSQL version: 10.4 >> Operating system: Debian GNU/Linux 8.7 (jessie) Thanks for the detailed report! To summarise your back traces, we have: PID 2877 is the master process and has decided to abort and is waiting for the workers to exit: WaitLatch WaitForBackgroundWorkerShutdown WaitForParallelWorkersToExit DestroyParallelContext AtEOXact_Parallel AbortTransaction AbortCurrentTransaction PostgresMain PIDs 3416, 3417, 3418, 3419 meanwhile are waiting to seize the scan head: WaitEventSetWaitBlock ConditionVariableSleep _bt_parallel_seize _bt_readnextpage Presumably 2877 has it (?), but aborted (do you have an error message in the server log?), and the workers have somehow survived TerminateBackgroundWorker() (called by DestroyParallelContext()). Hmm. TerminateBackgroundWorker() sets a shm flag and signals the postmaster, the postmaster signals the worker with SIGTERM, the worker handles SIGTERM in die() by setting ProcDiePending = true, InterruptPending= true and then setting its own latch, and the ConditionVarableSleep() loop should wake up, go around its loop again and run CHECK_FOR_INTERRUPTS() which sees InterruptPending and runs ProcessInterrupts(), which sees ProcDiePending and reports FATAL. So which domino fell short here? -- Thomas Munro http://www.enterprisedb.com
On 2018-07-23 16:30:28 +1200, Thomas Munro wrote: > TerminateBackgroundWorker() sets a shm flag and signals the > postmaster, the postmaster signals the worker with SIGTERM, the worker > handles SIGTERM in die() by setting ProcDiePending = true, > InterruptPending= true and then setting its own latch, and the > ConditionVarableSleep() loop should wake up, go around its loop again > and run CHECK_FOR_INTERRUPTS() which sees InterruptPending and runs > ProcessInterrupts(), which sees ProcDiePending and reports FATAL. So > which domino fell short here? I've not looked at this, but is it possible that interrupts aren't accepted because they're held (e.g. because of an lwlock)? Greetings, Andres Freund
пн, 23 июл. 2018 г. в 7:31, Thomas Munro <thomas.munro@enterprisedb.com>:
PID 2877 is the master process and has decided to abort and is waiting
for the workers to exit:
WaitLatch
WaitForBackgroundWorkerShutdown
WaitForParallelWorkersToExit
DestroyParallelContext
AtEOXact_Parallel
AbortTransaction
AbortCurrentTransaction
PostgresMain
PIDs 3416, 3417, 3418, 3419 meanwhile are waiting to seize the scan head:
WaitEventSetWaitBlock
ConditionVariableSleep
_bt_parallel_seize
_bt_readnextpage
Presumably 2877 has it (?), but aborted (do you have an error message
in the server log?), and the workers have somehow survived
TerminateBackgroundWorker() (called by DestroyParallelContext()).
Query was stuck for 8 hours when we tried to terminate it. Makes me think, that master process was
still waiting for bgworkers to finish, as test run finished in 11ms for me.
As I mentioned, we've got this case re-appear while I was preparing
As I mentioned, we've got this case re-appear while I was preparing
the report (had to restart the DB second time). I think I might make it happen again, if necessary.
There is not so much in the logs:
- a bunch of `FATAL: connection to client lost`, but from another (web) user (couple errors per hour)
- `ERROR: canceling statement due to conflict with recovery`, happened right when our problematic query started, same user
- errors related to shutdown/startup of the DB.
- a bunch of `FATAL: connection to client lost`, but from another (web) user (couple errors per hour)
- `ERROR: canceling statement due to conflict with recovery`, happened right when our problematic query started, same user
- errors related to shutdown/startup of the DB.
Victor Yegorov
On Mon, Jul 23, 2018 at 7:57 PM, Victor Yegorov <vyegorov@gmail.com> wrote: > - `ERROR: canceling statement due to conflict with recovery`, happened > right when our problematic query started, same user Ok, so that would explain how the master was cancelled. In 2877's stack we see that it was aborting here: #11 0x00007f539697ba5e in PostgresMain (argc=1, argv=argv@entry=0x7f5398d1bbc8, dbname=0x7f5398d1bb98 "coub", username=0x7f5398d1bbb0 "app") at /build/postgresql-10-U6N320/postgresql-10-10.4/build/../src/backend/tcop/postgres.c:3879 That line calls AbortCurrentTransaction(), just after the call to EmitErrorReport() that wrote something in your log. Andres's theory (interrupts 'held') seems promising... perhaps there could be a bug where parallel index scans leak a share-locked page or something like that. I tried to reproduce this a bit, but no cigar so far. I wonder if there could be something about your bloated index that reaches buggy behaviour... If you happen to have a core file for a worker that is waiting in ConditionVariableSleep(), or it happens again, you'd be able to see if an LWLock is causing this by printing num_held_lwlocks. -- Thomas Munro http://www.enterprisedb.com
пн, 23 июл. 2018 г. в 11:47, Thomas Munro <thomas.munro@enterprisedb.com>:
On Mon, Jul 23, 2018 at 7:57 PM, Victor Yegorov <vyegorov@gmail.com> wrote:
> - `ERROR: canceling statement due to conflict with recovery`, happened
> right when our problematic query started, same user
Ok, so that would explain how the master was cancelled. In 2877's
stack we see that it was aborting here:
Right:
ERROR: canceling statement due to conflict with recovery
DETAIL: User was holding shared buffer pin for too long.
#11 0x00007f539697ba5e in PostgresMain (argc=1,
argv=argv@entry=0x7f5398d1bbc8, dbname=0x7f5398d1bb98 "coub",
username=0x7f5398d1bbb0 "app") at
/build/postgresql-10-U6N320/postgresql-10-10.4/build/../src/backend/tcop/postgres.c:3879
That line calls AbortCurrentTransaction(), just after the call to
EmitErrorReport() that wrote something in your log. Andres's theory
(interrupts 'held') seems promising... perhaps there could be a bug
where parallel index scans leak a share-locked page or something like
that. I tried to reproduce this a bit, but no cigar so far. I wonder
if there could be something about your bloated index that reaches
buggy behaviour...
If you happen to have a core file for a worker that is waiting in
ConditionVariableSleep(), or it happens again, you'd be able to see if
an LWLock is causing this by printing num_held_lwlocks.
No, we do not have core files around. And so far I was not able to reproduce this situation.
I will keep monitoring. In case I'll hit it again — what else (except for num_held_lwlocks)
should I check for?
I will keep monitoring. In case I'll hit it again — what else (except for num_held_lwlocks)
should I check for?
Victor Yegorov
Ok, I see it: /* check for interrupts while we're not holding any buffer lock */ CHECK_FOR_INTERRUPTS(); /* step right one page */ so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ); ... /* nope, keep going */ if (scan->parallel_scan != NULL) { status = _bt_parallel_seize(scan, &blkno); That leads to a condition variable wait, while we still hold that buffer lock. That prevents interrupts. Oops. -- Thomas Munro http://www.enterprisedb.com
On 2018-07-25 14:04:11 +1200, Thomas Munro wrote: > Ok, I see it: > > /* check for interrupts while we're not > holding any buffer lock */ > CHECK_FOR_INTERRUPTS(); > /* step right one page */ > so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ); > ... > /* nope, keep going */ > if (scan->parallel_scan != NULL) > { > status = _bt_parallel_seize(scan, &blkno); > > That leads to a condition variable wait, while we still hold that > buffer lock. That prevents interrupts. Oops. Heh, guessed right. I kinda wonder if we shouldn't add a CHECK_FOR_INTERRUPTS_FOR_REALZ() that asserts if interrupts aren't held. There are plenty places where we rely on that being the case, for correctness. Greetings, Andres Freund
On Wed, Jul 25, 2018 at 2:08 PM, Andres Freund <andres@anarazel.de> wrote: > On 2018-07-25 14:04:11 +1200, Thomas Munro wrote: >> Ok, I see it: >> >> /* check for interrupts while we're not >> holding any buffer lock */ >> CHECK_FOR_INTERRUPTS(); >> /* step right one page */ >> so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ); >> ... >> /* nope, keep going */ >> if (scan->parallel_scan != NULL) >> { >> status = _bt_parallel_seize(scan, &blkno); >> >> That leads to a condition variable wait, while we still hold that >> buffer lock. That prevents interrupts. Oops. > > Heh, guessed right. I kinda wonder if we shouldn't add a > CHECK_FOR_INTERRUPTS_FOR_REALZ() that asserts if interrupts aren't > held. There are plenty places where we rely on that being the case, for > correctness. Yeah, I was wondering something similar: perhaps WaitEventSetWait() should assert that interrupts are not held, unless you explicitly told it somehow that it's OK? You couldn't do it unconditionally, because we sometimes do that on purpose: HOLD_INTERRUPTS(); WaitForParallelWorkersToExit(pcxt); RESUME_INTERRUPTS(); Here's a reproducer (adjust timeout to suit your machine): drop table if exists t; create table t as select generate_series(1, 1000000)::int i; create index on t(i); alter table t set (autovacuum_enabled = false); delete from t; -- 100% bloat please set max_parallel_workers_per_gather = 2; set min_parallel_index_scan_size = 0; set enable_seqscan = false; set enable_bitmapscan = false; set parallel_tuple_cost = 0; set parallel_setup_cost = 0; set statement_timeout = '100ms'; -- enough to fork, not enough to complete explain analyze select count(*) from t; -- Thomas Munro http://www.enterprisedb.com
On Wed, Jul 25, 2018 at 8:29 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Wed, Jul 25, 2018 at 2:08 PM, Andres Freund <andres@anarazel.de> wrote: >> On 2018-07-25 14:04:11 +1200, Thomas Munro wrote: >>> Ok, I see it: >>> >>> /* check for interrupts while we're not >>> holding any buffer lock */ >>> CHECK_FOR_INTERRUPTS(); >>> /* step right one page */ >>> so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ); >>> ... >>> /* nope, keep going */ >>> if (scan->parallel_scan != NULL) >>> { >>> status = _bt_parallel_seize(scan, &blkno); >>> >>> That leads to a condition variable wait, while we still hold that >>> buffer lock. That prevents interrupts. Oops. >> Well spotted. I think here we can release the current page lock before calling _bt_parallel_seize as we don't need it to get the next page. See the backward scan case, in particular, I am referring to the below code: _bt_readnextpage() { .. * For parallel scans, get the last page scanned as it is quite * possible that by the time we try to seize the scan, some other * worker has already advanced the scan to a different page. We * must continue based on the latest page scanned by any worker. */ if (scan->parallel_scan != NULL) { _bt_relbuf(rel, so->currPos.buf); status = _bt_parallel_seize(scan, &blkno); .. } This needs some more analysis. I will continue the analysis and shared findings. Thanks, Thomas for pinging me offlist and including me here. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Jul 25, 2018 at 8:43 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Jul 25, 2018 at 8:29 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> On Wed, Jul 25, 2018 at 2:08 PM, Andres Freund <andres@anarazel.de> wrote: >>> On 2018-07-25 14:04:11 +1200, Thomas Munro wrote: >>>> Ok, I see it: >>>> >>>> /* check for interrupts while we're not >>>> holding any buffer lock */ >>>> CHECK_FOR_INTERRUPTS(); >>>> /* step right one page */ >>>> so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ); >>>> ... >>>> /* nope, keep going */ >>>> if (scan->parallel_scan != NULL) >>>> { >>>> status = _bt_parallel_seize(scan, &blkno); >>>> >>>> That leads to a condition variable wait, while we still hold that >>>> buffer lock. That prevents interrupts. Oops. >>> > > Well spotted. I think here we can release the current page lock > before calling _bt_parallel_seize as we don't need it to get the next > page. > I have written a patch on the above lines and manually verified (by reproducing the issue via debugger) that it fixes the issue. Thomas, Victor, is it possible for you guys to see if the attached fixes the issue for you? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Wed, Jul 25, 2018 at 5:46 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> I think here we can release the current page lock >> before calling _bt_parallel_seize as we don't need it to get the next >> page. > > I have written a patch on the above lines and manually verified (by > reproducing the issue via debugger) that it fixes the issue. Thomas, > Victor, is it possible for you guys to see if the attached fixes the > issue for you? The patch looks good to me. It fixes the problem for my reproducer (BTW that works better with an extra zero on the generate_series number). -- Thomas Munro http://www.enterprisedb.com
ср, 25 июл. 2018 г. в 8:46, Amit Kapila <amit.kapila16@gmail.com>:
On Wed, Jul 25, 2018 at 8:43 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Well spotted. I think here we can release the current page lock
> before calling _bt_parallel_seize as we don't need it to get the next
> page.
>
I have written a patch on the above lines and manually verified (by
reproducing the issue via debugger) that it fixes the issue. Thomas,
Victor, is it possible for you guys to see if the attached fixes the
issue for you?
First, I failed to reproduce Thomas' case on the VM with just 1 core.
But it is reproduced just fine on an 8-core machine, make sure to warm up the index first though.
But it is reproduced just fine on an 8-core machine, make sure to warm up the index first though.
Patch fixes the issue for me here.
Victor Yegorov
On Thu, Jul 26, 2018 at 3:53 PM, Victor Yegorov <vyegorov@gmail.com> wrote: > ср, 25 июл. 2018 г. в 8:46, Amit Kapila <amit.kapila16@gmail.com>: >> >> On Wed, Jul 25, 2018 at 8:43 AM, Amit Kapila <amit.kapila16@gmail.com> >> wrote: >> > Well spotted. I think here we can release the current page lock >> > before calling _bt_parallel_seize as we don't need it to get the next >> > page. >> > >> >> I have written a patch on the above lines and manually verified (by >> reproducing the issue via debugger) that it fixes the issue. Thomas, >> Victor, is it possible for you guys to see if the attached fixes the >> issue for you? > > > First, I failed to reproduce Thomas' case on the VM with just 1 core. > But it is reproduced just fine on an 8-core machine, make sure to warm up > the index first though. > > Patch fixes the issue for me here. > Thanks, pushed the fix. I have not included Thomas's test as it is timing dependent and it seems risky to include such a test. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com