Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests - Mailing list pgsql-hackers

From Robert Haas
Subject Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests
Date
Msg-id CA+TgmoYaqJQKtvvbATFzsTsWVZkoB-rff16Ts4osn0fCzVe=CA@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests
List pgsql-hackers
On Thu, Jun 15, 2017 at 10:21 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Yes, I think it is for next query.  If you refer the log below from lorikeet:
>
> 2017-06-13 16:44:57.179 EDT [59404ec6.2758:63] LOG:  statement:
> EXPLAIN (analyze, timing off, summary off, costs off) SELECT * FROM
> tenk1;
> 2017-06-13 16:44:57.247 EDT [59404ec9.2e78:1] ERROR:  could not map
> dynamic shared memory segment
> 2017-06-13 16:44:57.248 EDT [59404dec.2d9c:5] LOG:  worker process:
> parallel worker for PID 10072 (PID 11896) exited with exit code 1
> 2017-06-13 16:44:57.273 EDT [59404ec6.2758:64] LOG:  statement: select
> stringu1::int2 from tenk1 where unique1 = 1;
> TRAP: FailedAssertion("!(BackgroundWorkerData->parallel_register_count
> - BackgroundWorkerData->parallel_terminate_count <= 1024)", File:
> "/home/andrew/bf64/root/HEAD/pgsql.build/../pgsql/src/backend/postmaster/bgworker.c",
> Line: 974)
> 2017-06-13 16:45:02.652 EDT [59404dec.2d9c:6] LOG:  server process
> (PID 10072) was terminated by signal 6: Aborted
> 2017-06-13 16:45:02.652 EDT [59404dec.2d9c:7] DETAIL:  Failed process
> was running: select stringu1::int2 from tenk1 where unique1 = 1;
> 2017-06-13 16:45:02.652 EDT [59404dec.2d9c:8] LOG:  terminating any
> other active server processes
>
> Error "could not map dynamic shared memory segment" is due to query
> "EXPLAIN .. SELECT * FROM tenk1" and Assertion failure is due to
> another statement "select stringu1::int2 from tenk1 where unique1 =
> 1;".

I think you're right.  So here's a theory:

1. The ERROR mapping the DSM segment is just a case of the worker the
losing a race, and isn't a bug.

2. But when that happens, parallel_terminate_count is getting bumped
twice for some reason.

3. So then the leader process fails that assertion when it tries to
launch the parallel workers for the next query.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [HACKERS] memory fields from getrusage()
Next
From: Ildus Kurbangaliev
Date:
Subject: Re: [HACKERS] Bug in ExecModifyTable function and trigger issuesfor foreign tables