Home > mailing lists

Re: BUG #15585: infinite DynamicSharedMemoryControlLock waiting inparallel query - Mailing list pgsql-bugs

From	Maksim Milyutin
Subject	Re: BUG #15585: infinite DynamicSharedMemoryControlLock waiting inparallel query
Date	January 16, 2019 14:23:01
Msg-id	458c691a-1b2d-7a70-1c04-58361b894011@gmail.com Whole thread Raw
In response to	Re: BUG #15585: infinite DynamicSharedMemoryControlLock waiting inparallel query (Thomas Munro <thomas.munro@enterprisedb.com>)
Responses	Re: BUG #15585: infinite DynamicSharedMemoryControlLock waiting inparallel query
List	pgsql-bugs

Tree view

On 1/11/19 8:16 AM, Thomas Munro wrote:

On Fri, Jan 11, 2019 at 10:04 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I am trying to reproduce it.

I ran squillions of concurrent parallel queries today, making sure
they would allocate and free entire segments a lot (based on the
assumption that something along those lines must be required).  I made
sure to use Linux, GCC, -O2, no asserts (since both reports came from
that environment, and I'd previously failed to reproduce this on my
usual tool chain/platforms), and I used a multi-socket box (in case
some cache coherency effect was not occurring on my development
laptops).  I did learn some interesting things about parallel query
performance that I plan to follow up on, but I had no luck in
reproducing this error.  Rats.

One observation is that the other report involved a fairly simple
Parallel Hash Join query (on 11), and this report involves Parallel
Index Scan and Parallel Bitmap Index Scan (on 10), so that suggests
that it's probably not a bug in the parallel executor code (for
example an access-after-free, whose undefined behaviour could in
theory look like this with unlucky timing, I speculate) but rather
something lower level.

Recently one of our customers encountered the same problem on PG 11.1. The query structure and error messages are similar to one that was described in [1].

OS/PG settings:

Ubuntu 16.04

mem 128 GB

cpu 16 cores
huge_pages is turned off

# sudo cat /proc/sys/kernel/shmmax18446744073692774399

max_connections = 200

shared_buffers = 33GB

dynamic_shared_memory_type = posix

max_worker_processes = 16max_parallel_maintenance_workers = 6max_parallel_workers_per_gather = 6max_parallel_workers = 12

The query was:

SELECT ev.date, r.indicator_id, r.service_reg_id, count(r.value), count(CASE WHEN r.value::int > 3 THEN 1 ELSE NULL END), avg(r.value::int)FROM base.hershel_events AS ev INNER JOIN base.hershel_rates AS r ON r.event_id = ev.idWHERE ev.date between '2018-09-15' AND '2018-12-15'GROUP BY ev.date, r.indicator_id, r.service_reg_idORDER BY ev.date ASC

Its plan with record statistics:

Finalize GroupAggregate (actual rows=81332937 loops=1) Group Key: ev.date, r.indicator_id, r.service_reg_id -> Gather Merge (actual rows=81332937 loops=1) Workers Planned: 7 Workers Launched: 7 -> Partial GroupAggregate (actual rows=10166617 loops=8) Group Key: ev.date, r.indicator_id, r.service_reg_id -> Sort (actual rows=10166617 loops=8) Sort Key: ev.date, r.indicator_id, r.service_reg_id Sort Method: external merge Disk: 549392kB Worker 0: Sort Method: external merge Disk: 516104kB Worker 1: Sort Method: external merge Disk: 533200kB Worker 2: Sort Method: external merge Disk: 540680kB Worker 3: Sort Method: external merge Disk: 540760kB Worker 4: Sort Method: external merge Disk: 530512kB Worker 5: Sort Method: external merge Disk: 472240kB Worker 6: Sort Method: external merge Disk: 539256kB -> Parallel Hash Join (actual rows=10166617 loops=8) Hash Cond: (r.event_id = ev.id) -> Parallel Seq Scan on hershel_rates r (actual rows=15000000 loops=8) -> Parallel Hash (actual rows=9319175 loops=8) Buckets: 4194304 Batches: 32 Memory Usage: 124064kB -> Parallel Seq Scan on hershel_events ev (actual rows=9319175 loops=8) Filter: ((date >= '2018-09-15'::date) AND (date <= '2018-12-15'::date)) Rows Removed by Filter: 4430825

At some moment one of parallel workers emitted to log the error message: dsa_area could not attach to segment. After terminating all workers (the leader also bumped into the same error) the last one began to emit to log the following type of messages: LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp5027.0.sharedfileset/i55of128.p1.0", size 458752. The summary size of temp files that was mentioned in the log was about *14GB*. And eventually that last worker got the fatal error: cannot unpin a segment that is not pinned. After that the DynamicSharedMemoryControlLock left unreleased and soon new backends hanged on it.

Another time some backend (more likely, it was some utility worker) bumped into the error: could not open shared memory segment "/PostgreSQL.1948558832": No such file or directory. After that other backends stopped on DynamicSharedMemoryControlLock. Before this issue there were parallel aggregate queries.

Unfortunately our customer doesn't want to catch up this error on production system again. He turned off the query parallelism. And this issue doesn't reproduce on test environment. AFAIK this is heisenbug and is reproducible on heavy load. I'll try to run some stress tests to catch up stack trace.

[1] https://www.postgresql.org/message-id/flat/20181231221734.GB25379%40telsasoft.com

-- 
Regards, Maksim Milyutin
Postgres Professional:http://www.postgrespro.ru
The Russian Postgres Company

pgsql-bugs by date:

From: Amit Langote
Date: 16 January 2019, 09:30:02
Subject: Re: BUG #15587: Partitions with ALTER TABLE ADD CONSTRAINT

From: PG Bug reporting form
Date: 16 January 2019, 14:56:34
Subject: BUG #15594: Unstable tests in contrib/test_decoding/output_iso/

Re: BUG #15585: infinite DynamicSharedMemoryControlLock waiting inparallel query - Mailing list pgsql-bugs

Previous

Next