Re: BUG #15585: infinite DynamicSharedMemoryControlLock waiting inparallel query - Mailing list pgsql-bugs
From | Maksim Milyutin |
---|---|
Subject | Re: BUG #15585: infinite DynamicSharedMemoryControlLock waiting inparallel query |
Date | |
Msg-id | 458c691a-1b2d-7a70-1c04-58361b894011@gmail.com Whole thread Raw |
In response to | Re: BUG #15585: infinite DynamicSharedMemoryControlLock waiting inparallel query (Thomas Munro <thomas.munro@enterprisedb.com>) |
Responses |
Re: BUG #15585: infinite DynamicSharedMemoryControlLock waiting inparallel query
|
List | pgsql-bugs |
On Fri, Jan 11, 2019 at 10:04 AM Thomas Munro <thomas.munro@enterprisedb.com> wrote:I am trying to reproduce it.I ran squillions of concurrent parallel queries today, making sure they would allocate and free entire segments a lot (based on the assumption that something along those lines must be required). I made sure to use Linux, GCC, -O2, no asserts (since both reports came from that environment, and I'd previously failed to reproduce this on my usual tool chain/platforms), and I used a multi-socket box (in case some cache coherency effect was not occurring on my development laptops). I did learn some interesting things about parallel query performance that I plan to follow up on, but I had no luck in reproducing this error. Rats. One observation is that the other report involved a fairly simple Parallel Hash Join query (on 11), and this report involves Parallel Index Scan and Parallel Bitmap Index Scan (on 10), so that suggests that it's probably not a bug in the parallel executor code (for example an access-after-free, whose undefined behaviour could in theory look like this with unlucky timing, I speculate) but rather something lower level.
Recently one of our customers encountered the same problem on PG 11.1. The query structure and error messages are similar to one that was described in [1].
OS/PG settings:
Ubuntu 16.04
huge_pages is turned off
18446744073692774399
max_parallel_maintenance_workers = 6
max_parallel_workers_per_gather = 6
max_parallel_workers = 12
The query was:
SELECT
ev.date,
r.indicator_id,
r.service_reg_id,
count(r.value),
count(CASE WHEN r.value::int > 3 THEN 1 ELSE NULL END),
avg(r.value::int)
FROM base.hershel_events AS ev
INNER JOIN base.hershel_rates AS r ON r.event_id = ev.id
WHERE ev.date between '2018-09-15' AND '2018-12-15'
GROUP BY ev.date, r.indicator_id, r.service_reg_id
ORDER BY ev.date ASC
Its plan with record statistics:
Finalize GroupAggregate (actual rows=81332937 loops=1)
Group Key: ev.date, r.indicator_id, r.service_reg_id
-> Gather Merge (actual rows=81332937 loops=1)
Workers Planned: 7
Workers Launched: 7
-> Partial GroupAggregate (actual rows=10166617 loops=8)
Group Key: ev.date, r.indicator_id, r.service_reg_id
-> Sort (actual rows=10166617 loops=8)
Sort Key: ev.date, r.indicator_id, r.service_reg_id
Sort Method: external merge Disk: 549392kB
Worker 0: Sort Method: external merge Disk: 516104kB
Worker 1: Sort Method: external merge Disk: 533200kB
Worker 2: Sort Method: external merge Disk: 540680kB
Worker 3: Sort Method: external merge Disk: 540760kB
Worker 4: Sort Method: external merge Disk: 530512kB
Worker 5: Sort Method: external merge Disk: 472240kB
Worker 6: Sort Method: external merge Disk: 539256kB
-> Parallel Hash Join (actual rows=10166617 loops=8)
Hash Cond: (r.event_id = ev.id)
-> Parallel Seq Scan on hershel_rates r (actual rows=15000000 loops=8)
-> Parallel Hash (actual rows=9319175 loops=8)
Buckets: 4194304 Batches: 32 Memory Usage: 124064kB
-> Parallel Seq Scan on hershel_events ev (actual rows=9319175 loops=8)
Filter: ((date >= '2018-09-15'::date) AND (date <= '2018-12-15'::date))
Rows Removed by Filter: 4430825
At some moment one of parallel workers emitted to log the error message: dsa_area could not attach to segment. After terminating all workers (the leader also bumped into the same error) the last one began to emit to log the following type of messages: LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp5027.0.sharedfileset/i55of128.p1.0", size 458752. The summary size of temp files that was mentioned in the log was about *14GB*. And eventually that last worker got the fatal error: cannot unpin a segment that is not pinned. After that the DynamicSharedMemoryControlLock left unreleased and soon new backends hanged on it.
Another time some backend (more likely, it was some utility worker) bumped into the error: could not open shared memory segment "/PostgreSQL.1948558832": No such file or directory. After that other backends stopped on DynamicSharedMemoryControlLock. Before this issue there were parallel aggregate queries.
Unfortunately our customer doesn't want to catch up this error on production system again. He turned off the query parallelism. And this issue doesn't reproduce on test environment. AFAIK this is heisenbug and is reproducible on heavy load. I'll try to run some stress tests to catch up stack trace.
[1] https://www.postgresql.org/message-id/flat/20181231221734.GB25379%40telsasoft.com
-- Regards, Maksim Milyutin Postgres Professional:http://www.postgrespro.ru The Russian Postgres Company
pgsql-bugs by date: