Performance issues with parallelism and LIMIT - Mailing list pgsql-hackers

From David Geier
Subject Performance issues with parallelism and LIMIT
Date
Msg-id 2c4686b2-635b-8cb8-8f22-03e1fa336f14@gmail.com
Whole thread Raw
Responses Re: Performance issues with parallelism and LIMIT
List pgsql-hackers
Hi hackers,

While migrating from PostgreSQL 14 to 15, we encountered the following 
performance degradation caused by commit 46846433a03dff: "shm_mq: Update 
mq_bytes_written less often", discussion in [1].

The batching can make queries with a LIMIT clause run significantly 
slower compared to PostgreSQL 14, because neither the ring buffer write 
position is updated, nor the latch to inform the leader that there's 
data available is set, before a worker's queue is 1/4th full. This can 
be seen in the number of rows produced by a parallel worker. Worst-case, 
the data set is large and all rows to answer the query appear early, but 
are not big enough to fill the queue to 1/4th (e.g. when the LIMIT and 
the tuple sizes are small). Here is an example to reproduce the problem.

CREATE TABLE t(id1 INT, id2 INT, id3 INT, id4 INT, id5 INT);
INSERT INTO t(id1, id2, id3, id4, id5) SELECT i%1000, i, i, i, i FROM 
generate_series(1, 10000000) AS i;
ANALYZE t;
SET parallel_tuple_cost = 0;
SET parallel_setup_cost = 0;
SET min_parallel_table_scan_size = 0;
SET max_parallel_workers_per_gather = 8;
EXPLAIN ANALYZE VERBOSE SELECT id2 FROM t WHERE id1 = 100 LIMIT 100;

PostgreSQL 15:

  Limit  (cost=0.00..797.43 rows=100 width=4) (actual 
time=65.083..69.207 rows=100 loops=1)
    Output: id2
    ->  Gather  (cost=0.00..79320.18 rows=9947 width=4) (actual 
time=65.073..68.417 rows=100 loops=1)
          Output: id2
          Workers Planned: 8
          Workers Launched: 7
          ->  Parallel Seq Scan on public.t (cost=0.00..79320.18 
rows=1243 width=4) (actual time=0.204..33.049 rows=100 loops=7)
                Output: id2
                Filter: (t.id1 = 100)
                Rows Removed by Filter: 99345
                Worker 0:  actual time=0.334..32.284 rows=100 loops=1
                Worker 1:  actual time=0.060..32.680 rows=100 loops=1
                Worker 2:  actual time=0.637..33.954 rows=98 loops=1
                Worker 3:  actual time=0.136..33.301 rows=100 loops=1
                Worker 4:  actual time=0.140..31.942 rows=100 loops=1
                Worker 5:  actual time=0.062..33.673 rows=100 loops=1
                Worker 6:  actual time=0.062..33.512 rows=100 loops=1
  Planning Time: 0.113 ms
  Execution Time: 69.772 ms

PostgreSQL 14:

  Limit  (cost=0.00..797.75 rows=100 width=4) (actual 
time=30.602..38.459 rows=100 loops=1)
    Output: id2
    ->  Gather  (cost=0.00..79320.18 rows=9943 width=4) (actual 
time=30.592..37.669 rows=100 loops=1)
          Output: id2
          Workers Planned: 8
          Workers Launched: 7
          ->  Parallel Seq Scan on public.t (cost=0.00..79320.18 
rows=1243 width=4) (actual time=0.221..5.181 rows=15 loops=7)
                Output: id2
                Filter: (t.id1 = 100)
                Rows Removed by Filter: 15241
                Worker 0:  actual time=0.129..4.840 rows=15 loops=1
                Worker 1:  actual time=0.125..4.924 rows=15 loops=1
                Worker 2:  actual time=0.314..5.249 rows=17 loops=1
                Worker 3:  actual time=0.252..5.341 rows=15 loops=1
                Worker 4:  actual time=0.163..5.179 rows=15 loops=1
                Worker 5:  actual time=0.422..5.248 rows=15 loops=1
                Worker 6:  actual time=0.139..5.489 rows=16 loops=1
  Planning Time: 0.084 ms
  Execution Time: 38.880 ms

I had a quick look at the code and I started wondering if we can't 
achieve the same performance improvement without batching by e.g.:

- Only set the latch if new data is written to an empty queue. 
Otherwise, the leader should anyways keep try reading from the queues 
without waiting for the latch, so no need to set the latch again.

- Reorganize struct shm_mq. There seems to be false sharing happening 
between at least mq_ring_size and the atomics and potentially also 
between the atomics. I'm wondering if the that's not the root cause of 
the "slow atomics" observed in [1]? I'm happy to do some profiling.

Alternatively, we could always set the latch if numberTuples in 
ExecutePlan() is reasonably low. To do so, the DestReceiver's receive() 
method would only need an additional "force flush" argument.


A slightly different but related problem is when some workers have 
already produced enough rows to answer the LIMIT query, but other 
workers are still running without producing any new rows. In that case 
the "already done" workers will stop running even though they haven't 
reached 1/4th of the queue size, because the for-loop in execMain.c 
bails out in the following condition:

         if (numberTuples && numberTuples == current_tuple_count)
             break;

Subsequently, the leader will end the plan and then wait in the Gather 
node for all workers to shutdown. However, workers still running but not 
producing any new rows will never reach the following condition in 
execMain.c to check if they're supposed to stop (the shared memory queue 
dest receiver will return false on detached queues):

             /*
              * If we are not able to send the tuple, we assume the 
destination
              * has closed and no more tuples can be sent. If that's the 
case,
              * end the loop.
              */
             if (!dest->receiveSlot(slot, dest))
                 break;

Reproduction steps for this problem are below. Here the worker getting 
the first table page will be done right away, but the query takes as 
long as it takes to scan all pages of the entire table.

CREATE TABLE bar (col INT);
INSERT INTO bar SELECT generate_series(1, 5000000);
SET max_parallel_workers_per_gather = 8;
EXPLAIN ANALYZE VERBOSE SELECT col FROM bar WHERE col = 1 LIMIT 1;

  Limit  (cost=0.00..1.10 rows=1 width=4) (actual time=32.289..196.200 
rows=1 loops=1)
    Output: col
    ->  Gather  (cost=0.00..30939.03 rows=28208 width=4) (actual 
time=32.278..196.176 rows=1 loops=1)
          Output: col
          Workers Planned: 8
          Workers Launched: 7
          ->  Parallel Seq Scan on public.bar (cost=0.00..30939.03 
rows=3526 width=4) (actual time=137.251..137.255 rows=0 loops=7)
                Output: col
                Filter: (bar.col = 1)
                Rows Removed by Filter: 713769
                Worker 0:  actual time=160.177..160.181 rows=0 loops=1
                Worker 1:  actual time=160.111..160.115 rows=0 loops=1
                Worker 2:  actual time=0.043..0.047 rows=1 loops=1
                Worker 3:  actual time=160.040..160.044 rows=0 loops=1
                Worker 4:  actual time=160.167..160.171 rows=0 loops=1
                Worker 5:  actual time=160.018..160.022 rows=0 loops=1
                Worker 6:  actual time=160.201..160.205 rows=0 loops=1
  Planning Time: 0.087 ms
  Execution Time: 196.247 ms

We would need something similar to CHECK_FOR_INTERRUPTS() which returns 
a NULL slot if a parallel worker is supposed to stop execution (we could 
e.g. check if the queue got detached). Or could we amend 
CHECK_FOR_INTERRUPTS() to just stop the worker gracefully if the queue 
got detached?

Jasper Smit, Spiros Agathos and Dimos Stamatakis helped working on this.

[1] 
https://www.postgresql.org/message-id/flat/CAFiTN-tVXqn_OG7tHNeSkBbN%2BiiCZTiQ83uakax43y1sQb2OBA%40mail.gmail.com

-- 
David Geier
(ServiceNow)



pgsql-hackers by date:

Previous
From: Jakub Wartak
Date:
Subject: Re: Syncrep and improving latency due to WAL throttling
Next
From: Ilya Gladyshev
Date:
Subject: Re: Progress report of CREATE INDEX for nested partitioned tables