Re: Parallel Seq Scan - Mailing list pgsql-hackers

From Thom Brown
Subject Re: Parallel Seq Scan
Date
Msg-id CAA-aLv5ykvV=gxi9Xv25sx3aVkWKtHL3NLxZ8v2OS0fUPndgAw@mail.gmail.com
Whole thread Raw
In response to Re: Parallel Seq Scan  (Thom Brown <thom@linux.com>)
List pgsql-hackers
On 11 November 2015 at 19:51, Thom Brown <thom@linux.com> wrote:
> On 11 November 2015 at 19:26, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Wed, Nov 11, 2015 at 12:59 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>>> I have a first query
>>>
>>> I looked on EXPLAIN ANALYZE output and the numbers of filtered rows are
>>> differen
>>
>> Hmm, I see I was right about people finding more bugs once this was
>> committed.  That didn't take long.
>>
>> There's supposed to be code to handle this - see the
>> SharedPlanStateInstrumentation stuff in execParallel.c - but it's
>> evidently a few bricks shy of a load.
>> ExecParallelReportInstrumentation is supposed to transfer the counts
>> from each worker to the DSM:
>>
>>         ps_instrument = &instrumentation->ps_instrument[i];
>>         SpinLockAcquire(&ps_instrument->mutex);
>>         InstrAggNode(&ps_instrument->instr, planstate->instrument);
>>         SpinLockRelease(&ps_instrument->mutex);
>>
>> And ExecParallelRetrieveInstrumentation is supposed to slurp those
>> counts back into the leader's PlanState objects:
>>
>>         /* No need to acquire the spinlock here; workers have exited already. */
>>         ps_instrument = &instrumentation->ps_instrument[i];
>>         InstrAggNode(planstate->instrument, &ps_instrument->instr);
>>
>> This might be a race condition, or it might be just wrong logic.
>> Could you test what happens if you insert something like a 1-second
>> sleep in ExecParallelFinish just after the call to
>> WaitForParallelWorkersToFinish()?  If that makes the results
>> consistent, this is a race.  If it doesn't, something else is wrong:
>> then it would be useful to know whether the workers are actually
>> calling ExecParallelReportInstrumentation, and whether the leader is
>> actually calling ExecParallelRetrieveInstrumentation, and if so
>> whether they are doing it for the correct set of nodes.
>
> Hmm.. I made the change, but clearly it's not sleeping properly with
> my change (I'm expecting a total runtime in excess of 1 second):
>
> max_parallel_degree = 4:
>
> # explain (analyse, buffers, timing, verbose, costs) select count(*)
> from js where content->'tags'->0->>'term' like 'design%' or
> content->'tags'->0->>'term' like 'web%';
>
>             QUERY PLAN
>
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=49578.18..49578.19 rows=1 width=0) (actual
> time=797.518..797.518 rows=1 loops=1)
>    Output: count(*)
>    Buffers: shared hit=174883 read=540
>    ->  Gather  (cost=1000.00..49546.93 rows=12500 width=0) (actual
> time=0.245..784.959 rows=55151 loops=1)
>          Output: content
>          Number of Workers: 4
>          Buffers: shared hit=174883 read=540
>          ->  Parallel Seq Scan on public.js  (cost=0.00..47296.93
> rows=12500 width=0) (actual time=0.019..6153.679 rows=94503 loops=1)
>                Output: content
>                Filter: (((((js.content -> 'tags'::text) -> 0) ->>
> 'term'::text) ~~ 'design%'::text) OR ((((js.content -> 'tags'::text)
> -> 0) ->> 'term'::text) ~~ 'web%'::text))
>                Rows Removed by Filter: 2051330
>                Buffers: shared hit=299224 read=907
>  Planning time: 0.086 ms
>  Execution time: 803.026 ms
>
>
> max_parallel_degree = 0:
>
> # explain (analyse, buffers, timing, verbose, costs) select count(*)
> from js where content->'tags'->0->>'term' like 'design%' or
> content->'tags'->0->>'term' like 'web%';
>
>          QUERY PLAN
>
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=212867.43..212867.44 rows=1 width=0) (actual
> time=1278.717..1278.717 rows=1 loops=1)
>    Output: count(*)
>    Buffers: shared hit=174671 read=572
>    ->  Seq Scan on public.js  (cost=0.00..212836.18 rows=12500
> width=0) (actual time=0.018..1272.030 rows=55151 loops=1)
>          Output: content
>          Filter: (((((js.content -> 'tags'::text) -> 0) ->>
> 'term'::text) ~~ 'design%'::text) OR ((((js.content -> 'tags'::text)
> -> 0) ->> 'term'::text) ~~ 'web%'::text))
>          Rows Removed by Filter: 1197822
>          Buffers: shared hit=174671 read=572
>  Planning time: 0.064 ms
>  Execution time: 1278.741 ms
> (10 rows)
>
> Time: 1279.145 ms
>
>
> I did, however, notice that repeated runs of the query with
> max_parallel_degree = 4 yields different counts of rows removed by
> filter:
>
> Run 1: 2051330
> Run 2: 2081252
> Run 3: 2065112
> Run 4: 2022045
> Run 5: 2025384
> Run 6: 2059360
> Run 7: 2079620
> Run 8: 2058541

Here's another oddity, with max_parallel_degree = 1:

# explain (analyse, buffers, timing, verbose, costs) select count(*)
from js where content->'tags'->>'title' like '%design%';
QUERY
PLAN

------------------------------------------------------------------------------------------------------------------------------------Aggregate
(cost=132489.34..132489.35 rows=1 width=0) (actual
 
time=382.987..382.987 rows=1 loops=1)  Output: count(*)  Buffers: shared hit=175288  ->  Gather
(cost=1000.00..132488.34rows=401 width=0) (actual
 
time=382.983..382.983 rows=0 loops=1)        Output: content        Number of Workers: 1        Buffers: shared
hit=175288       ->  Parallel Seq Scan on public.js  (cost=0.00..131448.24
 
rows=401 width=0) (actual time=379.407..1141.437 rows=0 loops=1)              Output: content              Filter:
(((js.content-> 'tags'::text) ->>
 
'title'::text) ~~ '%design%'::text)              Rows Removed by Filter: 1724810              Buffers: shared
hit=241201Planningtime: 0.104 msExecution time: 403.045 ms
 
(14 rows)

Time: 403.596 ms

The actual time of the sequential scan was 1141.437ms, but the total
execution time was 403.045ms.

And successive runs with max_parallel_degree = 1 also yield a
different number of rows removed by the filter, as well as a different
number of buffers being hit:

Run: rows removed / buffers hit
1: 1738517 / 243143
2: 1729361 / 241900
3: 1737168 / 242974
4: 1734440 / 242591

Thom



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Per-table log_autovacuum_min_duration is actually documented
Next
From: Michael Paquier
Date:
Subject: Re: Per-table log_autovacuum_min_duration is actually documented