Re: Parallel Seq Scan - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Parallel Seq Scan
Date
Msg-id CAA4eK1KGXPz-SzhgsdU2H3p8qO0NxRBfjT3BDvKZTXPDDb2jtQ@mail.gmail.com
Whole thread Raw
In response to Re: Parallel Seq Scan  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Fri, Nov 13, 2015 at 10:56 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Nov 12, 2015 at 10:39 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > The number of shared buffers hit could be different across different runs
> > because the read sequence of parallel workers can't be guaranteed, also
> > I don't think same is even guaranteed for Seq Scan node,
>
> The number of hits could be different.  However, it seems like any
> sequential scan, parallel or not, should have a number of accesses
> (hit + read) equal to the size of the relation.  Not sure if that's
> what is happening here.
>

After patch provided above to fix the issue reported by Pavel, that is
the behaviour, but I think there are few more things which we might
want to consider, just refer the below plan:

Total pages in table
--------------------------------
postgres=# select relname,relpages from pg_class where relname like '%t2%';
 relname | relpages
---------+----------
 t2      |     5406
(1 row)


Parallel Plan
-----------------------------
postgres=# explain (analyze,buffers,timing) select count(*) from t2 where c1 % 1
0 = 0;
                                                           QUERY PLAN

--------------------------------------------------------------------------------
------------------------------------------------
 Aggregate  (cost=8174.90..8174.91 rows=1 width=0) (actual time=1055.294..1055.2
94 rows=1 loops=1)
   Buffers: shared hit=446 read=5054
   ->  Gather  (cost=0.00..8162.40 rows=5000 width=0) (actual time=79.787..959.6
51 rows=100000 loops=1)
         Number of Workers: 2
         Buffers: shared hit=446 read=5054
         ->  Parallel Seq Scan on t2  (cost=0.00..8162.40 rows=5000 width=0) (ac
tual time=30.771..2518.844 rows=100000 loops=1)
               Filter: ((c1 % 10) = 0)
               Rows Removed by Filter: 900000
               Buffers: shared hit=352 read=5054
 Planning time: 0.170 ms
 Execution time: 1059.400 ms
(11 rows)

Lets focus on Buffers and actual time in the above plan:

Buffers - At Parallel Seq Scan node, it shows total of 5406 (352+5054)
buffers which tallys with what is expected.  However at Gather node,
it shows 5500 (446+5054) and the reason for the same is that we
accumulate overall buffer usage for parallel execution of worker which
includes start of node as well, refer ParallelQueryMain() and when the
that gets counted even towards the buffer calculation of Gather node.
The theory behind collecting overall buffer usage for parallel execution
was that we need it for pg_stat_statements where the stats is accumulated
for overall execution not on node-by-node basis refer queryDesc->totaltime
usage in standard_ExecutorRun().
I think here we need to decide what is the right value to display at
Gather node:
1. Display the same number of buffers at Gather node as at Parallel
Seq Scan node.
2. Display the number of buffers at Parallel Seq Scan node plus the
additional buffers used by parallel workers for miscellaneous work
like ExecutorStart(), etc.
3. Don't account for buffers used for parallel workers.
4. Anything better?

Also in conjuction with above, we need to see what should be accounted for
pg_stat_statements?

actual_time - 
Actual time at Gather node: actual time = 79.787..959.651
Actual time at Parallel Seq Scan node = 30.771..2518.844

Time at Parallel Seq Scan node is more than time at Gather node as
the time for parallel workers is also accumulated for Parallel Seq Scan
node, whereas some doesn't get accounted for Gather node.  Now it
could be confusing for users because time displayed at Parallel Seq
Scan node will be equal to - time_taken_by_worker-1 +
time_taken_by_worker-2 + ...
This time could be more than the actual time taken by query because each
worker is getting executed parallely but we have accounted the time such
that each one is executing serially. 
I think the time for fetching the tuples from workers is already accounted
for Gather node, so may be for Parallel Seq Scan node we can omit
adding the time for each of the parallel workers.

Thoughts?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

pgsql-hackers by date:

Previous
From: Thom Brown
Date:
Subject: Re: Parallel Seq Scan
Next
From: Amit Kapila
Date:
Subject: Re: Parallel Seq Scan